Analysing State-Backed Propaganda Websites: A New Dataset and Linguistic Study

The paper "Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study," presented at the 2023 Conference on Empirical Methods in Natural Language Processing by the researchers at the Department of Computer Science, University of Sheffield, investigates key characteristics of state-backed sites spreading disinformation. It focuses on two webpages, Reliable Recent News (RRN) and WarOnFakes (WoF) that spread content in multiple languages, including Arabic, Chinese, English, French, German, and Spanish.

Aims of the study:

  1. Create a new dataset with articles from RRN and WoF.
  2. Perform cross-site unsupervised topic clustering across the multilingual dataset.
  3. Conduct linguistic and temporal analysis of translations and topics over time.
  4. Analyse articles with false publication dates.

Methodology

Researchers collected all posts from the two websites in March 2023 using the WordPress REST API. They extracted text and metadata from each post, including publication and modification times. The content was clustered using BERTopic, and various linguistic tools were used for analysis, including LIWC2015 lexicon.

Key Findings

  1. Dataset Composition: The dataset includes 14,053 translations of 3,447 articles posted from March 4, 2022, to March 6, 2023. These articles cover a wide range of topics, showing significant overlap between RRN and WoF.
  2. Topic Analysis: Articles were clustered into 144 topics, with many sentences classified as outliers. Themes included military actions, political events, and economic issues, indicating coordinated efforts in spreading disinformation.
  3. Linguistic Characteristics: The analysis showed that RRN and WoF articles are more negative than those from genuine news sources like the New York Times. They frequently use emotional and speculative language, focusing more on present and future events.
  4. Temporal Patterns: Posts are typically published on weekdays, with lower activity during Russian public holidays. The study also found evidence of backdating, where non-English posts were given false dates to align with the original articles.
  5. Cyrillic Characters: The presence of Cyrillic characters in 178 articles suggests the original content was likely written in Russian before being translated. Some translations contained remnants of machine translation tool interfaces.

Conclusion

The study highlights the importance of understanding state-backed disinformation operations. By analysing the content and dissemination strategies of RRN and WoF, the researchers revealed how these sites aim to influence public opinion and spread propaganda. The dataset and findings offer valuable resources for further research in disinformation detection and analysis, promoting greater awareness and more effective countermeasures against such operations.

The paper was published in ACL Anthology and is available at: https://aclanthology.org/2023.emnlp-main.349/

Open access source code was published at GitHub: https://github.com/GateNLP/wordpress-site-extractor

Public Dataset on Zenodo for research purposes: https://zenodo.org/records/10007383

Derived from complete dataset (restricted): https://zenodo.org/records/10008933

Software on Zenodo accompanying the paper: https://zenodo.org/records/10008086