A paper titled "Disinformation Capabilities of Large Language Models" by Ivan Vykopal, Matúš Pikuliak, Ivan Srba, Robert Moro, Dominik Macko, and Maria Bielikova from KInIT and Faculty of Information Technology at the Brno University of Technology explores the risks posed by large language models (LLMs) in generating disinformation. The authors investigate how different LLMs perform when tasked with creating disinformation news articles based on specific narratives in themes such as COVID-19, Russian war in Ukraine, or US elections.
They evaluate 10 models, including popular ones like GPT-3, GPT-4, Falcon, and Vicuna, to assess their ability to generate, agree with, or refute disinformation narratives. The study also examines the models’ safety features and the effectiveness of existing detection tools for identifying machine-generated disinformation.
The authors designed the study using 20 disinformation narratives from five categories: COVID-19, health, Russian War in Ukraine, US elections, and regional topics. The narratives were derived from credible fact-checking sources like Snopes and Agence France-Presse (AFP). For each narrative, they provided titles and abstracts to guide LLMs in generating news articles either based solely on titles or with detailed abstracts. The 10 LLMs used in the study, including ChatGPT, Falcon, Mistral, and Llama-2, generated 1,200 texts.
Key Findings
- Disinformation Generation: Most models, except for Falcon, generated coherent disinformation articles. The models agreed with the narratives to varying degrees, with Vicuna and GPT-3 Davinci being the most prone to generating convincing and dangerous disinformation. Falcon, on the other hand, often refused to generate disinformation, showcasing some embedded safety features.
- Safety Features: While some models, such as Falcon and ChatGPT, demonstrated safety mechanisms by refusing to generate harmful content or providing disclaimers, others, like Vicuna and GPT-3 Davinci, lacked such features. The inconsistency in safety filters across models raised concerns about their ability to prevent harmful disinformation from being generated.
- Human vs. GPT-4 Evaluation: The study involved both human annotators and GPT-4 to assess the texts. While GPT-4 was capable of automating part of the evaluation process, it tended to overestimate the presence of safety features, inflating the perceived safety of the generated articles. GPT-4 proved useful in identifying dangerous content and showed potential for automating future disinformation detection efforts.
- Model Detection: The researchers evaluated multiple machine-generated text detection tools, finding that fine-tuned ELECTRALARGE detectors were the most effective at identifying disinformation generated by LLMs. However, adversarial actors could still exploit weaknesses in these detection systems, and the paper emphasizes the need for further improvements in detection technologies.
The paper concludes that LLMs, including open-source models, are capable of generating highly convincing disinformation narratives, posing a significant risk to public discourse. The authors call for the development of more robust detection systems and the continued monitoring of LLM capabilities as they evolve. The inconsistent application of safety filters across models underscores the urgency for better uniformed safeguards, which leads us to an important warning about the potential misuse of LLMs and the challenges associated with controlling their disinformation-generating capacities.
Read the paper at Zenodo: https://zenodo.org/records/13630077
Read the paper at ACL Anthology: https://aclanthology.org/2024.acl-long.793/
Access to data in GitHub: https://github.com/kinit-sk/disinformation-capabilities