Authorship Obfuscation in Multilingual Machine-Generated Text Detection

A paper "Authorship Obfuscation in Multilingual Machine-Generated Text Detection" produced by the researchers from KInIT, the MIT Lincoln Laboratory and the University of Pennsylvania in the US investigates the growing challenge of detecting machine-generated text (MGT) in multilingual contexts, focusing on the role of authorship obfuscation (AO).

With the proliferation of large language models (LLMs) capable of generating text indistinguishable from human writing, techniques such as paraphrasing and homoglyph substitution threaten the robustness of detection systems. The authors benchmarked 10 AO techniques across 37 detection methods, spanning 11 languages, resulting in an unprecedented evaluation of over 4,000 combinations.

Key findings reveal that:

Homoglyph attacks are the most effective obfuscation methods, with a success rate exceeding 70% in some languages, as they exploit vulnerabilities in character encoding and linguistic model embeddings. These attacks often evade detection without significant degradation of readability.
Paraphrasing techniques, while effective in altering text to bypass detection, often cause language shifts or distortions in multilingual settings. ChatGPT paraphrasing proved less impactful but demonstrated potential with refined prompts.
Backtranslation, a common AO method, varies in effectiveness by language pair and detector but highlights the nuanced challenge of preserving semantic integrity while evading detection.

The study also demonstrates the role of preprocessing techniques, such as detecting multi-script homoglyph text, as critical for mitigating AO effects. However, reliance solely on preprocessing without addressing other AO methods leaves significant gaps in detection capability.

A significant contribution of the research lies in its exploration of adversarial robustness. By introducing obfuscated text for data augmentation, the authors show improved detection capabilities across multilingual models. However, the effectiveness of adversarial retraining depends heavily on the AO methods used, with homoglyph attacks seeing the greatest performance boost post-training. Conversely, methods like backtranslation and basic paraphrasing presented challenges in maintaining balanced detection performance across all language groups.

The dataset curated for this study includes approximately 740,000 samples of human and machine-generated texts, processed through various AO methods. This comprehensive dataset, available for research purposes, offers a valuable resource for further exploration in MGT detection.

The authors underline the need for multilingual adaptability in detection frameworks and advocate for the integration of adversarial retraining with preprocessing strategies. By addressing the specific challenges posed by homoglyph and paraphrasing-based attacks, future detection systems can better handle the dynamic and evolving threat landscape posed by advanced LLMs.

- Zenodo: https://zenodo.org/records/14247322‍

- GitHub: https://github.com/kinit-sk/mAO