KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection

In the paper titled "KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection" by Michal Spiegel and Dominik Macko, the authors describe their approach to detecting machine-generated text using fine-tuned language models (LLMs).

The paper was part of the SemEval-2024 Conference, Task 8, which focused on identifying machine-generated text across multiple languages, domains, and text generators. The challenge addressed in the paper is critical due to the increasing use of large language models like GPT-3, BLOOMZ, and ChatGPT in creating multilingual human-like texts that can be misused in contexts like academic cheating, disinformation, or plagiarism.

The task was divided into three subtasks:

A) distinguishing human-written from machine-generated text,

B) classifying different types of machine-generated text,

and C) detecting mixed human and machine-generated content.

The authors focused on subtask A, particularly its multilingual component, which included languages such as Arabic, Chinese, English, and Russian. The authors implemented a system combining two fine-tuned LLMs—Falcon-7B and Mistral-7B—with statistical detection methods using a two-step majority voting ensemble. They introduced per-language threshold calibration, meaning separate detection thresholds were applied for each language, improving detection accuracy across the multilingual dataset. The final model combined LLM-based predictions with statistical metrics such as entropy and rank to enhance performance.

The submitted system ranked fourth, achieving a 95% accuracy — within 1% of the winning team. The authors noted the effectiveness of their ensemble approach, particularly in combining fine-tuned LLMs with statistical detection methods. Their system's generalization capabilities were bolstered by combining Falcon-7B and Mistral-7B predictions, using metrics like Binoculars for robust detection across languages.

Their system alternatives included models fine-tuned for each language and models trained on multilingual datasets. Post-deadline evaluations also involved the fine-tuning of Llama-2-7B, which further demonstrated the potential for combining these models with statistical detection tools for improved performance.

The study demonstrates that combining fine-tuned large language models with statistical detection metrics significantly improves the accuracy and robustness of machine-generated text detection across languages. The authors advocate for continued exploration of per-language calibration and ensemble techniques to further enhance the performance of LLMs in this field. While their solution performed exceptionally well, they acknowledge the potential for further optimization through hyperparameter tuning and system refinements. The paper contributes with valuable insights into the evolving area of multilingual text detection and offers a promising approach for future advancements in the field.

Link to Zenodo: https://zenodo.org/records/13851347

Link to ACL Anthology: https://aclanthology.org/2024.semeval-1.84/

Data in GitHub: https://github.com/kinit-sk/semeval-2024-task-8-machine-text-detection