The paper, presented at SemEval-2023 by the researchers including our partners at the University of Sheffield, details the results of the work focusing on the classification of online news articles. The work involved identifying the genre (opinion, objective reporting, or satire), detecting framing techniques, and classifying various persuasion methods in a multilingual context. The team employed advanced machine learning models, including mBERT and RoBERTa, enhanced with task-adaptive pretraining and ensemble methods, to tackle the challenges posed by multilingual data and class imbalance. Their innovative approaches resulted in top rankings across multiple languages, demonstrating the effectiveness of their strategies in accurately categorizing and analyzing news content.
"SheffieldVeraAI at SemEval-2023 Task 3: Mono and Multilingual Approaches for News Genre, Topic and Persuasion Technique Classification" paper details the results of the work conducted within Task 3 at prestigious SemEval-2023 conference. The paper focused on classifying news articles by genre, topic framing, and persuasion techniques across multiple languages with the goal of distinguishing between opinion pieces, objective reporting, and satire, while also identifying specific framing and persuasion methods.
Key Objectives and Methods
The team aimed to address the challenge by leveraging both monolingual and multilingual models. The tasks were divided into three main subtasks:
- News Genre Classification: Differentiating between opinion, objective reporting, and satire.
- Framing Techniques Detection: Identifying 14 different framing techniques used in news articles.
- Persuasion Techniques Classification: Recognizing 23 persuasion techniques, grouped into six high-level categories.
To tackle these challenges, the team used a variety of advanced machine learning techniques, including:
- mBERT Models: For multilingual contexts, with some models using adapters and task-adaptive pretraining.
- RoBERTa Models: For English-specific tasks, fine-tuned on relevant datasets.
- Ensemble Methods: Combining predictions from multiple models to improve accuracy.
Results and Findings
- Subtask 1 (News Genre): The ensemble of mBERT models was highly effective, ranking joint-first for German and achieving the highest mean rank among multilingual teams.
- Subtask 2 (Framing): The team used separate ensembles for monolingual (RoBERTa-MUPPET) and multilingual (XLM-RoBERTa with adapters) models, securing first place in three languages and the best average rank overall.
- Subtask 3 (Persuasion Techniques): Different strategies were employed for English and other languages, using RoBERTa-Base and mBERT models respectively, achieving top 10 positions in all languages, including second place for English.
Key Techniques and Challenges
- Class Imbalance: The team addressed this by employing techniques like oversampling and class weighting.
- Multilingual and Monolingual Model Comparison: They found that different subtasks and languages benefitted from different approaches. For instance, multilingual models generally performed better for non-English languages, while monolingual models were more effective for English.
- Task-Adaptive Pretraining (TAPT): This was particularly useful for improving the performance of multilingual models on specific tasks.
Conclusion
The research highlighted the importance of using a combination of monolingual and multilingual models tailored to the specific characteristics of each subtask. The ensemble approaches and advanced preprocessing techniques significantly contributed to their high rankings. The study provides valuable insights into the effectiveness of various machine learning strategies for multilingual news classification tasks, offering a robust framework for future research in this area.
The paper was published at ACL Anthology and is accessible at https://aclanthology.org/2023.semeval-1.275/.
Open access source code was published at https://github.com/GateNLP/semeval2023-multilingual-news-detection.
Link to Zenodo: https://zenodo.org/records/8159066