Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation

A paper titled "Effects of Diversity Incentives on Sample Diversity and Downstream Model Performance in LLM-Based Text Augmentation" produced by Jan Cegin, Branislav Pecher, Jakub Simko, Ivan Srba, Maria Bielikova, and Peter Brusilovsky from KInIT, Faculty of Information Technology at Brno University of Technology and the University of Pittsburgh investigates the influence of different diversity incentive methods on the quality of text generated by large language models (LLMs) and the subsequent performance of downstream models trained on this augmented data.

Specifically, the paper explores how diversity incentives traditionally used in crowdsourcing tasks—taboo words, chaining, and hints—can enhance the lexical diversity of generated paraphrases and the performance of classifiers trained on them.

The authors note that the use of LLMs in text augmentation, such as GPT-4, has become common in various domains like sentiment and news classification. However, the effects of incorporating diversity incentives in LLM-generated text data are still under-researched. These incentives, adapted from crowdsourcing methods, aim to produce more diverse paraphrases and ultimately improve the robustness of downstream models.

The paper addresses two main research questions:

Does the use of diversity incentives in LLMs lead to more diverse paraphrases?
Do downstream models perform better when trained on data augmented with diversity incentives?

To answer these, the researchers conducted experiments using five LLMs and six datasets across tasks like sentiment analysis and news classification. They applied three diversity incentive methods:

Taboo words: prohibiting the model from using certain significant words in paraphrases.
Chaining: using outlier paraphrases from previous rounds as seed sentences for further paraphrasing.
Hints: providing the model with examples of diverse paraphrases during the prompting process.

Their findings revealed several key points:

Taboo words significantly increased the lexical diversity of generated paraphrases but had limited effects on the performance of downstream models. This method also struggled with stability, leading to inconsistent results across models and datasets.
Hints consistently improved the performance of downstream models, both in terms of accuracy and stability. The authors attribute this to its similarity to in-context learning, where examples guide the model to generate more appropriate paraphrases. This method outperformed both the baseline and other diversity incentives, especially when fine-tuning models like BERT and Mistral.
Chaining did not significantly improve lexical diversity or model performance, often producing inconsistent results, likely due to LLMs generating progressively lower-quality outputs when relying on outliers as seeds.

The study concludes that hints are the most effective diversity incentive for improving model performance, particularly in classification tasks, while taboo words are better suited for increasing lexical diversity. However, combining these two methods did not lead to any significant improvements, suggesting that separate implementation might be more beneficial.

In summary, the authors advocate for further exploration of diversity incentives, especially hints, as a promising technique for enhancing both the quality of LLM-generated data and the performance of downstream classifiers. These findings could inform future strategies for using LLMs in data augmentation, particularly in low-resource settings or where diversity in training data is critical.

Link to Zenodo: https://zenodo.org/records/13630106

Link to ACL Anthology: https://aclanthology.org/2024.acl-long.710/