Analysis of Linguistic Effects of Self-Consuming Training

Veronika Grigoreva , Catherine Stinson , Prof. Christian Muise

Abstract

Modern language models train on vast arrays of data scraped from the internet, with the text data itself often only lightly filtered. Considering the increasing amounts of machine-generated texts on the internet, the possibility of a new language model training on the outputs of previous generations is exceedingly high. Following previous work, we use the term “self-consuming training” for this process. To analyse how self-consuming training might affect LLMs, we repeatedly simulate the process using a GPT-2-based LLM on several datasets. Afterwards, we score the outputs across multiple attributes, including quality, overall text diversity, as well as emotion, toxicity, perceived identity of the author, etc., using both established metrics and fine-tuned classifier models. Based on these scores, we draw out some potential effects self-consuming training might have on modern language models.

Publication

3rd International Conference on Foundation and Large Language Models

Date

October, 2025

Links