Content
summary Summary

Researchers from Midjourney and New York University have developed a new approach that could help language models generate more diverse creative texts without significantly sacrificing quality.

Ad

In a recently published paper, the team introduces "deviation metrics" into the AI training process. The method works by measuring how different each generated text is from others created for the same prompt. These differences get calculated using embedded texts and their pairwise cosine distance - essentially giving the system a mathematical way to understand text variation.

Three-stage flowchart: process for improving AI responses through deviation analysis, DPO/ORPO training and diversification
The training approach evaluates differences between LLM responses to enhance output variety. | Image: Chung et al.

Initial testing looks promising. Models using this new training method generated 23 percent more diverse texts, with quality scores dropping by only five percent according to Reddit's reward system.

A test case shows how this works in practice. When given the prompt "Why are you shaking, my love? You're king now," the standard GPT-4o model mostly stuck to stories about nervous new rulers. The modified Llama-3.1-8B model, despite being smaller, produced everything from dark fantasy tales about bear princes to supernatural stories set underwater.

Ad
Ad
Tabular representation: Three AI models (GPT-4, Llama-3.1) generate different narrative responses to a royal writing prompt.
Modified Llama models show greater variety in storytelling compared to GPT-4o with identical prompts. | Image: Chung et al.

Human testers backed up these findings, saying the texts showed more variety while maintaining quality. The researchers only tested against the older GPT-4o though, not the newer GPT-4.5, which produces more natural-sounding text but costs more to use.

Comparison chart: DDPO-both vs. GPT-4o and DPO with win rates for storytelling quality and diversity, DDPO-both leads in all categories.
Data shows the modified model outperforming others in both story quality and variety. | Image: Chung et al.

Two types of diversity

The researchers focused on two kinds of variety: semantic (different story content and plots) and stylistic (writing that sounds like it comes from different authors). They developed specific versions for each type but found combining them worked best.

For their research, the team used more than 100,000 prompt-response pairs from Reddit's r/WritingPrompts. They discovered they could get significantly better variety with just four different responses per prompt.

The system can maintain quality by using carefully selected training examples or setting minimum standards for how different responses need to be. This makes it more flexible than other methods for increasing output variety.

Some questions still need answers. The researchers haven't yet shown whether their method works beyond creative writing - technical documentation and summaries might require different approaches. The technique's effectiveness in online training environments, which many large models use, also remains untested.

Recommendation

The quality measurement system itself raises questions. While Reddit upvotes provide some insight into text quality, they miss important factors like technical accuracy, consistency, and professional writing standards. These limitations suggest more comprehensive evaluation methods may be needed.

Even with these open questions, the technique could change how LLMs handle creative writing tasks, where current models often fall into repetitive patterns. The researchers say they'll share their code on GitHub, so others can build on their work.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Midjourney and New York University researchers have developed a new training method that allows AI language models to generate a wider variety of texts.
  • The approach incorporates "deviation metrics" to assess the differences between texts during the training process. The researchers applied this method to two existing training techniques and evaluated its effectiveness using over 100,000 prompt-response pairs sourced from Reddit.
  • While the text quality decreased slightly by around five percent, the diversity of responses increased by 23 percent. When compared to GPT-4, the modified Llama-3.1-8B model produced a significantly broader range of stories, spanning various genres from fantasy to underwater tales.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.