LAION and Intel introduce tools that help AI gauge the intensity of 40 distinct emotions

Jun 20, 2025

GPT-4o prompted by THE DECODER

Key Points

LAION and Intel have released Empathic-Insight, a suite of models and datasets that can analyze facial images and audio files across 40 emotion categories, covering not only emotional but also cognitive, physical, and social states.
The models were trained and validated exclusively with synthetic image and voice data to protect privacy and ensure demographic diversity.
The researchers also introduced BUD-E Whisper, an enhanced version of OpenAI’s Whisper, which can now detect emotional tones and speaker characteristics alongside transcription.

One of the latest open-source projects from LAION and Intel aims to give AI systems a better grasp of human emotion.

The "Empathic Insight" suite includes models and datasets designed to analyze facial images or audio files and rate the intensity of 40 different emotion categories. For faces, emotions are scored on a scale from 0 to 7; for voices, the system labels emotions as absent, slightly pronounced, or strongly pronounced.

Close-up of a worried-looking woman and table showing emotion predictions, high scores for disappointment, distress, anger. — The Empathic Insight models can detect up to 40 distinct emotions in facial images. | Image: LAION

EmoNet, the backbone of these models, draws on a taxonomy of 40 emotion categories developed from the "Handbook of Emotions," a landmark reference in psychology. The researchers expanded the usual list of basic emotions, adding cognitive states like concentration and confusion, physical states such as pain and fatigue, and social emotions including shame and pride. They argue that emotions aren't universally readable - instead, the brain constructs them from a range of signals. As a result, their models work with probability estimates, not fixed labels.

Training with synthetic faces and voices

To train the models, the team used over 203,000 facial images and 4,692 audio samples. The speech data comes from the Laion's Got Talent dataset, which includes more than 5,000 hours of synthetic recordings in English, German, Spanish, and French, all generated using OpenAI's GPT-4o audio model.

Three portraits: a triumphant 70-year-old Latina, a smiling 30-year-old African American man, and a wistful 40-year-old Southeast Asian woman. — Synthetic sample images from the EmoNet Face Benchmark showcase the diversity of the training data. | Image: LAION

To avoid privacy problems and improve demographic diversity, LAION relied entirely on synthetic data. The facial images were created with text-to-image models like Midjourney and Flux, then programmatically varied by age, gender, and ethnicity. All audio samples were reviewed by psychology-trained experts, and only ratings that three independent reviewers agreed on made it into the dataset.

Outperforming established emotion AI

According to LAION, the Empathic Insight models outperform existing competitors in benchmarks. On the EmoNet Face HQ benchmark, the Empathic Insight Face model showed a higher correlation with human expert ratings than Gemini 2.5 Pro or closed-source APIs like Hume AI. The key metric was how closely the AI's assessments matched those of psychology professionals.

Bar chart: Agreement between the models and human emotion assessments in percent, EmpathicInsight-Face Large ~40%. — EmoNet ratings align with human expert assessments up to 40 percent of the time, compared to 25-30 percent for standard VLMs and near zero for random baselines. | Image: LAION

The researchers also report strong results in speech emotion recognition. The Empathic Insight Voice model performed better than existing audio models on the EmoNet Voice Benchmark, correctly identifying all 40 emotion categories. The team experimented with different model sizes and audio processing methods to optimize results.

Enhanced transcription with BUD-E Whisper

Beyond emotion recognition, LAION developed BUD-E Whisper, an upgraded version of OpenAI's Whisper model. While Whisper transcribes speech to text, BUD-E Whisper adds structured descriptions of emotional tone, detects vocal outbursts like laughter and sighs, and estimates speaker traits such as age and gender.

All EmoNet models are available under Creative Commons (for the models) and Apache 2.0 (for the code). Datasets and models can be downloaded from Hugging Face. Both Empathic Insight models come in "Small" and "Large" versions on Hugging Face, making them accessible for different use cases and hardware requirements.

Intel has supported the project since 2021 as part of its open-source AI strategy, with a focus on optimizing models for Intel hardware.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: LAION | Hugging Face