Insights

Benchmarking LLMs for Emotion Intelligence

The blog post outlines the challenges in benchmarking emotional intelligence in AI systems, highlighting issues such as subjective scene settings, ambiguous labelling, and hidden assumptions that often lead to inconsistent evaluations. It calls for an interdisciplinary, nuanced approach - one that not only measures outcomes but also examines the reasoning behind responses, and their consistency - to better capture the complexities of human emotion in real-life scenarios.

Written by

Dr. Suvajit Majumder

Principal AI Engineer

Zisis Batzos

Senior AI Engineer

Maria Tramontano

Advisory Psychologist

Mar

2025

Introduction: Why Benchmarking Matters (and Why it’s Flawed)?

Benchmarks are essential for evaluating AI performance, particularly in language modeling, by testing capabilities across various tasks. As models rapidly evolve, benchmarking must also adapt to ensure accurate assessments. However, outdated benchmarks risk misrepresenting a model’s true abilities.

We have identified some major challenges in benchmarking emotion-related tasks, where even state-of-the-art models struggle: Subjectivity in problem descriptions, biases in labelling, and hidden assumptions in benchmark tasks. Efforts like EmoBench and EQ-Bench aim to improve these evaluations but remain incomplete, highlighting the need for ongoing refinement in benchmarking methodologies.

In the sections that follow, we will delve deeper into these challenges, exploring the theoretical and practical limitations of current benchmarks, and proposing pathways towards more reliable, comprehensive evaluations for emotional intelligence in AI.

Emotional Intelligence Benchmarks: state-of-art

This article follows the taxonomy of EmoBench, which classifies emotional intelligence evaluation into two domains: emotional understanding and emotional application.

Emotional understanding involves recognising, interpreting, and comprehending emotional nuances beyond basic detection. Key benchmarks include:

EmoBench – Evaluates emotion differentiation in varied text contexts.

EQ-Bench – Focuses on cognitive aspects of emotional recognition.

SECEU – Assesses emotional comprehension in complex, context-rich scenarios.

These benchmarks test models’ ability to infer emotions beyond superficial keyword associations.

Emotional application refers to the practical use of emotional intelligence in decision making and responses. Key benchmarks include:

EmoBench (Application Tasks) – Measures emotional reasoning in real-world scenarios.

MERBench – Evaluates how models integrate emotional knowledge in response generation.

Tasks in this category often simulate social interactions, conflict resolution, and empathy-driven decision making, requiring models to exhibit contextually appropriate emotional responses.

Though this taxonomy provides a robust framework for evaluating text-based emotional intelligence, it has some severe limitations. The exclusive focus on text restricts the assessment of non-verbal emotional cues, such as tone, facial expressions, and body language—elements critical to holistic emotional understanding, but outside the scope of this discussion.

Problem of Subjectivity in Scene Settings

The first challenge we identified in LLM benchmarks is the subjectivity in scene settings, which leads to inconsistencies both within and across different benchmark datasets.

Within benchmarks, cultural context influences data creation, embedding implicit norms that may not be universally understood. Across benchmarks, varying definitions of emotions—such as Ekman’s six basic emotions vs. dimensional approaches—affect evaluation consistency. A model may perform well in one framework but struggle in another due to these discrepancies.

Subjective scene settings can also lead to misclassified errors, where models provide reasonable responses that do not align with human-labeled answers. These so-called "errors" sometimes disappear upon multiple attempts, indicating interpretive variability rather than comprehension failure.

Consider the following example:

"Rosy and Andie went on a hike. Rosy's blood sugar gets low very quickly, so she brought chocolates with her. Andie saw Rosy's face had gone pale while she was grabbing a piece of chocolate, so she said, "Sorry, did I scare you?"

Here, limited context results in multiple interpretations. While all models predicted emotions based on Andie’s reaction to Rosy, human labellers marked “Amusement” as correct, likely viewing Andie’s question as rhetorical or playful. This divergence illustrates how subjective framing can lead to discrepancies between model outputs and human-labelled answers, not because the models fail to understand, but because the scenario itself allows for multiple valid interpretations.

Labelling Ambiguity and Human Subjectivity

Labelling ambiguity and human subjectivity also limit LLMs true benchmarking, particularly in emotion classification tasks. Since labels are determined by human annotators rather than strict rules or universal psychological theories, subjective interpretations influence dataset creation.

Key challenges:

Disagreement on Emotional Labels: Different annotators may label the same emotion differently (e.g., frustration vs. disappointment). Post-labelling filtering helps, but it can bias datasets toward simpler, universally agreed-upon emotions while neglecting more complex ones.

Cultural Bias: Most emotion datasets are English-centric and shaped by Western norms. Cultural differences alter how emotions are perceived—what one culture sees as "modesty," another may interpret as "lack of confidence."

Personality and Experience: Individual temperament and life experiences affect emotional interpretation (e.g., excitement vs. anxiety), introducing further variability in labelling.

Lack of Context: Annotators often label isolated sentences without full conversational background, leading to superficial or misleading classifications.

Consider this example:

"Dorea was trying to cook a Baklava. When she took it out of the oven, the Baklava was ruined as the bread was not crispy, and the filling was bursting all over the pan. At that moment, her daughter came home and noticed her mom's fresh yet ruined Baklava. She tasted it and gave a thumbs-up to her mother."

The ground truth label is Delight, but interpretations vary. Dorea’s feelings depend on factors like personality, relationship with her daughter, and past experiences. DeepSeek-R1 suggested another plausible angle: the daughter being kind, which could lead Dorea to feel embarrassment instead. This demonstrates how emotional perception is shaped by context and personal biases, making labelling inherently subjective.

Hidden and Unintentional Assumptions

Another major flaw in LLM benchmarks is the presence of hidden and unintentional assumptions that subtly influence model responses and their evaluation.

Benchmark scenarios often embed implicit biases regarding context, reasoning, or expected knowledge. This is especially problematic in emotion classification, where ambiguous cues force models to infer details, leading to responses that appear incorrect but are logically sound. Even when context is provided, unclear phrasing can create uncertainty, making evaluation more about interpretation than actual understanding.

Additionally, models must often assume implicit character context. Differences in how they interpret motivations or emotions can lead to multiple, equally valid answers. The same model may even produce different responses under varying conditions.

Example:

"India and Jane had prepared a surprise for their best friend, Blair, whose birthday was in 3 days. A day before Blair's birthday, the three of them were playing together in Blair's room and India started talking about the surprise. Before India could talk further, Jane faked a cough and gave India a stare."

When tested, Deepseek-R1 had to decide whether Blair noticed the hints—both assumptions were reasonable due to the lack of explicit cues. Similarly, GPT-o3-mini-high encountered ambiguity in the answer choice "oblivious," interpreting it as unawareness rather than an emotional reaction. This highlights how unclear instructions and wording introduce unintended biases, forcing models to rely on their own interpretive frameworks rather than a clearly defined evaluation standard.

Theoretical limits to accurate evaluation

Evaluating LLMs in emotional intelligence presents inherent theoretical limitations, as emotional reasoning lacks clear-cut, universally agreed-upon answers. Unlike mathematics or formal logic, emotions are heuristic and context-dependent, making precise assessment ambiguous.

Example:

"It was the day of the school's talent competition. Backstage, Sara, who was performing her stand-up comedy show, felt like she had prepared a good sketch for her act and only needed one last rehearsal. So she started pacing back and forth while mumbling words."

The task requires interpreting an emotional response within a nuanced social context. Unlike a mathematical equation, there's no singular "correct" answer. Emotional states are fluid, context-dependent, and deeply subjective, influenced by cultural, personal, and situational factors. This divergence from deterministic problem-solving underscores the critical difference in evaluating models on emotional intelligence tasks.

Approaches to Benchmarking Emotional Intelligence:

Focusing on Unambiguous Cases: Ensures consistent evaluation but limits the scope of emotional reasoning.

Embracing Ambiguity: Allows for multiple valid responses, assessing not just conclusions but the reasoning process. Multi-step rewards can recognize different, yet equally justifiable, interpretations.

Advanced models like OpenAI’s Omni series and Deepseek-R1 offer opportunities for human-in-the-loop feedback, refining both models and benchmarks iteratively. Recognizing the limits of precise evaluation isn’t a weakness but a step toward better, more adaptive benchmarking that reflects the complexities of human emotion.

Moving Forward: Towards More Robust Emotion AI Benchmarking

As the challenges of evaluating emotional intelligence in LLMs become clearer, it is evident that a paradigm shift is needed. Rather than focusing solely on outcomes, future benchmarks must prioritise the reasoning processes behind model predictions.

Key Areas for Improvement:

Expert and Rule-Based Assessments: Current benchmarks rely heavily on human annotators, whose interpretations are shaped by personal and cultural biases. Involving psychologists and linguists can create more structured, principle-driven evaluations, improving reliability and reducing subjective inconsistencies.

Model Confidence and Justifications: Instead of assessing only correctness, benchmarks should consider how confident a model is in its response and the reasoning behind it. This would distinguish between educated reasoning and random guesses, offering deeper insight into true emotional intelligence capabilities.

Better Scenario Design: Emotional benchmarks must balance complexity with clarity, ensuring that while multiple interpretations remain valid, they are supported by consistent, logical reasoning rather than arbitrary assumptions.

To achieve these improvements, interdisciplinary collaboration is crucial. AI researchers, psychologists, linguists, and prompt engineers must work together to develop benchmarks that are both technically robust and grounded in human psychology and language principles.

At Freedom2hear, we are committed to advancing emotion AI benchmarking through innovative methodologies and collective expertise. Our upcoming blog post will outline concrete proposals for addressing these challenges and share insights into our models’ performance on newly refined benchmarks. We look forward to engaging with the research community and shaping the future of emotion-AI evaluation together. Stay tuned!

‍

Benchmarking LLMs for Emotion Intelligence

Introduction: Why Benchmarking Matters (and Why it’s Flawed)?

Emotional Intelligence Benchmarks: state-of-art

Problem of Subjectivity in Scene Settings

Labelling Ambiguity and Human Subjectivity

Hidden and Unintentional Assumptions

Theoretical limits to accurate evaluation

Moving Forward: Towards More Robust Emotion AI Benchmarking

Understanding the world we live in

International Day for Countering Hate Speech

Introducing the New Freedom2hear: A Fresh Look, Same Commitment to Safer Digital Spaces

Training for Gold: Why Empowering Humans in Content Moderation Matters

Book a demo today

Safer, kinder, healthier interactions online

Contact

Benchmarking LLMs for Emotion Intelligence

Introduction: Why Benchmarking Matters (and Why it’s Flawed)?

Emotional Intelligence Benchmarks: state-of-art

Problem of Subjectivity in Scene Settings

Labelling Ambiguity and Human Subjectivity

Hidden and Unintentional Assumptions

Theoretical limits to accurate evaluation

Moving Forward: Towards More Robust Emotion AI Benchmarking

Understanding the world we live in

International Day for Countering Hate Speech

Introducing the New Freedom2hear: A Fresh Look, Same Commitment to Safer Digital Spaces

Training for Gold: Why Empowering Humans in Content Moderation Matters

Book a demo today

Safer, kinder, healthier interactions online

Contact

Newsletter

Stay connected, stay informed 👋