The blog post outlines the challenges in benchmarking emotional intelligence in AI systems, highlighting issues such as subjective scene settings, ambiguous labelling, and hidden assumptions that often lead to inconsistent evaluations. It calls for an interdisciplinary, nuanced approach - one that not only measures outcomes but also examines the reasoning behind responses, and their consistency - to better capture the complexities of human emotion in real-life scenarios.
1. Introduction: Why Benchmarking Matters (and Why It’s Flawed)?
Benchmarks are essential for evaluating AI performance, particularly in language modeling, by testing capabilities across various tasks. As models rapidly evolve, benchmarking must also adapt to ensure accurate assessments. However, outdated benchmarks risk misrepresenting a model’s true abilities.
We have identified some major challenges in benchmarking emotion-related tasks, where even state-of-the-art models struggle: Subjectivity in problem descriptions, biases in labelling, and hidden assumptions in benchmark tasks. Efforts like EmoBench and EQ-Bench aim to improve these evaluations but remain incomplete, highlighting the need for ongoing refinement in benchmarking methodologies.
In the sections that follow, we will delve deeper into these challenges, exploring the theoretical and practical limitations of current benchmarks, and proposing pathways towards more reliable, comprehensive evaluations for emotional intelligence in AI.
2. Emotional Intelligence Benchmarks: state-of-art
This article follows the taxonomy of EmoBench, which classifies emotional intelligence evaluation into two domains: emotional understanding and emotional application.
Emotional Understanding involves recognising, interpreting, and comprehending emotional nuances beyond basic detection. Key benchmarks include:
These benchmarks test models’ ability to infer emotions beyond superficial keyword associations.
Emotional Application refers to the practical use of emotional intelligence in decision-making and responses. Key benchmarks include:
Tasks in this category often simulate social interactions, conflict resolution, and empathy-driven decision-making, requiring models to exhibit contextually appropriate emotional responses.
Though this taxonomy provides a robust framework for evaluating text-based emotional intelligence, it has some severe limitations. The exclusive focus on text restricts the assessment of non-verbal emotional cues, such as tone, facial expressions, and body language—elements critical to holistic emotional understanding, but outside the scope of this discussion.
3. Problem of Subjectivity in Scene Settings
The first challenge we identified in LLM benchmarks is the subjectivity in scene settings, which leads to inconsistencies both within and across different benchmark datasets.
Within benchmarks, cultural context influences data creation, embedding implicit norms that may not be universally understood. Across benchmarks, varying definitions of emotions—such as Ekman’s six basic emotions vs. dimensional approaches—affect evaluation consistency. A model may perform well in one framework but struggle in another due to these discrepancies.
Subjective scene settings can also lead to misclassified errors, where models provide reasonable responses that do not align with human-labeled answers. These so-called "errors" sometimes disappear upon multiple attempts, indicating interpretive variability rather than comprehension failure.
Consider the following example:
"Rosy and Andie went on a hike. Rosy's blood sugar gets low very quickly, so she brought chocolates with her. Andie saw Rosy's face had gone pale while she was grabbing a piece of chocolate, so she said, "Sorry, did I scare you?"
Here, limited context results in multiple interpretations. While all models predicted emotions based on Andie’s reaction to Rosy, human labellers marked “Amusement” as correct, likely viewing Andie’s question as rhetorical or playful. This divergence illustrates how subjective framing can lead to discrepancies between model outputs and human-labeled answers, not because the models fail to understand, but because the scenario itself allows for multiple valid interpretations.
4. Labelling Ambiguity and Human Subjectivity
Labelling ambiguity and human subjectivity also limit LLMs true benchmarking, particularly in emotion classification tasks. Since labels are determined by human annotators rather than strict rules or universal psychological theories, subjective interpretations influence dataset creation.
Key Challenges:
Consider this example:
"Dorea was trying to cook a Baklava. When she took it out of the oven, the Baklava was ruined as the bread was not crispy, and the filling was bursting all over the pan. At that moment, her daughter came home and noticed her mom's fresh yet ruined Baklava. She tasted it and gave a thumbs-up to her mother. "
The ground truth label is Delight, but interpretations vary. Dorea’s feelings depend on factors like personality, relationship with her daughter, and past experiences. DeepSeek-R1 suggested another plausible angle: the daughter being kind, which could lead Dorea to feel embarrassment instead. This demonstrates how emotional perception is shaped by context and personal biases, making labelling inherently subjective.
5. Hidden and Unintentional Assumptions
Another major flaw in LLM benchmarks is the presence of hidden and unintentional assumptions that subtly influence model responses and their evaluation.
Benchmark scenarios often embed implicit biases regarding context, reasoning, or expected knowledge. This is especially problematic in emotion classification, where ambiguous cues force models to infer details, leading to responses that appear incorrect but are logically sound. Even when context is provided, unclear phrasing can create uncertainty, making evaluation more about interpretation than actual understanding.
Additionally, models must often assume implicit character context. Differences in how they interpret motivations or emotions can lead to multiple, equally valid answers. The same model may even produce different responses under varying conditions.
Example:
"India and Jane had prepared a surprise for their best friend, Blair, whose birthday was in 3 days. A day before Blair's birthday, the three of them were playing together in Blair's room and India started talking about the surprise. Before India could talk further, Jane faked a cough and gave India a stare. "
When tested, Deepseek-R1 had to decide whether Blair noticed the hints—both assumptions were reasonable due to the lack of explicit cues. Similarly, GPT-o3-mini-high encountered ambiguity in the answer choice "oblivious," interpreting it as unawareness rather than an emotional reaction. This highlights how unclear instructions and wording introduce unintended biases, forcing models to rely on their own interpretive frameworks rather than a clearly defined evaluation standard.
6. Theoretical limits to accurate evaluation
Evaluating LLMs in emotional intelligence presents inherent theoretical limitations, as emotional reasoning lacks clear-cut, universally agreed-upon answers. Unlike mathematics or formal logic, emotions are heuristic and context-dependent, making precise assessment ambiguous.
Example:
"It was the day of the school's talent competition. Backstage, Sara, who was performing her stand-up comedy show, felt like she had prepared a good sketch for her act and only needed one last rehearsal. So she started pacing back and forth while mumbling words."
The task requires interpreting an emotional response within a nuanced social context. Unlike a mathematical equation, there's no singular "correct" answer. Emotional states are fluid, context-dependent, and deeply subjective, influenced by cultural, personal, and situational factors. This divergence from deterministic problem-solving underscores the critical difference in evaluating models on emotional intelligence tasks.
Approaches to Benchmarking Emotional Intelligence:
Advanced models like OpenAI’s Omni series and Deepseek-R1 offer opportunities for human-in-the-loop feedback, refining both models and benchmarks iteratively. Recognizing the limits of precise evaluation isn’t a weakness but a step toward better, more adaptive benchmarking that reflects the complexities of human emotion.
7. Moving Forward: Towards More Robust Emotion AI Benchmarking
As the challenges of evaluating emotional intelligence in LLMs become clearer, it is evident that a paradigm shift is needed. Rather than focusing solely on outcomes, future benchmarks must prioritise the reasoning processes behind model predictions.
Key Areas for Improvement:
To achieve these improvements, interdisciplinary collaboration is crucial. AI researchers, psychologists, linguists, and prompt engineers must work together to develop benchmarks that are both technically robust and grounded in human psychology and language principles.
At GoBubble, we are committed to advancing emotion AI benchmarking through innovative methodologies and collective expertise. Our upcoming blog post will outline concrete proposals for addressing these challenges and share insights into our models’ performance on newly refined benchmarks. We look forward to engaging with the research community and shaping the future of emotion-AI evaluation together. Stay tuned!