Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance

February 25, 2026 at 14:26

Quality: 8/10 Relevance: 9/10

Summary

Current large audio language models tend to rely on lexical content rather than acoustic cues when interpreting emotion. The paper introduces LISTEN, a benchmark to separate lexical reliance from acoustic sensitivity in narrative speech, and evaluates six state-of-the-art LALMs. Results show lexical dominance across evaluations, with models predicting neutral emotion when lexical cues are absent and near-chance performance in paralinguistic settings, suggesting many LALMs are effectively transcription-only.

Read Original Article