Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance
Summary
Current large audio language models tend to rely on lexical content rather than acoustic cues when interpreting emotion. The paper introduces LISTEN, a benchmark to separate lexical reliance from acoustic sensitivity in narrative speech, and evaluates six state-of-the-art LALMs. Results show lexical dominance across evaluations, with models predicting neutral emotion when lexical cues are absent and near-chance performance in paralinguistic settings, suggesting many LALMs are effectively transcription-only.