Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities
We present HumorBench, a benchmark designed to evaluate large language models' (LLMs) ability to reason about and explain sophisticated humor in cartoon captions. As reasoning models increasingly saturate existing benchmarks in mathematics and science, novel and challenging evaluations of model intelligence beyond STEM domains are essential.
Reasoning is fundamentally involved in text-based humor comprehension, requiring the identification of connections between concepts in cartoons/captions and external cultural references, wordplays, and other mechanisms. HumorBench includes approximately 300 unique cartoon-caption pairs from the New Yorker Caption Contest and Cartoonstock.com, with expert-annotated evaluation rubrics identifying essential joke elements.
Performance of current models on HumorBench (as of July 2025)
| Rank | Model | Score (%) |
|---|
Score represents the percentage of humor elements correctly identified. See our interactive results viewer for detailed analysis.
HumorBench takes a fundamentally different approach to evaluating humor understanding in AI. While previous benchmarks often conflate two distinct challenges—understanding what makes something intended to be funny (objective comprehension) versus finding it amusing (subjective appreciation)—we focus exclusively on the former.
Given a textual description of a cartoon and its caption, models must explain what the joke is, identifying the specific connections and mental leaps required to understand the humor. For instance, recognizing that "Death" playing chess references both the literal game and the metaphorical "chess match with death," or understanding how a swimming person might be interpreted as "groceries" from a shark's perspective.
This approach reveals that humor comprehension is fundamentally a reasoning task. Models must connect visual elements, caption text, and external knowledge—making the same kinds of logical leaps required in STEM domains, but applied to cultural and linguistic contexts.