HumorBench

Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities

Reuben Narad1, Siddharth Suresh2, Jiayi Chen2, Pine S.L. Dysart-Bricken1,
Bob Mankoff3, Robert Nowak2, Jifan Zhang2, Lalit Jain1
1University of Washington, Seattle    2University of Wisconsin-Madison    3Air Mail and Cartoon Collections
📄 Read the Paper 💻 View on GitHub 📊 Interactive Results

Abstract

We present HumorBench, a benchmark designed to evaluate large language models' (LLMs) ability to reason about and explain sophisticated humor in cartoon captions. As reasoning models increasingly saturate existing benchmarks in mathematics and science, novel and challenging evaluations of model intelligence beyond STEM domains are essential.

Reasoning is fundamentally involved in text-based humor comprehension, requiring the identification of connections between concepts in cartoons/captions and external cultural references, wordplays, and other mechanisms. HumorBench includes approximately 300 unique cartoon-caption pairs from the New Yorker Caption Contest and Cartoonstock.com, with expert-annotated evaluation rubrics identifying essential joke elements.

Leaderboard

Performance of current models on HumorBench (as of July 2025)

Rank Model Score (%)

Score represents the percentage of humor elements correctly identified. See our interactive results viewer for detailed analysis.

Key Findings

How HumorBench Works

HumorBench Overview

HumorBench takes a fundamentally different approach to evaluating humor understanding in AI. While previous benchmarks often conflate two distinct challenges—understanding what makes something intended to be funny (objective comprehension) versus finding it amusing (subjective appreciation)—we focus exclusively on the former.

The Task

Given a textual description of a cartoon and its caption, models must explain what the joke is, identifying the specific connections and mental leaps required to understand the humor. For instance, recognizing that "Death" playing chess references both the literal game and the metaphorical "chess match with death," or understanding how a swimming person might be interpreted as "groceries" from a shark's perspective.

What Makes HumorBench Different

This approach reveals that humor comprehension is fundamentally a reasoning task. Models must connect visual elements, caption text, and external knowledge—making the same kinds of logical leaps required in STEM domains, but applied to cultural and linguistic contexts.