HumorDB

HumorDB: Can AI Understand Graphical Humor?

Authors: Vedaant V Jain, Felipe dos Santos Alves Feitosa, Gabriel Kreiman
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

    Paper (ArXiv)     Poster     Code & Dataset

The Core Idea: What does it mean to “understand” an image?

Modern AI can easily label the objects in the images below: “surgeon,” “patient,” “operating room”. But can it tell you why the image on the left is funny, while the nearly identical image on the right is not?

This is the central question of HumorDB. We find that while AI is good at literal-level classification, it fails at human-level abstract reasoning. Humor, which relies on understanding context, expectations, and incongruity, is the perfect testbed for this.

We introduce HumorDB, a new dataset and benchmark built on minimally contrastive pairs. We take a humorous image and make a subtle edit to remove only the humorous element, forcing the model to prove it can pinpoint the exact source of the joke.

   
        Funny surgeon image

Funny (83% of humans agree)

   
   
        Non-funny surgeon image

Not Funny (86% of humans agree)

   

Key Findings at a Glance

We tested state-of-the-art vision models (ViT, DINOv2) and large vision-language models (GPT-4o, Gemini-Flash) against human performance on three tasks: binary classification, funniness rating, and pairwise comparison (which image is funnier).

1. A Clear Human-AI Gap Remains

Models perform well above chance, but all trail human-level accuracy. The gap is most pronounced in the "Comparison Task," which requires nuanced judgment.

Graph showing AI performance below human performance
2. AI Fails to "Look at the Joke"

Even when a model is correct, its internal attention maps rarely focus on the humorous region. Models are "right for the wrong reason," relying on superficial cues, not the joke itself.

Attention map showing AI looking at the wrong part of an image
3. Abstraction is the Hardest Challenge

Performance varied by image type. All models struggled most with abstract content like sketches, performing near chance-level.

Graph showing performance on sketches is lowest

The Dataset

HumorDB is a diverse, controlled dataset designed for rigorous evaluation.

  • 3,542 Total Images
  • 1,271 Minimally Contrastive Pairs
  • 650 Human Subjects Annotating
  • 5 Image Types: Photos (36%), Photoshopped (35%), Cartoons (14%), Sketches (5%), and AI-Generated (10%)

Examples from HumorDB

   
        Optical illusion of a magic carpet

GPT-4o: "...the optical illusion created by the shadow...makes it appear as though someone is flying on a magic carpet."

   
   
        Dog with sunglasses drinking from a coconut

Gemini-Flash: "...the dog is wearing sunglasses and enjoying a coconut drink."

   
   
        Forced perspective of snowboarder on the moon

Llava: "...depicts a cartoon of a person inside a box, seemingly being ”pulled out” by a hand using a toothpick. ..."

   

Abstract

Despite significant advancements in image segmentation and object detection, understanding complex scenes remains a significant challenge. Here, we focus on graphical humor as a paradigmatic example of image interpretation that requires elucidating the interaction of different scene elements in the context of prior cognitive knowledge. This paper introduces HumorDB, a novel, controlled, and carefully curated dataset designed to evaluate and advance visual humor understanding by AI systems. The dataset comprises diverse images spanning photos, cartoons, sketches, and AI-generated content, including minimally contrastive pairs where subtle edits differentiate between humorous and non-humorous versions. We evaluate humans, state-of-the-art vision models, and large vision-language models on three tasks: binary humor classification, funniness rating prediction, and pairwise humor comparison. The results reveal a gap between current AI systems and human-level humor understanding. While pretrained vision-language models perform better than vision-only models, they still struggle with abstract sketches and subtle humor cues. Analysis of attention maps shows that even when models correctly classify humorous images, they often fail to focus on the precise regions that make the image funny. Preliminary mechanistic interpretability studies and evaluation of model explanations provide initial insights into how different architectures process humor. Our results identify promising trends and current limitations, suggesting that an effective understanding of visual humor requires sophisticated architectures capable of detecting subtle contextual features and bridging the gap between visual perception and abstract reasoning. All the code and data are available here: https://github.com/kreimanlab/HumorDB.


Citation

If you find our work useful, please consider citing:

@misc{jain2025humordbaiunderstandgraphical,
      title={HumorDB: Can AI understand graphical humor?}
      author={Vedaant V Jain and Felipe dos Santos Alves Feitosa and Gabriel Kreiman},
      year={2025},
      eprint={2406.13564},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={[https://arxiv.org/abs/2406.13564](https://arxiv.org/abs/2406.13564)}, 
}