ITHACA, N.Y. — Artificial intelligence (AI) models nowadays are as impressive as ever, but new research reports even the most advanced systems still aren’t capable of grasping the very human concept of humor. Sure, large AI networks may be able to generate thousands of simple “why did the chicken cross the road” or “knock, knock” jokes, but scientists at Cornell University explain that AI programs still don’t actually understand why the jokes are funny.
Study authors used hundreds of entries in The New Yorker magazine’s Cartoon Caption Contest as a source of testing data and jokes. They then challenged both AI models and humans with three distinct tasks: matching a joke to a cartoon, identifying a winning caption, and explaining why the winning caption is funny.
Across all the tasks, humans performed demonstrably better than the machines. So, even as AI breakthroughs like ChatGPT continue to close the human-machine performance gap over the past year or so, these findings suggest AI still has a ways to go before it can host its own comedy special.
“The way people challenge AI models for understanding is to build tests for them – multiple choice tests or other evaluations with an accuracy score,” says lead study author Jack Hessel, Ph.D. ’20, a research scientist at the Allen Institute for AI (AI2), in a university release.
“And if a model eventually surpasses whatever humans get at this test, you think, ‘OK, does this mean it truly understands?’ It’s a defensible position to say that no machine can truly ‘understand’ because understanding is a human thing. But, whether the machine understands or not, it’s still impressive how well they do on these tasks.”
To research this topic, the ream gathered together 14 years’ worth of New Yorker caption contests (over 700 altogether). Each of those contests featured: a caption-less cartoon, that week’s entries, the three finalists selected by New Yorker editors, and some contests even recorded crowd quality estimates for each and every submission.
For each contest, study authors assessed two specific kinds of AI across the three tasks, “from pixels” (computer vision) and “from description” (analysis of human summaries of cartoons).
“There are datasets of photos from Flickr with captions like, ‘This is my dog,’” Hessel explains. “The interesting thing about the New Yorker case is that the relationships between the images and the captions are indirect, playful, and reference lots of real-world entities and norms. And so the task of ‘understanding’ the relationship between these things requires a bit more sophistication.”
During the experiment, the matching phase entailed AI models selecting the finalist caption for the given cartoon from among “distractors” that were in fact finalists but for other contests. The quality ranking portion asked AI models to differentiate a finalist caption from a nonfinalist, and the explanation phase asked AI models to create free text explaining how a high-quality caption relates to the cartoon.
Hessel wrote the vast majority of human-generated explanations himself, but only after crowdsourcing the task proved unsatisfactory. Ultimately, he was able to generate 60-word explanations for over 650 cartoons.
“A number like 650 doesn’t seem very big in a machine-learning context, where you often have thousands or millions of data points,” Hessel notes, “until you start writing them out.”
The project detailed a notable gap between AI and human-level “understanding” of why a cartoon is funny or humorous. The best an AI performed on a multiple choice test of matching cartoon to caption was only 62 percent accuracy, much lower than humans’ 94 percent in the same setting. Regarding the comparison of human vs. AI-generated explanations, humans came out on top again, being preferred roughly 2-to-1.
While AI may not “understand” humor yet, study authors write it can still serve as a collaborative tool humorists may want to use to brainstorm new joke ideas. Even the study itself was written in a humorous manner, with playful comments and footnotes throughout.
“This three or four years of research wasn’t always super fun,” says study co-author Lillian Lee, the Charles Roy Davis Professor in the Cornell Ann S. Bowers College of Computing and Information Science, “but something we try to do in our work, or at least in our writing, is to encourage more of a spirit of fun.”
This study won a best-paper award at the 61st annual meeting of the Association for Computational Linguistics.
You might also be interested in:
- ChatGPT scores in the top 1% for original creative thinking
- Comedy’s origin story? 15th century manuscript reveals the roots of British humor
- Can a robot get your jokes? Scientists give an android a sense of humor