You’re in a brainstorming session. The prompt on the screen reads: “If you had the talent to invent things just by thinking of them, what would you create?” Around the table, your team starts typing. Ideas trickle in—some thoughtful, some rushed. You run the same prompt through ChatGPT-4. It gives you four polished, surprising, and detailed answers in seconds.
This scenario isn’t hypothetical. In a large-scale experiment conducted by researchers from the University of Lausanne, over 1,000 U.S. adults were asked to respond to open-ended creativity tasks. So were two generative AI models—ChatGPT-4 and Bard (now Gemini). The responses were then rated for creativity by over 3,000 additional participants. The results were clear: ChatGPT-4 consistently produced ideas that were rated more creative than those generated by humans.
And not just on average. Among the top 1% of highest-rated responses, eight came from ChatGPT-4, compared to six from humans and three from humans using ChatGPT for help.
This wasn’t about trivia or simple Q&A. The tasks were designed to test open, divergent creativity—questions with no single correct answer, where novelty, surprise, and usefulness were the standard. And when measured across these dimensions, ChatGPT-4 outperformed humans, even when humans had access to it.
But that’s not the whole story. Let’s dig more into it.
Research Methodology
The study examined how generative AI compares to humans in tasks that require creativity and strategic reasoning. It involved over 4,000 participants and was structured around two main experiments: one measuring creative output and the other measuring strategic adaptability.
In the creativity task, 1,250 U.S. adults were asked to respond to one of two open-ended prompts:
- “If you had the talent to invent things just by thinking of them, what would you create?”
- “Imagine and describe a town, city, or society in the future.”
Participants were given up to 10 minutes and 1,000 characters to write their responses. Some worked independently (HumanBaseline), others were informed they were competing with AI (HumanAgainstAI), and some were explicitly allowed to use either ChatGPT or Gemini to assist them (HumanPlusAI).
Separately, ChatGPT-4 and Gemini were given the same prompts and asked to produce creative answers. Both models were prompted in isolated chats with the instruction:
“Give 4 alternative and creative answers to the following question within 1,000 characters for each answer.”
All responses—human and AI—were evaluated by 3,336 online raters, each of whom assessed 20 random responses on four dimensions:
- Overall creativity
- Novelty/originality
- Surprise
- Usefulness
The raters were divided into three groups: some saw original texts, some saw grammar-corrected versions of human responses, and others were told that some responses might be AI-generated and were asked to guess the source.
This design allowed researchers to test not only the quality of outputs but also whether perceptions about AI influence ratings, and how humans respond to working alongside or against generative models.
Key Finding: AI Outperformed Humans in Creativity
The experiment revealed a clear outcome: ChatGPT-4 produced more creative responses than humans.
When rated blindly by online raters, responses from ChatGPT-4 received significantly higher creativity scores than those from the HumanBaseline group. This finding was consistent across all measures—overall creativity, originality, surprise, and usefulness. In contrast, Gemini scored lower than human responses across the board.
Among the top-performing ideas—the top 10%, 5%, and 1% of all responses based on creativity scores—ChatGPT-4 dominated:
- Top 10%: 43% of ChatGPT-4’s responses appeared here.
- Top 5%: 58.8% of the best answers were from ChatGPT-4.
- Top 1%: 8 out of 17 highest-rated responses were written by ChatGPT-4.
By comparison, only 4% of HumanBaseline responses made it into the top 10%, and Gemini contributed just one response to that same tier.
Even when humans had access to AI tools, they couldn’t outperform the model itself. In the HumanPlusAI condition, participants who used ChatGPT-4 or 3.5 produced responses that were less creative on average than those generated by ChatGPT-4 alone. The average creativity score for ChatGPT-4 responses was 7.24, while the scores for HumanPlusAI users were lower, even among those who used the same model.
Augmentation Isn’t Enough (Yet)
The experiment tested not only whether AI could outperform humans, but also whether humans could improve their own creative performance by collaborating with AI. The results show that while augmentation does help, it doesn’t match the effectiveness of AI working alone.
Participants in the HumanPlusAI group—those who were explicitly allowed to use ChatGPT or Gemini—produced more creative responses than those in the HumanBaseline group. The average creativity score increased significantly for this group, showing that access to generative tools can enhance human creativity.
But when compared to ChatGPT-4 alone, the augmented responses still fell short. Even participants who used ChatGPT-4 (not 3.5 or Gemini) didn’t match the creativity level achieved by the model operating independently. According to the paper, “Humans plus ChatGPT perform worse than ChatGPT alone.”
The authors attribute this to how the AI was prompted. In the study, ChatGPT-4 was explicitly instructed to generate creative and novel ideas. In contrast, human participants likely submitted the task prompt directly without additional guidance, which could lead to more conventional or literal outputs.
This distinction underscores the importance of how AI is used, not just whether it is used. Even though participants had access to the same model, differences in approach and prompting produced significant differences in output.
Bias, Perception, and Reality
The study didn’t just measure how creative the responses were—it also examined how people perceive creativity when they believe AI is involved. The results uncovered a consistent pattern: raters penalized responses they believed were generated by AI, even when those responses were just as creative—or more so.
In one treatment group (AIRaters), participants were informed that some of the texts they were evaluating may have been created by AI. They were also asked to guess the source of each response. When these raters thought a response was AI-generated, they consistently gave it lower creativity scores, regardless of the actual source.
This penalty appeared across multiple sub-dimensions:
- Originality
- Usefulness
- Overall creativity
Only the surprise dimension remained unaffected by perceived authorship.
Importantly, the penalty wasn’t based on accurate detection. Raters often misclassified the origin of the responses:
- They correctly guessed human-written responses 63% of the time.
- They identified ChatGPT-4 outputs as AI only 61% of the time.
- They were least accurate with Gemini, identifying it as AI just 37% of the time.
- Responses from the HumanPlusAI group were guessed to be AI in 59% of cases.
This inability to reliably detect AI authorship, combined with a measurable penalty when AI was suspected, points to what the authors describe as a behavioural constraint: “This is a novel phenomenon similar to algorithm aversion.”
Yet despite this bias, ChatGPT-4’s responses still received the highest average creativity ratings overall, even among raters who were told that some responses might be AI-generated. The study concludes: “Judges are very bad at guessing which ideas are AI-generated.”
So What Does This Mean for Founders and Tech Leaders?
The findings of this study make one thing very clear: generative AI—specifically ChatGPT-4—currently outperforms humans in open-ended creative tasks. And while augmenting human input with AI does lead to improvement, it still doesn’t match the quality of AI working alone.
For founders, tech leaders, and COOs, this raises a pressing operational question: what role should humans play in creative workflows where AI now excels?
The authors point out that:
“While AI can augment human creativity, the effect size is relatively small, and human-AI collaboration does not yet outperform AI operating alone.”
Why? Because effective use of AI isn’t just about access—it’s about execution. The gap appears to hinge on how the tools are used. The study highlights that participants who used ChatGPT-4 didn’t prompt it with the same precision as the researchers did, resulting in lower-quality outputs. This suggests that organizations can’t expect value from AI without also developing AI literacy and prompting skillsets internally.
Moreover, the results expose a second layer of complexity: perception bias. Even when AI-generated responses are objectively better, people still score them lower when they suspect AI involvement. As the paper notes:
“This is a novel phenomenon similar to algorithm aversion… and might be a behavioral constraint to the adoption of AI.”
So while AI can clearly perform at high levels, adoption across teams and stakeholders may be slowed by skepticism, misperceptions, or underestimation. These social dynamics matter. The authors suggest that ensuring responsible deployment of AI will require addressing biases and misconceptions—not just technical capability.
Finally, on the question of innovation: humans still hold a critical edge in one area—idea diversity at the highest levels of creativity. In the most creative responses, human-generated ideas were more semantically distinct than those from AI. The authors note:
“The most creative individuals in our study are capable of generating ideas that are more unique compared to their AI-generated counterparts.”
That’s a signal to tech leaders: the future isn’t about choosing between AI and human talent—it’s about designing systems where each excels in what it does best. Let AI handle scale, consistency, and surprising originality. Let humans explore uncharted spaces and push the limits of uniqueness. The real advantage lies in combining both—but doing so deliberately.
This Isn’t Just About Winning Brainstorms
This isn’t the final word on AI and creativity. It’s not some definitive proof that humans have been outclassed. It’s one study—well-designed, large-scale, but still just one. And there are clear limitations. Most notably, the humans using AI in the experiment weren’t trained researchers or expert prompt engineers. They didn’t get coaching on how to frame questions or guide the model. That matters.
So no, this doesn’t mean AI has permanently overtaken human creativity. But it does show something that can’t be ignored: how far these tools have come—and how wide the performance gap can get when used well.
That’s the real headline. When people and AI collaborate intentionally, when the setup is right, and when the prompts are sharp—the results improve. Not just slightly. Meaningfully. The most exciting potential isn’t in replacing humans with AI or comparing which one is better. It’s in building systems where each complements the other. Where AI brings speed, volume, and structure—and people bring context, direction, and originality.
That’s been the point of AI all along: not to outperform humans, but to help them do more. To generate better ideas, faster. To think differently. To spend less time stuck and more time building something worthwhile.
The opportunity here isn’t about who wins the creativity contest. It’s about who learns to use AI well enough to build something that actually matters.