Computational Creativity
I can’t tell you what a psychological relief it was, after months of working with agents, to find something that they were genuinely terrible at: coming up with creative ad campaign ideas. We are not talking “not quite there yet” bad. We are talking “consistently pegged at 7-17% success rates on our evals” bad. This relief turned out to be short-lived, however, as it took me 6 weeks of banging my head against a terminal before the ideas were good enough that I felt like our product was finally, mercifully ready for beta.
Slop is a 4-letter Word
There doesn’t seem to be a unified definition of computational creativity, but they all converge on something like the below from Google AI overview, which is a fair synthesis:
Computational creativity is a multidisciplinary field at the intersection of AI, cognitive psychology, and the arts, aiming to build software that exhibits creative behaviors. It involves developing systems that can independently generate, evaluate, and refine novel outputs—such as art, music, or poetry—often by exploring "conceptual spaces". (Sources: 1, 2, 3; generated on 5/7/26).
If you replace “art, music, or poetry” with “advertising ideas,” this is a big part of what we’re building at branchpoints: software that helps marketers R&D potential promotional directions at a speed and scale that was impossible pre-GenAI. Problem is, if the ideas aren’t actually any good, congratulations friendo, you’ve just built yourself a slop factory.
We are not in the business of slop, which I define differently than Webster’s slur for low quality, high volume digital content. To me, slop is simply anything mid. I don’t care how something was produced, I only care if it’s good. A long time ago, I used to spend time working out of the BBDO offices in New York City, and there was this one conference room with an Ira Glass quote painted on the wall which has always stuck with me:
“It's hard to make something that's interesting. It's really, really hard. It's like a law of nature, a law of aerodynamics, that anything that's written or anything that's created wants to be mediocre. The natural state of all writing is mediocrity... So what it takes to make anything more than mediocre is such an act of will.”
- Ira Glass, emphasis mine
Viewing the world through this lens gives me hope, because if you believe in Ira’s law, and I do, anything you find to be meh can be made spectacular if paid enough human time and attention. It also explains why the world was lousy with slop long before GenAI came along.
Unfortunately for me this January, the thing I was finding very meh was v1 of the product we just built. I couldn’t for the life of me figure out why our output was converging on the same dozen trite ideas, regardless of brand or category or strategy or brief or prompts or context or tools or anything else in our application layer. I was caught particularly off guard by how bad things were because I had always had such great results pairing with LLMs to develop ideas that genuinely excited me. To be clear, the goal here was not to discover some universal theory of creativity. There is no such thing, obviously. We were simply trying to encode into software a very specific creative standard for this type of work honed over many years: mine.
The first step in solving any problem is accurately identifying it. What followed was a journey to the heart of model training, machine learning failure modes, and statistical distributions. Deep waters for someone who majored in their native language. Funny enough, I would come to find that mid turns out to be a much more technically accurate term for AI output than I anticipated.
LLMs as Consensus Machines
Marketing has always felt like a natural fit for LLM-based applications because their non-determinism is a feature, not a bug. In marketing, the solution space is vast and success isn't predictable in advance. There's no formula that guarantees a campaign will bend the market towards your brand, no matter how many celebrities you stuff into your ad. The only guarantee is that predictable = invisible. One plus one should never equal two; it should add up to Dilly Dilly. Arriving at ideas like this is hard work. Breakthrough creativity has always come from a great volume of thinking, and science has empirically proven that quantity begets quality. You simply have to explore hundreds of ideas to arrive at a single great one, no shortcuts. Well, unless you’re an LLM. Then the literature says that quantity does not increase quality in creative ideation and that LLMs fail at creativity in advertising because they consistently “drift toward mediocrity.”
Disclaimer: I spend a lot of time trying to understand AI better, out of curiosity and also because doing so improves our business. But I’m not a statistician. And I’m certainly not an AI or ML researcher. So take everything that follows with a grain of non-technical salt. However, at this point I have stared at and annotated thousands of data traces from our product. So my understanding feels sufficiently deep on the many and diverse failure modes that arise when trying to get models and agents to generate solid advertising creative. Pain is a great teacher.
If you are interested in the current strengths and limitations of AI today (May ‘26) for strategic and creative exploration, I think the single most important thing to understand is mode collapse, an idea I first encountered in the fascinating Verbalized Sampling paper. While 'mode collapse' is apparently a failure mode first seen in earlier generative-model research—again, I’m not a researcher—to me it perfectly describes the wall we hit when first testing our product that prizes divergent thinking. Because we got a lot of crap like this:

Each of these ad ideas, if you can even call them that, is from a different LLM call with different instructions and context. Yet there is a stunning and disgusting convergence in thinking. The first mistake I made, before I understood what was happening under the hood, was throwing more compute at the problem. I thought “If I can’t get a few chained API calls to come up with great ads, I’ll just let an unbounded agent rip until they crack it, COGS and latency be damned!” This made matters significantly worse. Agents are, in their simplest conceptual form, chat in a loop.1. Yes I know an agent has 1,000 definitions. Not here to debate them. Internally, we use Willison’s canonical reasoning model with tools in a loop to achieve a goal. Andreessen’s more recent model + shell + file system + markdown + chron is emphatically not how we’ve built our product, and feels more like an architectural opinion (a strong one, mind you!) than a definition. But I digress. Broadly speaking, agents involve tons of design choices around a model wrapped in state, tools, policies, and validation that are as heterogeneous as the teams that build them and use cases they solve. So even if in one turn you manage to get the LLM to a strange and exciting place, the next time around the loop, the agent will work very hard to get back to its safe and boring comfort zone.
This is because when you use an LLM today, you are effectively using a consensus machine. During pretraining, GPT-style models absorb patterns from an enormous range of human writing: the standard, the strange, the elegant, the cliché, and everything in between. But during post-training, the models are aligned to be helpful, safe, and liked by a broad set of users. That process can reduce output diversity, pulling models toward narrower, more homogenized outputs despite the staggering breadth of data they were pretrained on. One visible symptom of this phenomenon is the widely discussed AI writing “smell.” It turns out when optimizing a product for most people, most people don’t like strange or unexpected responses. You know what makes for great advertising? Strange and unexpected ideas!
This means it's important to separate what a model is capable of from how they’re most commonly experienced. By the time these models reach you inside popular chat products, they are swaddled in safety layers, system prompts, product defaults, and all manner of other settings that bias toward polished, “helpful” responses. Theoretically, this is why outputs for creative use cases can feel mid. Practically, it means you and your nearest competitor may get uncomfortably similar answers when asking the same chatbot to solve a marketing challenge like, say, “How do we drive awareness of our product among cardiologists?”
Thank you for bearing with this overly technical explanation, but you can now see our conundrum: when you attempt to engineer computational creativity with LLMs, you are fighting the statistical gravity of the models. Looping this problem with an off-the-shelf agent often makes it worse. In all our experiments and testing, general purpose agents mostly came up with lame, repetitive ideas. So you need a lot of scaffolding in your application, or what people now call a harness, to shake models and agents out of their boring habits. We ended up having to build our own creative misfit agents who hate the normy middle and work really hard to get to surprising and exciting places while staying on brief. This engineering effort was non-trivial.
For the record, I think this will all change and change quickly. I suspect labs will soon expose more controls for diversity, sampling, exploration, and creative variance. The crude controls we have today are a start, but don’t let you actually steer towards creative quality as you have defined it. You know when the GOAT calls out the problem of “no competitive advantages” between two Claude-run firms, the model labs are likely already working on solving it. In fact, I know precisely one guy whose friend used to train models at a hyperscaler, and he told me as such. Yes, I realize that’s how rumors start.
LLMs as Concept Calculators
I suppose we should pause to discuss the elephant in the room: is it perverse to try and offload creative thinking to models in the first place? My big picture take on basically any debate in AI right now, where intelligence is increasingly capable, ambient, and disposable, is that there are no longer binary truths, because both poles of any argument are usually simultaneously correct. In this case, the things that are both true are:
- Humans have enjoyed a multi-millenia run of primacy as the undisputed champions of creative thinking, and will remain exceedingly excellent at it
- Models and agents, if you play with them seriously enough, are now undeniably creative in surprising and uncomfortable ways
One is not replacing the other. Far from it. It’s just that now, from a wide enough vantage point, both poles are effectively true. Arguing about this does not interest me. Finding novel ways to pair people with these new creative capabilities interests me greatly. Let me tell you a little story…
In the summer after ChatGPT stormed the scene, I used to run workshops where we’d demo generative AI for clients and talk about how brands were starting to use this technology. The room was always a mix of nerves and excitement, so we’d defuse the tension by asking what everyone wanted AI to be able to do for them. Slide 2 revealed a GIF of a robot doing laundry, we laughed, and things would generally take on a more jovial bent from there. But to drive the point home that we, the humans, had nothing to worry about, I’d close the ice breaker by calling LLMs “word calculators.” I’d explain how they were just statistical next-token prediction machines, and since no one has ever lost their job to a calculator, we shouldn’t be worried now either. I think this generally helped, and it was an honest representation of my understanding of the tech at the time.
Problem is I was wrong. Or at best, half-right. Back then I didn’t understand that in the process of learning to generate text one token at a time, models absorb far more than word patterns. They encode something closer to math on meaning. Their generated responses are shaped by patterns learned from their training data, including relationships among ideas, styles, genres, arguments, and concepts embedded inside the model as high-dimensional numerical representations. Framed this way, to me the much better and more exciting metaphor for LLMs is concept calculators. They are machines that let you push your own ideas and intent through a massive mathematical model of human meaning and see what new combinations come back.
When you come up with ad ideas, the industry jargon for this exercise is literally called “concepting.” It stands to reason then that a concept calculator would be awfully helpful with this type of work, if only it weren’t broken half the time spewing out pocket watch entrails. Herein lies the tension of applying LLMs for our use case, and the heart of our problem: there now exists a general-purpose technology that can help you generate and refine ideas by working through patterns learned from a vast archive of human expression. But to use the tech effectively for creative work, you have to find inventive new ways to outmaneuver its default pull toward consensus thinking. Problem accurately identified; now we needed a solve.
Engineering Divergence
After a spin through arXiv was a lot less fruitful than I anticipated, I went searching for answers in the most unlikely of places: books. I have a shelf in my office full of books on advertising and art and writing and strategy and thinking and a whole bunch of other creative stuff. I began pouring through them for inspiration on how to get our product to be more creative. Then spent the next 6 weeks translating my favorite findings and ideas into software. In the end, we built 7 versions of our creative generation pipeline, and our north star internal eval metric—an intentionally very fuzzy, very opinionated “Would I proudly show this to a client?”—jumped from embarrassingly meh to quite exciting:
- v1 = 7-17%
- v2-5 = ~20-30%
- v6-7 = ~40-60%

The mechanics of how we got those jumps were pretty unsexy. Our product would generate a run of 100 creative concepts. And I would grade them. Each run took me 3 hours, and I was mentally exhausted afterwards. Then Claude Code would evaluate the same run, knock it out in 10 minutes, and was always hungry for more. For the entirety of the month we did this, when comparing our analyses side-by-side, Claude and I were always 30 points apart in our grading no matter what the baseline pass rate was (he’s not nearly hard enough on the work). But we almost always agreed on the best 5-10 ideas from any given run. To this day I cannot tell you why this is. LLMs are so weird.
Anyway, through this analysis we would systematically identify failure modes, then reason on how to solve them (with our brains, not AI), then tinker with the system until we got rid of [insert failure mode]. Then we’d update our eval suite to keep tabs on these failure modes for regressions. Claude’s most adorably infuriating weakness also led to a real product aha moment around cognitive debt and the need for cognitive forcing functions in AIUX, where you build points of mental friction into your product so users think more critically and less passively when using it.
We discovered this by accident because, basically, Claude kept glazing us on the incredible creative prowess of our system. Problem was it wasn’t so amazing yet. When we’d ask Claude why he felt this way, he kept citing output like the below as proof the product was improving:
“A glowing peach orb falls softly into the gently rolling turquoise of the Pacific. The clouds blush aubergine as two lovers in their mid-70’s stroll with the waves and their golden retriever lapping at their feet, pari-passu.”
And I’d ask, as politely as possible, “Why, Claude, did you select this concept as a great advertising campaign?” And he’d go “The prose! The imagery! This idea is F. Scott incarnate!” And I’d respond, again as politely as possible, “Don’t you ever pitch me an effing beach stroll concept again.” (I am embarrassed by how much I cursed at Claude Code during this period.)
This failure mode, while blood-boiling to work through, was ultimately highly instructive. For our creative evaluation use case, it revealed how easily an LLM-as-a-Judge could be snowed by flowery language and barely-there ideas. We didn’t know it at the time, but the literature has identified a constellation of related weaknesses in a model’s ability to evaluate the quality of an idea: style-over-substance, style bias, self-preference, and verbosity bias. So we stopped treating AI judgement as a viable arbiter of creative quality, because models are too willing to reward the most articulate slop.

I could go on and on about all our experiments and failures and learnings, but have bored you long enough. The larger point is there is a crazy amount of R&D needed by non-technical SMEs to get the models and agents to do fuzzy things well enough to ship them. I can’t wait to build harness v8, because I’m already brimming with ideas (and fresh data) on how to improve things even further. I also recognize everything we build is temporary. The abiding law of AI products is to, a couple times a year or so, remove all the scaffolding you built around the stuff AI wasn’t good at yet in order to free up the newer, more powerful models to fully cook. So we’re about due for that too. But I also feel like I’ve fallen down a deep well, where there is no “correct way” to do any of this, and it’s all just how people choose to force their personal standards into products for subjective use cases.
Finding Our Place in the Loop
You hire creative people because they are opinionated. Because they are fractious and persnickety. Because they don’t see the world like most people. Because the mode is invisible to them. Because there is value in leasing their lens for a bit to get a fresh look at your thing-in-need-of-freshening. These same people can now encode their opinions into systems that are repeatable and scalable and, if architected properly, even self-improving. As intelligence becomes a utility like water or electricity, the people and teams who play with it and pipe it in interesting, highly opinionated ways will be the ones who separate themselves from the commodified mode.
What makes an ad idea great anyway? In the end, this is the opinion we’d been trying so hard to automate. I have always liked Luke Sullivan’s definition from Hey Whipple, Squeeze This, the single best book about advertising ever written:
“Eventually you get to an idea that dramatizes the benefit of your client’s product or service. Dramatizes is the key word. You must dramatize it in a unique, provocative, compelling, and memorable way.”
In an attention economy, where pattern breaking your brand’s way into your audience's field of view seems to be the goal of any marketing effort these days, the above strikes me as a pretty good definition. When evaluating ideas, I also keep my own internal measure near at hand: “Is it obvious in hindsight?” Meaning, you’d personally never arrive at this idea on your own. But after seeing it, it gets its hooks in you, and you start accidentally reaching for it to describe the problem. It’s so clear and useful, it becomes the de facto way you and the team now discuss the communication challenge itself. Most people can identify great ideas like this fairly easily whether they’re “creative” or not, because these ideas tend to leap off the page and worm their way into how you think. A great ad idea advertises itself.
I have not yet found or been able to build a model, agent, or system strong enough at identifying ideas like this. Which means I cannot yet confidently offload it to AI. So for now, this is where in our loop the humans sit.