← all writing
AI-Native Experimentation

The Mechanical Turk Is Now an LLM

An LLM lets you fake almost any product's core feature for a weekend, so "we cannot test it cheaply" is no longer a constraint, it is an excuse.

Published 2026-06-14

The Mechanical Turk Is Now an LLM

In 1770, a chess-playing machine called the Mechanical Turk beat aristocrats across Europe. Inside the cabinet, a human chess master worked the levers. The machine was a fake. The chess was real, and so was the lesson: you can test whether people want a thing long before you can actually build it.

Here is the call. If your product idea has an AI feature at its core, you can fake that feature convincingly enough to get real demand data in a weekend, for the price of a few cents in API calls. "We cannot test it cheaply" is no longer true. For most AI features, it is an excuse.

The expensive mistake

Teams building AI features skip validation because the feature itself looks too hard to fake. The reasoning sounds careful: "the whole point is the model, so a fake door tells us nothing. We have to build the real pipeline to know if it works." So they spend three months on data pipelines, fine-tuning, evals, and integrations, then ship to discover that the people they pictured do not show up, or will not pay. The mistake is treating "hard to build" as "impossible to fake." Those are not the same thing, and conflating them is how a quarter disappears.

What Savoia's IBM test actually proved

Alberto Savoia, who coined the term pretotyping (test "the right it" before you build it) in his book The Right It, tells the story of how IBM checked demand for speech-to-text decades before the technology was viable. Building a real speech recognizer was not commercially possible at the time. So instead of building one, IBM sat a person at a microphone in front of a screen, and a skilled typist in the next room listened and typed what they said. To the user, the machine appeared to transcribe speech in real time. That is the Mechanical Turk move: a human behind the curtain doing the work the product would eventually do.

IBM did not learn "speech-to-text is neat." They learned things that would have cost a fortune to discover after a full build: dictating for an hour made people's throats hurt, and users would not speak confidential information out loud in an open office. Real behavior, real friction, real signal, before a single line of recognition code. The fake backend was a human typist. Today it is an LLM.

Why the excuse died

Three things changed.

  1. The hidden worker is now software. Savoia needed a typist. You need an API key. An LLM can stand in for the "magic" in a wide range of products: summarize this contract, draft this reply, sort these tickets, pull these fields out of this document, generate these twenty variants. If the core feature is "the software understands language or produces it," an LLM can fake it at demo quality this afternoon.
  2. The cost collapsed. As of June 2026, a single call to a frontier model costs cents, not dollars. A weekend of faked usage for a small test cohort costs less than lunch. Set that next to the months of engineering a real version takes, and the asymmetry is not subtle.
  3. You are testing demand, not the model. This is the part teams miss. The pretotype is not asking "is our model good enough?" It is asking "will the people we are building for actually use this, and will they pay?" A human typist answered that question for IBM with zero recognition accuracy. An LLM answers it for you while also being good enough to feel real.

The usual move is to build the real backend first, because "we need to know if it works technically." Here is what the pattern says instead: build the fake backend first, because the technical risk is rarely what kills you. Demand risk is. Most products fail because nobody wanted them, not because the engineering was wrong.

A weekend pretotype you can actually run

Say you want to build an AI contract-review tool for small law firms. The real version is months of document parsing, a tuned model, integrations, and a security review. The faked version is a weekend:

You have not built the product. You have built the experiment that tells you whether to. If three of your first ten firms upload a real contract and ask about pricing, that is signal worth acting on. If everyone bounces at the upload step, you just saved three months.

The counterpoint

"But an LLM will get things wrong, and a bad answer burns the prospect." Fair. This is exactly where the human in the loop earns its keep, the same way IBM's typist did. At pretotype scale (tens of users, not thousands), you can read every output before it ships. You are not running a service. You are running a test with a person checking the work. The moment quality at scale becomes the real question, you have already answered the prior one (do they want it) and earned the right to build the real thing.

The honest limit: this tells you about demand and first-impression quality. It does not prove you can hit production accuracy or that the unit economics work. Do not let a good fake-backend signal talk you into skipping those checks later. It de-risks the first and biggest question, not all of them.

The call, again

The Mechanical Turk was a human in a box in 1770. It was a typist behind a screen for IBM. It is an LLM behind a form in 2026. The trick has not changed: fake the "magic" with whatever is cheapest, and measure whether anyone wants it before you build it for real. What changed is that the cheapest option now sits behind an API, and it is good enough to convince a real user for the length of a test.

This week, take the one AI feature you were about to spend a quarter building. Write the one-page promise, wire the form to an LLM with yourself reading every output, and put it in front of ten real prospects. Count how many come back. Let that number, not your conviction, decide whether you build it.