/* global window */
// Content object for the article renderer (blog/blog-article.jsx).
// Produced by the Writer→Editor→Humanize→Proofreader chain; draft in blog/pipeline/drafts/synthetic-vs-human.md
window.__ARTICLE__ = {
  slug: "synthetic-vs-human",
  laneLabel: "EVIDENCE INDEX",
  kicker: "METHODOLOGY",
  readMins: 7,
  dateLabel: "May 6, 2026",
  title: "Synthetic vs. human research: when each one wins",
  deck: "A practitioner's decision matrix for the silicon-sample era — where simulated respondents earn the call, where they quietly lie to you, and the validation evidence behind each line.",
  tags: ["synthetic-research", "research-methods", "evidence-trail"],
  toc: [
    { id: "question", num: "01 · THE QUESTION", title: "The wrong question" },
    { id: "definition", num: "02 · DEFINITION", title: "What 'synthetic research' means" },
    { id: "wins", num: "03 · WHERE IT WINS", title: "Where synthetic wins" },
    { id: "misleads", num: "04 · WHERE IT FAILS", title: "Where synthetic misleads" },
    { id: "matrix", num: "05 · THE MATRIX", title: "The decision matrix" },
    { id: "blend", num: "06 · IN PRACTICE", title: "How to blend them" },
    { id: "closing", num: "07 · THE VERDICT", title: "The closing position" },
  ],
  body: [
    { t: "h2", id: "question", num: "01 · THE QUESTION", text: "The wrong question, asked once a week" },
    { t: "p", html: `It lands in my inbox about once a week, usually from a product lead or a CMO under deadline: <em>"Can we just run this with synthetic respondents instead?"</em>` },
    { t: "p", html: `It's the wrong question. Framed that way, it's a budget question wearing a methodology costume. The word "instead" is doing all the work, and "instead" is where teams get hurt. The honest version is narrower: for this specific decision, what does synthetic earn, and what does it not?` },
    { t: "p", html: `Synthetic research is real, and it works for some things. It also fails at others, confidently and invisibly. The failures don't announce themselves. A synthetic study returns clean numbers, tight distributions, a tidy ranking of your five concepts. It looks like evidence. Sometimes it is. Sometimes it's an expensive mirror reflecting your own assumptions back at you in 14-point font.` },
    { t: "p", html: `So here's the matrix I actually use, and the evidence under each cell.` },
    { t: "pullquote", text: "Synthetic research doesn't fail loudly. It returns clean numbers, tight distributions, a tidy ranking — and sometimes that tidiness is the tell." },

    { t: "h2", id: "definition", num: "02 · DEFINITION", text: `What "synthetic research" actually means` },
    { t: "p", html: `Strip the hype and there are three things people mean.` },
    { t: "p", html: `The first is <strong>LLM personas</strong>, the "silicon samples." You condition a large language model on a demographic backstory (age, income, region, political lean) and ask it to answer as that person. The foundational work is Argyle et al., <em>"Out of One, Many"</em> (2023): GPT-3, conditioned on thousands of real respondents' backstories, could reproduce US survey response distributions well enough that the authors coined the term <em>algorithmic fidelity</em> (Argyle et al., 2023). That paper is why this field exists. It's also routinely over-read.` },
    { t: "p", html: `The second is <strong>synthetic survey respondents</strong>, the same idea industrialized. You generate hundreds or thousands of simulated answers to a questionnaire to estimate how a population would respond: concept scores, message preference, purchase intent.` },
    { t: "p", html: `The third is <strong>augmented synthetic data</strong>. You collect a small real sample, then use a model to extrapolate a larger one. ESOMAR's buyer guidance is blunt about the catch. There's a <em>minimum viable data</em> threshold below which augmentation produces nothing reliable, because the synthetic output is only ever as good as the real primary data behind it (ESOMAR, 2024).` },
    { t: "p", html: `If you're not a researcher, think of synthetic research as a flight simulator. It's superb for rehearsing the route, catching obvious problems, and training your instincts cheaply. It is not the thing you certify the aircraft on.` },

    { t: "h2", id: "wins", num: "03 · WHERE IT WINS", text: "Where synthetic wins, and the evidence for it" },
    { t: "p", html: `Synthetic earns real decisions. Here's where the validation holds up.` },
    { t: "p", html: `Start with <strong>pre-testing your instrument</strong>. Before you field a survey to humans, run it through a model. Confusing wording, dead-end logic, leading scales: synthetic respondents surface these fast and for almost nothing. You're not measuring opinion. You're debugging the questionnaire. Low stakes, high yield. It's the least controversial use, and the one Qualtrics and most serious vendors put first (Qualtrics, 2025).` },
    { t: "p", html: `Then there's <strong>wide concept and message screening</strong>, where the strongest recent evidence lives. A 2025 study reproducing human <em>purchase intent</em> across 57 personal-care surveys (9,300 human respondents) found that a careful elicitation method reached <strong>88–92% correlation</strong> with human ratings and produced realistic response distributions, not just matching the average but the spread (Wang et al., 2025). It ranked concepts correctly, tracked age and income sensitivity, and flagged the duds. If you have 30 messaging variants and budget to test six with humans, synthetic is a defensible way to cut to the survivors.` },
    { t: "p", html: `Third, <strong>approximating large, well-trodden attitudes</strong>. When you're estimating broad, stable, frequently-surveyed opinions in a well-represented population, models do reasonably well. That's the Argyle result, and it replicates for mainstream attitudes (Argyle et al., 2023). Dillion et al. (2023) found a striking <strong>r = .95</strong> between model and human <em>moral judgments</em> on standard scenarios (MeasuringU, 2025). For the broad middle of the distribution, on questions the internet has already discussed to death, synthetic is a fast directional read.` },
    { t: "p", html: `The common thread: synthetic wins when the answer already exists somewhere in the training data, the population is well-represented, and you need <em>direction</em>, not <em>truth</em>. Speed and breadth, not precision.` },

    { t: "h2", id: "misleads", num: "04 · WHERE IT FAILS", text: "Where synthetic misleads, and the evidence for that" },
    { t: "p", html: `Now the failures. These aren't edge cases. They're structural.` },
    { t: "p", html: `The big one, nearly universal, is <strong>variance collapse</strong>. LLMs cluster around the perceived majority answer and crush the spread. Across studies, models reproduce only roughly 7–67% of the true standard deviation (Engipulse, 2025; arXiv 2601.15755, 2026). The person who <em>strongly</em> disagrees, the 3% with the weird-but-real use case, the long tail where most product insight actually lives: they vanish. Your synthetic study can match the human <em>mean</em> perfectly and still be useless, because the moment you look at variance, subgroup means, or the correlations between variables, it falls apart (Bisbee et al., 2024).` },
    { t: "p", html: `There's also <strong>effect-size inflation</strong>. Synthetic respondents agree too much and react too hard. They reproduce the <em>direction</em> of a persuasion effect but markedly overstate its <em>magnitude</em> against real humans (Almeida et al., 2024, via MeasuringU). Size a launch off synthetic lift and you'll over-forecast.` },
    { t: "p", html: `Then come the <strong>niche and underrepresented segments</strong>. Where you most need help (small, hard-to-reach groups) is exactly where models break. Verasight's polling test found crosstab errors of <strong>15 points for Black respondents</strong> and <strong>~20 points for other racial groups</strong>, and a broader literature shows models flatten within-group heterogeneity into caricature (Verasight, 2024). They don't simulate the minority view. They stereotype it.` },
    { t: "p", html: `Last, <strong>real behavior, novelty, and willingness-to-pay</strong>. A model can tell you what people <em>say</em> about a familiar product. It can't give you a behavioral signal, what someone actually <em>did</em> at 02:14 when the checkout flow stalled. It has no lived experience to draw on for a genuinely novel product, and WTP estimates inherit both the variance collapse and the sycophancy. Models also show a near-total absence of uncertainty. Verasight found LLMs produced <strong>0% "don't know"</strong> where 3% of real respondents (about 8 million US adults) chose it. On out-of-sample policy questions, models <em>reversed</em> the real distribution, reporting 59% support where the truth was 28% (Verasight, 2024).` },
    { t: "p", html: `The replication record sharpens the point. A review of 12 peer-reviewed comparison studies tallied 9 encouraging findings against 14 discouraging ones, and one replication effort succeeded only <strong>21% of the time</strong> (MeasuringU, 2025; Park et al., 2024).` },
    { t: "pullquote", text: "A synthetic study can nail the human average and still be worthless — because the insight you're paid to find lives in the variance it just erased." },

    { t: "h2", id: "matrix", num: "05 · THE MATRIX", text: "The decision matrix" },
    { t: "p", html: `Here's the call, decision by decision. This column is NeroView's own framework, our read of the evidence above, not a citation.` },
    { t: "table",
      headers: ["Decision type", "Synthetic", "Human", "Why"],
      rows: [
        ["Debug a survey / pilot the instrument", "✅ Primary", "—", "Debugging wording and logic, not measuring opinion. Lowest stakes."],
        ["Wide concept/message screening (cut 30 → 6)", "✅ Primary", "Validate survivors", "~88–92% correlation on familiar categories; good at ranking, bad at magnitude."],
        ["Broad, well-surveyed attitudes (directional)", "✅ Useful", "Spot-check", "Algorithmic fidelity holds for the mainstream middle."],
        ["Subgroup / niche / underrepresented segments", "⚠️ Misleads", "✅ Required", "15–20pt crosstab errors; flattens within-group variance into stereotype."],
        ["Emotion, frustration, delight (the <em>why</em>)", "❌ No", "✅ Required", "No lived experience; no genuine affective signal. Needs source clips."],
        ["Genuinely novel product / category", "❌ No", "✅ Required", "Nothing in training data to draw on."],
        ["Willingness-to-pay / pricing commitment", "❌ No", "✅ Required", "Variance collapse + sycophancy inflate and distort."],
        ["Actual behavior (what they <em>did</em>)", "❌ No", "✅ Required", "Synthetic produces stated attitudes, never behavioral signal."],
        ["Go / no-go launch decision", "❌ No", "✅ Required", "High-stakes; effect-size inflation forecasts you off a cliff."],
      ] },
    { t: "p", html: `Industry guidance converges here. Qualtrics' position is blunt: synthetic "augments human research, it does not replace it," with humans as "the source of truth that anchors synthetic models" (Qualtrics, 2025). ESOMAR's 2025 code revision leans hard on transparency and "the necessity of human oversight" (ESOMAR, 2024/2025).` },

    { t: "h2", id: "blend", num: "06 · IN PRACTICE", text: "How to blend them: synthetic as pre-flight" },
    { t: "p", html: `The workflow that survives contact with reality is sequential, not substitutive.` },
    { t: "p", html: `First, <strong>pre-flight on synthetic</strong>. Run your instrument through a model. Screen your wide concept set. Generate hypotheses, find the obvious failures, rank the field. Treat every synthetic output as a <em>hypothesis</em>, never a finding.` },
    { t: "p", html: `Then <strong>validate the survivors with real participants</strong>. Take the six concepts synthetic liked and put them in front of humans. Now you get what synthetic structurally cannot: the source clip of someone's face falling at the price reveal, the transcript moment where they misread your value prop, the behavioral signal of the abandoned task, the segment pattern that only shows up in the niche you couldn't simulate. You also get an honest confidence indicator that accounts for the <em>don't-knows</em> a model would have erased.` },
    { t: "p", html: `This is where method becomes discipline. We obsess over the evidence trail at NeroView because synthetic and human research fail in <em>opposite</em> directions, and you can only tell them apart if every claim is traceable to its source. A synthetic concept score with no clip behind it is an opinion. A human insight you can play back at 02:14, read in the transcript, and watch repeat across a segment is evidence. When someone challenges the finding, you don't defend the model. You play the tape.` },

    { t: "h2", id: "closing", num: "07 · THE VERDICT", text: "The closing position" },
    { t: "p", html: `Synthetic research is the best pre-flight tool our field has ever had. It is not a participant. It approximates the average of the familiar and erases the tails, the emotion, the novelty, and the niche, which is to say it erases most of what you were hired to find.` },
    { t: "p", html: `So stop asking whether synthetic can replace humans. Ask the better question: which decisions has synthetic earned? Screening, piloting, directional reads on well-trodden ground? Yes, and gladly. Pricing, emotion, behavior, novelty, niche, and the go/no-go? No, and don't let a clean-looking distribution talk you out of it.` },
    { t: "p", html: `A good researcher isn't the one who picks a side. It's the one who knows, decision by decision, which tool each one earns, and can show you the receipt.` },

    { t: "references", items: [
      { n: 1, html: `Argyle, L. P., et al. (2023). "Out of One, Many: Using Language Models to Simulate Human Samples." <em>Political Analysis</em>. <a href="https://www.cambridge.org/core/journals/political-analysis/article/out-of-one-many-using-language-models-to-simulate-human-samples/035D7C8A55B237942FB6DBAD7CAA4E49" target="_blank" rel="noopener">cambridge.org</a> (preprint: <a href="https://arxiv.org/abs/2209.06899" target="_blank" rel="noopener">arxiv.org/abs/2209.06899</a>)` },
      { n: 2, html: `Bisbee, J., et al. (2024). "Synthetic Replacements for Human Survey Data? The Perils of Large Language Models." <em>Political Analysis</em>. (via the MeasuringU review, ref. 3.)` },
      { n: 3, html: `MeasuringU (2025). "A Review of Experiments with Synthetic Users." <a href="https://measuringu.com/review-of-experiments-with-synthetic-users/" target="_blank" rel="noopener">measuringu.com</a>` },
      { n: 4, html: `Wang et al. (2025). "LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings." (arXiv 2510.08338). <a href="https://arxiv.org/html/2510.08338v1" target="_blank" rel="noopener">arxiv.org/html/2510.08338v1</a>` },
      { n: 5, html: `Verasight (2024). "Your Polls on ChatGPT." <a href="https://www.verasight.io/reports/synthetic-sampling" target="_blank" rel="noopener">verasight.io</a>` },
      { n: 6, html: `Qualtrics (2025). "Synthetic Data for Market Research FAQ." <a href="https://www.qualtrics.com/articles/strategy-research/synthetic-data-market-research/" target="_blank" rel="noopener">qualtrics.com</a>` },
      { n: 7, html: `ESOMAR (2024). "Synthetic Data in Marketing Studies" + 2025 ICC/ESOMAR Code revision. <a href="https://ana.esomar.org/api/public/document/file_renderer/12519" target="_blank" rel="noopener">esomar.org</a>` },
      { n: 8, html: `Engipulse (2025) / arXiv 2601.15755 (2026). <a href="https://engipulse.com/data-analytics/synthetic-survey-data-when-llms-can-and-cannot-replace-human-respondents/" target="_blank" rel="noopener">engipulse.com</a>` },
    ] },
  ],
  related: [
    { href: "/blog/post.html", title: "Stop trusting AI summaries. Start trusting the evidence trail.", meta: "9 min · Index" },
    { href: "/blog/sample-report.html", title: "The challenger's playbook — beating the Giants (n=259)", meta: "14 min · Report" },
  ],
};