AI Evaluation Is the Real Work: What We Learned Teaching a Government Cohort to Break Our Prototypes

Earlier this month, the Aspen Institute, the policy non-profit, brought together a group of 23 state and county benefits administrators in Pittsburgh to discuss the challenges of delivering benefits at scale. Collected Company partnered with the New Practice Lab at New America to translate agency implementation challenges into prototypes as part of a 36-hour sprint. On Monday we brainstormed, on Tuesday we built, and on Wednesday morning, we walked the group through four functional prototypes: A document parser that pulled requirements out of federal legislation, state policy, and state guidance. A case management assistant that read an application form and flagged what a caseworker still needed to ask. A training tool for call center operators. A bot that answered claimant questions about unemployment eligibility.
The demos worked. The administrators laughed at the right moments, leaned in when we expected them to, and asked sharp questions about how the tools handled the edge cases in their real workflows.
But instead of discussing all the new features these tools could have, we wanted to focus on all the ways they could fail.
When Teo wrote about what changes when building gets cheap, he focused on the first step of the build: defining the problem. We’re here to talk about the next step: evaluating the solution.
A working prototype used to signal weeks of engineering and decisions that had already survived scrutiny. AI collapses that signal. "Looks done" is now the cheapest thing AI produces. Knowing how it will fail is the skill that scales.
We Asked Them to Break Our Demos
We ran an exercise called Break It! Each small group got a prototype and seven prompts:
Harmful Advice: Where could the tool give actively harmful advice?
Bias: Where could bias show up?
Data Vulnerabilities: What happens when the data is messy, incomplete, corrupted, or personally identifiable?
Staff Incentives/Workarounds: What bad staff incentives or operational workarounds might emerge?
Public Trust Breakdown: How could public trust break?
Changes Over Time: How will you know when an initial “good” solution has drifted?
Explainability: Can a caseworker explain a decision the AI makes to an applicant?
The groups did not run out of things to find. A training bot that produced plausible answers also confidently produced wrong ones, because plausible and accurate are qualities AI is bad at distinguishing. A document parser that worked on Oregon's tidy PDFs would behave unpredictably for a state with less organized policy documentation. A faulty eligibility tool that agency staff trust too much becomes a worse outcome for claimants than no tool at all.
The point of our exercise was to show that AI failure modes are the product specifications. If you cannot capture the needs of real beneficiaries and describe how a tool will fail, you have not yet described an effective tool.
What the Conversation Revealed
Most of the executives in the room had never tried to break a tool before, but once we gave them a framework for it, they couldn’t stop. The failure modes weren’t hard to find, it was just that nobody had previously told them that finding gaps was now part of their job.
For most of enterprise software's history, evaluation was the vendor's job. Building was hard, only the vendor had the capacity to test what they built, and the buyer had reasonable grounds to trust that it had been tested. But when anyone can stand up a working prototype in a day, the organization whose name will be on the press release if something goes wrong has to own evaluation in a way they did not before.
AI compresses time and scale, which means it compresses both your capacity for good and your capacity for harm. The question changes from "how is it built?" to "how does it break?" and the answer is rarely loud. AI fails silently and inconsistently. Whoever is accountable for the system needs a way to see those failures before the public does.
Every leader shipping AI now holds two jobs they may not have held before. Neither is technical.
The first is definition, which happens before anything gets built: defining what success looks like in language a non-engineer can evaluate, and listening to the real user early enough that what gets built is shaped by them rather than discovered by them.
The second is evaluation, which happens continuously during and after the build. The goal is to build-in measurement and learning goals from the start, and create a real path for the people closest to the harm to continuously flag it.
These are not government-specific. The cohort in Pittsburgh ran benefits programs, where the stakes are visible. A wrong eligibility decision can affect whether someone is able to see a doctor, pay rent, or feed their family. But the same two levers apply to a media company deploying an editorial assistant, a SaaS team shipping an in-product agent, or a non-profit automating a client-facing workflow. The prototype changes, but the accountability does not.
Three Ways to Begin Operationalizing AI Evaluation
Define what "good" means before you see the demo. Not the demo's success criteria, but your team’s mission-aligned and operational success criteria. What outcome are you trying to change, by how much, for whom, over what time frame. The demo will be impressive regardless. Your job is to know, in advance, what would and would not count as the tool working. If you can’t write that down before you see the demo, you’re not ready to evaluate.
Ask the AI vendor two specific questions: what does it look like when your tool is wrong, and how would I know? If they cannot answer, they have not done the evaluation work themselves. That’s a disqualifier.
Schedule a “Break It!” session before you write the spec. Pull together the people who would absorb the operational fallout if a tool fails: front-line staff, legal, anyone customer-facing. Give them a vendor's demo and the seven prompts above, and let them produce a list of likely failure modes.
Evaluation as a Posture, Not Phase
For most of software history, evaluation was a phase the vendor did before delivery. AI makes evaluation a posture instead. The leaders who ship AI well in the next few years will be the ones who treat "is this still good?" as a question they ask forever, not one they answer once.
Ayushi Roy is the Chief Program Officer at New America's New Practice Lab. In partnership with Collected Company, they designed and led the AI Lab for the Aspen Institute's FSP State Executive Cohort. Read Part 1 of the story: “When Building Is Cheap, Problem Definition is Everything.”
Earlier this month, the Aspen Institute, the policy non-profit, brought together a group of 23 state and county benefits administrators in Pittsburgh to discuss the challenges of delivering benefits at scale. Collected Company partnered with the New Practice Lab at New America to translate agency implementation challenges into prototypes as part of a 36-hour sprint. On Monday we brainstormed, on Tuesday we built, and on Wednesday morning, we walked the group through four functional prototypes: A document parser that pulled requirements out of federal legislation, state policy, and state guidance. A case management assistant that read an application form and flagged what a caseworker still needed to ask. A training tool for call center operators. A bot that answered claimant questions about unemployment eligibility.
The demos worked. The administrators laughed at the right moments, leaned in when we expected them to, and asked sharp questions about how the tools handled the edge cases in their real workflows.
But instead of discussing all the new features these tools could have, we wanted to focus on all the ways they could fail.
When Teo wrote about what changes when building gets cheap, he focused on the first step of the build: defining the problem. We’re here to talk about the next step: evaluating the solution.
A working prototype used to signal weeks of engineering and decisions that had already survived scrutiny. AI collapses that signal. "Looks done" is now the cheapest thing AI produces. Knowing how it will fail is the skill that scales.
We Asked Them to Break Our Demos
We ran an exercise called Break It! Each small group got a prototype and seven prompts:
Harmful Advice: Where could the tool give actively harmful advice?
Bias: Where could bias show up?
Data Vulnerabilities: What happens when the data is messy, incomplete, corrupted, or personally identifiable?
Staff Incentives/Workarounds: What bad staff incentives or operational workarounds might emerge?
Public Trust Breakdown: How could public trust break?
Changes Over Time: How will you know when an initial “good” solution has drifted?
Explainability: Can a caseworker explain a decision the AI makes to an applicant?
The groups did not run out of things to find. A training bot that produced plausible answers also confidently produced wrong ones, because plausible and accurate are qualities AI is bad at distinguishing. A document parser that worked on Oregon's tidy PDFs would behave unpredictably for a state with less organized policy documentation. A faulty eligibility tool that agency staff trust too much becomes a worse outcome for claimants than no tool at all.
The point of our exercise was to show that AI failure modes are the product specifications. If you cannot capture the needs of real beneficiaries and describe how a tool will fail, you have not yet described an effective tool.
What the Conversation Revealed
Most of the executives in the room had never tried to break a tool before, but once we gave them a framework for it, they couldn’t stop. The failure modes weren’t hard to find, it was just that nobody had previously told them that finding gaps was now part of their job.
For most of enterprise software's history, evaluation was the vendor's job. Building was hard, only the vendor had the capacity to test what they built, and the buyer had reasonable grounds to trust that it had been tested. But when anyone can stand up a working prototype in a day, the organization whose name will be on the press release if something goes wrong has to own evaluation in a way they did not before.
AI compresses time and scale, which means it compresses both your capacity for good and your capacity for harm. The question changes from "how is it built?" to "how does it break?" and the answer is rarely loud. AI fails silently and inconsistently. Whoever is accountable for the system needs a way to see those failures before the public does.
Every leader shipping AI now holds two jobs they may not have held before. Neither is technical.
The first is definition, which happens before anything gets built: defining what success looks like in language a non-engineer can evaluate, and listening to the real user early enough that what gets built is shaped by them rather than discovered by them.
The second is evaluation, which happens continuously during and after the build. The goal is to build-in measurement and learning goals from the start, and create a real path for the people closest to the harm to continuously flag it.
These are not government-specific. The cohort in Pittsburgh ran benefits programs, where the stakes are visible. A wrong eligibility decision can affect whether someone is able to see a doctor, pay rent, or feed their family. But the same two levers apply to a media company deploying an editorial assistant, a SaaS team shipping an in-product agent, or a non-profit automating a client-facing workflow. The prototype changes, but the accountability does not.
Three Ways to Begin Operationalizing AI Evaluation
Define what "good" means before you see the demo. Not the demo's success criteria, but your team’s mission-aligned and operational success criteria. What outcome are you trying to change, by how much, for whom, over what time frame. The demo will be impressive regardless. Your job is to know, in advance, what would and would not count as the tool working. If you can’t write that down before you see the demo, you’re not ready to evaluate.
Ask the AI vendor two specific questions: what does it look like when your tool is wrong, and how would I know? If they cannot answer, they have not done the evaluation work themselves. That’s a disqualifier.
Schedule a “Break It!” session before you write the spec. Pull together the people who would absorb the operational fallout if a tool fails: front-line staff, legal, anyone customer-facing. Give them a vendor's demo and the seven prompts above, and let them produce a list of likely failure modes.
Evaluation as a Posture, Not Phase
For most of software history, evaluation was a phase the vendor did before delivery. AI makes evaluation a posture instead. The leaders who ship AI well in the next few years will be the ones who treat "is this still good?" as a question they ask forever, not one they answer once.
Ayushi Roy is the Chief Program Officer at New America's New Practice Lab. In partnership with Collected Company, they designed and led the AI Lab for the Aspen Institute's FSP State Executive Cohort. Read Part 1 of the story: “When Building Is Cheap, Problem Definition is Everything.”