How to Know When AI Is Good Enough: 10 Lessons from Our NY Tech Week Panel

The AI conversation is changing. A year ago, most organizations were asking what AI could do. Today, they're asking whether they can trust it enough to deploy it.
All of our clients are navigating this transition in some way, shape, or form, yet it feels like conversations are happening in siloes. That’s why we were so excited to bring together practitioners and operators to share concrete ways they approach this topic and discuss the deeply human questions around it.
Our NY Tech Week panel featured Justin Bleuel (OpenAI), Teresa Mondría Terol (NPR), Sanket Karuri (Code for America), and Zoë Egelman (Elego Law). The panel explored how teams across AI research, journalism, civic technology, and legal services evaluate quality, manage risk, and decide when AI is ready for the real world.
Here are 10 key takeaways.
1. “Good enough” depends on the stakes
There is no universal quality threshold for AI. A system that is 97% accurate might be an incredible achievement in one domain and completely unacceptable in another. As Teresa explained, in journalism, “97, 98 or 99 don't fly. It needs to be 100.”
In order to set your bar, you need to define the consequences of failure.
2. Evaluation has to measure value, not just risk
Testing tends to focus on risk minimization. But effective AI evaluation also asks what value the system creates. Sanket framed deployment decisions in the public sector as balancing “operational risk” against “the opportunity to serve your community better.”
A good evaluation plan requires you to define success as much as failure.
3. AI moves work from creation to verification
AI might make development faster, but the tradeoff is significantly more time spent in testing. Justin described having “almost a bigger backlog of things to review than I ever have before” in his role as a product manager at OpenAI. As AI accelerates creation, organizations often discover that the bottleneck shifts downstream into review, testing, and approval.
Capacity planning needs to account for these extra review cycles.
4. Continuous evaluation matters more than perfect evaluation
Some failures may not arise in early testing. Sanket brought to light that certain communities might be underrepresented in training data or less likely to use early versions of a product. "You're not going to catch everything beforehand," he warned, “because you may not know upfront everything to test for."
As we’ve written before, the leaders who ship AI well in the next few years will be the ones who treat "is this still good?" as a question they ask forever, not one they answer once.
5. “Human-in-the-loop” means different things
“Human-in-the-loop” has become a catch-all phrase for AI systems that keep a person involved somewhere in the process. Sometimes the human is there to catch errors. Sometimes they are there to exercise professional judgment. Sometimes they are there to build trust inside an organization that is still learning how to work with AI. And in some sectors, a human may be required for legal, ethical, or institutional reasons. As Teresa put it, “‘Human-in-the-loop’ is doing a lot of jobs right now.”
These jobs are different. If the human is only reviewing outputs because the model is not yet reliable, that role may shrink as systems improve. But if the human is there because trust, accountability, or professional judgment is central to the service (as in law, journalism, or public benefits) their role is not just a temporary workaround. It is part of what makes the system legitimate.
Before deploying AI, teams should be clear about what job the human is doing in the loop: checking, deciding, approving, interpreting, building trust, or taking responsibility. Each has different implications for risk, cost, staffing, and scale.
6. AI can be a better judge than people expect
One common alternative to human-in-the-loop is to use a larger, more capable AI model to judge the output of a smaller model.
The obvious question is whether you can trust the larger model’s judgement. But it’s worth considering whether the baseline, human judgement, is as reliable as we expect. Discussing writing evaluation, Justin noted that if a room full of experts graded the same writing sample, “I would think we would agree with each other maximum 60% of the time.”
That means AI evaluators don't need perfect accuracy to be useful. If they can match human agreement rates on subjective tasks, they may already be providing meaningful value.
Many organizations set a higher bar for AI than they do for humans. Good evaluation starts by measuring both against the same standard.
7. Disclaimers are a strategy choice
When a product says, “AI can make mistakes, please verify,” it is making a decision about who bears the cost of errors. Consumer tools often shift that burden to users. Journalism organizations generally cannot. As Teresa noted, “Putting the burden of verification on the reader is a challenge of that trust.”
Knowing what customers expect from your product determines how you communicate the limitations of an AI system.
8. You're inheriting someone else's definition of “good enough”
Every AI deployment sits on top of a chain of prior judgments. Zoë captured this dynamic perfectly: “Someone at the frontier model decided this was good enough to ship, and then someone at the legal tech company decided, ‘Okay, this is good enough to go to lawyers.’”
By the time your organization adopts AI, multiple parties have already made decisions about quality, risk, and readiness. Your due diligence means understanding the entire stack of assumptions.
9. AI evaluation requires rubrics and repeatable testing
Traditional software testing assumes that if something works once, it will likely work again.
AI systems are different. Because outputs are nondeterministic, teams need structured evaluation methods: test sets, scoring rubrics, and repeatable grading processes. As Justin explained, a core piece of the work is “building the rubrics.”
Instead of evaluating outputs once, the best evaluation frameworks provide a repeatable process that lets you assess every future version of the system.
10. How you test is also a choice
Not every organization can evaluate AI the same way.
At OpenAI, Justin described a world where features can be rolled out, measured, and improved through large-scale experimentation. Teams can A/B test features with millions of users and quickly gather signals about what's working.
For public-sector applications, Sanket described a very different reality. When a system affects government services, experimentation carries real consequences. Instead of learning through large-scale deployment, teams often need to validate systems in sandboxes, test against synthetic data, and earn confidence incrementally before putting them in front of the public.
The lesson is that evaluation is about how you measure as much as what you measure. The testing strategy that works for a consumer product may be unacceptable for a public benefits program, a newsroom, or a legal practice.
The Bottom Line
If there was one theme that connected every panelist's perspective, it was that determining whether AI is “good enough” is rarely a technical decision alone. It's a product decision, a risk decision, a governance decision, and often a trust decision.
Helping organizations answer those questions is a growing part of our work. More so than building AI systems, the challenge is often defining quality, designing evaluation processes, and creating the organizational confidence to put a system in front of real users.
Or, as Zoë put it: “It's a question of, ‘What are you comfortable standing behind?’”
The AI conversation is changing. A year ago, most organizations were asking what AI could do. Today, they're asking whether they can trust it enough to deploy it.
All of our clients are navigating this transition in some way, shape, or form, yet it feels like conversations are happening in siloes. That’s why we were so excited to bring together practitioners and operators to share concrete ways they approach this topic and discuss the deeply human questions around it.
Our NY Tech Week panel featured Justin Bleuel (OpenAI), Teresa Mondría Terol (NPR), Sanket Karuri (Code for America), and Zoë Egelman (Elego Law). The panel explored how teams across AI research, journalism, civic technology, and legal services evaluate quality, manage risk, and decide when AI is ready for the real world.
Here are 10 key takeaways.
1. “Good enough” depends on the stakes
There is no universal quality threshold for AI. A system that is 97% accurate might be an incredible achievement in one domain and completely unacceptable in another. As Teresa explained, in journalism, “97, 98 or 99 don't fly. It needs to be 100.”
In order to set your bar, you need to define the consequences of failure.
2. Evaluation has to measure value, not just risk
Testing tends to focus on risk minimization. But effective AI evaluation also asks what value the system creates. Sanket framed deployment decisions in the public sector as balancing “operational risk” against “the opportunity to serve your community better.”
A good evaluation plan requires you to define success as much as failure.
3. AI moves work from creation to verification
AI might make development faster, but the tradeoff is significantly more time spent in testing. Justin described having “almost a bigger backlog of things to review than I ever have before” in his role as a product manager at OpenAI. As AI accelerates creation, organizations often discover that the bottleneck shifts downstream into review, testing, and approval.
Capacity planning needs to account for these extra review cycles.
4. Continuous evaluation matters more than perfect evaluation
Some failures may not arise in early testing. Sanket brought to light that certain communities might be underrepresented in training data or less likely to use early versions of a product. "You're not going to catch everything beforehand," he warned, “because you may not know upfront everything to test for."
As we’ve written before, the leaders who ship AI well in the next few years will be the ones who treat "is this still good?" as a question they ask forever, not one they answer once.
5. “Human-in-the-loop” means different things
“Human-in-the-loop” has become a catch-all phrase for AI systems that keep a person involved somewhere in the process. Sometimes the human is there to catch errors. Sometimes they are there to exercise professional judgment. Sometimes they are there to build trust inside an organization that is still learning how to work with AI. And in some sectors, a human may be required for legal, ethical, or institutional reasons. As Teresa put it, “‘Human-in-the-loop’ is doing a lot of jobs right now.”
These jobs are different. If the human is only reviewing outputs because the model is not yet reliable, that role may shrink as systems improve. But if the human is there because trust, accountability, or professional judgment is central to the service (as in law, journalism, or public benefits) their role is not just a temporary workaround. It is part of what makes the system legitimate.
Before deploying AI, teams should be clear about what job the human is doing in the loop: checking, deciding, approving, interpreting, building trust, or taking responsibility. Each has different implications for risk, cost, staffing, and scale.
6. AI can be a better judge than people expect
One common alternative to human-in-the-loop is to use a larger, more capable AI model to judge the output of a smaller model.
The obvious question is whether you can trust the larger model’s judgement. But it’s worth considering whether the baseline, human judgement, is as reliable as we expect. Discussing writing evaluation, Justin noted that if a room full of experts graded the same writing sample, “I would think we would agree with each other maximum 60% of the time.”
That means AI evaluators don't need perfect accuracy to be useful. If they can match human agreement rates on subjective tasks, they may already be providing meaningful value.
Many organizations set a higher bar for AI than they do for humans. Good evaluation starts by measuring both against the same standard.
7. Disclaimers are a strategy choice
When a product says, “AI can make mistakes, please verify,” it is making a decision about who bears the cost of errors. Consumer tools often shift that burden to users. Journalism organizations generally cannot. As Teresa noted, “Putting the burden of verification on the reader is a challenge of that trust.”
Knowing what customers expect from your product determines how you communicate the limitations of an AI system.
8. You're inheriting someone else's definition of “good enough”
Every AI deployment sits on top of a chain of prior judgments. Zoë captured this dynamic perfectly: “Someone at the frontier model decided this was good enough to ship, and then someone at the legal tech company decided, ‘Okay, this is good enough to go to lawyers.’”
By the time your organization adopts AI, multiple parties have already made decisions about quality, risk, and readiness. Your due diligence means understanding the entire stack of assumptions.
9. AI evaluation requires rubrics and repeatable testing
Traditional software testing assumes that if something works once, it will likely work again.
AI systems are different. Because outputs are nondeterministic, teams need structured evaluation methods: test sets, scoring rubrics, and repeatable grading processes. As Justin explained, a core piece of the work is “building the rubrics.”
Instead of evaluating outputs once, the best evaluation frameworks provide a repeatable process that lets you assess every future version of the system.
10. How you test is also a choice
Not every organization can evaluate AI the same way.
At OpenAI, Justin described a world where features can be rolled out, measured, and improved through large-scale experimentation. Teams can A/B test features with millions of users and quickly gather signals about what's working.
For public-sector applications, Sanket described a very different reality. When a system affects government services, experimentation carries real consequences. Instead of learning through large-scale deployment, teams often need to validate systems in sandboxes, test against synthetic data, and earn confidence incrementally before putting them in front of the public.
The lesson is that evaluation is about how you measure as much as what you measure. The testing strategy that works for a consumer product may be unacceptable for a public benefits program, a newsroom, or a legal practice.
The Bottom Line
If there was one theme that connected every panelist's perspective, it was that determining whether AI is “good enough” is rarely a technical decision alone. It's a product decision, a risk decision, a governance decision, and often a trust decision.
Helping organizations answer those questions is a growing part of our work. More so than building AI systems, the challenge is often defining quality, designing evaluation processes, and creating the organizational confidence to put a system in front of real users.
Or, as Zoë put it: “It's a question of, ‘What are you comfortable standing behind?’”