Conclusion

Maud Nalpas

Alexandra Klepper

You've reached the end of our course on AI evals. You mapped your existing web testing knowledge to the world of LLMs, created rule-based unit tests, built and tested your judge model, and wired up your testing pipeline.

Our industry is concerned about vibes and LLM nondeterminism. In reality, if you have ever built a web app that needs to work flawlessly across browsers, devices, and screen sizes, you're prepared for this. One input leading to multiple possible behaviors, an environment you cannot entirely control, and the infamous "Works on my machine."

The solution is testing. Evals are exactly this: tests for your AI features. Your web tests gave you the confidence to ship in wild browser environments, and evals do the same thing for your AI features. Build your evals, and ship away!

Before you dive in, take a moment to ask yourself a few key questions: What makes an output "bad"? Define your failure cases. Get deeply familiar with your data, and collaborate closely with domain experts. What makes an output "good" versus "ideal"? Define your expectations clearly before asking a model to grade them. How often will you run evals? Evaluation-driven development is one approach you can take, but set expectations for how often you'll evaluate after your application is deployed.

The AI space moves fast, and building a full pipeline can feel overwhelming. Start small: write one rule-based test and build one basic LLM judge. Once you establish that baseline, you stop guessing and get your power back as an engineer. You cross the gap from a fun internal prototype to a robust feature you can test, measure, and ship with confidence. Remember, evals built by humans are subject to human failings. Bias is built in. Deploy regular audits of your models and evaluations to address bias.

Follow this course to build your first tests, check out the companion code, and start testing. Share what you've learned: How are you running your evals? Get in touch with us at @ChromiumDev, share with us on BlueSky, or set up one-on-one office hours with the Web.dev AI Team.

Build an expert judge

Course resources

Conclusion Stay organized with collections Save and categorize content based on your preferences.

Conclusion