What you'll learn

What to expect from this series, and what you should know before you start.

Maud Nalpas

In this series, you build evaluations (evals) for our example application, ThemeBuilder. You'll learn how to:

Build a robust, end-to-end evaluation workflow so you trust you can ship to your users.
Use the LLM-as-a-judge pattern to measure subjective quality. Create a judge with minimal setup, or use advanced techniques to develop a custom judge that thinks like top-domain experts.
Automate your pipeline by running evals at build time (CI/CD) and in production, to catch regressions early.
Apply techniques that give you statistical confidence and prove your results aren't just a lucky draw from your test pool, and optimize your evals design to catch sneaky regressions.
Use evals to select the best model for your use case.

Approach

Think of this series as your starting point. You can build your full evals pipeline using only the main guidance, which we've based on standard industry best practices, and explore more advanced techniques when you're ready to level up.

Whether you use a ready-made evals platform or build your own, the concepts and techniques you'll learn are tool-agnostic. Understanding the why behind them helps you dodge common traps and develop an expert evals pipeline, no matter what stack you choose.

Once completed, you'll know how to iterate on your prompt, upgrade your LLM, or switch your LLM while shipping to your users with confidence.

Prerequisites

You should have some experience building with LLMs. We assume you're already comfortable with:

LLM basics: determinism versus probability engines, hallucination, structured outputs, temperature.
Prompt engineering techniques.
Generative AI basics: model providers, platforms, benchmarks, and leaderboards.

Introduction to AI Evaluations

Mental model

What you'll learn Stay organized with collections Save and categorize content based on your preferences.

Approach

Prerequisites

What you'll learn