Design your evaluations

Define what "good" and "bad" looks like for your AI application.

Before we design our tests, look at a typical perfect output from ThemeBuilder. Each evaluation we build is processing a version of this object:

{
  "id": "example-002",
  "userInput": {
    "companyName": "Nova news",
    "description": "Space exploration news and educational content.",
    "audience": "science enthusiasts",
    "tone": [
      "informative",
      "scientific",
      "inspiring"
    ]
  },
  "appOutput": {
    "motto": "Unveiling the universe.",
    "colorPalette": {
      "textColor": "#E2E8F0",
      "backgroundColor": "#0B0D17",
      "primary": "#7000FF",
      "secondary": "#00C2FF"
    }
  }
}

Define success and failure

To create our evaluations, the first step is to define what success and failure look like. To do this, we need to know our data: what kinds of faulty outputs are we likely to get in production? If available, we should look at production data.

Examples of faulty outputs for ThemeBuilder include:

  • Incorrect data structure:
    • Invalid JSON, missing keys
    • Color palette values are not hexadecimal
    • The motto or some colors are empty strings
    • The motto is longer than our set limit of 6 words.
  • Bad motto:
    • The motto doesn't match the brand, audience or tone.
    • The motto is toxic.
  • Bad color palette:
    • The color palette doesn't match the brand, audience or tone.
    • The text-to-background color contrast ratio is less than 4.5.

Example user input

User input: {
 "companyName": "Moon Cafe",
 "description": "A cozy nocturnal coffee shop serving late-night espresso and pastries.",
 "audience": "night owls and students"
}

Output: Incorrect data

// Uses the wrong key tagline instead of motto.
// Array colors instead of the required colorPalette object.
Output: {"tagline": "Freshly brewed", "colors": \["\#f0f0f0"\]}

// The motto is over our 6-word limit
Output: {
  "motto": "The best place for late night espresso and cozy pastries",
  "colorPalette": ...
}

// Colors are invalid hexadecimal strings
Output: {
  "motto": "Brewed for the moon.",
  "colorPalette": {"textColor": "grey", "backgroundColor": "white", "primary": "neon-purple", "secondary": "\#\#00C2FF"}
}

Output: Bad motto

// Brand and tone mismatch (too cold for a cozy vibe)
Output motto: "Beans for maximum productivity."

// Toxic (rude and unwelcoming)
Output motto: "Go away loser, we're busy."

Output: Bad color palette

// Brand and tone mismatch (clashing neon colors for a cozy cafe)
Output color palette: {
  "textColor": "\#00FF00", "backgroundColor": "\#FF00FF",
  "primary": "\#FFFF00", "secondary": "\#0000FF"
}

// Contrast ratio below the 4.5:1 requirement
Output color palette: {
  "textColor": "\#CCCCCC", "backgroundColor": "\#FFFFFF",
  "primary": "\#EEEEEE", "secondary": "\#DDDDDD"
}

Define evaluation criteria and methods

Based on our failure modes, we can define our evaluation criteria. From there, it's straightforward to define the eval method we'll need to check each criteria:

  • To test our objective criteria, we'll write regular code (rule-based evals).
  • To test our subjective criteria, we'll use a judge model. We're using rubric-based evals: since there's no single perfect version of a creative motto to use as a reference, we don't compare it to a specific output. Instead, we'll provide the judge with a clear set of criteria (the rubric) to guide its evaluation.
Evaluation criteria Evaluation method
The data format is correct: Valid JSON, all keys present, hexadecimal colors, no empty values, motto is under six words Rule-based (objective)
The text-to-background color contrast ratio is accessible Rule-based
The motto matches the brand, audience and tone LLM judge (subjective)
The color palette matches the brand, audience and tone LLM judge
The motto isn't toxic LLM judge

Use task-specific criteria

Beside your use-case specific metrics, use standard criteria and metrics relevant to the task. For example, for summarization, common metrics include:

  • Alignment: The summary follows specific user instructions, tone, or style.
  • Concision: The summary is saying just what is needed and nothing more.
  • Richness: The summary includes all key points.
  • Correctness: The summary is factual and true.
  • Groundedness: Every claim is traced back to the source to prevent hallucinations.

Prebuilt evals

Evals solutions like Vertex Gen AI Evals API, Braintrust, Datadog, DeepEval, and LangSmith offer managed evals or prebuilt metrics that may fit your use case.

Explore what's available.