Design your evaluations

Define what "good" and "bad" looks like for your AI application.

Maud Nalpas

Before designing your tests, look at a typical perfect output from ThemeBuilder. Each evaluation processes a version of this object:

{
  "id": "example-002",
  "userInput": {
    "companyName": "Nova news",
    "description": "Space exploration news and educational content.",
    "audience": "science enthusiasts",
    "tone": [
      "informative",
      "scientific",
      "inspiring"
    ]
  },
  "appOutput": {
    "motto": "Unveiling the universe.",
    "colorPalette": {
      "textColor": "#E2E8F0",
      "backgroundColor": "#0B0D17",
      "primary": "#7000FF",
      "secondary": "#00C2FF"
    }
  }
}

Define success and failure

The first step to creating an evaluation is to define success and failure. To do so, you must be familiar with your data; understand what faulty outputs are likely to appear in production. If available, review production data.

Examples of faulty outputs for ThemeBuilder include:

Incorrect data structure:
- Invalid JSON, missing keys
- Color palette values are not hexadecimal
- The motto or some colors are empty strings
- The motto is longer than our set limit of 6 words.
Bad motto:
- The motto doesn't match the brand, audience or tone.
- The motto is toxic.
Bad color palette:
- The color palette doesn't match the brand, audience or tone.
- The text-to-background color contrast ratio is less than 4.5.

Example user input

User input: {
 "companyName": "Moon Cafe",
 "description": "A cozy nocturnal coffee shop serving late-night espresso and pastries.",
 "audience": "night owls and students"
}

Output: Incorrect data

// Wrong key `tagline` instead of `motto`.
// Array of colors instead of the required `colorPalette` object.
Output: {"tagline": "Freshly brewed", "colors": \["\#f0f0f0"\]}

// The motto is over our 6-word limit
Output: {
  "motto": "The best place for late night espresso and cozy pastries",
  "colorPalette": ...
}

// Colors are invalid hexadecimal strings
Output: {
  "motto": "Brewed for the moon.",
  "colorPalette": {"textColor": "grey", "backgroundColor": "white", "primary": "neon-purple", "secondary": "\#\#00C2FF"}
}

Output: Bad motto

// Brand and tone mismatch (too cold for a cozy vibe)
Output motto: "Beans for maximum productivity."

// Toxic (rude and unwelcoming)
Output motto: "Go away loser, we're busy."

Output: Bad color palette

// Brand and tone mismatch (clashing neon colors for a cozy cafe)
Output color palette: {
  "textColor": "\#00FF00", "backgroundColor": "\#FF00FF",
  "primary": "\#FFFF00", "secondary": "\#0000FF"
}

// Contrast ratio below the 4.5:1 requirement
Output color palette: {
  "textColor": "\#CCCCCC", "backgroundColor": "\#FFFFFF",
  "primary": "\#EEEEEE", "secondary": "\#DDDDDD"
}

Define evaluation criteria and methods

You can define evaluation criteria and methods, based on how an output fails to meet your expectations:

To test the objective criteria, create rule-based evals (use regular code).
To test the subjective criteria, use a judge model.

Evaluation criteria	Evaluation method
The data format is correct: Valid JSON, all keys present, hexadecimal colors, no empty values, motto is under six words	Rule-based (objective)
The text-to-background color contrast ratio is accessible	Rule-based
The motto matches the brand, audience and tone	LLM judge (subjective)
The color palette matches the brand, audience and tone	LLM judge
The motto isn't toxic	LLM judge

Rubric

There's no such thing as a perfect creative motto or color palette. So instead of comparing ThemeBuilder's output to an ideal result, provide the judge clear guidelines.

// Example rubric for color palette brand fit 
Criteria:
1. **Psychological and literal association**: Do the colors logically map
   to the literal product and evoke the right vibe?
2. **Constraint verification**: Does the palette violate any fundamental
   keywords (such as "sustainable", "discreet", or "organic")?
3. **Appropriate and harmonious**: Is the palette suitable for the company's
   industry baseline, regardless of secondary trendy adjectives?

Use task-specific criteria

Beside your use-case specific metrics, use standard criteria and metrics relevant to the task. For example, for summarization, common metrics include:

Alignment: The summary follows specific user instructions, tone, or style.
Concision: The summary is saying just what is needed and nothing more.
Richness: The summary includes all key points.
Correctness: The summary is factual and true.
Groundedness: Every claim is traced back to the source to prevent hallucinations.

Prebuilt evals

Evals solutions and tools offer managed evals or prebuilt metrics that may fit your use case. Explore what's available.

Mental model

Rule-based evaluations