Build rule-based evaluations

Automate the basics: use code to catch simple errors.

Maud Nalpas

Now that you know which failures you want to catch with rule-based evals, it's time to implement the corresponding evaluator functions:

evalDataFormat(): Checks that the data format is correct. This includes valid JSON, all keys present, no empty values, motto is under six words, hexadecimal colors.
evalContrastRatio(): Checks that the text-to-background color contrast ratio is accessible.

Implement rule-based evals

Choose a scoring method

The evaluation criteria are binary. Your rule-based eval functions should produce a binary output, such as a PASS or FAIL label.

ThemeBuilder app output (full theme object) → evalDataFormat() → PASS or FAIL label. PASS if the data format meets all constraints. FAIL otherwise.
ThemeBuilder app output (color palette object) → evalContrastRatio() → PASS or FAIL label .PASS if the ratio is > 4.5. FAIL otherwise.

Define an evals type

The PASS or FAIL metric is a boolean, but you can choose to implement it as a string label (category) for readability.

To keep things lean, you can use the same TypeScript type for both your rule-based and the LLM judge evals you'll implement later. Create an EvalResult type that wraps a binary EvalLabel category, and a rationale field for the judge model to explain its rating.

enum EvalLabel {
    PASS = "PASS",
    FAIL = "FAIL"
}

interface EvalResult {
    label: EvalLabel;
    rationale?: string;
}

Implement evaluators

Zod is a great tool for schema validation, as it handles both the JSON structure and custom rules. It's declarative, which makes the validation code readable. Define detailed error reports with specific paths and reasons for failures, for easier troubleshooting.

import { z } from 'zod';
import { MAX_WORD_COUNT } from './app.config';

const hexColorRegex = /^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$/;

// Reusable schema for hex colors
const HexColor = z.string().regex(hexColorRegex, { message: "Invalid hex color code" });

// zod schema definition for AppOutput
export const AppOutputSchema = z.object({
  motto: z.string().min(1, { message: "Motto is missing or empty" }).refine((val) => {
    const words = val.replace(/[^\w\s]|_/g, "").trim();
    const count = words ? words.split(/\s+/).length : 0;
    return count > 0 && count <= MAX_WORD_COUNT;
  }, { message: `Motto must be between 1 and ${MAX_WORD_COUNT} words` }),
  colorPalette: z.object({
    textColor: HexColor,
    backgroundColor: HexColor,
    primary: HexColor,
    secondary: HexColor
  }).catchall(HexColor)
});

Contrast ratio

Keep domain logic, such as the contrast-ratio calculations, in separate utility functions.

/*
 * Input: ColorPalette {"textColor":"#333333","backgroundColor":"#000000", ...}
 * Output: EvalResult {"status":"FAIL","rationale":"Contrast ratio is 1.66:1 (must be >= 4.5:1)."}
 * minContrastRatio is an app config variable, MIN_CONTRAST_RATIO = 4.5
*/
export function evalContrastRatio(colorPalette: ColorPalette, minContrastRatio: number): EvalResult {
  if (!colorPalette || !colorPalette.textColor || !colorPalette.backgroundColor) {
    return { status: EvalLabel.FAIL, rationale: "Missing textColor or backgroundColor." };
  }
  try {
    const ratio = getContrastRatio(colorPalette.textColor, colorPalette.backgroundColor);
    const rationale = `Contrast ratio is ${ratio.toFixed(2)}:1 (must be >= ${minContrastRatio}:1).`;
    if (ratio < minContrastRatio) {
      return { status: EvalLabel.FAIL, rationale };
    }
    return { status: EvalLabel.PASS, rationale };
  } catch (e) {
    return { status: EvalLabel.FAIL, rationale: "Could not calculate contrast ratio (invalid hex?)." };
  }
}

Take a look at our evaluator code for evalDataFormat() and evalContrastRatio().

Test rule-based evals

Rule-based evaluations are deterministic, so you can implement classic unit tests to check their behavior. Build your tests to run various outputs through the evaluators and assert whether they return the PASS or FAIL label you expected.

If a test case expects the evaluator to return a FAIL and it does, the test outputs a PASS because the evaluator behaved as intended.

import { MIN_CONTRAST_RATIO } from '../src/app.config'; // 4.5

const testCases = [
  {
  // ...
      appOutput: {
        motto: "Test motto",
        colorPalette: {
          textColor: "#333333",
          backgroundColor: "#000000",
          primary: "#FF0000",
          secondary: "#333333"
        }
      },
    expected: {
      // Dark grey on black (low contrast): FAIL
      contrast: EvalLabel.FAIL
    }
  }
  // ... more test cases
];

testCases.forEach((testCase) => {
  const result = evalContrastRatio(
    testCase.appOutput.colorPalette as any, MIN_CONTRAST_RATIO
  );
  const actualEvalLabel = result.label;
  const expectedEvalLabel = testCase.expected.contrast;
  const isSuccess = actualEvalLabel === expectedEvalLabel;
 // ...
});

Try it

Clone the evals-course project.
Set it up.
Run the rule-based evaluator tests yourself.

Design evaluations

Part 1