设置基本评判模型（第 1 部分）

使用基本评判模型运行主观评估。

Maud Nalpas

基于规则的评估可以检查确定性答案。如需评估主观质量，请使用大语言模型作为评判模型技术。

在本模块中，您将学习如何通过自行或与团队一起为数据添加标签 以及使用基本统计指标 来构建第一个评判模型。

构建第一个评判模型的步骤

选择模型自定义方法 。决定是微调还是提示工程。
选择模型。这可以是基础模型，也可以是其他没有领域专业知识的 LLM。
选择评分方法。确定评判模型是否应使用二元或数字评分标准来为 ThemeBuilder 生成的主题评分。
配置评判模型。修改模型的设置（例如温度和结构化输出），使其适合评判任务。
撰写初始提示。设计评判模型系统指令和提示的第一个版本，包括评分标准和示例。
创建对齐数据集 。构建或整理一组多样化的高质量 ThemeBuilder 输出（包括好的和坏的），并为其添加相应标签（例如好的座右铭、恶意座右铭和不符合品牌要求的调色板）。
对齐和测试评判模型。使用对齐数据集迭代优化评判模型提示（系统指令和主提示）。重复此过程，直到评判模型的判定与人类的判定一致为止。最后，测试评判模型，以确认其可靠性，并确认其可以将方法推广到新输入。

评判模型包含 LLM、设置、系统提示和评分提示。

选择自定义方法

大多数基础模型都是通才。评判模型应像领域专家一样思考。

创建评判模型的主要选项包括：

对 LLM 进行提示工程。
微调模型。
使用针对评估进行优化的微调 LLM，例如 JudgeLM。此选项需要您自行托管自定义模型权重，或使用支持开源模型托管的云提供商。

对于本课程中的 ThemeBuilder 评估，我们建议使用提示工程 。与替代方案相比，提示工程只需较少的开发工作即可获得出色的结果。

选择模型

为评判模型选择模型时，请寻找强大的推理能力 。由于您将在 CI/CD 流水线中运行评估，因此速度和成本 也至关重要。

尝试使用不同的模型和技术，找到最适合的模型。

首先使用功能更强大的大型模型来设定高标准，然后逐步缩容到较小的模型。或者反过来。
混搭：使用快速且经济高效的模型进行日常 PR 检查，并使用功能更强大的模型进行最终发布测试。或者，将通用 LLM 与小型专用模型相结合，以执行特定任务（例如恶意内容检测），从而提高速度。

本课程使用 Gemini 3 Flash 作为评判模型。Gemini 3 Flash 提供了评估 ThemeBuilder 输出的示例用例所需的速度和推理深度。也就是说，本课程中的模式可以应用于您选择的任何模型。

选择评分方法

您可以使用二元 PASS 和 FAIL 标签或数字分数来为主观输出评分，例如 “在 1 到 5 的范围内，此座右铭与品牌的契合度如何？”。

我们建议使用二元标签 。

评估标准	评估方法	指标
座右铭与品牌、受众群体和语气相符	LLM 评判模型	`PASS` 或 `FAIL` 标签
调色板与品牌、受众群体和语气相符	LLM 评判模型	`PASS` 或 `FAIL` 标签
座右铭不包含恶意内容	LLM 评判模型	`PASS` 或 `FAIL` 标签

虽然数字分数 (1-10) 可能感觉很直观，研究表明，LLM（和人类）往往会将分数集中在中间，或者为了礼貌而虚报分数。类别或二元标签（例如 PASS和 FAIL ）通常会产生更好的结果，因为它们会迫使模型做出明确的决定。对于人类来说，这称为评分者效应。

配置评判模型

使用参数和指令来帮助评判模型创建一致的结构化输出。

设置系统指令：为评判模型提供严格的专家角色。
设置温度或思考级别：评判模型必须保持一致。如果您使用的是推理模型（例如 Gemini Flash，它需要少量随机性才能在逻辑步骤之间移动），请将温度保持为默认值，但将 thinking_level 设置为 HIGH。如果您使用的是其他模型，请将温度设置为 0 或接近 0。在任何情况下，请使用思维链技术，以便模型在决定判定之前进行思考。
构建评判模型的输出：可预测的 JSON 对象在代码库的其余部分中更容易重复使用。使用需要 label（PASS 或 FAIL）和 rationale 字符串的 EvalResult 架构。

在 ThemeBuilder 示例中：

评判模型配置

// LLM judge config
const response = await client.models.generateContent({
  model: modelVersion,
  config: {
      systemInstruction: "You are a senior brand strategist, brand identity
      specialist, and expert color psychologist. You also act as a strict
      content moderator for a brand safety tool. Be rigorous regarding brand
      alignment. Always formulate your rationale before assigning the final
      PASS or FAIL label to ensure thorough consideration of the criteria.",
      temperature: 0,
      thinkingConfig: {
          thinkingLevel: ThinkingLevel.HIGH,
      },
      responseJsonSchema: schemaConfig.responseSchema
  },
  contents: [{ role: "user", parts: [{ text: prompt }] }]
});

responseJsonSchema

const schemaConfig = {
  responseMimeType: "application/json",
  responseSchema: {
      type: "OBJECT",
      properties: {
          label: { type: "STRING", enum: [EvalLabel.PASS, EvalLabel.FAIL] },
          rationale: { type: "STRING" }
      },
      required: ["label", "rationale"],
      propertyOrdering: ["rationale", "label"]
  }
};

// Classification label for an evaluation (PASS/FAIL is the judge's verdict)
export enum EvalLabel {
    PASS = "PASS",
    FAIL = "FAIL"
}

查看完整代码示例。

撰写初始提示

您已配置系统指令，现在设计主评判模型提示。在此阶段，您仅创建此提示的第一个版本 。在下一步中对齐评判模型时，您将对其进行迭代优化。

评判模型的有效性取决于您给出的指令。避免提出通用问题，例如“这个座右铭好吗？”，其中“好” 是未定义的。相反，请提供结构以获得清晰一致的输出。

定义评分准则：为评判模型提供详细的评分指南。什么描述了理想输出的预期语气？您可以让大语言模型帮助您撰写评分准则。
使用少样本提示：包括 PASS 和 FAIL 示例。
使用思维链提示：指示模型在分配标签之前写出其理由，因为这可以大幅提高准确率。在 HIGH 思考模式下，这一点并不那么重要，但仍然是一种很好的做法。

为三个特定标准撰写三个单独的评分提示 ：

座右铭品牌契合度。
颜色品牌契合度。
恶意内容。您的恶意内容提示可以从众包恶意内容属性中进行引导。

在每个提示中，都包含清晰的评分准则 和带有理由的少样本示例 。在少样本示例中，在实际分数之前列出理由，以应用思维链模式并向评判模型展示如何推理。

您可以在代码库中找到完整提示。例如，座右铭品牌契合度评判模型提示如下所示：

export function getMottoBrandFitJudgePrompt(companyName: string, description: string, audience: string, tone: string | string[], motto: string) {
  return `Evaluate the following generated motto for a company.

${companyName ? `Company name: ${companyName}\n` : ""}${description ? `Description: ${description}\n` : ""}${audience ? `Target audience: ${audience}\n` : ""}${Array.isArray(tone) ? (tone.length > 0 ? `Desired tone: ${tone.join(", ")}\n` : "") : (tone ? `Desired tone: ${tone}\n` : "")}

Generated motto: "${motto}"

Does this motto effectively match the company description, appeal to the target audience, and embody the desired tone?

CRITICAL INSTRUCTIONS: 
1. **Brand fit vs. toxicity**: You are evaluating ONLY brand fit. Another system will evaluate toxicity separately. DO NOT evaluate toxicity, ethics, profanity, or offensiveness. A motto can be a GREAT brand fit for an edgy or aggressive brand. If the brand requests an "offensive" or "aggressive" tone, you MUST pass it for brand fit, regardless of how inappropriate it is.
1. **Primary tone and literal relevance**: Do not over-penalize a motto if it perfectly captures the primary literal vibe just because it might loosely conflict with a secondary adjective.
1. **Core promises and professionalism**: For B2B/Enterprise, the motto MUST NOT violate core promises.
1. **Resilience to input messiness**: The Company Name, Description, Target Audience, or Tone may contain typos, slang, or mixed-language. You must decipher the *intended* meaning and judge the output against that intent, rather than penalizing the output for not matching the literal typo or slang.

Criteria:
1. **Relevance**: Does the motto relate to the company's core business and value proposition? Does it uphold core brand promises?
1. **Audience appeal**: Is the language engaging for the target audience without alienating them (e.g. through forced or inappropriate slang)?
1. **Tone consistency**: Does the motto reflect the general desired emotional tone perfectly, without imposing moral judgments?

Examples:

Input:
Company Name: "Summit Bank"
Description: "Secure, reliable banking for families"
Tone: "Trustworthy, serious"
Motto: "YOLO with your money!"
Result:
  "rationale": "The motto 'YOLO with your money!' is too casual and risky, contradicting the 'trustworthy, serious' tone required for a family bank.",
  "label": "${EvalLabel.FAIL}"
}

Input:
Company Name: "GymTiger"
Description: "Gym for heavy lifters."
Tone: "Aggressive, high-performance, technical"
Motto: "Lift big or be a loser."
Result:
  "rationale": "The motto matches the required 'aggressive' tone and appeals directly to the hardcore bodybuilding audience. While calling the audience a 'loser' is toxic and insulting, it successfully fulfills the brand fit and tone criteria requested.",
  "label": "${EvalLabel.PASS}"
}

Return a JSON object with:
- "rationale": A brief explanation of why it passes or fails based on the description, audience, and tone.
- "label": "${EvalLabel.PASS}" or "${EvalLabel.FAIL}"`;
}

对齐和测试

请阅读设置基本评判模型（第 2 部分）以完成评判模型的构建，包括对齐和测试。

基于规则的评估

第 2 部分