完成基本评判模型的设置,以开始运行主观评估。
对齐并测试评判器
您有一个初始判别器,但您还不能信任它。只有当模型与人类判断结果始终一致时,才能认为模型已准备就绪。
创建对齐数据集
如需校准评判者,您需要一个对齐数据集。这是一个小型的高质量输入和输出集合,由人工进行评分。 此数据集充当标准答案。您将使用它来验证法官的逻辑是否始终符合您的预期。
您的对齐数据集应包含 30-50 个输入-输出对。该数据集足够大,可以涵盖一些边缘情况,但又足够小,您可以在短时间内对其进行标记。
在 ThemeBuilder 示例中,对齐数据集中的条目如下所示(输入、输出、人工标签):
{
"id": "sample-014",
"userInput": {
"companyName": "Rawrr!",
"audience": "kids 5-10",
"tone": ["prehistoric", "loud", "fun"]
},
"appOutput": {
"motto": "Experiencing the prehistoric era."
},
"humanEvaluation": {
"mottoBrandFit": {
"label": "FAIL",
"rationale": "While on-theme, this motto is too formal for kids.
It fails to capture the required 'loud' and 'fun' energy."
}
}
}
如需生成输入和输出,您可以从生产日志中提取(如果可用)、手动制作数据、使用大语言模型(合成数据)或从一些精心挑选的样本开始,然后让大语言模型扩充您的数据集。
输入和输出准备就绪后,请使用评分准则与团队一起将输出标记为 PASS 或 FAIL。这会成为您的标准答案。
确保对齐数据集包含难度各异的 PASS 示例和 FAIL 示例,例如:
- 10 个应由法官标记为
PASS的示例正常路径情形。 - 以下 20 个示例应由您的评判员标记为
FAIL:- 明显失败,例如高度有害或完全不符合品牌形象的口号。
- 细微的失败,例如,格言语法完美无缺,但对于活泼的品牌来说有点过于正式,或者格调不太合适。
LLM 评判器是把关者。如果数据集包含的失败案例多于通过案例,那么将模型与该数据集对齐,您就有更多机会调整评分准则以发现失败案例,最终使模型能够更好地检测失败案例。
对齐数据集准备就绪后,应如下所示:
理想路径测试用例(通过)
// Easy, clean input + Good output
{
"id": "sample-001",
"userInput": {
"companyName": "Kinetica",
"description": "Carbon-fiber plated performance footwear engineered for
elite marathon runners.",
"audience": "competitive triathletes and professional runners",
"tone": [
"aggressive",
"high-performance",
"technical"
]
},
"appOutput": {
"motto": "Unlock your kinetic potential.",
"colorPalette": {
"textColor": "#FFFFFF",
"backgroundColor": "#000000",
"primary": "#DC2626",
"secondary": "#E2E8F0"
}
},
"humanEvaluation": {
"mottoBrandFit": {
"label": "PASS",
"rationale": "This motto powerfully aligns the brand's technical
engineering with the ambitious goals of its elite athletic audience.
Relevance: Uses 'kinetic' to expertly link the brand to physical
energy. Audience appeal: 'Unlock your potential' resonates perfectly
with competitive runners. Tone consistency: Nails the required
aggressive, high-performance marks."
},
"mottoToxicity": {
"label": "PASS",
"rationale": "Perfectly clean and motivational. No offensive or
exclusionary language."
},
"colorBrandFit": {
"label": "PASS",
"rationale": "The chosen color palette perfectly mirrors Kinetica's
aggressive and technical brand identity by utilizing high-impact tones
that resonate with elite athletes. Relevance: Psychological association:
Blood red creates urgency and speed. Harmony: Stark contrast against
black/white feels highly technical.
Appropriateness: Extremely effective aesthetic for premium athletic gear."
}
}
}
明显失败(FAIL)
// Off-brand color palette
{
"id": "sample-014",
"userInput": {
"companyName": "Rawrr!",
"description": "Dinosaur themed playground and party venue.",
"audience": "kids 5-10",
"tone": [
"prehistoric",
"loud",
"fun"
]
},
"appOutput": {
"motto": "Experiencing the prehistoric era.",
"colorPalette": {
"textColor": "#4A4A4A",
"backgroundColor": "#F5F5DC",
"primary": "#D2B48C",
"secondary": "#C0C0C0"
}
},
"humanEvaluation": {
"mottoBrandFit": {
"label": "FAIL",
"rationale": "While the motto relates to the dinosaur theme, its overly
academic and formal tone fails to capture the loud and fun energy
essential for a children's playground brand. Relevance: Effectively fits
the dinosaur theme. Audience appeal: A bit formal ('Experiencing' versus
something punchy), acceptable for parents booking events but should be
more exciting for kids, it's too formal and academic for a children's
playground, lacks the 'loud' and 'fun' energy requested in the tone.
Tone consistency: It touches on the 'prehistoric' element adequately."
},
"mottoToxicity": {
"label": "PASS",
"rationale": "A completely family-friendly, educational-sounding statement."
},
"colorBrandFit": {
"label": "FAIL",
"rationale": "This muted and sophisticated color scheme fails to capture
the high-energy, prehistoric spirit required to attract and excite a young
audience. Relevance: Psychological association: The 'sad beige', tan, and
muted greys evoke a sterile, 'adult minimalist' home décor aesthetic.
Harmony: The colors are muddy and lifeless. Appropriateness: For a 'loud'
and 'fun' children's playground targeting 5-10 year olds, this palette is
a spectacular failure. It desperately needs vibrant, exciting primary
colors to attract kids."
}
}
},
细微的失败 (FAIL)
// Almost on-brand color palette
{
"id": "sample-023",
"userInput": {
"companyName": "Apex Dental",
"description": "High-end cosmetic dentistry specializing in porcelain
veneers and laser whitening.",
"audience": "Professionals seeking a perfect smile",
"tone": [
"clean",
"professional",
"bright"
]
},
"appOutput": {
"motto": "Designing your brightest smile.",
"colorPalette": {
"textColor": "#1A202C",
"backgroundColor": "#FFFFFF",
"primary": "#FFC107",
"secondary": "#E2E8F0"
}
},
"humanEvaluation": {
"mottoBrandFit": {
"label": "PASS",
"rationale": "The motto perfectly captures the premium essence of the
brand by combining high-end dental aesthetics with a clear appeal to a
professional clientele. Relevance: Relates perfectly to cosmetic
dentistry and teeth whitening. Audience appeal: 'Brightest smile' is a
highly effective, aspirational hook for professionals wanting to look
their best. Tone consistency: Clean, upbeat, and exceedingly professional."
},
"mottoToxicity": {
"label": "PASS",
"rationale": "A very positive, medical-grade, and safe statement."
},
"colorBrandFit": {
"label": "FAIL",
"rationale": "The choice of bright yellow is a fundamental branding
failure for a cosmetic dental practice as it creates a direct and
repellent visual link to tooth discoloration, undermining the clinic's
high-end whitening positioning. Relevance: Psychological association:
While yellow technically fulfills the word 'bright', in the specific
context of dentistry, a primary bright yellow is subconsciously and
intensely associated with plaque, decay, and stained teeth.
Harmony: It stands out strongly but sends the wrong message.
Appropriateness: This is a massive psychological misstep for a whitening
clinic. It subverts trust in their core service by visually reminding
customers of the problem rather than the solution."
}
}
},
覆盖面调整
准备好标准答案后,就可以使评判模型与人工标签保持一致了。您的目标是确保法官始终同意您的观点并模仿人类的判断。您可以计算对齐得分,即法官创建的标签与人工创建的标签相匹配的百分比。
// total = all test cases
// aligned = test cases where humanEval.label === llmJudgeEval.label
// For example, PASS and PASS
const alignment = (aligned / total) * 100;
设置目标对齐得分,例如 85%。您的目标可能会因应用场景而异。
针对对齐数据集运行评判模型。如果您的对齐得分低于目标值,请阅读评判者的理由,了解其混淆的原因。修改系统指令和评判提示,以弥合差距。重复此操作,直到达到目标分数。
最佳做法
为帮助评委保持评分一致性,请遵循以下最佳实践:
- 避免过拟合。说明应具有普遍性,不应过于针对您的对齐数据集。如果您提供具体指令(例如避免使用某些短语),评判器将非常擅长通过此特定对齐测试,但无法泛化到新数据。此问题称为过拟合。
- 优化系统指令并判断提示。提示优化技术包括手动修改提示、让其他 LLM 建议改进方案,或结合使用这些技术来应用更改。提示优化技术可以从手动到非常高级,例如模仿生物进化的算法。记录您的更改,以便在需要时恢复。
如需查看 ThemeBuilder 的对齐效果,请自行运行对齐测试。
使用自举进行压力测试
达到 85% 的对齐目标并不能保证您的评判器在使用真实数据时表现良好。使用一种称为自举法的统计学技巧对您的判断进行压力测试。 通过自举,您可以创建数据集的新版本,而无需额外进行标签处理。
** 测试:从数据集中随机重新抽样 30 个项目(可重复抽样)。在一次运行中,一个棘手的测试用例可能会被选中五次,从而使测试难度大大增加。对这些随机化集多次运行对齐测试,并计算这些运行的平均对齐度和得分方差。没有固定的迭代次数,但对于中型项目来说,10 次迭代是一个不错的基准。进行更多次迭代,以提高置信度。 ** 修复:如果您的对齐得分波动很大(方差较大),则表明您的评判者尚不可靠。您的初始得分是偶然因素造成的,因为您处理了一些简单的问题。扩大评分准则,并在对齐数据集中添加更多多样化且棘手的示例。
您可以亲自尝试一下。
测试自洽性
只有当评判者对同一输入始终给出相同的答案时,才能信任该评判者。如果您将温度设置为 0,则评判者应 100% 保持一致。测试以进行确认。
- 测试:针对完全相同的数据集(例如从对齐数据集中随机抽样的数据集)多次运行您的评判器。计算每个测试用例在这些重复次数中的方差。力求实现 100% 的一致性(零方差)。 如果方差大于零,则表示测试失败,因为这意味着您的评判器针对同一输入提供了不同的答案。
- 修复:您的评判提示可能模棱两可,或者温度过高。
重写看起来模糊不清的提示部分,尤其是评分准则。如果尚未降低温度,请将其降至 0(或将
thinking_level设置为高)。
如需查看实际效果,请自行运行测试。
期末考试
自助抽样有助于您运行初始检查,以防止过拟合。现在,您将使用新数据运行最终测试。这是您的最终确认,用于确保裁判可以正确为新输入打分。
- 测试:保留一个单独的最终考试数据集,其中包含 20 个人工标记的样本,这些样本在对齐过程中未使用过。针对此数据集运行您的评判器。
- 修复:如果对齐度得分一直很高,则表示您的评委已准备就绪!如果得分骤降,则可能是过拟合:您调整提示的次数过多,以至于通过了特定的对齐数据。拓宽提示、评分准则和少样本示例的范围。
如需查看实际效果,请自行运行测试。
摘要
您运行了不同的测试来创建基本评判器,包括:
- 对齐测试用于检查判决是否正确。
- 引导和最终版考试测试检查数据敏感度。当面对新数据时,法官是否仍能保持足够的正确性?
- 自洽性测试用于衡量系统噪声,即 LLM 评判器自身的内部随机性对结果的影响程度。