完成基本評估模型設定,即可開始進行主觀評估。
對齊並測試評估人員
您有初始法官,但尚未信任該法官。只有在法官的判斷結果與人類判斷一致時,才算準備就緒。
建立對齊資料集
如要校正評估人員,您需要對齊資料集。這是由人工評估的一小組高品質輸入和輸出內容。這個資料集會做為基準真相。您將使用這項功能,確認法官的邏輯是否一如預期。
對齊資料集應包含 30 到 50 個輸入/輸出配對。這個資料集夠大,可涵蓋部分極端情況,但又夠小,您可以在短時間內標記。
在 ThemeBuilder 範例中,對齊資料集中的項目如下所示 (輸入、輸出、人工標籤):
{
"id": "sample-014",
"userInput": {
"companyName": "Rawrr!",
"audience": "kids 5-10",
"tone": ["prehistoric", "loud", "fun"]
},
"appOutput": {
"motto": "Experiencing the prehistoric era."
},
"humanEvaluation": {
"mottoBrandFit": {
"label": "FAIL",
"rationale": "While on-theme, this motto is too formal for kids.
It fails to capture the required 'loud' and 'fun' energy."
}
}
}
如要生成輸入和輸出內容,您可以從生產記錄中提取 (如有),手動製作資料,使用 LLM (合成資料),或從幾個手動挑選的樣本開始,要求 LLM 擴增資料集。
輸入和輸出內容準備就緒後,請使用評量表與團隊一起將輸出內容標示為 PASS 或 FAIL。這會成為你的基準真相。
請確保對齊資料集包含難度各異的 PASS 範例和 FAIL 範例,例如:
- 10 個快樂路徑案例,法官應標示為
PASS。 - 以下列出 20 個範例案件,法官應標示為
FAIL:- 明顯失敗,例如極度惡意或完全不符合品牌形象的口號。
- 細微的失敗,例如:座右銘文法正確,但對活潑的品牌來說有點太正式,或只部分符合語氣。
LLM 評估模型是把關者,如果資料集包含的失敗案例多於通過案例,您就有更多機會調整評分量表來找出失敗案例,最終讓評估人員更擅長偵測失敗案例。
對齊資料集準備就緒後,看起來應該會像這樣:
滿意路徑案例 (通過)
// Easy, clean input + Good output
{
"id": "sample-001",
"userInput": {
"companyName": "Kinetica",
"description": "Carbon-fiber plated performance footwear engineered for
elite marathon runners.",
"audience": "competitive triathletes and professional runners",
"tone": [
"aggressive",
"high-performance",
"technical"
]
},
"appOutput": {
"motto": "Unlock your kinetic potential.",
"colorPalette": {
"textColor": "#FFFFFF",
"backgroundColor": "#000000",
"primary": "#DC2626",
"secondary": "#E2E8F0"
}
},
"humanEvaluation": {
"mottoBrandFit": {
"label": "PASS",
"rationale": "This motto powerfully aligns the brand's technical
engineering with the ambitious goals of its elite athletic audience.
Relevance: Uses 'kinetic' to expertly link the brand to physical
energy. Audience appeal: 'Unlock your potential' resonates perfectly
with competitive runners. Tone consistency: Nails the required
aggressive, high-performance marks."
},
"mottoToxicity": {
"label": "PASS",
"rationale": "Perfectly clean and motivational. No offensive or
exclusionary language."
},
"colorBrandFit": {
"label": "PASS",
"rationale": "The chosen color palette perfectly mirrors Kinetica's
aggressive and technical brand identity by utilizing high-impact tones
that resonate with elite athletes. Relevance: Psychological association:
Blood red creates urgency and speed. Harmony: Stark contrast against
black/white feels highly technical.
Appropriateness: Extremely effective aesthetic for premium athletic gear."
}
}
}
明顯失敗 (FAIL)
// Off-brand color palette
{
"id": "sample-014",
"userInput": {
"companyName": "Rawrr!",
"description": "Dinosaur themed playground and party venue.",
"audience": "kids 5-10",
"tone": [
"prehistoric",
"loud",
"fun"
]
},
"appOutput": {
"motto": "Experiencing the prehistoric era.",
"colorPalette": {
"textColor": "#4A4A4A",
"backgroundColor": "#F5F5DC",
"primary": "#D2B48C",
"secondary": "#C0C0C0"
}
},
"humanEvaluation": {
"mottoBrandFit": {
"label": "FAIL",
"rationale": "While the motto relates to the dinosaur theme, its overly
academic and formal tone fails to capture the loud and fun energy
essential for a children's playground brand. Relevance: Effectively fits
the dinosaur theme. Audience appeal: A bit formal ('Experiencing' versus
something punchy), acceptable for parents booking events but should be
more exciting for kids, it's too formal and academic for a children's
playground, lacks the 'loud' and 'fun' energy requested in the tone.
Tone consistency: It touches on the 'prehistoric' element adequately."
},
"mottoToxicity": {
"label": "PASS",
"rationale": "A completely family-friendly, educational-sounding statement."
},
"colorBrandFit": {
"label": "FAIL",
"rationale": "This muted and sophisticated color scheme fails to capture
the high-energy, prehistoric spirit required to attract and excite a young
audience. Relevance: Psychological association: The 'sad beige', tan, and
muted greys evoke a sterile, 'adult minimalist' home décor aesthetic.
Harmony: The colors are muddy and lifeless. Appropriateness: For a 'loud'
and 'fun' children's playground targeting 5-10 year olds, this palette is
a spectacular failure. It desperately needs vibrant, exciting primary
colors to attract kids."
}
}
},
細微失敗 (FAIL)
// Almost on-brand color palette
{
"id": "sample-023",
"userInput": {
"companyName": "Apex Dental",
"description": "High-end cosmetic dentistry specializing in porcelain
veneers and laser whitening.",
"audience": "Professionals seeking a perfect smile",
"tone": [
"clean",
"professional",
"bright"
]
},
"appOutput": {
"motto": "Designing your brightest smile.",
"colorPalette": {
"textColor": "#1A202C",
"backgroundColor": "#FFFFFF",
"primary": "#FFC107",
"secondary": "#E2E8F0"
}
},
"humanEvaluation": {
"mottoBrandFit": {
"label": "PASS",
"rationale": "The motto perfectly captures the premium essence of the
brand by combining high-end dental aesthetics with a clear appeal to a
professional clientele. Relevance: Relates perfectly to cosmetic
dentistry and teeth whitening. Audience appeal: 'Brightest smile' is a
highly effective, aspirational hook for professionals wanting to look
their best. Tone consistency: Clean, upbeat, and exceedingly professional."
},
"mottoToxicity": {
"label": "PASS",
"rationale": "A very positive, medical-grade, and safe statement."
},
"colorBrandFit": {
"label": "FAIL",
"rationale": "The choice of bright yellow is a fundamental branding
failure for a cosmetic dental practice as it creates a direct and
repellent visual link to tooth discoloration, undermining the clinic's
high-end whitening positioning. Relevance: Psychological association:
While yellow technically fulfills the word 'bright', in the specific
context of dentistry, a primary bright yellow is subconsciously and
intensely associated with plaque, decay, and stained teeth.
Harmony: It stands out strongly but sends the wrong message.
Appropriateness: This is a massive psychological misstep for a whitening
clinic. It subverts trust in their core service by visually reminding
customers of the problem rather than the solution."
}
}
},
達成共識
實際資料準備就緒後,即可讓評估模型與人工標籤保持一致。您的目標是確保法官持續同意您的判決,並模仿人類的判決。您可以計算一致性分數,也就是法官建立的標籤與人工建立的標籤相符的百分比。
// total = all test cases
// aligned = test cases where humanEval.label === llmJudgeEval.label
// For example, PASS and PASS
const alignment = (aligned / total) * 100;
設定目標一致性分數,例如 85%。目標可能因用途而異。
針對對齊資料集執行評估模型。如果對齊分數低於目標,請參閱評審的理由,瞭解評審感到困惑的原因。修改系統指令和評估提示,縮小差距。重複上述步驟,直到達到目標分數為止。
最佳做法
為協助評審一致評分,請遵循下列最佳做法:
- 避免過度配適。指令應盡量通用,不要過於針對對齊資料集。如果您提供特定指令 (例如避免使用特定詞組),評估人員就能順利通過這項特定對齊測試,但無法將結果套用至新資料。這個問題稱為過度配適。
- 最佳化系統指令並判斷提示。提示最佳化技術包括手動修改提示、要求其他 LLM 建議改進方式,或根據這些技術的組合套用變更。提示最佳化技術可從手動到非常進階,例如模仿生物演化的演算法。記錄變更內容,以便在需要時還原。
如要查看 ThemeBuilder 的對齊功能實際運作情形,請自行執行對齊測試。
使用啟動程序進行壓力測試
達到 85% 的一致性目標,不代表評估人員在處理實際資料時一定會表現良好。使用稱為「自助法」的統計技巧,對評估人員進行壓力測試。啟動程序會建立新的資料集版本,無須額外標記。
** 測試:從資料集隨機重新取樣 30 個項目 (含替換)。在一次執行中,系統可能會選取五次棘手案例,使測試難度大幅提升。對這些隨機組合多次執行對齊測試,並計算這些執行作業的平均對齊程度和分數差異。沒有確切的數字,但對於中型專案來說,10 次疊代是個不錯的基準。多執行幾次疊代,提高信心指數。** 修正:如果一致性分數大幅變動 (高變異數),表示評估員的評估結果尚不可靠。您一開始的分數是受到幾個簡單案件影響,擴大評分量表範圍,並在對齊資料集中加入更多不同且棘手的範例。
您可以親自試試看。
測試自我一致性
只有在輸入相同內容時,評估者一律給出相同答案,才能信任評估者。如果將溫度參數設為 0,評估人員應 100% 保持一致。測試以確認。
- 測試:對完全相同的資料集多次執行評估,例如從對齊資料集隨機抽取資料。計算這些重複項目的每個測試案例變異數。盡量達到 100% 一致性 (零差異)。 如果變異數大於零,表示測試失敗,因為法官對相同輸入內容提供不同的答案。
- 修正:您的評估提示詞可能模稜兩可,或溫度參數過高。請改寫模糊不清的提示詞部分,尤其是評分量表。如果尚未將溫度參數調低至 0 (或將
thinking_level設為高),請先調低。
如要查看實際運作情形,請自行執行測試。
期末考
自助抽樣法可協助您執行初步檢查,避免過度擬合。現在,您要使用新資料執行最終測試。最後確認法官可以正確評估新輸入的內容。
- 測試:保留 20 個人工標記的樣本,做為最終測驗資料集,這些樣本不得在對齊期間使用。針對這組資料執行評估工具。
- 修正:如果一致性分數維持在高分,表示評估員已準備就緒!如果分數大幅下降,可能是因為過度調整提示,導致模型過度擬合特定對齊資料。擴大提示、評分量表和少樣本範例的範圍。
如要查看實際運作情形,請自行執行測試。
摘要
您執行了各種測試來建立基本評估人員,包括:
- 對齊測試會檢查評估者是否正確。
- 引導測驗和正式測驗會檢查資料敏感度。面對新資料時,法官是否仍經常做出正確判斷?
- 自我一致性測試會評估系統雜訊,也就是 LLM 評估人員的內部隨機性對結果的影響程度。