設定基本評估模型 (第 2 部分)

完成基本評估模型設定,即可開始進行主觀評估。

對齊並測試評估人員

您有初始法官,但尚未信任該法官。只有在法官的判斷結果與人類判斷一致時,才算準備就緒。

建立對齊資料集

如要校正評估人員,您需要對齊資料集。這是由人工評估的一小組高品質輸入和輸出內容。這個資料集會做為基準真相。您將使用這項功能,確認法官的邏輯是否一如預期。

對齊資料集應包含 30 到 50 個輸入/輸出配對。這個資料集夠大,可涵蓋部分極端情況,但又夠小,您可以在短時間內標記。

在 ThemeBuilder 範例中,對齊資料集中的項目如下所示 (輸入、輸出、人工標籤):

{
  "id": "sample-014",
  "userInput": {
    "companyName": "Rawrr!",
    "audience": "kids 5-10",
    "tone": ["prehistoric", "loud", "fun"]
  },
  "appOutput": {
    "motto": "Experiencing the prehistoric era."
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "FAIL",
      "rationale": "While on-theme, this motto is too formal for kids.
        It fails to capture the required 'loud' and 'fun' energy."
    }
  }
}

如要生成輸入和輸出內容,您可以從生產記錄中提取 (如有),手動製作資料使用 LLM (合成資料),或從幾個手動挑選的樣本開始,要求 LLM 擴增資料集

輸入和輸出內容準備就緒後,請使用評量表與團隊一起將輸出內容標示為 PASSFAIL。這會成為你的基準真相

請確保對齊資料集包含難度各異的 PASS 範例和 FAIL 範例,例如:

  • 10 個快樂路徑案例,法官應標示為 PASS
  • 以下列出 20 個範例案件,法官應標示為 FAIL
    • 明顯失敗,例如極度惡意或完全不符合品牌形象的口號。
    • 細微的失敗,例如:座右銘文法正確,但對活潑的品牌來說有點太正式,或只部分符合語氣。

LLM 評估模型是把關者,如果資料集包含的失敗案例多於通過案例,您就有更多機會調整評分量表來找出失敗案例,最終讓評估人員更擅長偵測失敗案例。

對齊資料集準備就緒後,看起來應該會像這樣:

滿意路徑案例 (通過)

// Easy, clean input + Good output
{
  "id": "sample-001",
  "userInput": {
    "companyName": "Kinetica",
    "description": "Carbon-fiber plated performance footwear engineered for
    elite marathon runners.",
    "audience": "competitive triathletes and professional runners",
    "tone": [
      "aggressive",
      "high-performance",
      "technical"
    ]
  },
  "appOutput": {
    "motto": "Unlock your kinetic potential.",
    "colorPalette": {
      "textColor": "#FFFFFF",
      "backgroundColor": "#000000",
      "primary": "#DC2626",
      "secondary": "#E2E8F0"
    }
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "PASS",
      "rationale": "This motto powerfully aligns the brand's technical
      engineering with the ambitious goals of its elite athletic audience.
      Relevance: Uses 'kinetic' to expertly link the brand to physical
      energy. Audience appeal: 'Unlock your potential' resonates perfectly
      with competitive runners. Tone consistency: Nails the required
      aggressive, high-performance marks."
    },
    "mottoToxicity": {
      "label": "PASS",
      "rationale": "Perfectly clean and motivational. No offensive or
      exclusionary language."
    },
    "colorBrandFit": {
      "label": "PASS",
      "rationale": "The chosen color palette perfectly mirrors Kinetica's
      aggressive and technical brand identity by utilizing high-impact tones
      that resonate with elite athletes. Relevance: Psychological association:
      Blood red creates urgency and speed. Harmony: Stark contrast against
      black/white feels highly technical.
      Appropriateness: Extremely effective aesthetic for premium athletic gear."
    }
  }
}

明顯失敗 (FAIL)

// Off-brand color palette
{
  "id": "sample-014",
  "userInput": {
    "companyName": "Rawrr!",
    "description": "Dinosaur themed playground and party venue.",
    "audience": "kids 5-10",
    "tone": [
      "prehistoric",
      "loud",
      "fun"
    ]
  },
  "appOutput": {
    "motto": "Experiencing the prehistoric era.",
    "colorPalette": {
      "textColor": "#4A4A4A",
      "backgroundColor": "#F5F5DC",
      "primary": "#D2B48C",
      "secondary": "#C0C0C0"
    }
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "FAIL",
      "rationale": "While the motto relates to the dinosaur theme, its overly
      academic and formal tone fails to capture the loud and fun energy
      essential for a children's playground brand. Relevance: Effectively fits
      the dinosaur theme. Audience appeal: A bit formal ('Experiencing' versus
      something punchy), acceptable for parents booking events but should be
      more exciting for kids, it's too formal and academic for a children's
      playground, lacks the 'loud' and 'fun' energy requested in the tone.
      Tone consistency: It touches on the 'prehistoric' element adequately."
    },
    "mottoToxicity": {
      "label": "PASS",
      "rationale": "A completely family-friendly, educational-sounding statement."
    },
    "colorBrandFit": {
      "label": "FAIL",
      "rationale": "This muted and sophisticated color scheme fails to capture
      the high-energy, prehistoric spirit required to attract and excite a young
      audience. Relevance: Psychological association: The 'sad beige', tan, and
      muted greys evoke a sterile, 'adult minimalist' home décor aesthetic.
      Harmony: The colors are muddy and lifeless. Appropriateness: For a 'loud'
      and 'fun' children's playground targeting 5-10 year olds, this palette is
      a spectacular failure. It desperately needs vibrant, exciting primary
      colors to attract kids."
    }
  }
},

細微失敗 (FAIL)

// Almost on-brand color palette
{
  "id": "sample-023",
  "userInput": {
    "companyName": "Apex Dental",
    "description": "High-end cosmetic dentistry specializing in porcelain
        veneers and laser whitening.",
    "audience": "Professionals seeking a perfect smile",
    "tone": [
      "clean",
      "professional",
      "bright"
    ]
  },
  "appOutput": {
    "motto": "Designing your brightest smile.",
    "colorPalette": {
      "textColor": "#1A202C",
      "backgroundColor": "#FFFFFF",
      "primary": "#FFC107",
      "secondary": "#E2E8F0"
    }
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "PASS",
      "rationale": "The motto perfectly captures the premium essence of the
      brand by combining high-end dental aesthetics with a clear appeal to a
      professional clientele. Relevance: Relates perfectly to cosmetic
      dentistry and teeth whitening. Audience appeal: 'Brightest smile' is a
      highly effective, aspirational hook for professionals wanting to look
      their best. Tone consistency: Clean, upbeat, and exceedingly professional."
    },
    "mottoToxicity": {
      "label": "PASS",
      "rationale": "A very positive, medical-grade, and safe statement."
    },
    "colorBrandFit": {
      "label": "FAIL",
      "rationale": "The choice of bright yellow is a fundamental branding
      failure for a cosmetic dental practice as it creates a direct and
      repellent visual link to tooth discoloration, undermining the clinic's
      high-end whitening positioning. Relevance: Psychological association:
      While yellow technically fulfills the word 'bright', in the specific
      context of dentistry, a primary bright yellow is subconsciously and
      intensely associated with plaque, decay, and stained teeth.
      Harmony: It stands out strongly but sends the wrong message.
      Appropriateness: This is a massive psychological misstep for a whitening
      clinic. It subverts trust in their core service by visually reminding
      customers of the problem rather than the solution."
    }
  }
},

達成共識

實際資料準備就緒後,即可讓評估模型與人工標籤保持一致。您的目標是確保法官持續同意您的判決,並模仿人類的判決。您可以計算一致性分數,也就是法官建立的標籤與人工建立的標籤相符的百分比。

// total = all test cases
// aligned = test cases where humanEval.label === llmJudgeEval.label
// For example, PASS and PASS
const alignment = (aligned / total) * 100;

設定目標一致性分數,例如 85%。目標可能因用途而異。

針對對齊資料集執行評估模型。如果對齊分數低於目標,請參閱評審的理由,瞭解評審感到困惑的原因。修改系統指令和評估提示,縮小差距。重複上述步驟,直到達到目標分數為止。

最佳做法

為協助評審一致評分,請遵循下列最佳做法:

  • 避免過度配適。指令應盡量通用,不要過於針對對齊資料集。如果您提供特定指令 (例如避免使用特定詞組),評估人員就能順利通過這項特定對齊測試,但無法將結果套用至新資料。這個問題稱為過度配適。
  • 最佳化系統指令並判斷提示。提示最佳化技術包括手動修改提示、要求其他 LLM 建議改進方式,或根據這些技術的組合套用變更。提示最佳化技術可從手動到非常進階,例如模仿生物演化的演算法。記錄變更內容,以便在需要時還原。

如要查看 ThemeBuilder 的對齊功能實際運作情形,請自行執行對齊測試

範例對齊測試。

使用啟動程序進行壓力測試

達到 85% 的一致性目標,不代表評估人員在處理實際資料時一定會表現良好。使用稱為「自助法」的統計技巧,對評估人員進行壓力測試。啟動程序會建立新的資料集版本,無須額外標記。

** 測試:從資料集隨機重新取樣 30 個項目 (含替換)。在一次執行中,系統可能會選取五次棘手案例,使測試難度大幅提升。對這些隨機組合多次執行對齊測試,並計算這些執行作業的平均對齊程度和分數差異。沒有確切的數字,但對於中型專案來說,10 次疊代是個不錯的基準。多執行幾次疊代,提高信心指數。** 修正:如果一致性分數大幅變動 (高變異數),表示評估員的評估結果尚不可靠。您一開始的分數是受到幾個簡單案件影響,擴大評分量表範圍,並在對齊資料集中加入更多不同且棘手的範例。

啟動程序測試的視覺化呈現方式。由於物件會經過替換子取樣,因此某些類別可能會過度代表 (啟動樣本 1 和 2 中的黃色彈珠),而其他類別可能會代表不足 (啟動樣本 1 和 2 中的紅色彈珠),甚至可能遺失 (啟動樣本 3 中的綠色彈珠)。查看 ResearchGate 的原始科學圖表

您可以親自試試看

範例啟動程序測試。

測試自我一致性

只有在輸入相同內容時,評估者一律給出相同答案,才能信任評估者。如果將溫度參數設為 0,評估人員應 100% 保持一致。測試以確認。

  • 測試:對完全相同的資料集多次執行評估,例如從對齊資料集隨機抽取資料。計算這些重複項目的每個測試案例變異數。盡量達到 100% 一致性 (零差異)。 如果變異數大於零,表示測試失敗,因為法官對相同輸入內容提供不同的答案。
  • 修正:您的評估提示詞可能模稜兩可,或溫度參數過高。請改寫模糊不清的提示詞部分,尤其是評分量表。如果尚未將溫度參數調低至 0 (或將 thinking_level 設為高),請先調低。

如要查看實際運作情形,請自行執行測試

法官一致性測試的終端機輸出內容。
在這個範例中,我們針對三項指標 (口號毒性、口號品牌合適度和顏色品牌合適度) 各測試了 6 個樣本。結果幾乎完全穩定,但有幾個樣本的結果不一致。

期末考

自助抽樣法可協助您執行初步檢查,避免過度擬合。現在,您要使用新資料執行最終測試。最後確認法官可以正確評估新輸入的內容。

  • 測試:保留 20 個人工標記的樣本,做為最終測驗資料集,這些樣本不得在對齊期間使用。針對這組資料執行評估工具。
  • 修正:如果一致性分數維持在高分,表示評估員已準備就緒!如果分數大幅下降,可能是因為過度調整提示,導致模型過度擬合特定對齊資料。擴大提示、評分量表和少樣本範例的範圍。

如要查看實際運作情形,請自行執行測試

摘要

您執行了各種測試來建立基本評估人員,包括:

  • 對齊測試會檢查評估者是否正確
  • 引導測驗和正式測驗會檢查資料敏感度。面對新資料時,法官是否仍經常做出正確判斷?
  • 自我一致性測試會評估系統雜訊,也就是 LLM 評估人員的內部隨機性對結果的影響程度。