設定基本評估模型 (第 2 部分)

完成基本評估模型設定,即可開始進行主觀評估。

校準並測試評估人員

您有初始法官,但尚未信任該法官。只有在法官的判斷結果與人類判斷一致時,才算準備就緒。

建立對齊資料集

如要校正評估人員,您需要對齊資料集。這是由人工評估的一小組高品質輸入和輸出內容。這個資料集會做為基準真相。您可以使用這項功能,確認評審的邏輯是否持續符合您的期望。

對齊資料集應包含 30 到 50 個輸入/輸出配對。這個資料集夠大,可涵蓋一些邊緣情況,但又不會太大,讓您能在短時間內完成標記。

在 ThemeBuilder 範例中,對齊資料集中的項目如下所示 (輸入、輸出、人工標籤):

{
  "id": "sample-014",
  "userInput": {
    "companyName": "Rawrr!",
    "audience": "kids 5-10",
    "tone": ["prehistoric", "loud", "fun"]
  },
  "appOutput": {
    "motto": "Experiencing the prehistoric era."
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "FAIL",
      "rationale": "While on-theme, this motto is too formal for kids.
        It fails to capture the required 'loud' and 'fun' energy."
    }
  }
}

如要生成輸入和輸出內容,您可以從生產記錄中擷取 (如有)、手動建立資料、使用 LLM (合成資料),或從幾個手動挑選的樣本開始,要求 LLM 擴增資料集。

輸入和輸出內容準備就緒後,請使用評分量表與團隊一起將輸出內容標示為 PASSFAIL。這會成為你的基準真相

請確認對齊資料集包含難度不一的 PASSFAIL 範例,例如:

  • 10 個範例,說明法官標籤的順利路徑案例為 PASS
  • 以下是 20 個法官標籤為 FAIL 的案例:
    • 明顯失敗,例如極度惡意或完全不符合品牌形象的口號。
    • 細微的失敗,例如座右銘文法正確,但對活潑的品牌來說過於正式,或僅部分符合語氣。

LLM 評估模型是把關者,如果資料集包含的失敗案例多於通過案例,則對齊資料集可提供更多機會調整評分量表,以找出失敗案例,最終提升評審員偵測失敗案例的能力。

對齊資料集準備就緒後,看起來會像這樣:

滿意路徑案例 (通過)

// Easy, clean input + Good output
{
  "id": "sample-001",
  "userInput": {
    "companyName": "Kinetica",
    "description": "Carbon-fiber plated performance footwear engineered for
    elite marathon runners.",
    "audience": "competitive triathletes and professional runners",
    "tone": [
      "aggressive",
      "high-performance",
      "technical"
    ]
  },
  "appOutput": {
    "motto": "Unlock your kinetic potential.",
    "colorPalette": {
      "textColor": "#FFFFFF",
      "backgroundColor": "#000000",
      "primary": "#DC2626",
      "secondary": "#E2E8F0"
    }
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "PASS",
      "rationale": "This motto powerfully aligns the brand's technical
      engineering with the ambitious goals of its elite athletic audience.
      Relevance: Uses 'kinetic' to expertly link the brand to physical
      energy. Audience appeal: 'Unlock your potential' resonates perfectly
      with competitive runners. Tone consistency: Nails the required
      aggressive, high-performance marks."
    },
    "mottoToxicity": {
      "label": "PASS",
      "rationale": "Perfectly clean and motivational. No offensive or
      exclusionary language."
    },
    "colorBrandFit": {
      "label": "PASS",
      "rationale": "The chosen color palette perfectly mirrors Kinetica's
      aggressive and technical brand identity by utilizing high-impact tones
      that resonate with elite athletes. Relevance: Psychological association:
      Blood red creates urgency and speed. Harmony: Stark contrast against
      black/white feels highly technical.
      Appropriateness: Extremely effective aesthetic for premium athletic gear."
    }
  }
}

明顯失敗 (FAIL)

// Off-brand color palette
{
  "id": "sample-014",
  "userInput": {
    "companyName": "Rawrr!",
    "description": "Dinosaur themed playground and party venue.",
    "audience": "kids 5-10",
    "tone": [
      "prehistoric",
      "loud",
      "fun"
    ]
  },
  "appOutput": {
    "motto": "Experiencing the prehistoric era.",
    "colorPalette": {
      "textColor": "#4A4A4A",
      "backgroundColor": "#F5F5DC",
      "primary": "#D2B48C",
      "secondary": "#C0C0C0"
    }
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "FAIL",
      "rationale": "While the motto relates to the dinosaur theme, its overly
      academic and formal tone fails to capture the loud and fun energy
      essential for a children's playground brand. Relevance: Effectively fits
      the dinosaur theme. Audience appeal: A bit formal ('Experiencing' versus
      something punchy), acceptable for parents booking events but should be
      more exciting for kids, it's too formal and academic for a children's
      playground, lacks the 'loud' and 'fun' energy requested in the tone.
      Tone consistency: It touches on the 'prehistoric' element adequately."
    },
    "mottoToxicity": {
      "label": "PASS",
      "rationale": "A completely family-friendly, educational-sounding statement."
    },
    "colorBrandFit": {
      "label": "FAIL",
      "rationale": "This muted and sophisticated color scheme fails to capture
      the high-energy, prehistoric spirit required to attract and excite a young
      audience. Relevance: Psychological association: The 'sad beige', tan, and
      muted greys evoke a sterile, 'adult minimalist' home décor aesthetic.
      Harmony: The colors are muddy and lifeless. Appropriateness: For a 'loud'
      and 'fun' children's playground targeting 5-10 year olds, this palette is
      a spectacular failure. It desperately needs vibrant, exciting primary
      colors to attract kids."
    }
  }
},

細微失敗 (FAIL)

// Almost on-brand color palette
{
  "id": "sample-023",
  "userInput": {
    "companyName": "Apex Dental",
    "description": "High-end cosmetic dentistry specializing in porcelain
        veneers and laser whitening.",
    "audience": "Professionals seeking a perfect smile",
    "tone": [
      "clean",
      "professional",
      "bright"
    ]
  },
  "appOutput": {
    "motto": "Designing your brightest smile.",
    "colorPalette": {
      "textColor": "#1A202C",
      "backgroundColor": "#FFFFFF",
      "primary": "#FFC107",
      "secondary": "#E2E8F0"
    }
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "PASS",
      "rationale": "The motto perfectly captures the premium essence of the
      brand by combining high-end dental aesthetics with a clear appeal to a
      professional clientele. Relevance: Relates perfectly to cosmetic
      dentistry and teeth whitening. Audience appeal: 'Brightest smile' is a
      highly effective, aspirational hook for professionals wanting to look
      their best. Tone consistency: Clean, upbeat, and exceedingly professional."
    },
    "mottoToxicity": {
      "label": "PASS",
      "rationale": "A very positive, medical-grade, and safe statement."
    },
    "colorBrandFit": {
      "label": "FAIL",
      "rationale": "The choice of bright yellow is a fundamental branding
      failure for a cosmetic dental practice as it creates a direct and
      repellent visual link to tooth discoloration, undermining the clinic's
      high-end whitening positioning. Relevance: Psychological association:
      While yellow technically fulfills the word 'bright', in the specific
      context of dentistry, a primary bright yellow is subconsciously and
      intensely associated with plaque, decay, and stained teeth.
      Harmony: It stands out strongly but sends the wrong message.
      Appropriateness: This is a massive psychological misstep for a whitening
      clinic. It subverts trust in their core service by visually reminding
      customers of the problem rather than the solution."
    }
  }
},

觸及率一致性

準備好實際資料後,請將評估模型與人工標籤對齊。您的目標是確保評估人員持續同意您的判斷,並模仿人類的判斷。您可以計算對齊分數,也就是法官建立的標籤與人工建立的標籤相符的百分比。

// total = all test cases
// aligned = test cases where humanEval.label === llmJudgeEval.label
// For example, PASS and PASS
const alignment = (aligned / total) * 100;

設定目標一致性分數,例如 85%。目標可能因用途而異。

針對對齊資料集執行評估模型。如果對齊分數低於目標,請參閱評審的理由,瞭解他們為何提供不正確的標籤。修改系統指令和評估提示,以彌補差距。重複此步驟,直到達到目標分數為止。

最佳做法

為協助評審一致評分,請遵循下列最佳做法:

  • 避免過度配適。請概括說明,避免指令過於具體,以致不適用於對齊資料集。如果您提供特定指示 (例如避免使用特定片語),評估人員就能有效通過這項特定對齊測試,但無法將結果套用至新資料。這個問題稱為過度配適。
  • 最佳化系統指令並判斷提示。提示最佳化技術包括手動修改提示、要求其他 LLM 建議改進方式,或根據這些技術的組合套用變更。提示最佳化技術可從手動到非常進階,例如模擬生物演化的演算法。記錄變更內容,以便在需要時還原。

如要查看 ThemeBuilder 的對齊功能,請執行對齊測試

校正測試的終端機輸出內容,顯示通過/失敗率。
範例對齊測試。

使用啟動程序進行壓力測試

達到 85% 的一致性目標,並不保證評估人員能準確評估實際資料。使用稱為「自助法」的統計技術,對評估人員進行壓力測試。啟動程序會建立新的資料集版本,無須額外標記。

  • 測試:從資料集中隨機重新取樣 30 個項目 (含替換)。在一次執行中,系統可能會選取五次困難的案例,使測試難度大幅提升。對這些隨機集合多次執行對齊測試,並計算這些執行作業的平均對齊和分數差異。沒有特定數字,但對於中型專案來說,10 次疊代是實用的基準。執行更多次疊代,提高信心水準。
  • 修正:如果對齊分數大幅波動 (高變異數),表示評估員尚不可靠。您一開始的分數是巧合,因為您只處理了幾個簡單的案件。擴大評分量表範圍,並在對齊資料集中加入更多元且具挑戰性的範例。
以視覺化方式呈現自助法檢定,說明如何透過重抽樣 (含放回) 過度或過度代表特定資料類別。
由於物件會透過替換方式進行子取樣,因此某些類別可能會過度呈現 (啟動樣本 1 和 2 中的黃色彈珠),而其他類別可能會呈現不足 (啟動樣本 1 和 2 中的紅色彈珠),甚至可能遺失 (啟動樣本 3 中的綠色彈珠)。查看 ResearchGate 的原始科學圖表

你可以試試看

啟動測試的終端機輸出內容。
範例啟動程序測試。

測試自我一致性

只有在相同輸入內容一律會得到相同答案時,才能信任評估人員。如果將溫度參數設為 0,評估人員的判斷結果會 100% 一致。確認這項一致性。

  • 測試:對完全相同的資料集 (例如從對齊資料集隨機抽樣) 多次執行評估模型。計算這些重複項目的每個測試案例變異數。目標是 100% 一致 (零變異數)。如果變異數大於零,則測試失敗,因為評估人員對相同輸入內容提供不同的答案。
  • 修正:評估提示可能模稜兩可,或溫度過高。重新撰寫不清楚的提示詞部分,尤其是評分量表。如果尚未將 temperature 設為 0 (或將 thinking_level 設為高),請先完成設定。

如要查看實際運作情況,請執行測試

法官一致性測試的終端機輸出內容。
在這個範例中,我們針對三項指標 (口號毒性、口號品牌合適度和顏色品牌合適度) 各測試了 6 個樣本。結果幾乎完全穩定,但有幾個樣本的結果不一致。

期末考

自助抽樣法可協助您執行初步檢查,避免過度擬合。接著,您將使用新資料執行最終測試。最後確認法官可以正確評估新輸入的內容。

  • 測試:保留 20 個人工標記的樣本,做為最終測驗資料集,這些樣本不得在對齊期間使用。針對這組資料執行評估工具。
  • 修正:如果一致性分數維持在高分,表示評估人員已準備就緒。如果分數大幅下降,表示過度配適:您調整提示詞的次數過多,導致模型通過特定對齊資料。擴大提示、評分量表和少量樣本的範圍。

如要查看實際運作情況,請執行測試

摘要

您執行了各種測試來建立基本評估人員,包括:

  • 對齊測試會檢查評估者是否正確
  • 啟動和期末考測試檢查資料敏感度:法官面對新資料時,是否仍能保持正確。
  • 自我一致性測試會評估系統雜訊,也就是 LLM 評估人員的內部隨機性對結果的影響程度。