Google은 AI 기술을 사용하여 콘텐츠를 사용자의 기본 언어로 번역합니다. AI 번역에는 오류가 있을 수 있습니다.

기본 평가 모델 설정 (2부)

주관적 평가를 실행하려면 기본 평가 모델 설정을 완료하세요.

Maud Nalpas

심판 정렬 및 테스트

초기 심판이 있지만 아직 신뢰할 수 없습니다. 심사자는 사람의 판단에 일관되게 동의할 때만 준비된 것으로 간주됩니다.

정렬 데이터 세트 만들기

심사위원을 보정하려면 정렬 데이터 세트가 필요합니다. 이는 사람이 수동으로 평가한 입력 및 출력의 작고 고품질 컬렉션입니다. 이 데이터 세트는 그라운드 트루스 역할을 합니다. 이를 사용하여 심사위원의 논리가 내 기대치와 일관되게 일치하는지 확인합니다.

정렬 데이터 세트에는 30~50개의 입력-출력 쌍이 포함되어야 합니다. 이 세트는 일부 특이 사례를 포함할 만큼 크지만 단기간에 라벨을 지정할 수 있을 만큼 작습니다.

ThemeBuilder 예시에서 정렬 데이터 세트의 항목은 다음과 같습니다(입력, 출력, 사람 라벨).

{
  "id": "sample-014",
  "userInput": {
    "companyName": "Rawrr!",
    "audience": "kids 5-10",
    "tone": ["prehistoric", "loud", "fun"]
  },
  "appOutput": {
    "motto": "Experiencing the prehistoric era."
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "FAIL",
      "rationale": "While on-theme, this motto is too formal for kids.
        It fails to capture the required 'loud' and 'fun' energy."
    }
  }
}

입력과 출력을 생성하려면 프로덕션 로그에서 추출하거나(사용 가능한 경우) 데이터를 수동으로 만들거나 LLM(합성 데이터)을 사용하거나 직접 선택한 샘플 몇 개로 시작하여 LLM에 데이터 세트 증강을 요청하면 됩니다.

입력과 출력이 준비되면 평가 기준을 사용하여 팀과 함께 출력을 PASS 또는 FAIL로 라벨링합니다. 이것이 그라운드 트루스가 됩니다.

정렬 데이터 세트에는 다양한 난이도의 PASS 예시와 FAIL 예시가 모두 포함되어야 합니다. 예를 들면 다음과 같습니다.

심사위원이 PASS로 라벨을 지정한 해피 패스 사례 10개
심사위원이 FAIL로 라벨을 지정한 20가지 사례:
- 명백한 실패(예: 매우 유해하거나 완전히 브랜드에 맞지 않는 모토)
- 미묘한 실패: 예를 들어 문법적으로는 완벽하지만 장난기 있는 브랜드에 비해 너무 공식적이거나 분위기에 부분적으로만 맞는 슬로건

LLM 심사자는 문지기입니다. 통과 사례보다 실패 사례가 더 많이 포함된 데이터 세트에 맞추면 실패를 포착하도록 기준표를 조정할 기회가 더 많아지고 궁극적으로 심사위원의 실패 감지 능력이 향상됩니다.

정렬 데이터 세트가 준비되면 다음과 같이 표시됩니다.

해피 패스 케이스 (통과)

// Easy, clean input + Good output
{
  "id": "sample-001",
  "userInput": {
    "companyName": "Kinetica",
    "description": "Carbon-fiber plated performance footwear engineered for
    elite marathon runners.",
    "audience": "competitive triathletes and professional runners",
    "tone": [
      "aggressive",
      "high-performance",
      "technical"
    ]
  },
  "appOutput": {
    "motto": "Unlock your kinetic potential.",
    "colorPalette": {
      "textColor": "#FFFFFF",
      "backgroundColor": "#000000",
      "primary": "#DC2626",
      "secondary": "#E2E8F0"
    }
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "PASS",
      "rationale": "This motto powerfully aligns the brand's technical
      engineering with the ambitious goals of its elite athletic audience.
      Relevance: Uses 'kinetic' to expertly link the brand to physical
      energy. Audience appeal: 'Unlock your potential' resonates perfectly
      with competitive runners. Tone consistency: Nails the required
      aggressive, high-performance marks."
    },
    "mottoToxicity": {
      "label": "PASS",
      "rationale": "Perfectly clean and motivational. No offensive or
      exclusionary language."
    },
    "colorBrandFit": {
      "label": "PASS",
      "rationale": "The chosen color palette perfectly mirrors Kinetica's
      aggressive and technical brand identity by utilizing high-impact tones
      that resonate with elite athletes. Relevance: Psychological association:
      Blood red creates urgency and speed. Harmony: Stark contrast against
      black/white feels highly technical.
      Appropriateness: Extremely effective aesthetic for premium athletic gear."
    }
  }
}

명백한 실패 (FAIL)

// Off-brand color palette
{
  "id": "sample-014",
  "userInput": {
    "companyName": "Rawrr!",
    "description": "Dinosaur themed playground and party venue.",
    "audience": "kids 5-10",
    "tone": [
      "prehistoric",
      "loud",
      "fun"
    ]
  },
  "appOutput": {
    "motto": "Experiencing the prehistoric era.",
    "colorPalette": {
      "textColor": "#4A4A4A",
      "backgroundColor": "#F5F5DC",
      "primary": "#D2B48C",
      "secondary": "#C0C0C0"
    }
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "FAIL",
      "rationale": "While the motto relates to the dinosaur theme, its overly
      academic and formal tone fails to capture the loud and fun energy
      essential for a children's playground brand. Relevance: Effectively fits
      the dinosaur theme. Audience appeal: A bit formal ('Experiencing' versus
      something punchy), acceptable for parents booking events but should be
      more exciting for kids, it's too formal and academic for a children's
      playground, lacks the 'loud' and 'fun' energy requested in the tone.
      Tone consistency: It touches on the 'prehistoric' element adequately."
    },
    "mottoToxicity": {
      "label": "PASS",
      "rationale": "A completely family-friendly, educational-sounding statement."
    },
    "colorBrandFit": {
      "label": "FAIL",
      "rationale": "This muted and sophisticated color scheme fails to capture
      the high-energy, prehistoric spirit required to attract and excite a young
      audience. Relevance: Psychological association: The 'sad beige', tan, and
      muted greys evoke a sterile, 'adult minimalist' home décor aesthetic.
      Harmony: The colors are muddy and lifeless. Appropriateness: For a 'loud'
      and 'fun' children's playground targeting 5-10 year olds, this palette is
      a spectacular failure. It desperately needs vibrant, exciting primary
      colors to attract kids."
    }
  }
},

미묘한 실패 (FAIL)

// Almost on-brand color palette
{
  "id": "sample-023",
  "userInput": {
    "companyName": "Apex Dental",
    "description": "High-end cosmetic dentistry specializing in porcelain
        veneers and laser whitening.",
    "audience": "Professionals seeking a perfect smile",
    "tone": [
      "clean",
      "professional",
      "bright"
    ]
  },
  "appOutput": {
    "motto": "Designing your brightest smile.",
    "colorPalette": {
      "textColor": "#1A202C",
      "backgroundColor": "#FFFFFF",
      "primary": "#FFC107",
      "secondary": "#E2E8F0"
    }
  },
  "humanEvaluation": {
    "mottoBrandFit": {
      "label": "PASS",
      "rationale": "The motto perfectly captures the premium essence of the
      brand by combining high-end dental aesthetics with a clear appeal to a
      professional clientele. Relevance: Relates perfectly to cosmetic
      dentistry and teeth whitening. Audience appeal: 'Brightest smile' is a
      highly effective, aspirational hook for professionals wanting to look
      their best. Tone consistency: Clean, upbeat, and exceedingly professional."
    },
    "mottoToxicity": {
      "label": "PASS",
      "rationale": "A very positive, medical-grade, and safe statement."
    },
    "colorBrandFit": {
      "label": "FAIL",
      "rationale": "The choice of bright yellow is a fundamental branding
      failure for a cosmetic dental practice as it creates a direct and
      repellent visual link to tooth discoloration, undermining the clinic's
      high-end whitening positioning. Relevance: Psychological association:
      While yellow technically fulfills the word 'bright', in the specific
      context of dentistry, a primary bright yellow is subconsciously and
      intensely associated with plaque, decay, and stained teeth.
      Harmony: It stands out strongly but sends the wrong message.
      Appropriateness: This is a massive psychological misstep for a whitening
      clinic. It subverts trust in their core service by visually reminding
      customers of the problem rather than the solution."
    }
  }
},

도달범위 정렬

정답이 준비되면 심사위원을 사람 라벨과 정렬합니다. 목표는 평가자가 일관되게 동의하고 인간의 판단을 모방하도록 하는 것입니다. 심사위원이 만든 라벨이 사람이 만든 라벨과 일치하는 비율로 정렬 점수를 계산할 수 있습니다.

// total = all test cases
// aligned = test cases where humanEval.label === llmJudgeEval.label
// For example, PASS and PASS
const alignment = (aligned / total) * 100;

타겟 정렬 점수를 설정합니다(예: 85%). 사용 사례에 따라 타겟이 달라질 수 있습니다.

정렬 데이터 세트에 대해 심사 모델을 실행합니다. 정렬 점수가 타겟보다 낮으면 심사위원의 근거를 읽고 잘못된 라벨이 제공된 이유를 파악하세요. 시스템 요청 사항과 심사관 프롬프트를 수정하여 격차를 해소합니다. 목표 점수에 도달할 때까지 이 과정을 반복합니다.

권장사항

심사위원이 일관되게 점수를 매길 수 있도록 다음 권장사항을 따르세요.

과적합 방지 일반적인 지침을 사용하고 정렬 데이터 세트에 너무 구체적인 지침을 사용하지 마세요. 특정 문구를 피하는 것과 같은 구체적인 지침을 제공하면 심사자는 이 특정 정렬 테스트를 효과적으로 통과하지만 새 데이터로 일반화하지는 못합니다. 이 문제를 과적합이라고 합니다.
시스템 요청 사항과 판단 프롬프트를 최적화합니다. 프롬프트 최적화 기법에는 프롬프트를 수동으로 수정하거나, 다른 LLM에 개선사항을 제안해 달라고 요청하거나, 이러한 기법의 조합에 따라 변경사항을 적용하는 방법이 있습니다. 프롬프트 최적화 기법은 수동에서 매우 고급까지 다양합니다(예: 생물학적 진화를 모방하는 알고리즘). 필요한 경우 변경사항을 되돌릴 수 있도록 변경사항을 기록해 둡니다.

ThemeBuilder에서 정렬이 실제로 작동하는지 확인하려면 정렬 테스트를 실행하세요.

정렬 테스트의 터미널 출력으로, 통과/실패 비율을 보여줍니다. — 조정 상황 테스트 예시

부트스트래핑으로 스트레스 테스트

85% 정렬 타겟에 도달한다고 해서 심사위원이 실제 데이터에서 우수한 성능을 발휘하는 것은 아닙니다. 부트스트래핑이라는 통계 기법으로 심사위원을 스트레스 테스트합니다. 부트스트래핑을 사용하면 추가 라벨링 작업 없이 데이터 세트의 새 버전을 만들 수 있습니다.

테스트: 데이터 세트에서 복원 추출을 사용하여 30개의 항목을 무작위로 재샘플링합니다. 한 번의 실행에서 어려운 케이스가 5번 선택되어 테스트가 훨씬 더 어려워질 수 있습니다. 이러한 무작위 집합에서 정렬 테스트를 여러 번 실행하고 이러한 실행에서 평균 정렬 및 점수 분산을 계산합니다. 특정 숫자는 없지만 10회 반복은 중간 규모 프로젝트에 유용한 기준입니다. 신뢰도를 높이기 위해 더 많은 반복을 실행합니다.
해결: 정렬 점수가 크게 변동 (분산이 높음)되면 심판이 아직 신뢰할 수 없습니다. 초기 점수는 몇 가지 쉬운 사례로 인해 우연히 나온 것입니다. 기준표를 확대하고 정렬 데이터 세트에 더 다양하고 어려운 예시를 추가하세요.

복원 추출을 사용한 리샘플링이 특정 데이터 클래스를 과도하게 표현하거나 과소하게 표현할 수 있는 방법을 보여주는 부트스트랩 테스트의 시각적 표현 — 대체로 객체가 하위 샘플링되므로 일부 클래스는 과도하게 표현될 수 있고 (부트스트랩 샘플 1과 2의 노란색 구슬) 다른 클래스는 과소 표현될 수 있으며 (부트스트랩 샘플 1과 2의 빨간색 구슬) 누락될 수도 있습니다 (부트스트랩 샘플 3의 녹색 구슬). ResearchGate의 원본 과학적 수치를 확인합니다.

사용해 보세요.

자체 일관성 테스트

심판은 동일한 입력에 대해 항상 동일한 답변을 제공하는 경우에만 신뢰할 수 있습니다. 온도를 0로 설정한 경우 심사위원은 100% 일관됩니다. 이 일관성을 확인합니다.

테스트: 정렬 데이터 세트에서 무작위로 추출한 것과 같은 정확히 동일한 데이터 세트에서 심판을 여러 번 실행합니다. 이러한 반복에 걸쳐 각 테스트 사례의 분산을 계산합니다. 100% 일관성 (분산 0)을 목표로 합니다. 분산이 0보다 크면 심판이 동일한 입력에 대해 다른 답변을 제공하므로 테스트가 실패합니다.
해결: 심사 프롬프트가 모호하거나 온도가 너무 높을 수 있습니다. 명확하지 않은 프롬프트 부분, 특히 점수 매기기 루브릭을 다시 작성합니다. 아직 온도를 낮추지 않았다면 온도를 0으로 낮춥니다 (또는 thinking_level을 높게 설정).

실제로 작동하는지 확인하려면 테스트를 실행하세요.

심사관 일관성 테스트의 터미널 출력입니다. — 이 예에서는 세 가지 측정항목(모토 유해성, 모토 브랜드 적합성, 색상 브랜드 적합성) 각각에 대해 6개의 샘플을 테스트했습니다. 결과는 거의 완전히 안정적이었지만 일부 샘플에서는 결과가 일관되지 않았습니다.

Final exam

부트스트래핑은 과적합을 방지하기 위한 초기 검사를 실행하는 데 도움이 되었습니다. 그런 다음 새 데이터를 사용하여 최종 테스트를 실행합니다. 심사위원이 새로운 입력을 올바르게 평가할 수 있는지 확인하는 마지막 단계입니다.

테스트: 정렬 중에 사용하지 않은 인간 라벨 샘플 20개로 구성된 별도의 최종 시험 데이터 세트를 유지합니다. 이 세트에 대해 심판을 실행합니다.
수정: 정렬 점수가 높게 유지되면 심사자가 준비된 것입니다. 점수가 급격히 떨어지면 과적합을 나타냅니다. 특정 정렬 데이터를 통과하기 위해 프롬프트를 너무 많이 조정했기 때문입니다. 프롬프트, 루브릭, 퓨샷 예시를 확장합니다.