Can you create AI evaluations?

  1. Why is intuition an insufficient way to measure the quality of LLM-based applications?

  2. Which of the following is an example of a rule-based evaluation for the ThemeBuilder application?

  3. What is the primary purpose of using pairwise evaluation instead of pointwise evaluation?

  4. When configuring a judge model, why should you set the temperature to 0?

  5. What does it mean to overfit in your evaluation pipeline?

  6. What is the bootstrapping technique used for?

  7. What metric is used to measure 'agreement beyond luck' between human experts or between a judge and a human?

  8. When evaluating toxicity, why prioritize recall over precision?

  9. What is the dynamic rubric pattern?