Inter-Annotator Agreement Metrics

10 min read

Measure consistency between annotators to identify training gaps, guideline issues, and ensure high-quality training data.

Inter-Annotator Agreement Metrics

  • Cohen's Kappa (κ) - Accounts for chance agreement between two annotators. Values above 0.8 indicate excellent agreement for classification tasks.
  • Fleiss' Kappa - Extends Cohen's Kappa to multiple annotators
  • Krippendorff's Alpha - Works for any number of annotators and handles missing data
  • IoU (Intersection over Union) - Essential for bounding box and segmentation annotation. Above 0.75 is typically acceptable.

When Agreement is Low

  1. Identify specific labels or items with high disagreement
  2. Discuss edge cases as a team to understand root causes
  3. Update annotation guidelines with clarifying examples
  4. Re-train annotators on problematic categories
  5. Consider if the labeling task itself is too subjective