Measure consistency between annotators to identify training gaps, guideline issues, and ensure high-quality training data.
Inter-Annotator Agreement Metrics
- Cohen's Kappa (κ) - Accounts for chance agreement between two annotators. Values above 0.8 indicate excellent agreement for classification tasks.
- Fleiss' Kappa - Extends Cohen's Kappa to multiple annotators
- Krippendorff's Alpha - Works for any number of annotators and handles missing data
- IoU (Intersection over Union) - Essential for bounding box and segmentation annotation. Above 0.75 is typically acceptable.
When Agreement is Low
- Identify specific labels or items with high disagreement
- Discuss edge cases as a team to understand root causes
- Update annotation guidelines with clarifying examples
- Re-train annotators on problematic categories
- Consider if the labeling task itself is too subjective