How to review results

There are three recommended strategies for reviewing results. The first batch of data should be comprehensively reviewed by your team and/or experts contracted by Centaur. For subsequent batches, it is recommended to review at least 10 - 15 cases per category to get a good sense of labeler performance.

	How to find	What they show	What to look for
Low agreement cases	Filter by Labeled cases and sort by ascending agreement. Go to States and select Labeled. Then go toMore Columns and select Agreement.	Cases where labelers did not converge on similar answers	Any patterns in the types of cases with low agreement - e.g., edge cases that may not be covered in the instructions
High difficulty cases	Filter by Gold Standard cases and sort by descending difficulty. Go to States and select Gold Standard. Then go to More Columns and select Difficulty.	Cases where labelers did not agree with the gold standard answer	Any patterns of the kinds of cases labelers are getting wrong and/or any incorrect gold standard answers
Random sample	Multiple ways - e.g., filtering by Labeled and pick a random page to review	Random sample of cases to get a general sense of labeler performance	How labelers are doing on the task; any other interesting trends

How to find

What they show

What to look for

Low agreement cases

Filter by Labeled cases and sort by ascending agreement.

Go to States and select Labeled. Then go toMore Columns and select Agreement.

Cases where labelers did not converge on similar answers

Any patterns in the types of cases with low agreement - e.g., edge cases that may not be covered in the instructions

High difficulty cases

Filter by Gold Standard cases and sort by descending difficulty.

Go to States and select Gold Standard. Then go to More Columns and select Difficulty.

Cases where labelers did not agree with the gold standard answer

Any patterns of the kinds of cases labelers are getting wrong and/or any incorrect gold standard answers

Random sample

Multiple ways - e.g., filtering by Labeled and pick a random page to review

Random sample of cases to get a general sense of labeler performance

How labelers are doing on the task; any other interesting trends

If any significant errors are found in the results, consider these strategies to improve your results moving forward. Share your findings with your project manager, and they can assist you as well.

If you are interested in leveraging Centaur accuracy metrics to review your results, check out Case level metrics and Task level metrics.