Performance measurement

We measure our labelers’ performance to ensure we only include high-quality reads in our final labels.

How is performance measured?

Labelers receive a score from 0 to 100 on each gold standard and labeled case based on how closely their read matches the Correct Label. Labelers are not scored on unlabeled cases.

Case-level scores are calculated differently based on the annotation type.

Then, the labeler’s scores on the last 20–50 cases are averaged to create their overall score. This overall score is a dynamic representation of the labeler’s performance on the task at any given moment. Reads from labelers with overall scores above a certain threshold will be considered qualified reads and incorporated into the Majority Label.

Classification

Single Select

This is a binary score—100 if the read matches the Correct Label, 0 if not.

Multi-Select

  • If labelers can select more than one choice: labeler receives a score of 100 on each answer choice they correctly select or omit, and a score of 0 for every incorrect selection. These scores on each answer choice are averaged together to determine the case's score.

Example: for the task Select all the colors of the flag with answer choices red, blue, green, yellow, white, black Because there are 6 choices, and each correct selection or omission gets 100 points, the best possible score a labeler could get is 600 out of 600.

  • A selection of red, blue would receive a score of 500/600, or 83%, for correctly selecting colors red and blue, correctly omitting green, yellow, black, and incorrectly omitting white.
Color choicesColor presentColor selectedPoints collected
Red100
Blue100
White0
Green100
Yellow100
Black100
  • A selection of yellow, white would receive a score of 300/600, or 50%, for incorrectly omitting red and blue, incorrectly selecting yellow, correctly selecting white, and correctly omitting green and black
Color choicesColor PresentColor selectedPoints collected
Red0
Blue0
White100
Green100
Yellow0
Black100

Polygon and pixel segmentation

The score is based on the intersection over union (IoU) of the labeler-drawn shape over the Correct Label. Higher IoU translates to a higher score.

If there are multiple findings, the IoU is based on the sum of labeled findings’ area over the sum of gold standard or labeled case area.

In multi-class polygon and pixel segmentation, IoUs are calculated for each individual class within a case. The final IoU for the case is the average of the individual IoUs for each class.

Box, line, circle segmentation

The closer the surfaces and locations of the labeler shapes match the correct label, the higher the score will be.

  • We compare the similarity of the shapes drawn by the labeler with the correct label, measuring distance between nearest pairs of shapes from the labeler and the correct answer based on a distance metric known as the Hausdorff distance. The combination of all these pairwise Hausdorff distance measurements yields the labeler's score.
  • Scoring can be curved to be more lenient or more difficult, based on the difficulty of the task or size of the findings.

In multi-class box, line, and circle segmentation, a score is calculated for each class based on the above methodology. The overall score of the case is determined by averaging the individual class scores.

Range Selection

NER:

The score is based on the character-wise intersection over union (IoU) of the labeler-drawn highlights over the Correct Label. Higher IoU translates to a higher score.

If there are multiple findings, the IoU is based on the sum of labeled findings’ highlighted characters over the sum of gold standard or labeled case highlighted characters.

In multi-class NER, IoUs are calculated for each individual class within a case. The final IoU for the case is the average of the individual IoUs for each class.

Time Range Selection:

The score is based on the deci-second intersection over union (IoU) of the labeler selections over the Correct Label. Higher IoU translates to a higher score.

If there are multiple findings, the IoU is based on the sum of labeled findings’ selected deci-seconds over the sum of gold standard or labeled case selected deci-seconds.

In multi-class NER, IoUs are calculated for each individual class within a case. The final IoU for the case is the average of the individual IoUs for each class.