Audio Clip Annotation

Centaur supports audio clip annotation for both classification and time range selection tasks. This allows your labelers to listen to audio clips and either classify them or highlight specific time ranges within the audio.

Audio clip annotation is available on both the DiagnosUs mobile app and the web labeling interface.

Supported Task Types

Classification

Labelers listen to an audio clip and select from a set of answer choices. Use classification when the goal is to categorize the entire clip.

Example use cases:

  • Classifying clinical scene types from ambient audio
  • Identifying conversational topics
  • Categorizing audio quality or content type

Time Range Selection

Labelers listen to an audio clip and highlight one or more time ranges within it. Tasks can be single-class or multi-class, meaning labelers can annotate multiple types of events within the same clip.

Example use cases:

  • Highlighting segments containing protected health information (PHI) or personally identifiable information (PII)
  • Marking event boundaries such as the start and end of a patient history
  • Identifying and classifying overlapping audio events during an examination

Supported Audio Formats and File Sizes

Audio files can be imported directly into the Centaur platform. The following formats are supported:

FormatFile ExtensionMax File SizeNotes
WAV.wav100 MBUncompressed; larger files per minute of audio
MP3.mp3100 MBCompressed; recommended for longer clips

As a rule of thumb, keep the total file size for a single case under 100 MB. S3 imports have a maximum size limit of 5 GB per import operation.

Data Dicer for Long Audio Clips

For long audio files, Centaur's data dicer can automatically split clips into smaller sub-clips for annotation. This is recommended when individual clips exceed 60 to 120 seconds, as shorter clips improve labeler throughput and annotation quality.

Key details:

  • The data dicer supports clips up to 100 MB.
  • Use the data dicer for tasks where full-clip context is not required for accurate annotation. If annotators need to hear the full recording for context, consider providing transcripts alongside shorter clips.
  • Centaur can provide credit estimates for the sub-clips generated by the data dicer.
  • Contact your project manager to enable the data dicer for your task.

Best Practices for Audio Annotation

Keep clips short

Aim for clips between 30 and 120 seconds whenever possible. Internal testing shows that annotator engagement and accuracy decline significantly with longer clips. If your source audio is longer, use the data dicer or pre-segment your clips before import.

Write specific, actionable prompts

Audio annotation can be more ambiguous than image classification. Your labeler prompt and instructions should clearly define what labelers are listening for, including examples of edge cases. For range selection, specify whether annotators should include silence or pauses between events.

Provide sufficient gold standards

Audio tasks can be harder to calibrate than image tasks. Provide enough gold standard cases to reliably measure annotator performance; at minimum, more than four per task. With too few gold standards, performance scores may be unreliable and quality assessment becomes difficult.

For audio tasks, ensure that your gold standard clips are representative of your full dataset. If you plan to use the data dicer, gold standard clips should match the length of the diced sub-clips, not the original long-form audio.

Consider multi-class complexity

For time range selection with multiple label classes (e.g., "patient entry" and "patient exit" in the same clip), ensure your instructions clearly distinguish when each label should be applied. Annotators may find overlapping or adjacent events in the same clip confusing without strong guidance.

Optimize for mobile

Many DiagnosUs labelers annotate on mobile devices with varying network conditions. Compressed formats like .mp3 reduce download times and memory usage. Keeping individual clip file sizes well under 100 MB will improve the labeler experience.

Monitor throughput and quality early

Launch a small initial batch and review results before scaling up. Pay attention to average annotation duration per case. If labelers are spending significantly less time than the clip length, they may be skipping content. If they are spending far longer, the task may be too complex or the clips too long.

Results and Export

Audio annotation results follow the same export patterns as other task types. Classification results are exported with the standard classification schema, and time range selection results follow the ranges export schema. See the Label export glossaries for field-level documentation:

For API-based result retrieval, see API JSON Results.