Common Voice 23 — Georgian ASR Quality Control Dataset
This dataset was created as part of a quality control (QC) process for the Mozilla Common Voice Corpus 23 (Georgian subset).
Common Voice 23 — Georgian ASR Quality Control Dataset
DOI: 10.57967/hf/6884 ·
License: CC BY 4.0 ·
Language: ka (Georgian)
Overview
This dataset was created as part of a quality control (QC) process for the Mozilla Common Voice Corpus 23 (Georgian subset). The main goal is to identify corrupted, low-quality, or mislabeled recordings that might have passed through Common Voice validation but are unsuitable for training or evaluation.
Methodology
- User Selection: Unique users were selected from the Common Voice 23 Georgian corpus, including only those with at least 3 valid audio samples.
- Automatic Transcription: Each audio clip was transcribed using custom Georgian ASR models developed at ASR.GE.
-
Error Metrics:
Transcriptions were compared to the reference sentences from Common Voice and
the following metrics were computed for each sample:
CER(Character Error Rate)WER(Word Error Rate)
High CER/WER values often correspond to faulty recordings, such as noise, silence, or non-speech segments.
Dataset Fields
| Column | Description |
|---|---|
path |
Relative path to the original Common Voice audio file |
reference |
Ground-truth text provided by Common Voice |
cer_min |
Minimum Character Error Rate across model predictions |
wer_min |
Minimum Word Error Rate across model predictions |
Purpose and Use Cases
This dataset enables:
- Detection of problematic users or recordings in the Common Voice Georgian subset.
- Improved dataset curation by filtering out noisy or low-quality samples.
- ASR benchmarking on real-world Georgian data.
It can be directly linked with Common Voice 23 (ka) using the path field,
which matches the original file structure.
Opens in a new tab