სიახლეებზე დაბრუნება
22 Feb, 2026

Common Voice 23 — Georgian ASR Quality Control Dataset

Common Voice 23 — Georgian ASR Quality Control Dataset

This dataset was created as part of a quality control (QC) process for the Mozilla Common Voice Corpus 23 (Georgian subset).

Common Voice 23 — Georgian ASR Quality Control Dataset

DOI: 10.57967/hf/6884 · License: CC BY 4.0 · Language: ka (Georgian)

Overview

This dataset was created as part of a quality control (QC) process for the Mozilla Common Voice Corpus 23 (Georgian subset). The main goal is to identify corrupted, low-quality, or mislabeled recordings that might have passed through Common Voice validation but are unsuitable for training or evaluation.

Methodology

  • User Selection: Unique users were selected from the Common Voice 23 Georgian corpus, including only those with at least 3 valid audio samples.
  • Automatic Transcription: Each audio clip was transcribed using custom Georgian ASR models developed at ASR.GE.
  • Error Metrics: Transcriptions were compared to the reference sentences from Common Voice and the following metrics were computed for each sample:
    • CER (Character Error Rate)
    • WER (Word Error Rate)

    High CER/WER values often correspond to faulty recordings, such as noise, silence, or non-speech segments.

Dataset Fields

Column Description
path Relative path to the original Common Voice audio file
reference Ground-truth text provided by Common Voice
cer_min Minimum Character Error Rate across model predictions
wer_min Minimum Word Error Rate across model predictions

Purpose and Use Cases

This dataset enables:

  • Detection of problematic users or recordings in the Common Voice Georgian subset.
  • Improved dataset curation by filtering out noisy or low-quality samples.
  • ASR benchmarking on real-world Georgian data.

It can be directly linked with Common Voice 23 (ka) using the path field, which matches the original file structure.

View on Hugging Face

Opens in a new tab