This dataset was created as part of a quality control (QC) process for the Mozilla Common Voice Corpus 23 (Georgian subset).

Common Voice 23 — Georgian ASR Quality Control Dataset

DOI: 10.57967/hf/6884 · License: CC BY 4.0 · Language: ka (Georgian)

Overview

This dataset was created as part of a quality control (QC) process for the Mozilla Common Voice Corpus 23 (Georgian subset). The main goal is to identify corrupted, low-quality, or mislabeled recordings that might have passed through Common Voice validation but are unsuitable for training or evaluation.

Methodology

User Selection: Unique users were selected from the Common Voice 23 Georgian corpus, including only those with at least 3 valid audio samples.
Automatic Transcription: Each audio clip was transcribed using custom Georgian ASR models developed at ASR.GE.
Error Metrics: Transcriptions were compared to the reference sentences from Common Voice and the following metrics were computed for each sample:
- CER (Character Error Rate)
- WER (Word Error Rate)
High CER/WER values often correspond to faulty recordings, such as noise, silence, or non-speech segments.

Dataset Fields

Column	Description
`path`	Relative path to the original Common Voice audio file
`reference`	Ground-truth text provided by Common Voice
`cer_min`	Minimum Character Error Rate across model predictions
`wer_min`	Minimum Word Error Rate across model predictions

Purpose and Use Cases

This dataset enables:

Detection of problematic users or recordings in the Common Voice Georgian subset.
Improved dataset curation by filtering out noisy or low-quality samples.
ASR benchmarking on real-world Georgian data.

It can be directly linked with Common Voice 23 (ka) using the path field, which matches the original file structure.

View on Hugging Face

Opens in a new tab

Common Voice 23 — Georgian ASR Quality Control Dataset

Common Voice 23 — Georgian ASR Quality Control Dataset

Overview

Methodology

Dataset Fields

Purpose and Use Cases

| სხვა სიახლეები

AI ქართულ სამართალში: როგორ ვასწავლეთ საგადასახადო კოდექსი ხელოვნურ ინტელექტს

B2B: Real-time Georgian speech transcription right on your desktop

ASR.GE - ხელოვნური ინტელექტი თქვენს სამედიცინო დაწესებულებაში

ქართული AI-ს მომავალი უფრო მეტია, ვიდრე უბრალოდ ჩატი. 🚀 - Survey | კვლევა