Kathy Reid<p>For the past couple of years, as each new <span class="h-card" translate="no"><a href="https://mozilla.social/@mozilla" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>mozilla</span></a></span> <a href="https://aus.social/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>CommonVoice</span></a> dataset of <a href="https://aus.social/tags/voice" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>voice</span></a> <a href="https://aus.social/tags/data" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>data</span></a> is released, I've been using <span class="h-card" translate="no"><a href="https://vis.social/@observablehq" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>observablehq</span></a></span> to visualise the <a href="https://aus.social/tags/metadata" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>metadata</span></a> coverage across the 100+ languages in the dataset. </p><p>Version 17 was released yesterday (big ups to the team - EM Lewis-Jong, <span class="h-card" translate="no"><a href="https://mastodon.social/@jessie" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>jessie</span></a></span>, Gina Moape, Dmitrij Feller) and there's some super interesting insights from the visualisation: </p><p>➡ Catalan (ca) now has more data in Common Voice than English (en) (!)</p><p>➡ The language with the highest average audio utterance duration at nearly 7 seconds is Icelandic (is). Perhaps Icelandic words are longer? I suspect so!</p><p>➡ Spanish (es), Bangla (Bengali) (bn), Mandarin Chinese (zh-CN) and Japanese (ja) all have a lot of recorded utterances that have not yet been validated. Albanian (sq) has the highest percentage of validated utterances, followed closely by Erzya / Arisa (myv).</p><p>➡ Votic (vot) has the highest percentage of invalidated utterances, but with 76% of utterances invalidated, I wonder if this language has been the target of deliberate invalidation activity (invalidating valid sentences, or recording sentences to be deliberately invalid) given the geopolitical instability in Russia currently. </p><p>See the visualisation here and let me know your thoughts below!</p><p>➡ <a href="https://observablehq.com/@kathyreid/mozilla-common-voice-v17-dataset-metadata-coverage" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">observablehq.com/@kathyreid/mo</span><span class="invisible">zilla-common-voice-v17-dataset-metadata-coverage</span></a></p><p><a href="https://aus.social/tags/linguistics" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>linguistics</span></a> <a href="https://aus.social/tags/languages" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>languages</span></a> <a href="https://aus.social/tags/data" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>data</span></a> <a href="https://aus.social/tags/VoiceAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>VoiceAI</span></a> <a href="https://aus.social/tags/VoiceData" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>VoiceData</span></a> <a href="https://aus.social/tags/SpeechAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SpeechAI</span></a> <a href="https://aus.social/tags/SpeechData" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SpeechData</span></a> <a href="https://aus.social/tags/DataViz" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>DataViz</span></a></p>