STT - Nuances and Insights - AgenticUniverse

There is a lot of research that compares different ASR services and focuses on bias regarding gender, race or age. Koenecke et al. (2020) report an almost twice as high WER for Black American speakers compared to white American speakers. Despite the individual accuracy, these racial disparities exist across all tested vendors.

They see the main problem in the performance gap of the acoustic model, due to insufficient training data featuring black American speakers. Tatman and Kasten (2017) also note that the accuracy of ASR systems depend on sociolinguistic factors, and is worse for Black Americans compared to white Americans.

Obviously, ASR does not distinguish between skin colours, but reflects biases in the training data. ASR therefore performs worse for speakers of underrepresented groups within the training data. As a result the accuracy decreases for speakers with regional accents or second language learners.

Tadimeti et al. (2022) report a performance gap between general American and non-American accents. This does not only apply to English, as Cumbal et al. (2021) show by comparing native and non-native speakers of Swedish. Bias in ASR applies to all kinds of dimensions. Catania et al. (2019) show that the level of emotionality decreases accuracy compared to neutral speech. The company Speechmatics (2023) reports that also the age of the speakers is a factor. ASR shows the highest accuracy between the ages of 28 to 36, and the highest error rates for the oldest group of 60 to 81.

The average WER of these providers across the English datasets was 8.61%. By providing a vocabulary, it was reduced to 8.10%.

On the English datasets streaming transcription had a higher WER (10.9%) compared to batch transcription (9.37%).

Speakers can enhance ASR accuracy by speaking more clearly and slowly.

Pronunciation Traits:

North Indian Hindi-speaking individuals often miss articulating the phoneme /r/ when it is placed between a word and succeeded by a consonant sound.

This results in variations such as elongation of the preceding vowel or incomplete articulation of /r/.

Linguistic and Signal Processing Analysis:

Improper coordination between the tongue and lips contributes to the inaccurate articulation of /r/.

The second and third formant frequencies were identified as critical in this mispronunciation pattern.

Observational Insights:

A noticeable deviation in the formant frequencies between the neutral speaker and the test subjects was observed, particularly for words requiring precise /r/ articulation.

Scatter plots and boxplots showed significant dispersion in formant frequency values for the affected sounds.

Practical Implications:

Findings can inform the design of language-learning pedagogies tailored for non-native English speakers.

The study contributes to understanding the barriers non-native speakers face and can aid in developing tools for accent modification or training.

Hindi and Telugu showed significant differences in vowel quality, consonant articulation (e.g., degree of retroflexion in stops), and suprasegmental features like rhythm and timing.

Vowel Quality Differences:

/u/ Vowel Fronting: Telugu speakers tended to produce the vowel /u/ with a more fronted tongue position than Hindi speakers. This fronting was more pronounced in their native languages but was also observed in their IE speech.

/i/ and /e/ Vowels: Hindi speakers produced the vowels /i/ and /e/ with a higher (more raised) tongue position compared to Telugu speakers. These differences were present in both the native languages and subtly in IE.

/ɑ/ Vowel Height: There was an interaction between language task and speaker background for the vowel /ɑ/. Telugu speakers produced a higher (more close) /ɑ/ in IE than in their native language, differing from Hindi speakers.

Production of the Fricative /s/:

Spectral Characteristics: Hindi speakers produced the /s/ sound with a lower average frequency (center of gravity) than Telugu speakers in both their native languages and in IE. This indicates that the quality of /s/ differed between the groups, possibly due to differences in tongue placement or lip rounding.

Phrase-Final Lengthening:

Duration of Final Vowels: Hindi speakers exhibited more extensive phrase-final lengthening than Telugu speakers in both their native languages and in IE. This means that Hindi speakers tended to lengthen the final syllable of phrases more than Telugu speakers.

Retroflexion in Stops:

Degree of Retroflexion: While both groups used retroflex stops, Hindi speakers produced the voiceless retroflex stop /ʈ/ with greater retroflexion (tongue curled back) than the voiced retroflex stop /ɖ/, whereas Telugu speakers showed the opposite pattern. In IE, however, both groups showed similar degrees of retroflexion, suggesting convergence in IE.

Lexical Stress Patterns:

Summary:

Phonetic Differences Exist but Are Subtle: While there are specific phonetic differences in how native Hindi and Telugu speakers produce certain sounds in IE—such as vowel quality adjustments, the articulation of /s/, and phrase-final lengthening—these differences are subtle.

Experienced Listeners Can Detect Differences: Only listeners with experience in IE and its regional nuances can reliably perceive these differences in speech.

Minimal Impact on Overall Recognition: For general speech recognition and comprehension, these phonetic differences do not pose significant challenges. IE maintains sufficient uniformity across speakers of different native languages to be mutually intelligible and function effectively as a lingua franca.

The paper analyzes how Indian English speakers often replace /θ/ and /ð/ with /t/ and /d/, leading to recognition errors in ASR systems.

https://rest.neptune-prod.its.unimelb.edu.au/server/api/core/bitstreams/2d9303f0-a810-5408-8594-b30cb8ff2a0d/content

a. Reduction in Vowel Length Distinction

Explanation:

In native English, vowel length can distinguish meaning between words (e.g., "ship" /ɪ/ vs. "sheep" /iː/).

Indian languages often do not use vowel length to differentiate meaning in the same way.

Indian English speakers may not consistently produce long vowels, leading to shorter durations for vowels that are typically long in native English.

Impact on ASR:

ASR systems may confuse words that rely on vowel length distinctions, resulting in misrecognitions between minimal pairs like "sit" and "seat" or "hit" and "heat."

The acoustic models, expecting longer vowel durations, may misclassify shorter vowels, affecting word recognition accuracy.

b. Monophthongization of Diphthongs

Explanation:

Diphthongs are vowels that involve a glide from one vowel to another within the same syllable (e.g., /eɪ/ in "make"). Indian English speakers may pronounce these as monophthongs (single, pure vowel sounds).

For instance, /eɪ/ may be realized as /e/, and /aɪ/ as /a/.

Impact on ASR:

The lack of the glide changes the acoustic signature of the vowel, leading ASR systems to misinterpret or fail to recognize the intended word.

Words like "bait" and "bet" may sound similar, causing confusion in transcription.

c. Vowel Centralization and Lack of Reduction

Explanation:

In native English, unstressed syllables often contain a reduced vowel sound, typically a schwa (/ə/). Indian English speakers may pronounce these vowels more fully, without reduction.

For example, the word "banana" may have all vowels pronounced distinctly rather than reducing the middle syllable to a schwa.

Impact on ASR:

ASR systems may expect a reduced vowel and misalign the phonetic sequence when a full vowel is pronounced, leading to errors in word recognition.

The mismatch in expected vs. actual vowel quality affects the acoustic model's ability to accurately decode the speech signal.

d. Substitution of Vowel Qualities

Explanation:

Certain vowel sounds not present in Indian languages may be substituted with the closest native equivalent.

The vowel /æ/ (as in "cat") may be pronounced as /ɛ/ (as in "bet") or /a/ (as in "father").

Impact on ASR:

These substitutions can cause the ASR system to map the spoken input to unintended words with similar vowel sounds.

For example, "bag" may be misrecognized as "beg" or "bog," depending on the vowel substitution.

2. Consonant Phonological Changes

a. Use of Retroflex Consonants

Explanation:

Indian languages often use retroflex consonants ([ʈ], [ɖ], [ɳ]) articulated with the tongue curled back.

These sounds may replace alveolar stops (/t/, /d/) in Indian English, altering the place of articulation.

Impact on ASR:

The acoustic properties of retroflex sounds differ from alveolar stops, leading to misclassification by the ASR's acoustic model.

Words containing /t/ and /d/ may be misrecognized or confused with other words, increasing error rates.

b. Substitution of Dental Fricatives with Stops

Explanation:

The dental fricatives /θ/ ("thin") and /ð/ ("this") are not present in most Indian languages.

Indian English speakers often substitute these sounds with dental or alveolar stops /t/ and /d/, or sometimes with /s/ and /z/.

Impact on ASR:

This leads to minimal pairs sounding identical (e.g., "thin" pronounced as "tin," "then" as "den"), causing the ASR to misinterpret the intended word.

The substitution affects the phonetic accuracy expected by the ASR, resulting in incorrect transcriptions.

c. Voicing Distinctions and Devoicing

Explanation:

There may be less distinction between voiced and voiceless consonant pairs in Indian English.

Voiceless consonants (/p/, /t/, /k/) may be unaspirated and sound closer to their voiced counterparts (/b/, /d/, /g/).

Impact on ASR:

The reduced voicing contrast can lead the ASR to confuse words like "bat" and "pat," "cod" and "god."

The acoustic model may not correctly identify the consonant, leading to misrecognition.

d. Simplification of Consonant Clusters

Explanation:

Complex consonant clusters, particularly at word-initial positions, may be simplified.

Speakers might insert a vowel (epenthesis) within clusters (e.g., "school" pronounced as "iskool") or omit a consonant.

Impact on ASR:

The insertion or deletion alters the phoneme sequence, causing the ASR system to misalign the speech input with its models.

Words with simplified clusters may not be recognized or may be confused with other words.

e. Variations in Aspiration

Explanation:

In English, voiceless stops are aspirated at the beginning of stressed syllables, but this feature may be absent or inconsistently applied in Indian English.

Aspirated and unaspirated consonants may not be distinguished.

Impact on ASR:

The lack of expected aspiration can change the acoustic signal, causing the ASR to misidentify the consonant sound.

This can lead to errors in recognizing words where aspiration is a distinguishing feature.

f. Interchangeability of /v/ and /w/

Explanation:

Indian English speakers may not differentiate between /v/ and /w/ sounds, often pronouncing them similarly.

This is influenced by the phonology of Indian languages, where such distinctions may not exist.

Impact on ASR:

The confusion between /v/ and /w/ can result in misrecognition of words like "veil" and "wail," "very" and "wary."

ASR systems may struggle to correctly transcribe these words without clear acoustic distinctions.

g. Devoicing of Final Consonants

Explanation:

Voiced consonants at the end of words may be devoiced (e.g., "bag" pronounced with a final /k/ sound).

This can happen due to syllable-final devoicing common in some Indian languages.

Impact on ASR:

The devoiced consonant alters the expected phonetic realization, causing the ASR to misinterpret the word.

Words like "ride" may be heard as "right," leading to transcription errors.

3. Phonological Processes Affecting ASR

a. Epenthesis (Vowel Insertion)

Explanation:

To simplify pronunciation, speakers may insert a vowel between consonants (e.g., "film" as "filum," "help" as "helup").

This process eases articulation but changes the syllable structure.

Impact on ASR:

The added vowel results in an unexpected phoneme sequence, making it difficult for the ASR to match the input with the correct word.

The system may interpret the word as a different one or fail to recognize it altogether.

b. Consonant Deletion

Explanation:

Consonants, especially in clusters or at the ends of words, may be omitted (e.g., "friend" pronounced as "fren," "cold" as "col").

This simplification affects the word's phonetic structure.

Impact on ASR:

Missing consonants lead to incomplete acoustic cues, causing the ASR to misrecognize the word or confuse it with a similar-sounding word.

c. Assimilation and Phoneme Alteration

Explanation:

Sounds may change due to the influence of neighboring sounds (e.g., "input" pronounced with a bilabial nasal /m/ instead of /n/ before /p/).

These assimilation processes can differ from those in native English.

Impact on ASR:

The altered sounds deviate from the expected pronunciation patterns, reducing ASR accuracy.

The system may not correctly identify the assimilated phonemes.

4. Influence on ASR Results

Mismatch with Acoustic Models

ASR systems are trained on speech data that assumes certain phonetic realizations of phonemes. Phonological variations in Indian English introduce discrepancies between the expected and actual acoustic signals.

Inadequate Lexicon Representation

Pronunciation dictionaries used in ASR may lack the variations found in Indian English, leading to incorrect phoneme-to-word mappings.

Increased Word Error Rates (WER)

The cumulative effect of phonological changes results in higher WER for Indian English speakers, as the ASR struggles to correctly recognize and transcribe speech.

Reduced Language Model Effectiveness

Language models predict word sequences based on training data. Phonological variations can lead to unexpected word sequences or misrecognized words, reducing the efficacy of these models in decoding speech.

5. Mitigation Strategies

a. Data Collection and Training

Collect extensive speech data from Indian English speakers to capture the range of phonological variations.

Train acoustic models on this data to improve recognition of altered phonemes and pronunciations.

b. Lexicon Expansion

Update the pronunciation dictionary to include alternative pronunciations common in Indian English.

Use multiple pronunciations for words to cover variations in vowel and consonant realizations.

c. Acoustic Model Adaptation

Employ techniques like transfer learning to adapt existing models to the acoustic characteristics of Indian English.

Use accent adaptation methods to fine-tune models for better performance with specific accents.

d. End-to-End ASR Systems

Leverage end-to-end neural network models that learn to map audio directly to text, potentially capturing accent variations more effectively without relying on predefined phoneme models.

e. Speaker and Accent Identification

Implement systems that first identify the speaker's accent and then apply accent-specific models or adjustments to improve recognition accuracy.

f. Phoneme Set Modification

Adjust the phoneme set used by the ASR to include phonemes or allophones present in Indian English but absent in native English.

This allows the acoustic model to account for sounds like retroflex consonants.

g. Pronunciation Learning

Incorporate pronunciation learning algorithms that adapt to the speaker's phonological patterns over time.

This user-specific adaptation can significantly improve ASR performance for frequent users.

6. Conclusion

Phonological changes in vowels and consonants among Indian English speakers present significant challenges for ASR systems. These changes alter the acoustic realization of speech sounds, leading to mismatches with ASR models trained on native English pronunciations. Understanding these variations is essential for:

Enhancing ASR systems to be more inclusive and effective for non-native English accents.

Reducing recognition errors and improving user satisfaction among Indian English speakers.

Developing robust ASR technologies that can handle the diversity of global English pronunciations.

By implementing targeted strategies such as collecting accent-specific data, adapting acoustic models, expanding pronunciation lexicons, and utilizing advanced ASR architectures, we can mitigate the impact of phonological changes on ASR performance.

Note: This analysis is based on linguistic research on Indian English phonology and its implications for ASR systems. It highlights the importance of considering phonological variations in developing and refining speech recognition technologies for diverse user populations.

STT - Nuances and Insights

2. Consonant Phonological Changes#

3. Phonological Processes Affecting ASR#

4. Influence on ASR Results#

5. Mitigation Strategies#

6. Conclusion#

2. Consonant Phonological Changes

3. Phonological Processes Affecting ASR

4. Influence on ASR Results

5. Mitigation Strategies

6. Conclusion