O.E. Scharenborg | TU Delft Repository

Improving child speech recognition with augmented child-like speech

Conference paper (2024) - Y. Zhang (author), Z. Yue (author), T.B. Patel (author), O.E. Scharenborg (author)

State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) c ...

Improving End-to-End Models for Children’s Speech Recognition

Journal article (2024) - T.B. Patel (author), O.E. Scharenborg (author)

Children’s Speech Recognition (CSR) is a challenging task due to the high variability in children’s speech patterns and limited amount of available annotated children’s speech data. We aim to improve CSR in the often-occurring scenario that no children’s speech data is available ...

Automatic evaluation of spontaneous oral cancer speech using ratings from naive listeners

Journal article (2023) - B.M. Halpern (author), B.M. Halpern (author), B.M. Halpern (author), S. Feng (author), Rob van Son (author), Rob van Son (author), Michiel van den Brekel (author), Michiel van den Brekel (author), O.E. Scharenborg (author)

In this paper, we build and compare multiple speech systems for the automatic evaluation of the severity of a speech impairment due to oral cancer, based on spontaneous speech. To be able to build and evaluate such systems, we collected a new spontaneous oral cancer speech corpus ...

Improving Adaptive Learning Models Using Prosodic Speech Features

Conference paper (2023) - Thomas Wilschut (author), Florian Sense (author), O.E. Scharenborg (author), Hedderik van Rijn (author)

Cognitive models of memory retrieval aim to describe human learning and forgetting over time. Such models have been successfully applied in digital systems that aid in memorizing information by adapting to the needs of individual learners. The memory models used in these systems ...

BIAS in Flemish automatic speech recognition

Conference paper (2023) - Aaricia Herygers (author), Vass Verkhodanova (author), Matt Coler (author), O.E. Scharenborg (author), Munir Georges (author), Munir Georges (author)

Research has shown that automatic speech recognition (ASR) systems exhibit biases against different speaker groups, e.g., based on age or gender. This paper presents an investigation into bias in recent Flemish ASR. Seeing as Belgian Dutch, which is also known as Flemish, is ofte ...

AnyoneNet

Synchronized Speech and Talking Head Generation for Arbitrary Persons

Journal article (2023) - X. Wang (author), X. Wang (author), X. Wang (author), Qicong Xie (author), Lei Xie (author), Jihua Zhu (author), O.E. Scharenborg (author)

Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios. In this paper, we present an automatic method to generate synchronized speech and talking-head videos ...

Neural representations of non-native speech reflect proficiency and interference from native language knowledge

Journal article (2023) - Christian Brodbeck (author), Katerina Danae Kandylaki (author), O.E. Scharenborg (author)

Learning to process speech in a foreign language involves learning new representations for mapping the auditory signal to linguistic structure. Behavioral experiments suggest that even listeners that are highly proficient in a non-native language experience interference from repr ...

DAIS: The Delft Database of EEG Recordings of Dutch Articulated and Imagined Speech

Conference paper (2023) - Bo Dekker (author), Alfred C. Schouten (author), AC Schouten (author), A.C. Schouten (author), O.E. Scharenborg (author)

Silent speech interfaces could enable people who lost the ability to use their voice or gestures to communicate with the external world, e.g., through decoding the person’s brain signals when imagining speech. Only a few and small databases exist that allow for the development an ...

Exploring Data Augmentation in Bias Mitigation Against Non-Native-Accented Speech

Conference paper (2023) - Y. Zhang (author), Aaricia Herygers (author), T.B. Patel (author), Z. Yue (author), O.E. Scharenborg (author)

Automatic speech recognition (ASR) should serve every speaker, not only the majority “standard” speakers of a language. In order to build inclusive ASR, mitigating the bias against speaker groups who speak in a “non-standard” or “diverse” way is crucial. We aim to mitigate the bi ...

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge

Audio-Visual Diarization And Recognition

Conference paper (2023) - Zhe Wang (author), Shilong Wu (author), Diyuan Liu (author), More Authors..., Hang Chen (author), Mao-Kui He (author), Jun Du (author), Chin-Hui Lee (author), Jingdong Chen (author), Shinji Watanabe (author), Sabato Marco Siniscalchi (author), Sabato Marco Siniscalchi (author), O.E. Scharenborg (author)

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 ch ...

Improving Whispered Speech Recognition Performance Using Pseudo-Whispered Based Data Augmentation

Conference paper (2023) - Zhaofeng Lin (author), T.B. Patel (author), O.E. Scharenborg (author)

Whispering is a distinct form of speech known for its soft, breathy, and hushed characteristics, often used for private communication. The acoustic characteristics of whispered speech differ substantially from normally phonated speech and the scarcity of adequate training data le ...

Towards inclusive automatic speech recognition

Journal article (2023) - S. Feng (author), B.M. Halpern (author), B.M. Halpern (author), B.M. Halpern (author), O. Kudina (author), Olya Kudina (author), O.E. Scharenborg (author)

Practice and recent evidence show that state-of-the-art (SotA) automatic speech recognition (ASR) systems do not perform equally well for all speaker groups. Many factors can cause this bias against different speaker groups. This paper, for the first time, systematically quantifi ...

The Presence of Background Noise Extends the Competitor Space in Native and Non-Native Spoken-Word Recognition

Insights from Computational Modeling

Journal article (2022) - Themis Karaminis (author), Florian Hintz (author), O.E. Scharenborg (author)

Oral communication often takes place in noisy environments, which challenge spoken-word recognition. Previous research has suggested that the presence of background noise extends the number of candidate words competing with the target word for recognition and that this extension ...

Towards Identity Preserving Normal to Dysarthric Voice Conversion

Conference paper (2022) - Wen-Chin Huang (author), B.M. Halpern (author), B.M. Halpern (author), B.M. Halpern (author), Lester Phillip Violeta (author), O.E. Scharenborg (author), Tomoki Toda (author)

We present a voice conversion framework that converts normal speech into dysarthric speech while preserving the speaker identity. Such a framework is essential for (1) clinical decision making processes and alleviation of patient stress, (2) data augmentation for dysarthric speec ...

The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition

Journal article (2022) - Luke Prananta (author), B.M. Halpern (author), B.M. Halpern (author), B.M. Halpern (author), S. Feng (author), O.E. Scharenborg (author)

In this paper, we investigate several existing and a new state-of-the-art generative adversarial network-based (GAN) voice conversion method for enhancing dysarthric speech for improved dysarthric speech recognition. We compare key components of existing methods as part of a rigo ...

Discovering phonetic inventories with crosslingual automatic speech recognition

Journal article (2022) - Piotr Żelasko (author), S. Feng (author), Laureano Moro Velázquez (author), Ali Abavisani (author), Saurabhchand Bhati (author), O.E. Scharenborg (author), Mark Hasegawa-Johnson (author), Najim Dehak (author)

The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual train ...

Modelling Human Word Learning and Recognition Using Visually Grounded Speech

Journal article (2022) - D.G.M. Merkx (author), D.G.M. Merkx (author), D.G.M. Merkx (author), Sebastiaan Scholten (author), Stefan L. Frank (author), Mirjam Ernestus (author), O.E. Scharenborg (author)

Many computational models of speech recognition assume that the set of target words is already given. This implies that these models learn to recognise speech in a biologically unrealistic manner, i.e. with prior lexical knowledge and explicit supervision. In contrast, visually g ...

Comparing data augmentation and training techniques to reduce bias against non-native accents in hybrid speech recognition systems

Conference paper (2022) - Yixuan Zhang (author), Y. Zhang (author), T.B. Patel (author), O.E. Scharenborg (author)

One important problem that needs tackling for wide deployment of Automatic Speech Recognition (ASR) is the bias in ASR, i.e., ASRs tend to generate more accurate predictions for certain speaker groups while making more errors on speech from other groups. We aim to reduce bias aga ...

The differential roles of lexical and sublexical processing during spoken-word recognition in clear and in noise

Journal article (2022) - Antje Strauß (author), Tongyu Wu (author), James M. McQueen (author), O.E. Scharenborg (author), Florian Hintz (author)

Successful spoken-word recognition relies on interplay between lexical and sublexical processing. Previous research demonstrated that listeners readily shift between more lexically-biased and more sublexically-biased modes of processing in response to the situational context in w ...

Successful spoken-word recognition relies on interplay between lexical and sublexical processing. Previous research demonstrated that listeners readily shift between more lexically-biased and more sublexically-biased modes of processing in response to the situational context in which language comprehension takes place. Recognizing words in the presence of background noise reduces the perceptual evidence for the speech signal and – compared to the clear – results in greater uncertainty. It has been proposed that, when dealing with greater uncertainty, listeners rely more strongly on sublexical processing. The present study tested this proposal using behavioral and electroencephalography (EEG) measures. We reasoned that such an adjustment would be reflected in changes in the effects of variables predicting recognition performance with loci at lexical and sublexical levels, respectively. We presented native speakers of Dutch with words featuring substantial variability in (1) word frequency (locus at lexical level), (2) phonological neighborhood density (loci at lexical and sublexical levels) and (3) phonotactic probability (locus at sublexical level). Each participant heard each word in noise (presented at one of three signal-to-noise ratios) and in the clear and performed a two-stage lexical decision and transcription task while EEG was recorded. Using linear mixed-effects analyses, we observed behavioral evidence that listeners relied more strongly on sublexical processing when speech quality decreased. Mixed-effects modelling of the EEG signal in the clear condition showed that sublexical effects were reflected in early modulations of ERP components (e.g., within the first 300 msec post word onset). In noise, EEG effects occurred later and involved multiple regions activated in parallel. Taken together, we found evidence – especially in the behavioral data – supporting previous accounts that the presence of background noise induces a stronger reliance on sublexical processing.

@en

Using cross-model learnings for the Gram Vaani ASR Challenge 2022

Journal article (2022) - T.B. Patel (author), O.E. Scharenborg (author)

In the diverse and multilingual land of India, Hindi is spoken as a first language by a majority of its population. Efforts are made to obtain data in terms of audio, transcriptions, dictionary, etc. to develop speech-technology applications in Hindi. Similarly, the Gram-Vaani AS ...