AccentShift: A Streaming Framework for Real-Time Cross-Accent Speech Translation

Description

Accent variation is a persistent barrier to spoken communication and a documented source of bias in speech technology, where recognition accuracy degrades on non-native and regional English. We present AccentShift, a streaming framework for real-time cross-accent speech translation that converts spoken English from one accent toward a target accent while preserving lexical content. The system is a modular cascade chunked automatic speech recognition (ASR), an optional language-model rewriting stage, and accent-conditioned neural text-to-speech-orchestrated over WebSocket connections for low-latency interaction. Rather than report simulated benchmarks, we ground the design in measurements on two public corpora. Across 345 recordings from six dialects of the British Isles, Whisper attains a 9.7% overall word error rate with a significant 8.6-point spread across dialects (one-way ANOVA, $p=0.039$) Irish speech being hardest. On 160 recordings from four non-native first-language groups, the overall error rate rises to 12.8%, confirming that non-native speakers bear a heavier recognition burden. On identical-text dialect recordings, speaking rate differs significantly across dialects and remains so after controlling for the gender imbalance in our sample ($p<0.001$) showing that accent is robustly encoded in surface timing. We complement these with a self-consistent latency budget showing that sub-second response (480 ms in direct mode) is achievable when the combined real-time factor stays below one, with input buffering the dominant cost. We are explicit about scope: we measure the accent gap and the latency envelope, and leave perceptual evaluation of conversion quality to future human studies.

Authors

DOI: 10.5281/zenodo.20753986

Publication Date: 2026-06-18

Back to publications list


About