Voxxwire: Designing a Privacy-Preserving, Fully Offline Speech Translation System for Real-Time Cross-Lingual Communication
DOI:
https://doi.org/10.51094/jxiv.3482キーワード:
privacy-preserving speech translation、 offline inference、 voice activity detection、 automatic speech recognition、 neural machine translation、 text-to-speech synthesis、 edge AI deployment、 on-device processing抄録
Multilingual communication in virtual meetings has become a cornerstone of modern global collaboration, yet the tools most people rely on for speech translation still depend heavily on cloud infrastructure. This creates a troubling combination of privacy exposure, network dependency, and recurring costs that limits their adoption in sensitive or resource-constrained environments. In this paper, we introduce Voxxwire, an open-source desktop application that performs complete end-to-end speech translation entirely on the user’s local machine—without sending a single byte of data over the internet.
Our system brings together four neural components within a unified asynchronous pipeline: Silero-based voice activity detection for segmenting speech, CTranslate2-accelerated Whisper inference for multilingual recognition, Argos Translate for offline neural machine translation across more than 49 language directions, and Piper for VITS-based speech synthesis. A key innovation is the dual-channel architecture that simultaneously processes both microphone input and system audio loopback, enabling bidirectional translation during live video-conference sessions. To tackle the re-translation feedback loop that inevitably arises when synthesized audio gets recaptured by the loopback channel, we devised a deterministic timing gate that suppresses capture during playback windows—a lightweight alternative to computationally expensive acoustic echo cancellation.
利益相反に関する開示
The authors declare that they have no conflicts of interest.ダウンロード *前日までの集計結果を表示します
引用文献
CSA Research, "Can’t Read, Won’t Buy – B2C: How language and localization drive global purchasing decisions," Tech. Rep., 2024.
Google LLC, "Google Translate: Conversation Mode," 2024. [Online]. Available: https://translate.google.com
Microsoft Corporation, "Microsoft Translator – Real-time translation for conversations and meetings," 2024. [Online]. Available: https://www.microsoft.com/translator
Apple Inc., "Translation – Apple Developer Documentation," iOS 18, 2024.
DeepL SE, "DeepL API Documentation," 2024. [Online]. Available: https://www.deepl.com/docs-api
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, "Robust speech recognition via large-scale weak supervision," in Proc. 40th Int. Conf. Mach. Learn. (ICML), 2023, pp. 28492–28518.
SYSTRAN, "faster-whisper: CTranslate2-optimized Whisper inference," GitHub repository, 2023. [Online]. Available: https://github.com/SYSTRAN/faster-whisper
Silero Team, "Silero VAD: Pre-trained enterprise-grade voice activity detector," 2021. [Online]. Available: https://github.com/snakers4/silero-vad
A. Benyassine et al., "ITU-T Recommendation G.729 Annex B: A silence compression scheme for V.70 applications," IEEE Commun. Mag., vol. 35, no. 9, pp. 64–73, 1997.
Argos Open Technologies, LLC, "Argos Translate: Open-source offline translation library written in Python," 2021. [Online]. Available: https://github.com/argosopentech/argos-translate
M. Junczys-Dowmunt et al., "Marian: Fast neural machine translation in C++," in Proc. ACL 2018 System Demonstrations, 2018, pp. 116–121.
M. Hansen, "Piper: A fast, local neural text-to-speech system," 2023. [Online]. Available: https://github.com/rhasspy/piper
Research Intelo, "Offline Translation Models on Device Market Research Report 2033," 2025. Global market valued at $1.9B in 2024, projected to reach $6.7B by 2033.
KUDO AI, "AI Speech Translation in 2025 & Beyond: Data & Trends," Jan. 2025. Edge AI market for speech translation projected to grow by 35% in 2025.
S. Deng et al., "Edge intelligence: The confluence of edge computing and artificial intelligence," IEEE Internet Things J., vol. 7, no. 8, pp. 6694–6747, 2020.
S. Laskaridis et al., "Melting point: Mobile evaluation of language transformers," arXiv preprint arXiv:2403.12844, 2024.
European Parliament and Council of the EU, "General Data Protection Regulation (GDPR)," Regulation (EU) 2016/679, 2016.
Real-Time Speech-to-Text on Edge: A Prototype System for Ultra-Low Latency Communication with AI-Powered NLP, Information, vol. 16, no. 8, Art. 685, Aug. 2025.
J. Btia and G. David, "Embedded Implementation of Speech-to-Text Translation Using Compressed Deep Neural Networks," Nat. J. Signal Image Process., vol. 1, no. 3, pp. 39–47, 2025.
公開済
投稿日時: 2026-03-18 09:55:04 UTC
公開日時: 2026-05-12 10:36:53 UTC
ライセンス
Copyright(c)2026
Parmar, Jay
Raj
Darshan Ramoliya
この作品は、Creative Commons Attribution-NonCommercial 4.0 International Licenseの下でライセンスされています。
