Voxxwire: Designing a Privacy-Preserving, Fully Offline Speech Translation  System for Real-Time Cross-Lingual Communication

Parmar, Jay; Raj; Darshan Ramoliya

doi:10.51094/jxiv.3482

##article.authors##

Parmar, Jay Independent Researcher
Raj Independent Researchers, Gujarat, India
Darshan Ramoliya Independent Researchers, Gujarat, India

DOI:

https://doi.org/10.51094/jxiv.3482

キーワード:

privacy-preserving speech translation、 offline inference、 voice activity detection、 automatic speech recognition、 neural machine translation、 text-to-speech synthesis、 edge AI deployment、 on-device processing

抄録

Multilingual communication in virtual meetings has become a cornerstone of modern global collaboration, yet the tools most people rely on for speech translation still depend heavily on cloud infrastructure. This creates a troubling combination of privacy exposure, network dependency, and recurring costs that limits their adoption in sensitive or resource-constrained environments. In this paper, we introduce Voxxwire, an open-source desktop application that performs complete end-to-end speech translation entirely on the user’s local machine—without sending a single byte of data over the internet.
Our system brings together four neural components within a unified asynchronous pipeline: Silero-based voice activity detection for segmenting speech, CTranslate2-accelerated Whisper inference for multilingual recognition, Argos Translate for offline neural machine translation across more than 49 language directions, and Piper for VITS-based speech synthesis. A key innovation is the dual-channel architecture that simultaneously processes both microphone input and system audio loopback, enabling bidirectional translation during live video-conference sessions. To tackle the re-translation feedback loop that inevitably arises when synthesized audio gets recaptured by the loopback channel, we devised a deterministic timing gate that suppresses capture during playback windows—a lightweight alternative to computationally expensive acoustic echo cancellation.

利益相反に関する開示

The authors declare that they have no conflicts of interest.

ダウンロード *前日までの集計結果を表示します

ダウンロード実績データは、公開の翌日以降に作成されます。

引用文献

CSA Research, "Can’t Read, Won’t Buy – B2C: How language and localization drive global purchasing decisions," Tech. Rep., 2024.

Google LLC, "Google Translate: Conversation Mode," 2024. [Online]. Available: https://translate.google.com

Microsoft Corporation, "Microsoft Translator – Real-time translation for conversations and meetings," 2024. [Online]. Available: https://www.microsoft.com/translator

Apple Inc., "Translation – Apple Developer Documentation," iOS 18, 2024.

DeepL SE, "DeepL API Documentation," 2024. [Online]. Available: https://www.deepl.com/docs-api

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, "Robust speech recognition via large-scale weak supervision," in Proc. 40th Int. Conf. Mach. Learn. (ICML), 2023, pp. 28492–28518.

SYSTRAN, "faster-whisper: CTranslate2-optimized Whisper inference," GitHub repository, 2023. [Online]. Available: https://github.com/SYSTRAN/faster-whisper

Silero Team, "Silero VAD: Pre-trained enterprise-grade voice activity detector," 2021. [Online]. Available: https://github.com/snakers4/silero-vad

A. Benyassine et al., "ITU-T Recommendation G.729 Annex B: A silence compression scheme for V.70 applications," IEEE Commun. Mag., vol. 35, no. 9, pp. 64–73, 1997.

Argos Open Technologies, LLC, "Argos Translate: Open-source offline translation library written in Python," 2021. [Online]. Available: https://github.com/argosopentech/argos-translate

M. Junczys-Dowmunt et al., "Marian: Fast neural machine translation in C++," in Proc. ACL 2018 System Demonstrations, 2018, pp. 116–121.

M. Hansen, "Piper: A fast, local neural text-to-speech system," 2023. [Online]. Available: https://github.com/rhasspy/piper

Research Intelo, "Offline Translation Models on Device Market Research Report 2033," 2025. Global market valued at $1.9B in 2024, projected to reach $6.7B by 2033.

KUDO AI, "AI Speech Translation in 2025 & Beyond: Data & Trends," Jan. 2025. Edge AI market for speech translation projected to grow by 35% in 2025.

S. Deng et al., "Edge intelligence: The confluence of edge computing and artificial intelligence," IEEE Internet Things J., vol. 7, no. 8, pp. 6694–6747, 2020.

S. Laskaridis et al., "Melting point: Mobile evaluation of language transformers," arXiv preprint arXiv:2403.12844, 2024.

European Parliament and Council of the EU, "General Data Protection Regulation (GDPR)," Regulation (EU) 2016/679, 2016.

Real-Time Speech-to-Text on Edge: A Prototype System for Ultra-Low Latency Communication with AI-Powered NLP, Information, vol. 16, no. 8, Art. 685, Aug. 2025.

J. Btia and G. David, "Embedded Implementation of Speech-to-Text Translation Using Compressed Deep Neural Networks," Nat. J. Signal Image Process., vol. 1, no. 3, pp. 39–47, 2025.

Voxxwire: Designing a Privacy-Preserving, Fully Offline Speech Translation System for Real-Time Cross-Lingual Communication