JEPA-v0: a self-supervised audio encoder for real-time speech translation
Summary
The article introduces JEPA-v0, a self-supervised audio encoder designed for real-time speech-to-speech translation that preserves voice and prosody. It explains the architecture (context encoder, target encoder with EMA, predictor) and learning strategies (masked reconstruction vs. contrastive learning) and discusses results on benchmarks, limitations, and future directions for improving temporal resolution and frequency structure to enable better downstream translation.