Senior Research Scientist (Speech)

Aldea Inc • San Francisco, California, United States • 3w ago

About Aldea

Headquartered in Miami, Aldea is a next-generation AI company focused on voice-based clinical and expert applications. Our flagship product, Advisor, uses proprietary AI to scale the impact of world-class minds across personal development, finance, parenting, relationships, and morewith faster, more cost-effective performance than traditional models.

As a multidisciplinary team of builders, researchers, and product thinkers, we value clear thinking, sharp writing, and strong intuition for what people need.

This is a rare opportunity to join an early-stage startup that will help define a new category.

About Aldea Aldea is a multi-modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bottlenecks for the evolution of software. Our mission is to build the next generation of foundational models that power a more expressive, contextual, and intelligent human–machine interface.

The Role We are seeking a Foundational AI Research Scientist (Speech) to advance the frontier of speech understanding and generation. You will lead applied research in speech-to-text (STT), text-to-speech (TTS), and speech-to-speech modeling, designing architectures and training strategies that redefine fidelity, controllability, and efficiency in voice-based systems.

This role blends deep research expertise with strong engineering intuition. You'll drive end-to-end experimentationfrom model design and training-pipeline setup to empirical validationand help translate breakthroughs into production-grade systems.

What You'll Do

Research and prototype novel architectures for STT, TTS, and speech-to-speech modeling.
Design and execute experiments validating new methods for scalability, performance, and quality.
Collaborate cross-functionally with engineering teams to integrate research into real-world products.
Stay current with foundational research in speech processing and generative modeling.

Minimum Qualifications

Requires a Ph.D. in Computer Science, Engineering, or related field
3+ years of relevant industry experience
Demonstrated experience in training or researching TTS, STT, or speech-to-speech models.
Deep understanding of modern sequence modeling architectures including State Space Models (SSMs), Sparse Attention mechanisms, Mixture of Experts (MoE), and Linear Attention variants
Proven experience with pre-training foundational models from scratch on large-scale datasets.
Track record of working with massive multi-modal datasets (audio, text, and speech corpora at scale).
Deep expertise in PyTorch, Transformers, and modern deep-learning frameworks.
Ability to translate complex research ideas into high-performance, maintainable code.
Evidence of research excellence through impactful technical contributions.

Nice to Have

Experience with voice-based AI applications or multi-speaker synthesis.
Publication record in top-tier venues (ICML, NeurIPS, ICLR, ICASSP, Interspeech).
Background in cross-lingual or multilingual speech systems.
Experience with data curation, filtering, and quality assessment pipelines for speech data.

Compensation & Benefits