We are seeking a Speech AI Engineer to design, build, and deploy intelligent speech systems capable of converting human voice into accurate text and detecting emotional tone and intent in speech.
Responsibilities:
- Design and develop speech-to-text (ASR) models using modern architectures such as Whisper, Wav2Vec2, or Conformer.
- Build speech emotion recognition (SER) models to classify tone, emotion, and mood from voice inputs.
- Collect, preprocess, and annotate custom speech datasets — dataset creation is part of this role.
- Apply data augmentation and noise-robust training for better real-world performance.
- Implement quantization, pruning, and optimization of models for real-time inference on servers.
- Develop and expose REST APIs for model access using FastAPI, ensuring scalability and security.
- Integrate speech models with text-processing and chatbot systems for unified voice–text experiences.
- Manage the full ML lifecycle — training, validation, deployment, monitoring, and continuous improvement.
- Containerize and deploy models using Docker and orchestrate services via Kubernetes.
- Continuously explore state-of-the-art speech and multimodal AI research to enhance accuracy and latency.
Required Skills & Experience:
- Hands-on experience in speech recognition, speech emotion detection, or speaker identification.
- Proficiency in Python and deep learning frameworks:
- PyTorch, TensorFlow, HuggingFace Transformers
- Strong understanding of audio feature extraction (MFCC, log-mel spectrograms, CTC loss, etc.).
- Experience in model quantization and optimization (ONNX, TensorRT, TorchScript).
- Proficiency with FastAPI, Docker, and Kubernetes for scalable deployment.
- Familiarity with CI/CD, MLOps, and model monitoring workflows.
- Knowledge of GPU acceleration and multi-model inference management.
- Familiarity with databases (MongoDB, PostgreSQL, Redis).
- Practical experience with real-time or streaming audio pipelines is a strong plus.