We are seeking a person to drive the development of innovative Generative AI solutions with a focus on introducing cutting-edge speech interfaces. This role involves the integration of ASR (Automatic Speech Recognition/Speech-to-Text), LLM (Large Language Models), and TTS (Text-to-Speech) systems, initially targeting English and Indian English, with plans to expand incrementally to other Indian languages. We look for candidates who have relevant experience or have been aligned and/or contributed to the implementation of similar works.
This position requires a minimum of 5 years of industry experience, including at least 2 years dedicated to speech interfaces (ASR and/or TTS).
Key Responsibilities:
Collaborate to design and implement ASR + LLM + TTS pipelines for real-time and semi-real-time applications.
Build or fine-tune ASR and TTS models optimized for low-resource environments.
Expand language capabilities step-by-step, starting with English and Indian English, followed by other Indian languages.
Manage the collection and preprocessing of large-scale audio datasets (ASR+TTS).
Develop backend web services to integrate ASR, TTS, and LLM components for seamless user experiences.
Ensure the integration of speech-enabled Generative AI applications with platforms like automotive systems and phone helplines.
Continuously monitor advancements in ASR/TTS, including model architecture, open-source models, and services focused on Indian languages.
Required Experience:
1.Hands-on experience with on-premise ASR and TTS models, including fine-tuning and deployment.
2.Proficiency in utilizing cloud-based ASR and TTS services, such as Google Speech-to-Text, AWS Polly, or Azure Cognitive Services.
3.Familiarity with translation systems for multilingual applications.
4.Knowledge of state-of-the-art ASR and TTS architectures and their incremental/streaming generation capabilities.
5.Expertise in integrating and "gluing" modules for real-time and semi-real-time audio applications.
6.Strong understanding of ASR, TTS, and LLM models that support incremental generation for real-time or semi-real-time applications.
7.Proven experience in managing audio data collection pipelines, including annotation, preprocessing, and storage.
8.Strong experience in backend web development using Python frameworks such as Flask or FastAPI.
9.Proficiency in asynchronous programming in Python to enable efficient real-time operations.
10.Experience with at least one major cloud platform (e.g., AWS, Azure, Google Cloud).
11.Familiarity with serverless solutions for scalable and cost-effective deployments.
12.Experience in training or fine-tuning deep neural network (DNN) models for speech applications (ASR/TTS).
13.Experience with asynchronous programming, streaming APIs, and event-driven architectures, preferably focused on real-time operations.
14.Delivered at least one functional solution in the speech domain (ASR/TTS) that is still running.