Quezon City, Metro Manila, Philippines
12 hours ago
Data Analyst

As a Data Analyst, you will play a crucial role in the data preprocessing phase of our project to fine-tune the Whisper model for Taglish and other languages. Your responsibilities will include collecting, organizing, cleaning, and preparing high-quality multilingual data for model training. You will work closely with the machine learning team to ensure that the data meets the necessary standards for effective model training.

Key Responsibilities:

Data Collection and Organization:

Gather raw audio files in various formats (e.g., MP3, WAV, FLAC) from diverse sources such as interviews, podcasts, and YouTube videos. Organize files into a structured directory hierarchy, ensuring a clear and consistent file naming convention.

Audio Preprocessing:

Convert audio files to the required format (16kHz mono, 16-bit signed integer WAV) using tools like FFmpeg. Transcribe audio files, either manually or through a transcription service, and store text files with corresponding filenames.

Data Cleaning and Normalization:

Clean and normalize text data to address spelling variations, punctuation issues, and formatting inconsistencies. Standardize abbreviations and contractions, and remove special characters or unnecessary symbols.

Data Segmentation and Labeling:

Split lengthy audio recordings into smaller, manageable segments. Create and maintain a metadata file that maps audio files to their corresponding transcriptions and alignment details.

Quality Assurance and Validation:

Conduct thorough quality checks to validate the dataset for accuracy, consistency, and completeness. Identify and resolve issues in the audio and text data, such as misalignments or incorrect transcriptions.

Data Analysis and Reporting:

Use data analysis techniques to evaluate dataset health and completeness. Provide regular reports on data collection progress, challenges, and recommendations for improvements.

Collaboration and Communication:

Work closely with the machine learning team to address any data-related issues. Provide regular updates on data collection and preprocessing progress.
Confirm your E-mail: Send Email