Vancouver, Canada
17 hours ago
Software Engineer, Developer Experience & MLOps

About the team

Dialpad’s Ai Engineering team works centrally alongside Data Science, Telephony, and Product Engineering teams to produce The Good Ai. In this role, you’ll leverage and acquire a broad skill set ranging from Distributed Systems Engineering, DevOps, MLOps and Data Engineering to deliver functionality essential to powering Dialpad’s Ai products.

Your role

As a Software Engineer – AI Developer Experience & ML Platform, you will design, build, and optimize the infrastructure, tooling, and workflows that enable engineers and data scientists to efficiently develop, deploy, and scale AI-powered applications. Your role will be multifaceted, spanning both developer experience (DevEx)—streamlining development, testing, and deployment—and MLOps & ML platform engineering—ensuring scalable, reliable, and high-performance AI/ML workloads.

For DevEx, you’ll focus on improving the productivity and happiness of engineers and data scientists by building robust development environments, automating workflows, and enhancing observability. You’ll work with tools such as Kubernetes, Grafana, Terraform, and CircleCI, leveraging GCP services like GKE, Cloud Workstations, Cloud Run, and BigQuery, to create a seamless developer experience.

For MLOps, you’ll architect and maintain scalable AI infrastructure, enabling real-time analytics, efficient model training, and optimized inference. You’ll work with vLLM, Apache Beam, SQL, and Kubeflow, using GCP services like Vertex AI, Dataflow, BigQuery, and GKE to build and maintain end-to-end AI pipelines. Your contributions will directly impact the scalability, performance, and reliability of Dialpad’s AI-driven insights, ensuring that AI models and analytics run efficiently at scale.

What you’ll do [i,e., Responsibilities]

First Week

Merge your first PR & learn the review process: Make a small contribution, go through the code review process, and get familiar with team coding standards. Learn CI/CD workflows & deployment process: Understand how changes move from development to production, including CircleCI, Terraform, and GitOps workflows. Test, deploy, and monitor a change in production: Push a minor update, observe logs, metrics, traces and alerts, and ensure smooth rollout. Meet the team & key stakeholders: Get to know your immediate team and cross-functional teams (ML engineers, data scientists, platform engineers).

First 3 Months

Work directly on Dialpad’s AI/ML pipelines, Vertex AI and Kubernetes-based dev environments to enhance platform performance. Optimize developer workflows, including CI/CD pipelines (CircleCI) and infrastructure (Terraform), to accelerate AI and ML deployments. Strengthen observability and debugging (Grafana, Loki, OpenTelemetry) for better insights and faster issue resolution. Collaborate with cross-functional teams to identify bottlenecks, ship quick wins, and demonstrate measurable improvements.

First 6 Months

Streamlining ML deployments, data ingestion, and environment setup through internal CLI tools, templates, and dashboards to empower self-serve developer workflows. Automating ML model testing and deployment rollbacks to refine and automate CI/CD for AI/ML, including improving GitOps workflows with CircleCI and Terraform for increased reliability. Using Grafana, Loki, OpenTelemetry, and Vertex AI Model Monitoring to enhance AI/ML observability and monitoring by expanding logging, tracing, and real-time metrics. Optimizing Kubernetes and cloud workflows to improve GKE-based AI workloads, autoscaling policies, and resource efficiency to address growing ML and data pipeline demands.

First 12 Months

Optimize AI compute cost and efficiency by implementing autoscaling, spot instance scheduling, and GPU/TPU resource optimization to balance performance and cost. Build a self-serve AI infrastructure by developing internal developer tooling, dashboards, or APIs that enable engineers and data scientists to easily deploy models and manage data pipelines. Enable AI-driven analytics at scale by ensuring real-time AI insights power customer-facing features with sub-second query latencies in Pinot, BigQuery, and Dataflow. Automate infrastructure provisioning by expanding GitOps-driven automation for deploying and managing AI workloads, Kubernetes clusters, and cloud resources.

Technologies you know

Kubernetes & Cloud Infrastructure – Managing GKE, Terraform, Cloud Workstations, and IAM for scalable AI/ML workloads. Related: AWS, Azure, Docker, Kubernetes CI/CD & GitOps – Automating deployments with CircleCI, Terraform, and Cloud Build to streamline AI/ML workflows. Related: ArgoCD, Jenkins, GitLab ML Pipeline & Data Processing – Working with Vertex AI Pipelines, MLFlow, Apache Beam, Spark, Dataflow, Pub/Sub, and BigQuery to enable real-time AI analytics. Related: Athena, Kafka, Flink, Redshift, Spark, Snowflake, Databricks Observability & Monitoring – Implementing Grafana, Loki, OpenTelemetry, and Vertex AI Model Monitoring for debugging and tracking AI performance. Related: Prometheus, Jaeger Model Deployment & Serving – Understanding Kubeflow, TensorFlow Serving, Triton Inference Server, and strategies for scalable ML inference. Related: ONNX, TorchServe

Skills you’ll bring

You have a Bachelor’s Degree in Computer Science, Software Engineering, Mathematics, or a related field, or equivalent work experience. You have 3+ years of experience in DevOps, MLOps, Developer Experience, or related roles. You have strong fundamentals in software engineering, cloud infrastructure, and distributed systems. You thrive in a collaborative, distributed team and can work effectively across time zones. You have experience building and maintaining AI/ML infrastructure, CI/CD pipelines, or developer tooling. You enjoy automating and optimizing development workflows, from CI/CD pipelines to AI/ML deployments. You take a data-driven approach to system reliability, ensuring observability, monitoring, and performance tracking. You believe in choosing the right tool for the job, balancing scalability, efficiency, and maintainability. You are comfortable working across infrastructure, AI/ML pipelines, and developer tooling to support high-scale applications. You enjoy continuous learning and knowledge-sharing, improving both your skills and your team's capabilities. You are fluent in English and communicate complex technical concepts clearly.

Bonus Points For

A track record of Open Source contributions in DevOps, MLOps, or AI tooling. Experience in the Python ecosystem and related ML/DevOps libraries. Hands-on expertise with cloud providers such as Google Cloud Platform (GCP) or AWS. Experience with GitOps workflows and tools like ArgoCD, Flux, or Terraform. Familiarity with AI/ML observability, model monitoring, and real-time inference optimization.

Benefits, time-off, and wellness

An apple a day keeps the doctor away—and it doesn’t hurt that we offer flexible time off and great options for medical, dental, and vision plans for all employees. Along with that, employees also receive a monthly stipend to help cover your cell phone bill, home internet bill, and we reimburse for gym membership costs, a variety of wellness events, and more!

Professional development

Dialpad offers reimbursement for expenses related to professional development, up to an annual limit per calendar year.

Confirm your E-mail: Send Email