Bogota, Colombia
3 days ago
Reliability Operations Engineer

At Infobip, we dream big. We value creativity, persistence, and innovation, passionately believing that it is through teamwork that we can all reach greater heights.


Since 2006, we have been innovating at the edge of technological possibilities and are now shaping global communications of the future. Through 75+ offices on six continents, Infobip’s platform is used by almost 80% of the population, making it the largest network of its kind and the only full-stack cloud communication platform globally.


Join us on our mission to create life-changing interactions between humans and online services with new and unseen solutions.

Job description:

As a part of Reliability Operations, you will work in a team which strives to identify, respond and mitigate platform incidents. If a platform incident occurs, you and your team will be the first responders to the incident, involving the responsible individuals in mitigation and driving the resolution.
Your job will include working on improving the observability of our platform, as well as collaboration with other engineers in common mitigation tactics. The automation is a big part of the job, as we strive to have meaningful alerting, rather than being triggered for every small glitch, so fine-tuning of existing alerts and improvements of the processes are one of our priorities.
Is your eye twitching when something breaks and you already have a list in your head of possible improvements? This is the place you're looking for.

What you will do:

Be a first responder to platform alertsMonitor our products for issues, prioritize, triage them, and assess client impactDetect issues, identify them (affected systems, locations, responsible teams) and respond in a timely manner by utilizing runbooksClearly communicate (summarize) and escalate platform incidents to responsible individualsActively contribute to current runbooks and create a new onesWhen an incident is reported, be the driver of the incident resolution (incident commander)Based on alerts, try to prevent an issue becoming an incident


More about you:

You have an engineering or support background and passion for IT with at least 1 year of prior experience in the same or similar jobsYou have an experience with tools for monitoring systems (Grafana, Prometheus, NewRelic, Graylog, Kibana, Elasticsearch, Opensearch…)You have a strong system-thinking and problem-solving mindsetYou are genuinely interested into how things work, and driven when they don’tYou have strong analytical and investigative skills combined with the ability to navigate through substantial amounts of data to gather critical information in a timely mannerYou are genuinely interested in site reliability and want to learn about mitigation tacticsHands-on knowledge of a system administration tasks are an advantage, but not a prerequisiteYou can speak fluently to clients, and colleagues alike, and have great command of EnglishYou can exhibit an advanced level of teamwork, excellent communication skills and a high degree of independenceYou are efficient in execution, prone to continuous improvements, experimentation, and self-education

A bit more on what kind of people we are looking for:

tech savvycurious with attention to detailcritical thinkerssystem-knowledge, holistic viewenjoys troubleshootingresponsibleclear communicatorproblem solverwilling to teach / mentor others

Infobip employees are people with diverse backgrounds, characteristics, and experiences that share the same passion and talent that helps us achieve our mission. That's why Infobip is committed to creating a diverse workplace and is proud to be an equal-opportunity employer.

All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender, gender identity, national origin, citizenship, disability, veteran status, or any other part of one's identity.

#LI-RA1
Confirm your E-mail: Send Email
All Jobs from Infobip