Senior Site Reliability Engineer - Storage Operations
IBM
**Introduction**
About IBM
IBM is a global technology and innovation company. It is the largest technology and consulting employer in the world, with presence in 170 countries. The diversity and breadth of the entire IBM portfolio of research, consulting, solutions, services, systems and software, unusually distinguishes IBM from other companies in the industry.
Over the past 100 years, a lot has changed at IBM, in this new era of Cognitive Business, IBM is helping to reshape industries as diverse as healthcare, retail, banking, travel, manufacturing, and many more, by bringing together our expertise in Cloud, Analytics, Security, Mobile, and the Internet of Things. We like to say, "be essential." We are changing how we craft. How we collaborate. How we analyze. How we engage.
Join the next generation of innovators, inventors and entrepreneurs who are crafting the very way the world works. We want the brightest minds doing work that encourages, in an environment where growth is supported. IBMers get to discover their potential, so they’re inspired to build breakthroughs that help our clients succeed. We’re building teams with dynamic strengths with people who want their ideas to matter. Join us — you’ll be proud to call yourself an IBMer.
Our Culture:
IBM is committed to crafting a diverse environment and is proud to be an equal opportunity employer. You will receive consideration for employment without regard to your race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status.
Business Unit Introduction
IBM Cloud Computing is a one-stop shop which provides all the cloud solutions & cloud tools the industries need. IBM Cloud portfolio includes infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS) offered through public, private and hybrid cloud delivery models, in addition to the components that make up those clouds.
IBM Cloud ensures seamless integration into public and private cloud environments. The infrastructure is secure, scalable, and flexible, providing customized enterprise solutions that have made IBM Cloud the Hybrid Cloud Market leader with our market leading IAAS and PAAS Platforms. The IBM Cloud platform is the public cloud offering from IBM providing services to global enterprises. IBM Cloud is the Cloud for Smarter Business, built on Open Technology with Developer Tools and supports solutions by Industry. We run the services and workloads from Watson, Blockchain, Services, Security, and IoT.
Ready to help drive IBM's success in the Cloud market? This is your chance to research and learn new Cloud related technology products and services, as well as to design and implement quick Cloud based prototypes while advancing your career in leading edge technology.
**Your role and responsibilities**
Who you are:
As a Site Reliability Engineer (SRE) in the IBM Cloud Infrastructure organization, you will be responsible for ensuring the reliability, scalability, and operational efficiency of IBM Cloud's storage services. You will work closely with development teams, SRE peers and engineering managers to automate infrastructure management, optimize system performance, and enhance monitoring capabilities. This role involves writing code, building automation, troubleshooting production issues, and improving overall service reliability.
How we’ll help you grow:
You’ll have access to all the technical and management training courses to become the expert you want to be.
You’ll learn directly from Senior members/leaders in this field.
You'll have the opportunity to work with multiple clients.
Key Responsibilities:
Reliability & Scalability
· Design, build, and maintain highly available, distributed storage services with a focus on reliability, scalability, and security.
· Implement auto-scaling, load balancing, and failover strategies to ensure seamless service availability.
· Analyze performance bottlenecks, optimize system efficiency, and contribute to capacity planning efforts.
Automation & Infrastructure as Code
· Develop infrastructure automation using PHP, Go, Kubernetes, and other cloud-native technologies.
· Implement self-healing mechanisms and automated remediation processes to minimize manual intervention.
Incident Management & Monitoring
· Respond to production incidents, lead root cause analyses (RCA), and implement long-term fixes to improve system resilience.
· Develop observability solutions, including monitoring, logging, and alerting, using tools like Prometheus, Grafana, Splunk, and IBM Cloud Monitoring.
· Establish SLOs (Service Level Objectives), SLIs (Service Level Indicators), and SLAs (Service Level Agreements) to measure and enhance system reliability.
Security & Compliance
· Ensure compliance with security best practices and regulatory requirements.
· Implement secret management, encryption, and access control for sensitive infrastructure components.
· Participate in security audits, vulnerability assessments, and compliance automation efforts.
Cross-Team Collaboration & DevOps Culture
· Work closely with development, operations, and security teams to design and implement resilient architectures.
· Advocate for DevOps/SRE best practices, including blameless postmortems, incident retrospectives, and operational readiness reviews.
· Provide mentorship to junior engineers and contribute to knowledge-sharing across teams.
**Required technical and professional expertise**
Technical Skills
· Programming Languages: PHP, Go, Python, Bash, or other scripting languages
· Cloud & Infrastructure: Kubernetes, Docker, Terraform, IBM Cloud, AWS, or other cloud providers
· Storage Technologies: NetApp, Ceph, GlusterFS, NFS, or other cloud storage solutions
· CI/CD & Automation: GitHub Actions, Jenkins, Ansible, ArgoCD
· Monitoring & Logging: Prometheus, Grafana, ELK stack, Splunk, Datadog
Experience
· 6+ years of experience in SRE, DevOps, or Software Engineering roles.
· A solid understanding of Cloud infrastructure/operations is a must
· Knows their way around a Unix/Linux shell, can write shell scripts, and understands Linux internals
· Experience in Software Development Life Cycle, Test Driven Development, Continuous Integration and Continuous Delivery
· Experience with containers, such as with Docker, Kubernetes and Open Shift
· Strong understanding of Linux systems administration, networking, and distributed systems.
· Experience with troubleshooting complex production incidents and implementing permanent fixes.
· Ability to write clean, maintainable, and efficient automation code.
· Experience debugging complex problems
· Expertise in Ansible, Bash, core Python development, and deployments in production environment is a must.
**Preferred technical and professional experience**
· Strong familiarity with one of C, C++, golang, python, or Java
· PHP and perl development experience
· IBM Cloud API knowledge
· Experience in Monitoring applications such as Grafana, ELK stack, Prometheus, Nagios, and Sysdig
· Familiarity with jinja2 and mustache deployment templates,
Familiarity with cloud deployment tooling such as razee and launch darkly.
Confirm your E-mail: Send Email
All Jobs from IBM