Sacramento, CA, 94278, USA
1 day ago
Site Reliability Engineer
Job Description Insight Global is seeking a Site Reliability Engineer to join a team leading a critical platform modernization effort. The Site Reliability Engineer (SRE) is a key member of stream-aligned teams responsible for ensuring the performance and reliability of applications developed by scrum teams. They focus on activities such as load/performance testing, monitoring, troubleshooting, error management, and other typical duties to enhance the stability and availability of applications. The ideal candidate will have strong experience using observability tools as well as cloud native infrastructure experience. Additional responsibilities include: Performance Testing: Conduct load/performance testing to assess application scalability and performance under various conditions, identify bottlenecks, and optimize system resources. Monitoring: Implement and maintain monitoring solutions leveraging the MDE toolset to track application health, performance metrics, SLAs, and system behavior in real-time, proactively identifying and resolving issues before they impact users. Ensure early detection and resolution of issues to minimize downtime and maintain high availability. Troubleshooting: Investigate and troubleshoot incidents, outages, and performance issues, utilizing diagnostic tools and techniques to identify root causes and implement effective solutions. Restore service functionality quickly and efficiently to minimize impact on users and business operations. Error Management: Design and implement error management strategies, including error handling, logging, and alerting mechanisms, to effectively capture and address application errors and anomalies. Improve application stability and reliability by minimizing error rates and providing timely alerts for critical issues. Automation: Work with MDE and Scrum teams to develop and maintain automation scripts and tools to streamline repetitive tasks, automate deployment processes, and improve operational efficiency. Increase operational efficiency, reduce manual intervention, and enhance consistency and reliability of deployment and configuration processes. Incident Response Management: Lead incident response efforts during critical incidents, coordinating cross-functional teams, communicating updates to stakeholders, and conducting post-incident reviews to identify areas for improvement. Minimize incidents impact on business operations, ensure effective response and resolution, and drive continuous improvement in incident management processes. Developer Support: Provide guidance and support during the development process to help coach developers on good software design patterns that will sustain proper site reliability and operations. Reduced errors and outages due to improper functioning code and solutions. We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com .     To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/ . Skills and Requirements 4+ years of experience in Site Reliability Engineering/DevOps Proficiency in system-level testing tools (e.g., JMeter, Gatling) and techniques for assessing application scalability and performance. Experience with monitoring tools (e.g., Datadog, Prometheus, Grafana, New Relic) for real-time monitoring and alerting. Knowledge of management strategies and error handling, logging, and alerting techniques. Automation skills, including scripting languages (e.g., Go, Bash) General experience in Terraform and Kubernetes Experience with ArgoCD and Helm Exposure to Crossplane null We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal employment opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment without regard to race, color, ethnicity, religion,sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military oruniformed service member status, or any other status or characteristic protected by applicable laws, regulations, andordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request to HR@insightglobal.com.
Confirm your E-mail: Send Email