Plano, Texas, United States
17 hours ago
Senior Site Reliability Engineer
Overview & Responsibilities Duties: Design and implement highly available and scalable systems. Ensure systems are designed to improve reliability, availability, and scalability. Develop system architecture, design software and hardware systems, and select and configure infrastructure components. Develop and implement alerting systems. Develop and implement monitoring systems, enabling them to identify issues with systems in real-time. Set up performance monitoring tools, implement log aggregation and analysis systems, and configure alerting systems to notify of potential issues. Automate routine operational tasks, such as deployments, configuration management, and backups. Develop custom scripts or use automation tools to streamline tasks and reduce risk of human error. Participate in incident management and troubleshooting efforts when issues arise. Analyze system logs, debug code, and work with other technical teams to identify and resolve issues. Review code, provide feedback on projects, and offer advice on technical issues. Establish and maintain processes for managing incidents. Develop incident management plans and establish escalation procedures. Respond to incidents as they occur, coordinating the efforts of technical teams to identify the root cause of the incident, mitigating its impact, and restoring service as quickly as possible. Communicate with partners throughout the incident management process, providing updates on the status of the incident and managing expectations regarding resolution times. Conduct post-incident reviews to identify areas for improvement in the incident management process and to see opportunities for automation. Analyze incident data, identify root causes, and make recommendations for process improvements. Develop and implement incident management policies and procedures that align with industry standard processes and organizational goals. Conduct research, develop documentation, and work with other partners to ensure consensus and adoption. Continuously improve systems and processes. Stay up-to-date with emerging technologies and standard processes in the field. Requirements Bachelor’s degree or higher in Computer Information Systems or a related technical field required. 5 years of experience required in Information Technology, including any experience with: incident management; using Microsoft Azure to manage and automate cloud infrastructure; log monitoring and management software Application Insights, Logic Monitor, and New Relic; and performance testing and analysis. Must have legal authority to work in the U.S. EEOE. Relocation not available. Position is onsite/hybrid. Must be located in Dallas-Ft. Worth Resume to: Tara Dowie, Talent Management Manager, Republic Finance, LLC, 7031 Commerce Circle, Suite 100, Baton Rouge, LA 70809, or email tdowie@republicfinance.com. Reference AB24 + job title in cover letter/subject line.
Confirm your E-mail: Send Email