Remote
10 days ago
Site Reliability Engineer
Zenoss is seeking an experienced Site Reliability Engineer (SRE) to join a team of Engineers and Architects creating breakthrough ITOps and AIOps Platform.  We are seeking individuals with experience and knowledge in building software and tools to support our Ops and Support teams providing SaaS offerings in Cloud and microservices based architectures.

Zenoss has a culture that encourages its employees to ask questions, challenge assumptions, and dig in to address the problems posed with highly distributed cloud and SaaS software systems. As a SRE, you will be responsible for architecture, design, implementation and delivery of features and functionality used in supporting, operating and monitoring Zenoss Cloud. We offer a healthy work/life balance, a positive work environment and a host of amenities to enable our teams to do their best work. This is an ideal role for a self-motivated professional with passion for cutting edge technology and creative problem solving.

This position can be Remote (Work from Home) or work out of our Austin, TX office.

Responsibilities:Develop, deploy, operate, and support cloud infrastructure primarily utilizing GCPWork with development, operations and support personnel to identify, isolate, diagnose issues, handle support escalations, plan and deliver high value monitoring and alerting featuresReview of technical designs/information, automating processes through scripting, installation and configuration of software, and validation of technical environmentsResponsible for the ongoing maintenance, security, and availability of several applications based on business requirements and adhering to tight operations, security, and procedural modelsEnsure production level systems are running at all times and have multiple levels of redundancy to meet committed SLAsApplies professional-level technical skill and judgment to provide non-routine technical support for production operations to drive optimal performance, reliability, redundancy, and scaleDocument environment topology and installation details along with incident reviewsAutomation of tasks using scripting and configuration management systemsCommunicate highly technical information to both technical and non-technical personnelWork with customers to troubleshoot and resolve technical issues.Troubleshoot network performance issues, perform intrusion monitoring, and maintain disaster recovery proceduresPlan for, and recommend, expansion of capacity and upgrades, patches, and new applications and equipment when necessaryParticipation in the development of information technology and infrastructure projectsDocument and thoroughly understand the application architecture and system configuration across platformsDetermine the root cause of an outage, duration, and recommendations or steps to resolve issuesProvide 24x7 support for all network and server systems that are pivotal to productionRequired Experience / Skills: Bachelor's degree in Computer Science/Engineering or equivalent relevant experience3-6 years of professional hands-on experience with Cloud production environments hosted on GCP using BigTable, BigQuery, Dataflow, GKE and other GCP servicesExperience with CI/CD tools like Spinnaker and Jenkins and cloud-based software development and delivery processes/methodologiesStrong scripting skills and demonstrated ability to automate tasks. (SaltStack, Python, Terraform preferred) Strong understanding of networking, firewalls, load balancers, and databasesStrong verbal and written communication skillsProject and task oriented with a focus on details and an ability to proactively communicate detailed status to customer and project teamStrong organization skills and an ability to work both within a team and independentlyAbility to make sound decisions based on customer needs and technical knowledgeSelf-motivated and able to work under pressure to deliver high-quality solutionsAbility to work after hours including weekends and night when required
Confirm your E-mail: Send Email