Lead Site Reliability Engineer

Lead Site Reliability Engineer
Caylent Inc, Brazil

Experience
1 Year
Salary
0 - 0
Job Type
Job Shift
Job Category
Traveling
No
Career Level
Telecommute
Qualification
As mentioned in job details
Total Vacancies
1 Job
Posted on
Mar 24, 2021
Last Date
Apr 24, 2021
Location(s)

Job Description

Job Overview

The Lead Site Reliability Engineer will be in charge of helping design, implement, and optimize both the appropriate tools and best practices to support a 24x7 Alert Response Service Offering. Ideal candidates will bring knowledge of the best ways to leverage and integrate tools to support monitoring, alerting, on-call scheduling, escalation workflow, runbook documentation, root-cause analysis, and incident management. In addition, the ideal candidates will possess knowledge of current thought leadership in Site Reliability Engineering at the level to help educate customers on how to effectively implement an SRE Program for their applications and systems. If the idea of being able to combine your passion for Site Reliability Engineering with the unique opportunity to help build the systems, processes, and philosophies that drive a best-of-breed new practice in this space, this might be just the challenge you are looking for!

Responsibilities and Duties
  • Select, customize, and integrate the optimal tools to support our Alert Response Service as a part of helping customers adopt best-of-breed SRE fundamentals.
  • Document our processes, procedures, practices, and methodologies that define our opinionated approach to achieving SRE objectives.
  • Help develop materials which help us introduce SRE concepts to customers and help lead them through an organizational readiness process for empowering their teams for end-to-end ownership of their application components and suitable and adequate component metrics and functional monitoring coverage.
  • Developing processes and procedures for our Alert Response team who will handle initial Alerts, determine the presence/absence of a customer-impacting incident, attempt to address issues for which existing runbooks already exist, and direct targeted escalations to the right teams based upon root cause analysis to minimize both alert fatigue and the incidences where on-call resources must be pulled-in to address critical incidents.
  • Help design and measure KPI’s for our Services and for Customer Success following the adoption of our processes and best practices for site reliability engineering and alert/incident response.
  • Perform monitoring/alerting readiness assessments and ensure appropriate work backlogs are generated for changes required to set customers and their applications up for success.
  • Ongoing development and improvement of our offering(s) focused on customer site availability and platform incident response.

Requirements

Qualifications

Experience required:

  • Past experience as a Site Reliability Engineer with the systems and tooling which facilitate monitoring, alerting, and incident response for production workloads.
  • Extensive knowledge of Site Reliability Engineering theory and best practices at a level where you can talk extensively about state of the art of thought leadership in this discipline.
  • Past experience as a Systems Administrator responsible for Linux/Unix systems, desired
  • Past experience managing multi-team platform support/incident response .

Experience desired:

  • Past experience as Manager or Lead for a DevOps or SRE Team with responsibility for a team of Engineers supporting a production product or platform.



Skills required:

  • Knowledge of current monitoring and alerting tools catering to Cloud Native, such as Prometheus, Grafana, and AlertManager.
  • Knowledge of modern log aggregation tools for Cloud Native workloads, such as ELK/EFK stack implementations, Grafana Loki, Graylog, Fluentd/Fluentbit.
  • Knowledge of current monitoring and alerting tools catering to Serverless technologies.
  • Strong knowledge of one or more Alert Management tools, such as Pagerduty, OpsGenie, AlertOps, VictorOps, etc.
  • Strong knowledge of other monitoring tools such as Cloudwatch, Nagios, MRTG, Zabbix, SolarWinds, WhatsUp, and other similar tools.
  • Knowledge of Kubernetes and Container Orchestration
  • Linux Systems Administration Fundamentals
  • Experience managing Cloud infrastructure in AWS
  • Fundamentals of containers, containerd, and Docker
  • Knowledge of remote systems management using OpenSSH
  • Understanding of Linux logging subsystems
  • Troubleshooting and managing Linux services
  • Understanding of DNS fundamentals
  • Experience working with wiki-based or markdown-based documentation tools
  • Intermediate-level knowledge of AWS Cloud (often represented by the AWS Solutions Architect Associate Certification)

Skills desired, but not required:

  • Advanced-level knowledge of AWS Cloud (often represented by the AWS Solutions Architect Professional Ce

Job Specification

Job Rewards and Benefits

Caylent Inc

Information Technology and Services - Santo Domingo Este, Dominican Republic
© Copyright 2004-2024 Mustakbil.com All Right Reserved.