Director, Site Reliability Engineering

Benchmark Education Company

🇺🇸FloridaRemote$140K–$180K/yr12mo ago

K-12 Engineering Site Reliability Engineering Devops Cloud Computing Infrastructure Leadership Observability

Find Similar Open Jobs

Summary

Director of Site Reliability Engineering leading a strategic SRE team to ensure system reliability, performance, and scalability across critical systems. The role balances operational excellence with innovation while driving reliability best practices across engineering teams.

Key Responsibilities: Define and execute site reliability vision, lead and mentor the SRE team, establish SLIs/SLOs/error budgets, and drive incident management processes including response, on-call coordination, and postmortem analysis. Partner with engineering and product teams to implement observability practices, reduce operational toil, and develop capacity planning and performance testing strategies.

Skills & Tools: 8+ years in SRE or DevOps with 3+ years of leadership experience, expertise in SLIs/SLOs, incident management, and observability tools like Prometheus, Grafana, or Datadog. Strong leadership, communication, and stakeholder management skills with hands-on software engineering background in Python, Bash, or similar languages.

Qualifications: 8+ years of experience in Site Reliability Engineering or DevOps with proven track record managing large-scale distributed systems and 3+ years in a leadership role. Expertise in defining SLIs, SLOs, error budgets, incident management processes, and leading on-call rotations and postmortems.

Location: Remote - Florida, United States of America

Compensation: $140,000 – $180,000/year

Job Description

Fast Facts

We are looking for a Director of Site Reliability Engineering to lead our SRE team, ensuring system reliability and performance through strategic leadership and operational excellence.

Responsibilities: Oversee the SRE team's operations and strategy, define reliability best practices, establish SLIs, SLOs, and improve incident management processes to enhance system resilience.

Skills: 8+ years in SRE or DevOps, leadership experience, expertise in SLIs/SLOs, incident management, and observability tools like Prometheus and Grafana.

Qualifications: Experience in AWS cloud environments, leadership in managing on-call rotations, and knowledge of software engineering are preferred.

Location: Remote - Florida, USA

Compensation: Not provided by employer. Typical compensation ranges for this position are between $140,000 - $180,000.

Position Purpose:

We are seeking a Director of Site Reliability Engineering (SRE) to lead our SRE team in ensuring the availability, performance, and scalability of our critical systems. This role is responsible for defining and driving reliability strategies, operational excellence, and incident response processes at scale. You will collaborate closely with engineering, DevOps, and product teams to establish best practices and implement processes that enhance system resilience and service performance.

Responsibilities:

Leadership & Strategy
Define and execute the vision for site reliability, balancing innovation with operational stability.
Lead, mentor, and grow a high-performing SRE team, fostering a culture of ownership and continuous improvement.
Partner with Engineering, DevOps, and Product teams to embed reliability best practices into the development lifecycle.
Operational Excellence
Establish and refine SLIs, SLOs, and error budgets to measure and improve service reliability.
Develop and drive incident management processes, including real-time incident response, on-call coordination, and postmortem analysis to prevent recurring issues.
Implement and standardize operational readiness reviews and escalation procedures to ensure teams are equipped to handle incidents effectively.
Drive initiatives to reduce operational toil, leveraging automation where applicable to enhance team efficiency.
Collaborate with engineering teams to define performance testing and capacity planning strategies to proactively mitigate reliability risks.
Champion the adoption of observability, logging, and monitoring best practices, ensuring visibility into system health and performance.

Qualifications:

8+ years of experience in Site Reliability Engineering, DevOps, or related fields, with at least 3+ years in a leadership role.
Proven track record of driving operational excellence in large-scale, distributed systems.
Expertise in defining and implementing SLIs, SLOs, error budgets, and incident management processes.
Strong knowledge of observability tools such as Prometheus, Grafana, Datadog, New Relic, or similar.
Experience leading on-call rotations, postmortems, and operational readiness programs.
Excellent leadership, communication, and stakeholder management skills.

Preferred Qualifications:

Deep experience with AWS cloud environments, including operational best practices for high availability and reliability.
AWS certifications such as AWS Certified DevOps Engineer – Professional, AWS Certified Solutions Architect – Professional, or AWS Certified Advanced Networking – Specialty.
Experience with AWS monitoring and logging tools (CloudWatch, X-Ray, AWS Config, GuardDuty).
Experience scaling SRE practices in high-growth or regulated environments.
Hands-on background in software engineering with Python, Bash, or similar languages.

About Us

Benchmark Education Company is a leading publisher of core, supplemental, and intervention literacy and language resources in English and Spanish, both print and digital, as well as world-class professional development. Since its founding in 1998, our company has proven to be one of the most nimble and innovative content creators on the cutting edge of pedagogy and technology. The digital content in our many learning programs delivers all the rigor of its print counterpart and is designed for virtual and blended learning contexts.

Benchmark Education Publishing (BEC) and its affiliates are proud to be an Equal Opportunity Employer.

For further information, visit us at: https://www.benchmarkeducation.com