This position has been filled
This job is no longer accepting applications. Browse open EdTech jobs or view current openings at Benchmark Education Company or search for Director, Site Reliability Engineering jobs.

Director, Site Reliability Engineering
Benchmark Education CompanySummary
Director of Site Reliability Engineering leading a strategic SRE team to ensure system reliability, performance, and scalability across critical systems. The role balances operational excellence with innovation while driving reliability best practices across engineering teams.
Job Description
Fast Facts
We are looking for a Director of Site Reliability Engineering to lead our SRE team, ensuring system reliability and performance through strategic leadership and operational excellence.
Responsibilities: Oversee the SRE team's operations and strategy, define reliability best practices, establish SLIs, SLOs, and improve incident management processes to enhance system resilience.
Skills: 8+ years in SRE or DevOps, leadership experience, expertise in SLIs/SLOs, incident management, and observability tools like Prometheus and Grafana.
Qualifications: Experience in AWS cloud environments, leadership in managing on-call rotations, and knowledge of software engineering are preferred.
Location: Remote - Florida, USA
Compensation: Not provided by employer. Typical compensation ranges for this position are between $140,000 - $180,000.
Position Purpose:
We are seeking a Director of Site Reliability Engineering (SRE) to lead our SRE team in ensuring the availability, performance, and scalability of our critical systems. This role is responsible for defining and driving reliability strategies, operational excellence, and incident response processes at scale. You will collaborate closely with engineering, DevOps, and product teams to establish best practices and implement processes that enhance system resilience and service performance.
Responsibilities:
- Leadership & Strategy
- Define and execute the vision for site reliability, balancing innovation with operational stability.
- Lead, mentor, and grow a high-performing SRE team, fostering a culture of ownership and continuous improvement.
- Partner with Engineering, DevOps, and Product teams to embed reliability best practices into the development lifecycle.
- Operational Excellence
- Establish and refine SLIs, SLOs, and error budgets to measure and improve service reliability.
- Develop and drive incident management processes, including real-time incident response, on-call coordination, and postmortem analysis to prevent recurring issues.
- Implement and standardize operational readiness reviews and escalation procedures to ensure teams are equipped to handle incidents effectively.
- Drive initiatives to reduce operational toil, leveraging automation where applicable to enhance team efficiency.
- Collaborate with engineering teams to define performance testing and capacity planning strategies to proactively mitigate reliability risks.
- Champion the adoption of observability, logging, and monitoring best practices, ensuring visibility into system health and performance.
Qualifications:
- 8+ years of experience in Site Reliability Engineering, DevOps, or related fields, with at least 3+ years in a leadership role.
- Proven track record of driving operational excellence in large-scale, distributed systems.
- Expertise in defining and implementing SLIs, SLOs, error budgets, and incident management processes.
- Strong knowledge of observability tools such as Prometheus, Grafana, Datadog, New Relic, or similar.
- Experience leading on-call rotations, postmortems, and operational readiness programs.
- Excellent leadership, communication, and stakeholder management skills.
Preferred Qualifications:
- Deep experience with AWS cloud environments, including operational best practices for high availability and reliability.
- AWS certifications such as AWS Certified DevOps Engineer – Professional, AWS Certified Solutions Architect – Professional, or AWS Certified Advanced Networking – Specialty.
- Experience with AWS monitoring and logging tools (CloudWatch, X-Ray, AWS Config, GuardDuty).
- Experience scaling SRE practices in high-growth or regulated environments.
- Hands-on background in software engineering with Python, Bash, or similar languages.
About Us
Benchmark Education Company is a leading publisher of core, supplemental, and intervention literacy and language resources in English and Spanish, both print and digital, as well as world-class professional development. Since its founding in 1998, our company has proven to be one of the most nimble and innovative content creators on the cutting edge of pedagogy and technology. The digital content in our many learning programs delivers all the rigor of its print counterpart and is designed for virtual and blended learning contexts.
Benchmark Education Publishing (BEC) and its affiliates are proud to be an Equal Opportunity Employer.
For further information, visit us at: https://www.benchmarkeducation.com
