Lead Site Reliability Engineer
We are looking for a highly experienced Lead Site Reliability Engineer to drive SRE outcomes for business-critical applications in the Risk Technology space. This role requires a strong application, infrastructure, and engineering mindset, with the ability to work closely with application support, development, observability, and technology teams to improve reliability, resiliency, operational readiness, and automation maturity.
The ideal candidate will be responsible for defining and enabling SRE goals, establishing reliability requirements, evaluating SLAs, SLOs, SLIs, and error budgets, identifying critical user journeys, and ensuring that business-critical applications are supported with the right monitoring, alerting, automation, and operational practices.
This is not a traditional DevOps role. We are looking for someone who can go deep into application architecture, workflows, design, code behavior, infrastructure dependencies, and production support challenges, while helping teams mature their reliability engineering practices.
Key Responsibilities
- Lead SRE enablement for business-critical applications across Risk Technology.
- Partner closely with application support, development, infrastructure, and observability teams to improve reliability and resiliency.
- Define SRE priorities, goals, standards, and measurable outcomes for application teams.
- Establish and evaluate SLAs, SLOs, SLIs, error budgets, and service health indicators.
- Identify and document critical user journeys, application dependencies, failure points, and recovery expectations.
- Drive observability improvements by ensuring the right monitors, alerts, dashboards, logs, traces, and metrics are in place.
- Review application architecture, workflows, design patterns, and production support processes to identify reliability gaps.
- Support code-level analysis, code review discussions, and engineering recommendations from an SRE perspective.
- Improve incident management, post-incident reviews, root cause analysis, runbooks, and operational readiness.
- Build automation using Python, Ansible, and Terraform to reduce manual effort and improve operational efficiency.
- Leverage Amazon/AWS products, including AI-based solutions, to improve SRE efficiency, automation, monitoring, and operational outcomes.
- Help application teams adopt industry-standard SRE practices inspired by mature engineering organizations.
- Work hands-on with teams to improve production stability, resiliency, scalability, and supportability.
Required Skills and Experience
- Minimum 10 years of experience in SRE, production engineering, application reliability, infrastructure engineering, or related technology roles.
- Strong understanding of SRE principles, including SLIs, SLOs, SLAs, error budgets, toil reduction, incident management, and reliability engineering.
- Deep experience supporting business-critical applications in production environments.
- Strong application architecture knowledge with the ability to understand design, workflows, dependencies, and failure scenarios.
- Hands-on experience with Python automation.
- Hands-on experience with Ansible and Terraform automation.
- Strong knowledge of observability practices, including metrics, logs, traces, dashboards, alerting, and service health monitoring.
- Ability to partner with application support and development teams to improve reliability from both operational and engineering perspectives.
- Strong understanding of cloud, infrastructure, networking, databases, middleware, and application runtime environments.
- Experience reviewing code, supporting code quality discussions, and identifying reliability risks in application changes.
- Strong problem-solving skills with the ability to deep dive into complex technical issues.
- Excellent communication skills with the ability to translate technical risks into business-impacting outcomes.
Preferred Qualifications
- Experience in financial services, banking, risk technology, regulatory platforms, or other high-criticality environments.
- Exposure to AWS/Amazon services and AI-enabled automation or operational intelligence capabilities.
- Experience with Prometheus, Grafana, OpenTelemetry, Splunk, Datadog, Dynatrace, New Relic, or similar observability platforms.
- Knowledge of Kubernetes, OpenShift, containers, CI/CD pipelines, and modern distributed systems.
- Experience building reliability scorecards, operational readiness reviews, service maturity assessments, and production support standards.
- Strong understanding of resiliency patterns, failover, disaster recovery, capacity planning, and performance engineering.
Ideal Candidate Profile
The ideal candidate is a hands-on SRE leader who can think like an engineer, operate like a production owner, and partner like a trusted advisor to application teams. They should be comfortable going deep into application behavior, understanding business workflows, challenging reliability gaps, and enabling practical SRE outcomes that improve stability, resiliency, and operational excellence.
Pay: $70.00 per hour
Work Location: In person