Positions:2 Full Time
Experience
8 - 14 Years
Lead Site Reliability Engineer (SRE) – Offshore Operations
About the Role
We are expanding our Site Reliability Engineering organization to support a 24x7 follow-the-sun operating model. This is a newly established offshore SRE role focused on real-time incident response, proactive prevention, and continuous automation. As a Lead SRE, you will manage a team of 5-6 engineers while ensuring the reliability, availability, and performance of mission-critical production systems. Every minute matters—you will act decisively to prevent service degradation and protect the customer experience.
Key Responsibilities
Incident Response & Leadership
- Lead and mentor a team of 5-6 SREs in daily operations and incident response activities
- Act as the first responder to production alerts, rapidly assessing severity and initiating mitigation
- Serve as Incident Commander during major incidents, leading bridge calls with clarity and decisiveness
- Drive root cause isolation within 30 minutes for critical incidents whenever possible
- Communicate effectively with engineering, product, and leadership teams during high-pressure situations
- Maintain strong presence and ownership on incident bridges with confident decision-making
Team Management & Operations
- Oversee daily activities and coordinate closely with client leads and managers
- Prepare weekly and monthly KPI reports on team performance and reliability metrics
- Drive continuous improvement initiatives across the team
Proactive Reliability Engineering
- Identify patterns, trends, and signals to prevent incidents before they occur
- Continuously improve alert quality, reduce noise, and increase signal fidelity
- Partner with engineering teams to enhance system resilience and reliability
Automation & Toil Reduction
- Eliminate manual work by automating operational tasks, ticket handling, and repetitive workflows
- Build and improve tooling across incident response, observability, and operations
- Leverage AI-assisted development tools (e.g., Cursor, Claude) where they provide clear value
Platform & Systems Support
- Troubleshoot across hybrid ecosystems including on-prem VMs (Linux & Windows; VMware), cloud platforms (AWS, GCP, Azure), and containerized environments (Kubernetes)
- Diagnose and resolve issues across networking, Kubernetes, CDN, and traffic management layers (Akamai, waiting rooms, etc.)
Required Technical Skills & Experience
Core Engineering & Operations
- Strong hands-on experience in incident management and triage in production environments
- Proven ability to troubleshoot complex distributed systems under pressure
- Solid understanding of Linux systems administration (performance tuning, networking, NTP, etc.)
- Prior experience as an Incident Commander or in similar leadership roles during outages
Cloud & Infrastructure
- Hands-on experience with AWS core services (S3, Lambda, Load Balancers, ECS, EC2)
- Familiarity with GCP and/or Azure environments
- Proven experience operating in multi-cloud and hybrid environments
Containers & Orchestration
- Experience troubleshooting Kubernetes clusters (pods, ingress, configuration issues)
- Understanding of containerized application architectures
DevOps & CI/CD
- Strong knowledge of DevOps practices and CI/CD pipelines
- Hands-on experience with Harness, GitHub, and/or GitLab
Application & Technology Stack
- Working knowledge of Java, Node.js, and React-based applications
- Understanding of database connectivity and dependencies across Oracle, MariaDB, and MSSQL
- Strong troubleshooting awareness (no DBA ownership required)
Networking
- Strong foundational knowledge of TCP/IP, DNS, and HTTP(S)
- Experience with load balancing and network troubleshooting
- Ability to diagnose connectivity issues between services and databases
Preferred Qualifications
- Experience in large-scale enterprise (Fortune 500) environments supporting mission-critical applications
- Familiarity with Akamai CDN and traffic management tools
- Experience in high-volume, high-availability production environments
- Track record of leading teams in fast-paced, incident-driven environments
Key Traits for Success
- Bias for Action: Move fast and decisively when systems are at risk
- Strong Communicator: Excel under pressure on incident bridges and with cross-functional teams
- Systems Thinker: Connect the dots across complex architectures and identify root causes
- Automation Mindset: Constantly look to reduce toil and improve operational efficiency
- Team Leader: Mentor and develop SRE team members while maintaining high standards
- Continuous Learner: Stay current with tools, AI capabilities, and emerging technologies
Working Hours
Shift 1: 3:00 AM – 12:30 PM IST (5:30 PM – 3:00 AM EST)
Shift 2: 11:00 AM – 8:30 PM IST (1:30 AM – 11:00 AM EST)
Why This Role Matters
This team forms the backbone of our global reliability strategy, ensuring continuous coverage and rapid response across all hours. As a Lead SRE, you will directly impact uptime, customer experience, and operational excellence—playing a critical role in preventing and resolving issues before they become major incidents. Your leadership will set the standard for reliability across our offshore operations.
What We're Looking For
We need a seasoned SRE professional with proven leadership experience, exceptional troubleshooting skills, and a passion for automation and continuous improvement. You should thrive in high-pressure environments, lead by example, and have the technical depth to mentor a growing team while maintaining hands-on involvement in critical incidents.

PIa1b96fa8eb48-37437-40780770