Lead SRE 40003

Cephas Consultancy Services Private Limited • Bangalore, Karnātaka, India • 3w ago

Positions:2 Full Time
Experience
8 - 14 Years

Lead Site Reliability Engineer (SRE) â€“ Offshore Operations

About the Role

We are expanding our Site Reliability Engineering organization to support a 24x7 follow-the-sun operating model. This is a newly established offshore SRE role focused on real-time incident response, proactive prevention, and continuous automation. As a Lead SRE, you will manage a team of 5-6 engineers while ensuring the reliability, availability, and performance of mission-critical production systems. Every minute mattersâ€”you will act decisively to prevent service degradation and protect the customer experience.

Key Responsibilities

Incident Response & Leadership

Lead and mentor a team of 5-6 SREs in daily operations and incident response activities
Act as the first responder to production alerts, rapidly assessing severity and initiating mitigation
Serve as Incident Commander during major incidents, leading bridge calls with clarity and decisiveness
Drive root cause isolation within 30 minutes for critical incidents whenever possible
Communicate effectively with engineering, product, and leadership teams during high-pressure situations
Maintain strong presence and ownership on incident bridges with confident decision-making

Team Management & Operations

Oversee daily activities and coordinate closely with client leads and managers
Prepare weekly and monthly KPI reports on team performance and reliability metrics
Drive continuous improvement initiatives across the team

Proactive Reliability Engineering

Identify patterns, trends, and signals to prevent incidents before they occur
Continuously improve alert quality, reduce noise, and increase signal fidelity
Partner with engineering teams to enhance system resilience and reliability

Automation & Toil Reduction

Eliminate manual work by automating operational tasks, ticket handling, and repetitive workflows
Build and improve tooling across incident response, observability, and operations
Leverage AI-assisted development tools (e.g., Cursor, Claude) where they provide clear value

Platform & Systems Support

Troubleshoot across hybrid ecosystems including on-prem VMs (Linux & Windows; VMware), cloud platforms (AWS, GCP, Azure), and containerized environments (Kubernetes)
Diagnose and resolve issues across networking, Kubernetes, CDN, and traffic management layers (Akamai, waiting rooms, etc.)

Required Technical Skills & Experience

Core Engineering & Operations

Strong hands-on experience in incident management and triage in production environments
Proven ability to troubleshoot complex distributed systems under pressure
Solid understanding of Linux systems administration (performance tuning, networking, NTP, etc.)
Prior experience as an Incident Commander or in similar leadership roles during outages

Cloud & Infrastructure

Hands-on experience with AWS core services (S3, Lambda, Load Balancers, ECS, EC2)
Familiarity with GCP and/or Azure environments
Proven experience operating in multi-cloud and hybrid environments

Containers & Orchestration

Experience troubleshooting Kubernetes clusters (pods, ingress, configuration issues)
Understanding of containerized application architectures

DevOps & CI/CD

Strong knowledge of DevOps practices and CI/CD pipelines
Hands-on experience with Harness, GitHub, and/or GitLab

Application & Technology Stack

Working knowledge of Java, Node.js, and React-based applications
Understanding of database connectivity and dependencies across Oracle, MariaDB, and MSSQL
Strong troubleshooting awareness (no DBA ownership required)

Networking

Strong foundational knowledge of TCP/IP, DNS, and HTTP(S)
Experience with load balancing and network troubleshooting
Ability to diagnose connectivity issues between services and databases

Preferred Qualifications

Experience in large-scale enterprise (Fortune 500) environments supporting mission-critical applications
Familiarity with Akamai CDN and traffic management tools
Experience in high-volume, high-availability production environments
Track record of leading teams in fast-paced, incident-driven environments

Key Traits for Success

Bias for Action: Move fast and decisively when systems are at risk
Strong Communicator: Excel under pressure on incident bridges and with cross-functional teams
Systems Thinker: Connect the dots across complex architectures and identify root causes
Automation Mindset: Constantly look to reduce toil and improve operational efficiency
Team Leader: Mentor and develop SRE team members while maintaining high standards
Continuous Learner: Stay current with tools, AI capabilities, and emerging technologies

Working Hours

Shift 1: 3:00 AM â€“ 12:30 PM IST (5:30 PM â€“ 3:00 AM EST)

Shift 2: 11:00 AM â€“ 8:30 PM IST (1:30 AM â€“ 11:00 AM EST)

Why This Role Matters

This team forms the backbone of our global reliability strategy, ensuring continuous coverage and rapid response across all hours. As a Lead SRE, you will directly impact uptime, customer experience, and operational excellenceâ€”playing a critical role in preventing and resolving issues before they become major incidents. Your leadership will set the standard for reliability across our offshore operations.

What We're Looking For

We need a seasoned SRE professional with proven leadership experience, exceptional troubleshooting skills, and a passion for automation and continuous improvement. You should thrive in high-pressure environments, lead by example, and have the technical depth to mentor a growing team while maintaining hands-on involvement in critical incidents.