Site Reliability Engineer
We are seeking a skilled Site Reliability Engineer to ensure the reliability, scalability, and performance of cloud-based systems running on AWS. This is a hands-on engineering role focused on applying software engineering principles to operations, with a strong emphasis on automation, observability, and reliability engineering.
As an SRE, you will work closely with CloudOps, DevOps, Platform, and Engineering teams to build and operate highly reliable systems. You will define reliability standards, reduce operational toil, and leverage automation and AI-driven insights to manage production environments proactively and improve system resilience.
Key Responsibilities
Design, build, and operate highly reliable and scalable AWS-based systems
Define and manage SLIs, SLOs, and error budgets in alignment with business goals
Identify reliability risks and drive architectural and operational improvements
Conduct capacity planning, resilience testing, and failure analysis
Build automation to eliminate manual operational tasks and reduce toil
Develop event-driven and self-healing mechanisms using AWS Lambda and native services
Implement proactive issue detection and automated remediation workflows
Leverage AI/ML and AIOps capabilities for predictive alerting and anomaly detection
Design and maintain monitoring, logging, alerting, and observability solutions
Integrate and optimize AWS services including: CloudWatch, CloudTrail, AWS Configuration, IAM, Lambda, and AI/ML services
Participate in on-call rotations and lead incident response for production issues.
Perform root cause analysis (RCA) and drive post-incident improvements
Partner with CloudOps teams to improve operational maturity and reliability practices
Collaborate with DevOps and Engineering teams to embed reliability into CI/CD pipelines
Balance feature velocity and reliability using error budgets and SRE best practices
Influence design decisions to improve system reliability and maintainability
Drive continuous improvement initiatives focused on reliability, performance, and efficiency
Contribute to run books, reliability standards, and operational documentation
Support governance, compliance, and security requirements through automation
Requirements
Required Skills & Experience
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field
3–6 years of experience in SRE, DevOps, CloudOps, or Platform Engineering roles
Strong hands-on experience with AWS services including: EC2, S3, RDS, VPC, IAM, Lambda, API Gateway, ECS/EKS
Strong programming and scripting skills (Python, Go, or similar)
Experience with Infrastructure as Code (Terraform, CloudFormation, CDK)
Solid understanding of distributed systems, fault tolerance, and reliability patterns
Experience defining and managing SLIs, SLOs, and error budgets at scale
Hands-on experience with monitoring, logging, and observability tools
Experience managing production incidents and on-call responsibilities
Strong problem-solving, collaboration, and communication skills
Experience with microservice, event-driven architectures, serverless, and containers
Familiarity with chaos engineering, load testing, and resilience testing practices
Experience with cost-aware reliability engineering and performance optimization
Exposure to AIOps or ML-based monitoring solutions
Experience in start-up or fast-paced product environments
Willingness to support teams and customers across multiple time zones
- Impact: Play a pivotal role in shaping a rapidly growing venture studio with Cloud-driven digital transformation.
- Culture: Thrive in a collaborative, innovative environment that values creativity, ownership, and agility.
- Growth: Access professional development opportunities, and mentorship from experienced peers.
- Benefits: Competitive salary, wellness packages, and flexible work arrangements that support your lifestyle and goals.