Site Reliability Engineer

We are seeking a skilled Site Reliability Engineer to ensure the reliability, scalability, and performance of cloud-based systems running on AWS. This is a hands-on engineering role focused on applying software engineering principles to operations, with a strong emphasis on automation, observability, and reliability engineering.


As an SRE, you will work closely with CloudOps, DevOps, Platform, and Engineering teams to build and operate highly reliable systems. You will define reliability standards, reduce operational toil, and leverage automation and AI-driven insights to manage production environments proactively and improve system resilience.


Key Responsibilities

  • Design, build, and operate highly reliable and scalable AWS-based systems

  • Define and manage SLIs, SLOs, and error budgets in alignment with business goals

  • Identify reliability risks and drive architectural and operational improvements

  • Conduct capacity planning, resilience testing, and failure analysis

  • Build automation to eliminate manual operational tasks and reduce toil

  • Develop event-driven and self-healing mechanisms using AWS Lambda and native services

  • Implement proactive issue detection and automated remediation workflows

  • Leverage AI/ML and AIOps capabilities for predictive alerting and anomaly detection

  • Design and maintain monitoring, logging, alerting, and observability solutions

  • Integrate and optimize AWS services including: CloudWatch, CloudTrail, AWS Configuration, IAM, Lambda, and AI/ML services

  • Participate in on-call rotations and lead incident response for production issues.

  • Perform root cause analysis (RCA) and drive post-incident improvements

  • Partner with CloudOps teams to improve operational maturity and reliability practices

  • Collaborate with DevOps and Engineering teams to embed reliability into CI/CD pipelines

  • Balance feature velocity and reliability using error budgets and SRE best practices

  • Influence design decisions to improve system reliability and maintainability

  • Drive continuous improvement initiatives focused on reliability, performance, and efficiency

  • Contribute to run books, reliability standards, and operational documentation

  • Support governance, compliance, and security requirements through automation



Requirements

Required Skills & Experience

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field

  • 3–6 years of experience in SRE, DevOps, CloudOps, or Platform Engineering roles

  • Strong hands-on experience with AWS services including: EC2, S3, RDS, VPC, IAM, Lambda, API Gateway, ECS/EKS

  • Strong programming and scripting skills (Python, Go, or similar)

  • Experience with Infrastructure as Code (Terraform, CloudFormation, CDK)

  • Solid understanding of distributed systems, fault tolerance, and reliability patterns

  • Experience defining and managing SLIs, SLOs, and error budgets at scale

  • Hands-on experience with monitoring, logging, and observability tools

  • Experience managing production incidents and on-call responsibilities

  • Strong problem-solving, collaboration, and communication skills

  • Experience with microservice, event-driven architectures, serverless, and containers

  • Familiarity with chaos engineering, load testing, and resilience testing practices

  • Experience with cost-aware reliability engineering and performance optimization

  • Exposure to AIOps or ML-based monitoring solutions

  • Experience in start-up or fast-paced product environments

  • Willingness to support teams and customers across multiple time zones


Signs You May Be a Great Fit 
  • Impact: Play a pivotal role in shaping a rapidly growing venture studio with Cloud-driven digital transformation. 
  • Culture: Thrive in a collaborative, innovative environment that values creativity, ownership, and agility.
  • Growth: Access professional development opportunities, and mentorship from experienced peers. 
  • Benefits: Competitive salary, wellness packages, and flexible work arrangements that support your lifestyle and goals.