Role Summary
We are looking for a Site Reliability Engineer who will work closely with our development teams to continuously improve the uptime, scalability, and reliability of our services. This role focuses on application‑level reliability, architecture best practices, automation, and enabling developers — and does not involve day‑to‑day infrastructure maintenance or sysadmin responsibilities.
Key Responsibilities
- Partner with development teams to design and improve service architectures with a strong focus on reliability, scalability, and reducing operational toil
- Contribute to defining and promoting 12‑Factor App and cloud‑native best practices across teams
- Support development teams in deploying and optimizing their services on Kubernetes, without being responsible for infrastructure operations
- Build internal tools, scripts, and automation (primarily in Python) to enhance delivery quality, observability, and operational efficiency
- Define and implement SLOs/SLIs/SLAs and establish well‑structured reliability standards
- Improve service observability by designing metrics, dashboards, and alerting
- Participate in incident analysis and root cause investigations, focusing on application and service layers
- Identify and automate repetitive processes to reduce operational overhead
- Explore and leverage AI‑powered tools to improve development, testing, and operational workflows
Required Skills & Experience
- Hands‑on experience deploying and debugging services on Kubernetes
- Strong programming skills, preferably in Python
- Solid understanding of SRE principles including SLO/SLA/SLI, error budgets, monitoring, and alerting
- Strong familiarity with the 12‑Factor methodology and cloud‑native application design
- Experience with observability tools (e.g., Prometheus, Grafana)
- Ability to analyze complex service‑level issues and propose pragmatic solutions
- Familiarity with CI/CD pipelines and release engineering practices
Nice to Have
- Experience using AI tools to enhance development, debugging, testing, or operational workflows
- Knowledge of containerization and modern deployment practices
- Experience designing developer golden paths or platform engineering practices
- Understanding of DevOps concepts and ability to collaborate effectively with DevOps and Infra teams
Personal Attributes
- A passion for reducing toil and improving software quality through automation
- Strong communication skills and ability to collaborate closely with development teams
- Product‑oriented thinking with a focus on end‑to‑end service reliability
- System‑level thinking and the ability to identify architectural bottlenecks