We are looking for a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our systems by automating processes, monitoring infrastructure, and improving service efficiency.
Responsibilities
- Ensure high availability, reliability, and scalability of production systems.
- Implement monitoring, logging, and alerting solutions to improve system observability.
- Perform capacity planning, load testing, and system tuning.
- Respond to incidents, participate in on-call rotations, and lead root cause analysis (RCA).
- Automate operational tasks to reduce manual interventions and improve system reliability.
- Work closely with operation teams to build reliable and resilient applications.
- Establish and enforce best practices for production readiness and release processes.
Requirements:
- 3–7 years of experience as an SRE, Production Engineer, or related role.
- Expertise in monitoring/logging and alerting systems (Prometheus, Grafana, ELK stack etc.).
- Experience with incident management, postmortems, and root cause analysis.
- Proficiency in Git/Linux/programming/scripting (Python, Go, Bash, etc.).
- Familiarity with Kubernetes, Docker, and microservices architectures.
- Strong problem-solving and troubleshooting skills in production environments.
- Knowledge of networking, and security best practices.