We are looking for a passionate and experienced Senior Site Reliability Engineer (SRE) to join our infrastructure team. In this role, you will be responsible for building and maintaining scalable, resilient systems that power our core services. You will work closely with engineering, DevOps teams to ensure high availability, performance, and operational efficiency across all systems.
Key Responsibilities:
- Design and implement scalable and highly available infrastructure solutions
- Manage monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, EFK Stack)
- Collaborate with DevOps, backend, and data teams to improve CI/CD and automation workflows
- Define and track SLOs, SLAs, and SLIs for critical services
- Conduct incident response, root cause analysis, and performance tuning
- Document infrastructure processes, configurations, and architectural decisions
Requirements:
- 3+ years of experience as an SRE, DevOps Engineer, or related role
- Strong Linux system administration background
- Hands-on experience with CI/CD tools (Preferred GitLab CI)
- Deep knowledge of Kubernetes and related tools (Helm, ArgoCD, etc.)
- Proficiency in building and maintaining observability tools
- Solid understanding of infrastructure and Infrastructure as Code (e.g., Ansible)
- Strong problem-solving skills and a collaborative mindset
What You’ll Get:
- The chance to work on real scaling challenges
- A team that values transparency, curiosity, and learning
- Budget for learning, courses, and conferences
Benefits:
Join our friendly and dynamic team and enjoy a range of perks, such as:
- Monthly social events and gathering
- Breakfast
- Lunch subsidies
- Transportation budget
- On-site medical care
- Comprehensive health insurance
- Parking space
- Seasonal and special charges and discounts from Okala
- Occasional Gifts