About the Role:
We are seeking a Senior Site Reliability Engineer (SRE) to join our team and take ownership of the reliability, scalability, and performance of our production systems. As an SRE, you will work closely with software engineers, DevOps teams, and infrastructure specialists to automate operations, enhance observability, and improve system resilience. Your expertise in bare-metal platforms, incident management, and performance optimization will help ensure seamless service delivery.
Key Responsibilities:
- Reliability & Availability: Ensure high availability and performance of critical systems by implementing SRE best practices.
- Incident Management: Lead incident response, root cause analysis (RCA), and post-mortems to prevent future issues.
- Observability & Monitoring: Develop and maintain monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, ELK,).
- Automation & Tooling: Automate manual operations, CI/CD pipelines, and infrastructure provisioning using tools like Terraform, Ansible, and Kubernetes.
- Performance Optimization: Identify bottlenecks and optimize databases, networks, and application performance.
- Security & Compliance: Work with security teams to enforce best practices in access control, encryption, and compliance (e.g., SOC2, ISO 27001).
- Capacity Planning: Analyze traffic patterns and resource utilization to ensure scalability and cost optimization.
- Collaboration: Partner with development teams to embed SRE principles into software design and deployment strategies.
Qualifications & Experience:
- 3+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
- Strong coding skills in Python, Go, or Bash for automation and scripting.
- Expertise in Kubernetes and container orchestration (EKS, GKE, AKS, or self-managed clusters).
- Deep knowledge of bare-metal and virtualization platforms (KVM, VMware) and infrastructure as code (Terraform, Ansible).
- Experience with observability tools (Prometheus, Grafana, eBPF, OpenTelemetry).
- Proficiency in Linux system administration and networking fundamentals.
- Experience with CI/CD tools (gitlab-ci, ArgoCD, FluxCD) and GitOps principles.
- Incident response and on-call experience with a focus on reducing MTTR and improving MTTD.
- Strong analytical and problem-solving skills, with a proactive mindset toward reliability improvements.
Preferred Qualifications:
- Experience with distributed systems and database reliability engineering (PostgreSQL, MySQL, MongoDB).
- Knowledge of service meshes (Istio, Linkerd) and API gateways.
- Experience with chaos engineering and fault injection testing.
- Understanding of FinOps and cloud cost optimization strategies.
Why Join Us?
- Work with cutting-edge cloud and DevOps technologies in a high-scale environment.
- A culture of learning, innovation, and automation-first mindset.
- Competitive salary, benefits, and remote/hybrid work flexibility.
- Ownership of high-impact reliability projects across the organization.
Let me know if you need any refinements based on your specific company needs!
ثبت مشکل و تخلف آگهی
ارسال رزومه برای اسنپ مارکت