The SRE Engineer is a key member of the SRE team, responsible for ensuring the reliability, performance, and scalability of business-specific services. Reporting to the SRE Supervisor, the SRE Engineer collaborates with the DevOps Core team to leverage standardized infrastructure tools while managing and optimizing services tailored to their business unit. This role requires strong technical expertise in SRE practices, hands-on operational skills, and a focus on supporting developers to deliver robust applications.
Responsibilities
- Monitor and maintain the reliability and performance of business-specific services or web applications.
- Utilize tools like APM and Vector to track application performance metrics and process logs, ensuring high availability and optimal user experience.
- Respond to incidents, troubleshoot issues, and implement fixes to minimize downtime and maintain Service Level Agreements (SLAs).
- Deploy and manage services in Kubernetes using standardized Helm Charts and custom configurations (e.g., Custom Helm Values) provided by the DevOps Core team.
- Apply security policies and configure CRDs to meet business-specific requirements within dedicated Kubernetes namespaces.
- Monitor resource usage and optimize deployments to ensure efficient use of infrastructure resources.
- Execute and maintain CI/CD pipelines using Gitlab Runners, adhering to standardized templates provided by DevOps Core.
- Automate operational tasks, such as backups and configuration updates.
- Submit requirements (e.g., service configurations) via Merge Requests to DevOps Core for integration with centralized systems.
- Work closely with the DevOps Core team to leverage centralized tools (e.g., Vault, ElasticSearch, Gitlab) and provide feedback for improving standardized frameworks.
- Support development teams by ensuring reliable infrastructure and addressing application-specific needs.
- Participate in sync meetings with the SRE Supervisor and DevOps Core to align on requirements and resolve operational issues.
- Adhere to security best practices, including RBAC, secrets management with Vault, and Kubernetes security policies (e.g., Tetragon, OPA).
- Integrate security scans into CI/CD pipelines to ensure the deployment of secure container images.
- Document and report compliance-related activities to the SRE Supervisor for audits.
- Participate in on-call rotations to respond to incidents and perform root cause analysis for service disruptions.
- Document incidents, resolutions, and operational procedures to contribute to the team’s knowledge base and prevent recurrence.
- Collaborate with the SRE Supervisor to conduct post-mortem analyses and implement preventive measures.
- Identify opportunities to optimize service performance, such as reducing latency in some services or improving web application response times.
- Propose enhancements to tools and processes (e.g., Helm Charts, monitoring alerts) and share feedback with the SRE Supervisor and DevOps Core.
- Stay updated on SRE best practices and business-specific technologies to drive innovation.
Requirements:
- Technical Expertise: Proficiency in SRE practices, Kubernetes, CI/CD pipelines.
- Operational Skills: Experience in monitoring, troubleshooting, and optimizing distributed systems.
- Collaboration: Ability to work effectively with development teams, DevOps Core, and other stakeholders to deliver reliable services.
- Problem-Solving: Strong analytical skills to diagnose and resolve operational issues efficiently.
- Security Awareness: Knowledge of security practices, including RBAC, secrets management, and container security.
- Communication Skills: Strong verbal and written communication skills in Persian and English for technical documentation and team collaboration.