We are looking for a Site Reliability Engineer (Internship) to ensure the reliability, availability, and scalability of our systems. The role works closely with development teams to improve system resilience, automate operations, and respond to production incidents(Software updates, bug fixes, and security patches).
Responsibilities
- Understands basic cloud infrastructure and monitoring tools such as Prometheus, Grafana, or Cloud Monitoring.
- Assists in routine operational tasks, such as system checks and basic incident response under supervision.
- Learns to follow established procedures for system maintenance, patching, and updates.
- Gains foundational knowledge of Service Level Objectives (SLOs) and error budgets.
- Supports documentation of incidents and postmortems.
- Experience with at least one programming language (e.g., C++, C#, Java, Python, JavaScript)
- Exposure to scripting languages (e.g., Python, Bash).
- Basic understanding of Linux/Unix systems and networking fundamentals.
- Familiarity with at least one cloud provider (e.g., Huawei Cloud, AWS, Azure).
- Knowledge of CI/CD pipelines and version control (Git).
Requirements
- New graduate, Majors related to computer science or software engineer
- Bachelor’s degree in Computer Science, Software Engineering, a related field, or equivalent practical experience.
- Strong willingness to learn cloud operations, observability, and automation.
- Excellent communication skills.
- Pragmatic and problem-solving attitude.
- A naturally curious and proactive approach to learning and problem-solving
- Good skills at writing / editing documentation.
Preferred
-Familiarity with OpenStack deployment or operations.
-Familiarity with public cloud deployment or operations.