IranServer is seeking a precise and forward-thinking NOC Manager to ensure service stability, proactive monitoring, and the management of critical infrastructure operations. In this role, you will be responsible for leading the operations team, standardizing processes, and driving infrastructure transformation projects within one of the largest hosting platforms in the country.
If you are interested in leading operational teams, building standard frameworks, implementing intelligent automation, making decisions in critical situations, and working in Large-Scale environments, this position can be a turning point in your career.
Key Responsibilities:
- Managing, planning, and directing the NOC team across various shifts to ensure High Availability
- Continuous monitoring of the network, servers, datacenter equipment, and critical services
- Analyzing operational incidents, identifying critical points, and taking quick corrective action
- Defining, documenting, and improving processes related to Monitoring, Incident Response, Escalation, and Major Incident Management
- Close collaboration with Network, SRE, Datacenter, DevOps, and SOC teams
- Designing and implementing operational KPIs, including Uptime, MTTR, MTTD, SLA, and Capacity Metrics
- Analyzing alerts and events using Zabbix, Grafana, Prometheus, ELK, and Splunk
- Capacity & Performance Management
- Incident analysis and preparing standard RCA reports for management
- Active participation in designing, updating, and testing Disaster Recovery and Business Continuity plans
- Contributing to the design of advanced dashboards for service health and Observability
- Proposing and implementing improvements in NOC tools, procedures, standards, and automation
- Leading the team during critical events and coordinating among teams to minimize downtime
Required Qualifications & Skills:
Technical Skills
- Hands-on experience in designing, documenting, and improving operational processes
(Incident, Problem, Change, Escalation Flow, SOP, Runbook) - Ability to build a Process-Driven NOC structure and enforce operational discipline
- Strong knowledge of networking concepts: Routing, Switching, BGP, OSPF, VLAN, Firewalling
- Familiarity with datacenter structures and operations, including Power, Cooling, Rack Layout, and Connectivity
- Experience with monitoring and observability tools: Zabbix / Prometheus / Grafana / ELK / Splunk
- Operations automation skills using:
Python / Bash Scripting
Ansible / SaltStack
API Integration - Familiarity with AIOps and Machine Learning for Operations, including: False alert reduction
Failure prediction
Service behavior analysis
Anomaly detection
- Experience in designing intelligent dashboards for: Trend analysis, capacity forecasting, and health scoring
- Understanding of SRE concepts and metrics: SLA, SLO, Error Budget
- Ability to analyze logs, perform advanced troubleshooting, and deliver complete RCAs
- Experience with Cloud architectures, OpenStack, Kubernetes, or microservices is considered a plus
Behavioral Skills
- Continuous improvement mindset and interest in building standard and automated structures
- Systematic and data-driven thinking in decision-making
- Ability to lead teams toward an Automation-First culture
- Crisis management and accurate decision-making under pressure
- Strong communication skills for coordination across teams
- Commitment to documentation, operational discipline, and transparency