Era4

Technical Operations Manager - AI

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations.


The Technical Operations Manager is responsible for the implementation, day-to-day running of a new, greenfield Technical Operation Centre, encompassing Client Support, SRE \ AIOps and Automation. Sitting across traditional Service Desk, SRE / AI Engineering, this new role ensures that Era4’s sovereign AI/HPC infrastructure is supported, monitored, and delivered to contracted SLA targets from day one, the best part is that it is yours to build.


You will design and shape the function, embed SLO-driven thinking, Agentic approaches, own escalation pathways, and translate complex infrastructure events into clear customer communications. This is a foundational role with a direct line to leadership and genuine scope to shape how the function operates at scale.


Key Responsibilities:


Operations Leadership:

  • Own the end-to-end operational performance of the 24x7 Operations Centre: incident management, change management, and problem management
  • Serve as the primary escalation point for P1/P2 incidents, providing incident command and coordinating resolution across SRE, Service Desk, DevOps partners, and third-party vendors
  • Maintain and continuously improve operational runbooks, SOPs, and the postmortem-to-runbook learning pipeline
  • Lead regular operational reviews (weekly, monthly) and produce management-ready performance reports covering MTTR, SLA adherence, error budget consumption, and incident trends
  • Manage on-call rotas and escalation schedules across the Operations Centre; coordinate overnight cover and handoff procedures
  • Own the change advisory process, ensuring all infrastructure changes are risk-assessed, scheduled, and communicated appropriately
  • Line-manage the Service Desk function: own ticket triage workflows, SLA timers, first-contact resolution targets, and customer communication standards within the ITSM
  • Champion a customer-first culture across Service Desk


SRE / AIOPs:

  • Own the SLO/error budget framework at a programme level: hold the team accountable to error budget targets, use burn rate data to drive prioritisation decisions, and escalate when automation investment needs to be throttled or accelerated.
  • Provide operational context in sprint planning and backlog prioritisation; ensure the SRE team’s roadmap is anchored to customer experience, customer-impacting risk reduction and compliance milestones, not engineering preference alone.
  • Manage and develop 3rd Party integrations, at both a Service and Technical level.


Required Experience & Skills:

  • Comfortable and confident dealing directly with Clients, from Technical Support Tickets to Service Reviews with Senior Leadership.
  • Proven background within infrastructure operations, HPC, SRE, NOC, managed services, or equivalent mission-critical environment in a management or senior lead role.
  • Demonstrated experience across at least two of the three domains: NOC/incident operations, service management (ITSM, SLA governance), and SRE/platform engineering — with sufficient working knowledge of the third to operate effectively as an escalation point.
  • Working knowledge of observability tooling — Grafana, Prometheus, or equivalent; able to read dashboards, interrogate alert logic, and hold meaningful conversations with engineering teams and third-party vendors.
  • Fluency with SLA/SLO frameworks: designing, implementing, and reporting against contractual and internal service targets.
  • Strong Linux, Container and Infrastructure knowledge, specifically supporting GPU and HPC workloads in production.


One or more would be an advantage:

  • Operational experience with GPU infrastructure (NVIDIA HGX, DGX, InfiniBand) or AI/HPC compute environments.
  • Familiarity with DCGM Exporter, GPU telemetry, or equivalent high-density compute monitoring.
  • Experience with integration and automation into ticket platforms (Halo, ServiceNow, Fresh service, or equivalent) and ITIL-based incident, problem, and change management.
  • Hands-on experience with GitLab, GitOps workflows, and infrastructure-as-code (Terraform, Ansible, or AWX).
  • Exposure to agentic remediation / AIOps tooling — automated alerting, event correlation, or self-healing runbooks.
  • Exposure to one or more of Python, Go, Bash, PromQL.
  • Experience in a data centre, Hosting, Cloud, colocation, managed hosting, or sovereign cloud environment.

 

Why Join Era4:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

Technology

United Kingdom - Hybrid (Visit to office / site locations required)

Partager sur :

Conditions d’utilisationConfidentialitéCookiesAlimenté par Rippling