Era4

SRE / AIOps Engineer

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations

**This is a greenfield role, building a modern Agentic approach to Client and Infrastructure Operations**.

 

Role Summary:

We are seeking Automation & AIOps Engineers who sit at the intersection of Site Reliability Engineering and modern AI-driven operations. Embedded within Era4's engineering-led Operations Centre, this role exists to build a modern AI Platform Operations function from scratch, designing tooling, and agentic workflows.  No legacy to deal with.

 

Key Responsibilities:

 

Runbook Automation & Agent Development:

  • Build agentic, executable workflows capable of triaging, diagnosing, and where appropriate autonomously remediating known failure patterns.
  • Build and maintain LLM-backed agents targeting the observability stack, ITSM platform, and infrastructure APIs (e.g. DCIM, IPAM, hypervisor layers).
  • Develop auditable Client focused automations, for Client interactions and workflows, with appropriate controls
  • Develop safe, auditable automation with appropriate controls for higher-risk platform actions

 

Operational Tooling & Self-Service Enablement

  • Build internal tooling that empowers engineers and service desk analysts: CLI utilities, ChatOps integrations (Slack/Teams bots), status dashboards, and self-service automation hooks.
  • Reduce dependency on DevSecOps and engineering teams for routine operational tasks through automation.
  • Maintain and contribute a library of automation assets, agent prompts, and runbook-as-code artefacts, version-controlled and peer-reviewed.

 

Event & Alert Intelligence:

  • Develop the automation layer around monitoring and event management: alert suppression logic, enrichment pipelines, correlation rules, and alert-to-ticket integrations.
  • Continuously tune signal-to-noise ratios across monitoring tooling (Prometheus, Mimir, Grafana, or equivalent) to improve situational awareness.
  • Design and implement event correlation and deduplication logic to reduce alert storms and improve incident context.

 

Continuous Improvement & Knowledge Capture

  • Identify common Operational patterns and tasks as candidates for automation; maintain and prioritise a toil reduction backlog.
  • Participate in post-incident reviews and translate findings into updated automation, runbooks, or agent logic.
  • Contribute to the evolution of Era4's operational standards, tooling architecture, and agent framework.

 

Essential Experience:

 

Technical – Core Element:

  • Strong Python development skills, including scripting for automation, API integration, and data processing.
  • Hands-on experience with observability and monitoring platforms: Prometheus, Grafana, Mimir, or equivalent.
  • Experience integrating with ITSM platforms (ServiceNow, Halo, Jira Service Management, or similar) via API.
  • Solid understanding of event-driven architectures, message queues, and webhook-based automation patterns.
  • Strong understanding of managing GPU infrastructure in production, key signals and metrics and the automation of workflows 
  • Familiarity with Infrastructure-as-Code principles and cloud-native environments (Kubernetes, Terraform, or similar).

 

Technical – Agent & AI

  • Demonstrable experience building LLM-powered agents or automation using frameworks such as LangChain, LlamaIndex, the Anthropic SDK, OpenAI function calling, or comparable tooling.
  • Understanding of agentic design patterns: tool use, structured output, human-in-the-loop controls, and chain-of-thought reasoning for operational tasks.
  • Comfort operating in an API-first environment, integrating agents with infrastructure APIs, DCIM, IPAM, and hypervisor control planes.

 

Operational:

  • Prior experience in an SRE, Senior Operations, or Platform Engineering environment, with exposure to on-call operations and incident management processes.
  • Experience in converting narrative runbooks into executable automation or codified decision trees.
  • Understanding of ITIL-aligned incident and change management principles and ITSM tooling.

 

One or more would be an advantage:

  • Exposure to data centre or colocation operations, particularly high-density compute or GPU infrastructure environments.
  • Experience with ChatOps tooling: building Slack or Microsoft Teams bots for operational workflows.
  • Familiarity with DCIM platforms and telemetry pipelines (power, thermal, network).
  • Knowledge of OpenTelemetry, distributed tracing, or log aggregation platforms (Loki, ELK, Splunk).
  • Contributions to open-source observability or automation tooling.
  • Experience in a start-up or scale-up environment where tooling is built from scratch.

 

Why Join Era4:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

 

Diversity & Inclusion

Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. 

 

Executive & Operations

United Kingdom - Hybrid (Occasional visit to London office)

Partager sur :

Conditions générales d’utilisationConfidentialitéCookiesPropulsé par Rippling