Reliability Operations Specialist

About Nexcess

Nexcess provides specialty cloud solutions for organizations where performance and compliance have to coexist. We serve businesses worldwide, from agencies scaling client sites to enterprises running mission-critical operations. We've built our reputation on deep technical expertise and genuine partnership with every client we work with. Behind every environment we manage is a team of people who take the craft seriously and keep showing up when it matters.

About the role

We're looking for a Reliability Operations Specialist to help drive operational excellence across incident management, service reliability, observability, and continuous improvement initiatives. This role serves as a central coordinator and subject matter expert for reliability practices, helping teams improve service stability, reduce operational risk, and strengthen operational readiness across the organization.

The Reliability Operations Specialist partners closely with engineering, infrastructure, security, and operations teams to facilitate incident response, oversee post-incident reviews, track corrective actions, and provide visibility into the health and reliability of our platforms. This role does not have direct people management responsibilities but plays a critical role in influencing reliability outcomes through collaboration, process ownership, and data-driven decision making.

Location: Remote

Employment type: Permanent, Full-time

Pay Range: $85,000 - $100,000 Annually, The final compensation offered will be determined based on factors including location, experience, skills, qualifications, and market conditions.

What you'll do

Incident Management & Operational Excellence

Participate in major incident response activities and serve as an Incident Commander when assigned.
Coordinate incident response efforts across multiple teams during service-impacting events.
Facilitate escalation management, stakeholder communications, and status reporting.
Support the ongoing improvement of incident management processes, procedures, and operational readiness.
Drive initiatives focused on reducing Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
Identify opportunities to improve operational efficiency, reliability, and service delivery.

Post-Mortem Management & Continuous Improvement

Coordinate and facilitate blameless post-mortem reviews following significant incidents.
Ensure post-mortems are completed accurately, consistently, and within established timelines.
Analyze incident trends to identify recurring issues, systemic risks, and improvement opportunities.
Maintain accountability for corrective action tracking and closure.
Partner with stakeholders to prioritize and drive reliability-focused improvements.
Foster a culture of learning, accountability, and continuous improvement.

Reliability & Observability

Partner with engineering teams to define, maintain, and mature Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Support the development and evolution of observability practices, including monitoring, alerting, dashboards, and telemetry standards.
Analyze reliability metrics and operational performance data to identify opportunities for improvement.
Recommend and track initiatives that improve platform stability, resiliency, and service performance.
Help establish operational best practices that support scalable and reliable platform operations.

Reporting & Stakeholder Communication

Develop reliability reporting for engineering leadership and executive stakeholders.
Maintain incident communication standards and stakeholder notification processes.
Provide regular reporting on incident performance, corrective actions, reliability trends, and service health.
Translate technical reliability metrics into actionable business insights and recommendations.
Present findings and recommendations to technical and non-technical audiences

What you bring

3+ years of experience in Product Operations, Platform Operations, Technical Customer Support, Incident Coordination, Site Operations, IT Service Management (ITSM), or a related operational role.
Experience participating in or coordinating major incident response activities.
Knowledge of incident management, root cause analysis, problem management, and post-mortem methodologies.
Experience working with monitoring, alerting, observability, or operational reporting tools.
Strong analytical and organizational skills with exceptional attention to detail.
Excellent written and verbal communication skills.
Ability to work effectively across multiple teams and influence outcomes without direct authority.
Strong problem-solving skills and the ability to remain calm and organized during high-pressure situations.

Preferred Qualifications

Experience working with Service Level Objectives (SLOs), Service Level Indicators (SLIs), and reliability metrics.
Familiarity with Linux systems, cloud infrastructure, networking concepts, hosting platforms, or distributed systems.
Knowledge of ITIL, operational excellence frameworks, or Site Reliability Engineering (SRE) principles.
Experience supporting high-availability SaaS, hosting, cloud, or infrastructure environments.
Experience creating executive-level operational reports, dashboards, and presentations.
Experience using observability and incident management platforms.

What We Offer

Comprehensive benefits package
Traditional and Roth 401(k) with company matching
A collaborative, team-oriented culture
Consistent and predictable work hours
Engaging, varied work that keeps each day different
Opportunities to contribute ideas and influence how work gets done

Disclaimer:

This job description is only a summary of the typical functions of the position. It is not intended to be an exhaustive or comprehensive list of all job responsibilities, tasks, or duties. Additional duties and tasks may be assigned as part of the job function. Nexcess reserves the right to modify, interpret, or apply this job description in a way that best supports the organizational needs. The job description in no way creates or implies an employment contract. The employment contract remains “at will”.

Equal Employment Opportunity Policy:

Nexcess is committed to offering equal employment opportunity without regard to age, color, disability, gender, gender identity, genetic information, marital status, military status, national origin, race, religion, sexual orientation, veteran status, or any other legally protected characteristic.

Platform Reliability

United Kingdom

Bulgaria

India

Remote (United States)

Teilen auf: