Senior Site Reliability Engineer

Senior Site Reliability Engineer

Order.co is the System of Action for the Office of the CFO, transforming the way businesses purchase and pay into an intuitive, B2C-like shopping experience. Order.co leverages embedded AI agents and embedded financial products to reinvent the way businesses connect with their vendors.

End users enjoy a seamless, zero-training buying experience, while finance and procurement leaders gain a single platform to orchestrate how the business “should operate”. The result is an all-in-one solution that serves as a gravitational pull for spend and data, automating and eliminating procurement and finance workflows from requisition to reconciliation along the way.

Order.co is on the cutting edge of B2B Agentic Commerce, poised to be the market leader in creating a more predictive, prescriptive, and personalized experience for users.

Founded in 2016 and headquartered in New York City, Order.co oversees nearly half a billion in annualized spend across hundreds of customers like WeWork, SoulCycle, Lume, and [solidcore]. Order.co has raised $75M in funding from industry-leading investors like MIT, Stage 2 Capital, Rally Ventures, 645 Ventures, and more. Order.co has been proudly named a 50 to Watch by Spend Matters and a Best Place to Work by BuiltIn and Inc. Magazine.

The Role

As a Senior Site Reliability Engineer on the Platform team, you will ensure that software systems are reliable, scalable, performant, and operationally efficient. You blend software engineering skills with infrastructure and operations expertise to keep critical systems running smoothly while enabling rapid product development.

Responsibilities

Reliability Engineering & Infrastructure Ownership

Design, build, and operate highly available, scalable, and fault-tolerant infrastructure and platform services
Own reliability, availability, latency, and operational excellence for critical production systems and services
Define and maintain service level objectives (SLOs), service level indicators (SLIs), and error budgets across platform systems
Lead incident response efforts for complex production outages; drive root-cause analysis and long-term remediation actions
Build resilient systems that gracefully handle failures, traffic spikes, dependency degradation, and regional outages
Continuously improve system reliability through automation, observability, performance tuning, and capacity planning

Automation & Platform Engineering

Develop infrastructure automation and self-service tooling to reduce operational toil and improve engineering velocity
Build and maintain CI/CD pipelines, deployment automation, and release engineering workflows
Implement infrastructure as code (IaC) practices using tools such as Terraform, CloudFormation, and container orchestration
Improve developer experience by building reliable internal platforms, operational tooling, and standardized deployment patterns
Drive adoption of GitOps, immutable infrastructure, and automated remediation patterns

Observability & Operational Excellence

Design and maintain comprehensive monitoring, logging, tracing, and alerting systems for distributed services
Establish actionable alerting standards that reduce noise while improving incident detection and response times
Analyze production trends, system bottlenecks, and failure patterns to proactively prevent incidents
Lead operational readiness reviews, disaster recovery planning, and game-day exercises
Improve mean time to detect (MTTD) and mean time to recovery (MTTR) through tooling, automation, and process refinement

Systems Architecture & Scalability

Participate actively in architecture and infrastructure design reviews
Propose scalable and reliable platform designs that account for multi-region deployment, redundancy, failover, and security considerations
Evaluate trade-offs between reliability, scalability, operational complexity, and engineering velocity
Identify systemic risks and operational gaps before they become production incidents
Partner with engineering teams to ensure services are designed with operability, observability, and resilience in mind from day one

Security & Compliance

Approach infrastructure and operational practices with a strong security mindset
Implement and maintain secure cloud networking, secrets management, IAM policies, and infrastructure hardening standards
Partner with Security and Compliance teams to ensure systems meet organizational and regulatory requirements
Drive operational best practices around vulnerability management, patching, and production access controls

End-to-End Ownership & Collaboration

Scope and estimate infrastructure and reliability initiatives accurately
Coordinate production rollouts, maintenance events, and reliability improvements across teams
Communicate operational risks, dependencies, and incident impacts clearly to technical and non-technical stakeholders
Collaborate closely with Software Engineering, Security, Product, and Operations teams to improve platform reliability and scalability
Serve as a trusted escalation point during critical production incidents

Mentorship & Technical Leadership

Mentor junior and mid-level engineers on reliability engineering principles, operational excellence, and infrastructure best practices
Raise the operational maturity of the engineering organization through documentation, reviews, and technical guidance
Drive improvements in team standards around observability, incident management, automation, and infrastructure design
Influence technical decisions through credibility, operational expertise, and strong engineering judgment

Qualifications

You are motivated by accountability — you own outcomes, not just tasks
You are results-oriented and measure success by shipped, working software
You are motivated by correctness in code that touches money — the consequences of a bug land on real customer balances, and you take that seriously
You love helping people on your team grow and improve
Writing tests is an integral part of your development process, not an afterthought
You know how to design and build software incrementally — you don't need a complete spec to make progress
Collaborating with the people around you to achieve a goal motivates you
You are collaborative, open-minded, and actively developing your craft
You are curious and pragmatic about AI-driven solutions — you apply them where they add real value and stay skeptical where they don't
Familiarity with AI-assisted development tools — you understand how they work, where they help, and where they fail. Prior hands-on use is a plus; intellectual curiosity and the instinct to evaluate AI output critically are what matter

Technical Skills

Strong foundation in computer science fundamentals: data structures, algorithms, and system design
Familiarity with building production-grade applications and services using Ruby and Ruby on Rails
Deep expertise with Linux systems administration and production troubleshooting
Strong experience operating cloud infrastructure at scale, particularly within AWS environments
Experience with Kubernetes, container orchestration, and cloud-native infrastructure patterns
Proficiency with infrastructure as code tools such as Terraform or CloudFormation
Expertise designing and operating CI/CD pipelines and deployment automation systems
Deep understanding of observability tooling including Datadog, OpenTelemetry, or similar platforms
Strong knowledge of distributed systems reliability patterns including redundancy, failover, autoscaling, rate limiting, and graceful degradation
Experience building automation and operational tooling using languages such as Python, Go, Bash, or Ruby
Strong understanding of networking fundamentals including DNS, load balancing, TLS, VPNs, firewalls, and service discovery
Hands-on experience with incident response, root-cause analysis, and production operations in high-availability environments
Familiarity with SRE methodologies including SLOs, SLIs, error budgets, capacity planning, and operational maturity modeling
Experience implementing secure infrastructure and cloud security best practices including IAM, secrets management, and vulnerability remediation
Proven ability to design scalable, resilient, and maintainable platform systems and APIs
Experience supporting distributed microservices architectures and event-driven systems
Strong understanding of operational excellence principles including automation-first engineering and toil reduction
Experience using AI-assisted engineering tools (e.g., Claude, GitHub Copilot) as force multipliers while applying sound operational and engineering judgment
Excellent debugging and systems thinking skills across infrastructure, networking, application, and platform layers

What Great Looks Like

A Senior Software Engineer on the Platform team who is thriving at this level demonstrates:

Reliable delivery of complex work — consistently ships multi-part solutions on time with low defect rates
Low defects in owned areas — proactively monitors and improves the quality of the systems they own; that means incident-free quarters in code paths that move funds and clean reconciliation against vendor reports
Measurable mentorship impact — engineers around you write better code because of your reviews and guidance

"Someone we can depend on for the work that matters — especially the work that touches money."

Failure Modes We Screen Against

We actively evaluate candidates for the following anti-patterns during the interview process:

Failure Mode	What It Looks Like
Strong coder, weak owner	Ships code but doesn't manage to the task — owns the merge, not the outcome; hands off and moves on without monitoring or fixing post-release issues
Solo expert	Hoards knowledge instead of sharing — becomes a single point of failure and blocks team growth
Overconfident designer	Proposes solutions without considering trade-offs — jumps to conclusions, resists alternative approaches
Rubber-stamper	Produces AI-generated output without verifying it against the codebase, tests, or business context

Interview Process

Our 5-round process is designed to evaluate you across all competency areas. AI tools are permitted in technical rounds.

Round	Format	What We Evaluate
1 — Hiring Manager Screen	60 min, conversational	Career trajectory, mentorship philosophy, technical influence examples, communication style
2 — Take-Home + PR Discussion	72h take-home + 60 min live	Navigating unfamiliar code, ownership and decomposition discipline visible in your PR, root-cause judgment, AI tool usage
3 — System Design + Artifact Critique	60 min, Miro board	Requirements gathering, schema/API design, trade-off articulation, calibrated code-review judgment on a teammate's PR
4 — Team Interview (conditional)	30 min, behavioral	Collaboration patterns, mentorship behavior, negotiation behavior with cross-functional partners
5 — Culture Add	30 min, People Team	Organizational values alignment

Round 4 is conditional: it runs when the team needs additional behavioral signal after Rounds 2 and 3, and is otherwise skipped. Your recruiter will tell you whether it's scheduled before your loop is finalized.

The Round 2 (Take-Home + PR Discussion) and Round 3 (System Design) exercises are drawn from real problems so the technical evaluation is grounded in the work you'd actually be doing.

What You’ll Receive

Competitive compensation including base salary, bonus, and equity
Employer-sponsored 401(k) with match
Comprehensive medical, dental, and vision coverage
Flexible time off and hybrid work environment

The anticipated annual salary range for this role is $175,000 - $200,000. Actual compensation and title will be commensurate with experience, qualifications, knowledge, and skills.

410 - Engineering

Remote (United States)

Partilhar em:

Termos de serviço.Privacidade Cookies Desenvolvido pela Rippling