Hercules Careers

Sr. DevOps Engineer (Portugal)

About HerculesAI

HerculesAI helps finance and operations leaders solve problems that are too complex, large-scale, or time-consuming for human teams to manage alone. Its platform automates the validation and verification of data across millions of high-volume, rules-based transactions, improving billing accuracy, reducing costs, and accelerating cash flow. Built on a modular, multi-AI agent architecture, HerculesAI delivers industry-specific solutions for staffing, insurance, government, and financial services. Its accuracy and consistency enable enterprises to achieve levels of precision and speed that were previously out of reach. 

Headquartered in the United States, HerculesAI also has offices in the United Kingdom, Armenia, Canada, and Portugal. 

About the role

We are seeking a Senior DevOps Engineer to design, automate, and operate infrastructure across both cloud (Azure, AWS, GCP) and on-prem environments. The role focuses on Kubernetes operations, CI/CD automation, hybrid infrastructure management, and security, with opportunities to support AI-powered workloads. This is a hands-on engineering position, ideal for someone who thrives in hybrid and multi-cloud environments.

What you'll do

  • Deploy, scale, and manage Kubernetes clusters in both cloud and on-premises environments.
  • Build and maintain CI/CD pipelines using modern automation tools and Infrastructure as Code practices.
  • Manage hybrid infrastructure ensuring scalability, resilience, and disaster recovery readiness.
  • Strengthen security and compliance through identity management, network policies, and encryption strategies.
  • Implement observability solutions (metrics, logging, tracing) to ensure system reliability and performance.
  • Optimize infrastructure performance and cloud costs while maintaining high availability.
  • Own Kubernetes operations across cloud and on-prem: provision clusters, manage upgrades, enforce policies, and standardize app delivery (Helm/Kustomize) with progressive rollouts (blue/green, canary).
  • Design, build, and maintain CI/CD pipelines (GitHub Actions/Azure DevOps/Argo CD) using IaC (Terraform) and GitOps; enforce quality/security gates and artifact promotion.
  • Architect and operate hybrid infrastructure (Azure, AWS, GCP, on-prem): networking, identity, storage, backup/DR, and capacity planning with clear RTO/RPO objectives; run DR tests regularly.
  • Implement zero-trust and compliance controls: IAM/least-privilege, secrets management (Vault/KMS), mTLS, network policies, container image scanning/signing/attestation (Trivy/COSIGN), SBOMs, and policy-as-code (OPA/Gatekeeper/Kyverno).
  • Establish observability end-to-end: metrics, logs, traces (Prometheus/Grafana/OpenTelemetry/ELK), SLOs/SLIs, alerting, runbooks, and on-call rotation hygiene.
  • Optimize performance and cost: right-size workloads, set requests/limits, enable autoscaling, implement spot/reserved strategies, and produce FinOps reporting.
  • Partner with Engineering, Product, AI, and Security to support AI/LLM workloads (GPU scheduling, device plugins, quotas), model artifact storage, data pipelines; drive post-release verification and incident retrospectives.
  • Create paved roads and reusable templates: environment blueprints, bootstrap scripts, golden images, and self-service tooling for developers.
  • Lead incident response: triage, rollback, root-cause analysis, corrective actions, and knowledge base updates.

Key Ares of Expertise

  • Kubernetes & Containers: Cluster deployment, scaling, and application delivery.
  • Automation & IaC: Terraform, Helm, Kustomize, CI/CD pipelines.
  • Hybrid Infrastructure: Cloud (Azure, AWS, GCP) and on-premises systems.
  • Security & Compliance: Zero-trust networking, identity access management, encryption.
  • Observability & Reliability: Monitoring, logging, and tracing with industry-standard tools.
  • Performance & Cost Optimization: Resource tuning and efficiency improvements.
  • GitOps & Release Engineering: Argo CD/Flux, promotion gates, progressive delivery (blue/green, canary).
  • Networking & Platform Security: VPC/VNet, ingress/controllers, service mesh, mTLS, network policies.
  • Supply-Chain Security: Image scanning/signing/attestation (Trivy, COSIGN), SBOMs, provenance.
  • Policy as Code & Guardrails: OPA/Gatekeeper/Kyverno; pre-merge checks in CI; admission policies.
  • Observability at SLO Level: Prometheus, Grafana, OpenTelemetry, ELK; actionable alerting; runbooks.
  • Resilience/DR: Backup/restore testing, RTO/RPO adherence, capacity planning.
  • FinOps: Cost per workload/unit, right-sizing, autoscaling, spot/reserved strategy.
  • AI/ML Infra (nice-to-have): GPU nodes, device plugins, quotas, scheduling constraints.

Necessary Qualifications

  • 6+ years in DevOps/SRE/Platform Engineering with hands-on ownership of Kubernetes, CI/CD, and hybrid cloud operations at scale.
  • Required: Strong with Terraform (IaC), Helm/Kustomize, container registries, and GitOps workflows (e.g., Argo CD/Flux).
  • Proficient with at least one major CI system (GitHub Actions, Azure DevOps) and artifact management; fluent in scripting (Bash) and one programming language (Python or Go preferred).
  • Deep knowledge of cloud primitives (Azure/AWS/GCP) and on-prem virtualization; networking (VPC/VNet, ingress, service mesh), storage, and security controls.
  • Observability stack experience (Prometheus, Grafana, OpenTelemetry, ELK), SLI/SLO design, and actionable alerting.
  • Security by default: IAM, secrets management (Vault/KMS), image scanning/signing (Trivy/COSIGN), SBOMs, and policy as code (OPA/Gatekeeper/Kyverno).
  • Proven track record with DR/BCP, backup/restore testing, and capacity planning; comfort with incident command and postmortems.
  • Bonus: Support for AI workloads (GPU nodes, quotas), performance profiling, and cost modeling/FinOps.

Technology Office

Remote (Portugal)

Share on:

Terms of servicePrivacyCookiesPowered by Rippling