Senior SRE - Distributed Systems & Cloud Infrastructure

About Us


Chess.com is one of the largest gaming sites in the world and the #1 platform for playing, learning, and enjoying chess.


We are a team of 600+ fully remote people in 60+ countries working hard to serve the global chess community. We are here to support 200M+ chess players worldwide with the best possible product, content, and tools to serve the community!


We are a tech company. A gaming company. A content company. And we do it all with passion and commitment to the game. Above all we prize our mission-driven, flat, life-celebrating, no-corporate culture, and we look forward to meeting you and learning more about what you can bring to the team.



About You

  • You’re a passionate member of the Chess.com community, with an acute understanding of our users and their needs.
  • You have advanced expertise in distributed systems and several years of experience integrating and optimizing cloud-native services using Kubernetes, Golang, and TypeScript at scale.
  • You excel at deep-diving into both application code and core system internals to optimize performance and architect robust solutions.
  • You thrive in globally distributed teams, are humble, humorous, and take strong ownership of your work.
  • You’re enthusiastic about tackling the complexities of high-traffic, data-intensive environments and are eager to push the limits of infrastructure reliability and scalability for Chess.



What you'll do

Architect & Optimize Infrastructure:

  • Lead the design and optimization of cloud-native services using Kubernetes, Terraform, and GitOps tools like ArgoCD.
  • Develop high-performance integration patterns and manage scalable, distributed systems handling extensive data volumes.

Deep Performance Tuning:

  • Dive into Golang and TypeScript codebases to identify and resolve performance bottlenecks at scale.
  • Optimize infrastructure and application code to achieve aggressive performance and reliability targets, with a focus on chess programming at the bits level.

Collaboration & Best Practices:

  • Work closely with development teams to refine cloud service integration architectures and implement best practices.
  • Monitor and enhance system reliability and performance through effective collaboration and innovative solutions.

Incident Response & Operational Excellence:

  • Participate in incident response for critical infrastructure issues, ensuring rapid resolution and minimal downtime.
  • Drive improvements in infrastructure reliability, scalability, and operational efficiency.

Infrastructure & Automation:

  • Utilize Terraform and Kubernetes to manage and scale our cloud infrastructure, ensuring robust, automated deployment processes.



Required Skills

High-Scale Cloud Operations:

  • 5+ years of experience managing and scaling large-scale, cloud-native distributed systems.
  • Deep understanding of Kubernetes, Terraform, and GitOps practices.
  • Expert in observability practices and ability to support incident response / on call.

Advanced Development in Golang:

  • Extensive experience in high-performance service development with Golang
  • Proven ability to profile and optimize applications for high throughput and reliable operation.

Distributed Systems Expertise:

  • Strong knowledge of distributed systems design, failure modes, and robust architectural principles.
  • Experience with data modeling and indexing strategies to support efficient service operations.

Performance Optimization:

  • Demonstrated experience improving system reliability and performance through deep code-level and architectural analysis.

Communication & Collaboration:

  • Excellent written and verbal communication skills.
  • Experience working in globally distributed teams.



Preferred Skills

Chess Programming:

  • Experience in chess programming, including bit-level manipulations and optimizations.
  • C/C++ Experience

Observability & Cloud Practices:

  • Familiarity with modern observability tools and practices.
  • Hands-on experience with Kubernetes and cloud-native workflows.



About the Opportunity

  • This is a full-time opportunity
  • We are 100% remote (work from anywhere!)


You can learn more about us here:

Engineering

Remote

Share on:

Terms of servicePrivacyCookiesPowered by Rippling