SRE / High-Middle DevOps
Aghanim is an integrated commerce, liveops automation, community engagement, and payments platform for video games.
Mobile games have traditionally depended on app stores for distribution, payments, and player relationships. We believe there is a better way. Aghanim helps game studios build direct relationships with players, sell directly, and build their future on their own terms. Today, more than 100 games worldwide are already building this future with Aghanim.
Our team brings together people across Los Angeles, New York, Seoul, Beijing, London, Lisbon, Belgrade and other locations around the globe, with deep expertise in gaming, fintech and technology. We move quickly, keep communication direct, and focus on getting things done. We believe the best people thrive when they have autonomy, ownership, and a stake in the company's success.
We’re looking for a Middle/High-Middle DevOps / SRE Engineer to help run and improve our production platform in GCP + GKE, fronted by Cloudflare, with observability in Datadog and CI/CD in GitHub Actions.
You’ll work closely with Senior/Principal engineers, implementing reliability improvements, expanding monitoring coverage, and reducing operational toil - especially important in a highload system with sudden traffic spikes.
Key Responsibilities
Platform Operations
Operate and improve production systems on GCP, GKE, and related managed services
Contribute to platform reliability, scalability, and operational improvements alongside Senior and Principal engineers
Infrastructure & Delivery
Implement infrastructure changes using Terraform and maintain Kubernetes configurations, Helm charts, and deployment tooling
Improve CI/CD automation and deployment reliability
Observability & Incident Management
Build and maintain monitoring, alerting, and observability in Datadog
Participate in incident response, troubleshooting, and postmortem activities
Security & Cost Optimization
Support security tooling, vulnerability remediation, and secure platform practices
Identify and implement cost optimization opportunities without compromising reliability
Required Qualifications
Hands-on production experience with Kubernetes (ideally GKE) and basic cluster operations.
Working experience with Terraform and Helm in PR-based workflows.
Familiarity with GCP services used in SaaS operations (e.g., Cloud SQL, BigQuery, BigTable, Pub/Sub, Cloud Run, Memorystore).
Monitoring/alerting and troubleshooting skills (preferably Datadog).
Strong scripting/automation mindset to reduce manual work and prevent repetitive incidents.
Reliability awareness: understanding how changes affect availability/latency and how to operate under SLA constraints.
Preferred Qualifications
Cloudflare basics (WAF/DNS, edge concepts; Workers/CDN is a plus).
Experience writing/maintaining runbooks and participating in postmortems.
Exposure to SOC 2 / PCI-DSS requirements or willingness to learn.
Experience in high-load consumer products or game dev.
Why Join Us
World-class team – work alongside experienced professionals from around the globe who have built products used by millions of players
High growth, high impact – be part of a fast-growing company where ideas turn into products and reach customers in days, not months
Autonomy and ownership – we trust people to make decisions, take initiative, and drive results
Modern tools and technology – use AI, automation, and modern tools as part of your everyday work
Equity – participate in the company's growth and long-term success