Aghanim

SRE / High-Middle DevOps

Lisbon, Portugal

Full-Time

On-site

Aghanim is an integrated commerce, liveops automation, community engagement, and payments platform for video games.

Mobile games have traditionally depended on app stores for distribution, payments, and player relationships. We believe there is a better way. Aghanim helps game studios build direct relationships with players, sell directly, and build their future on their own terms. Today, more than 100 games worldwide are already building this future with Aghanim.

Our team brings together people across Los Angeles, New York, Seoul, Beijing, London, Lisbon, Belgrade and other locations around the globe, with deep expertise in gaming, fintech and technology. We move quickly, keep communication direct, and focus on getting things done. We believe the best people thrive when they have autonomy, ownership, and a stake in the company's success.

We’re looking for a Middle/High-Middle DevOps / SRE Engineer to help run and improve our production platform in GCP + GKE, fronted by Cloudflare, with observability in Datadog and CI/CD in GitHub Actions.

You’ll work closely with Senior/Principal engineers, implementing reliability improvements, expanding monitoring coverage, and reducing operational toil - especially important in a highload system with sudden traffic spikes.

Key Responsibilities

Platform Operations

Operate and improve production systems on GCP, GKE, and related managed services
Contribute to platform reliability, scalability, and operational improvements alongside Senior and Principal engineers

Infrastructure & Delivery

Implement infrastructure changes using Terraform and maintain Kubernetes configurations, Helm charts, and deployment tooling
Improve CI/CD automation and deployment reliability

Observability & Incident Management

Build and maintain monitoring, alerting, and observability in Datadog
Participate in incident response, troubleshooting, and postmortem activities

Security & Cost Optimization

Support security tooling, vulnerability remediation, and secure platform practices
Identify and implement cost optimization opportunities without compromising reliability

Required Qualifications

Hands-on production experience with Kubernetes (ideally GKE) and basic cluster operations.
Working experience with Terraform and Helm in PR-based workflows.
Familiarity with GCP services used in SaaS operations (e.g., Cloud SQL, BigQuery, BigTable, Pub/Sub, Cloud Run, Memorystore).
Monitoring/alerting and troubleshooting skills (preferably Datadog).
Strong scripting/automation mindset to reduce manual work and prevent repetitive incidents.
Reliability awareness: understanding how changes affect availability/latency and how to operate under SLA constraints.

Preferred Qualifications

Cloudflare basics (WAF/DNS, edge concepts; Workers/CDN is a plus).
Experience writing/maintaining runbooks and participating in postmortems.
Exposure to SOC 2 / PCI-DSS requirements or willingness to learn.
Experience in high-load consumer products or game dev.

Why Join Us

World-class team – work alongside experienced professionals from around the globe who have built products used by millions of players
High growth, high impact – be part of a fast-growing company where ideas turn into products and reach customers in days, not months
Autonomy and ownership – we trust people to make decisions, take initiative, and drive results
Modern tools and technology – use AI, automation, and modern tools as part of your everyday work
Equity – participate in the company's growth and long-term success

Apply for this job

Aghanim

View company profile

DevOps