Staff/Senior Staff Engineer, Kubernetes
TLDR
Lead multi-cloud Kubernetes operations and cloud-native architecture for large-scale production clusters, driving reliability, security, and automation across Alibaba Cloud and AWS.
Who We Are
What You’ll Be Doing
- K8s cluster lifecycle management: Own the build, scaling, version upgrades, daily operations, fault diagnosis, and performance tuning of large-scale production Kubernetes clusters; ensure 7×24 high availability and stable operations; support continuous business iteration.
- Alibaba Cloud & AWS multi-cloud operations (core responsibility): Operate, govern, and optimize Alibaba Cloud and AWS resources across dual-cloud environments, covering container services, networking, storage, IAM, load balancing, databases, and object storage; manage configuration changes, cost optimization, and disaster recovery to achieve unified multi-cloud governance.
- Cloud-native architecture and optimization: Lead containerization and microservices operational rollout; optimize Pod scheduling, resource quotas, network policies, image management, and log monitoring systems; resolve cluster resource fragmentation, business adaptation, and network interoperability challenges.
- Stability and security: Build comprehensive K8s cluster monitoring, alerting, logging, and distributed tracing systems; define operations runbooks, change processes, and incident response plans; strengthen cluster security controls, disable high-risk permissions, harden container runtime environments, and ensure infrastructure and business data security.
- Automated operations and DevOps: Develop operations automation scripts using Shell/Python; integrate Jenkins, GitLab CI, and ArgoCD to build automated release, inspection, and backup systems; implement Infrastructure as Code (IaC) principles to improve efficiency and reduce human error.
- Incident management and post-mortem optimization: Lead online incident response, conduct root cause analysis, produce post-mortem reports, and continuously optimize cluster architecture, resource allocation, monitoring strategy, and long-term stability assurance mechanisms.
- Technical knowledge sharing and team empowerment: Track Cloud Native and public cloud technology developments; document operations best practices and technical specifications; assist the team in improving multi-cloud K8s operations capabilities.
What We Look For In You
- Bachelor's degree or above in a computer-related field; 4+ years of hands-on experience operating production-level Kubernetes clusters; proficient in K8s core principles and components including Pod, Deployment, StatefulSet, Service, Ingress, CRD, controllers, scheduling strategies, network models, and storage mounting; able to independently resolve complex cluster failures and performance bottlenecks.
- Proficient in Alibaba Cloud and AWS dual-cloud operations, with independent experience in dual-cloud production environments:
- Alibaba Cloud: proficient in ACK Container Service, ECS, SLB, VPC, RAM, RDS, OSS, CloudMonitor, security groups, and snapshot backups.
- AWS: proficient in EKS, EC2, S3, VPC, IAM, TGW, load balancing, CloudWatch, and security policies; practical experience in overseas cloud deployment, operations, and disaster recovery.
- Proficient in Linux system administration; familiar with system optimization, permission control, process management, log analysis, and online troubleshooting.
- Familiar with mainstream container runtimes (containerd/Docker); understand K8s networking (CNI plugins such as Calico/Flannel), storage (CSI), and multi-cluster management; familiar with Istio/Envoy service mesh, east-west traffic governance, gray-scale releases, and network interoperability.
- Strong Shell and Python automation skills; experienced with CI/CD pipelines (Jenkins, GitLab CI, ArgoCD); familiar with IaC tools (Terraform, Ansible, Helm); experienced with observability stacks (Prometheus, Grafana, ELK/EFK, Jaeger, SkyWalking).
- Preferred: experience in large-scale public cloud environments (100+ nodes); multi-cloud cost optimization; K8s security hardening (OPA/Gatekeeper, Pod Security Standards, Falco); Kubernetes CKA/CKS certification; experience with AI/LLM workload scheduling (GPU scheduling, distributed training).
Perks & Benefits
-
Competitive total compensation package
- L&D programs and education subsidy for employees' growth and development
-
Various team building programs and company events
- Wellness and meal allowances
- Comprehensive healthcare schemes for employees and dependants
- More that we love to tell you along the process!
Benefits
Education Stipend
L&D programs and education subsidy for employees' growth and development
Health Insurance
Comprehensive healthcare schemes for employees and dependants
Wellness Stipend
Wellness and meal allowances
OKX operates as a prominent cryptocurrency exchange, enabling users to buy, sell, and trade a wide range of digital assets, including Bitcoin and Ethereum. In addition to facilitating crypto trading, they've developed OKX Wallet, a widely-used platform for accessing decentralized applications and exploring the Web3 landscape.
- Founded
- Founded 2017
- Employees
- 500+ employees
- Industry
- Diversified Financial Services