Site Reliability Engineer - AI Agents
TLDR
Bridge platform engineering and AI infrastructure to power scalable production-grade AI agent systems, focusing on reliability, observability, and developer tooling.
You will be responsible for designing, operating, and scaling the infrastructure layer that powers AI agent systems in production, ensuring reliability, observability, and developer usability across the platform.
- Design, build, and operate scalable cloud infrastructure supporting AI agent execution, orchestration, and model serving in production
- Ensure reliability, performance, and observability of distributed agentic systems across internal and external products
- Develop platform services, APIs, SDKs, and self-service tooling to enable efficient consumption of AI infrastructure
- Manage compute, orchestration, and deployment infrastructure supporting AI and ML workloads at scale
- Build and maintain CI/CD pipelines for reliable, automated deployment of AI services and agent workflows
- Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS environments
- Design and operate monitoring, logging, alerting, and incident response systems tailored to AI/ML workloads
- Define reliability patterns, guardrails, and failure recovery mechanisms for LLM and agent-based systems
- Collaborate with AI and Data Engineering teams to evolve experimental prototypes into production-grade systems
- Manage Kubernetes-based container orchestration environments for scalable deployment of services
- Implement security controls, access management, and infrastructure best practices across systems
- Document architecture, runbooks, and operational procedures to support platform adoption and reliability
- 5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar roles in production environments
- Hands-on experience supporting ML infrastructure, model serving, or MLOps pipelines in production
- Experience building developer platforms, internal tools, APIs, or SDKs used at scale by engineering teams
- Strong understanding of platform engineering principles, including self-service infrastructure and developer experience design
- Proficiency with Infrastructure as Code tools, particularly Terraform
- Strong experience with Kubernetes and containerized systems (Docker)
- Solid cloud infrastructure experience, preferably AWS
- Strong scripting and programming skills (Python preferred, plus bash/shell proficiency)
- Experience designing and operating observability, monitoring, and alerting systems
- Experience with incident response processes and on-call operational ownership
- Strong collaboration skills across AI, data, and engineering teams
- High ownership mindset with ability to operate in fast-paced, high-stakes production environments
- Familiarity with AI agent systems, LLM-based applications, or orchestration frameworks is a strong plus
- Competitive compensation package with performance-based incentives
- Fully remote-friendly structure with flexibility across eligible regions
- Comprehensive health coverage including medical, dental, and vision (where applicable)
- Retirement savings plans with employer contributions (where applicable)
- Flexible PTO policy and paid company holidays
- Mental health and wellness support programs
- Learning and development budget for continuous technical growth
- Opportunity to work on cutting-edge AI agent infrastructure at global scale
- High-ownership engineering culture with strong cross-functional collaboration
- Exposure to advanced platform engineering and applied AI systems.
Requirements
The ideal candidate is a strong SRE or platform engineer with experience in cloud-native systems, production infrastructure, and exposure to ML or AI-driven workloads.
Benefits
Benefits
Equity Compensation
Competitive compensation package with performance-based incentives
Health Insurance
Comprehensive health coverage including medical, dental, and vision (where applicable)
Learning Budget
Learning and development budget for continuous technical growth
Platform engineering and AI systems exposure
Exposure to advanced platform engineering and applied AI systems.
Paid Time Off
Flexible PTO policy and paid company holidays
Remote-Friendly
Fully remote-friendly structure with flexibility across eligible regions
Wellness Stipend
Mental health and wellness support programs
Jobgether runs the largest remote job platform, effectively linking job seekers with over 200,000 flexible and remote opportunities that match their unique skills and preferences. Our focus is on enhancing the hiring process, ensuring efficiency while prioritizing the candidate experience, particularly in the growing health and wellness sector.
- Founded
- Founded 2020
- Employees
- 11-50 employees
- Industry
- Professional Services