Jobgether
Jobgether

Site Reliability Engineer - AI Agents

TLDR

Platform-focused SRE role building scalable cloud-native infra for AI agent workloads, blending SRE, MLOps, and developer tooling for production-ready AI systems.

Accountabilities:

You will be responsible for designing, operating, and scaling resilient infrastructure systems that support AI agent workloads in production, ensuring reliability, scalability, and developer usability across the platform.

  • Design, build, and operate cloud-native infrastructure supporting AI agent execution, orchestration, and model serving at scale
  • Ensure reliability, observability, and performance of distributed agentic systems across internal and external-facing products
  • Develop platform services, APIs, SDKs, and self-service tooling to enable teams to efficiently consume AI infrastructure capabilities
  • Manage and optimize compute, orchestration, and serving layers for AI and ML workloads in production environments
  • Build and maintain CI/CD pipelines to enable safe, fast, and reliable deployment of AI services and agent workflows
  • Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS-based infrastructure
  • Design monitoring, alerting, and observability systems tailored to AI/ML and agent-based workloads
  • Define and enforce reliability patterns, guardrails, and failure recovery mechanisms for LLM and agentic systems
  • Collaborate with AI, Data Engineering, and Product teams to transform experimental prototypes into production-ready systems
  • Manage Kubernetes-based container orchestration environments, ensuring scalable and efficient workload deployment
  • Implement security best practices and access controls across infrastructure and platform services
  • Document system architecture, operational procedures, and runbooks to support team knowledge sharing and reliability
  • Requirements

    The ideal candidate is a strong platform-minded engineer with deep SRE experience, a solid understanding of cloud-native systems, and exposure to AI/ML infrastructure or agent-based systems.

    • 5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar production-focused roles
    • Hands-on experience supporting ML systems, model serving infrastructure, or MLOps pipelines in production environments
    • Strong experience building developer platforms, internal tools, APIs, or SDKs used by engineering teams at scale
    • Deep understanding of platform engineering principles, including self-service infrastructure and developer experience design
    • Strong proficiency with Infrastructure as Code tools, particularly Terraform
    • Advanced experience with Kubernetes and containerized environments (Docker)
    • Solid cloud infrastructure experience, preferably within AWS environments
    • Strong programming and scripting skills (Python preferred, plus bash/shell proficiency)
    • Experience designing and operating observability, logging, monitoring, and alerting systems
    • Proven experience with incident response, on-call rotations, and production reliability ownership
    • Strong cross-functional collaboration skills across AI, data, and engineering teams
    • High ownership mindset with the ability to operate in fast-moving, high-stakes production environments
    • Familiarity with AI/agent systems, orchestration frameworks, or LLM-based applications is a strong plus
    • Benefits

      • Competitive compensation package with performance-based incentives
      • Remote-first working model across multiple eligible countries
      • Comprehensive medical, dental, and vision insurance coverage (where applicable)
      • Retirement savings plans with employer contribution options
      • Flexible PTO policy and company holidays
      • Mental health support and wellness programs
      • Learning and development budget for technical and professional growth
      • Opportunities to work on cutting-edge AI agent infrastructure at global scale
      • Inclusive, distributed engineering culture with strong emphasis on ownership and impact
      • Regular opportunities to collaborate with high-performing AI and platform engineering teams.
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
 
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
 
 
#LI-CL1

Benefits

Equity Compensation

Competitive compensation package with performance-based incentives

Health Insurance

Comprehensive medical, dental, and vision insurance coverage (where applicable)

Learning Budget

Learning and development budget for technical and professional growth

Collaboration with engineering teams

Regular opportunities to collaborate with high-performing AI and platform engineering teams.

Paid Time Off

Flexible PTO policy and company holidays

Remote-Friendly

Remote-first working model across multiple eligible countries

Wellness Stipend

Mental health support and wellness programs

Jobgether runs the largest remote job platform, effectively linking job seekers with over 200,000 flexible and remote opportunities that match their unique skills and preferences. Our focus is on enhancing the hiring process, ensuring efficiency while prioritizing the candidate experience, particularly in the growing health and wellness sector.

Founded
Founded 2020
Employees
11-50 employees
Industry
Professional Services
View company profile
Report this job
Apply for this job