Jobgether
Jobgether

Site Reliability Engineer (SRE)

TLDR

Drive reliability and scalability of distributed systems in Kubernetes, shaping observability, incident response, and AI-driven automation to improve platform resilience.

Accountabilities:

In this role, you will be responsible for building and maintaining highly reliable systems while continuously improving operational maturity across engineering teams. You will define reliability standards, lead incident management practices, and drive automation initiatives that reduce operational toil and increase system resilience.

  • Define and track SLI, SLO, and SLA metrics, operating with error budget principles
  • Design and implement high availability, disaster recovery, and resilience strategies (RTO/RPO)
  • Build and evolve observability platforms (logs, metrics, traces, alerts, dashboards)
  • Lead incident response processes, including on-call coordination and escalation flows
  • Perform root cause analysis (RCA) and post-mortem reviews with preventive actions
  • Optimize system performance through capacity planning, tuning, and infrastructure analysis
  • Drive automation and self-healing solutions to eliminate repetitive operational tasks
  • Apply AI-driven approaches (AIOps) for anomaly detection, log analysis, and troubleshooting
  • Collaborate with development teams to improve system reliability and deployment safety
  • Ensure security, compliance, and operational best practices in production environments
  • Requirements:

    We are looking for a strong technical profile with deep infrastructure understanding, solid automation skills, and a proactive mindset focused on reliability and scalability.

    • Experience as an SRE, DevOps, or Backend/Platform Engineer in production environments
    • Strong knowledge of Kubernetes, Docker, and cloud-native architectures
    • Solid experience with observability tools (Grafana, Prometheus, ELK, Datadog, or similar)
    • Strong understanding of Linux systems, networking, HTTP, DNS, and TLS/SSL
    • Proficiency in scripting/automation using Python, Shell, or similar languages
    • Experience with distributed systems, incident management, and troubleshooting
    • Familiarity with CI/CD pipelines, infrastructure automation, and Git workflows
    • Knowledge of reliability engineering concepts (SLI, SLO, error budgets) is highly valued
    • Experience with high-availability systems and production-scale environments
    • Strong analytical thinking, autonomy, and structured problem-solving skills
    • Clear communication skills and ability to collaborate across engineering teams
    • Familiarity with AIOps, OpenTelemetry, or chaos engineering is a plus
    • Benefits:

      • 100% remote work, with flexibility to work from anywhere in Brazil
      • Competitive compensation aligned with senior-level engineering roles
      • Health and dental care plans
      • Life insurance coverage
      • Meal and food allowances (depending on contract model)
      • Home office support and ergonomic assistance
      • Wellness and mental health support programs
      • Access to fitness and wellness platforms and partnerships
      • Learning and development programs to support career growth
      • Performance-based recognition and engagement initiatives
      • Collaborative and innovation-driven engineering culture.
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
 
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
 
 
#LI-CL1

Benefits

Equity Compensation

Competitive compensation aligned with senior-level engineering roles

Meal and food allowances

Meal and food allowances (depending on contract model)

Health Insurance

Health and dental care plans

Home office support and ergonomic assistance

Learning Budget

Learning and development programs to support career growth

Collaborative and innovation-driven engineering culture.

Remote-Friendly

100% remote work, with flexibility to work from anywhere in Brazil

Access to fitness and wellness platforms and partnerships

Jobgether runs the largest remote job platform, effectively linking job seekers with over 200,000 flexible and remote opportunities that match their unique skills and preferences. Our focus is on enhancing the hiring process, ensuring efficiency while prioritizing the candidate experience, particularly in the growing health and wellness sector.

Founded
Founded 2020
Employees
11-50 employees
Industry
Professional Services
View company profile
Report this job
Apply for this job