Senior Site Reliability Engineer - AWS
TLDR
Lead automation and reliability efforts across cloud platforms to boost scalability, performance, and incident response in a fast-growing tech environment.
- Serve as the reliability engineering lead within a cross-functional team, providing technical leadership, mentorship, and guidance on best practices.
- Design, implement, and maintain highly automated systems that support software development, deployment, testing, monitoring, and operational workflows.
- Act as the primary advocate for reliability, scalability, and operational excellence throughout the entire software development lifecycle.
- Develop and maintain monitoring, logging, dashboarding, and alerting solutions that provide visibility into application and infrastructure health.
- Continuously improve CI/CD pipelines, automation frameworks, deployment processes, and operational tooling to increase efficiency and reduce manual effort.
- Identify and remediate reliability, performance, availability, and security risks across cloud infrastructure and production systems.
- Create and maintain technical documentation, operational procedures, architecture standards, and engineering best practices.
- Research, evaluate, and implement tools and technologies that improve system resilience and engineering productivity.
- Collaborate with engineering teams to troubleshoot complex production issues and ensure rapid incident resolution.
- Participate in on-call rotations and provide support during critical production incidents and emergency response situations.
- Mentor junior engineers and contribute to the development of a strong reliability engineering culture.
- 8+ years of experience in software engineering, infrastructure engineering, cloud operations, or related technical disciplines.
- Minimum of 7 years of dedicated experience in Site Reliability Engineering (SRE) or closely related reliability-focused roles.
- Strong expertise in Python, Bash, PowerShell, and other scripting or automation technologies commonly used within SRE environments.
- Extensive experience designing, building, and maintaining autonomous systems that automate deployment, testing, monitoring, and operational processes.
- Advanced hands-on experience with AWS services, including EC2, EKS/Kubernetes, CloudWatch, Lambda, S3, IAM, and related cloud-native technologies.
- Proven ability to implement and optimize monitoring, alerting, incident management, capacity planning, and performance optimization strategies.
- Deep understanding of CI/CD pipelines, infrastructure automation, and modern DevOps and SRE practices.
- Experience building and maintaining highly available, scalable, and resilient production systems in fast-paced environments.
- Strong problem-solving skills with the ability to independently identify reliability challenges and drive long-term improvements.
- Demonstrated success reducing operational toil through automation and process optimization.
- Excellent communication, collaboration, and stakeholder management skills.
- Bachelor’s degree in Computer Science, Information Systems, or a related field, equivalent certifications, or comparable professional experience.
- AWS certifications or other cloud-related certifications are considered a plus.
- Self-driven mindset with a passion for continuous learning, innovation, and operational excellence.
- Competitive base salary ranging from $175,000 to $195,000, based on experience, skills, and location.
- Comprehensive medical, dental, and vision insurance coverage for eligible employees.
- Generous paid time off program.
- Paid maternity and paternity leave.
- Short-term and long-term disability coverage.
- Opportunity to work within a dynamic, fast-growing technology organization.
- Exposure to large-scale cloud infrastructure and cutting-edge engineering challenges.
- Access to experienced leadership and mentorship opportunities.
- Strong culture focused on innovation, collaboration, and professional growth.
- Competitive compensation package designed to support pay equity.
- Company-branded merchandise and employee recognition perks.
- Opportunity to make a significant impact on highly scalable, mission-critical systems.
Requirements:
Benefits:
Benefits
Health Insurance
Comprehensive medical, dental, and vision insurance coverage for eligible employees.
impactful work opportunities
Opportunity to make a significant impact on highly scalable, mission-critical systems.
Paid Parental Leave
Paid maternity and paternity leave.
Paid Time Off
Generous paid time off program.
Remote-Friendly
Opportunity to work within a dynamic, fast-growing technology organization.
Jobgether runs the largest remote job platform, effectively linking job seekers with over 200,000 flexible and remote opportunities that match their unique skills and preferences. Our focus is on enhancing the hiring process, ensuring efficiency while prioritizing the candidate experience, particularly in the growing health and wellness sector.
- Founded
- Founded 2020
- Employees
- 11-50 employees
- Industry
- Professional Services