The big picture:
Netflix has more than 125 million subscribers worldwide. To support such a large subscriber base, we run a large, distributed, and ever-changing system. The Resilience Engineering team’s goal is to make this complex system as resilient as we possibly can, so that our customers can have a great experience. The opportunity to impact Netflix and its 125 million customers is huge! If you like scale and global impact, this is an amazing place to be.
How do we make our system more resilient? We find vulnerabilities and risks in our system before they lead to customer-facing outages. To find vulnerabilities, we build Chaos tools tools that allow us to inject events that we expect the system to handle, and check that the service continues to stay healthy. We also focus on a platform for load testing services. This allows us to better understand the limits of our production systems. Finally, we track patterns of risks and vulnerabilities, which inform us of our biggest availability challenges and help us come up with risk mitigation strategies. You can read more about our work here.
We are looking for an engineering manager to lead the team and help us realize the big opportunities we have. You understand how distributed systems work and what makes them fail, especially at scale. You are passionate about finding faults and vulnerabilities and about improving Netflix’s resiliency. Having a strong software engineering background is a plus and will help set you up for success.
Our team works with a lot of partner teams at Netflix. For instance, we work with service teams to fix vulnerabilities and with platform teams to build better integrated Chaos tooling. So you’ll need to understand the importance of collaboration.
Finally, you want to have big impact and aren’t afraid to take on big challenges. Our subscriber base is growing fast, and so is our opportunity!
• Team building & ensuring excellent talent density: recruiting, evaluating, coaching, celebrating
• Partnership: collaborating with partner teams to prioritize and help in addressing vulnerabilities
• Strategy: together with the team and context from the broader organization, determining and championing longer term team direction and technology strategy
• Execution: establishing a plan and delivering on it, iterating and improving the plan along the way
• Cloud based architectures working at significant scale.
• Experience with, or understanding of, what makes complex distributed systems work well and fail.
Netflix offers a unique culture that values freedom and responsibility. You can learn more via the Netflix culture memo and jobs site.