Optimizely is hiring a

Site Reliability Engineer - Distributed Systems

San Francisco, United States


The Challenge

At Optimizely, the Distributed Systems platform is the core foundation in driving all our Experimentation and Personalization products that directly impact our customer experience. The platform is leveraged across multiple different product lines from Counting and Analytics to Targeting and Recommendations. Our platform, based primarily on the Hadoop ecosystem on AWS, processes many billions of events a day and we’ve built a sophisticated data pipeline that can power a variety of queries for personalized experiences at web scale.

We are looking for a Site Reliability Engineer that will work within the Distributed Systems Engineering team to drive operational excellence in our Big Data infrastructure. This is a unique opportunity to own, operate and scale mission-critical production services on a cutting-edge technology stack.

Why is this exciting for you?

This is a unique role that will leverage knowledge and expertise across multiple layers of the stack! You’ll truly enjoy the non-traditional challenges of scaling and operating business-critical services that are built on cutting-edge technologies like Apache Samza and HBase with an engineering mindset.

  • You are a hybrid Systems and Software Engineer that loves to build systems for solving repetitive tasks and workflows.
  • You have a strong passion for solving operational problems in complex systems through the application of solid engineering practices.
  • You believe that automation is a key component in keeping a large-scale system humming.

Your background

  • You have strong DevOps experience with Unix/Linux systems, including solid troubleshooting and problem-solving skills.
  • Know your way around Unix/Linux command line tools.
  • Strong interpersonal communication skills and ability to work well in a diverse, team-focused environment with other Engineers, Product Managers, etc.
  • You have production experience on data centric applications in a web-scale environment.
  • You’ve worked on systems leveraging Amazon Web Services (AWS) services.  
  • Nice to have: Hands-on experience with Open Data platforms like Apache Hadoop, HBase or Spark would be a big plus.
  • Nice to have: You have experience with Java Virtual Machine (JVM) environments.

What you will be doing

Work closely with the Distributed Systems and DevOps engineering teams to:

  • Create a Reliability Engineering roadmap for ensuring that our complex, large-scale systems, and services are healthy, monitored, automated and designed to scale.
  • Influence Data Services/features early on so they are designed with scale, operability and performance in mind.
  • Deliver launch plans for major features and build the necessary infrastructure (staging environments, monitoring, alerting etc) that will support the launch with operational run-books.
  • Drive the team through “Disaster Recovery Tests” where we will manually turn down pieces of infrastructure and services to test Optimizely’s overall resiliency to failures.
  • Design new tools and smart alerting that can help discover failures/issues in a timely fashion with the goal of automating response to non-exceptional service conditions.
  • Engage in service capacity planning and demand forecasting.


  • Commuter and transportation benefits.
  • Catered in-office lunch and dinner on weekdays.
  • Full medical insurance with very low co-pay and deductible. HMO, PPO, and HSA options available.
  • Full dental coverage including orthodontics.
  • Full vision coverage including contacts.
  • Dependents 100% covered for medical, dental, and vision.
  • Wellness Grant.
  • Unlimited vacation policy and seventeen weeks of paid parental leave.
  • 401k benefit.
  • Working with a great team and having a huge impact!