Okta is hiring a

Site Reliability Operator (Staff/Principal)

London, United Kingdom

Title of Job: Site Reliability Operator (Staff/Principal)
Role Located in: London, UK
Reports to: Manager, Site Operations


Okta is the foundation for secure connections between people and technology. By harnessing the power of the cloud, Okta allows people to access applications on any device at any time, while still enforcing strong security protections. It integrates directly with an organization’s existing directories and identity systems, as well as 4,000+ applications.

Because Okta runs on an integrated platform, organizations can implement the service quickly at large scale and low total cost.

Thousands of customers, including Adobe, Allergan, Chiquita, LinkedIn, and Western Union, trust Okta to help their organizations work faster, boost revenue, and stay secure.


Position Description:

Okta is seeking a dedicated individual to help us maintain 100% uptime in our cloud service. The Staff/Principal Site Reliability Operator act as a lead to onsite Operators and will also be responsible for day-to-day monitoring, maintenance, and troubleshooting of our customer-facing web application environments. This is a high-impact role in a fast-paced organization that is poised for massive growth and success. The ideal candidate:

  • Lead and help build team in our UK office
  • Has extensive Linux experience and be an escalation point for junior resources
  • Has experience in off-premise cloud-based infrastructure, in particular Amazon Web Services
  • Has maintained complex custom applications on UNIX/Linux and/or Enterprise Java platforms
  • Has the ability to rapidly self-educate on new concepts and tools

  Job Duties and Responsibilities:

  • Participate in on-call rotation
  • Serve as primary escalation point from Customer Support team
  • Deploy weekly software releases and hotfixes
  • Ensure ongoing availability of all operational components of the Okta service
  • Work closely with core Engineering to ensure new features have the proper operational support and maintainability
  • Continuously refine monitoring processes, thresholds and configuration
  • Create and maintain documentation on installations, incidents, and procedures

  Minimum REQUIRED Knowledge, Skills, and Abilities:

  • Knowledge of Linux, TCP/IP, HTTP, security concepts and SQL databases
  • Excellent troubleshooting skills across multiple portions of complex multi-server environments
  • Ability to diagnose and correlate performance bottlenecks, network issues, failure patterns
  • Proficiency in at least one scripting/interpreted language (bash, Perl, Ruby, Python)
  • Experience as a first-responder for a high-uptime user-facing site
  • Willingness to learn from and teach others

  Bonus Knowledge, Skills, or Experience:

  • Orchestration and deployment automation systems
  • Open-source HTTP proxies (Nginx, haproxy, Varnish)
  • Basic MySQL administration, master/slave setup, troubleshooting operational database issues
  • Custom RPM creation
  • Memcached and workalikes

Okta is an Equal Opportunity Employer.


Other jobs at Okta