Faria Systems is hiring a

Site Reliability Engineer

San Francisco, United States
Remote

Site Reliability Engineers are hybrid systems and software engineers who are responsible for and take ownership of reliability, automation, and other issues related to 'keeping the lights on' across Faria’s multi-product SaaS systems stack.

SREs are integrated within the Technical Operations team and work under the Head of Technical Operations and with the CTO and Principal Developers. We are looking for engineers who want to be a part of developing infrastructure software, maintaining it and scaling it.

  • Reliably automate the server provisioning process to reduce the labor of our R&D team
  • Building scalable infrastructure to manage high-load, concurrent sessions to support ~50 mm monthly page views and 500k+ active users
  • Drive the company through “Disaster Recovery Tests”, where we manually turn down pieces of infrastructure to test Faria’s overall resiliency to failures
  • Implement the systems and processes that Faria Developers use to deploy their software into production
  • Build an auto-remediation system to automatically resolve production incidents before escalating them to on-call Developers
  • Proper remote presence & etiquette (acknowledging requests in a timely fashion over Slack, not leaving requests unacknowledged at all)
  • Tagging the appropriate person and persistently reminding them every 24 hours until full resolution is achieved (not having things fall through the cracks)
  • Effective adherence to operating procedures (organising day-to-day work and large-scale tasks in a calm manner with priority-driven sequencing)