StatusPage is looking for a stellar individual to join our Denver-based development team as the new Technical Operations Lead.
Our humble team of 11 got started in 2013 with a simple goal: help make the internet break less. Step 0 is where we're at right now - make it dead simple for companies to talk to each other better when they're having unexpected downtime, performance issues, or during scheduled maintenance. We ship a lot of code to help in this goal, and we're excited about continuing to grow the team this year.
This is mostly a technical role focused on the infrastructure we maintain, as well as any external vendors that we utilize for various functions. We're a growing team, and as such will expect this role to be a fluid mix of tactical work and strategic work. As the team grows, you will increasingly work with more DevOps and possibly some focused Ops folks in our journey to have a well-learned team of infrastructure folks.
Your P&L will consist of our infrastructure costs, uptime and ability for our infrastructure to weather adversity, and team morale about being on call and handling a production-level infrastructure. Most of the work will be done with our development team, with coordinating activities required with your Director of Engineering peer, and the CEO for product prioritization and risk assessment for the business.
Our current stack consists of
- Ubuntu-based AWS EC2 nodes, set behind ELB. Managing our own Postgres & Redis, Memcache through Elasticache
- Ruby and Rails at the application layer
- Chef for configuration management
- Fastly for CDN, Amazon for Cloudfront/Route53/S3, Librato for metrics, Papertrail for logging, New Relic/Scout/Sysdig for node-level monitoring and APM, ThreatStack and Yubico for security
Future plans for our stack
- Multiple data centers and multiple providers
- DevOps tooling with Go
- Deployments and scaling with docker
- Primary data store rolled to NoSQL, InfluxDB for a persistent metrics storage layer
- SOA to allow status data to power multiple applications
Qualifications and things that will make you great for this role
- 5+ years experience architecting, configuring, and deploying production-grade applications to cloud environments
- Being on call, and managing a team of on-call response
- Experience setting up policies and procedures around security and data handling (both internal and external). Managing security audits and reviews from potential customers
- Capacity planning and SLA writing/verification for key customers
- Love for team building, mentoring, and growing an organization with a healthy culture and a curiosity for life and technology
Some of the projects you'll be working on to start
- Getting our build and deployment system set up around containers
- Evaluating new data stores for metric data that we store on behalf of our customers (Influx, Cassandra)
- Increase our ability to handle crushing traffic spikes by moving lots of our traffic onto our CDN partner
- Instrument processes and automation around failure detection, promotion of backup resources, and coordination of those changes among servers
- Working with multiple vendors to make sure our email/sms throughput lives up to our SLA requirements
- Author and maintain data and security policies that our team can get behind. Communicate those and other security matters to potential customers and partners.