United States | Engineering | Full-time
Open to hire within Washington, Oregon, California, Montana, Idaho, Nevada, Utah, Arizona, Colorado, Texas, Wisconsin, Indiana, Ohio, New York, Virginia, North Carolina
About the company:
At Luna, our mission is to make quality, affordable vision care accessible for all.
Our diverse team is working to advance the vision industry with an ever-growing collection of innovative solutions designed to address the toughest challenges facing eyewear businesses and doctors. We offer suites of products and services that improve and increase vision prescription access, streamline online shopping with virtual try-on and facial measurements, and support fulfillment and delivery. We bring vision to the people.
About the team and role:
Luna is seeking an experienced Site Reliability Engineering Manager to join our SRE team.
The Manager of the Site Reliability Engineering team is part of the Platform Engineering Group, which provides advanced support for solution deployments in production customer environments. The platform engineering group deals with complex infrastructure to improve performance, visibility, stability, availability, and reliability with a focus on using codified, scripted, or automated solutions.
In this position, you will contribute to the success of the department by leading a team of distributed engineers, including planning, monitoring, and growing the group. You will be responsible for overseeing multiple infrastructure projects, adjusting plans when required, recruiting as needed, and for working with others within the Engineering team as well as other departments in the company. The ideal candidate should have great communication skills, working knowledge of different engineering disciplines, an aptitude for managing risk, and exceptional planning skills.
What you’ll do:
- Build and mentor a talented and globally distributed SRE team, including team recruitment, new talent training, system operation/maintenance/coordination and team culture building.
- Develop process specifications and plans for compliant access, configuration, disaster recovery and fault handling of critical paths of overseas SRE services.
- Develop automation, data visualization and automated monitoring processes to facilitate the optimization of the Luna Solutions digital platform infrastructure
- Work closely with engineering teams to ensure that services are correctly designed for scale, have defined proper metrics and related SLOs, follow best practices and guidelines for health, security, observability, and operability.
- Be an advocate for improving the stability and observability of our platform through automation and data analysis.
- Provide project management, conduct sprint ceremonies, and road-mapping support to the SRE team
- Working closely with your team in understanding the wide array of systems and their interdependencies.
- Drive the design and engineering of tools, as well as platform solutions, to optimize product engineering and operation efficiencies.
- Create and continuously refine processes, thresholds, and alert mechanisms for monitoring customer systems in production.
- Confirm the accuracy of the work performed and methods used by the team.
- Platform uses a broad suite of technologies: Amazon Web Services, Docker, GitHub, Jenkins, Redshift, RDS, Kubernetes, Terraform, Gruntwork.
- Work with Product Managers and Architects to identify both external and internal customer needs, generate requirements and plan implementation, ensuring appropriate staffing to deliver according to agreed timeline
- Lead the troubleshooting of incidents and participate in the management on call rotation for incidents
What you’ll need:
- 6+ years of engineering manager experience building and running high-performance infrastructure engineering or operations teams
- 5+ years of up-to-date knowledge of current and emerging SRE tooling and best practices, such as - Terraform, Gruntwork, Terragrunt, AWS Cloud, ELK, ECS, Docker, GitHub, CloudFormation, Serverless, CI/CD (Circle CI, Jenkins, Ansible), Grafana
- 10+ years of relevant software engineering work experience with focus on SaaS-based application
- Working knowledge of scripting and programming languages including HCL, YML and not limited to Bash and Python.
- Extensive experience in SRE and DevOps methodologies and best practices. You understand the how to design, build and operate a resilient large-scale system.
- Ability to work on multiple projects in various stages simultaneously
- Experience managing projects using Agile methodology and leading scrum ceremonies
- Leadership and people management experience.
- A focus on observability. Observability is key to operating a truly reliable and scalable system. We are looking for engineers who can "Monitor Everything & Measure Everything", driving a culture of observability, metrics, KPIs, SLOs.
- Experience with application monitoring and observability tools with best practices, focusing on improving reliability of services by optimizing the operational procedures and feedback loops to the teams with the goal of improving service reliability.
- Growth mindset. A willingness to use your skills and experience to mentor less-experienced engineers. A desire to learn from others and make yourself better every day.
- Excellent written and spoken communication skills, to collaborate efficiently with both engineers and non-technical peers
- Bachelor’s or higher degree, or the equivalent, in Computer Science or related. (Luna recognizes that knowledge and skills equivalent to those earned in a degree program can also be achieved via nontraditional paths and welcomes applicants with nontraditional training.)
Nice to have:
- Experience with medium to large SaaS industries
- Experience with Atlassian tools and basic sysadmin knowledge of these tools
We believe in our people. If you feel you are the right person for the position, we want you to apply, even if you do not meet all of the requirements listed above. If you are not sure, don’t hesitate to send us your information, we would love to hear from you!