Drive.ai is shaping the self-driving car revolution. Our goal is to improve people's lives by transforming mobility in a cost effective way that can impact everyone, not just those buying automobiles at the very high end of the market. We currently have autonomous vehicles on city streets.
As a DevOps and Reliability Engineer, you will be part of the team working towards our vision of autonomous vehicles. Your role will be to help design and implement our build and release infrastructure for large scale deep learning systems. You will be responsible for designing scalable solutions and tooling that can grow with the expansion of the company.
Scaling infrastructure: Recommend and implement solutions for scaling computational, storage, and networking units. Identify and mitigate bottlenecks and points of failure.
Coordinate with engineering and IT to develop policies for cluster utilization
Develop and utilize tools to monitor and diagnose cluster performance
Actively probe and monitor for cluster security
Highly proficient in infrastructure design
Highly proficient in linux resource management and administration tooling
Proficient in Python or other scripting language
Comfortable in C/C++ and SQL
Familiarity with Git and repository management
Experience working with GPU servers in a high performance computing environment
This job opening has been filled or removed by the company.
Try one of the other jobs at Drive.ai.