Clarifai is hiring a

Senior Site Reliability Engineer

San Francisco, United States
Clarifai is an artificial intelligence company that excels at visual recognition. We do not sell an abstract, futuristic technology - we sell a solution that people can use today to solve real-world problems. We believe that the same AI technology that gives big tech companies a competitive edge should be available to developers or businesses of any size or budget. That’s why we build products to make it easy, quick, and inexpensive for developers and businesses to innovate with AI, go to market faster, and build better user experiences. We make “teaching” AI just as accessible as we make using AI, which is why our technology is the most personalizable, unbiased, accurate solution in the market.

We have secured a $30M Series B round of funding and are backed by Menlo Ventures, Google Ventures, USV, NVIDIA, Qualcomm, Osage, Lux Capital, LDV Capital, and Corazon Capital.  To continue to succeed, we need people like you to join the team here in NYC!

Clarifai is proud to be an equal opportunity workplace dedicated to pursuing and hiring a diverse workforce.


Your mission, should you choose to accept it, is to keep the engine humming on our production systems and to ensure great performance.
  • You are at home in Linux production environments and have previous exposure to traffic and network management. We value experience with CoreOS or similar.
  • You monitor and scale production systems. You are not afraid of inspecting systems' internals when needed. We are even more thrilled if you're able to reverse engineer service behavior and match it against what the source code says. Our programming languages of choice are Python and Go, with a bit of C++.
  • You have worked with cloud computing infrastructure (AWS) and DevOps tools (Ansible and Docker). We appreciate even more any experience with Kubernetes, Mesos or other orchestration solutions, GPUs and machine learning, but don't worry if you have none: you can learn here with us!


In your first month, you will start off by learning the ropes. You will:
  • Get familiar with our code base (plus the backend & infrastructure teams). We really want you to take this time to get comfortable working with what we've built and who has helped build it so far. We always appreciate the feedback and new perspective from a fresh pair of eyes.
  • Get acquainted with our testing and production environments, Jenkins and Kubernetes.
  • Learn about the distinctive challenges of machine learning systems using GPUs.
  • Identify and resolve production bugs.
3 months later, you start putting yourself out there. You will:
  • Increase the coverage and reliability of our automation.
  • Work on load testing all components of our pipelines and plan capacity for our evolving needs.
  • Participate in design reviews and identify potential conflicts or issues, as well as suggest improvements.
6 months down the road, you will be on your way to making sure that when people think of Clarifai, they think of an excellent and reliable service. You will:
  • Integrate additional tracing, monitoring, and alerting to increase visibility into our services, improve uptime and meet our SLOs.
  • Be a good citizen in the open source community. You participate in discussions and submit bug reports. We value and encourage code contributions to open source projects and technologies that we use.
In 12 months, you'll have automated a lot of your initial tasks away, but fret not. By now, you frequently identify new and more substantial improvements to our services. You'll hire another SRE to help with your new goals and projects. When a conference comes around that you'd like to attend or speak at, you'll be able to go and share what you've accomplished so far and what is still left to do.


As Clarifai's first Senior Site Reliability Engineer, you strive for 100% uptime and excellent performance!

Other jobs at Clarifai