Member of Technical Staff - Systems
About ai&
ai& is a new global AI technology company dedicated to meeting the world's growing demand for AI. Our vision is twofold: to serve as a premier AI lab specializing in localization, and to act as a global infrastructure and compute provider. We are building a unified, optimized global platform that integrates next-generation data centers and infrastructure, heterogeneous compute serving, and advanced model services. We believe that the most effective way to build and scale AI is to own the stack from top to bottom.
At ai&, we empower small teams with the autonomy needed to tackle significant challenges. Our approach is to deconstruct large problems into manageable components and solve complex issues collaboratively. We seek highly motivated, mission-driven individuals who demonstrate strong personal agency. We value curiosity as the foundation of talent, and we are looking for people eager to develop alongside our evolving technology and expanding business.
We are actively hiring worldwide, with presence in Tokyo, SF, Austin, and Toronto. We are more than happy to meet exceptional talent where they are.
Role overview
As a Systems Engineer at ai&, you are responsible for the physical and software foundation that everything else runs on. You will plan, configure, and manage the bare-metal infrastructure that powers our data centers — from OS tuning and driver management to rack-scale GPU system provisioning. You are the person who makes sure the hardware is running at its full potential before the software teams ever touch it.
This is a hands-on role. You will work on some of the most advanced compute hardware available, including NVL72 and AMD Helios rack-scale systems, and you will be responsible for keeping them running at maximum efficiency. You think carefully about system configuration, firmware, and the low-level software decisions that compound into real performance differences at scale.
Responsibilities
Bare-Metal Infrastructure Management Configure and manage bare-metal servers end to end. Own OS tuning, driver management, firmware upgrades, and CUDA configuration across the fleet.
Rack-Scale GPU System Operations Lead the installation, provisioning, and continuous operation of high-density, liquid-cooled rack-scale GPU systems including NVL72 and AMD Helios deployments.
System Architecture & Planning Plan and architect the next generation of system configurations including compute, storage, networking interconnects, routers, and switches. Make decisions that scale.
Performance Optimization Tune system-level configurations to maximize hardware utilization and minimize overhead. Work closely with the kernel and inference teams to ensure software and hardware are fully aligned.
Cross-Team Collaboration Work closely with the network, storage, and data center teams to ensure the physical infrastructure operates as a unified, high-performance system.
You may be a fit if you have the following skills
Bare-Metal Operations Experience Deep hands-on experience managing large-scale bare-metal server environments. You have configured OS, drivers, firmware, and CUDA at scale and you know the failure modes.
GPU System Expertise Experience provisioning and operating high-density GPU systems. Familiarity with NVIDIA NVLink, NVSwitch, and AMD MI-series architectures is a strong signal.
Low-Level Systems Knowledge Strong understanding of Linux internals, kernel parameters, NUMA topology, PCIe configurations, and how these interact with AI workloads.
Infrastructure Judgment You make system configuration decisions that hold up at scale. You think about maintainability, reproducibility, and failure recovery from the start.
Great Team Spirit A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos.