Senior HPC Cluster Engineer
TLDR
Drive performance and reliability of large-scale GPU clusters and InfiniBand networks within a next-gen AI cloud infrastructure, tackling deep system-level HPC challenges.
Own the performance optimization and reliability of large-scale GPU clusters and InfiniBand networking environments supporting HPC workloads:
- Tune and optimize GPU cluster performance and InfiniBand fabric efficiency to ensure high throughput and low-latency computing.
- Diagnose, troubleshoot, and resolve complex system-level issues across GPU, network, and compute layers.
- Integrate and validate new hardware components into existing HPC infrastructure, including support for GPUs and related accelerators.
- Work across virtualization and orchestration layers (KVM/QEMU, Kubernetes) to ensure seamless hardware utilization and deployment.
- Develop and improve automation for monitoring, fault detection, and proactive remediation in distributed compute environments.
- Configure, manage, and maintain GPU devices, PCIe systems, and InfiniBand networks to ensure stability and scalability.
- 5+ years of experience in system-level software engineering with a focus on performance, scalability, or infrastructure optimization.
- 3+ years of hands-on experience with Linux systems administration, debugging, and performance tuning.
- Strong understanding of server and hardware architecture including PCIe, NICs, GPUs, and Linux kernel-level behavior.
- Proficiency in C, C++, Go, or Python for systems or performance-oriented development.
- Experience working with distributed or HPC environments and solving complex infrastructure challenges.
- Strong analytical and problem-solving skills with the ability to work on deep technical issues independently.
- Familiarity with GPU clusters, InfiniBand networking, and large-scale compute systems is highly desirable.
- Experience with KVM/QEMU or containerized orchestration environments is a plus.
- Exposure to distributed computing frameworks or libraries such as MPI or NCCL is advantageous.
- Competitive compensation package.
- Career development and continuous learning opportunities in advanced AI and HPC systems.
- Flexible working arrangements and remote-friendly culture across Europe.
- Opportunity to work on cutting-edge AI infrastructure and large-scale distributed systems.
- Collaborative engineering environment with high technical ownership.
- Exposure to international teams and world-class engineering challenges.
Requirements:
We are looking for a highly experienced systems engineer with strong expertise in HPC and low-level infrastructure:
Benefits:
Benefits
Learning Budget
Career development and continuous learning opportunities in advanced AI and HPC systems.
Remote-Friendly
Flexible working arrangements and remote-friendly culture across Europe.
Jobgether runs the largest remote job platform, effectively linking job seekers with over 200,000 flexible and remote opportunities that match their unique skills and preferences. Our focus is on enhancing the hiring process, ensuring efficiency while prioritizing the candidate experience, particularly in the growing health and wellness sector.
- Founded
- Founded 2020
- Employees
- 11-50 employees
- Industry
- Professional Services