Jobgether
Jobgether

Senior HPC Cluster Engineer

TLDR

Drive performance and reliability of large-scale GPU clusters and InfiniBand networks within a next-gen AI cloud infrastructure, tackling deep system-level HPC challenges.

Accountabilities:

Own the performance optimization and reliability of large-scale GPU clusters and InfiniBand networking environments supporting HPC workloads:

  • Tune and optimize GPU cluster performance and InfiniBand fabric efficiency to ensure high throughput and low-latency computing.
  • Diagnose, troubleshoot, and resolve complex system-level issues across GPU, network, and compute layers.
  • Integrate and validate new hardware components into existing HPC infrastructure, including support for GPUs and related accelerators.
  • Work across virtualization and orchestration layers (KVM/QEMU, Kubernetes) to ensure seamless hardware utilization and deployment.
  • Develop and improve automation for monitoring, fault detection, and proactive remediation in distributed compute environments.
  • Configure, manage, and maintain GPU devices, PCIe systems, and InfiniBand networks to ensure stability and scalability.
  • Requirements:

    We are looking for a highly experienced systems engineer with strong expertise in HPC and low-level infrastructure:

    • 5+ years of experience in system-level software engineering with a focus on performance, scalability, or infrastructure optimization.
    • 3+ years of hands-on experience with Linux systems administration, debugging, and performance tuning.
    • Strong understanding of server and hardware architecture including PCIe, NICs, GPUs, and Linux kernel-level behavior.
    • Proficiency in C, C++, Go, or Python for systems or performance-oriented development.
    • Experience working with distributed or HPC environments and solving complex infrastructure challenges.
    • Strong analytical and problem-solving skills with the ability to work on deep technical issues independently.
    • Familiarity with GPU clusters, InfiniBand networking, and large-scale compute systems is highly desirable.
    • Experience with KVM/QEMU or containerized orchestration environments is a plus.
    • Exposure to distributed computing frameworks or libraries such as MPI or NCCL is advantageous.
    • Benefits:

      • Competitive compensation package.
      • Career development and continuous learning opportunities in advanced AI and HPC systems.
      • Flexible working arrangements and remote-friendly culture across Europe.
      • Opportunity to work on cutting-edge AI infrastructure and large-scale distributed systems.
      • Collaborative engineering environment with high technical ownership.
      • Exposure to international teams and world-class engineering challenges.
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
 
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
 
 
#LI-CL1

Benefits

Learning Budget

Career development and continuous learning opportunities in advanced AI and HPC systems.

Remote-Friendly

Flexible working arrangements and remote-friendly culture across Europe.

Jobgether runs the largest remote job platform, effectively linking job seekers with over 200,000 flexible and remote opportunities that match their unique skills and preferences. Our focus is on enhancing the hiring process, ensuring efficiency while prioritizing the candidate experience, particularly in the growing health and wellness sector.

Founded
Founded 2020
Employees
11-50 employees
Industry
Professional Services
View company profile
Apply for this job