KeyStep

Member of Technical Staff (AI Infrastructure Engineer)

Perplexity

London, UK

11 days ago

full-timeAI

Skills & Technologies

PythonC++CPlatform EngineeringYAMLAPIsDevOpsSREGitOpsDistributed SystemsSOLIDAWSKubernetesTerraformAnsibleTensorFlowPyTorchLLMDeploymentTraining

Job Description

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.

RESPONSIBILITIES

- Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads

- Manage and optimize Slurm-based HPC environments for distributed training of large language models

- Develop robust APIs and orchestration systems for both training pipelines and inference services

- Implement resource scheduling and job management systems across heterogeneous compute environments

- Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure

- Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm

- Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services

- Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

QUALIFICATIONS

- Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management

- Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization

- Experience with deploying and managing distributed training systems at scale

- Deep understanding of container orchestration and distributed systems architecture

- High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)

- Experience managing GPU clusters and optimizing compute resource utilization

REQUIRED SKILLS

- Expert-level Kubernetes administration and YAML configuration management

- Proficiency with Slurm job scheduling, resource management, and cluster configuration

- Python and C++ programming with focus on systems and infrastructure automation

- Hands-on experience with ML frameworks such as PyTorch in distributed training contexts

- Strong understanding of networking, storage, and compute resource management for ML workloads

- Experience developing APIs and managing distributed systems for both batch and real-time workloads

- Solid debugging and monitoring skills with expertise in observability tools for containerized environments

PREFERRED SKILLS

- Experience with Kubernetes operators and custom controllers for ML workloads

- Advanced Slurm administration including multi-cluster federation and advanced scheduling policies

- Familiarity with GPU cluster management and CUDA optimization

- Experience with other ML frameworks like TensorFlow or distributed training libraries

- Background in HPC environments, parallel computing, and high-performance networking

- Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices

- Experience with container registries, image optimization, and multi-stage builds for ML workloads

REQUIRED EXPERIENCE

- Demonstrated experience managing large-scale Kubernetes deployments in production environments

- Proven track record with Slurm cluster administration and HPC workload management

- Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure

- Experience supporting both long-running training jobs and high-availability inference services

- Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

Company & Role Analysis

JobSeeker+

Likely perks

Private MedicalPension25+ Days HolidayStock OptionsLearning BudgetFlexible Hours

Culture & working style

Neutral 2–4 sentence summary of what working at this company is like, drawn from public reviews and press coverage. Tone, collaboration style, pace, benefits highlights.

Market salary range

£45,000 – £60,000 (Glassdoor, Levels.fyi, 2025)

Unlock the full analysis for this job

Apply Now