Certified NVIDIA AI Infrastructure & Kubernetes Platform Engineer new york city

IT Search Corp

Certified NVIDIA AI Infrastructure & Kubernetes Platform Engineer

Full Time • new york city
 
NVIDIA AI Infrastructure & Kubernetes Platform Engineer (DGX Systems) Remote 
Related Certifications required
6 months to 1+ yrs
$open
USC  or GC req

 Alternate titles depending on context:
  • AI Platform Architect – DGX & SuperPOD
  • AI Infrastructure DevOps Engineer – NVIDIA DGX Stack
  • Senior AI Systems Engineer – DGX | Kubernetes | InfiniBand 
Job Description:
 

We are seeking a highly skilled AI Infrastructure & Kubernetes Platform Engineer with a proven track record in deploying and managing NVIDIA DGX-based AI clusters, orchestrating containerized AI workloads using Kubernetes, and ensuring secure, high-throughput operations across InfiniBand-powered networks. The ideal candidate will hold a combination of Kubernetes certifications (CKA, CKAD, CKS) and NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN), coupled with hands-on training in DGX, BlueField, and high-speed network operations.
 
 
This position plays a key role in supporting AI/ML infrastructure at scale, enabling efficient training and inference for complex models, and integrating NVIDIA's cutting-edge compute, storage, and fabric solutions with modern DevOps practices.
 
 
Core Responsibilities:
 

 AI Infrastructure Operations
  • Deploy and manage NVIDIA DGX BasePODs and SuperPODs for high-performance AI workloads.
  • Oversee DGX system lifecycle operations including provisioning, monitoring, firmware upgrades, and capacity planning.
  • Operate Base Command Manager to manage GPU clusters, schedule workloads, and integrate with MLOps tools.
  • Perform DGX node health validation, NCCL interconnect testing, and NVLink topology verification following new deployments or hardware changes.
 Kubernetes Platform Engineering
  • Architect secure and scalable Kubernetes clusters optimized for GPU-accelerated workloads using NVIDIA GPU Operator.
  • Leverage expertise from CKA/CKAD/CKS to develop, deploy, and secure AI applications on Kubernetes.
  • Implement CI/CD pipelines and GitOps methodologies for deploying and managing ML workflows.
 High-Performance Networking & DPUs
  • Administer InfiniBand networks and BlueField DPUs using Unified Fabric Manager (UFM).
  • Enable NVLink/NVSwitch performance across GPU nodes and tune fabric configurations for minimal latency and maximum throughput.
  • Use BlueField for offloading storage, firewalling, and telemetry, enhancing AI workload security and performance.
 Security & Compliance
  • Apply best practices from the CKS certification to secure containerized AI environments.
  • Configure runtime security, secrets management, network segmentation, and auditing using DPU-enhanced Kubernetes deployments.
  • Support zero-trust architecture initiatives by enforcing workload identity, RBAC policies, and supply chain integrity across AI container images and model artifacts.
 Monitoring, Telemetry & Optimization
 

Monitor GPU, CPU, and I/O performance using NVIDIA DCGM, Prometheus, Grafana, and Base Command APIs.
  • Tune system performance and model training pipelines for cost-efficiency and throughput.
  • Build and maintain operational runbooks, incident response playbooks, and SLA reporting dashboards covering GPU utilization, thermal thresholds, and fabric health.
Qualifications:
 

 Certifications a plus:
  •  Certified Kubernetes Administrator (CKA)
  • Certified Kubernetes Application Developer (CKAD)
  • Certified Kubernetes Security Specialist (CKS)
  • NVIDIA Certified Associate: AI Infrastructure & Operations (NCA-AIIO)
  • NVIDIA Certified Professional: AI Infrastructure (NCP-AII)
  • NVIDIA Certified Professional: AI Operations (NCP-AIO)
  • NVIDIA Certified Professional: AI Networking (NCP-AIN)
 Expertise With:
  • DGX System, BasePOD, and SuperPOD Administration
  • BlueField DPU Configuration & Operations
  • InfiniBand Fabric and UFM Management
  • Base Command Manager for workload orchestration
 Technical Skills:
  • Kubernetes, Helm, GPU Operator, Kubeflow
  • DevOps tools: Ansible, Terraform, GitOps, CI/CD pipelines
  • Storage: NFS, BeeGFS, Lustre
  • Networking: RoCE, InfiniBand, DPU offload, gRPC, RDMA
  • Programming/scripting: Python, YAML, Bash
Compensation: $100.00 - $130.00 per hour




(if you already have a resume on Indeed)

Or apply here.

* required fields

Location
Or
Or
If no code provided, add their name instead.