Engineering

AI Engineer: ML Infrastructure

Remote
Full-Time

Company Description Zyphe provides a privacy-first identity verification solution that prioritizes user control over personal data while ensuring businesses are protected from fraud and data breaches. Powered by a decentralized platform, Zyphe enables seamless identity verification and retention without storing Personally Identifiable Information (PII) on company servers. With advanced KYC, AML, and KYB modules built on Web3 principles, Zyphe helps organizations meet modern privacy and security requirements. The platform also offers users secure identity vaults and effortless one-click verification for smooth onboarding experiences. Role Overview We're looking for an AI Engineer specializing in ML Infrastructure to build and scale the platform that powers all of our machine learning systems. This is not a modeling role. You will own the entire ML platform, from training orchestration to serving infrastructure, ensuring our AI capabilities are reliable, fast, and cost-efficient in production. You'll work at the intersection of distributed systems, MLOps, and cloud-native infrastructure, building the foundation that every AI team at Zyphe depends on. What You'll Do - Design and maintain scalable ML training pipelines with experiment tracking and reproducibility

Build and optimize model serving infrastructure for low-latency, high-availability inference
Develop feature stores and data pipelines that feed training and real-time prediction
Implement CI/CD for ML, automated testing, validation, and deployment of model artifacts
Build monitoring and alerting systems for model performance, data drift, and system health
Optimize compute costs across training and inference workloads (GPU scheduling, spot instances)
Manage Kubernetes-based ML workloads and container orchestration
Collaborate with ML engineers to translate research prototypes into production-grade systems What We're Looking For - Strong experience building ML infrastructure and platform tooling in production
Deep knowledge of Kubernetes, Docker, and cloud-native orchestration (AWS/GCP)
Hands-on expertise with ML workflow tools (Ray, Kubeflow, MLflow, or similar)
Experience designing model serving systems (Triton, TorchServe, custom gRPC services)
Solid understanding of distributed training and GPU resource management
Strong software engineering fundamentals (Python, Go, or Rust; CI/CD; infrastructure-as-code)
Familiarity with feature stores, data versioning, and experiment tracking
Experience with cost optimization for GPU workloads is a plus What Makes You a Great Fit - You think in reliability and throughput, not just accuracy
You're obsessed with developer experience, making ML engineers productive and autonomous
You combine systems engineering depth with ML domain understanding
You don't just keep the lights on, you build platforms that accelerate the entire team