Machine Learning Engineer Job at Evolve Group, San Jose, CA

UnZrL21wbVJ1Y08zb3FzYnB4YUVzL2paeHc9PQ==
  • Evolve Group
  • San Jose, CA

Job Description

Machine Learning Engineer

Tech start-up

San Fransisco based

We’ve partnered with one of the most ambitious and technically rigorous AI research labs in the world. Based in San Francisco, this team is building foundation models entirely from scratch.

They are now hiring ML Infrastructure Engineers to design and scale the systems that power large-scale, distributed model training. If you’ve built infrastructure that runs across hundreds of GPUs, thrive under technical complexity, and want to work side-by-side with elite AI researchers — this is the role.

Key Responsibilities:

  • Build and scale distributed training systems for large-scale model training across LLMs, vision, and robotics.
  • Set up and run large-scale training across many GPUs using tools like Kubernetes, DeepSpeed, and FSDP.
  • Troubleshoot system issues (GPU errors, network problems) and build tools to monitor and recover from failures.
  • Optimize PyTorch pipelines, sharding, and sampling strategies.
  • Collaborate closely with researchers to support novel model training at scale.

Requirements:

  • 3–15 years in ML infrastructure, systems, or research engineering roles.
  • Proven experience scaling distributed training for large models.
  • Strong with PyTorch, CUDA, NCCL, Kubernetes.
  • Familiar with setting up distributed training clusters.
  • Deep understanding of PyTorch dataloaders, data sharding, and sampling.
  • Strong communicator with a collaborative, mission-driven mindset.

This is a fully in-person role based in San Francisco , it's ideal for engineers excited to build at the edge of what's possible in AI.

Job Tags

Immediate start,

Similar Jobs

Cherokee Federal

Nurse Practitioner Job at Cherokee Federal

 ...required by our governmental client, this position requires being a U.S. citizen. Cherokee Nation Integrated Health (CNIH) is looking for a Nurse Practitioner to provide comprehensive occupational health services at the CDC's campus locations. This role requires... 

EdgeCore Digital Infrastructure

Core Operations Apprentice Technician Job at EdgeCore Digital Infrastructure

 ...things work, enjoys working with their hands, and finds satisfaction in adapting to the unexpected. As a Critical Operations Apprentice Trainee, youll be joining a team that values support, humility, and learning by doing. No prior data center experience is... 

On-Time Delivery Services. LLC

Delivery Driver for Amazon DSP Job at On-Time Delivery Services. LLC

 ...On-Time Delivery Services LLC is a small business in Transportation-Logistics in Laurel...  ...and our goal is to deliver packages out of Amazons Laurel distribution center and are looking...  ...~ Bi-Weekly Pay ~ Paid Training ~ Full-Time ~4-day work week ~ Overtime Pay... 

Techgene Solutions

Financial Accountant Job at Techgene Solutions

 ...MANDATORY SKILLS/EXPERIENCE: ** Must have CPA or BS in Accounting ** Must have 10+ yrs of working with FASB/GASB ** Must have 10+ yrs of financial statements preparation ** Must be advanced in Microsoft Office ** Must have General Ledger systems experience (Quickbooks... 

YMX Logistics

Truck Driver Job at YMX Logistics

 ...YMX Logistics is seeking an experienced and results-driven Spotting / Hostler Driver to join our team. The ideal candidate should demonstrate a strong commitment to safety, reliability and professionalism. We offer competitive compensation, excellent benefits. Role...