AMD requires a Senior AI/ML and GPU Performance QA Engineer who will manage validation and performance testing for machine learning frameworks and high-performance computing and graphics processing unit software systems. The job requires employees to create automated testing systems and conduct GPU performance testing and verify artificial intelligence models and test distributed systems that operate across multiple nodes. The applicant must possess more than eight years of experience together with advanced GPU knowledge and continuous integration and delivery expertise.
Location: Hyderabad, India
Job Type: Full Time
Job ID: 74600
Role: Senior AI/ML and GPU Performance QA Engineer
Apply: Click Here
Lead validation for ML/AI models: accuracy testing, performance benchmarking, regression, drift detection, and A/B testing.
Test and validate ML frameworks including PyTorch, Hugging Face, and MLFlow.
Perform GPU testing & profiling on ROCm/CUDA platforms: memory, thermal, multi-GPU scaling.
Validate HPC frameworks, distributed runtimes, compilers, and GPU libraries.
Build scalable CI/CD workflows and automated test pipelines using Docker, Kubernetes, GitHub Actions, and Jenkins.
Validate cloud-based AI workloads on AWS SageMaker, Lambda, and S3.
Conduct multi-node distributed training and inference validation using orchestration tools (Slurm, MPI, Kubernetes).
Develop and maintain benchmarking suites for AI models and HPC workloads.
Collaborate with hardware and software teams to ensure GPU platform readiness.
Analyze performance metrics using profiling tools (Nsight, rocprof, perf) and optimize workloads.
Mentor junior engineers and contribute to validation strategy, tooling, and best practices.
Candidates must possess a Bachelor's degree or a Master's degree in Computer Science or Electrical Engineering or a related field.
Should have experience of more that 8 years in validation engineering and ML infrastructure and HPC performance testing fields
Possesses practical skills in both NVIDIA CUDA and AMD ROCm and various GPU software ecosystems.
Demonstrates advanced knowledge concerning AI model designs and training and inference processes and machine learning performance hindrances.
Demonstrates expertise in CI/CD systems and Git and Docker and automated testing frameworks.
Shows competence in managing multiple nodes and validating distributed systems.
Knowledge about HPC benchmarks that include HPL and HPCG and MLPerf and about AI benchmarking methods.
Have expertise in Python and Bash and YAML scripting and Linux environments.
Demonstrates excellent abilities in communication documentation and collaboration with different departments.
Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related field.
AMD (Advanced Micro Devices) serves as the top worldwide provider of advanced computing and graphics technologies which drive innovation across artificial intelligence, data center operations, personal computing, gaming, and embedded system development. AMD develops its workplace culture through collaborative work practices which combine with humble behavior and innovative thinking to create new computing technologies while helping employees achieve their professional development goals.