AMD is seeking an AI Software System Engineer for HPC Infrastructure Engineering to design, build, and optimize next-generation high-performance computing and AI systems. The role focuses on GPU cluster management, AI workload automation, and distributed computing innovation, offering engineers a chance to advance performance, automation, and adaptive computing solutions.
Location: Hyderabad, India
Job Type: Full Time
Job ID: 69060
Apply: Click Here
Design, build and support AI-related GPU-intensive HPC Cluster computing capabilities.
Maintain AI-ML services and Applications on the distributed architecture of Tensorflow or PyTorch as well as Inferencing systems built with Large Language Models.
Implement automated cluster capabilities with tools such as Terraform and Ansible, and develop a monitoring framework for our clusters with Prometheus.
Develop a collaborative working relationship with our partners in both North America and Europe in order to meet and understand their specific needs relative to AI infrastructure.
Use design-thinking and AI/ML in the optimization of internal processes and our delivery of services to our customer base.
Minimum 5 years of experience in Python-based HPC infrastructure engineering and AI application development.
Strong experience with SLURM, Kubernetes, and GPU Clusters.
Expertise with RoCEv2, KVM, Ubuntu, GPU drivers, and 400G Network Interconnects.
Working knowledge of automated tools such as Terraform, Saltstack, and Prometheus.
Exceptional problem-solving, interpersonal skills, and communication skills.
Bachelor's or Master's degree in Computer Science, Artificial Intelligence, or Related Field.
AMD is a global leader in Technology that specializes in High Performance Computing, Graphics, and AI solutions. AMD develops cutting-edge products for the Data Center, Gaming and Professional Markets through the design and development of advanced Processors, GPUs, and Adaptive Processing solutions. With a strong commitment to innovation and efficiency, the company continues to be a pioneer in the areas of AI HPC infrastructure and Digital Transformation.