Principal Software Quality Engineer - GPU & Machine Learning

Advanced Micro Devices, Inc.
$210,400.00/Yr.-$315,600.00/Yr.
United States, California, San Jose
2100 Logic Drive (Show on map)
Jun 17, 2026
WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great products that accelerate next-generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges-striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. THE ROLE: Weare seeking aPrincipal Software Quality Engineertoserve as the senior technical leader forROCm software validationacrosscompute workloads and server-class systems. Inthis individual-contributor leadership role, you will definehowAMD provesROCm is ready to ship- from unit andcomponenttesting, through full-stack workload validation, to multi-node system-level qualification on AMD Instinct GPU platforms. THE PERSON: Youwill set the technical direction for validation strategy, build and evolve the test infrastructure thatgates everyROCm release, and personally drive the hardestdebugging, characterization, and qualification problems. Your work directly determines thequality bar experienced by hyperscalers, OEMs, sovereign-AI customers, and the open-source community runningROCm inproduction. KEY RESPONSIBILITIES: Ownthe end-to-end validation architecturefor ROCm - unit, integration, framework, workload, performance, stress, stability, scale-out, and system-leveltest layers - across multiple GPU generations and server platforms. Definerelease-qualification gates andexit criteriaforROCm software releases (functional coverage, performance regressions, stability hours, scale targets, RAS criteria) anddrive the org to meet them. Lead system-level testing for server nodes- multi-GPU topologies, PCIe/InfinityFabric/xGMI, BMC/IPMI, thermal/power, firmware interactions, and multi-node fabric(Ethernet/InfiniBand/UALink) bring-up andvalidation. Drive compute workload validation and characterization- LLM training andinference(PyTorch, vLLM, Triton, JAX), recommender systems, scientific HPC kernels, MLPerf-class benchmarks- establishing reproducible methodology, baselines, and regression tracking. Architect thetest infrastructure- distributed test runners, GitHub Actions/ Jenkins / internal CI fleets, hardware lab orchestration, resultdatalakes, flaky-test detection, bisectionautomation, and self-servicedeveloper pre-submit pipelines. Champion modern, agile quality engineering- shift-left testing, test pyramids, contract testing betweenlayers, hermetic test environments, deterministic reproducers, and continuous validation intrunk. Setthe bar for GitHub-based quality workflows- PR gatingpolicy, requiredchecks, code-coverage standards, bug-bashandtriage cadences, and disciplined issue management acrossROCm/repositories and partner upstream projects. Lead complex escalationdebug- partner with development, hardware, firmware, and customer-facing teams to root-cause the hardest multi-day, multi-node, multi-component failures andconvert findings into durable test coverage. Influence the roadmap- work with product management, silicon, platform, and softwarearchitecture to ensure validation readiness fornext-generation Instinct GPUs and serverplatformsbeforetape-inmilestones and silicon arrival. Mentor and elevateSenior and Staff validation engineers, SDETs, and SQA leads; raise the technical bar through designreview, code review, and written guidance. RepresentROCm validation externally- strategic customerengagements, OEM qualification programs, and open-source community quality initiatives. PREFERRED EXPERIENCE:* Strong softwareengineering experience withastrong validation, SDET, or quality-engineering focus, including5+ years in a senior IC role(Staff/Principal/PMTS or equivalent) leading validation of complex systems software. Expert-levelPythonfortest automation and infrastructure; strongC++for debugging and extending productioncode paths undertest. Deep, demonstrable validation experience inat least twoof the following domains: GPU compute software stacks(ROCm, CUDA, oneAPI, SYCL) Deep-learning frameworks andinference engines (PyTorch, TensorFlow, JAX, Triton, vLLM) HPC/ parallel runtimes andcommunication libraries (MPI, RCCL/NCCL, UCX, Libfabric) Linux kernel, GPU drivers, or accelerator firmware Distributed systems and large-scale cluster software System-level validation forserver-class compute nodes- multi-GPU, multi-node, fabric-attached environments - including stress/stability, soak, fault-injection, and RAS testing. Proven, hands-on experience workingefficiently in an agenticAI engineering environment- daily, productionuseofLLM-based coding agents(e.g., Cursor, Claude Code, Copilot Workspace, Codex-class agents) andorchestration frameworks forrealengineering work, withdemonstrableproductivity, quality, or coverage gains attributable to thoseworkflows. Comfort designing prompts, tool/MCP integrations, evaluation harnesses, and guardrails for autonomous and semi-autonomous agents. Hands-on experience defining and shippingrelease qualification programsfor software consumedby hyperscalers, OEMs, or otherTier-1 customers. Mastery ofGitHub atscaleforquality engineering - PR gating, GitHub Actions, self-hosted runners, requiredstatuschecks, releasetagging, and open-source contribution andtriage norms. Strong commandofmodern, agile software developmentpractices- trunk-based development, CI/CD, shift-left testing, observability, feature flags, andincremental delivery- applied specifically to validation organizations. Excellent written and verbal communication - able to author crisp test plans, qualification reports, RFCs, and post-mortems, and to influence development teams without authority. Direct contributions to validation, CI, or test infrastructure forROCm,PyTorch,LLVM,Triton,vLLM, or comparable upstream open-source projects. Demonstrated leadership inagentic-AI adoption- builtor rolled out agent-based workflows across an engineering team (e.g., autonomous test generation, AI-driven log/triage pipelines, multi-agent debugsystems, MCP serverdesign, retrieval-augmented engineering knowledge bases) with measurable outcomes. Experience operating or validatinglarge GPU clusters (256+ GPUs)- fabric bring-up, cluster health monitoring, and fleet-level diagnostics. Familiarity withTraining/Inference/HPCindustry-standard benchmark methodologies andsubmissions. Backgroundin performance validation: roofline analysis, profiler tooling (rocprof, Omniperf, Nsight-class), regression detection Experience withfaultinjection, RAS, telemetry, and long-haul stabilityprograms for accelerator platforms. Familiarity with hardware lab automation: BMC/IPMI/Redfish, PDU control, serial-console capture, automated re-imaging, and topology-aware test scheduling. Prior experience standing up validation forpre-silicon / emulation / first-silicon bring-upof accelerators. ACADEMIC CREDENTIALS: BS/MS/PhDin Computer Science, Computer Engineering, orrelated discipline (or equivalent demonstrated experience). LOCATION:San Jose, California #LI-DR1 #LI-HYBRID Benefits offered are described: AMD benefits at a glance. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process. AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's "Responsible AI Policy" is available here. This posting is for an existing vacancy.