WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great products that accelerate next-generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges-striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.
Weare seeking aPrincipal Member of Technical Staff(PMTS)toserve as the senior technical leader forROCm software validationacrosscompute workloads and server-class systems. Inthis individual-contributor leadership role, you will definehowAMD provesROCm is ready to ship- from unit andcomponenttesting, through full-stack workload validation, to multi-node system-level qualification on AMD Instinct GPU platforms.Youwill set the technical direction for validation strategy, build and evolve the test infrastructure thatgates everyROCm release, and personally drive the hardestdebugging, characterization, and qualification problems. Your work directly determines thequality bar experienced by hyperscalers, OEMs, sovereign-AI customers, and the open-source community runningROCm inproduction.
Ownthe end-to-end validation architecturefor ROCm - unit, integration, framework, workload, performance, stress, stability, scale-out, and system-leveltest layers - across multiple GPU generations and server platforms.
Definerelease-qualification gates andexit criteriaforROCm software releases (functional coverage, performance regressions, stability hours, scale targets, RAS criteria) anddrive the org to meet them.
Lead system-level testing for server nodes- multi-GPU topologies, PCIe/InfinityFabric/xGMI, BMC/IPMI, thermal/power, firmware interactions, and multi-node fabric(Ethernet/InfiniBand/UALink) bring-up andvalidation.
Drive compute workload validation and characterization- LLM training andinference(PyTorch, vLLM, Triton, JAX), recommender systems, scientific HPC kernels, MLPerf-class benchmarks- establishing reproducible methodology, baselines, and regression tracking.
Architect thetest infrastructure- distributed test runners, GitHub Actions/ Jenkins / internal CI fleets, hardware lab orchestration, resultdatalakes, flaky-test detection, bisectionautomation, and self-servicedeveloper pre-submit pipelines.
Champion modern, agile quality engineering- shift-left testing, test pyramids, contract testing betweenlayers, hermetic test environments, deterministic reproducers, and continuous validation intrunk.
Setthe bar for GitHub-based quality workflows- PR gatingpolicy, requiredchecks, code-coverage standards, bug-bashandtriage cadences, and disciplined issue management acrossROCm/*repositories and partner upstream projects.
Lead complex escalationdebug- partner with development, hardware, firmware, and customer-facing teams to root-cause the hardest multi-day, multi-node, multi-component failures andconvert findings into durable test coverage.
Influence the roadmap- work with product management, silicon, platform, and softwarearchitecture to ensure validation readiness fornext-generation Instinct GPUs and serverplatformsbeforetape-inmilestones and silicon arrival.
Mentor and elevateSenior and Staff validation engineers, SDETs, and SQA leads; raise the technical bar through designreview, code review, and written guidance.
12+ yearsof professional softwareengineering experience withastrong validation, SDET, or quality-engineering focus, including5+ years in a senior IC role(Staff/Principal/PMTS or equivalent) leading validation of complex systems software.
Deep-learning frameworks andinference engines (PyTorch, TensorFlow, JAX, Triton, vLLM)
System-level validation forserver-class compute nodes- multi-GPU, multi-node, fabric-attached environments - including stress/stability, soak, fault-injection, and RAS testing.
Proven, hands-on experience workingefficiently in an agenticAI engineering environment- daily, productionuseofLLM-based coding agents(e.g., Cursor, Claude Code, Copilot Workspace, Codex-class agents) andorchestration frameworks forrealengineering work, withdemonstrableproductivity, quality, or coverage gains attributable to thoseworkflows. Comfort designing prompts, tool/MCP integrations, evaluation harnesses, and guardrails for autonomous and semi-autonomous agents.
Mastery ofGitHub atscaleforquality engineering - PR gating, GitHub Actions, self-hosted runners, requiredstatuschecks, releasetagging, and open-source contribution andtriage norms.
Strong commandofmodern, agile software developmentpractices- trunk-based development, CI/CD, shift-left testing, observability, feature flags, andincremental delivery- applied specifically to validation organizations.
Excellent written and verbal communication - able to author crisp test plans, qualification reports, RFCs, and post-mortems, and to influence development teams without authority.
Direct contributions to validation, CI, or test infrastructure forROCm,PyTorch,LLVM,Triton,vLLM, or comparable upstream open-source projects.
Demonstrated leadership inagentic-AI adoption- builtor rolled out agent-based workflows across an engineering team (e.g., autonomous test generation, AI-driven log/triage pipelines, multi-agent debugsystems, MCP serverdesign, retrieval-augmented engineering knowledge bases) with measurable outcomes.
Backgroundin performance validation: roofline analysis, profiler tooling (rocprof, Omniperf, Nsight-class), regression detection
Experience withfaultinjection, RAS, telemetry, and long-haul stabilityprograms for accelerator platforms.
Familiarity with hardware lab automation: BMC/IPMI/Redfish, PDU control, serial-console capture, automated re-imaging, and topology-aware test scheduling.
ROCm powers AIand HPC workloads onAMD Instinct GPUs atthe largest scale inthe industry. The quality of every ROCm release is felt acrossmillions of GPUs in production - and the validation organization iswhatstandsbetween "code complete" and "customerready." AsPrincipal MTS for ROCm Validation, you will define thatbar, build the systems thatenforce it, and personally lead the toughest qualification problems on AMD's moststrategicplatforms. #LI-TC1 #Hybrid
AMD is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
Benefits offered are described: AMD benefits at a glance. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process. AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's "Responsible AI Policy" is available here. This posting is for an existing vacancy.
|