We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

Principal Machine Learning Ops Engineer

Western Governors University
life insurance, flexible benefit account, parental leave, paid time off, paid holidays, sick time
United States, North Carolina, Durham
Aug 07, 2025

If you're passionate about building a better future for individuals, communities, and our country-and you're committed to working hard to play your part in building that future-consider WGU as the next step in your career.

Driven by a mission to expand access to higher education through online, competency-based degree programs, WGU is also committed to being a great place to work for a diverse workforce of student-focused professionals. The university has pioneered a new way to learn in the 21st century, one that has received praise from academic, industry, government, and media leaders. Whatever your role, working for WGU gives you a part to play in helping students graduate, creating a better tomorrow for themselves and their families.

The salary range for this position takes into account the wide range of factors that are considered in making compensation decisions including but not limited to skill sets; experience and training; licensure and certifications; and other business and organizational needs.

At WGU, it is not typical for an individual to be hired at or near the top of the range for their position, and compensation decisions are dependent on the facts and circumstances of each case. A reasonable estimate of the current range is:

Grade: Technical 413 Pay Range: $197,000.00 - $305,300.00

Job Description

Job Summary

The Principal Machine Learning Operations Engineer (Principal MLOps Engineer) is an action-oriented position that designs and builds automated processes. These processes focus on machine learning (ML) service and infrastructure stability, which enables our digital Educational Technology (Ed Tech) transformation product to use advanced NLP, knowledge engineering, and ML to accelerate innovation in scientific operations. The ideal candidate will have cloud infrastructure, IAC, and monitoring/instrumentation skills. A proven track record of collaboration, iteratively implementing data-intensive solutions, and strong project leadership are also required to be successful in this role. The Principal MLOps Engineer will educate stakeholders, mentor team members and, with a strong vision for how the ML/SE discipline can proactively create positive impacts, have a significant stake in defining the future of the Ed Tech function for WGU.

Job Responsibilities

Platform Ownership & Development

  • Architect, build, and maintain ML infrastructure leveraging Databricks and AWS.

  • Lead the development of reusable MLOps tooling, SDKs, CI/CD templates, and pipelines.

  • Implement and extend Databricks Asset Bundles, Databricks APIs, and Agent Frameworks for model and GenAI workloads.

CI/CD & Automation

  • Design and implement automated CI/CD pipelines using tools like GitHub Actions, Infrastructure-as-Code (Terraform, AWS CDK), and templating frameworks (Copier, Cookiecutter).

  • Automate testing, deployment, and rollback workflows to streamline ML lifecycle from experimentation to production.

Model Lifecycle Management

  • Manage the ML lifecycle using MLflow for both classic ML (experiments, registry) and GenAI use cases (traces, evaluations).

  • Collaborate with data scientists to deploy, monitor, and iterate on ML models in real-time and batch inference environments.

  • Develop scalable and low-latency online inference systems, exposed via APIs or service endpoints.

Monitoring, Observability & Performance

  • Implement monitoring solutions for drift detection, data quality, throughput, latency, and model performance using tools like Evidently and custom dashboards.

  • Proactively manage model and system reliability, scalability, and resiliency in production environments.

GenAI & Agentic Workflows

  • Design and deploy agentic workflows using frameworks like LangChain, and integrate with Databricks Agent Frameworks.

  • Collaborate with ML/GenAI teams to evaluate and productionize LLM-based applications and tools.

Minimum Qualifications
  • 7+ years of software engineering or DevOps experience, with at least 4 years focused on MLOps or ML infrastructure.

  • Deep experience with Python, SQL, and Databricks platform development.

  • Strong proficiency in deploying and managing ML systems on AWS, including EC2, S3, EKS, SageMaker, or Lambda.

  • Hands-on experience with MLflow, GitHub Actions, Databricks SDKs, and infrastructure as code tools (Terraform, CDK, etc.).

  • Experience building and maintaining CI/CD pipelines, unit and integration tests, and using YAML for workflow orchestration.

  • Expertise in building low-latency real-time inference systems, REST/gRPC APIs for model serving, and related architecture patterns.

  • Experience working with agentic frameworks (LangChain, etc.) for GenAI applications.

  • Demonstrated ability to monitor production systems for performance, reliability, and model quality.

  • Proven track record of writing production-grade code and reusable abstractions for ML/AI systems.

  • Strong communication and cross-functional collaboration skills.

Nice to Have's
  • PhD in Computer Science, Software Engineering, Data Science, Machine Learning, Math, Physics, or a related field

  • Experience working with knowledge graphs stores (Stardog, TigerGraph, Ontotext GraphDB, Neo4j) and surrounding semantic technology (OWL, RDF, SWRL, SPARQL, JSON-LD)

  • Proficient in modern big data architectural approaches (Kappa/Lambda architectures, Data Lake Zones, etc.)

  • Mastery of operating and designing stream-based data systems (Kafka, AWS Kinesis, GCP PusSub, etc.) particularly under varying load

  • Deep understanding of theoretical and practical tradeoffs of various NoSQL stores (Cassandra, Elasticsearch, Dynamo DB, etc.) with respect to different read/write patterns and availability/consistency requirements

  • Cloud infrastructure, IAC, and monitoring/instrumentation skills

#LI-CW1

Position & Application Details

Full-Time Regular Positions (classified as regular and working 40 standard weekly hours): This is a full-time, regular position (classified for 40 standard weekly hours) that is eligible for bonuses; medical, dental, vision, telehealth and mental healthcare; health savings account and flexible spending account; basic and voluntary life insurance; disability coverage; accident, critical illness and hospital indemnity supplemental coverages; legal and identity theft coverage; retirement savings plan; wellbeing program; discounted WGU tuition; and flexible paid time off for rest and relaxation with no need for accrual, flexible paid sick time with no need for accrual, 11 paid holidays, and other paid leaves, including up to 12 weeks of parental leave.

How to Apply: If interested, an application will need to be submitted online. Internal WGU employees will need to apply through the internal job board in Workday.

Additional Information

Disclaimer: The job posting highlights the most critical responsibilities and requirements of the job. It's not all-inclusive.

Accommodations: Applicants with disabilities who require assistance or accommodation during the application or interview process should contact our Talent Acquisition team at recruiting@wgu.edu.

Equal Employment Opportunity: All qualified applicants will receive consideration for employment without regard to any protected characteristic as required by law.

Applied = 0

(web-8669549459-4fb8n)