Major League Baseball Trusts Mojo Platform!

Major League Baseball Trusts Mojo Platform!

Major League Baseball Trusts Mojo Platform!

Mojo Platform Orchestrates MLB's Bare Metal.

By Kevin Backman


At the end of the 2019 baseball season, MLB began a hardware refresh across all our major league ballparks providing us with an opportunity to rebuild our server infrastructure using a DevOps approach.

Goals of the Refresh

  • Given the patchwork nature of our environment, we wanted to simplify our infrastructure management into a single pane of glass. It was important for us to build, manage, and support 30 MLB ballparks easily from day one, and allow us to expand the approach in a repeatable fashion when the time came to refresh our minor league ballparks.
  • Streamline app deployments so that we could consistently and reliably deploy code using the Continuous Integration and Continuous Delivery tools currently in place.
  • Reduce costs and complexity by removing virtualization and running everything on Anthos bare metal, which is Googles solution for Anthos running on physical servers.
  • Create a resilient ballpark cluster that could handle hardware failures. In the past if a server went down, all VMs running on that server went down with it. This meant that we couldn’t fully utilize our hardware as we needed to run a minimum of two VMs for each app.
  • Deliver the solution within a tight schedule for the start of the 2021 MLB season with resource constraints compounded by COVID-19.

Our partnership with Google provided us with a solution for having all our clusters in one place, the GCP Console. Kubernetes and Anthos on bare metal provided us with a way to do this.

Our Ballpark Infrastructure Server Footprint

All of our MLB ballparks have limited data center space and for each, we have a condensed and optimized footprint of four servers. These four servers are responsible for running applications that are critical to the game of baseball including our Statcast and Hawk-Eye solutions.

Why GKE OnPrem?

We have limited space in the ballpark datacenters and kubernetes provides us with the flexibility we need.

In the past we were running a traditional VMware (and RHEV-M in some parks) stack on physical machines. While this met our needs, we were over provisioning each VM with 8 vCPU’s and 32G of memory to run an application that only required a fraction of these resources. Using GKE was an obvious choice, allowing us to allocate the appropriate CPU and Memory resources at the pod level (e.g. cpu: 500m and memory: 1Gi). GKE also allowed us to run multiple pods across our nodes, giving us resiliency and flexibility when performing maintenance. Working with Google provided us with a wealth of knowledge that we tapped into to deploy Anthos across our parks and gave us that single pane of glass we were looking for.

With our Ballpark clusters now living in the GCP environment, we can manage them the same way we manage our GKE clusters in the Cloud and utilize our CICD tooling already in place to deploy our applications.

How did we do it?

Short answer:

Through automation and a lot of hard work.

Metify Mojo Platform for Remote Deployment and Ongoing Maintenance of our Bare Metal Servers.

Deploying new technology to 30 data centers is no easy task, without the use of automation tools and the hard work of people physically in the ballparks.

Before we could even proceed we relied on our very capable Ballpark Infrastructure and Ballpark Ops teams to actually do the hardware installation. Once that was complete (through no small feat on their part due to the pandemic) we were able to get to work on our portion of building out the new and improved infrastructure.

This was achieved by setting up and using a hardware orchestration tool called Mojo which is a software suite produced by Metify that utilizes Redfish to interact with the bare metal servers and PXE boot them. One Mojo instance can manage hundreds of bare metal servers, greatly reducing the level of effort required to manage said servers. Mojo also allows us to manage BIOS and firmware of the servers so now we are able to keep pace with patches and security hotfixes.

With the hardware and operating system ready to go, we switch to deploying Anthos to all the servers. This process is well documented by Google; however, instead of deploying an Anthos workstation for each cluster at each ballpark, we deployed our workstations in a GCP project. This reduced and simplified managing the environment, where instead of 30 workstations across our parks, we have 3 running in the Cloud (a workstation can only deploy 1 cluster at a time).

By running Anthos on bare metal without VMWare, we have a much cleaner and simpler environment to manage. The cost savings alone are enough to make this rebuild worth the effort.

With the Anthos clusters online, it was a simple task of plugging them into our existing ArgoCD environment. Adding continuous deployments to the ballparks allows us to build a resilient environment better suited to handling hardware failures, a challenge we had when running VMWare and previous virtualization technologies.

Credit: Graeme Tower

The GKE solution also simplified how we handled persistent and shared storage. Using Rook Ceph within our clusters in each park allowed us to eliminate the need for a physical NAS solution and further reduce our datacenter footprint, complexities and cost.

What did we learn?

We learned that with automation tools, kubernetes, continuous deployment and an infrastructure as code approach, we can deploy and manage resilient systems that can be trusted to run our critical applications in the ballparks. This puts us in a position to be able to stay on the cutting edge of technology as it relates to baseball. We took what was a six hour process from start to finish and reduced it to less than an hour. In past hardware refreshes it would take a minimum of six hours for a single engineer to build the old VMware environment from start to finish. That now takes forty minutes for one park, and multiple parks can be worked on at the same time. This is a massive reduction in engineering hours leaving engineers more time to work on future improvements and projects.

View Kevin's Original Article as Published on Medium: