This morning I had the opportunity to chat with software engineers and data scientists at the AI Dev World Conference on a topic I just happen to be v...
Automating data science is hard, and we do a lot of it.
Even with the most basic models, training them takes a lot of time and resources, and scoring has to be fast. So you end up with a lot of tradeoffs like when to train, how often, how much data do you want to train on, how many resources does each job need, how you want to score, what latency you’re willing to accept when scoring new data, rescoring old data. Factoring in budget, team size, technology sprawl, etc. becomes an extra convoluted problem!
The Pinpoint Data Super-Guild, a combination of the Data Engineering and Data Science teams, has rapidly iterated over several different solutions to bring our data science to production at scale — without breaking the bank or introducing a pile of new technologies.
Our goals were...simple 😳. We wanted to create a framework that:
To accomplish this goal, we switched our data science language from R to Python. Python is much easier to work within a containerized environment, and our Data Engineers have already productized Python, so it helped to create a more robust operational data science codebase.
Our Data Scientists are now contributing unit tests, canary tests, caches, etc. This, along with the switch in language, has made the Super Guild hugely productive.
Here’s a simplified, high-level architecture diagram of our ecosystem.
Like any good Data Engineer, our go-to for automating processes is Cron timers! After porting the initial data-science projects to Python, we built a K8s operator that fired jobs on a cron timer.
For each customer/data science-project combination, we’d run a train/score job at 6 AM/8 PM to hopefully not impact the services during peak traffic hours.
For our customer base at the time, this wasn’t a bad solution. It’s brute-force and highly effective. We knew we would be paging through all the data a project needed for each customer, which isn’t massive due to the partitioning by the customer, and the jobs would train and score fast across an entire dataset.
This solution was working great...until “users” wanted to see updates to Issue Forecasts, Sprint Risks, etc. more than 2x a day. Perfectly understandable 😬.
To accommodate more frequent updates, we updated the K8s Operator to listen for Event-API messages when the Agent had finished an export to run the batch jobs, in addition to still running the bi-daily batch jobs. This also allowed us to separate the “train” jobs, which can run less frequently, from the “score” jobs, which needed to run as fast as the data changes. Train jobs were still running as an overnight Cron job.
This worked GREAT! Data was being updated! Jobs were being scheduled! Pods were flying!
...Until we ran 17k batch jobs in a day and caused an outage 🥵. Oops.
There were two major causes for this:
While chaos experiments are supposed to be conducted in a “semi-controlled” fashion 😇, this led to a momentary pause of our data science engine while we retooled.
After reviewing the outage, it turned out that we were firing K8s jobs off and not limiting how many could run at one time. This led to our data science jobs taking over all the resources we had allocated for our cluster, which impacted the other services.
So we adjusted the K8s Operator to only fire 15 jobs at any given time, so we had a way to control the job execution parallelism. We could still listen to events, run all the jobs, keep everything on a budget, and updated relatively frequently. Whew.
We also implemented a Python client for our EventAPI message bus to stream messages through a Python codebase.
This effort backed the first iterations of the “Pulse” activity feed, which is continuously updating and personalized to the user! It worked! But something told us we could do even better...
…then came Agent v4 (which we’re SUPER EXCITED ABOUT!)
This meant everything would be streaming...except we’re still running BATCH jobs in data science, and the last time we tried streaming events also happened to be when we broke production 😳.
At this point, we had a large amount of usage data, and code patterns were starting to emerge in all the data science projects. We did a few things to separate concerns, improve throughput, and make the team more productive overall.
The new Operator is based on the idea of programs and commands. A program is a CLI-based python package that implements the “bridge” client interface, which can have multiple entry points, called commands. Commands generally correspond to actions which may be executed, like training, stream scoring, or launching an API.
The program-command combo is what defines a new piece of infrastructure (deployment).
We needed a new CRD (a K8s Custom Resource Definition) that would allow us to use a simple configuration to define how these ‘combos’ would be executed and will enable us to define multiple ‘combos’ per data science project.
Also, the Operator would manage to lay down the Keda ScaledObjects, ServiceMonitors, etc. so that any data science project that is deployed can autoscale, be monitored, and hopefully be able to live on its own in the wild without daily care and feeding.
Since Pinpoint uses Golang for our heavy-lifting data services, we wanted to use the client libraries with thousands of human-hours development time to drive the interop bits. The Python clients were functional but not as robust/resilient as the Golang client.
So we built a simple TCP bridge protocol, so any Python program can implement a lightweight TCP connection to a Golang program that can talk to the Pinpoint Messaging & Data Infrastructure.
The final implementation looks something like:ds-runner sidecar
Our current iteration is more robust and capable than ever. When you change anything in your PRs, Issues, etc. through Pinpoint, that automatically triggers smart re-calculations for the data science bits! Not only does the infrastructure perform better, but it’s easier to integrate new projects, too, now that our Data Scientists have a cookie-cutter repository, a pattern for the “science code goes here” and easy-to-use tools to deploy it.
Stay tuned for a deep dive into the inner workings of Project Lego. Yes, there will be CODE 🕺🏻.
Click on the tabs in Jon's Pinpoint Developer Profile above to learn more about his work. Get your own here!
As part of our latest release, our Agent underwent a complete transformation in order to simplify the installation of in...