Automating data science is hard, and we do a lot of it.

Even with the most basic models, training them takes a lot of time and resources, and scoring has to be fast. So you end up with a lot of tradeoffs like when to train, how often, how much data do you want to train on, how many resources does each job need, how you want to score, what latency you’re willing to accept when scoring new data, rescoring old data. Factoring in budget, team size, technology sprawl, etc. becomes an extra convoluted problem!

The Pinpoint Data Super-Guild, a combination of the Data Engineering and Data Science teams, has rapidly iterated over several different solutions to bring our data science to production at scale — without breaking the bank or introducing a pile of new technologies.

Our goals were...simple 😳. We wanted to create a framework that:

  • Runs on Kubernetes (“cloud-native” — yay buzzwords!)
  • Is driven by our EventAPI messaging infrastructure
  • Supports the following interaction modes:
    • Request/Response via REST
    • Stream
    • Micro-batch (periodically flush on a set of accumulated data)
    • Batch
  • Doesn’t cost a zillion dollars.
  • We can explain it to our friends.

To accomplish this goal, we switched our data science language from R to Python. Python is much easier to work within a containerized environment, and our Data Engineers have already productized Python, so it helped to create a more robust operational data science codebase.

Our Data Scientists are now contributing unit tests, canary tests, caches, etc. This, along with the switch in language, has made the Super Guild hugely productive.

Reference architecture

Here’s a simplified, high-level architecture diagram of our ecosystem.

Glossary

  • webapp - what you see when you hit the application at pinpoint.com
  • GraphAPI - our internal GraphQL service implementation
  • EventAPI - our internal WebSocket-based eventing infrastructure
  • Agent - the program that does all the heavy lifting of getting data from github.com, Jira, GitLab, etc. into the system
  • Pipeline - does the ETL on the Agent’s output and loads it into the system
  • Train - a data science process that builds a model
  • Score - a data science process that applies the model to data and gives predicted value with an explanation

Here's a look at some of our trial and error and ultimate solution. 

 

1st Try: 100% batch + cron

Like any good Data Engineer, our go-to for automating processes is Cron timers! After porting the initial data-science projects to Python, we built a K8s operator that fired jobs on a cron timer.

For each customer/data science-project combination, we’d run a train/score job at 6 AM/8 PM to hopefully not impact the services during peak traffic hours.

For our customer base at the time, this wasn’t a bad solution. It’s brute-force and highly effective. We knew we would be paging through all the data a project needed for each customer, which isn’t massive due to the partitioning by the customer, and the jobs would train and score fast across an entire dataset.

This solution was working great...until “users” wanted to see updates to Issue Forecasts, Sprint Risks, etc. more than 2x a day. Perfectly understandable 😬.

 

2nd Try: Cron + event-triggered batch

To accommodate more frequent updates, we updated the K8s Operator to listen for Event-API messages when the Agent had finished an export to run the batch jobs, in addition to still running the bi-daily batch jobs. This also allowed us to separate the “train” jobs, which can run less frequently, from the “score” jobs, which needed to run as fast as the data changes. Train jobs were still running as an overnight Cron job.

This worked GREAT! Data was being updated! Jobs were being scheduled! Pods were flying!

...Until we ran 17k batch jobs in a day and caused an outage 🥵. Oops.

There were two major causes for this:

  • Since the Operator would launch a job every time an “exportComplete” message came through, we ended up with A LOT of pods being scheduled. Still, our cluster autoscaler couldn’t spin up nodes fast enough, and they got scheduled anywhere they could. We were sometimes squeezing out core service workloads like GraphAPI and Pipeline, meaning the entire stack ground to a halt.
  • With that many pods running and paging through all of each customer’s issues, it was accidentally DDoS-ing our GraphAPI service.

While chaos experiments are supposed to be conducted in a “semi-controlled” fashion 😇, this led to a momentary pause of our data science engine while we retooled.

3rd Try: Event-triggered batch with worker pool + streaming option

After reviewing the outage, it turned out that we were firing K8s jobs off and not limiting how many could run at one time. This led to our data science jobs taking over all the resources we had allocated for our cluster, which impacted the other services.

So we adjusted the K8s Operator to only fire 15 jobs at any given time, so we had a way to control the job execution parallelism. We could still listen to events, run all the jobs, keep everything on a budget, and updated relatively frequently. Whew.

We also implemented a Python client for our EventAPI message bus to stream messages through a Python codebase.

This effort backed the first iterations of the “Pulse” activity feed, which is continuously updating and personalized to the user! It worked! But something told us we could do even better...

…then came Agent v4 (which we’re SUPER EXCITED ABOUT!)

This meant everything would be streaming...except we’re still running BATCH jobs in data science, and the last time we tried streaming events also happened to be when we broke production 😳.

Introducing: Project Lego (current version!)

At this point, we had a large amount of usage data, and code patterns were starting to emerge in all the data science projects. We did a few things to separate concerns, improve throughput, and make the team more productive overall.

 

So, we built a new operator that:

  • Creates a cohesive platform for deploying programs inside the Pinpoint infrastructure, initially targeting data science programs, but can easily support other program types.
  • Provides a mechanism for the following usage patterns:
    • Request/Response via REST API
    • Streams via event-API
    • Triggered/Batch & Microbatch
  • Supports autoscaling
  • Has metric gathering hooks built-in
  • Has a language-agnostic interface
  • Was easy to deploy

The new Operator is based on the idea of programs and commands. A program is a CLI-based python package that implements the “bridge” client interface, which can have multiple entry points, called commands. Commands generally correspond to actions which may be executed, like training, stream scoring, or launching an API.

The program-command combo is what defines a new piece of infrastructure (deployment).

We needed a new CRD (a K8s Custom Resource Definition) that would allow us to use a simple configuration to define how these ‘combos’ would be executed and will enable us to define multiple ‘combos’ per data science project.

Also, the Operator would manage to lay down the Keda ScaledObjects, ServiceMonitors, etc. so that any data science project that is deployed can autoscale, be monitored, and hopefully be able to live on its own in the wild without daily care and feeding.

Abstracting the implementation from the infrastructure

Since Pinpoint uses Golang for our heavy-lifting data services, we wanted to use the client libraries with thousands of human-hours development time to drive the interop bits. The Python clients were functional but not as robust/resilient as the Golang client.

So we built a simple TCP bridge protocol, so any Python program can implement a lightweight TCP connection to a Golang program that can talk to the Pinpoint Messaging & Data Infrastructure.

The final implementation looks something like:

ds-runner sidecar
  • Manages all the EventAPI connection
  • Manages message tracking via GraphAPI, so we know what messages are processed as a Micro batch/Batch for which Streamed messages failed
  • Manages Dead Letter Queueing messages that fail to process more than 5x
  • Optionally serves the registered Action (command) as a REST API
  • Exposes a metrics endpoint and accepts custom metrics from programs
Python program - action Custom Resource Definition (CRD)
  • Define which command/action should be invoked when a message comes through EventAPI
  • Defines the Topics, Message Filters, etc
  • Defines the Micro-batch Threshold (optional)
  • Defines the Number of Replicas, Resources, etc
  • Defines the code to execute from a simple TCP socket event loop
After porting the workload to Project Lego, we significantly increased our streaming throughput, and our Data Science programs are now able to keep up and scale as data flows through the system.

 

Results so far

Our current iteration is more robust and capable than ever. When you change anything in your PRs, Issues, etc. through Pinpoint, that automatically triggers smart re-calculations for the data science bits! Not only does the infrastructure perform better, but it’s easier to integrate new projects, too, now that our Data Scientists have a cookie-cutter repository, a pattern for the “science code goes here” and easy-to-use tools to deploy it.

Stay tuned for a deep dive into the inner workings of Project Lego. Yes, there will be CODE 🕺🏻.

- Jon

Click on the tabs in Jon's Pinpoint Developer Profile above to learn more about his work. Get your own here


Get the data science behind high-performance teams.