Pinpoint Engineering

The TL;DR from my session on AI and EngOps at AIDe...

This morning I had the opportunity to chat with software engineers and data scientists at the AI Dev World Conference on a topic I just happen to be v...

Our 12 go-to Python libraries for data science

We use data science — machine learning, natural language processing, etc. in Pinpoint to correlate data from all the tools used to build software to unlock actionable insights, make recommendations on how to improve execution, predict risks, and forecast work effort. 

Data Science is often written off as a “black box” due to its vagueness. Here at Pinpoint, we strive to be transparent in everything we do, including “showing our work” with customized explanations for each metric or prediction and sharing how we build our projects.

Below I am sharing what libraries are used in our work in an effort to continue that transparency. We don’t use any third-party applications that supplement our analyses, besides our code editor, VSCode. All of the below libraries are based in Python. 

Mainstay Libraries

The below libraries are our go-to’s and are widely used within the industry.

  • Pandas & Numpy for simple data calculations: These are two of the most popular Python packages of all time. They make working with data and calculations super easy and are probably the most used library within our codebase.
  • scikit-learn for machine learning: This is our primary machine learning workhorse. When it comes to iteration and hypothesis testing of new ideas, we rely heavily on the models contained within this library. Because of how deep this library is, we’re able to handle a wide range of predictive tasks.
  • NLTK/ & spaCy for Natural Language Processing: These two libraries are the most-used libraries for our NLP tasks. They provide outstanding APIs for textual analysis and experimentation. For smaller jobs, we operate mainly out of NLTK; for projects requiring a heftier process, we use spaCy.
  • networkx for network/graph analysis: This library biggest use case for us is activity scoring within Pinpoint’s activity feed. It has handy functions and capabilities for graph generation, network analysis, and graph structures.
  • Structlog for logging: Internally, we try to make sure we label all the steps within our projects. From data ingestion all the way to the prediction upload. Structlog is used for logging any steps and errors seen along the way. This is a massive help when it comes to seeing some of these things in production.

Lesser-Known Hits

Here are other packages heavily-used in our work and deserve much more kudos in the industry.  

  • treeinterpreter/ & shap for model explanations: Our model explanations for things like issue forecast and sprint health are handled with the help of these two packages. They take the model of choice and explain how we derived the prediction at the “local” level. For example, each issue that gets forecasted will have a specific explanation unique to that issue instead of a more general across-all-issues explanation, which is most commonly seen in the industry and only contributes to that “black box” stereotype mentioned earlier.
  • imbalanced-learn for balancing data: One of the challenges we face daily with our data is imbalanced data. Within our Jira issue data, we regularly see an imbalance between issues with story points and issues without. This package gives us several options to deal with this, including several over/undersampling techniques and several SMOTE variations.
  • sparse_dot_topn for quick processing of text similarity jobs: For some of our text similarity jobs, we incorporate bits of this package. With NLP projects, a big worry is processing time, and with how fast and interactive we would like our app to be, we can’t rely on single-threaded jobs for some of the work. That is where this package shines; it ratchets up popular text similarity techniques by giving us the ability to use multiple CPUs and workers.
  • sgqlc for GraphQL code generation: Our projects interact heavily with GraphQL, and sgqlc provides a really nice client with documented code generation functions. We tied in the library’s code generation ability to our CI pipeline, assuring that our python GraphQL schema is up to date with the latest services.

I recommend checking out the below articles for more on our data science infrastructure. 

About Jose 


Related Post:

The TL;DR from my session on AI and EngOps at AIDevWorld

This morning I had the opportunity to chat with software engineers and data scientists at the AI Dev World Conference on...

Two data science life hacks to improve your workflow

Data science is fundamental to Pinpoint’s application. But, like most startups, we are still in the process of building ...

Pinpoint Book Club Reads: The Tyranny of Metrics

One of the most gratifying (and often frustrating) features of life in a software startup is the “everyone owns everythi...

Subscribe to the Pinpoint Engineering Blog

cat developer