This morning I had the opportunity to chat with software engineers and data scientists at the AI Dev World Conference on a topic I just happen to be v...
At Pinpoint, we offer customers two types of data models for insights across our platform that leverage data science and machine learning. Our team commonly uses a Global Data Model which leverages data across our customer base and a Custom Data Model which is specific to a single customer using their own historical data.
As with any data science model, the more data you have, the more effective the training of the model — and ultimately the accuracy of the prediction and clarity of the explanation. As a side note, when it comes to any of our data science, we value transparency. Which is why we provide explanations — or “show our homework” in the product about how a metric or prediction is calculated.
Below, we cover the advantages and disadvantages of each data model, as well as how we determine which to use for each customer.
The advantages of using a customer-specific model are straightforward. We are building a model or predictor solely off of that specific customer's data which will have the highest accuracy. This means that another company's practices, behavior, size, and engineering efficiency does not affect the customer's predictions.
Given that the model only accounts for a specific customer's behavior; not only can we form more accurate predictions, but we can make clearer and more well-defined explanations for those predictions. We will automatically default to this model unless a customer does not have enough data to meet the threshold as described in the following section.
The Global Data Model would be used when a customer does not have enough historical data (i.e.: completed issues for our Issue Forecast) to properly train the model. With fewer data points to train on, the model will not do as well when given new data to predict, leading to less accurate and less explainable predictions which is why we provide the Global Data Model. This model will most commonly be applied to newer companies, or companies just starting to use a tool such as an issue tracking platform.
The Global Data Model leverages a large, universal dataset that is representative of our entire customer base. This allows customers with limited historical data the benefits of getting an accurate forecast immediately, rather than waiting to have enough data to build a custom model. Because companies’ projects can vary based on their size (enterprise vs. startup) and what types of projects they are doing, we normalize the global data set so that it will be a good representation regardless of these differing factors. This helps prevent the attributes of our customer base from skewing the data set.
The determining factor for threshold has to do with a prevalent topic in data science: variance. Variance is a measure of how well the model is able to conform or adjust to new data it has never seen before. The lower the variance, the better. After experimenting with different threshold levels, we chose the lowest threshold number that would allow customers to use their own model while also maintaining an acceptable amount of variance.
As we get more data either specific to a particular customer or for our global data set, the model is only going to get more accurate with more clear cut explanations to support the predictions.
This morning I had the opportunity to chat with software engineers and data scientists at the AI Dev World Conference on...