<span id=How we use machine learning for intelligent commit linking" />

Just a few days ago, we wrote about four reasons to link your code commits to their originating issue. The idea is that remembering to link your code commits—basically, by flagging the issue ID in the commit comment—is a small price to pay for the benefits Pinpoint can deliver. That’s true. But from the Data Science side we’ve been wondering, what if there was a way to eliminate even that small price? Is this a case where machine learning could supplement or even replace a tedious manual task?

The short answer is, Yes.

V1: automated linking, brute force

One of the engineering performance signals we surface is called Traceability. Traceability identifies how many of your total commits were linked to issues. We use this signal to help evaluate the hygiene of the commit process, as well as to perform higher-level analyses based on the kinds of work triggering the commits.

Our original approach to determining commit links went as follows. We would investigate the messages in the body of a commit message (or the name of the branch that contained the commit) using a sophisticated regular expression, something similar to Issue Identifier [\s-_]\d+>. When we found one or more matches, we would then cross reference the identifier back to the data in the issue system, e.g. Jira. If the ID, say “BE-123,” was found in Jira, we would know this commit message was linked to that particular issue. If we didn't find it, we would assume it was a false positive and discard the match. This is a fairly common best practice—Jira even has some additional built-in capabilities called Smart Commits.

The slight problem is as mentioned above: all this requires the developer to remember to flag the issue ID in his or her commit. Not difficult, no, but when you’re under the gun and sprinting for a finish, it’s the kind of small additional tax that tends to go unpaid. As we wrote in our prior post, it’s simply “too hard to see specifically how the practice helps me, my team, and/or the company at large.”

Enter machine learning.

V2: automated linking, natural language processing (NLP)

To get developers out of the business of having to tag their commits, we use document similarity from R's text2vec package. Document similarity works exactly as it sounds—it identifies semantic similarities as well as similar concepts within two different documents. In our case, we use Jaccard similarity. This is an NLP technique that measures the number of common words that appear within two sentences (documents) versus the number of unique words within those same sentences:

Jaccard Similarity Formula

A short example will show how this technique derives similarity. Let’s say we have one commit, and four different candidate issues* that might link. For this exercise, we're trying to find a linkage to the issue titled, “Test repository docker base needs to be fixed.”

Linking commits exampleIn order to capture similarities, we run through some straightforward NLP cleaning procedures:

  • We remove any non-alphanumeric characters
  • We transform all letters to lowercase
  • We eliminate all stop words (to, and, be, etc.)

Finally, we stem the words. We want our program to recognize that “fixes” should be similar to “fixed,” so we stem both words to “fix.”

Llinking commits with Natural Language ProcessingThe result looks a little strange, but you can see how useful stemming is for words like “changes,” “fixes,” and “landing.”

Now that we have our corpus cleaned and ready for analysis, we run our Jaccard similarity. It renders the following results:

Jaccard Similarity Results for Linking CommitsIn this case, we want our program to recognize that our commit is linked to the third issue on our table. And using Jaccard similarity, we see in fact that that relationship has been identified as the strongest. We do this by taking the amount of common words found between both sentences, five, then dividing by the amount of words that exist in either set, seven: 5/7 = 0.7142.

It’s simple—but pretty powerful. We combine this method with word embedding to identify similar words that are written differently, like “bug” and “defect,” and then use a scoring method to ensure we’re getting the most likely candidate.

Look for this feature in our next release, coming soon...


*In the real world, we might find hundreds or thousands of candidate issues. Any issue that’s open and predates the commit in question is a candidate for linkage.


Get the data science behind high-performance teams.