Pinpoint Engineering

The TL;DR from my session on AI and EngOps at AIDe...

This morning I had the opportunity to chat with software engineers and data scientists at the AI Dev World Conference on a topic I just happen to be v...

Building Data Science Applications with R: Takeaways from the RStudio Conference

One of our engineering goals this year at Pinpoint is to improve the entire team’s knowledge about the different methodologies and tools available for us to use as software engineers, data engineers and data scientists. We have found some success in having team members attend conferences related to their roles, then share their takeaways with the team upon returning. This allows us to continually expose ourselves to new ideas, learn about new tools and continually improve our overall knowledge. 

At Pinpoint, we use R for most of our data science work, from data exploration and data cleaning to modeling and testing. To help improve how we use R at Pinpoint, a colleague and I attended the RStudio Conference in January. My goal in attending the conference was to both get concrete tips and tricks for using R on a day-to-day basis as well as to see how other companies are using R. I was particularly interested in seeing how R is used in large scale projects and how companies are using it to scale their data science applications so we could apply those strategies as our own applications scale.  I attended several talks by people who were using R for serious business cases and quickly noticed a theme — most companies are creating an API using the plumber package and deploying containerized versions of their R programs as a microservice.  

Even as a regular R user, I still have a lot to learn about all of the R ecosystem and its capabilities.  Here are my takeaways and highlights from the conference.

New (to me) packages

My goal was to expand my knowledge of the R ecosystem. The conference was a great way to learn more R packages that I haven’t heard of or used and determine if there is anything that we should incorporate into our process.

These are some of the new packages I discovered at the conference:

  • Ray Shader is an open-source package for producing two and three dimensional data visualizations in R, mainly for creating three-dimensional topographic maps. These models can be rotated or scripted to create animation.
  • The Tsibble package makes it easier to group information by time period. This is a package I could see us implementing because our work is fundamentally about making sense of how data changes over user-defined time periods in development methodologies such as Scrum or Kanban.
  • The feasts package works with the Tsibble package to add visualization capabilities to time series data as well as make it easier to differentiate seasonal versus long-term trends. I anticipate it will also be useful at Pinpoint as we look to improve our visualizations for trending data. 

In addition, the Tidyverse also continues to improve, with bug fixes and new features being added continually. At this point it is a mandatory tool for anyone using R for data science. 

My favorite talks

The talks I enjoyed the most focused on the intersection of machine learning and human intelligence. Humans design machine learning algorithms as well as the organizations that use artificial intelligence. The human element is often missed in discussions about artificial intelligence, from the ways debugging is handled (by humans) to how we can prevent bias in machine learning. Here were my favorite talks. 

  • How to win an AI hackathon without doing AI: Colin Gillespie talked about how data science is about human intelligence, not just fancy algorithms. Great data science requires taking into account the actual problem that needs solving and finding the best approach. 
  • Debugging R: Debugging R is challenging for everyone. In this talk, Jenny Bryan provided some techniques for debugging. I was also glad to hear some of her best practice recommendations, including how creating a ‘minimum reproducible example’ can help speed up the debugging process. 
  • Technical debt is a social problem: We think of technical debt as being a technical problem, but Gordon Shotwell made a compelling case that it’s really a communications problem — a failure to communicate with the future. One trick he mentioned is to separate out maintainers from users, because they have different priorities and need different types of information. 
  • Google PAIR: This talk from Fernanda Viegas and Martin Wattenberg highlighted how we need to understand and be able to transparently communicate how neural networks work — and to ensure the machine learning applications we build are actually serving people well. 

I’ve already started using some of these new techniques and tools in my work at Pinpoint as I share them with the rest of the team. Ultimately, they will make our product more resilient, easier to debug and more transparent for developers and managers. 

Related Post:

The TL;DR from my session on AI and EngOps at AIDevWorld

This morning I had the opportunity to chat with software engineers and data scientists at the AI Dev World Conference on...

Our 12 go-to Python libraries for data science

We use data science — machine learning, natural language processing, etc. in Pinpoint to correlate data from all the too...

Two data science life hacks to improve your workflow

Data science is fundamental to Pinpoint’s application. But, like most startups, we are still in the process of building ...

Subscribe to the Pinpoint Engineering Blog

cat developer