Newsletter #083

The data scientist role is one of the most sought after jobs in technology today. However, it’s also one of the most misunderstood. And this contributes to a lot of the dissatisfaction that many data scientists experience in industry.

I believe that when it comes to data science and machine learning, companies fall into one of two groups.

The first group is companies where ML/DS is a core competency. These are companies that wouldn’t exist were it not for machine learning. Popular examples here include Google, Facebook, Amazon, and Stitch Fix. Several of these companies didn’t start out as ML-driven companies, but if you removed their ML competency today, these companies would be totally different. Companies where ML is a core competency don’t need to be convinced that machine learning can add business value. They have several ML-driven products in production, and many others ML-driven bets that are being developed and tested.

The second group is companies where ML is not a core competency. The vast majority of companies fall into this bucket. At these companies, data scientists and ML engineers might work on projects that complement existing products, (attempt to) automate internal operational processes, or otherwise seek to drive other efficiencies. Many of these places have decided ML is worth the investment (otherwise their data scientists wouldn’t have jobs), but sometimes that investment is driven by FOMO (fear of missing out) rather than the desire to innovate. These companies invest millions of dollars into their data science teams, so naturally they want to market themselves as ML-driven companies even if their ML efforts are net-negative.

Although they’ve decided ML is worth the investment, leaders at companies in group two need to be convinced that deploying machine learning is worth the risk. This is a people problem, not a technology problem. Solving this requires cultivating relationships with appropriate business stakeholders and decision makers. It requires trust. The business needs to trust that data science understands the domain and what’s at stake (which often includes the stakeholders’ reputation and job). Correspondingly, the data scientists need to have stakeholder buy-in. Any technology solution won’t be perfect at first, so data scientists and business stakeholders will need to work together to iron out the kinks.

This is a hard problem to solve unless the data science team already has some level of influence. It’s a bit of a chicken-and-egg problem. But companies that solve this challenge can begin to generate business value through machine learning and data science. Those that perfect their solutions can even "cross the chasm" to become ML-first companies.


Here’s what I’ve been reading/watching/listening to recently:

  • Column Names as Contracts – While there are accepted strategies for making contracts with users of software and user interfaces, similar strategies are less widespread for data tables. This article describes controlled vocabularies for column names as a simple approach to building a shared understanding of how each field in a data set is intended to work. The post introduces the concept with an example and demonstrates how controlled vocabularies can offer lightweight solutions to rote data validation, discoverability, and wrangling.

  • pointblank – Last week I referenced the great_expectations library for "unit testing" data in Python. This week I discovered pointblank, a similar library to methodically validate your data (whether in-memory as data frames or as db tables) in R. The package contains a collection of powerful validation functions, maintains information on tables that is updates when the tables are updated, generates automated data quality reports, supports a variety of databases, and can be utilized in pipeline processes to periodically check data, trigger warnings, raise errors, or write out information to logs when validations exceed specified failure thresholds. I’m a big believer in the need for data validations to catch errors in ML pipelines.

  • Data Science Project Flow for Startups – A data science consultant provides his take on how to structure and carry out projects with teams of 1-4 data scientists. The process is divided into three aspects that run in parallel: product, data science and data engineering and involves data science repeatedly checking-in with product to ensure that KPIs are satisfied. The process itself is broke down into 4 phases: scoping, research, model development, and deployment.

  • Peer Reviewing Data Science Projects – A follow-up to the previous article that proposes a structured process for peer reviewing data science projects. The post suggests two different peer review processes: one for the research phase and the other for the model development phase. I especially enjoyed the extensive list of questions the author proposes for the research phase and how these can be used to reduce risk associated with the project. Extremely well-written and insightful.

  • Good Data Analysis – My team at 2U has been performing a lot of data analysis recently, so I decided to return to this excellent resource from google data analysts that I shared wayyyy back in issue 7. This document summarizes the ideas and techniques that careful, methodical data analysts use on large, high-dimensional data sets. It’s split into three sections: technical (techniques for examining data), process (how to approach data, what questions to ask), and mindset (how to work with others and communicate insights). One of the most interesting subsections describes the danger of mix shifts: where the sizes of subpopulations within a group differ. Mix shifts can lead to Simpson’s paradox "in which a trend appears in several different groups of data but disappears or reverses when these groups are combined."

That does it for this week’s issue. If you have any thoughts, I’d love to hear them in the comment section below!

3 thoughts on “Newsletter #083”

  1. HI Luigi,
    I am a freelance writer writing about data science, AI, machine learning and other related technologies. I am always on the lookout for content to consume on the related technologies. It was a sheer chance I landed onto your website — through a quote by you, in one of the articles. I have subscribed to your newsletter and hoping to learn something new in the field.

    Thanks and Regards
    Swati

Leave a Reply

Your email address will not be published. Required fields are marked *