Newsletter #086 - ML in Production

A common belief is that individual contributors (IC) in technical roles who transition into management have to sacrifice the amount of technical work they perform. To some extent, I think this is true, at least in the sense of the type of technical work performed as an individual contributor. But as someone who transitioned into management from an IC role, I can confidently say that technical work is a large part of data science management and that managers have plenty of opportunities to improve their technical abilities. In fact, I believe you might get even more exposure to technical work as a manager, depending on the types of projects and employees that you’re managing.

Let me first talk about the type of technical work that you’ll probably stop doing, at least if you’re 100% managing and you’re not expected to contribute code or perform analysis. As an IC, your day is spent writing code, analyzing data, generating hypotheses, training models, validating those models, and deploying those models. This is technical work in the sense that you’re building; you’re translating your knowledge directly into deliverables like code or visualizations. If you transition into management and are no longer expected to produce these deliverables, then this type of technical work will end. And if you’ve never managed before, you might think "Oh, I’m not doing technical work anymore." But that’s not true. It’s just that you’re not doing that type of technical work anymore.

So what kind of technical work do you actually do?

First, you’re probably going to perform some type of technical work whenever you’re coaching a direct report. Suppose one of your reports has to explain their analysis to you. You need to be able to understand their approach. You need to be able to look through and understand their code. You need to be able to advise this report on his or her approach and implementation. Providing technical advice is one area where there’s room for lots of technical work as a manager.

If you’re managing multiple reports across different projects, the breadth of techniques and skills that are required across projects will be more comprehensive than those required on a single project. You might have to exercise traditional statistical analysis tools on one project, whereas you’ll need modeling skills on another. One project might be more engineering focused whereas another task might be more analytically focused. And regardless of what techniques are utilized, you’ll constantly be asking technical questions and thinking about how the answers to connect the analytical results to business outcomes. This is all very, very technical work.

One caveat here is that I’m talking specifically about managing individual contributors, i.e. the people that do the actual work. I don’t know if the same could be said for managers of managers because I’ve never managed managers. I’ve managed data science ICs for a little over a year and I’ve been a tech lead for several years. So I’m still close to projects and individual products.

I’ll also add that technical work at management level is higher-level in the sense that you typically try to answer more general questions than you would be as an IC. You’ll have to take a general question, break it up into sub questions, and then distribute these questions across a team. As ICs produce analyses, it’s your job to synthesize the individual insights into a comprehensive and cohesive body of work that answer the original general question. This process is highly non-linear, iterative, and very technical. And keep in mind that answering the general question will require you to step into and understand each sub question. You can think of this like a neural net, where the lower layers (ICs) are responsible for learning low level feature representations, like edge detection in computer vision. The higher layers (managers) are responsible for learning higher level concepts, like detecting faces in an image.

Any high-fives for using a neural net as an analogy for data science management? No? Ok. See you next week then!

Here’s what I’ve been reading/watching/listening to recently:

Data Cleaning IS Analysis, Not Grunt Work – "The act of cleaning data imposes values/judgments/interpretations upon data intended to allow downstream analysis algorithms to function and give results. That’s exactly the same as doing data analysis. In fact, “cleaning” is just a spectrum of reusable data transformations on the path towards doing a full data analysis." Eloquently written and filled with humor, this post argues that the goal of data cleaning is to improve the signal-to-noise ratio in data in order to improve analytical results. Rather than think about cleaning as a separate step, we’d do better to acknowledge that data cleaning IS data analysis.
Four communication techniques for solving technical problems – A few weeks ago I wrote that generating business value with ML is often a people problem that involves convincing key decision makers to take risks. Effectively solving people problems requires strong communication skills, especially when discussing technical topics that involve multiple trade-offs. This article describes 4 approaches for technical communication: "1) go in the right direction by working from problem to solution; 2) prevent circular, chaotic conversation by split-tracking; 3) remove friction and land-mines by emphasizing empathy; 4) monitor for conversations becoming stuck or moving above, below or tangentially away from scope set out in your agenda."
Andrew Ng: Bridging AI’s Proof-of-Concept to Production Gap – Andrew Ng recently gave a talk on the common roadblocks that prevent AI projects from succeeding in production. According to him, the biggest challenges in bridging the research to production gap are small data, generalizability and robustness, and change management. While AI success has mostly been in the big data domain of consumer internet companies, the largest opportunities lie in non consumer tech industries like retail and healthcare. Note that the talk itself starts around 6:30 and ends at 45 minutes.
A Contextual-Bandit Approach to Personalized News Article Recommendation – A multi-armed bandit is an approach to experimentation where the system learns to divert traffic away from poorly-performing treatments and towards the better-performing ones. A contextual bandit takes advantage of additional context on users/items to improve the allocation of users to different treatments. This article describes Yahoo’s success (that’s right, Yahoo!), with using contextual bandits to recommend news articles to users and their approach for validating their algorithms in an offline setting. For a fantastic (and short) introduction to bandit algorithms for website optimization, check out Bandit Algorithms for Website Optimization (affiliate link).
WhyLogs: Embrace Data Logging Across Your ML Systems – Last week I wrote about a critical bug in one of our production applications that could have been diagnosed with better monitoring and logging. This post introduces WhyLogs, an open source package purposefully built for data logging in ML pipelines. According to the post WhyLogs logs properties of data as it moves through an ML system, aggregates logs, supports a wide range of ML data types, and tags data segments with labels for slicing and dicing. I’m excited to see companies and tools emerging that tackle challenges on the boundary of data science and software engineering.

That does it for this week’s issue. If you have any thoughts, I’d love to hear them in the comment section below!

Leave a Reply Cancel reply