Why I Started MLinProduction

Ever catch yourself thinking that you should read just one more book or take a few more online courses before beginning that new project?

Start before you’re ready is good advice for anyone, but especially for people like me who tend to overanalyze or overprepare to the point of not progressing towards some desired outcome. Famed Computer Scientist Donald Knuth expressed a similar sentiment when he said that "premature optimization is the root of all evil" (although the quote is originally attributed to Sir Tony Hoare). Said a little bit differently, anything worth doing, is worth doing badly at first.

Adopting this mentality in late 2018 changed my life and helped me start this blog, MLinProduction.com, and my weekly newsletter on production machine learning systems. At the time I knew nothing about creating and sharing online content online. I had never published written a blog post, recorded a YouTube video, or thought about starting a podcast. But what I did have was a burning desire to help people like me, machine learning engineers and data scientists, learn how to build ML products.

When I started writing online I thought it best to keep the blog totally technical and disregard my personality, subjective opinions, and motivations as much as possible. But I’ve had a change of heart. Much of my writing is influenced by personal experiences from working in tech: my roles at different companies, the projects I’ve worked on, and the teammates I’ve worked with deeply influence what I write and how I write it.

But it’s never too late to pivot.

So I’d like to share why I started MLinProduction. In this post I’ll describe what prompted me to begin writing online and why I chose to write about machine learning systems. Along the way I’ll share insecurities that I’ve had to overcome. My goal is to expose myself; to inject some of my personality, humor, and faults into this project. This post might be catharsis, more for me than it is for you, but I hope is helps anyone seeking to build a better version of themselves.

Let me begin with a problem statement.

The Problem: Most Machine Learning Content is Useless to Me.

In late 2018 I was working as a data scientist at 2U. As a teammate on a small team that served multiple business units within a complex business, I was responsible for all aspects of the machine learning lifecycle for my projects. Not only did I build the predictive models, but I was in charge of deploying and operationalizing them as well. The role was incredibly exciting and involved a lot of self-directed learning.

Since I’ve always considered myself a generalist (before 2U I had only worked at small startups and was a single digit employee several times), I was intrinsically motivated to be a "full-stack" ML guy. And given my previous roles as software engineer, data engineer, and ML Engineer, I was uniquely positioned to contribute to my team and company in myriad ways.

One could imagine that I had the perfect background for the job, but that doesn’t mean the work was easy. As I’ve written previously, running production ML systems in production is a complex, heterogenous challenge. Yes, effectively training models requires understanding learning algorithms and optimization, but you also need to know how to prevent data leakage and capture experimental metadata. Model deployment is an extensive topic that requires knowledge of machine learning, software engineering, and DevOps. And the work is hardly done after you ship a model; I’d argue that the work really begins post-deployment.

To combat the challenges I discovered "on the job", I did what I usually do when I don’t know something: use the internet to learn from other people who’ve had similar experiences. I researched blogs, signed up for newsletters, created YouTube playlists of conference talks, and listened to every seemingly relevant podcast that would fit on my phone’s storage. I used some good ole’ fashioned AI, a breadth-first search across media channels to learn how to build, run, and maintain real world machine learning systems. Surely plenty of detailed resources existed online within reach of a Google search – the digital equivalent of a stone’s throw away.

Boy, was I wrong.

But I did notice two things. First, most high-quality machine learning content is about learning algorithms and software frameworks used to implement those algorithms. But learning algorithms are a tiny component of building production ML systems.

Figure: Only a small fraction of real-world ML systems is composed of the ML code, as shown
by the small black box in the middle. The required surrounding infrastructure is vast and complex. Source.

Second, most online machine learning content, in general, is completely useless to me. A lot of it is aimed at beginners, which is great, but there’s also a ton of click-bait. There’s a lot of great academic literature available, especially if you follow machine learning Twitter and scan arxiv. But most of that stuff doesn’t help you solve day-to-day problems. For example, how do you effectively frame an ML problem with competing interests from different business units?

Figure: The universe of Online Machine Learning content today.

This really surprised me. As a practitioner, I knew that the most time consuming aspects of delivering machine learning solutions involved infrastructure, feature engineering, deployment, and monitoring. But if you don’t work in the field, if your idea of what data scientists work on came from blog posts, podcasts episodes, or youtube videos, your conception of applied machine learning is completely misguided. I realized that what I did as a data scientist did not align with what the internet suggested I did as a data scientist.

The Personal: My Entrepreneurial Itch

My previous role before becoming a data scientist at 2U was as an ML Engineer at a startup called CTRL-Labs. The company was working on an extremely fascinating problem – decoding neural EMG signal to construct a brain-machine interface – and the founding team hired a stellar team of research scientists, software engineers, and hardware engineers to build the solution. Like many of my teammates, I was driven by the difficulty of the challenge at hand. Unaware of my own limits, I worked long hours, usually opening my laptop as soon as I got home to continue developing the research infrastructure needed to decode our own neural networks. 11 months of working in this way drove me to burn out and depression.

When I look back now I interpret the 6 month long bout of major depression that ensued as my body forcing my mind to take a break. Those 6 months were some of the darkest in my life as the previous passion to build that engulfed my soul withered; just thinking about a trip to the grocery store frightened me to the point of hiding underneath my bedsheets.

My depression paralyzed me. Rather than work a fulltime job, I worked part time as an adjunct at Fordham University teaching courses in big data engineering and applied stats. To supplement the pittance of an income paid to adjuncts (one day I’ll write about how ridiculous that is), I decided to start my own tutoring company. Besides the additional income, I started the company because I wanted to start something. This was partially contextual: at the time I was angry at CTRL-Labs’ management for "making me depressed" and wanted to "stick it to the man" and not work for anyone else. In due time I’ve realized that anger was misplaced. I was playing the victim rather than owning my unhealthy work habits that contributed to my illness.

But the entrepreneurial itch remained after my depression ended and my joie de vivre returned. Months after taking the full time role at 2U, I couldn’t stop thinking about what it felt like to make that first tutoring sale. That feeling of taking an idea all the way from an abtsract mental conception to a concrete dollar in my pocket was intoxicating. But I didn’t have an idea of what to start, nor did I want to leave my job at 2U (which I really enjoyed).

Serendipity stepped in when I stumbled onto people like Nathan Barry and Amy Hoy who advocated for a different way of starting a business. Rather than starting with an idea, quitting your job, and trying to raise VC, they believe in starting with an audience. That is, build an audience of people online by teaching, either through writing or podcasting or youtubing, and figure out how to provide that audience value. All you need is 1,000 true fans to be a successful creator. So I decided to do just that.

The Solution: Build A Centralized Repository of Best Practices

It’s easy to find good quality content if you’re a machine learning beginner. Just follow a Medium blog like Towards Data Science, subscribe to the KDNuggets newsletter, and consume everything published on machinelearningmastery. Every week you’ll get multiple emails from each of these sources, many of which contain at least 1-2 helpful resources. On the other hand finding helpful resources for deploying models or setting up end-to-end ML pipelines requires hours of research each week.

Weekly Newsletter

My first goal was to simplify the process of finding high-quality machine learning resources for practitioners like me: mid to senior level data scientists, ML engineers, and ML-focused Product Managers. I want to make it as easy for these people to find good content as it is for beginners.

So I started by writing and publishing a weekly newsletter that contains 5 links to blog posts, journal papers, conference talks, and podcast episodes specifically targeting data scientists, ML engineers, and ML product managers. My focus was (and is) on finding and sharing the most valuable content I find, not on sharing "the latest information." Removing this arbtirary constraint on when a piece was published has led me to share resources published from the early 2000s all way to 2020.

My second goal was to accurately describe what ML practitioners do in their day-to-day work. Obviously these are technical roles that involve coding, algorithms, and statistics, so I write about technical concepts. But like many other technical roles, the job involves a large nontechnical component as well.

To accomplish this, I write about my personal experiences working in data science. Each newsletter includes a brief summary of an experience I’ve had as an individual contributor or manager. Sometimes I’ll write about specific technical concepts like how to use ensembles to combine multiple types of machine learning models or how you should monitor model performance on subslices of your data rather than rely on aggregate performance metrics. Other times I’ll write about broader ideas, like thinking about a tech company’s products and services as a living organism’s nervous system.

Blog Posts

Part of the reason I started by writing a weekly newsletter rather than blog posts was fear. I was terrified of writing blog posts. Some of that fear stemmed from negative self-talk that said I didn’t have anything worth writing about. How could I, a data scientist without a PhD, who wasn’t working on cutting edge deep learning models at a company like Google or Facebook, have anything meaningful to contribute? Why would anyone read my content when they could access the writing of industry pioneers like Xavier Amatriain or Andrej Karpathy?

But like most fears, mine weren’t rooted in reality. Rather they were the projections of certain limiting beliefs. Sure I’m not as experienced as Xavier Amatriain or Andrej Karpathy. But that doesn’t imply I have nothing of value to offer data scientists and ML engineers. In fact, I’ve worked across the data stack, first as a data engineer, than as a machine learning enginneer and finally as a data scientist. I’ve taught grad courses in big data engineering and data analysis. The lessons learned from these experiences are definitely valuable to a great many people. But self-doubt and comparing myself to others led me to disregard this for a long time. A good lesson here is not to compare yourself to others. Instead always compare youself to a previous version of you.

After 5 consecutive weeks of publishing the newsletter, I built up the courage and confidence to begin writing weekly blog posts. I took the same angle with this writing as I did for the newsletter. Rather than write about learning algorithms, I sought to describe day-to-day production machine learning challenges. Topics like how to package and serve models using Docker images and how data science hiring works.

Several of these posts have been part of multi-post series, including a series on Docker, another on Kubernetes, and most recently my 10-part series on Continuous Deployment for Machine Learning. These long form series allow me to dig deep into challenges you’ll only face if you’re managing machine learning powered products at scale. For instance, serving models in an online setting involves much more than writing a Flask app: you’ll need to query input data from multiple data sources, optimizing feature engineering, choose a deployment strategy, A/B test, and monitor predictions.

And this is just the beginning. Recently I started interviewing big names in production ML including Eric Colson (Chief Algorithms Officer Emeritus at Stitch Fix) and Erik Bernhardsson (CTO of Better) as well as rising stars like Jeremy Jordan. Even though I’m not as experienced as some of these individuals, being able to bring their knowledge to the community through my writing is a unique value only I can provide.

Conclusion

To summarize, my motivation to start my blog and newsletter stems from 2 main factors, one practical and the other personal.

The first is my desire to solve a real problem: there’s a lot of machine learning content online, but it’s really hard to filter out the noise. Most content is written for beginners and there’s a drastic shortage of best practices for running real world machine learning systems.

The second is my goal of running my own business. Rather than quit my job to create a startup, I’ve decided to build an audience by writing about a topic I care about. Working towards this goal has meant confronting my fears and self-limiting beliefs. At times I still experience feelings of inadequacy, but I choose to publish anyway. Exposing what I know about ML through writing has loosened fear’s grips on my pysche. Seeing this transformation motivates me to keep writing and building towards my goal.

If you’re a working data scientist or ML engineer looking for high-quality content, then sign up for my newsletter. Each week you’ll get the most practical articles, blog posts, papers, and conference videos (and more) I find.

7 thoughts on “Why I Started MLinProduction”

  1. When I wanted to study machine learning, I was excited and wanted to mix with my domain knowledge. I started studying in a conventional way and as I learned, working with the algorithms on already cleaned, structured data, I felt, I woke up from a bad dream. Then upon “GOOGLE searching”, I eventually stumbled on your blog and started gaining the real perspective to what more should I know and learn. As the picture in the above blog depicts, the market has huge clickbait and very low standard learning material. But fortunately, your blog is helping in realizing the actual things that happen behind the whole pipeline.
    I wish you make a series of study materials (only if you can and have time and mostly if you have an interest in sharing) on MLOps knowledge.
    Thank you for sharing and be happy.

    1. I hope this had an edit button. Sorry, it came out in the wrong way. Forgive me.
      I wish you make a series of study materials (only if you can and have time for writing) on MLOps knowledge.

    2. Thanks Srujan! I really appreciate the kind words.

      I’m actually currently working on some courses for the community : )

  2. This is an awesome blog that concentrates on challenges of machine learning in production .
    Many Data Scientists loose sleep thinking about not generating Business value and getting FIRED. At lease I know many of them and I myself use to feel like at one point in time. Data Scientist often involve in making several PoCs (Proof of Concepts) , only few would be funded by relevant business team and only few among funded would go into production.Later among ones in production only few generate business value. By them several months / years would have gone by !!

    Luigi has done great work in helping Data scientists reducing the time it takes to hit production to generate Business Value.

    Appreciate It.

Leave a Reply

Your email address will not be published. Required fields are marked *