Top ML Resources: Interview with Rui Carmo

Below is an interview with Rui Carmo, Cross-Domain Solution Architect, EMEA Architecture and Industry Services, Office of the CTO, at Microsoft.

Our interview with Rui Carmo is part of our interview series with the creators of some of the top ML resources we’ve come across online.

Rui’s top ML resource, So You’re Going to Manage a Data Science Team, is a fantastic piece on the people and processes required to build a data team that works.

1. How did you get started working on machine learning? How have you progressed through the ML space?

In college, I had the opportunity to major in what was then “AI” (essentially heuristics and symbolic programming along the lines of LISP machines and Mathematica), but eventually decided to go for distributed systems in order to pursue a career in telcos. This was nearly 30 years ago, but I had a taste for (some) maths, so I sort of “kept in touch”, but never really “got it” until, many years later, Vodafone decided to investigate churn prediction – and I was one of the few people that “got” the maths, understood how to get the relevant data off the network and was able to talk to the statisticians and BI folk, and I brushed up on my statistics and went to work.

That project was one of the factors that led me, around 2010, to become the head of Big Data at another telco and (as both steward and slave of our freshly minted Hadoop cluster) dig into ways to derive meaning out of all the data we handled – by that time Spark was starting to become usable, so I set up a small cluster of PCs, hooked up Jupyter (then iPython) to it, and started re-visiting the field in earnest. Although most of what we did was data engineering and not data science, it was the time when I largely ditched R for Python, started using notebooks daily and became immune to the hype around tools and began focusing on processes, people and how they could make sense of the data.

I eventually joined Microsoft in 2015 (partly due to my Open Source background and partly due to my experience in data and statistically-driven ML) and literally rode on Microsoft’s AI product roll-outs, digging further and further down into CNTK/Tensorflow (having ready access to cloud GPUs was a plus there), ML pipelines of various sorts and, of course, Spark/Databricks.

2. What kinds of machine learning problems do you work on today?

I’ve settled somewhere between doing large-scale feature extraction (which is one of the trickiest parts of the job, and where my data engineering chops are actually useful) and doing conventional recommendation/classification approaches (which are around 80% of what businesses look for).

I’ve done financial portfolio analysis/forecasting, anomaly detection and even a few deep learning projects around image segmentation (think satellite photo analysis, but focusing on identifying contiguous areas rather than specific ground features), but, again, most conventional business problems reap tremendous benefits from a simple, straightforward “80% solution” (that can be as simple as Bayesian classification) rather than massive computation.

3. What challenges do you face as a machine learning practitioner?

Debunking the hype. These days it’s (comparatively, looking back 5-10 years ago) much easier to explain to people that you can’t derive meaning from junk data and I increasingly come across projects where customers already “get” things and can even provide nicely labeled, nearly curated datasets, but the expectations around what you can achieve are always too high.

By that I mean that even when the model is sound, the use (or reliance) on it is often sub-optimal. The best parallel I can draw is with recommendation models, which are often overused to the point where everything you see on a website is a recommendation for something that is no longer relevant to your customer…

4. What’s your favorite machine learning tool? What problem does it solve?

Definitely notebooks, most notably Jupyter (and Seaborn). I need to get an interactive, nearly tactile feel for the data, and the first thing I do on a new dataset is either plot a histogram or toss up a table with a few hundred samples. Which is why I love the sheer power of Zeppelin notebooks atop Spark, and the way I can slice huge datasets with it.

But I am very conscious of the limitations of notebooks for doing proper, reproducible processes (which is usually the one thing I am asked to deliver), so I only use them the exploration stages of a project (and a few spot checks along the way) – most of what I deliver are end-to-end dataflows, and I usually strip out prototype code from notebooks and build libraries from it, with proper error checking, stream handling, etc.

5. What differentiates successful industry ML projects from unsuccessful projects?

I am tempted to reply “sheer luck and hype” since the industry moves so fast and sometimes feels like an endless cornucopia of products, but there are lots of services based on ML that are successful and below the radar (like fraud detection or biometrics SaaS), so I would say that a key indicator of success (besides adoption) is whether you can actually build a stable, growing business out of it.

That applies to everything: libraries, tools, services, etc. But there are two examples I keep going back to: I am particularly fond of the way Tensoflow became the industry standard (although I prefer using it through Keras due to the overhead), and am following the evolution of SparkML with interest because I think that it (or something like it) will eventually replace the hodgepodge of Python/R/Scala libraries that people are running today.

6. What advice do you have for ML practitioners who are struggling to build machine learning solutions into products?

I worked in Marketing for a long time during my telco years, so when I look at a product or service I am especially attuned to the balance between hype and actual benefits – and these days, a lot of people rely on hyperbole and completely over-the-top wording to “sell” their ideas, so my main advice would be to accurately and succinctly communicate the benefits/results of using your solution along with your understanding of the problems it solves.

By all means be enthusiastic about it, but show that you can bring more actual value than a bunch of empty words to the table.

You can follow Rui on Twitter here.

Leave a Reply Cancel reply