Below is an interview with Jeremy Jordan, Data Scientist, Machine Learning, at Proofpoint.
Our interview with Jeremy Jordan is part of our interview series with the creators of some of the top ML resources we’ve come across online.
Jeremy’s top ML resource, Building machine learning products: a problem well-defined is a problem half-solved, is an excellent read about how to create machine learning products filled with resources for blending AI and design.
1. How did you get started working on machine learning? How have you progressed through the ML space?
I first started playing around with machine learning during my junior year at university. I was pursuing a degree in materials science engineering but was interested in exploring career opportunities in data science. That summer, I worked at Red Hat as a data science intern which really confirmed my interest in the field. After graduating, I embarked on a full-time independent study of machine learning, working through online courses and doing projects to practice applying what I was learning. Since I was doing an independent study, I didn’t have a formal way to evaluate my understanding (eg. exams in school), so I developed a habit of writing about what I learned to "teach it back to myself" and enforce that I had sufficient understanding before progressing.
One of my side projects had garnered interest from a local angel investor, and I ended up raising a round to launch a startup and continue the project. This startup was a great experience in learning how to build a machine learning team and understanding the production considerations for machine learning systems. Unfortunately, our core hypothesis (on which we built the business) turned out to be incorrect, and we decided to shut down the company.
I spent a couple months consulting on short-term projects as I looked for new opportunities until I found a great fit at Proofpoint, where I’m currently employed.
2. What kinds of machine learning problems do you work on today?
At Proofpoint, we spend a lot of time working on people-centric cybersecurity problems. What do I mean by people-centric, you might ask? Well, if you look at the "attack surface" for enterprise companies, their compute infrastructure is usually pretty secure. However, the people working at that company (who have access to the compute infrastructure) can often be much more susceptible to an attack. We serve our customers by protecting their employees from potential attacks, which may come by the form of a phishing email or malicious file attachment.
Working on machine learning problems in cybersecurity provides many interesting challenges – adversarial environments, highly imbalanced classes, and non-stationary data distributions just to name a few.
Outside of work, I’ve been playing around with generative models and exploring the intersection of machine learning and art. I hope to share more details on this work later this year.
3. Could you elaborate on the challenges you face as a machine learning practitioner?
Yeah, so one of the things that makes cybersecurity a really interesting area to apply machine learning is that we operate in a naturally adversarial environment. By that, I mean that our models are usually tasked with detecting activity of malicious actors who are actively working to avoid being detected! It turns out that cyber-crime is a huge enterprise and these actors effectively run their operations as a business. They’ll launch "campaigns" which send out malicious payloads to a targeted set of people and they’ll track to see how effective each campaign performs. If they notice that a campaign isn’t performing well (because we’re detecting and blocking their content), they’ll change the payload to try and evade detection.
Due to this adversarial relationship, we’re highly motivated as machine learning practitioners to build models which learn robust features to discriminate between benign and malicious content. However, even if you’re training models which learn robust features from the data, these systems still need to be maintained. The challenge here is that the maintenance of such systems is often invisible to those outside of the field. When you first deliver a model, it’s easy for the receiving stakeholder to believe that the work is complete. After all, now they have a thing that can make predictions! And it seems to be working well! But in order to ensure that the system continues to work well, you need to have robust monitoring tools and retraining pipelines in place. And if you consider the act of retraining, this includes the continued and ongoing activity of labeling data. Much of this work is not sexy and the value isn’t immediately realized, so it can sometimes be difficult to justify the time spent on those efforts.
And then to continue on the topic of labeling data… at Proofpoint, we perform inference on billions of observations per day. Given data of that volume, there’s a pretty clear need for sampling data to be selected for labeling. Unfortunately, due to the fact that a large majority of those observations are completely benign, a naive random sampling approach isn’t practical. To effectively collect labels from highly imbalanced classes, we employ various active learning techniques in an attempt to capture all of the malicious content and a representative sample of the benign content. One unfortunate side-effect of this highly non-uniform sampling strategy is that our labeled data that we use for training and evaluating models has a different distribution than the real-time data stream. In order to build an accurate estimate of real-world performance, we’ll deploy our models into production in "shadow mode" where the predictions aren’t actually used by other systems until we’ve had a chance to validate the model quality.
So yeah, there’s many challenges which make working in this space interesting. It certainly keeps the work engaging 🙂
4. What’s your favorite machine learning tool? What problem does it solve?
Can I give you my top three? I’m very bad at choosing favorites.
-
PyTorch-Lightning automates much of the boilerplate code associated with training PyTorch models but still allows for full customization over the model definition and training process. I’ve been using it on a side project and have really been enjoying it.
-
MLFlow has been super useful in experiment tracking and I’ve been excited to see some of the new developments, particularly the model registry.
-
Streamlit allows me to build interactive applications very quickly. This enables stakeholders to engage with machine learning models during development and provide feedback as we iterate on various models.
5. What differentiates successful industry ML projects from unsuccessful projects?
This is a tricky question to answer in a general manner because not everyone has the same measure of success. Depending on whether your definition of success more closely aligns with learning new insights or increasing company revenue, the differentiating factors may vary.
This being said, I think one very important habit which can bias you towards successful outcomes is working with your customers/users/stakeholders to deeply understand the problem they’re experiencing. This can help ensure that you’re building the right thing, which in turn helps ensure that your model will actually be used and provide value to others.
6. What advice do you have for ML practitioners who are struggling to turn machine learning solutions into products?
Machine learning projects can be very challenging! You often don’t know exactly what you should do until you try a few ideas and see what works, which lends itself to a very iterative process. I’ve tried to assemble a checklist of sorts to guide myself through the process, both for defining the requirements of a machine learning project as well as building the model. However, even with these checklists to follow, I cannot understate the value of having a coach to guide you through this process.
Even if you begin a project with the best intentions of following such a framework, it’s pretty easy to get sucked into the day to day operations and lose track of the overall project development. This is where it can help to have an external perspective hold you accountable. I’m very fortunate to have a manager who has been building machine learning systems for over a decade. That experience manifests in an ability to ask critical questions early on, spot potential failure modes, and gently intervene to course-correct the project.
I recognize that not everyone has the privilege of having a manager with such a background, but that doesn’t mean you can’t find a coach to help keep you accountable. For example, one of my friends slowly redefined the role in his company from mechanical engineer to machine learning engineer. His boss doesn’t have any background in machine learning and can only give input on the application of the model to their domain. However, he was able to seek guidance outside of the company regarding some of his more specific machine learning questions. One of the things I love about our machine learning community is how open and supportive the members are. If you’re actively working on problems in industry and need some guidance, I’m sure you can find someone more experienced willing to help guide you through the process.
You can follow Jeremy Jordan on Twitter here.