Newsletter #082 - ML in Production

After a several week hiatus, the newsletter is back! I’ve been pretty busy moving across the country (super happy to be a San Diego resident 😀 ) and tending to my now 6 month old firstborn (I can’t believe it’s 6 months already). But I’m excited to publish issue #082.

I’m currently thinking through how I want to continue to develop MLinProduction. Something I’ve been asked by many new subscribers is whether there’s an archive of previous newsletter issues. Up until this point, I haven’t published the newsletter on the blog. But I’m happy to report that new issues will be published on the blog as well as be sent to subscribers. I’d love to publish previous issues to the blog, but that all depends on how much time I have in the near future.

Anyhow, I’ll continue to share new developments and decisions as I come up with them. If you have any ideas or ways you think I can improve the newsletter, I’d love to hear them! Either shoot me an email or drop a comment on the blog!

Here’s what I’ve been reading/watching/listening to recently:

Using GitHub Actions for MLOps & Data Science – The first post of a multi-part blog series on using GitHub Actions, GitHub’s native event-driven automation system, to perform ML pipeline tasks. When the author comments on a pull request, an action is triggered which performs model training and evaluation. When evaluation completes, another action comments back on the PR with the evaluation metrics, allowing the data scientist to decide whether to merge the changes or not. I like how this flow generates an auditable history of changes to code and the impact to model metrics.
Keeping your data pipelines healthy with the Great Expectations GitHub Action – The GitHub Actions team partnered with Great Expectations to create a workflow that can automatically test, document, and profile your data pipelines. For those unfamiliar with the Python library, Great Expectations allows you to specify "unit tests" for datasets through expectations. Expectations can be bootstrapped through the built-in profiling tool and dataset documentation is generated directly from the expectations. I’ve played around a bit with the library and am looking forward to summarizing my findings in an upcoming blog post.
Emerging Architectures for Modern Data Infrastructure – Data infrastructure serves two purposes at a high level: to help business leaders make better decisions through the use of data (analytic use cases) and to build data intelligence into customer-facing applications, including via machine learning (operational use cases). This a16z blog post provides an overview of 3 common blueprints used for 1) modern business intelligence, 2) multimodal data processing, and 3) AI/ML that rely on modern cloud data infrastructure, both from vendors and open source. According to the authors, these blueprints were synthesized from conversations with :hundreds of founders, corporate data leaders, and other experts – including interviewing 20+ practitioners on their current data stacks".
AWS Data Wrangler – An AWS Professional Service open source python initiative that extends pandas to AWS connecting DataFrames and AWS data services like Redshift, Glue, Athena, EMR, and more. Check out the super simple tutorials to read/write dataframes directly to S3 as flat files or Parquet files. The library also exposes methods to easily crawl your files and generate metadata tables.
How to put machine learning models into production – This post from the StackOverflow blog describes three key areas to consider when deploying models to production: data storage and retrieval, frameworks and tooling, and feedback and iteration. I would take the 2nd half of the post with a grain of salt (the article seems too sales-y to me with respect to emphasizing GCP and Tensorflow Extended), but I agree with the author’s points about thinking through what a deployed system should look like before embarking on a project. Thanks to (multiple) subscribers for putting this article on my radar!

That does it for this week’s issue. If you have any thoughts, I’d love to hear them in the comment section below!

Leave a Reply Cancel reply