Why is Snorkel compelling?
Obtaining labeled training data is one of the largest practical roadblocks to building machine learning models for use in most industries today. In supervised learning we need labeled data to build models; Neural Networks, for example, need around 10,000 examples per class to general decently well. Many datasets that exist where we can infer the label through log analysis (e.g., "anomaly detection") still have the issue where we have few real examples of the "anomaly" class. In other cases where we have lots of data for all classes we may not have labels for the data. Although there are labeling tools out there, it is still labor and time-intensive to build a quality labeled training dataset in those cases. Enter: Snorkel.
The project description from the website:
"Snorkel is a system for programmatically building and managing training datasets without manual labeling. In Snorkel, users can develop large training datasets in hours or days rather than hand-labeling them over weeks or months."
Snorkel - Get Started
There are 3 key operations performed in a Snorkel pipeline:
- Labeling data
- Transforming data
- Slicing data
Snorkel uses heuristic rules (or distant supervision techniques) to label data in an effort to reduce manual labeling. Snorkel also has the option to perform data transformations to create variations in the newly labeled data. Further, Snorkel can slice data into subsets for targeted improvement or monitoring. Snorkel works with the Pandas DataFrame construct so it can easily be integrated into a DataFrame-based ETL pipeline.
Jeff Dean tweeted about a variation of Snorkel recently as well:
To better understand how Snorkel is used in practice, take a look at these 3 example use cases:
- Apple's Overton Project using ideas from Snorkel
- Google AI Blog: Harnessing Organizational Knowledge for Machine Learning
- MRI Image Sequence Classification Work: Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences