Snorkel Flow is motivated by a tectonic shift in AI over the last decade towards powerful but data-hungry machine learning models that succeed or fail based on the massive labeled datasets they learn from.
Unfortunately, labeling these training datasets by hand is prohibitively expensive and slow for most organizations. In particular, the status quo of cheap, outsourced manual labeling that works for some use cases (e.g., labeling stop signs, pedestrians, cats, and dogs) is not an option for most organizations where data is highly private, requires subject matter experts to label, and changes rapidly.
Snorkel Flow is based on a novel approach to programmatically building and managing training datasets: rather than labeling thousands of data points by hand, users label, augment, and structure training datasets programmatically — e.g., using rules, heuristics, and other sources of signal. This type of input, often referred to as weak supervision, is fundamentally faster and more flexible, but also messier. Snorkel Flow relies on years of theoretical and algorithmic research from the Stanford AI Lab to manage this powerful new type of input.
Ultimately, the vision that all the algorithmic, theoretical, and empirical research behind Snorkel Flow builds to is a simple but powerful one: that of ML development as a practical, iterative, error-analysis driven process, rather than one reliant on weeks or months of labeling and relabeling data by hand.
The time and cost of labeling training data is one of the biggest bottlenecks in deploying machine learning today. Snorkel Flow enables users to label data programmatically instead using push-button or programmatically-defined labeling functions, to label data quickly, flexibly, and in an interpretable and auditable way. These labeling functions can be used to express subject matter expertise in the form of rules, heuristics, and pattern matchers, as well as to leverage organizational knowledge resources such as existing labels or models, legacy rules, knowledge bases and graphs, and more.
The resulting technical challenge is that these user-developed labeling strategies can be inaccurate, overlap and disagree, be correlated, and have minimal data coverage. Snorkel Flow uses novel, theoretically-grounded techniques to model, integrate, and clean this programmatically-labeled data to yield the same or greater accuracy as hand-labeled data.
In addition to labeling data, Snorkel Flow enables users to programmatically perform other key operations on training data, including augmenting training datasets by creating transformed data copies, slicing training datasets into critical subsets for monitoring and model prioritization, and more.
Whereas these key techniques are often implemented by hand in practice, Snorkel Flow enables users to programmatically express them and then uses novel algorithmic approaches to tune and optimize them. The result is faster, smarter, and more finely controllable ML.
One of the largest blockers to applying machine learning to many problem domains and sectors is the sensitivity of the data involved.
With Snorkel Flow’s programmatic approach to training data, not only can training data labeling and management be kept on-premises or self-hosted, it can be done without humans needing to view the majority of the data — setting a new high bar for practical, private machine learning.
The core slowdown in most iterative development loops involving machine learning is the need to label and re-label data by hand. With the shift to programmatic labeling and management of training data, monitoring and analysis leads to imminently actionable steps for improvement in the data — leading to a fast and responsive iterative loop.
Similarly, programmatic training data enables a fundamentally new but pragmatic level of auditability and versioning — since training data is labeled and managed by code.
Modern ML models and training techniques are powerful but data-hungry.
Snorkel Flow includes and enables these types of models (e.g., BERT, XLNet, etc.) and techniques (transfer learning, multi-task learning, ensembling, etc.) with training datasets that can be orders of magnitude larger than ones hand-labeled, and latest techniques are regularly integrated with the platform so that you can have access to the latest and greatest technology as it evolves.
The same algorithmic and theoretical techniques that power Snorkel Flow's ability to estimate the quality of diverse labeling functions and integrate their outputs applies equally to hand-labeled data. Though no hand-labeled training data is needed to use Snorkel Flow, if you have it, Snorkel Flow will manage it for you — automatically estimating the quality of your annotators and identifying unreliable or adversarial labelers.
The Snorkel Flow platform provides an array of collaboration and workflow tools to integrate subject matter expert labelers into your workflow — whether they are labeling a small test or validation data split, or providing higher-level feedback on a difficult portion of the training dataset to assist with development.
Snorkel Flow is informed by novel research into machine learning systems and weak supervision out of the Stanford AI Lab and beyond, funded by DARPA, ONR, DoD, NIH, NSF, Google, Intel, Microsoft, and many others, taught in several introductory and advanced machine learning courses, and published in over thirty-six peer-reviewed papers.