A key barrier for companies to adopt machine learning is not lack of data but lack of labeled data. Labeling data gets expensive, and the difficulties of sharing and managing large datasets for model development make it a struggle to get machine learning projects off the ground.
That’s where our “learn more from less data” approach comes into action. At JPMorgan Chase, we are focused on reducing the need for data to build models. Instead, we focus on building gold training datasets, helping reduce the labeling cost and increasing the agility of model development.
Labeled data is a group of samples that have been tagged with one or more labels. After obtaining a labeled dataset, machine learning models can be applied to the data so that new, unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data. A gold training dataset is a small, labeled dataset with high predictive power.
Active learning is a form of semi-supervised learning, which works well when you have a lot of data but face the expense of getting that data labeled. By labeling data points that help the quality of the model, teams can identify the samples that are most informative.
Using machine learning (ML) models, active learning can help identify difficult data points and ask a human annotator to focus on labeling them.
To explain passive learning and active learning, let’s use the analogy of teacher and student. In the passive learning approach, a student learns by listening to the teacher's lecture. In active learning, the teacher describes concepts, students ask questions, and the teacher spends more time explaining the concepts that are difficult for a student to understand. Student and teacher interact and collaborate in the learning process.
In ML model development using active learning, annotator and modeler interact and collaborate. An annotator provides a small labeled dataset. The modeling team builds a model and generates input on what to label next. Within a few iterations, teams can build refined requirements, a labeled gold training set, active learner and working machine learning model.
To identify difficult data points, we use a combination of methods, including:
Classification uncertainty sampling: When querying for labels, the strategy selects the sample with the highest uncertainty — data points the model knows least about. Labeling these data points makes the ML model more knowledgeable.
Margin uncertainty: When querying for labels, the strategy selects the sample with the smallest margin. These are data points the model knows about but isn’t confident enough to make good classifications. Labeling these examples increase model accuracy.
Entropy sampling: Entropy is a measure of uncertainty. It is proportional to the average number of guesses one has to make to find the true class. In this approach, we pick the samples with the highest entropy.
Disagreement-based sampling: While using this method, we pick those samples where different algorithms disagree. Example: If model is classifying into 5 classes (A,B, C, D & E), and if we are using 5 different classifiers, e.g.
Bag of words
LSTM
CNN
BERT
HAN (Hierarchical Attention Networks)
Annotator can label examples on which classifiers disagree.
Information density: In this approach, we focus on a denser region of data and select few points in each dense region. Labeling these data points help the model classify large number of data points around these points.
Business value: In this method, we focus on labeling the data points that have higher business value than the others.
Traditionally, data scientists work with annotators to label a portion of their data and hope for the best when training their model. If the model wasn’t sufficiently predictive, more data would be labeled, and they would try again until its performance reached an acceptable level. While this approach still makes sense for some problems, for those that have vast amounts of data or unstructured data, we find that active learning is a better solution.
Active learning combines the power of machine learning with human annotators to select the next best data points to label. This intelligent selection leads to the creation of high-performance models in less time and at lower cost.
The Artificial Intelligence & Machine Learning group is focused on increasing the volume and velocity of AI applications across the firm by helping develop common platforms, reusable services and solutions.