This is a follow-up to our post from earlier this year on collecting/creating training data for NLP algorithms. Much of the information in that blog applies to computer vision (CV) algorithms (and machine learning initiatives in general), and since the post was so popular, we’ve adapted it for CV practitioners. Enjoy!
The less-exciting part? Training data. (Well we find it exciting, but we know through our conversations with hundreds of data scientists, engineers, and product leaders that the responsibility of acquiring and/or annotating training data is a real thorn in their sides.) And yet, it’s arguably the backbone of machine learning. You gotta have training data, it’s gotta be high-quality, and it’s best—sometimes absolutely necessary—that you get it quickly and efficiently.
There’s a (large) handful of training data sources, solutions, and strategies out there—how do you choose? How do you know which dataset or tool or vendor is the right one for your project?
That’s what this post is all about.
Broadly, you can categorize training data resources into two buckets: pre-existing, publicly available datasets; and tools and solutions for creating your own.
Pre-Existing, Publicly Available Datasets
AKA open datasets or open source datasets, these are off-the-shelf, already-annotated datasets that are available on the web for free or for purchase. As mentioned in the NLP post, one of the many great things about the data science community is its members’ commitment to sharing knowledge and resources with the field at large—we owe the bevy of open computer-vision training datasets available today to that commitment.
In general, a pre-existing dataset is a good option in two scenarios: 1) you’re just beginning the process of testing out algorithms, or 2) the model you’re building only needs to perform a general, relatively simple task. While public datasets lack specificity, the accuracy is usually good, so they’re typically reliable resources. Here are some good datasets and dataset repositories:
- Common Objects in Context (COCO)
- Google’s Open Images
- The University of Edinburgh School of Informatics’ CVonline: Image Databases
- Yet Another Computer Vision Index To Datasets (YACVID)
- This list on GitHub
Tools & Solutions for Creating Your Own Training Data
It’s best to create your own training datasets when you need custom and/or highly specific annotations. If your model is intended to perform anything more sophisticated or specialized than generic computer vision functions (i.e., basic object recognition or image tagging/captioning), you’re likely to need proprietary training data.
There are three distinct approaches to generating your own training data (a blend of methods is also common):
- DIY/annotating in-house
- Crowdsourcing or outsourcing
- An end-to-end/”TDaaS” solution
Doing your own annotations (or “labeling”) is a popular choice when the accuracy bar is exceptionally high, and when you can afford to allocate employees’ time to annotating/managing the annotation process. DIY labeling is often much slower than other solutions, too, so wiggle room with deadlines is also required for this approach.
As noted, the advantage to handling data annotating in-house is that quality is often very high. It also provides the most control over the process of all three options.
Crowdsourcing or outsourcing is a frequent next step for DIY-ers when velocity becomes an issue, or when the cost of tying up high-value employees’ time in labeling—keeping them from tackling other important work—is no longer worth it. While quality often suffers with crowdsourcing/outsourcing and there is still much time and effort required of the customer (task design, writing instructions, quality control, etc.), this approach does enable companies to offload the actual annotating, and typically allows them access to vastly larger pools of annotators. (Though it should be noted, many times these “crowds” or groups are made up of largely unknown users, and targeting them by demographics, skill, or domain knowledge can be difficult or impossible.) Quality suffers, but speed and scale improve.
Here are some popular annotation tools for in-house or crowdsourcing/annotating:
Also check out these papers for helpful tips and tricks:
- Tools for Richer Crowd Source Image Annotations
- Efficient annotation of image data sets for computer vision applications
Training Data as a Service (TDaaS)
A comprehensive training data solution, such as Mighty AI’s Training Data as a Service (TDaaS), is a relatively new category among training data options. What TDaaS-like solutions offer is a complete offloading of the entire annotation process. From determining annotation specs to creating workflows to handling task design, instructions, qualifying/managing/paying annotators, and QA, this approach requires the least amount of effort from the customer (by far).
The “total cost of ownership” (TCO) of a TDaaS solution often works out to be the most favorable, too, when you factor in time and headache savings. Employees are freed up to do high-value work instead of annotating or managing the annotation process, and, at least in the case of Mighty AI, accuracy is as high or higher than both crowdsourcing/outsourcing and internal labeling (due to our proprietary QA process, as well as how much we know about our users, and our ability to target them by domain, skill, or demographics). You get the quality of in-house with the scale and speed of crowdsourcing, minus the time and effort on your end.
While training data is perhaps the least sexy part of computer vision projects, it’s undeniably crucial. Use the info above to guide you in your decision-making process for annotation methods, tools, and solutions—now and in the future, as your training data needs change.
Note: Prior to January 10, 2017, Mighty AI was known as Spare5. While Spare5 remains the name of our consumer brand and application, we’ve relaunched our business-customer side as Mighty AI, which also serves as the parent company under which Spare5 now lives. Some posts on mty.ai have been updated with the new company name to ease confusion.