Need autonomous driving training data? ›

At a Glance: Generative Models & Synthetic Data

At a Glance: Generative Models & Synthetic Data

There’s a bit of buzz around generative models in the AI community right now, specifically around using them to create “synthetic data.” Need a quick primer? Get it below, plus some recommended reading in case you want to dive deeper.

What are Generative Models?

Generative models are aptly named—they’re models that, broadly, generate data that resemble or are in some way related to the data they’re trained on. For example, a generative model built on a neural network (which is the kind of generative model we’re talking about today) and trained on images of faces for the purposes of facial recognition would output more (but fake) images of faces. It would output synthetic data.

What are Synthetic Data?

Synthetic data are basically what they sound like: not-real, artificially created data. Producing synthetic data with generative models is a new-ish concept in machine learning—researchers are only just beginning to make real headway in creating viable synthetic datasets with neural networks.

synthetic faces images

source

What are the Different Types of Generative Models?

While generative models as a concept has been around for a while and there are some established classical techniques, three newer deep-learning approaches are emerging:

  1. Generative Adversarial Networks (GANs): a game-like process that pits a generator network against a discriminative network
  2. Variational Autoencoders (VAEs): using probabilistic graphical models and variational Bayesian methods to derive a “lower bound”
  3. Autoregressive models: training the network to produce individual pixels based on those above and to the left of them

What are Synthetic Data from Generative Models Likely to be Useful For?

Creating art is one near-term application, and there are lots of opportunities in digital image enhancement: smoothing (denoising), filling in missing pieces (inpainting), improving resolution (super-resolution imaging), etc. There’s promise in natural language processing (NLP) applications, too, and this type of synthetic data can also be useful in reinforcement learning (the exploration bit).

Will We Ever be Able to Use Synthetic Data from Generative Models as Training Data?

Many in-the-know are hoping that, in the future, we’ll get to a place where a whole bunch of synthetic examples from generative models plus a small number of real examples can train a system to the same level of performance as a large number of real examples. A lot of the literature on this type of synthetic data suggests we could eventually be generating artificially large datasets for “pre-training”—one could train a machine learning system up on reasonable-looking synthetic data to get it into a “reasonable” starting point to begin “fine-tuning” it on a much smaller amount of real data. The idea being that it’ll have learned enough from the synthetic data to use the real data more efficiently.

Important note: Synthetic data can augment real datasets, but cannot replace them. Even outside of use cases where the training data must be all-authentic, generative models themselves will always need plenty of real data to learn how to produce synthetic examples in the first place. No model will ever be able to generate examples of things it’s never seen real examples of before.

The days of widely implemented “synthetic data for pre-training” approaches are still far off, and the jury is obviously still out on which applications the technique could be useful for. Interesting stuff, nonetheless, and we’ll of course report back as the field makes advances.

In the meantime, check out these great resources for more details and a peek into the current state of generative models for synthetic data:

OpenAI’s blog on generative models
The Unreasonable Effectiveness of Recurrent Neural Networks

image credit: rawpixel.com