Why machine learning uses “synthetic" data

We discuss the opinions of the IT community and industry experts. We also consider a couple of projects in which they develop tools for generating “artificial” data. One of them was represented by immigrants from the US National Security Agency and Google.


Photos - Franki Chamaki - Unsplash

MO problem


Some MO algorithms require structured data to work. For example, to solve the problems of machine vision, they are provided by the ImageNet project - in its database more than 14 million images are divided into 22 thousand categories. Working with such a large-scale set is paying off. The algorithms using it are mistaken in determining the object in photographs in only 3.75% of cases. For comparison - in humans, this figure exceeds 5%.

But to create datasets like ImageNet for each task is impossible. At least because records in them are marked (or checked) manually. At the same time, real data - for example, banking or medical data - may be closed and inaccessible to all developers and data scientists. But even if such data exists, they must be anonymized before processing.

With the solution of these difficulties, synthetic data helps. They are artificial and computer generated, but they look similarly real.

Who works in this field


Many universities, IT companies and startups are involved in projects in this area. For example, Gretel writes software that generates an artificial data set based on a real dataset. The company was founded by a group of immigrants from Google, Amazon and the US National Security Agency (NSA).

First of all, their platform analyzes the available information. Engineers used Uber electric scooter rides as an example . Gretel categorizes them and labels them, then anonymizes them using differential privacy methods . The output is a “completely artificial dataset.” Code of their decision developersposted on GitHub .

A similar project was implemented at the University of Illinois at Urbana-Champaign. Engineers have written a Python library that can be used to generate synthetic data for structured CSV, TSV, and partially structured JSON, Parquet, and Avro formats. In the first case, experts used generative-competitive networks , and in the second, recurrent neural networks .

How effective is synthetic data?


They provide an opportunity for data scientists and developers to train models for projects in areas where big data is not yet available. According to Alex Watson, one of the founders of Gretel, in many cases there are enough values ​​that just look like real user ones.

Gartner estimates that by 2022, 40% of MO models will be trained on synthetic datasets.

The head of Haze, an AI systems company, has linked technology diffusion to its “flexibility." Artificial information is easier to supplement and modify in order to increase the effectiveness of the trained model.

There are also a number of tasks related to computer vision, where it is difficult to use something other than a synthetic dataset - for example, in robotics. When designing industrial robots and unmanned vehicles, reinforcement learning is used . In this case, the artificial intelligence system learns by directly interacting with a certain environment. Depending on the response of this environment, the robot adjusts its actions.

But the drone cannot go outside and determine by trial and error that pedestrians cannot be “crushed”. Therefore, engineers resort to synthetic data - they simulate the environment in virtual space. For example, Nvidia offers a solution for such experiments . Research has also been conducted on machine training using the Grand Theft Auto V game engine .


Photos - Andrea Ang - Unsplash

Despite all the advantages of artificial data, they have their drawbacks. They are consideredless accurate - even if generated on real data - and can lead to models that generate plausible, but irreproducible in the real world results. However, one of the residents of Hacker News in a thematic thread notes that this is not such a big problem. Artificial data can be used to test the algorithms of an intelligent system.

Another user at Hacker News says that similar technologies complicate the learning process of models and increase the cost of development. His words were confirmed by a specialist from the University of Illinois - the difference can reach 50%.

In any case, synthetic datasets cannot be considered a "silver bullet." This is just a tool that can help with solving a certain range of problems. But it is possible that over time this spectrum will expand rapidly.

What we write about in our corporate blog:

A computer that refuses to die,
“Take footprints and leave for the weekend”: how to remove yourself from most popular services
What tools will help meet the GDPR
“Almost anarchy”: a brief history of Fidonet, a project that “doesn't care” "To victory over the Internet

All Articles