The Way out of the GDPR Dilemma

Machine Learning Reply relies on General Neural Networks (GANs) to align data science studies with the requirements of the GDPR.

The GDPR protects consumers against having their personal data misused by companies. However, the regulation also means that product developments and tests must in many cases be carried out without the use of very meaningful data - although new developments in the field of AI and machine learning open up a great deal of opportunities for improved products and services.

The GDPR therefore not only creates hurdles for companies, as it also makes it impossible for researchers to, for example, share data among themselves or make it available to external service providers.

THE USE CASE REQUIREMENTS

  • Finding a compromise:
    The necessary protection of personal data must be maintained, but at the same time the progress that is possible thanks to Big Data and Analytics should not be hindered;
  • Combining complexity and comprehensibility:
    Statistical sampling of data does not capture reciprocity in the attributes, and therefore does not represent it in the complexity that is required for modern methods. Dimension-reducing processes, on the other hand, usually result in a reduction in data comprehensibility.
  • Replace information in a disguised manner:
    The models should be based on real data, but should not allow any conclusions to be drawn about the origin. It must be possible to exchange personal information in a disguised manner (secure exchange of entities without duplicates). At the same time, it must be possible to generate realistic data optimally and quantifiably.

EXCURSUS: GENERATIVE ADVERSARIAL NEURAL NETWORKS (GANS)

The experts at Machine Learning Reply have identified the Generative Adversarial Neural Networks (GANs) methodology as a solution to the data protection dilemma.

That fact AI has a certain kind of creativity of its own has recently been proven by an algorithm, which generated a painting showing the fictitious nobleman "Edmond de Belamy" based on 15,000 portraits from the 14th to 20th century that were fed into the system. The Generative Adversarial Neural Networks (GANs) method was used for this purpose. It can be used to create images like that of Edmond de Belamy or photo-realistic images from hand-painted sketches.

The computer does this by having two artificial intelligences "play" against each other. First of all, both learn with real data - this can be structured data or, as in the example above, unstructured data such as artwork. In the second step, one AI tries to generate a new image (or a new data point), while the other one has to recognize whether it is a synthetic or an original image. The two parts of a GAN therefore train each other and the synthesized data becomes increasingly realistic.

The idea of synthetic data is to imitate the statistical properties of a real data set, without revealing individual parts of it. In doing so, the real data sets are exchanged by synthetic ones, which however follow the same patterns. There are different approaches to generating synthetic data, for example the Principal Component Analysis (PCA), autoencoders and generative models.

In order to evaluate which method generates data that can be used as input for both monitored and unattended models, Machine Learning Reply compared the performance of two generative models - Variable Autoencoder (VAE) and Wasserstein GAN (WGAN-GP).

In addition, Machine Learning Reply also applied the KNN algorithm to investigate the similarity between synthetic and real data, in order to determine which algorithm produces entities that can be more securely exposed.

The big challenge here was to display the statistical properties correctly while avoiding the generation of (near-)duplicates. Only then is it possible to use this procedure safely, as the latter would reveal protected information.

THE RESULT

Machine Learning Reply has evaluated the Wasserstein GANs method as promising in relation to the data protection use case. It is suitable for reproducing correlations within real data sets. With the GAN method, greater overlaps were found between the synthetic and real data than with the VAE method. This is because the simulated entities of the VAE method did not cover areas that could be found in the real data.