How synthetic data is revolutionising Industries

Revolutionising Industries with Synthetic Data: How Industries are Leveraging this Game-Changing Technology

Get In Touch

Before filling out the registration form, please read the Privacy notice pursuant to Article 13 of EU Regulation 2016/679

Invalid Input
Invalid Input
Invalid Input
Invalid Input
Invalid Input
Invalid Input

Privacy


I declare that I have read and fully understood the Privacy Notice and I hereby express my consent to the processing of my personal data by Reply SpA for marketing purposes, in particular to receive promotional and commercial communications or information regarding company events or webinars, using automated contact means (e.g. SMS, MMS, fax, email and web applications) or traditional methods (e.g. phone calls and paper mail).

It’s 2023 - AI has taken off in the last two years with the release of brand-new Large Language Models (LLMs). ChatGPT has caught the imagination of the general public in what possibilities AI generated data has to offer, and now companies are following suit. AI investment is forecast to increase at a compound annual growth rate of 37% (source) from 2023 to 2030. However, it is not always obvious how this technology will directly impact business, with accusations that LLMs are nothing more than toys to experiment with. 





Furthermore, within industry and coupled with the explosive growth of big data we run into several seemingly unrelated problems that have plagued business over the last decade. Regulations and privacy necessitate the inability to identify individuals from data sets, but without these markers we often lose access to the data in totality as, for example, we cannot transfer sensitive data beyond its initial repository meaning that large amounts of useful data is often wasted.


Changing market conditions often mean that AI models will lose effectiveness over time, as the data they have been trained on is no longer relevant. In addition, when the conditions do change, companies will have little to no data available for them to train on, as happened during the Covid period of 2020-21.


Finally, we have problems in the preponderance of biassed datasets - we see this prominently where language models will behave in a way which would be considered discriminatory on the basis of sex or religion. While we can use rule-based systems to avoid these, we still will face the problem of training models with biased training sets. For example, the recruitment industry faces enormous issues in C-level hiring, where historically there has been a massive overrepresentation in men for these roles.


These problems, although it appears unlinked, actually can be addressed with a technique from AI that has been around for decades, that of creating Synthetic Data. However, it is only with the appearance of models such as GPT-3 that we have seen the effectiveness of creating Synthetic Data become such that it can solve these, and many other problems.


What is Synthetic Data?

Synthetic data is a type of generated data which mimics examples of real-world data. It is used as input for analysis and Machine Learning purposes in combination of existing data or without the need to use real-world data. This can be particularly useful when the required data for our analysis is very rarely observed in the real world or when the cost of acquiring this data is too high or the data is too difficult to acquire or process (i.e due to various privacy and regulations).


Some methods for generating synthetic data include:


Rule-based Methods: This type of synthetic data is generated by a predefined set of rules. Rules can be set up to capture certain characteristics and relationships, however, this can be challenging as a limited set of rules can be defined, and setting up multiple rules can result in overlapping and conflicting rules. Rule-based methods can also be time-consuming and require significant effort to set up.


Statistical Methods: Where the distribution and statistical patterns of the underlying data is known, we can use statistical methods such as Monte-Carlo distribution to generate Synthetic Data. The quality of the produced data will generally depend on the subject matter expertise in the given domain.


Machine Learning Methods: Synthetic data generated through this method requires ML algorithms which are trained on the original data to learn all its characteristics, correlations, and patterns. These ML algorithms are then used to generate completely new data points which reproduce the same characteristics from the original dataset, whilst maintaining a level of privacy.

Furthermore, with the recent advancement in Generative AI; models such as Generative Pre-trained Transformer (GPT), Differential Auto-Encoding (DALE-E), and Stable Diffusion are also being used to create Synthetic Data. GPT is an ML model that is used to generate text, and DALE-E and Stable Diffusion are models that are used to generate images. Prompt engineering is helping organisations synthesise data to generate bespoke data points, suitable for their use cases.


How Does It Solve The Above Problems?

Considering the problems from the first section one-by-one then, we see how Synthetic Data can be used to confront these issues.

  1. Privacy Issues: Although due to regulatory and ethical reasons, we can’t train our models on personal data where individuals can be identified, using Synthetic Data in the way described above to create artificial replacement data while removing any issues related to privacy is a key benefit. With advances in LLMs, this is easier and easier to accomplish, especially in areas previously out of reach such as unstructured documents.


  2. Changing Conditions: Synthetic data created according to statistical rules can be used to retrain Machine Learning models when they lose effectiveness. This is especially useful if we find ourselves in positions where there is insufficient data to begin with.


  3. Biassed datasets: Synthetic data, especially with the use of LLMs, can be very effective in generating Synthetic versions of unstructured data where there is a lack of data corresponding to a particular characteristic. Again, this relies on the fact that Synthetic versions of data are as effective as real versions for machine learning training purposes. Synthetic data can be used to overcome real-life biases, rather than reinforcing a damaging status quo, counterproductive to the success of many businesses.


Use Case and Benefits

The market potential for synthetic data is growing as the need for large amounts of high-quality data for training machine learning models and other uses increases.

“Gartner estimates that by 2030, synthetic data will completely overshadow real data in AI models.” – Gartner, June 2022

We consider all major industries which are ripe for the use of Synthetic Data to help solve business problems. Here, Synthetic data generation has more specific use-cases - we describe some of them industries here.

Within Finance, whether that be in banking, insurance, or any other related sector there are many different opportunities to make use of Synthetic Data. We give a few use cases:

  1. Extreme event modelling: One of the biggest problems in risk management is working out the effects of highly unusual events. Whether that would be the financial crisis of 2007-8, or more recently in Covid, working out the effects on people and business is paramount to the riskiness of lending to these clients. The synthetic creation of transaction data is an incredibly powerful tool as we are then able to train our models with this additional data.
  2. Fraud Detection: Using generative adversarial networks, a form of Synthetic Data generation, we are able to create datasets that are importantly balanced - this enables classification models to work with a far greater efficiency than otherwise the case.

On the other hand, Synthetic Data has many use cases within the Retail industry. Some notable use cases are:

  1. Dynamic Content Generation: Synthetic Data can be used to generate content (i.e marketing, social media etc) that can be tailored to the needs of each consumer group. Thus helping retailers provide a more personalised experience to their customers and more importantly increase customer engagement.
  2. Behaviour Analysis for Next action: Synthetic Data can be used to simulate known or specific customer behaviour and preferences, enabling retailers to develop models which tailor the retailer’s next action specifically to the individual customer.

How Data Reply is Innovating with Synthetic Data

Data Reply has been at the forefront of innovation using Synthetic Data. We have created a number of solutions addressing some of the most common problems organisations are facing today which we believe synthetic data can help to address.




Synthetic Images: Generating synthetic images for design testing

Creating customised designs for products can be a time-consuming and costly process for organisations.


Through the use of Stable Diffusion, we were able to generate new designs for perfume bottles based on a single input of an existing picture and a descriptive text prompt without altering the contextual image through masking.

This is a huge advantage for companies as it provides them with a competitive edge over market competitors and optimises operational efficiency to innovate and ideate.

As a result, organisations can now quickly innovate and ideate on their design process, resulting in faster product-to-market.




Prompt: [Input Image] + "Add a Men & Women crown lid on a perfume bottle"


Synthetic CV: Unconscious bias mitigation in recruitment through Synthetic Data

Throughout the corporate world, there have been struggles to improve the representation of non-majority groups of people to closer align with the distribution of kinds of people in the real world. For example, as previously mentioned in this post, there have been strong efforts to increase the number of women in C-level positions over the past decades, with mixed levels of success.


Aside from more overt forms of discrimination, one form holding back the equalisation of roles according to gender (or race, religion, age etc.) is the presence of so-called ‘unconscious bias’. For individuals, this manifests as the implicit viewing of a person in a positive light according to traits that aren’t relevant to the job at hand, and more to do with shared characteristics such as race or culture. As an example, we see this in the statistical overrepresentation of people at the board level of a certain sex.

On the other hand, this impacts the effectiveness of AI in the recruitment process. For example, training a model on historical CVs to automatically score potential candidates would score candidates higher based on their sex, simply due to the imbalance of the data.


Using OpenAI’s GPT-3 model, Data Reply has developed a solution to generate CVs synthetically to correct this imbalance of a pool of existing CVs, thus helping to stabilise existing CV recommendation systems used by recruiters which may be influenced by real-world unconscious biases.





Synthetic image restoration: Augmenting low-res photo using synthetic data

Low-quality images (due to blur or bad resolution) can be problematic to some workflows because these may depend on the exactness and preciseness of the image to make the final decision.


Data Reply for one of their computer vision-powered accelerators has used a framework based on Deep Learning techniques capable of augmenting low-res photos of cars with damage by increasing the resolution of the original photo by 400%, unblurring it and enhancing information which may have been lowered or lost due to other artefacts such as noise.


Conclusion and Further Innovation

Based on our current research and discussion with industry experts who have concerns in the area of higher risk, lower data availability and simulation, we were able to come up with the conclusion that synthetic data with the right approach, when coupled with sophisticated technologies, can support business in cost-effectiveness, operational efficiency, better performance, risk simulation, privacy and mitigating bias. At Data Reply we are continuing to both research and work jointly with clients to support them through synthetic data, and below are some work streams we are focusing as well:

  • Data Anonymisation and Moderation
  • Data Synthesis/augmentation
  • Synthetic data for better ML model
  • Simulation for Safety and risk



Data Reply is a Reply Group company, a premier AWS partner, offering a broad range of advanced analytics, AI ML and data processing services. We operate across different industries and business functions, enabling our customers to achieve meaningful business outcomes through effective use of data, accelerating innovation and time to value.



Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centres globally. Millions of customers—including the fastest-growing start-ups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.