Journey to a Highly Scalable datalake

Background

The customer, a large automotive company, is managing a huge scattered IT landscape across many different departments. Data is produced and collected by a variety of – often legacy – applications, databases and systems. Some of the data dates back to over 30 years ago and comes in many different formats. The customer began the journey to centralize reporting and data science efforts with an on-premise datalake which Data Reply built in 2016 and has been operating since. That solution, based on at that time state-of-the-art technology – mainly Cloudera Hadoop and Apache Kafka – lacked the scalability and flexibility that the cloud began to increasingly offer in the Big Data area in the last few years. Consequently the decision was made to migrate the existing solution to a cloud-native datalake – again trusting the domain expertise that Data Reply offered with ongoing efforts to keep up to date with the Big Data software landscape.

Migrating to AWS

Data Reply was tasked to build a datalake on AWS for the customer. First this meant the setup of a centralized data storage and management solution based on Amazon Simple Storage Service (S3) and migrating the data that Data Reply collected on-premise in Apache Hadoop Distributed File System (HDFS). Keeping with industry best practices the data is organized in layers starting with a landing area where the team ingests data – depending on the source using mainly Kinesis and Apache NiFi – often in native formats produced by the systems of origin. ETL pipelines transform the data into a smaller number of agreed-upon formats. The pipelines use Data Reply’s custom-built solutions for masking sensitive information and enriching output data in a convenient use-case-ready form. In the end the transformed data is put in a final layer which Data Reply calls "datahub". Various other AWS accounts were setup to access use case specific data while allowing the team to keep track of costs on a use case basis.

Which services are used?

According to the wishes of the customer Data Reply leverages serverless solutions wherever possible. This means the team uses S3 for storage and AWS Glue for Spark based ETL pipelines which Data Reply organizes in workflows. Athena is used as the main SQL query solution. As a newer addition to the toolset of Data Reply’s datalake Quicksight is offered to BI analysts. Meanwhile Data Scientists are enabled to use their own EMR clusters and whatever tools they deem appropriate for their tasks.

Data Reply provisions the infrastructure with CloudFormation through Sceptre. To manage the configuration of the Glue workflows and jobs the team developed its own configuration service. This service is triggered by uploads of data to our S3 buckets. Using base-configurations stored in AWS Systems Manager the service works with the individual amount of data, computing the number of Data Processing Units (DPUs) that are necessary to complete the ETL processes run on top of the data. This allows the team to avoid overprovisioning resources and thus to keep the costs as low as possible.

Data Reply also uses managed services for Redis and Elasticsearch. These serverless solutions are used for Data Reply’s masking service and functional monitoring of the respective ETL pipelines.


How the accounts are set up

The centerpiece of the Data Reply datalake is a main AWS account where data is kept in S3, separated into buckets based on the source system. In this main account Data Reply also runs the AWS Glue ETL pipelines which take care of central data preparation tasks necessary for most use case input scenarios – the most important being the masking of sensitive information, for example because of GDPR. By keeping these preparation tasks in the core datalake Data Reply avoids confidentiality breaches while remaining flexible to provide cleartext data to those who have a legitimate business case and the necessary permissions to view said data – the key being a corresponding unmasking service.

This central datalake account is accompanied by a number of use case specific accounts. Best practices for AWS cross account access are used to provide each use case account with read-only access to the required data. Implementers of business case logic can then choose the tooling based on their needs while avoiding a cluttered and hard to manage infrastructure in the core datalake account. Coincidentally this frees up the resources of the core datalake operations team from having to deploy and operate use case specific solutions and pipelines.

Advantages of the new solution

  • Scalable by design. As detailed above Data Reply sticks to serverless solutions wherever possible, leveraging the full power of the AWS cloud.
  • Centrality of data storage and ETL-ing. Data Reply has full view of all the data ingested and can provide fine-grained access to anyone within the company looking to provide business value without having to gather data in a variety of places and file formats.
  • Flexibility. While Data Reply offers templates and guidance for data scientists looking to leverage the data lake, it does not matter to the Data Reply team which tools are being used to run business case specific code on top of the data. Data Reply offers data in modern, widely used formats like Parquet and Avro while protecting sensitive information by design.
  • strip-0

    DATA REPLY

    Data Reply is the Reply group company offering a broad range of advanced analytics and AI-powered data services. We operate across different industries and business functions, enabling them to achieve meaningful outcomes through effective use of data. We have strong competences in Big Data Engineering, Data Science and IPA; we build Big Data platforms and implement ML and AI models in a manner that is repeatable, efficient, scalable, simple and yet secure. We supports companies in combinatorial optimization processes with Quantum Computing techniques that enable an engine with high computational performances.