Modular Ingestion Framework for DataHub Import and Management

Realization of a centralized datahub and a modular framework for managing and modelling data

Scenario

Technology Reply supports the customer in all phases to realize centralized Data Lake, from design to implementation, during the realization of all needed services and during the data preparation useful for its queries. The goal is to manage in a performant way, inside a single Data Platform, all data belonging to different Banks, in order to guarantee the same functionalities to all customers.

In particular, the centralized datalake, named Data Hub, contains data received from different sources. In this way, each user can utilize a unique environment for data consulting and querying.


The Data Hub supplies several services in different use cases:

  • Data Processing and Modelling
  • User interface in order to include new data
  • Analytics tools aiming querying data

The framework is centralized, modular and configurable. It allows, through cloud services:

  • Receiving Data and processing them through event scheduling
  • Processing large amount of data
  • Modelling information in order to give easy access through Analytics Tools

The framework includes different processing layers, each with different goal.

  • First Layer: Receiving and checking data in order to automatically identify issues related to sources
  • Second Layer: Formal/technical checks adding data type
  • Third/Core Layer: Applying historical logics
  • Fourth/Modeled Layer: Containing all the models useful for querying

These layers are implemented through different technologies based on needs and goal of each single use case. In particular, the tools are:

  • Postgres for relational data
  • Key-value technology
  • Hadoop for large amount of data

The framework consists in a modular structure. Each component is a different module with a specific goal. This modular approach guarantess different benefits in terms of extension of the solution and new functionalities integration. Each module is implemented through Open Source technologies, like PySpark, in order to guarantee the reuse. The entire framework is driven through centralized metadata structures in order to speed up new processing flow and easy control the processes. The technology used is Postgres DB.

Advantages

The platform can be implemented through Cloud Native Services giving the following advantages:

  • Best management in terms of reliability and services availability
  • Optimal dimension infrastructure, pay per use

The Framework modularity gives us the opportunity to easily adjust it with new add-ons depending on our customers needs. Possible extensions are:

  • Data quality
  • Encrypted information and GDPR
  • Flow integration

Metadata structures, used to guide processes, can be queried in order to create easily Data Lineage: In particular, it is possible to obtain mode and transformation applied during the creation of different models.


Solution

Data Hub realization, including the framework described, allows to manage in a more intuitive way processes and creation of models needed to users analysis. Technology Reply supports the customer in all the phases, from Platform analysis, Process identification and implementation.