Custom integration framework

Scenario

Technology Reply has designed a metadata driven framework, based on the use of YAML configuration file, to guide the process of integrating and orchestrating data on a cloud-based database. Adopting a custom framework offers greater flexibility and scalability, allowing you to easily adapt to changes in data sources, required transformations and business requirements.

Metadata Management

The core of the metadata-driven framework is the Metadata Catalog, derived from the YAML configuration files, consolidated into a centralized repository where all metadata related to data sources, schemas, transformations, and destinations are stored. The centralized repository serves as a comprehensive and up-to-date archive of the information needed to manage and orchestrate data integration processes. It also provides a structured way to store and manage this metadata, ensuring that it is easily accessible and manageable.

Data pipelines and ingestion

To effectively manage the data flow, it is essential to implement robust ingestion pipelines. These pipelines must be able to handle both real-time and batch data loading. Ingestion pipelines ensure that data arrives in a timely and orderly manner, ready for further processing. Once collected, the raw data is inserted into the first storage layer, known as the Standard Layer. This layer serves as a temporary repository where data is organized according to the structures defined in the YAML files. The ingestion pipelines, guided by the YAML files, manage both real-time and batch loading, ensuring that data is promptly available for subsequent processing stages.

Processing and transformation

The framework can integrate with data processing engines (various databases, cloud/on-premise) to perform complex transformations. This layer is responsible for transforming raw data into useful information, applying various cleaning and transformation processes. Additionally, it is implemented to ensure that transformed data always complies with current specifications. SQL models describe the transformation operations to be applied, ensuring that processes are executed consistently and in accordance with business standards.

Data quality

Data quality is essential to ensure the reliability of analyses. Therefore, the framework includes data validation modules that examine the data to ensure it meets the quality criteria defined in the YAML files. Validation rules may include checks on data types, specific formats, and the removal of duplicates and functional checks. These rules are fully governable and manageable within SQL models, ensuring that validation is consistent and accurate. Monitoring and logging tools are integrated to track transformation operations and quickly identify any issues.

Data storage and access

Once transformed, data must be stored in an organized and accessible manner. The framework organizes data into different layers (raw data, processed data, analytics-ready data) to facilitate management and access. The YAML files determine how data should be organized and stored, making access and management easier.

Schema evolution

Through the YAML configuration files, the creation of database structures and their evolution over time is automatically managed.

Orchestration

To manage dependencies between different data pipelines and schedule job execution, the framework uses a workflow orchestrator to handle scheduling, retry management in case of failures, and ensure that all parts of the process work in sync.

Automation

Automation is a key element to ensure the efficiency and scalability of the framework. A Continuous Integration/Continuous Deployment (CI/CD) pipeline is integrated to automate the deployment and updating of the framework components.

Framework Workflow

Raw data is inserted into the first storage layer (Standard Layer).

- Validation modules verify data quality.
- Non-compliant data is discarded and reported.

- Necessary transformations are applied to the data within the underlying schema.
- Data integrity is ensured.

- Transformed data is loaded into subsequent storage layers.
- Data is prepared for use by analytics applications.

- The execution of pipelines, dependency management, and scheduling are handled.
- CI/CD ensures efficient management and distribution of resources.

Advantages

  • Flexibility: The use of YAML files allows easy adaptation of the framework to changes in data sources and business requirements.

  • Scalability: Efficient management of large volumes of data and the ability to scale horizontally are ensured by project-based YAML configuration files.

  • Reliability: Continuous monitoring and error management based on system tables ensure process reliability.

  • Efficiency: Cost reduction is achieved through the optimization of cloud resources and process automation, with guided configurations.

  • Data Governance: Improved control and traceability of data are achieved through a centralized process lineage.