A new paradigm for Data Management: Data Mesh

Design a modern data platform at scale based on the decentralization of Data Governance

The key features

Why apply Data Mesh

The data mesh can help organizations whose internal business areas are highly decentralized, as the data mesh structure allows different teams to manage their own data and provide the rest of the organization with only quality data as a product.

The implementation of a Data Mesh approach allows domain teams to have greater independence in modeling and governing data while still adhering to policies applied to the global level.

Having independent teams and data domains makes it possible to create data products faster to offer to the business.

With data mesh, data is the responsibility of a single node, and it is possible to protect its content and monitor its usage.

Increase in Data Ownership

In many organizations, establishing a "single source of truth" or "authoritative data source" is challenging due to the repeated extraction and transformation of data within the organization without clear ownership responsibilities for the new data created.
By adopting a data mesh approach, the authoritative data source is the Data Product published by the originating domain, with a clearly assigned data owner responsible for that data.

Greater scalability

One of the main advantages of the data mesh is to provide a comprehensive solution by dividing the central data team and surrounding knowledge into domain teams, each with its own expertise. This allows domain teams to deliver optimal business value within their areas of expertise.

The lifecycle of a data product adheres to agile principles of being short and iterative, to provide quick and incremental value to data consumers.

Our approach

1 - Make the data addressable and easily identifiable

In a data mesh solution, data access must be standardized with common rules across various domains so that data is easily accessible. In the case of data stored in a data lake, it is recommended that these be accessible via REST API which must all have the same format. In the case of data stored within a database, schemas and views can be defined according to standardized naming conventions. The data platform team will handle this phase, still using a centralized approach.

2 - Use metadata and data catalog (discoverability)

Improve the metadata and add a data catalog for their discovery, so that anyone can purchase any product of the organization. A point is needed to search, discover, and "purchase" the data within the company. A way to request access and grant access to data products in a manner that is usable by data owners and consumers without the involvement of a central team is also necessary. At this stage, work is being done on the features of data products, adding tests for data quality, lineage, monitoring, etc. ​

3 - Implement the domain driven design by breaking down the monolithic approach

It is necessary to try to assign ownership to the domain team that creates the data, moving to a de-centralized. Each team must own its own data resources, ETL pipelines, quality, testing, etc. It is still necessary to rely on a federated governance for the standardization, security, and interoperability of the data. If all of this has already been done, it is advisable to build these functionalities as services to create a self-service platform. At this stage, DataOps practices can be introduced and observability and self-service capabilities can be improved.

Fundamental factors to consider before betting on Data Mesh

Which components to use

The most common way to store data in a Data Mesh is the data lake where data products are addressable via URL and it is possible to manage control of, versioning, encryption, metadata, and observability. Another option is to use data warehouses; modern data warehouses have improved significantly and, thanks to Serverless Analytics, require no maintenance, they can simply be used without worrying about the operational burden.

In a Data Mesh, data processing is encapsulated within data products. There are no central ETL pipelines, but rather hidden within each domain so that the same tools already in use can be utilized. Important activities such as data quality, data cleaning, and data processing are still present, but are no longer centralized; instead, they are managed by different teams with federated governance.

A data catalog is an organized inventory of the data resources available within the organization. It uses metadata to help organizations manage their data. Additionally, it assists data professionals in collecting, organizing, accessing, and enriching metadata to support data discovery and governance.

The data stored by each domain must be queryable in order to create aggregated products regardless of the technology used to store the data. There can be a fast data level in a relational database and historical data in a data lake storage. However, data consumers should query them without being aware of the implementation details.

Data analysts are the most common data consumers, they tend to be less technical and use different tools. BI tools are extensively used within the organization to gain insights and make better decisions.