August 30, 2024
A guest post from Fabrício Ceolin, DevOps Engineer at Comet. Inspired by the growing demand…
When I first heard about the data mesh, I was sceptical. As I dug deeper into the topic, I realized it was a breath of fresh air where data architecture finally reunited with its long-lost relative: business domain ownership. — Stan Christiaens, co-founder of Collibra
This article will provide a comprehensive guide to data mesh tools across different stages of your data pipelines. In my previous article, I explained the Data Fabric and Data Mesh approaches, architecture, management and boundaries, which are the new data governance trends, partly metaphorically and partly with examples. I mentioned what difficulties these two new approaches, which complement each other, emerged to overcome in traditional methods.
In the intricate world of data governance, the Data Mesh concept emerges like an artist turning a tangled yarn into a harmonious mosaic, making each data strand shine in a complex yet seamless tapestry that propels our business forward. It transforms the daunting task of untangling our data ecosystem into crafting a masterpiece where every piece is vital, no matter how small. This approach not only simplifies management but also enriches our appreciation for the individual qualities of each data element, aligning them with our overarching business objectives.
As we delve into this reimagined data landscape, it becomes clear that equipping each domain team with the tools to manage and operate their data assets independently is paramount. This article aims to shed light on the tools that empower such autonomy and innovation. Yet, it’s crucial to understand that the selection of these tools will inevitably vary, as each team’s choice hinges on a multitude of factors, including the specific use case, existing technological infrastructure, and budgetary constraints.
A data catalogue is one of the most critical components of data network architecture. The data catalogue is essential as it is a central inventory of all data assets available within the organization. The data catalogue is necessary for data consumers/users because teams in different businesses and domains in the data network discover data assets and gain information about their scope through the data catalogue. In short, the data catalogue is the first component that teams from different businesses and domains will visit to use each other’s data products.
Four basic data features should be looked for in a suitable data catalogue tool: location, format, quality and metadata. This information allows the data to be discovered. Such a tool would enable data producers to document, understand and explain the context of their data assets. Having a proper data catalogue also provides the opportunity to manage data effectively.
When we do some market research, we come across many products that can serve our purpose of creating a data catalogue. I can list the three that stand out in my opinion as follows:
Strength — metadata management, data lineage, and collaboration features.
Strength — unified metadata management, easy integration and data quality.
Strength — smart search-driven data discovery, collaboration, and user-friendly interface.
These products are just a few options. We can literally find dozens of different options for data cataloguing. However, if it were me and I had no issues with budget or other constraints in terms of the existing data ecosystem, I’d choose one of these three.
The second critical component of the data network is data storage. This can usually be a data lake, data warehouse or data space. In a data network architecture, we typically distribute data storage across the organization, and each team is responsible for its own storage needs. This has some positive aspects, but it can also lead to additional complexity since each team is free to choose its own storage, and as a result, we may end up with many different types of storage used in different areas. This allows domain teams to choose the storage solution that best suits their needs. It doesn’t force them to use the enterprise solution, which was the primary feature of the data network anyway if you remember. This is an advantage because it allows the team to choose the best storage.
For example, a team with unstructured data might choose one type of database, while a team that only works with structured data might choose a completely different relational database. The added benefit here is that teams can scale storage solutions independently, so we don’t have to deal with massive programs that take months and require a lot of resources to scale at the enterprise level. If it’s a storage solution that only caters to the needs of a single team, it can make it much more costly.
There are many more product options for data storage than data cataloguing. Our choice will depend on our usage scenario, company size, budget, etc.
Complexity: Amazon S3, BigQuery, and Snowflake are cloud-native solutions, which generally means easier management and scalability.
Ease of Use: BigQuery and Snowflake are known for their ease of use and require less infrastructure management than Hadoop.
Cost Model: Each platform has a different cost model, and organizations must consider their specific usage models to evaluate cost-effectiveness.
I think storage needs to be chosen locally at the domain level because of the characteristics of the data network, and then we need to have technical leadership within the organization and have a clear strategy for how we’re going to combine all these different storage. Of course, adopting the same data storage solution for all domains is even better. However, we should allow the tools different domains can use in data storage because this goes against the data network methodology.
Another thing we need to think about when it comes to tools is how we navigate data flow paths. ETL refers to extracting, transforming, and loading data from source systems into the data lake or data warehouse. This allows domain teams to move data from source systems or locations to the central repository. But this doesn’t happen magically! For this, we need a data pipeline.
A well-planned data pipeline should allow us to scale horizontally to meet increasing data volumes and processing requirements. It will enable us to be more flexible and easily change our data processing pipeline to support new data sources, data formats, or other requirements. A proper pipeline ensures data quality, prevents data loss during information transfer, provides error handling capabilities, optimizes data processing, minimizes situations such as delays, and ensures timely availability of data.
Integration and Scalability: As a cloud-based object storage service focused on scalability and security, Amazon S3 provides an ideal platform for storing large datasets. On the other hand, Apache NiFi is an open-source data integration tool and offers great convenience in managing data pipelines and simplifying integration processes. Combining the data management and stream orchestration capabilities of Apache NiFi with the secure and scalable storage provided by Amazon S3, these two services can create a seamless synergy for data storage and integration needs.
The most crucial factor to consider when choosing is based on the unique requirements and architecture of our data storage and management needs. Implementing data pipelines is essential because data pipelines are managed by various domain teams rather than a central structure. This implies that data pipelines and each component must be customized according to the needs of the domains. As a result, this approach supports data autonomy and eases the burden on centralized data management teams.
When choosing between them, You must adhere to the specific needs and architecture of your data storage and management requirements. It’s best to leave data pipelines to domain teams’ choices, encouraging data autonomy and reducing the burden on centralized data teams.
Data quality management is critical when it comes to data networking. Actually, it always is! But when we think about it in the context of a data network, many different data products will be created and consumed by many different teams across the organization. However, each area no longer has its own data quality management team. Therefore, ensuring the consistency and reliability of different data products will only be possible using a suitable data quality management tool.
This situation can lead to a variety of problems, such as inefficiency, confusion, and even legal issues. If we have experience using data quality management tools for data mesh, we know that the key components are data profiling, data cleansing, data validation, and data monitoring. If we come from a larger organization, we already have a data quality management tool.
The first step here is to understand whether a tool will meet the demands of a data networking application if it is already in use. If it does not, it will be necessary to investigate another tool that can better meet these needs. Informatica is my first tool of choice regarding data quality management tools.
Weakness — It can be costly, and there may be a learning curve due to its complexity.
Weakness — Complexity may be challenging for some users, and large-scale applications may require additional infrastructure.
Weakness — It is limited in data integration and can be costly.
Weakness — There may be a learning curve for new users, and integration complexity may be apparent in some scenarios.
Finally, it would be wiser to choose a standard tool for data quality management that will be used across all domains of the organization, even if it is for the data network, unlike all other elements. Therefore, choosing something that will work for all different areas is important.
Data governance is becoming more prominent, especially in the data network. Data governance ensures that data is managed according to regulatory requirements and previously agreed corporate policies. If you have the correct data governance tool for data mesh, it will be much easier for domain teams to implement policies and standards for data governance.
I suggest choosing the tool you chose for data quality management in the previous topic. You don’t want to have a separate tool for data governance and quality because, in most cases, these tools provide the necessary functionality for data governance and data quality. So, look for a tool that suits your data governance and data quality needs. Again, here are my personal favourites. Collibra and Informatica.
Rather than rushing to choose the right vehicle, understand your needs and evaluate 2 to 5 tools. If you are already using an effective tool, go for it, but if it is not suitable for your business, you can analyze different tools and choose the one with the best return on investment.
Next, you need to consider how data communication between different domain teams will occur. You will need APIs and service mesh according to data network best practices for this. Now, let’s discuss what an API is and what a service mesh is.
API is the abbreviation for Application Programming Interfaces. This is a software interface that allows different applications and systems to share data with each other. Regarding data networking, APIs allow our domain teams to present their data to other domains. In this way, teams can interact more efficiently with each other’s data.
Many people have heard of APIs, but very few have heard of service mesh. A service mesh is an infrastructure layer that manages service-to-service communications within a distributed system. To enable organizations to more effectively manage microservices based on architecture, e.g. servicing, discovery, load balancing traffic, routing, security, etc. It provides all the features and functions you need, including:
APIs and service mesh are key components of the data network. Because they facilitate data exchange and communication between different areas. Therefore, we enable teams to securely and efficiently access the proper data at the right time by having APIs and a service network.
Weakness — Kubernetes is not a direct API management tool and may need additional tools to manage complex structures.
Weakness — Installation and configuration complexity may be challenging for some users. It can also cause performance issues in high-traffic applications.
Weakness — It has deficiencies in service mesh features. In particular, it does not offer as wide a range of features as a comprehensive service mesh solution such as Istio.
Choosing the right technology here will reduce the complexity of managing data across multiple domains. It will also make it much easier for domains to maintain and scale their data infrastructure. The ones I have listed for your data network application are just some of the most well-known, and of course, these are only some of the options.
Another critical part of the data network that we must consider is data visualization and reporting. We need to think about the tools we use to present domain data and make it into a product that is easy to consume/use by every domain in the organization. Using APIs and service mesh to exchange data between domains is excellent, but remember that most data consumers/users are not technical people! Moreover, not only this, some of the consumers/users may also be leaders in other fields. So, we need to think of ways to visualize and structure our data to make life easier for everyone. That’s why we need to choose a data visualization and reporting product.
Some of the best-known options are Tableau, Power BI and Quick View. You also have solutions like Looker and even Excel. What we have to offer will depend on what kind of knowledge we have internally. For example, we have too many experts at Tableau and need to revoke licenses for the next 18 months. In this case, there would be no point in switching to Power BI or anything else. However, if we are starting from scratch and need to gain specific knowledge of the tool or data visualization tool we are currently paying for, then it would be a good idea to look at different options and see what they are to meet our needs best.
Weakness — Tableau may experience some performance challenges when processing large data sets. It also lacks advanced analytics features compared to Power BI and Qlik View.
Weakness — Some of Power BI’s advanced analytics capabilities may be limited. It may perform less than competitors like Tableau when processing large-scale datasets.
Weakness — QlikView is more complex than Tableau and Power BI in terms of user-friendly interface and reporting features. Additionally, licensing costs can be high for certain user situations.
So, in summary, data visualization is important because it allows different domains to communicate their data more effectively and allows different business units to make better data-driven business decisions.
Collaboration and knowledge sharing are critical, especially for autonomous and scalable multidisciplinary organizations. This maximizes customer and revenue potential and highlights the need for effective communication between completely independent teams. Effective collaboration and smooth information flow can be supported using chat platforms like Slack and Microsoft Teams and wiki tools such as Confluence and Notion. Factors such as the choice of these tools, the way documents are shared, and the organization of meetings directly affect the effectiveness of intra-organizational collaboration.
In the process of transitioning to data network implementation, the integration and use of these tools become one of the key elements that support the overall success of organizations. Therefore, strategically deciding which technologies and methods to adopt is vital to ensuring the organization’s long-term success in managing and sharing knowledge.
Choosing the proper data mesh tools for our data network architecture is one of our most complex tasks. There are so many options on the market that we can face a wide information load from data storage and management to various needs. I recommend developing a six-step strategy to manage this complexity and choose the right tools. These steps will help us understand the process, evaluate the options, and ultimately determine the best tools.
As a result, our data network journey is like a large orchestra playing in harmony, with each tool participating with its unique voice. In this process, choosing and integrating the right tools is like ensuring that different musical instruments create perfect harmony with each other. Our teams, each specialized in their own fields, and the tools we choose create a fascinating symphony while maximizing the success and sustainability of our data network.
This orchestration increases efficiency while minimizing complexity in data management and enabling our organization to reach new heights in data-driven decision-making.