Data Storage and Operations for Data Management Professionals
Executive Summary
This webinar provides an overview of various data storage concepts such as data warehouse, data lake, and data mesh. Howard Diesel covers semantic data fabric, database organisation, and database technology performance. He highlights digital transformation and data management challenges, including increasing data storage and handling different data sets. The Unification challenges facing Data integration and interoperability. Additionally, the webinar explores the role of Agile teams and data fabric in microservice domain-driven development, data products, and the data mesh environment. Finally, it examines federated computational governance, data standards, and the key characteristics of data mesh.
Webinar Details:
Title: Data Storage for Data Management Professionals
Date: 02 February 2022
Presenter: Howard Diesel
Meetup Group: Data Professionals
Write-up Author: Howard Diesel
Contents
Data Storage Concepts
Overview of Data Warehouse, Data Lake, and Data Mesh
Semantic Data Fabric and Database Organisation
Database Technology and Performance
Challenges in Digital Transformation and Data Management
Challenges in Increasing Data Storage and Handling Different Data Sets
Data Integration and Interoperability
Data Integration and Semantic Unification
Challenges in Data Unification Approaches
Agile Teams and Data Fabric in Microservice Domain-Driven Development
Data Products and the Data Mesh Environment
Polyglot Data and Data Domains in Spotify's Data Infrastructure
Federated Computational Governance and Data Standards
Data Mesh and Its Key Characteristics
Discussion on Data Models and Metadata
Data Storage Concepts
Howard Diesel emphasises the importance of data storage and refers to previous conversations on various types of data storage. However, he expresses his inability to locate all the necessary information on data storage in the context diagram and suggested adding business requirements and citizen interaction to it. Howard also mentions his intention to review essential data storage concepts, including the data fabric, ACID, and BASE. He discusses different database architecture types, such as centralised, distributed, and federated systems, and the idea of tightly coupled and loosely coupled systems. The CAP theorem was also addressed, with a potential error regarding high consistency and availability.
Figure 1 Implications ACID vs BASE
Figure 2 Database Architecture Types. Centralized and Distributed (not Federated)
Figure 3 Distributed Systems: CAP Theorem
Overview of Data Warehouse, Data Lake, and Data Mesh
The trend in data processing has moved from data warehouse to data lake to data lake house, with Databricks contributing to this development. The data lake house emphasises integrating the warehouse and the lake to create structured data. NLP practitioners often build views on top of a data lake to provide structure for analytics. At the same time, the shared catalogue helps maintain an organised data environment, preventing it from becoming a "swamp." Implementing a data mesh in Azure involves a templated infrastructure with federated governance at a nodal level, and there are references to Scott Taylor's discussions and diagrams about the data mesh concept. Edge computing and data fabric are additional discussion topics, with mesh and fabric approaches addressing similar problems but with different strategies. Gardner presents a simplified view of data processing involving data sources, a data catalogue with metadata, and AI algorithms for integration and automation.
Figure 4 Federated Database: Blockchain
Figure 5 Data Warehouse to Data Lake to Lakehouse
Figure 6 Data Lake House Layers
Figure 7 Data Mesh
Figure 8 Data Mesh Components
Figure 9 Harmonized Mesh (Azure)
Figure 10 Edge Computing
Figure 11 Data Fabric: Key Pillars of a Comprehensive Data Fabric
Semantic Data Fabric and Database Organisation
The semantic data fabric is a concept that involves preparing and integrating data before it reaches the data consumers. Pool Party is a tool that offers a semantic data fabric to bring together data from various sources. Metadata management and integration are crucial in understanding and automating the different assets and work within the data environment. A knowledge graph built on the metadata helps transform and define the data, incorporating AI, machine learning, and active metadata. Semantic standards and natural language processing enrich the data, and a knowledge graph includes ontology and metadata that define the relationships and attributes of the data. The ultimate goal is to handle data curation, ingestion, access, glossary and dictionary understanding, and different types of data interactions. Different types of schema and database organisations were mentioned, including hierarchical, relational, non-relational, and various data structure options such as document, graph, column, value, and key-value.
Figure 12 Semantic Data Fabric Part
Figure 13 Database Organisation
Database Technology and Performance
In today's era, databases can be built in the cloud or other platforms for various data types and use cases. Fabric and mesh support the integration of different databases. A data manager's role includes understanding requirements, planning for business continuity, creating criteria for database technology, and managing data continuity plans. In addition, data storage professionals are responsible for tasks related to data management, including database performance. To ensure adequate levels of operation and delivery in line with business priorities, operational level management (OLA), a proactive methodology and procedures, is used. OLA reports measure service response times, network round-trip transaction time, and other performance aspects of a data centre. Executive OLA reports are presented to executives to explain the database system's performance.
Figure 14 Database Organisation. Document, Graph, Column-Family and Key-Value
Figure 15 Data Storage and Operations: Activities
Figure 16 Data Storage and Operations: Activities continued including Data manager
Figure 17 Data Storage and Operations: Activities continued including Data Storage Professional
Figure 18 Too much Tech Talk
Challenges in Digital Transformation and Data Management
Howard provides an overview of the challenges associated with data management during digital transformation. He emphasises the importance of reliable data for making informed decisions and highlights the scaling challenge of collecting and storing large amounts of data. Howard also explains that scaling has two dimensions: increasing data storage and increased data set diversity.
Figure 19 Looking After Data and Master Data
Figure 20 Problem Statement
Challenges in Increasing Data Storage and Handling Different Data Sets
The traditional replication or distribution methods that increase data storage often lead to reduced data sharing. The federated approach is seen as a solution to improve data sharing. In massively parallel processing (MPP), compute power, disks, memory, and compute power are distributed. However, scenarios where sharing is limited, and connectivity is avoided for better scaling are depicted in diagrams. Handling different data sets from various sources introduces different data semantics, and data sets received are often not well defined, lacking a proper glossary. For instance, a bank faced challenges with a data set from an external company due to the lack of proper data models and access notation diagrams. Despite cleaning, wrangling, fixing, presenting, and analysing the data, it was insufficient, leading to the decision to discard it.
Figure 21 Data Scaling Problem Dimensions
Figure 22 Increase Data Storage
Figure 23 Increase Data Storage continued
Figure 24 Increase in Data Sets
Data Integration and Interoperability
Integrating and consolidating data from multiple sources pose significant challenges for data scientists, particularly when dealing with multiple data lakes. This issue is expected to become even more complex as data scientists are interested in performing analytics across different data sets, further complicating the issue. Ensuring integration and interoperability is crucial to avoid misunderstandings and inconsistencies in data interpretation. Data integration involves the movement and consolidation of data from various sources, while interoperability refers to the ability of different systems or datasets to work together seamlessly.
Data Integration and Semantic Unification
Data integration involves consolidating data into consistent forms using a special model called the canonical model. The canonical model is crucial to achieving semantic unification, also known as data integration. Without this, confusion and chaos may arise in understanding and linking datasets. One approach to semantic unification is to establish a centralised human team to build a business glossary and ontology, but this can lead to missed deadlines. Other methods include the data fabric and data mesh, which have some fundamental differences despite similar terminology.
Figure 25 DMBOK DII (Data Integration & interoperability) Definition
Figure 26 Semantic Differences Challenge
Challenges in Data Unification Approaches
Connecting different data sources can be challenging for data scientists, as it can result in data duplication and missing information. Two approaches to address this challenge are the centralised human team defining a canonical model and using semantic data fabric. The data fabric approach minimises human involvement by unifying data sets from different sources with technology. On the other hand, the second approach, data mesh, relies on humans to build focused data products as domains with a decentralised structure and differs from the principles of data fabric.
Figure 27 Semantic Unification (Canonical)
Figure 28 Unification Approach
Agile Teams and Data Fabric in Microservice Domain-Driven Development
Agile teams and microservice domain-driven development are being implemented based on the work done by Martin Fowler. The teams are divided into different product domains with global policies on integration and interoperability. The federated computational governance policy concept concerning agile teams and microservice domain-driven development is discussed. The data fabric component architecture combines edge computing, different cloud systems, and on-premise environments, emphasising machine learning and active metadata management. The formation of the data fabric involves different layers, starting with hybrid clouds, universal connectors, continuous metadata intelligence environments for data cleaning and quality, convergence of capabilities, and eventually, leading to data applications. A continuous method data approach is also discussed as a way to automate the integration of data from various sources.
Figure 29 Data Fabric Component Architecture
Figure 30 Data Fabric Example
Data Products and the Data Mesh Environment
Data products are developed using AI and machine learning technology, which integrates various data sets. These systems perform this integration work, and understanding database technology is important to link it with data product requirements. The right approach is crucial for achieving desired outcomes. Therefore, paying attention to the code from a governance perspective is vital to ensure accuracy and reliability. A data mesh environment is created by building a data product environment, where operational systems, microservices, and legacy applications form the input for data products, which are the output exposed to the world. Data products are defined by their dimensions and code (algorithms) used for their creation.
Figure 31 Unified Data Operations
Figure 32 Continuous Metadata Intelligence
Figure 33 Data Mesh Components
Figure 34 Notation: Domain, its (analytical) data product and operational system
Figure 35 Notation: Domain, its (analytical) data product and operational system continued
Polyglot Data and Data Domains in Spotify's Data Infrastructure
Polyglot data is using multiple data types while maintaining consistent semantics. This approach allows for flexibility and compatibility in data processing and analysis. Spotify's data infrastructure includes different domains, such as artists, podcasts, and users, each operating independently with its data infrastructure. The podcast's data domain includes listeners' demographics and outputs data products related to podcast creation, released episodes, and top podcasts. In contrast, the artist's data domain produces data products related to artists' profiles, releases, and analytics. These data domains operate federated, defining the semantics of their respective data products.
Figure 36 Data Mesh Domain Integration
Federated Computational Governance and Data Standards
Howard emphasises the importance of standardising data products through semantic and syntax modelling. He stresses the need for consistent metadata elements and formats across all data products and highlights the significance of identifying boundaries between domains and associations between different areas. Howard compares various governance approaches and suggests that a federated approach is preferred, with each ministry having its own data management office. The concept of “Polysiums,” which aims to bring everything together, is introduced.
Figure 37 Data Mesh Approach
Figure 38 Data Governance Review
Data Mesh and Its Key Characteristics
Data Mesh is a decentralised, domain-oriented approach to data management that focuses on data ownership and treats data as a product. The concept involves understanding data interactions across domains and supporting various structured data types. It emphasises easy discovery for users through data catalogues and APIs, and self-service data infrastructure allows teams to create and consume data products autonomously. Federated computational governance ensures effective management and control of data assets. Howard mentions various references, such as comparisons between data mesh and other logical principles like data fabrics and different vendors in the market. The Q&A session included questions and comments from the audience, including remarks about the number of market architecture slides presented.
Figure 39 Principles Summary and the High Level Logical Architecture
Figure 40 Data Storage Reference Material
Discussion on Data Models and Metadata
Marc is currently working on a Microsoft Azure project but faces implementation challenges with the instantiation of IDW in the common data model format. There is a debate about the placement of the synapse layer within the data lake, and the team is hesitant to follow the canonical model. Marc express frustration with the slow progress of data modelling. Howard expresses interest in active metadata and data fabric, which can help build data sets and products by domain and address semantic differences across data sets.
If you would like to join the discussion, please visit our community platform, the Data Professional Expedition.
Additionally, if you would like to be a guest speaker on a future webinar, kindly contact Debbie (social@modelwaresystems.com)
Don’t forget to join our exciting LinkedIn and Meetup data communities not to miss out!