Data Storage and Operations for Data Management Professionals

Executive Summary

This webinar provides an overview of various data storage concepts such as data warehouse, data lake, and data mesh. Howard Diesel covers semantic data fabric, database organisation, and database technology performance. He highlights digital transformation and data management challenges, including increasing data storage and handling different data sets. The Unification challenges facing Data integration and interoperability. Additionally, the webinar explores the role of Agile teams and data fabric in microservice domain-driven development, data products, and the data mesh environment. Finally, it examines federated computational governance, data standards, and the key characteristics of data mesh.

Webinar Details:

Title: Data Storage for Data Management Professionals

Date: 02 February 2022

Presenter: Howard Diesel

Meetup Group: Data Professionals

Write-up Author: Howard Diesel

Contents

Data Storage Concepts

Overview of Data Warehouse, Data Lake, and Data Mesh

Semantic Data Fabric and Database Organisation

Database Technology and Performance

Challenges in Digital Transformation and Data Management

Challenges in Increasing Data Storage and Handling Different Data Sets

Data Integration and Interoperability

Data Integration and Semantic Unification

Challenges in Data Unification Approaches

Agile Teams and Data Fabric in Microservice Domain-Driven Development

Data Products and the Data Mesh Environment

Polyglot Data and Data Domains in Spotify's Data Infrastructure

Federated Computational Governance and Data Standards

Data Mesh and Its Key Characteristics

Discussion on Data Models and Metadata

Data Storage Concepts

Howard Diesel emphasises the importance of data storage and refers to previous conversations on various types of data storage. However, he expresses his inability to locate all the necessary information on data storage in the context diagram and suggested adding business requirements and citizen interaction to it. Howard also mentions his intention to review essential data storage concepts, including the data fabric, ACID, and BASE. He discusses different database architecture types, such as centralised, distributed, and federated systems, and the idea of tightly coupled and loosely coupled systems. The CAP theorem was also addressed, with a potential error regarding high consistency and availability.

Implications ACID vs BASE

Figure 1 Implications ACID vs BASE

Database Architecture Types. Centralized and Distributed (not Federated)

Figure 2 Database Architecture Types. Centralized and Distributed (not Federated)

Distributed Systems: CAP Theorem

Figure 3 Distributed Systems: CAP Theorem

Overview of Data Warehouse, Data Lake, and Data Mesh

The trend in data processing has moved from data warehouse to data lake to data lake house, with Databricks contributing to this development. The data lake house emphasises integrating the warehouse and the lake to create structured data. NLP practitioners often build views on top of a data lake to provide structure for analytics. At the same time, the shared catalogue helps maintain an organised data environment, preventing it from becoming a "swamp." Implementing a data mesh in Azure involves a templated infrastructure with federated governance at a nodal level, and there are references to Scott Taylor's discussions and diagrams about the data mesh concept. Edge computing and data fabric are additional discussion topics, with mesh and fabric approaches addressing similar problems but with different strategies. Gardner presents a simplified view of data processing involving data sources, a data catalogue with metadata, and AI algorithms for integration and automation.

Federated Database: Blockchain

Figure 4 Federated Database: Blockchain

Data Warehouse to Data Lake to Lakehouse

Figure 5 Data Warehouse to Data Lake to Lakehouse

Data Lake House Layers

Figure 6 Data Lake House Layers

Data Mesh

Figure 7 Data Mesh

Data Mesh Components

Figure 8 Data Mesh Components

Harmonized Mesh (Azure)

Figure 9 Harmonized Mesh (Azure)

Edge Computing

Figure 10 Edge Computing

Data Fabric: Key Pillars of a Comprehensive Data Fabric

Figure 11 Data Fabric: Key Pillars of a Comprehensive Data Fabric

Semantic Data Fabric and Database Organisation

The semantic data fabric is a concept that involves preparing and integrating data before it reaches the data consumers. Pool Party is a tool that offers a semantic data fabric to bring together data from various sources. Metadata management and integration are crucial in understanding and automating the different assets and work within the data environment. A knowledge graph built on the metadata helps transform and define the data, incorporating AI, machine learning, and active metadata. Semantic standards and natural language processing enrich the data, and a knowledge graph includes ontology and metadata that define the relationships and attributes of the data. The ultimate goal is to handle data curation, ingestion, access, glossary and dictionary understanding, and different types of data interactions. Different types of schema and database organisations were mentioned, including hierarchical, relational, non-relational, and various data structure options such as document, graph, column, value, and key-value.

Semantic Data Fabric Part

Figure 12 Semantic Data Fabric Part

Database Organisation

Figure 13 Database Organisation

Database Technology and Performance

In today's era, databases can be built in the cloud or other platforms for various data types and use cases. Fabric and mesh support the integration of different databases. A data manager's role includes understanding requirements, planning for business continuity, creating criteria for database technology, and managing data continuity plans. In addition, data storage professionals are responsible for tasks related to data management, including database performance. To ensure adequate levels of operation and delivery in line with business priorities, operational level management (OLA), a proactive methodology and procedures, is used. OLA reports measure service response times, network round-trip transaction time, and other performance aspects of a data centre. Executive OLA reports are presented to executives to explain the database system's performance.

Database Organisation. Document, Graph, Column-Family and Key-Value

Figure 14 Database Organisation. Document, Graph, Column-Family and Key-Value

Data Storage and Operations: Activities

Figure 15 Data Storage and Operations: Activities

Data Storage and Operations: Activities continued including Data manager

Figure 16 Data Storage and Operations: Activities continued including Data manager

Data Storage and Operations: Activities continued including Data Storage Professional

Figure 17 Data Storage and Operations: Activities continued including Data Storage Professional

Too much Tech Talk

Figure 18 Too much Tech Talk

Challenges in Digital Transformation and Data Management

Howard provides an overview of the challenges associated with data management during digital transformation. He emphasises the importance of reliable data for making informed decisions and highlights the scaling challenge of collecting and storing large amounts of data. Howard also explains that scaling has two dimensions: increasing data storage and increased data set diversity.

Looking After Data and Master Data

Figure 19 Looking After Data and Master Data

Problem Statement

Figure 20 Problem Statement

Challenges in Increasing Data Storage and Handling Different Data Sets

The traditional replication or distribution methods that increase data storage often lead to reduced data sharing. The federated approach is seen as a solution to improve data sharing. In massively parallel processing (MPP), compute power, disks, memory, and compute power are distributed. However, scenarios where sharing is limited, and connectivity is avoided for better scaling are depicted in diagrams. Handling different data sets from various sources introduces different data semantics, and data sets received are often not well defined, lacking a proper glossary. For instance, a bank faced challenges with a data set from an external company due to the lack of proper data models and access notation diagrams. Despite cleaning, wrangling, fixing, presenting, and analysing the data, it was insufficient, leading to the decision to discard it.

Data Scaling Problem Dimensions

Figure 21 Data Scaling Problem Dimensions

Increase Data Storage

Figure 22 Increase Data Storage

Increase Data Storage continued

Figure 23 Increase Data Storage continued

Increase in Data Sets

Figure 24 Increase in Data Sets

Data Integration and Interoperability

Integrating and consolidating data from multiple sources pose significant challenges for data scientists, particularly when dealing with multiple data lakes. This issue is expected to become even more complex as data scientists are interested in performing analytics across different data sets, further complicating the issue. Ensuring integration and interoperability is crucial to avoid misunderstandings and inconsistencies in data interpretation. Data integration involves the movement and consolidation of data from various sources, while interoperability refers to the ability of different systems or datasets to work together seamlessly.

Data Integration and Semantic Unification

Data integration involves consolidating data into consistent forms using a special model called the canonical model. The canonical model is crucial to achieving semantic unification, also known as data integration. Without this, confusion and chaos may arise in understanding and linking datasets. One approach to semantic unification is to establish a centralised human team to build a business glossary and ontology, but this can lead to missed deadlines. Other methods include the data fabric and data mesh, which have some fundamental differences despite similar terminology.

DMBOK DII (Data Integration & interoperability) Definition

Figure 25 DMBOK DII (Data Integration & interoperability) Definition

Semantic Differences Challenge

Figure 26 Semantic Differences Challenge

Challenges in Data Unification Approaches

Connecting different data sources can be challenging for data scientists, as it can result in data duplication and missing information. Two approaches to address this challenge are the centralised human team defining a canonical model and using semantic data fabric. The data fabric approach minimises human involvement by unifying data sets from different sources with technology. On the other hand, the second approach, data mesh, relies on humans to build focused data products as domains with a decentralised structure and differs from the principles of data fabric.

Semantic Unification (Canonical)

Figure 27 Semantic Unification (Canonical)

Unification Approach

Figure 28 Unification Approach

Agile Teams and Data Fabric in Microservice Domain-Driven Development

Agile teams and microservice domain-driven development are being implemented based on the work done by Martin Fowler. The teams are divided into different product domains with global policies on integration and interoperability. The federated computational governance policy concept concerning agile teams and microservice domain-driven development is discussed. The data fabric component architecture combines edge computing, different cloud systems, and on-premise environments, emphasising machine learning and active metadata management. The formation of the data fabric involves different layers, starting with hybrid clouds, universal connectors, continuous metadata intelligence environments for data cleaning and quality, convergence of capabilities, and eventually, leading to data applications. A continuous method data approach is also discussed as a way to automate the integration of data from various sources.

Data Fabric Component Architecture

Figure 29 Data Fabric Component Architecture

Data Fabric Example

Figure 30 Data Fabric Example

Data Products and the Data Mesh Environment

Data products are developed using AI and machine learning technology, which integrates various data sets. These systems perform this integration work, and understanding database technology is important to link it with data product requirements. The right approach is crucial for achieving desired outcomes. Therefore, paying attention to the code from a governance perspective is vital to ensure accuracy and reliability. A data mesh environment is created by building a data product environment, where operational systems, microservices, and legacy applications form the input for data products, which are the output exposed to the world. Data products are defined by their dimensions and code (algorithms) used for their creation.

Unified Data Operations

Figure 31 Unified Data Operations

Continuous Metadata Intelligence

Figure 32 Continuous Metadata Intelligence

Data Mesh Components

Figure 33 Data Mesh Components

Notation: Domain, its (analytical) data product and operational system

Figure 34 Notation: Domain, its (analytical) data product and operational system

Notation: Domain, its (analytical) data product and operational system continued

Figure 35 Notation: Domain, its (analytical) data product and operational system continued

Polyglot Data and Data Domains in Spotify's Data Infrastructure

Polyglot data is using multiple data types while maintaining consistent semantics. This approach allows for flexibility and compatibility in data processing and analysis. Spotify's data infrastructure includes different domains, such as artists, podcasts, and users, each operating independently with its data infrastructure. The podcast's data domain includes listeners' demographics and outputs data products related to podcast creation, released episodes, and top podcasts. In contrast, the artist's data domain produces data products related to artists' profiles, releases, and analytics. These data domains operate federated, defining the semantics of their respective data products.

Data Mesh Domain Integration

Figure 36 Data Mesh Domain Integration

Federated Computational Governance and Data Standards

Howard emphasises the importance of standardising data products through semantic and syntax modelling. He stresses the need for consistent metadata elements and formats across all data products and highlights the significance of identifying boundaries between domains and associations between different areas. Howard compares various governance approaches and suggests that a federated approach is preferred, with each ministry having its own data management office. The concept of “Polysiums,” which aims to bring everything together, is introduced.

Data Mesh Approach

Figure 37 Data Mesh Approach

Data Governance Review

Figure 38 Data Governance Review

Data Mesh and Its Key Characteristics

Data Mesh is a decentralised, domain-oriented approach to data management that focuses on data ownership and treats data as a product. The concept involves understanding data interactions across domains and supporting various structured data types. It emphasises easy discovery for users through data catalogues and APIs, and self-service data infrastructure allows teams to create and consume data products autonomously. Federated computational governance ensures effective management and control of data assets. Howard mentions various references, such as comparisons between data mesh and other logical principles like data fabrics and different vendors in the market. The Q&A session included questions and comments from the audience, including remarks about the number of market architecture slides presented.

Principles Summary and the High Level Logical Architecture

Figure 39 Principles Summary and the High Level Logical Architecture

Data Storage Reference Material

Figure 40 Data Storage Reference Material

Discussion on Data Models and Metadata

Marc is currently working on a Microsoft Azure project but faces implementation challenges with the instantiation of IDW in the common data model format. There is a debate about the placement of the synapse layer within the data lake, and the team is hesitant to follow the canonical model. Marc express frustration with the slow progress of data modelling. Howard expresses interest in active metadata and data fabric, which can help build data sets and products by domain and address semantic differences across data sets.

If you would like to join the discussion, please visit our community platform, the Data Professional Expedition.

Additionally, if you would like to be a guest speaker on a future webinar, kindly contact Debbie (social@modelwaresystems.com)

Don’t forget to join our exciting LinkedIn and Meetup data communities not to miss out!

Previous
Previous

Data Warehousing, BI, Big Data & Data Science for Data Citizens

Next
Next

Data Quality Framework & Methodologies – Data Professionals