Data warehousing from Conceptual to Physical - Corné Potgieter

Executive Summary

This webinar presents a comprehensive overview of Data Warehousing Automation. Corné Potgieter covers Data Modelling, Data Vault Methodology, Business Model Development, Logical Models, and Metadata Tagging in Logical Models. He delves into implementing a Data Mesh Type Architecture, Automating Data Vault Processes, and Best Practices for Safeguarding Sensitive Data in Data Vault Landscape. The webinar also discusses different Software Tools for Data Warehousing and Modelling, Data Automation and Storage in Data Vault, Data Modelling and Dimensional Modelling, and Business Vault and Access Layer in Data Management.

Webinar Details

Title: Data warehousing from Conceptual to Physical

Date: 05 March 2024

Presenter: Corné Potgieter

Meetup Group: DAMA SA User Group

Write-up Author: Howard Diesel

Contents

Data Warehousing Automation

Data Modelling Recap

Data Vault Methodology and Introduction to Cinema 1 Scenario

Implementing a Data Mesh Type Architecture

Logical Model and Business Concepts

Business Model Development

Data Vault and Metadata in Logical Model

Building a Centralized Data Model

Data Vault and Metadata Tagging in Logical Model

Automating Data Vault Processes

Automating Business Satellite Structure

Best Practices for Safeguarding Sensitive Data in Data Vault Landscape

Data Management Best Practices

Distribution of Data and Query Performance

Different Software Tools for Data Warehousing and Modelling

Data Automation and Storage in Data Vault

Data Modelling and Dimensional Modelling

Data Vault Design Principles

Business Vault and Access Layer in Data Management

Implementing an Ideal Self-Service BI Platform

Data Warehousing Automation

Corné Potgieter opens the webinar by defining Data Warehouse Automation as a process involving Metadata to capture information during the design stage, which can be used to generate components such as DDL and DML templates. While not everything in a Data Warehouse can be automated, repeatable patterns can be automated using Metadata. As a data consultant with experience in data warehousing and automation, I have seen the benefits of automating certain aspects of the process.

Data Warehousing from Conceptual to Physical

Figure 1 "Data Warehousing from Conceptual to Physical"

Definitions

Figure 2 Definitions

Data Warehouse Automation

Figure 3 Data Warehouse Automation

Data Modelling Recap

An effective Data Management approach involves several models that work together to ensure that everyone within a business speaks the same language when referring to key concepts. The Conceptual Model focuses on defining the core business concepts and their relationships at a high level. Meanwhile, the Business Glossary clearly defines these concepts to ensure organisational consistency. The Logical Model then designs the attributes, definitions, and relationships between different entities and maps them to the logical entities and business sources.

Finally, the Physical Model designs the Data Warehouse model and includes loading tables and staging tables specific to the chosen technology, such as Data Vault or dimensional modelling. By following this approach, businesses can ensure that their data is effectively managed and used to make informed decisions.

Data Modeling Recap

Figure 4 “Data Modeling Recap”

Data Vault

Figure 5 Data Vault

What is your Experience with Data Vault?

Figure 6 "What is your Experience with Data Vault?"

Data Vault Methodology and Introduction to Cinema 1 Scenario

Data Vault 2.0 is a data warehousing methodology designed by Dan Linstedt to structure data in a Data Warehouse, which involves denormalising the structure, making it repeatable and suitable for automation. The methodology models core business concepts in hubs, descriptive attributes in satellites, and relationships in link tables.

Corné shares an invented scenario involving CinemaOne, a nationwide cinema business that has recently adopted the showtime software, systematically rolling it out to their cinemas. At the same time, some still rely on silo Data Warehouses for reporting.

Scenario

Figure 7 Scenario

Implementing a Data Mesh Type Architecture

The fictional company, CinemaOne, plans to implement a more scalable Data Warehouse following a data mesh-type architecture. Teams will be free to build their data products in a decentralised way, while certain components will be centralised with central standards applied. Data modellers or data modelling teams will be responsible for conceptual and Logical Models.

“Showtime” sources will follow a Data Vault approach, with the central team designing the raw Vault and repeatable patterns and components, which teams will be free to consume in their own way within certain constraints. Snowflake will be used for Centralized components, and SQL store procedures will be used for cloud-based SQL warehouses. The base model will involve replicating data into ADLs and historised tables for ODS, with data teams having the option to work on the base model further in their way.

Scenario Expanded

Figure 8 Scenario Expanded

Visualisation of Scenario

Figure 9 Visualisation of Scenario

Logical Model and Business Concepts

The Logical Model must be enriched with attributes, relationships, and Metadata augmentation to ensure effective template-driven code generation. Similarly, the Conceptual Model must define core business concepts in a Business Glossary for consistent communication. In the data model, entities like movies, screenings, and reservations are interrelated and need to be defined.

A practical screening example would be a specific movie showing at a particular time and cinema, and a reservation would be for a specific customer for a specific screening, containing details like seat numbers. The data set should illustrate how everything is linked, providing a practical feel for the data's structure and content.

Conceptual Model

Figure 10 Conceptual Model

Conceptual Model Expanded

Figure 11 Conceptual Model Expanded

Second Conceptual Model

Figure 12 Second Conceptual Model

Logical Model

Figure 13 Logical Model

Business Model Development

Corné emphasises the importance of solidly understanding core business concepts before diving into technical details. He specifies the need to identify entity attributes in the Logical Model, regardless of existing sources, and suggests mapping those attributes to sources as a future step. Additionally, capturing enough Metadata for automation purposes is crucial. Corné provides examples of mapping a JSON API and a sample CSV file to the Logical Model, highlighting the importance of capturing relevant Metadata for automation.

Logical Model Expanded

Figure 14 Logical Model Expanded

Figure 15 Logical Model Source Mapping

Logical Model Source Mapping Part Two

Figure 16 Logical Model Source Mapping Part Two

Data Vault and Metadata in Logical Model

When generating a Data Vault in Snowflake for the showtime source, it's important to ensure that the column names in the file map match the logical attributes in the Logical Model. Metadata can help identify which data should become part of the Data Vault Central model and which data should not be included.

Additionally, tagging sensitive information, such as Personally Identifiable Information, according to specific legal requirements in different jurisdictions is crucial. Adding more Metadata to attributes indicates sensitive information, pseudo business logic, and potential calculations, which is essential for the Logical Model. Providing detailed pseudo logic for fields not available at the source can help further analyse and understand the data.

Logical Model: Adding Pseudo Business Logic

Figure 17 Logical Model: Adding Pseudo Business Logic

Physical Model Part One

Figure 18 Physical Model Part One

Building a Centralized Data Model

This project aims to generate loading mechanisms for various tables in Snowflake and Azure SQL, including Data Vault stage tables, hubs, satellites, and link tables. A template-driven approach will create a landing zone and historised/merged table for customer data in Azure SQL. The resulting Logical Model will be enhanced with tag Metadata to create a Data Vault on Snowflake and a base module on Azure.

The centralised team will use the Snowflake centralised data model, while other teams will continue to build their dimensional model in Azure SQL. An access layer will be added to the centralised model for easier consumption, joins for the data mart teams, and application of security policies. The focus is on automating data definition and manipulation through ddl statements and code generation for the core data model.

Physical Model Translated into Azure SQL

Figure 19 Physical Model Translated into Azure SQL

End to End

Figure 20 "End to End"

3 things worth automating

Figure 21 "3 things worth automating"

Data Definition and Data Manipulation

Figure 22 Data Definition and Data Manipulation

Data Vault and Metadata Tagging in Logical Model

The Data Vault's Logical Model is where Metadata is attributed to attributes such as business keys, link business keys, and satellite attributes. Capturing Metadata accurately in the Logical Model is crucial to generating hubs and satellites using templates. The hub structure includes a business key, hash key, load date, and source, while satellites also have a specific structure.

Metadata can be used to generate Snowflake code and automate the data flow process, including flattening the JSON structure. Snowpipe can be used for automation, allowing incremental files in a data lake source to be sent to a stage table in Snowflake for further processing.

Data Definition and Data Manipulation in Snowflake

Figure 23 Data Definition and Data Manipulation in Snowflake

Map to Source

Figure 24 Map to Source

Map to Source Part Two

Figure 25 Map to Source Part Two

Data Definition and Data Manipulation in Azure SQL

Figure 26 Data Definition and Data Manipulation in Azure SQL

Automating Data Vault Processes

Automating processes with tasks and streams is a common practice to update the Data Vault stage. While real-time pipelines are exciting, balancing real-time and batch processing is important. Metadata mapping enables automation for JSON, CSV, SQL, and base model code, and creating pipelines with ADF, Python, PowerShell, or JSON input is possible with the right Metadata. Data vault processes automate business logic, allowing for the efficient generation of business vault satellites and passing Metadata to data engineers for loading data.

Business Vault

Figure 27 Business Vault

Automating Business Satellite Structure

Efficient use of satellite information in businesses requires appropriate automation techniques and strict adherence to Logical Models that maintain security features. Reverse engineering and generating updates should be carefully executed with minimal automation.

New attributes or business rules may require the creation of new business satellites, and version control and clear communication between data modellers and engineers are crucial. It is essential to determine which aspects of business satellite structures can and should be automated while maintaining high quality and adherence to standards.

Security

Figure 28 Security

Best Practices for Safeguarding Sensitive Data in Data Vault Landscape

Joining data requires careful consideration to ensure accuracy and effectiveness, as automation may pose potential risks to business logic. A well-defined approach should be prescribed and targeted to maintain consistency. At the same time, sensitive data should be separated using satellite schemas in the Data Vault landscape to enhance security measures through access restrictions and satellite specialisation.

Implementing masking policies and row-level access controls on databases such as Snowflake can safeguard sensitive columns within separate satellite schemas. Normal SQL language enables creating and automating data control language (DCL) policies to manage access rights and define data security protocols. Separate templates for DCL policies allow the automation of code that controls data access, making Data Management and placement more definable and manageable. Incorporating additional Metadata to specify sensitivity, such as customer location or other specific restrictions for data access and usage in different regions, can be useful.

Security Tag Sensitivity

Figure 29 Security Tag Sensitivity

Security Tag Sensitivity Part Two

Figure 30 Security Tag Sensitivity Part Two

Security Tag other Metadata

Figure 31 Security Tag other Metadata

Data Management Best Practices

Metadata tagging ensures security and access control in Data Management systems. By creating row-level access policies based on user roles and regions, Metadata can help users only see relevant data. Well-tagged Metadata can also help generate security handling at the row level, optimising performance.

During the design phase of Data Management, performance optimisation should be considered when designing data objects based on known query patterns. Standardised insert patterns can lead to the creation of cluster keys and distribution definitions, optimising data distribution across nodes in MPP platforms. Access layer automation is essential for data distribution and selection queries, while link tables and driving keys can be leveraged for performance optimisation in Data Management systems like Synapse.

Performance

Figure 32 "Performance"

Performance: Design Time

Figure 33 Performance: Design Time

Data Vault

Figure 34 Data Vault

Distribution of Data and Query Performance

Data distribution is crucial for query performance but does not solve all problems. Different levers, such as indexing, clustering keys, and materialised views, can be utilised to optimise performance at design time. Skewed data can affect performance, which makes periodic checks and alterations to the cluster key necessary.

Automation can generate scripts or procedures for weekly reports and periodic corrections. Different methods for generating points in timetables are important for performance in Data Vault modelling.

Performance: Periodic Corrections

Figure 35 Performance: Periodic Corrections

Performance: Re/Design and Periodic

Figure 36 Performance: Re/Design and Periodic

Different Software Tools for Data Warehousing and Modelling

Corné shares various tools used in data warehousing and modelling. Access queries can be built from a table rather than the satellite, which maintains all the history but may not require as much granularity. Synapse is a distribution syntax that can be used with Snowflake to check the skewness of data through an SQL index.

WhereScape and Coalesce are tools for modelling and Data Warehouse generation, with Coalesce being more modern and based on YAML files. It is only compatible with Snowflake. DBT is a preferred tool, but it lacks a modelling aspect, which can be achieved through open-source modelling tools and a Python program to generate YAML or SQL files. Timextender is similar to Coalesce, and Data Vault Builder and Qlick Compose also handle Data Warehouse generation. Corné has in-depth knowledge of WhereScape, Qlick Compose, and DBT and can answer questions about these tools.

Available Software

Figure 37 Available Software

Data Automation and Storage in Data Vault

Data automation tools can be used for experimentation and understanding their functions. Recently, a webinar was held by Data Vault builders where they demonstrated reverse engineering and automation.

The difference between data storage in hubs and satellites, is that facts are stored in link tables, while satellite attributes can also be added to link tables for more detail in fact tables. There is debate on modelling satellite attributes off of link tables, with some suggesting to model it as a hub and satellite.

Data Modelling and Dimensional Modelling

Data modelling is a crucial aspect of data warehousing, and in some cases, data is stored in a link table with descriptive attributes in a satellite. The Hub table should be thin and contain the Hub key, hash key, business key, and Metadata fields. The table should contain the Hub key or the hash key of this Hub, relationships, and Metadata fields.

The satellite is where historicisation happens. Access view is denormalised dimensional modelling. A pragmatic approach should be taken, starting with a virtualised representation and persisting for performance reasons if necessary. History should be in the Data Vault. If not using the Data Vault, dimensional modelling should store history with no option to drop and recreate the data Mart at any time. Slowly changing data should be preserved in the dimensional model.

Data Vault Design Principles

The centralised data team is responsible for the raw Data Vault and ingestion. This Raw Data Vault aims to identify business concepts and keys without adding logic. The central team must standardise data ingestion approaches as different scenarios may arise. Additionally, they determine the creation of a raw Data Vault and base model.

The Business Vault is a logical separation where business teams can create entities and add business logic. It is crucial to avoid redundancy when adding business logic and understanding the Data Vault. The centralised model requires a decentralised team or individuals with system mastery.

Business Vault and Access Layer in Data Management

The business Vault is responsible for adding business logic to the component, while the access layer should restrict access to HR data and other tables in the business world. Ideally, the business teams should define attributes and generate a view with security policies already applied. Although the structure of the access layer should be defined centrally, the data team should be able to request the data they need flexibly.

A manual request process could be a bottleneck and lead to frustration, potentially causing teams to bypass the system. Therefore, automation or increased accessibility to the data is essential to avoid frustration and prevent teams from bypassing the system.

Implementing an Ideal Self-Service BI Platform

Corné closes the webinar by highlighting the rarity of an ideal self-service BI platform that balances good and bad elements. Emphasising practicality and simplicity, he stresses the importance of automating processes so users can quickly access information via a front-end interface. Corné also warns that inefficiency in self-service BI platforms can present challenges in urgent situations with time delays.

If you would like to join the discussion, please visit our community platform, the Data Professional Expedition.

Additionally, if you would like to be a guest speaker on a future webinar, kindly contact Debbie (social@modelwaresystems.com)

Don’t forget to join our exciting LinkedIn and Meetup data communities not to miss out!

Next
Next

Data Warehousing, BI, Big Data & Data Science for Data Citizens