Data warehousing from Conceptual to Physical - Corné Potgieter
Executive Summary
This webinar presents a comprehensive overview of Data Warehousing Automation. Corné Potgieter covers Data Modelling, Data Vault Methodology, Business Model Development, Logical Models, and Metadata Tagging in Logical Models. He delves into implementing a Data Mesh Type Architecture, Automating Data Vault Processes, and Best Practices for Safeguarding Sensitive Data in Data Vault Landscape. The webinar also discusses different Software Tools for Data Warehousing and Modelling, Data Automation and Storage in Data Vault, Data Modelling and Dimensional Modelling, and Business Vault and Access Layer in Data Management.
Webinar Details
Title: Data warehousing from Conceptual to Physical
Date: 05 March 2024
Presenter: Corné Potgieter
Meetup Group: DAMA SA User Group
Write-up Author: Howard Diesel
Contents
Data Warehousing Automation
Data Modelling Recap
Data Vault Methodology and Introduction to Cinema 1 Scenario
Implementing a Data Mesh Type Architecture
Logical Model and Business Concepts
Business Model Development
Data Vault and Metadata in Logical Model
Building a Centralized Data Model
Data Vault and Metadata Tagging in Logical Model
Automating Data Vault Processes
Automating Business Satellite Structure
Best Practices for Safeguarding Sensitive Data in Data Vault Landscape
Data Management Best Practices
Distribution of Data and Query Performance
Different Software Tools for Data Warehousing and Modelling
Data Automation and Storage in Data Vault
Data Modelling and Dimensional Modelling
Data Vault Design Principles
Business Vault and Access Layer in Data Management
Implementing an Ideal Self-Service BI Platform
Data Warehousing Automation
Corné Potgieter opens the webinar by defining Data Warehouse Automation as a process involving Metadata to capture information during the design stage, which can be used to generate components such as DDL and DML templates. While not everything in a Data Warehouse can be automated, repeatable patterns can be automated using Metadata. As a data consultant with experience in data warehousing and automation, I have seen the benefits of automating certain aspects of the process.
Figure 1 "Data Warehousing from Conceptual to Physical"
Figure 2 Definitions
Figure 3 Data Warehouse Automation
Data Modelling Recap
An effective Data Management approach involves several models that work together to ensure that everyone within a business speaks the same language when referring to key concepts. The Conceptual Model focuses on defining the core business concepts and their relationships at a high level. Meanwhile, the Business Glossary clearly defines these concepts to ensure organisational consistency. The Logical Model then designs the attributes, definitions, and relationships between different entities and maps them to the logical entities and business sources.
Finally, the Physical Model designs the Data Warehouse model and includes loading tables and staging tables specific to the chosen technology, such as Data Vault or dimensional modelling. By following this approach, businesses can ensure that their data is effectively managed and used to make informed decisions.
Figure 4 “Data Modeling Recap”
Figure 5 Data Vault
Figure 6 "What is your Experience with Data Vault?"
Data Vault Methodology and Introduction to Cinema 1 Scenario
Data Vault 2.0 is a data warehousing methodology designed by Dan Linstedt to structure data in a Data Warehouse, which involves denormalising the structure, making it repeatable and suitable for automation. The methodology models core business concepts in hubs, descriptive attributes in satellites, and relationships in link tables.
Corné shares an invented scenario involving CinemaOne, a nationwide cinema business that has recently adopted the showtime software, systematically rolling it out to their cinemas. At the same time, some still rely on silo Data Warehouses for reporting.
Figure 7 Scenario
Implementing a Data Mesh Type Architecture
The fictional company, CinemaOne, plans to implement a more scalable Data Warehouse following a data mesh-type architecture. Teams will be free to build their data products in a decentralised way, while certain components will be centralised with central standards applied. Data modellers or data modelling teams will be responsible for conceptual and Logical Models.
“Showtime” sources will follow a Data Vault approach, with the central team designing the raw Vault and repeatable patterns and components, which teams will be free to consume in their own way within certain constraints. Snowflake will be used for Centralized components, and SQL store procedures will be used for cloud-based SQL warehouses. The base model will involve replicating data into ADLs and historised tables for ODS, with data teams having the option to work on the base model further in their way.
Figure 8 Scenario Expanded
Figure 9 Visualisation of Scenario
Logical Model and Business Concepts
The Logical Model must be enriched with attributes, relationships, and Metadata augmentation to ensure effective template-driven code generation. Similarly, the Conceptual Model must define core business concepts in a Business Glossary for consistent communication. In the data model, entities like movies, screenings, and reservations are interrelated and need to be defined.
A practical screening example would be a specific movie showing at a particular time and cinema, and a reservation would be for a specific customer for a specific screening, containing details like seat numbers. The data set should illustrate how everything is linked, providing a practical feel for the data's structure and content.
Figure 10 Conceptual Model
Figure 11 Conceptual Model Expanded
Figure 12 Second Conceptual Model
Figure 13 Logical Model
Business Model Development
Corné emphasises the importance of solidly understanding core business concepts before diving into technical details. He specifies the need to identify entity attributes in the Logical Model, regardless of existing sources, and suggests mapping those attributes to sources as a future step. Additionally, capturing enough Metadata for automation purposes is crucial. Corné provides examples of mapping a JSON API and a sample CSV file to the Logical Model, highlighting the importance of capturing relevant Metadata for automation.
Figure 14 Logical Model Expanded
Figure 15 Logical Model Source Mapping
Figure 16 Logical Model Source Mapping Part Two
Data Vault and Metadata in Logical Model
When generating a Data Vault in Snowflake for the showtime source, it's important to ensure that the column names in the file map match the logical attributes in the Logical Model. Metadata can help identify which data should become part of the Data Vault Central model and which data should not be included.
Additionally, tagging sensitive information, such as Personally Identifiable Information, according to specific legal requirements in different jurisdictions is crucial. Adding more Metadata to attributes indicates sensitive information, pseudo business logic, and potential calculations, which is essential for the Logical Model. Providing detailed pseudo logic for fields not available at the source can help further analyse and understand the data.
Figure 17 Logical Model: Adding Pseudo Business Logic
Figure 18 Physical Model Part One
Building a Centralized Data Model
This project aims to generate loading mechanisms for various tables in Snowflake and Azure SQL, including Data Vault stage tables, hubs, satellites, and link tables. A template-driven approach will create a landing zone and historised/merged table for customer data in Azure SQL. The resulting Logical Model will be enhanced with tag Metadata to create a Data Vault on Snowflake and a base module on Azure.
The centralised team will use the Snowflake centralised data model, while other teams will continue to build their dimensional model in Azure SQL. An access layer will be added to the centralised model for easier consumption, joins for the data mart teams, and application of security policies. The focus is on automating data definition and manipulation through ddl statements and code generation for the core data model.
Figure 19 Physical Model Translated into Azure SQL
Figure 20 "End to End"
Figure 21 "3 things worth automating"
Figure 22 Data Definition and Data Manipulation
Data Vault and Metadata Tagging in Logical Model
The Data Vault's Logical Model is where Metadata is attributed to attributes such as business keys, link business keys, and satellite attributes. Capturing Metadata accurately in the Logical Model is crucial to generating hubs and satellites using templates. The hub structure includes a business key, hash key, load date, and source, while satellites also have a specific structure.
Metadata can be used to generate Snowflake code and automate the data flow process, including flattening the JSON structure. Snowpipe can be used for automation, allowing incremental files in a data lake source to be sent to a stage table in Snowflake for further processing.
Figure 23 Data Definition and Data Manipulation in Snowflake
Figure 24 Map to Source
Figure 25 Map to Source Part Two
Figure 26 Data Definition and Data Manipulation in Azure SQL
Automating Data Vault Processes
Automating processes with tasks and streams is a common practice to update the Data Vault stage. While real-time pipelines are exciting, balancing real-time and batch processing is important. Metadata mapping enables automation for JSON, CSV, SQL, and base model code, and creating pipelines with ADF, Python, PowerShell, or JSON input is possible with the right Metadata. Data vault processes automate business logic, allowing for the efficient generation of business vault satellites and passing Metadata to data engineers for loading data.
Figure 27 Business Vault
Automating Business Satellite Structure
Efficient use of satellite information in businesses requires appropriate automation techniques and strict adherence to Logical Models that maintain security features. Reverse engineering and generating updates should be carefully executed with minimal automation.
New attributes or business rules may require the creation of new business satellites, and version control and clear communication between data modellers and engineers are crucial. It is essential to determine which aspects of business satellite structures can and should be automated while maintaining high quality and adherence to standards.
Figure 28 Security
Best Practices for Safeguarding Sensitive Data in Data Vault Landscape
Joining data requires careful consideration to ensure accuracy and effectiveness, as automation may pose potential risks to business logic. A well-defined approach should be prescribed and targeted to maintain consistency. At the same time, sensitive data should be separated using satellite schemas in the Data Vault landscape to enhance security measures through access restrictions and satellite specialisation.
Implementing masking policies and row-level access controls on databases such as Snowflake can safeguard sensitive columns within separate satellite schemas. Normal SQL language enables creating and automating data control language (DCL) policies to manage access rights and define data security protocols. Separate templates for DCL policies allow the automation of code that controls data access, making Data Management and placement more definable and manageable. Incorporating additional Metadata to specify sensitivity, such as customer location or other specific restrictions for data access and usage in different regions, can be useful.
Figure 29 Security Tag Sensitivity
Figure 30 Security Tag Sensitivity Part Two
Figure 31 Security Tag other Metadata
Data Management Best Practices
Metadata tagging ensures security and access control in Data Management systems. By creating row-level access policies based on user roles and regions, Metadata can help users only see relevant data. Well-tagged Metadata can also help generate security handling at the row level, optimising performance.
During the design phase of Data Management, performance optimisation should be considered when designing data objects based on known query patterns. Standardised insert patterns can lead to the creation of cluster keys and distribution definitions, optimising data distribution across nodes in MPP platforms. Access layer automation is essential for data distribution and selection queries, while link tables and driving keys can be leveraged for performance optimisation in Data Management systems like Synapse.
Figure 32 "Performance"
Figure 33 Performance: Design Time
Figure 34 Data Vault
Distribution of Data and Query Performance
Data distribution is crucial for query performance but does not solve all problems. Different levers, such as indexing, clustering keys, and materialised views, can be utilised to optimise performance at design time. Skewed data can affect performance, which makes periodic checks and alterations to the cluster key necessary.
Automation can generate scripts or procedures for weekly reports and periodic corrections. Different methods for generating points in timetables are important for performance in Data Vault modelling.
Figure 35 Performance: Periodic Corrections
Figure 36 Performance: Re/Design and Periodic
Different Software Tools for Data Warehousing and Modelling
Corné shares various tools used in data warehousing and modelling. Access queries can be built from a table rather than the satellite, which maintains all the history but may not require as much granularity. Synapse is a distribution syntax that can be used with Snowflake to check the skewness of data through an SQL index.
WhereScape and Coalesce are tools for modelling and Data Warehouse generation, with Coalesce being more modern and based on YAML files. It is only compatible with Snowflake. DBT is a preferred tool, but it lacks a modelling aspect, which can be achieved through open-source modelling tools and a Python program to generate YAML or SQL files. Timextender is similar to Coalesce, and Data Vault Builder and Qlick Compose also handle Data Warehouse generation. Corné has in-depth knowledge of WhereScape, Qlick Compose, and DBT and can answer questions about these tools.
Figure 37 Available Software
Data Automation and Storage in Data Vault
Data automation tools can be used for experimentation and understanding their functions. Recently, a webinar was held by Data Vault builders where they demonstrated reverse engineering and automation.
The difference between data storage in hubs and satellites, is that facts are stored in link tables, while satellite attributes can also be added to link tables for more detail in fact tables. There is debate on modelling satellite attributes off of link tables, with some suggesting to model it as a hub and satellite.
Data Modelling and Dimensional Modelling
Data modelling is a crucial aspect of data warehousing, and in some cases, data is stored in a link table with descriptive attributes in a satellite. The Hub table should be thin and contain the Hub key, hash key, business key, and Metadata fields. The table should contain the Hub key or the hash key of this Hub, relationships, and Metadata fields.
The satellite is where historicisation happens. Access view is denormalised dimensional modelling. A pragmatic approach should be taken, starting with a virtualised representation and persisting for performance reasons if necessary. History should be in the Data Vault. If not using the Data Vault, dimensional modelling should store history with no option to drop and recreate the data Mart at any time. Slowly changing data should be preserved in the dimensional model.
Data Vault Design Principles
The centralised data team is responsible for the raw Data Vault and ingestion. This Raw Data Vault aims to identify business concepts and keys without adding logic. The central team must standardise data ingestion approaches as different scenarios may arise. Additionally, they determine the creation of a raw Data Vault and base model.
The Business Vault is a logical separation where business teams can create entities and add business logic. It is crucial to avoid redundancy when adding business logic and understanding the Data Vault. The centralised model requires a decentralised team or individuals with system mastery.
Business Vault and Access Layer in Data Management
The business Vault is responsible for adding business logic to the component, while the access layer should restrict access to HR data and other tables in the business world. Ideally, the business teams should define attributes and generate a view with security policies already applied. Although the structure of the access layer should be defined centrally, the data team should be able to request the data they need flexibly.
A manual request process could be a bottleneck and lead to frustration, potentially causing teams to bypass the system. Therefore, automation or increased accessibility to the data is essential to avoid frustration and prevent teams from bypassing the system.
Implementing an Ideal Self-Service BI Platform
Corné closes the webinar by highlighting the rarity of an ideal self-service BI platform that balances good and bad elements. Emphasising practicality and simplicity, he stresses the importance of automating processes so users can quickly access information via a front-end interface. Corné also warns that inefficiency in self-service BI platforms can present challenges in urgent situations with time delays.
If you would like to join the discussion, please visit our community platform, the Data Professional Expedition.
Additionally, if you would like to be a guest speaker on a future webinar, kindly contact Debbie (social@modelwaresystems.com)
Don’t forget to join our exciting LinkedIn and Meetup data communities not to miss out!