Multiple Timelines and Multi Activity in Data Vault with Remco Broekmans
Executive Summary
This webinar outlines the critical aspects of data vault, including the importance of time, data storage, multiple timelines, information reporting, and entity evolution processes. Remco Broekmans emphasises the significance of recording dates, handling multiple timelines in reporting, and effectively managing data processing and dimensional modelling in business. Additionally, the webinar covers best practices for customer phone numbers, record management, exceptional handling, context in dimensional models, and the use of batch ID. Remco also delves into the relevance of context and frequency in business analysis, along with notes on data storage and loading processes, ELM standards, logic models, and tool usage in data processing.
Webinar Details
Title: Multiple Timelines and Multi Activity in Data Vault with Remco Broekmans
Date: 23 August 2023
Presenter: Remco Broekmans
Meetup Group: INs and OUTs of Data Modelling
Write-up Author: Howard Diesel
Contents
The Importance of Time in Data Warehousing
Importance of Storing Data and Multiple Timelines
Understanding the Value of Information and Reporting
Importance of Recording Dates in Data Management
Multiple Timelines in Data Vault Storage Data
Understanding the Data Storage and Entity Evolution Process
Relationship Modelling and the Importance of Effective Dates
Dimensional Modelling in Business
Handling Multiple Timelines in Data Reporting
Discussion of Data Processing and Effective Dates in Reporting Systems
Importance of Dates in Fact Tables and Data Analysis
Best Practices for Handling Changes of Customer Phone Numbers in Data Vault
Record Management and Exception Handling
Understanding Data Loading and Context in a Dimensional Model
Importance of Context and Frequency in Business Analysis
Notes on Data Storage and Loading Process
Contact Information and ELM Standards
Contemplating the Use of a Batch ID and Pathological Cases
Thoughts on Logic Models and Tool Usage in Data Processing
The Importance of Time in Data Warehousing
Remco encourages interaction and presents a data vault builder and multiple timelines, referencing Einstein's quote on time illusions and questioning how to handle pauses during presentations. In data warehousing, the significance of time cannot be overstated. This is because the process of capturing and storing information is heavily dependent on temporal aspects. Interestingly, the movie "About Time" resonates with our innate desire to delve into past data and analyse it for insights.
Figure 2 Albert Einstein Quote
Figure 3 The Importance of Time in Data Warehousing
Figure 4 Data & Time in the Data Warehouse
Importance of Storing Data and Multiple Timelines
It is crucial to store all possible dates and times with data for strategic decision-making. Different timelines exist for various purposes, including supporting decision-making. Reporting data "as is" refers to providing the current value and outcome, disregarding past values or events. Data users emphasise the importance of current data, while analysts may have different perspectives. Multiple timelines, including "as was," have significance in data analysis.
Figure 5 Multiple Timelines
Understanding the Value of Information and Reporting
The information refers to data or details that something has. Data was recorded six months ago showing the manager's role in the store. The data was reported in relation to the business's impact. Previous data demonstrated the successful performance of the store within a specific timeframe. The most interesting aspect of reporting is assessing the old region's data to plan for the future. Functional dates have significance, such as the start of a contract or mortgage.
Importance of Recording Dates in Data Management
Keeping track of business day information is paramount for efficient Data Management. This involves recording new customer details and responding to requests. From both functional and technical perspectives, it is imperative to precisely record dates for various stages and changes in the data. This includes low-date timestamps in the data vault, which is essential for ensuring accurate and reliable Data Management.
Figure 6 Different Timelines
Figure 7 Store these different timelines in the Data Vault
Multiple Timelines in Data Vault Storage Data
Data Vault storage data contains multiple timelines, raising questions about living in different locations. Owning properties in different places is possible, but coordinating actions is uncertain. The satellite level can have multiple records with active and inactive indicators. Slowly changing dimensions are used to understand records' active and inactive states. Some argue that there can be multiple active records at the same time. The importance of multi-active to the satellite considering the multiple timelines is emphasised.
Understanding the Data Storage and Entity Evolution Process
The customer's address information is crucial for communication. The system records the customer's details and service ID as a reference. Remco lives in the Netherlands since 1997 and has residences additionally, in Cape Town. The system stores data with each record having an identifier. Extra keys or foreign keys can be added for storing multiple versions of customer details. The entity is broken down into components and assembled correctly during the data storage.
Figure 8 Multi Active in Data Vault - going wrong
Figure 9 Multi Active in Data Vault - going wrong continued
Relationship Modelling and the Importance of Effective Dates
The webinar delves into the uncertainty surrounding timelines and highlights the significance of factoring in effective dates in relation to key restart dates. Remco doubts the rationale behind appending another timeline to the existing key. This is particularly noteworthy since it has been widely acknowledged that link-to-link relationships are not advisable and discussing them is uncommon within the Data Modelling community.
Figure 10 Multi Active in Data Vault - going completely wrong
Figure 11 Multi Active in Data Vault - on Facts
Figure 12 Would we do this in Dimensional as well?
Dimensional Modelling in Business
Remco discusses the importance of Dimensional Data Modelling in a business context. He suggests adding parts to a key, distinguishing between the base effect date and requested delivery date, tracking history in Dimensional Modelling, using "as is" and "as was" as an alternative approach, maintaining version numbers for customer IDs to track changes and updates and discusses the potential benefits and challenges of using version numbers in Data Management.
Figure 13 Would we do this in Dimensional as well? continued
Figure 14 What about the fact table?
Handling Multiple Timelines in Data Reporting
Double entries in the fact table are not created for dimensional data. This is because there is uncertainty about which date, time, or dimension should be used in reporting. The fact table contains unique records in the event table, each consisting of four keys referencing dimensions, such as order date. To determine how the data should be presented, it is necessary to discuss adding a role-playing date for each key in the order table with the company. It's essential to consider the delivery perspective and decide if creating four distinct role-playing date dimensions is necessary.
Figure 15 What about the fact table? continued
Discussion of Data Processing and Effective Dates in Reporting Systems
It cannot be overemphasised how crucial it is to maintain accurate information and proper record-keeping of data. Remco recommends including both the order date and the requested delivery date to avoid confusion when processing orders. Additionally, he highlighted the significance of visual representations in reinforcing comprehension.
Figure 16 Back to the Basics
Importance of Dates in Fact Tables and Data Analysis
The start and end dates determine a contract's effective period and value rights. The system may provide a business or GUI date, or both, for tracking changes. Loading contact information into a warehouse requires recording the moment in time when the information is known. All significant contract dates should be included in the fact table, allowing the option to choose specific dates for analysis. Different dates serve different purposes, such as tracking employee payments or approval of funds.
Figure 17 Why still doing Multi Active
Best Practices for Handling Changes of Customer Phone Numbers in Data Vault
The customer's phone number must be changed to add a new customer phone number entry. When employees change their first names, a new entry should be added to the system. The satellite must capture the correct order date for an order mistakenly recorded as August 21st. To ensure proper understanding, care should be taken when introducing new words. New entries for the product should follow standard pricing since it has a fixed price. Multi-active and multi-timeline entries should be avoided for data consistency. Multi-active processes are still present due to the issue of writing information in later stages. Although a proposed solution by Petr Beles was disagreed upon, the discussion is ongoing. Historical tests or changes may result in multiple values for the same identifier. All related phone numbers should be stored in a designated field or distinguished using historical information to handle exceptions. Exceptions only affect the initial source system load, not subsequent updates.
Record Management and Exception Handling
Remco delves into the crucial aspect of maintaining and managing records in a professional environment. He elaborates on various scenarios that can arise during the process, including adding new records, identifying exceptions, and the importance of proper record identification. Additionally, Remco highlights using a centralised table with primary keys and timestamps to track business processes and ensure effective time management. He emphasises the significance of proper record management in facilitating efficient and organised operations.
Figure 18 Late arriving data
Figure 19 Record Management 1st and 2nd load
Understanding Data Loading and Context in a Dimensional Model
The Webinar touches on the optimal procedures for loading data, encompassing the importance of pinpointing a precise timeline for reporting and solely reporting up-to-date values that align with the context. Additionally, Remco referenced a discrepancy on January 25th and the subsequent validation of the accurate value on January 29th.
Figure 21 Understanding Data Loading and Context in a Dimensional Model
Figure 22 Different timelines
Importance of Context and Frequency in Business Analysis
To achieve effective analysis, it is crucial to store context-based data that caters to the business's varying requirements. Before deciding on the frequency of data loading for cost-effective server usage, it is recommended to consult stakeholders. This approach ensures that decisions are made with a thorough understanding of the business needs and are, therefore, more likely to result in successful outcomes.
Figure 23 Intra-batch Changes
Notes on Data Storage and Loading Process
Remco changed two full loadings on the table, including new and deleted data. He suggests using smaller data batches divided by days, adding timestamps to track the information's source and prioritising the first phone number in a list when contacting suppliers or customers. Exceptions to the general storage rules are mentioned, and further research is advised. These considerations are only relevant in the context of satellite nodes and not hub clouds and link clouds. An identifier or business key is the primary relationship in the enterprise, and it may change in different contexts.
Figure 24 Lousy source systems
Contact Information and ELM Standards
The phone standards of ELM provide guidelines on traction and data flow. Connect with the business on helpcenter.com to create a logical model. Use "info" to identify the purpose when sending information or questions. Corné suggests manipulating the load date to ensure uniqueness in the primary key for inter-batch loads. Remco recommends not intervening if the source system provides change data capture. Adjusting the load timestamp or loading smaller batches may be necessary if intervention changes occur.
Figure 25 Links and Info
Contemplating the Use of a Batch ID and Pathological Cases
Remco suggests using a batch ID to keep related emails together. He recommended stabilising the primary key by creating a three-part key to avoid issues while loading. Remco cautions against introducing a "green shift" and expresses uncertainty about "pathological cases." Christian discusses cases where additional key parts may be beneficial and brings up the Willibald case as an example for consideration.
Thoughts on Logic Models and Tool Usage in Data Processing
Remco expresses confidence in the dataset and willingness to explore it further. He discusses the challenges of preprocessing in batch-oriented source systems and proposes solutions. Additionally, Remco suggests that using tools for model creation may not always be necessary or efficient, based on his experience with Informatica. Remco emphasises the importance of considering individual cases and not relying solely on the capabilities of tools.
If you would like to join the discussion, please visit our community platform, the Data Professional Expedition.
Additionally, if you would like to be a guest speaker on a future webinar, kindly contact Debbie (social@modelwaresystems.com)
Don’t forget to join our exciting LinkedIn and Meetup data communities not to miss out!