How-to Trust Data for Data Management Professionals - Data Quality
Executive Summary
This webinar highlights key aspects of data management and quality for professionals in the field. Howard Diesel emphasises the principles of Data Quality Management and the importance of data profiling, which is critical for effective Master Data Management and ensuring data integrity throughout ETL and data transfer processes.
The framework focuses on various data quality dimensions, performance monitoring, and the application of data governance policies to enhance data completeness and accuracy. Additionally, Howard underscores the significance of data analysis and reporting, reference data management, and adhering to established data rules and measurements, all of which are vital for maintaining high-quality data that supports informed decision-making in organisations.
Webinar Details
Title: How-to Trust Data for Data Management Professionals - Data Quality
Date: 23 September 2021
Presenter: Howard Diesel
Meetup Group: Data Management Professionals
Write-up Author: Howard Diesel
Contents
Data Management Specialists and DQ for Data Professionals
Principles of Data Quality Management
Data Profiling and Data Quality
Data Quality Dimensions and Monitoring Frameworks
Data Profiling and Analysis
Data Profiling and Dictionary Usage
Data Quality Analysis and Assessment
Data Analysis and Reporting
Importance of Data Profiling in Master Data Management
Data Profiling and Quality Monitoring
Data Quality in ETL and Data Transfer Processes
Data Modelling and Data Acquisition
Application of Data Quality Dimensions
Data Governance and Policy Frameworks
Data Modelling, Completeness, and Accuracy
Data Quality and Reference Data
Data Quality and Data Rules
Data Measurement and Modelling
Data Management Specialists and DQ for Data Professionals
Data quality management involves data profiling and setting expectations and rules for data quality by the data manager and citizen. The data management professional implements data quality activities and delivers reports, policies, procedures, and other necessary deliverables. Effective data quality management is essential to identify patterns and inconsistencies and ensure information accuracy and reliability.
Figure 1 "How-to Trust Data"
Figure 2 Trustworthy Data By Personae
Principles of Data Quality Management
Data quality management is a critical aspect that impacts various areas like data strategy, business requirements planning, and more. Applying criticality, life cycle management, and prevention principles is essential to ensure effective data quality management. Identifying critical data elements depends on business strategy, regulatory reporting, financial reporting, and master data. Additionally, proper business keys and attributes are vital in defining critical data elements in a master data domain. Data quality dimensions, metrics, rules, and concepts are crucial to a successful data quality management strategy.
Figure 3 Data Quality: Why?
Figure 4 "How-To" DQ Management
Figure 5 What are Data Quality Principles
Figure 6 "How to Identify Critical Data"
Data Profiling and Data Quality
Data profiling is crucial for analysing and assessing data quality, identifying patterns and inconsistencies. The data manager should set expectations and rules for data quality, while the data citizen plays a vital role in establishing expectations and rules for data management. The data management professional implements data quality activities and delivers data profiling, procedures, reports, policies, and guidelines. Additionally, to achieve trust in data quality, paying attention to the most important concepts of data quality, including data profiling and letting the data speak to identify inconsistencies, is essential.
Figure 7 Relationship between Data Quality Dimensions and Data Quality Concepts
Figure 8 Data Profiling
Figure 9 Data Quality Activities
Figure 10 Data Quality Activities Pt.2
Figure 11 "Achieving Trust in Data"
Figure 12 "How Do we Archive Trust?"
Data Quality Dimensions and Monitoring Frameworks
Howard emphasises identifying the root cause of data quality problems. He recommends understanding the 12 dimensions of data quality specified in the Data Management Body of Knowledge (DMBOK) to enable specific and measurable data quality assessment. The focus should not solely be on matching data with the source but also on ensuring data source reliability. Howard suggests two scripts for data quality monitoring, but he encountered challenges executing dynamic SQL with large datasets. Despite the initial difficulties, running the profiling framework and viewing the results in Excel demonstrates the practicality and benefits of data quality monitoring.
Figure 13 Data Profiling - SQL Server Script
Figure 14 Profile Analysis
Data Profiling and Analysis
Data profiling is a valuable technique for analysing data at a column level and identifying unique business keys. Profiling scripts offer data length, data type, and other analysis ratios like minimum and maximum values, distinct values, and zero-length percentages. These scripts can process metadata from the source table and profile data daily.
Critical data elements can be defined, and DQM reports can be generated. The script can detect foreign keys and suggest potential data changes while conducting elementary domain analysis. Advanced options include domain analysis for reference data, upper and lower threshold analysis, null percentages, and variance warnings.
Figure 15 DataProfile - Power Query Editor
Figure 16 Data Profiling Script
Figure 17 Data Profiling Advanced Options
Data Profiling and Dictionary Usage
The script performs advanced analytics on critical data elements, columns, and data types, analysing large objects varchar max to look for distinct values and records on the boundaries. The domain analysis is performed only on columns with less than 200 distinct values, providing a distribution of those values for data quality assessment. The data profiling data model utilises a dictionary to provide information on the order of fields, names, descriptions, data types, validation rules, and other important details. Additionally, the data set profiling results include information on market reference, syndicate reference, broker pseudonyms, business categories, and the length of specified fields compared to actual lengths found in the data. Discrepancies between specified and actual data definitions must be addressed.
Figure 18 Data Profiling Settings
Figure 19 Data Profiling Data Model
Figure 20 Data Profiling Results
Data Quality Analysis and Assessment
When building a calendar dimension, it's important to consider the average data length, minimum dates, and maximum dates. To ensure data accuracy, performing routine analyses on incoming data sets, such as CSV or Excel files, is best to identify and address potential issues like incorrect field values before importing the data. Analysing fields for percentage zero length can reveal the quality of the data and identify potentially meaningless or valuable columns. Filtering critical data elements can show the cleanliness of the data and highlight any potential issues with specific columns. By following these steps, you can improve the accuracy and quality of your data.
Figure 21 Percentage Zero Length
Data Analysis and Reporting
When building a DQM report, looking at critical data elements and distinct values can be helpful while providing evidence for rejecting files. It is important to present data to data citizens and seek their agreement and approval. Domain analysis can be useful for asking specific and pointed questions, and having a data citizen interview checklist can help pose impactful questions about the data and understand its impact on the business. Additionally, business character categories and distribution can be useful in showing challenges in the data and seeking input from users.
Figure 22 Data Profiling Critical Data Elements
Importance of Data Profiling in Master Data Management
Data profiling is crucial in ensuring that data supports organisational needs and is not useless. This involves creating a reference data table for fields with few distinct elements, questioning the necessity of empty fields with zero percentage of distinct values, and discussing profiling observations with users to gain insight and monitor data. In master data management, data profiling is essential to avoid creating an authoritative source that is not trusted and to ensure that data supports the organisation's needs. Therefore, it is important to emphasise the significance of data profiling in master data management to prevent data-related issues and ensure that data is reliable.
Figure 23 Data Profiling Data Type Suggestions
Figure 24 Data Quality in the Middle
Data Profiling and Quality Monitoring
Data profiling is a continuous process that is done during the specification stage and throughout the run of data from the system. Regular profiling is essential as data can become dirty overnight. To ensure data quality for decision-making, set targets for exceptional reports, and continually validate business rules.
The cycle "plan, do, check, act" should be continuously repeated to understand problems, fix them, monitor continually, and plan again. Data quality rules should be reviewed regularly to ensure they are neither too strict nor too loose, allowing for proper data filtering.
Data Quality in ETL and Data Transfer Processes
Continuous monitoring of data is crucial to identify exceptions and anomalies, especially when new data is introduced. Placing data quality at the centre of all activities helps understand its impact on all other areas. The concept of data quality is divided into dimensions of accuracy, source accuracy, provenance accuracy, and credibility, with each requiring careful assessment. Besides ensuring data accuracy, evaluating business requirements and ownership and stewardship styles is necessary when applying data quality concepts to reference and master data.
Figure 25 Applying Data Quality to the Other Knowledge Areas
Figure 26 Applying Data Quality to the Other Knowledge Areas Pt.2
Data Modelling and Data Acquisition
A successful data management strategy requires careful consideration of several key factors, including data provenance, universe modelling, selection criteria, authoritative source comparison, data services provision, and data model scorecard. Provenance involves defining data ownership and ensuring its accuracy, completeness, and consistency. Universe modelling involves creating an economic model based on subject matter expertise, integration patterns, and a system of record stakeholders.
When choosing a data source, selection criteria should consider timeliness, accuracy, consistency, and completeness. Authoritative source comparison helps ensure consistency across different systems of record after acquiring data from the authoritative source. Data services provision involves making master data and reference data accessible for data sharing and measuring the effectiveness of data services. Finally, the data model scorecard helps measure the completeness and accuracy of data models, leading to new insights in data modelling and acquisition.
Figure 27 Applying Data Quality to the Other Knowledge Areas Pt.3
Figure 28 Data Model Scorecard Categories
Application of Data Quality Dimensions
Howard discusses the application of data quality to various knowledge areas by identifying deliverables and techniques. He introduces the concept of "p3t," which includes principles, policies, procedures, and technology in data management. Howard also provides examples of key deliverables, consumers, and processing steps in master data management. Lastly, defining principles, policies, and procedures before starting data quality work is emphasised.
Figure 29 Metadata Management Practice Scorecard
Figure 30 How-to Apply Data Quality
Figure 31 Reference and Master Data Deliverables and Techniques
Figure 32 Understanding P3T
Data Governance and Policy Frameworks
The policy and procedure framework in data governance is designed to establish controls and processes that ensure adherence to standards. This involves comparing deliverables to the standard, identifying discrepancies, and correcting them. Data models, playbooks, and scorecards implement proactive and concurrent controls, provide feedback for guidelines and procedures, and evaluate data quality metrics such as sparseness and logical completeness. Additionally, applying the framework to reference and master data requires focusing on accuracy, completeness, and consistency at different levels, including column, row, table, and schema levels.
Figure 33 Policy Guidelines
Figure 34 Deliverable Data Quality Dimensions
Data Modelling, Completeness, and Accuracy
Howard emphasises the importance of applying business rules to data modelling to ensure correctness and completeness. The completeness of a model is not only measured in meeting requirements but also in avoiding over-delivering and over-complicating. A Power BI scorecard, which evaluates consistency, readability, and definition standards, is used to check the quality of deliverables.
Accuracy of reference and master data is evaluated based on requirements, data model accuracy, and link to industry taxonomies. Each concept, such as accuracies and completeness, is mapped to reference data and compared for value domain and presence in the system of records and the completeness of the golden record.
Figure 35 "Correctness and Completeness"
Figure 36 Data Modelling Scorecard Analysis
Figure 37 Reference and Master Data Deliverable Accuracy
Data Quality and Reference Data
It is important to ensure the information is complete and consistent when managing data. A "golden profile" approach is now being used instead of the traditional "golden record" approach. Data services provided include publishing reference data, providing real-time online lookups, code list management, master data lookups, application configuration data, and building an enterprise data warehouse with data.
Checking the completeness at a schema level is necessary to ensure that all master and reference data is published to the enterprise data warehouse. Consistency in data encompasses logical, format, temporal, and distributed data equivalence, specifically focusing on business-driven format requirements and slowly changing dimensions for temporal consistency.
A metadata registry and service should provide format validation, harmonisation, and definitions for reference and master data to be used in data models and applications. The Zachman framework thoroughly examines all the elements and concepts needed to apply data quality to reference data, bi, and metadata.
Data Quality and Data Rules
Maintaining high-quality data is a fundamental component of structured and organised work. Companies must carefully choose appropriate dimensions to ensure that their data is "fit for purpose" and accurately reflects the data's appearance. Evaluating data against business and data rules together provides a complete picture of the data.
Data management professionals must apply their knowledge to ensure quality deliverables. Data quality dimensions offer a comprehensive evaluation tool for deliverables like data models. Applying dimensions to master data is a time-consuming but essential process for understanding and improving data quality.
Data Measurement and Modelling
Accurate and complete data is crucial for businesses to make informed decisions. Measuring data rigorously through scoring systems, allocating weight to important quality aspects, and ensuring consistency and equivalence in distributed data systems can help improve data quality. Universal representation and canonical modelling can also aid in comparing different systems. It is important to ensure complete coverage in data representation through scoping and modelling. Additionally, small iterations in data modelling can help understand the universe at some stage, making the process more manageable and effective.
If you would like to join the discussion, please visit our community platform, the Data Professional Expedition.
Additionally, if you would like to be a guest speaker on a future webinar, kindly contact Debbie (social@modelwaresystems.com)
Don’t forget to join our exciting LinkedIn and Meetup data communities not to miss out!