How-to Trust Data for Data Management Professionals - Data Quality

Executive Summary

This webinar highlights key aspects of data management and quality for professionals in the field. Howard Diesel emphasises the principles of Data Quality Management and the importance of data profiling, which is critical for effective Master Data Management and ensuring data integrity throughout ETL and data transfer processes.

The framework focuses on various data quality dimensions, performance monitoring, and the application of data governance policies to enhance data completeness and accuracy. Additionally, Howard underscores the significance of data analysis and reporting, reference data management, and adhering to established data rules and measurements, all of which are vital for maintaining high-quality data that supports informed decision-making in organisations.

Webinar Details

Title: How-to Trust Data for Data Management Professionals - Data Quality

Date: 23 September 2021

Presenter: Howard Diesel

Meetup Group: Data Management Professionals

Write-up Author: Howard Diesel

Contents

Data Management Specialists and DQ for Data Professionals

Principles of Data Quality Management

Data Profiling and Data Quality

Data Quality Dimensions and Monitoring Frameworks

Data Profiling and Analysis

Data Profiling and Dictionary Usage

Data Quality Analysis and Assessment

Data Analysis and Reporting

Importance of Data Profiling in Master Data Management

Data Profiling and Quality Monitoring

Data Quality in ETL and Data Transfer Processes

Data Modelling and Data Acquisition

Application of Data Quality Dimensions

Data Governance and Policy Frameworks

Data Modelling, Completeness, and Accuracy

Data Quality and Reference Data

Data Quality and Data Rules

Data Measurement and Modelling

Data Management Specialists and DQ for Data Professionals

Data quality management involves data profiling and setting expectations and rules for data quality by the data manager and citizen. The data management professional implements data quality activities and delivers reports, policies, procedures, and other necessary deliverables. Effective data quality management is essential to identify patterns and inconsistencies and ensure information accuracy and reliability.

"How-to Trust Data"

Figure 1 "How-to Trust Data"

Trustworthy Data By Personae

Figure 2 Trustworthy Data By Personae

Principles of Data Quality Management

Data quality management is a critical aspect that impacts various areas like data strategy, business requirements planning, and more. Applying criticality, life cycle management, and prevention principles is essential to ensure effective data quality management. Identifying critical data elements depends on business strategy, regulatory reporting, financial reporting, and master data. Additionally, proper business keys and attributes are vital in defining critical data elements in a master data domain. Data quality dimensions, metrics, rules, and concepts are crucial to a successful data quality management strategy.

Data Quality: Why?

Figure 3 Data Quality: Why?

"How-To" DQ Management

Figure 4 "How-To" DQ Management

What are Data Quality Principles

Figure 5 What are Data Quality Principles

"How to Identify Critical Data"

Figure 6 "How to Identify Critical Data"

Data Profiling and Data Quality

Data profiling is crucial for analysing and assessing data quality, identifying patterns and inconsistencies. The data manager should set expectations and rules for data quality, while the data citizen plays a vital role in establishing expectations and rules for data management. The data management professional implements data quality activities and delivers data profiling, procedures, reports, policies, and guidelines. Additionally, to achieve trust in data quality, paying attention to the most important concepts of data quality, including data profiling and letting the data speak to identify inconsistencies, is essential.

Relationship between Data Quality Dimensions and Data Quality Concepts

Figure 7 Relationship between Data Quality Dimensions and Data Quality Concepts

Data Profiling

Figure 8 Data Profiling

Data Quality Activities

Figure 9 Data Quality Activities

Data Quality Activities Pt.2

Figure 10 Data Quality Activities Pt.2

Achieving Trust in Data

Figure 11 "Achieving Trust in Data"

How Do we Archive Trust?

Figure 12 "How Do we Archive Trust?"

Data Quality Dimensions and Monitoring Frameworks

Howard emphasises identifying the root cause of data quality problems. He recommends understanding the 12 dimensions of data quality specified in the Data Management Body of Knowledge (DMBOK) to enable specific and measurable data quality assessment. The focus should not solely be on matching data with the source but also on ensuring data source reliability. Howard suggests two scripts for data quality monitoring, but he encountered challenges executing dynamic SQL with large datasets. Despite the initial difficulties, running the profiling framework and viewing the results in Excel demonstrates the practicality and benefits of data quality monitoring.

Data Profiling - SQL Server Script

Figure 13 Data Profiling - SQL Server Script

Profile Analysis

Figure 14 Profile Analysis

Data Profiling and Analysis

Data profiling is a valuable technique for analysing data at a column level and identifying unique business keys. Profiling scripts offer data length, data type, and other analysis ratios like minimum and maximum values, distinct values, and zero-length percentages. These scripts can process metadata from the source table and profile data daily.

Critical data elements can be defined, and DQM reports can be generated. The script can detect foreign keys and suggest potential data changes while conducting elementary domain analysis. Advanced options include domain analysis for reference data, upper and lower threshold analysis, null percentages, and variance warnings.

DataProfile - Power Query Editor

Figure 15 DataProfile - Power Query Editor

Data Profiling Script

Figure 16 Data Profiling Script

Data Profiling Advanced Options

Figure 17 Data Profiling Advanced Options

Data Profiling and Dictionary Usage

The script performs advanced analytics on critical data elements, columns, and data types, analysing large objects varchar max to look for distinct values and records on the boundaries. The domain analysis is performed only on columns with less than 200 distinct values, providing a distribution of those values for data quality assessment. The data profiling data model utilises a dictionary to provide information on the order of fields, names, descriptions, data types, validation rules, and other important details. Additionally, the data set profiling results include information on market reference, syndicate reference, broker pseudonyms, business categories, and the length of specified fields compared to actual lengths found in the data. Discrepancies between specified and actual data definitions must be addressed.

Data Profiling Settings

Figure 18 Data Profiling Settings

Data Profiling Data Model

Figure 19 Data Profiling Data Model

Data Profiling Results

Figure 20 Data Profiling Results

Data Quality Analysis and Assessment

When building a calendar dimension, it's important to consider the average data length, minimum dates, and maximum dates. To ensure data accuracy, performing routine analyses on incoming data sets, such as CSV or Excel files, is best to identify and address potential issues like incorrect field values before importing the data. Analysing fields for percentage zero length can reveal the quality of the data and identify potentially meaningless or valuable columns. Filtering critical data elements can show the cleanliness of the data and highlight any potential issues with specific columns. By following these steps, you can improve the accuracy and quality of your data.

Percentage Zero Length

Figure 21 Percentage Zero Length

Data Analysis and Reporting

When building a DQM report, looking at critical data elements and distinct values can be helpful while providing evidence for rejecting files. It is important to present data to data citizens and seek their agreement and approval. Domain analysis can be useful for asking specific and pointed questions, and having a data citizen interview checklist can help pose impactful questions about the data and understand its impact on the business. Additionally, business character categories and distribution can be useful in showing challenges in the data and seeking input from users.

Data Profiling Critical Data Elements

Figure 22 Data Profiling Critical Data Elements

Importance of Data Profiling in Master Data Management

Data profiling is crucial in ensuring that data supports organisational needs and is not useless. This involves creating a reference data table for fields with few distinct elements, questioning the necessity of empty fields with zero percentage of distinct values, and discussing profiling observations with users to gain insight and monitor data. In master data management, data profiling is essential to avoid creating an authoritative source that is not trusted and to ensure that data supports the organisation's needs. Therefore, it is important to emphasise the significance of data profiling in master data management to prevent data-related issues and ensure that data is reliable.

Data Profiling Data Type Suggestions

Figure 23 Data Profiling Data Type Suggestions

Data Quality in the Middle

Figure 24 Data Quality in the Middle

Data Profiling and Quality Monitoring

Data profiling is a continuous process that is done during the specification stage and throughout the run of data from the system. Regular profiling is essential as data can become dirty overnight. To ensure data quality for decision-making, set targets for exceptional reports, and continually validate business rules.

The cycle "plan, do, check, act" should be continuously repeated to understand problems, fix them, monitor continually, and plan again. Data quality rules should be reviewed regularly to ensure they are neither too strict nor too loose, allowing for proper data filtering.

Data Quality in ETL and Data Transfer Processes

Continuous monitoring of data is crucial to identify exceptions and anomalies, especially when new data is introduced. Placing data quality at the centre of all activities helps understand its impact on all other areas. The concept of data quality is divided into dimensions of accuracy, source accuracy, provenance accuracy, and credibility, with each requiring careful assessment. Besides ensuring data accuracy, evaluating business requirements and ownership and stewardship styles is necessary when applying data quality concepts to reference and master data.

Applying Data Quality to the Other Knowledge Areas

Figure 25 Applying Data Quality to the Other Knowledge Areas

Applying Data Quality to the Other Knowledge Areas Pt.2

Figure 26 Applying Data Quality to the Other Knowledge Areas Pt.2

Data Modelling and Data Acquisition

A successful data management strategy requires careful consideration of several key factors, including data provenance, universe modelling, selection criteria, authoritative source comparison, data services provision, and data model scorecard. Provenance involves defining data ownership and ensuring its accuracy, completeness, and consistency. Universe modelling involves creating an economic model based on subject matter expertise, integration patterns, and a system of record stakeholders.

When choosing a data source, selection criteria should consider timeliness, accuracy, consistency, and completeness. Authoritative source comparison helps ensure consistency across different systems of record after acquiring data from the authoritative source. Data services provision involves making master data and reference data accessible for data sharing and measuring the effectiveness of data services. Finally, the data model scorecard helps measure the completeness and accuracy of data models, leading to new insights in data modelling and acquisition.

Applying Data Quality to the Other Knowledge Areas Pt.3

Figure 27 Applying Data Quality to the Other Knowledge Areas Pt.3

Data Model Scorecard Categories

Figure 28 Data Model Scorecard Categories

Application of Data Quality Dimensions

Howard discusses the application of data quality to various knowledge areas by identifying deliverables and techniques. He introduces the concept of "p3t," which includes principles, policies, procedures, and technology in data management. Howard also provides examples of key deliverables, consumers, and processing steps in master data management. Lastly, defining principles, policies, and procedures before starting data quality work is emphasised.

Metadata Management Practice Scorecard

Figure 29 Metadata Management Practice Scorecard

How-to Apply Data Quality

Figure 30 How-to Apply Data Quality

Reference and Master Data Deliverables and Techniques

Figure 31 Reference and Master Data Deliverables and Techniques

Understanding P3T

Figure 32 Understanding P3T

Data Governance and Policy Frameworks

The policy and procedure framework in data governance is designed to establish controls and processes that ensure adherence to standards. This involves comparing deliverables to the standard, identifying discrepancies, and correcting them. Data models, playbooks, and scorecards implement proactive and concurrent controls, provide feedback for guidelines and procedures, and evaluate data quality metrics such as sparseness and logical completeness. Additionally, applying the framework to reference and master data requires focusing on accuracy, completeness, and consistency at different levels, including column, row, table, and schema levels.

Policy Guidelines

Figure 33 Policy Guidelines

Deliverable Data Quality Dimensions

Figure 34 Deliverable Data Quality Dimensions

Data Modelling, Completeness, and Accuracy

Howard emphasises the importance of applying business rules to data modelling to ensure correctness and completeness. The completeness of a model is not only measured in meeting requirements but also in avoiding over-delivering and over-complicating. A Power BI scorecard, which evaluates consistency, readability, and definition standards, is used to check the quality of deliverables.

Accuracy of reference and master data is evaluated based on requirements, data model accuracy, and link to industry taxonomies. Each concept, such as accuracies and completeness, is mapped to reference data and compared for value domain and presence in the system of records and the completeness of the golden record.

Correctness and Completeness

Figure 35 "Correctness and Completeness"

Data Modelling Scorecard Analysis

Figure 36 Data Modelling Scorecard Analysis

Reference and Master Data Deliverable Accuracy

Figure 37 Reference and Master Data Deliverable Accuracy

Data Quality and Reference Data

It is important to ensure the information is complete and consistent when managing data. A "golden profile" approach is now being used instead of the traditional "golden record" approach. Data services provided include publishing reference data, providing real-time online lookups, code list management, master data lookups, application configuration data, and building an enterprise data warehouse with data.

 Checking the completeness at a schema level is necessary to ensure that all master and reference data is published to the enterprise data warehouse. Consistency in data encompasses logical, format, temporal, and distributed data equivalence, specifically focusing on business-driven format requirements and slowly changing dimensions for temporal consistency.

A metadata registry and service should provide format validation, harmonisation, and definitions for reference and master data to be used in data models and applications. The Zachman framework thoroughly examines all the elements and concepts needed to apply data quality to reference data, bi, and metadata.

Data Quality and Data Rules

Maintaining high-quality data is a fundamental component of structured and organised work. Companies must carefully choose appropriate dimensions to ensure that their data is "fit for purpose" and accurately reflects the data's appearance. Evaluating data against business and data rules together provides a complete picture of the data.

Data management professionals must apply their knowledge to ensure quality deliverables. Data quality dimensions offer a comprehensive evaluation tool for deliverables like data models. Applying dimensions to master data is a time-consuming but essential process for understanding and improving data quality.

Data Measurement and Modelling

Accurate and complete data is crucial for businesses to make informed decisions. Measuring data rigorously through scoring systems, allocating weight to important quality aspects, and ensuring consistency and equivalence in distributed data systems can help improve data quality. Universal representation and canonical modelling can also aid in comparing different systems. It is important to ensure complete coverage in data representation through scoping and modelling. Additionally, small iterations in data modelling can help understand the universe at some stage, making the process more manageable and effective.

If you would like to join the discussion, please visit our community platform, the Data Professional Expedition.

Additionally, if you would like to be a guest speaker on a future webinar, kindly contact Debbie (social@modelwaresystems.com)

Don’t forget to join our exciting LinkedIn and Meetup data communities not to miss out!

Previous
Previous

Integrating DMBOK®2 & DCAM for Enhanced Data Management in AI for Data Executives

Next
Next

How to Buy and Register for a CDMP® Exam Using a Coupon