How-to Store Data for Data Managers

How-to Store Data for Data Managers

Executive Summary

‘How-to Store Data for Data Managers’ provides an overview of various concepts and transactions in data management, including transactional and distributed databases, consistency in data lake houses, data replication, federated database systems, CAP theorem, and lambda architecture. Howard Diesel covers the process of blockchain and data management, data lake houses and Snowflake, microservices and data mesh, cloud storage and data management, and process control networks and data security. Additionally, he offers insights on cloud cost management, naming conventions, and choosing between a hosted Azure instance and a VM for a small company. The webinar provides a glimpse into the challenges and features of data lakes and the mesh, infrastructure, edge computing, and distributed databases.

Webinar Details

Title: How-to Store Data for Data Managers

Date: 08 December 2021

Presenter: Howard Diesel

Meetup Group: Data Managers

Write-up Author: Howard Diesel

Contents

Executive Summary

Webinar Details

Introduction to Data Storage and Architecture

Essential Concepts and Processing Types in Data Management

Transactional and Distributed Databases

Differences between ACID and BASE in data management

Consistency and Architecture in Data Lake House

Airline Ticketing and Architecture Types

Data Replication and Federated Database Systems

CAP Theorem and Lambda Architecture

The Process of Blockchain and Data Management

The concept of a data lake house

Overview of Data Lakehouse and Snowflake

Data Lakes and Data Integration

Challenges and Features of Data Lakes and the Mesh

Microservices and Data Mesh

Overview of Infrastructure, Edge Computing, and Distributed Databases

Data Management and Storage

Cloud Storage and Data Management

Cloud Cost Management and Naming Strategy

Choosing between Azure SQL-hosted instance and a VM for a small company

Process Control Networks and Data Security

Introduction to Data Storage and Architecture

Howard Diesel, an experienced database administrator (DBA), emphasises the importance of understanding data storage and operations in a rapidly growing field, similar to metadata, with technologies like data mesh and data lake house emerging. It is challenging to manage the growing data storage environment and find the right architecture for storing data, and moving to the cloud is not a simple solution due to the multiple options available. Howard intends to explain the differences between data lakes and data warehouses and discuss the challenges of combining them. The existing practices of data warehousing, data marks, and business intelligence (BI) are still relevant but must coexist with newer storage concepts.

Essential Concepts and Processing Types in Data Management

The focus is on essential concepts in data management and their evolution over time. The DMBOK (Data Management Body of Knowledge) concepts are explored, including ACID (atomicity, consistency, isolation, and durability) and BASE processing. While understanding these concepts may be challenging, they are crucial in ensuring data integrity and coherence throughout the system. Consistency is vital in data management, ensuring data remains accurate and coherent across the system.

Database Processing Types: ACID vs BASE

Figure 1 Database Processing Types: ACID vs BASE

Transactional and Distributed Databases

To maintain consistency, transactional databases require that data updates or inserts adhere to the rules defined in the data model, including foreign keys and check constraints. Transactions must be isolated from each other to prevent interference or deadlocks, where multiple transactions are trying to update the same data or different tables simultaneously, locking each other out and rendering the database inaccessible. Once an update is made in a transaction, it cannot be undone unless a database restore is performed. Distributed databases, such as cryptocurrency or federated databases, use a distributed ledger where data gradually becomes consistent across all nodes, improving data availability and allowing the system to continue functioning even if certain nodes fail. While always-on systems, fault tolerance, mirroring, and replication can provide availability in relational or transactional databases, they do not guarantee that transactions are updated everywhere.

Differences between ACID and BASE in data management

ACID and BASE are two data management approaches with different priorities and implications. ACID prioritises consistency, ensuring all transactions are completed or rolled back to maintain data integrity. On the other hand, BASE prioritises availability over consistency, allowing for flexible and dynamic data models. ACID is typically used in transactional systems like banking, while BASE is often applied in document management systems or cryptocurrency. The differences between ACID and BASE have implications for system availability, response time, and handling of data consistency. When choosing the appropriate approach for a specific data management scenario, it is important to consider these factors.

Implications: ACID vs BASE

Figure 2 Implications: ACID vs BASE

Consistency and Architecture in Data Lake House

A data mesh uses different consistency modes, including strong and eventual consistency. Strong consistency ensures all nodes have up-to-date and consistent information, while eventual consistency only updates some nodes without ensuring consistency across all nodes. An Enterprise Data Warehouse (EDW) is typically ACID compliant. It supports transactional updates or inserts, whereas a Data Lake is typically BASE compliant and allows for adjustments and storing similar data. A data lake house combines both features, supporting ACID and BASE in a unified environment. The decision between using ACID or BASE processing depends on the need for consistency, with ACID chosen when consistency is crucial and BASE chosen for data that doesn't require immediate consistency.

Database Processing Types: ACID vs BASE

Figure 3 Database Processing Types: ACID vs BASE

Airline Ticketing and Architecture Types

Airline companies commonly overbook flights to account for no-shows and maximise seat occupancy despite only having one seat for each passenger. This practice is prevalent across all airline companies. In database management, the architecture types include centralised, distributed, and federated. The old mainframe model uses the centralised architecture with a single location and database, while distributed architecture involves multiple locations with automatic reconnection to the mirror database through mirroring. Acid is required for Online Transaction Processing (OLTP), but there may be some variation depending on the type of processing.

Database Architecture Types

Figure 4 Database Architecture Types

Data Replication and Federated Database Systems

Data replication and federated database systems are two important approaches used in database management. Replication is achieved through mirroring and is used for performance purposes, such as offloading reporting issues from the OLTP system. On the other hand, federated databases use partitioning to distribute data among different locations. SQL queries can be directed to specific locations to retrieve relevant data, reducing the need to load all records into memory. Hardwired and loosely coupled views are two common approaches within federated databases. Both data replication and federated database systems are commonly used in data warehousing scenarios, where incremental loads are performed, and each location represents a different day in the load period. Finally, the CAP theorem is important in database systems, offering the choice between consistency, availability, and partition tolerance.

Database Architecture Types Continued

Figure 5 Database Architecture Types Continued

Distributed Systems: CAP Theorem

Figure 6 Distributed Systems: CAP Theorem

CAP Theorem and Lambda Architecture

The CAP theorem outlines the trade-offs in distributed systems, where it is impossible to achieve consistency, availability, and partition tolerance simultaneously. ACID principles in a distributed system become less compliant as the system grows larger. The Lambda Architecture combines real-time and batch processing to achieve partition tolerance, availability, and consistency, consisting of a speed layer for immediate viewing with potential inconsistency, a batch layer that ensures consistency over time, and a serving layer that retrieves data from the appropriate layer. Project management has a similar concept of choosing two out of three constraints: cost, time, and quality. The blockchain system is a federated network without a central server, where every node is independent, and the system continues to function even if a node goes down.

Federated Database: Blockchain

Figure 7 Federated Database: Blockchain

The Process of Blockchain and Data Management

Blockchain is a technology that enables the transfer of money from one party to another by creating a transaction, which is represented as a block and is broadcast to all parties in the network. Two or three miners approve the block, and once verified, it is added to the chain, which is immutable. Miners have a copy of the blockchain, but their copies may be outdated if they were not part of a specific transaction. Consistency across federated databases is not immediate, and the traditional data management approach involves an enterprise database or data marks for operational and external data with reporting capabilities. Data Bricks is a company that specialises in ETL processes for data management, and the concept of a data lake involves adding different types of data to the lake, followed by preparation outside the lake (e.g., ETL, in-memory processing) before accessing it through data marks or a real-time database.

DW to Data Lake to LakeHouse

Figure 8 DW to Data Lake to LakeHouse

The concept of a data lake house

Integrating a data warehouse and a data lake creates a data lake house, which handles structured and unstructured data. This integrated structure operates with an indexing and caching layer and a catalogue, allowing analysis, visualisation, business intelligence, data science, and machine learning to work off that layer. Databricks, Amazon, Google, and Microsoft support this concept, with Azure introducing a data mesh and a data lake house. The data lake house not only understands the data but also incorporates ETL, analytical processing, caching, and indexing on top of documents and unstructured data, and it helps to avoid the data swamp issue that data lakes often experience.

Data LakeHouse Layers

Figure 9 Data LakeHouse Layers

DW to Data Lake to LakeHouse continued

Figure 10 DW to Data Lake to LakeHouse continued

Overview of Data Lakehouse and Snowflake

A data lakehouse is a unified data storage system that combines the features of data warehouses and data lakes, eliminating the need for separate storage areas. It offers the best of both worlds by providing consistent and quality data while allowing for the storage of less critical data. Snowflake is a native data lakehouse solution offering various ways to interact with data in base and acid approaches. The snowplough ingestion process feeds data into Snowflake. Howard notes that Bill Inman discussed integrating information in a single layer within a data lakehouse. A Data lake refers to a base format without a schema, while a data warehouse enforces business rules and consistency.

Data Lakes and Data Integration

Data lakes are built on Hadoop or AWS and allow the storage and analysis of large amounts of raw, unstructured data. However, the reconstitution of data into a structured format for reporting is a challenge, and the complexity of implementing ACID principles in data pipelines is a concern. Referential integrity is often removed at the source, causing difficulties in reconstituting data accurately on the other side. The Lake House concept aims to address these challenges by adding data integration capabilities, which allow changes or updates to the data in the lake. This feature may require additional tools or technologies, but it is recognised as a significant topic in the discussion.

Additional Features to Data Lake

Figure 11 Additional Features to Data Lake

Challenges and Features of Data Lakes and the Mesh

A data lake presents challenges, including schema evolution, lack of ACID properties, and time travel. To address these issues, data virtualisation or data warehouse can bring the data back into the correct format, and schema drift can be prevented by enforcing schema rules. In addition, lambda architecture supports batch, streaming, and real-time processing in data lakes. After discussing data lakes' challenges, the data mesh is introduced, presenting a comprehensive solution.

Microservices and Data Mesh

Microservices are based on the idea that each domain operates independently and has its own data products, warehouse structure, and database system. Communication between different domains and their data products is essential, and the harmonised mesh in Azure helps to bring together different domains and data products, providing global governance and open standards. This approach requires a lot of infrastructure and technology to support the diverse domains and data products, and the harmonised mesh allows for data virtualisation and integration between different areas, enabling multiple data lake warehouses worldwide. By ensuring consistent access control, connectivity, and governance, the data mesh connects different silos with diverse infrastructure technologies.

Data Mesh

Figure 12 Data Mesh

Data Mesh Components

Figure 13 Data Mesh Components

Harmonized Mesh (Azure)

Figure 14 Harmonized Mesh (Azure)

Data Mesh Conclusion

Figure 15 Data Mesh Conclusion

Overview of Infrastructure, Edge Computing, and Distributed Databases

The adoption of multi-cloud solutions and hybrid infrastructure is increasing among companies today. Edge computing, a technology that decentralises traditional data centres by bringing nodes closer to end-users, has emerged to reduce latency and improve bandwidth and throughput. It has been implemented in various ways, including building data centres near stock markets and placing them in the ocean. Microsoft also leverages edge computing by deploying its cloud infrastructure in different areas to deal with heat-related issues. Another concept, mesh networking, combines microservices with a well-defined canonical model provided by the Enterprise Data Warehouse (EDW). In distributed and federated databases, there are typically three controlled structures: hierarchical (tree structure), relational (schema-oriented), and others such as hierarchical trees and DB2.

Edge Computing

Figure 16 Edge Computing

Database Organisation

Figure 17 Database Organisation

Data Management and Storage

Data modelling approaches have evolved from a third normal form to non-relational schema-on-read for accommodating large documents and diverse database formats. The role of a data manager includes assessing business continuity requirements, evaluating deployment criteria, and implementing storage management strategies that ensure secure data, compliance, and optimal performance while addressing data management challenges and reducing IT expenses.

Database Organisation Continued

Figure 18 Database Organisation Continued

Data Storage and Operations: Activities for Data managers

Figure 19 Data Storage and Operations: Activities for Data managers

Figure 20 Data Storage Management

Cloud Storage and Data Management

Organisations must ensure their management system adheres to the Health Insurance Portability Accountability Act (HIPAA) regulations. Flags to ensure data deletion before the database takes over are essential. With cloud storage, organisations can access better user experience, ongoing availability, reliability, lower costs, and efficient operations. The cloud also allows compute nodes to be spun up, processed and shut down programmatically. Multi-cloud and hybrid approaches can help optimise costs and application performance. Key focus areas for managing data in the cloud include automation, orchestration, migration, configuration management, security, governance, compliance, and access control. While cloud vendors offer a certain level of security, organisations still need to take care of encryption, hashing, obfuscation, and compliance measures. Performance monitoring and hosting are crucial considerations for data management in the cloud.

Benefits of a data storage management strategy

Figure 21 Benefits of a data storage management strategy

Data Storage Management Trend

Figure 22 Data Storage Management Trend

Cloud management components

Figure 23 Cloud management components

Cloud Cost Management and Naming Strategy

When working with cloud technology, monitoring performance and managing costs are crucial. Over-expanding can result in dangerously high costs while implementing a cloud business office and moving on-premises resources to the cloud can face pushback. Unique naming strategies are essential when moving large amounts of data and resources, and billing for cloud services can be complex and potentially expensive. Creating significant cloud infrastructure without proper knowledge or strategy can result in unexpected costs, but providers may offer assistance in certain cases.

Choosing between Azure SQL-hosted instance and a VM for a small company.

When choosing between a hosted instance of Azure SQL and a VM, small companies should consider the cost difference and other factors such as Linux or Windows, spot instance or always available instance. The database might only be needed for customers with varying usage requirements when the Power BI file is refreshed. Incorrect choices in hosting options can lead to a significant increase in costs. In addition, it's important to note that throttling parameters for network switches may reduce the quality of service after a call is over a certain duration. Amazon's Well-Architected Framework includes operational excellence, security, reliability, performance, and cost optimisation as pillars, with the security pillar highlighting Amazon's responsibilities in ensuring security.

Process Control Networks and Data Security

It is generally believed that on-premise databases and systems are less secure than cloud-based systems. To ensure security, computer systems such as those in a department or process control network are usually not connected to the outside world. Process control networks are typically firewalled off from the business. SQL server replicators transfer process data from on-premises to the cloud due to performance and security concerns that prevent business users from directly querying the process data.

If you want to receive the recording, kindly contact Debbie (social@modelwaresystems.com)

Don’t forget to join our exciting LinkedIn and Meetup data communities not to miss out!

Previous
Previous

The Benefits of Data Centralisation and Automated Reporting

Next
Next

How do you Implement Best Practices in Data Privacy & Protection with Lucien Pierce