Data-as-an-Asset

Can Big Data Replace an EDW?

By Anand.K | July 23, 2019 | 7 min read

Table of Content

Can Big Data replace an EDW?

Overview

Data Warehousing has been the buzzword for the past two or three decades, and big data is the new trend in technology. A question that often arises in our minds is, “Are they similar and will Big Data replace a Data Warehouse?” The reason being both have similarities like holding data, used for reporting purposes, and managed by electronic storage devices. There is an underlying difference between the two, namely, Big Data Solution is a technology, whereas Data Warehousing is an architectural concept in data computing.

Modern data stack tools have also changed this conversation, because organizations now expect warehouse data to be operationalized through reverse ETL, exposed through a semantic layer, and monitored with data observability platforms. As a result, the debate is less about replacement and more about how the EDW fits into a broader data architecture.

An organization can have different combinations, such as Big Data or Data warehouse solution only, or Big Data and Data Warehouse solutions based on the four consideration factors, such as: Data Structure, Data Volume, Unstructured Data, and Schema-on-Read.

This blog post tries to bring out the similarities and differences between the two and illustrates with a Use Case example.

Difference Between Big Data and Data Warehouse

What is a Data Warehouse?

A data warehouse is a conceptual architecture that helps to store structured, subject-oriented, time-variant, non-volatile data for decision making. A data warehouse typically stores the historical data, a copy of transaction data specifically structured for query and analysis. The physical data consolidation has been shifting to a more logical one, which accommodates real-time data as well. Data from the sources is transformed (cleansed, applying business rules, enhanced), and analysis is done in the ETL/ELT phase to load into a structured form (Can be relational, dimensional, hybrid, etc.).

In many modern environments, the EDW is no longer just a reporting layer. It often sits within a broader analytics stack that includes metrics layers, reverse ETL, and governance controls designed to keep business users aligned on trusted definitions.

Traditional EDW Architecture

Figure 1: Traditional EDW Architecture

A traditional Data Warehouse integrates data from many transactional and operational systems to present the resulting information as a “single integrated version of the truth” to decision makers at all levels of the organization. The design of the data warehouse, if done properly, allows us to access, report and analyze that information from all the relevant and possible angles, which drives consistent and accurate information as a result.

What is Big Data?

Big data is a technology that is used to store the unstructured data from various sources and to manage a huge volume of data in Exabytes (the size of west coast states) and Zettabytes (the size of the Pacific Ocean). Big Data is capable of storing structured, semi-structured, and unstructured data comprising video, audio, unstructured text, etc., using less expensive storage devices. The processing of data is decentralized and distributed across multiple servers for faster processing. There is no schema or modelling of the data stored, and the data is stored in its native format. The actual usage will be done by applying rules to this data, and a report will be obtained.

Big data platforms are also increasingly integrated with the modern data stack, especially where raw data must be transformed, validated, and activated downstream. This is where semantic layers and data observability tools help bridge scale with usability.

Big Data Warehouse Architecture

Figure 2: Big Data Warehouse Architecture

Comparison table of Data Warehouse to Big Data

Big Data Warehouse Architecture

Choosing Data Warehouse or Big Data:

Current day's data is not only large in size but characterized by 4Vs (Volume, Velocity, Variety, Veracity) that have changed the way the data is consumed radically. To cite an example, Facebook reported that nearly 2.5 billion different items are shared each day and their data is growing at a rate of 500TB daily and they claim to capture each user’s click in their storage space. Typically, while dealing with such big data analytics projects, failures are also common.

The decision today is often influenced by whether teams need a centralized EDW or a more distributed model such as a data mesh. While centralized governance works well for consistency, data mesh introduces domain ownership and federated accountability across teams.

Data Warehouse to Big Data

Figure 3: Data Warehouse to Big Data

So, as organizations have grown, there emerges the challenge to store and extract valuable information from these data, which involves cost, quality, accuracy and maintenance. The traditional data warehouse is typically implemented in a single or multiple relational databases serving as a central repository. Unlike traditional data warehouses, massively parallel analytic databases such as Netezza, Teradata, and EMC GreenPlum are capable of quickly ingesting large amounts of mainly structured data with minimal requirement of data modeling, and scale out to accommodate multiple terabytes to petabytes of data. Most importantly for end-users, massively parallel analytic databases support near real-time results to complex SQL queries. And, it is good with ELT instead of ETL.

In contrast, Big Data technologies are designed to span multiple machines and handle huge volumes of data irrespective of structured, semi-structured or unstructured data with high performance in a cloud-based environment or distributed servers using Hadoop, HDFS, NOSQL etc.

With respect to the business usage view, the reports are readily accessible by businesses from the EDW, but limited to only structured and transactional data. Also, the information present at all levels in DW can be fetched based on needs because of the structural arrangement of data. If business requires additional information that is available in social media logs then it requires rebuilding of the DW depending on the requirement. The business analyzes each piece of raw data and requires separate transformations on the Big Data to build a report. This involves cost and extra effort. The retrieval of information present in Big Data is difficult as the data present is unstructured and not organized.

Use Case example

Financial services companies generate structured data, such as customer demographics and transaction history, and unstructured data, such as customer behavior on websites and social media. In cases where organizations rely on time-sensitive data analysis, a traditional database DWH is a better fit for structured customer demographics and transaction history. On the other hand, where fast performance isn’t critical, Big Data analysis is fit for all structured and unstructured customer transaction or behavioral data.

Can Big Data/Hadoop and EDW share the same umbrella?

Increasingly, organizations understand that they have a business requirement to combine traditional data warehouses with their historical business data sources at one end and less structured and big data sources at the other end. A hybrid model, supporting traditional and big data sources, can thus help accomplish these business goals.

In this hybrid model, the highly structured optimized operational data remains in the tightly controlled data warehouse, while the data that is highly distributed and subject to change in real time is controlled by a Hadoop-based infrastructure. Teradata Aster Big Analytics Appliance is the first of its kind to embed SQL and big data analytics processing to allow deeper insights on multi-structured data sources with high performance and scalability.

Hybrid DWH Model

Figure 4: Hybrid DWH Model

Moreover the hybrid approach allows firms to protect their investment in their respective DWH infrastructure and extend it to accommodate the Big Data environment. As Hadoop is a family of products, each with multiple capabilities, there are multiple areas in data warehouse architectures where Hadoop products can contribute, like Data Staging, Data Archiving, Schema Flexibility, etc. Hadoop seems most compelling as a data platform for capturing and storing big data within an extended DW environment, in addition to processing that data for analytic purposes on other platforms.

One of the approaches to amplify DWH in an Enterprise with Hadoop/Big Data cluster is as follows:

Continue to store structured data from OLTP and back office systems into the DWH.
Store unstructured data, i.e., all the communication with customers from phone logs, customer feedback, GPS locations, photos, tweets, emails, text messages into Hadoop/NoSQL that does not fit nicely into tables.
Correlate data in DWH with the data in the Hadoop cluster (can be loaded into ODS also) to get better insight about customers, products, equipment, etc. Organizations can now run ad-hoc analytics and clustering and targeting models against this correlated data in Hadoop, which is otherwise computationally very intensive.

This hybrid approach is now even more relevant with reverse ETL and metrics layers, which allow governed warehouse data to flow back into operational systems and business-facing dashboards. In practice, this lets teams keep the EDW as the trusted system of record while using big data platforms for scale and flexibility.

Modern data stack perspective

The modern data stack extends the value of both EDW and big data by connecting storage, transformation, governance, and activation layers. Tools for semantic modeling, reverse ETL, and data observability help organizations turn trusted data into action across operational systems and analytics workflows.

Conclusion

While Big Data technologies are focused on advanced analytics, a modernization strategy for data archives, the EDW environment was mostly built for reporting, OLAP and performance Management. Hence, we can rightly state that Big Data is a complement, not a replacement for a data warehouse. They co-exist based on the business requirements.

In a modern data architecture, the EDW, big data platforms, and the modern data stack are increasingly complementary rather than competing layers. With observability, semantic modeling, and activation tools in place, organizations can combine centralized governance with flexible data consumption.

Hadoop will not replace a data warehouse because the data and its platform are two non-equivalent layers in Data warehouse architecture. However, there is a greater probability of Hadoop replacing an equivalent data platform, such as a relational database management system.

Anand.K

Associate Architect

Anand has 7.5 years of experience in the design & development of various data integration and warehouse projects.