Data-as-an-Asset

Why Do Big Data Projects Fail?

By Jim Strickland | March 4, 2020 | 5 min read

Table of Content

Why do Big Data projects fail?

When you read the marketing spin on Big Data and the tools available today, you may deduce that there is much upside and not much downside to implementing a Big Data project. Nevertheless, you will quickly find that this is not the case. It’s not as simple as typing “apt-get install Hadoop” into a Linux command window; everything installs, you ask it complex questions, and it gives you sage advice. It is tough to get a Big Data project working.

A successful Big Data initiative depends less on hype and more on readiness, data quality, and execution discipline. Without those foundations, even the best platform will struggle to deliver business value.

This post is not trying to dissuade you from attempting a Big Data project. If done correctly, it can give you critical information about your customers and products that can separate you from the pack and make your company a leader in the marketplace.

However, you need to go into the project, understand what you are getting into, and have the right resources assigned to give you the highest chance of success.

Here are four reasons why I think Big Data projects typically fail:

1. Underestimating how complex it is to get data, clean it, store it, and then analyze it.

One of the top reasons a big data project fails is that companies don’t know where all their data is.

In many cases, the real challenge is not the technology itself but the end-to-end data pipeline, from source discovery to transformation and analysis. That is why Big Data projects need strong architecture and realistic timelines from day one.

Consequently, one of the first things you will need to do is understand where all of your data lives, develop connectors to extract that data from its source, clean the data, and put it in a form that can be analyzed. If you are using Hadoop for this, then once you have identified the data sources, you will need to build connectors to those sources and extract the data. You then have to make sure the data is relatively free of inconsistencies like multiple records for the same customer, wrong spelling, etc. Then, that data must be put into a standard format (JASON, XML, …) and stored in a high-performance file system such as HDFS. All of this requires some sophisticated programming in a language such as Java. You will also have to build out a large cluster of servers that can have the MapReduce nodes set up, and this infrastructure must be kept configured, maintained, and monitored.

2. You don’t know where the data is or what condition it is in.

As mentioned, Big Data is about “Big Data,” up to petabytes of data all stored in one place that can have analytics operations performed against it. But to gather and store this data, you need to know where it is in the first place, and it lurks in places you may not always think of. For example, almost every company has some CRM or point of sale systems that keep up with its customer data. It is usually a relational database and is used daily to transact business with the company’s customers. And, because you cannot know what data you need for analytics, you will need to extract all of this data and put it in the analytics “Data Lake.” But there is much more data available to be analyzed: web log files, spreadsheets, text messages with customers, feedback from Facebook, Instagram posts and comments, and much more. All of this data needs to go into the data lake in a standard form, and extract API’s need to be built to get it.

A mature Big Data program also needs a clear data inventory and source ownership before ingestion begins. Otherwise, teams spend too much time chasing data instead of analyzing it.

3. You don’t know what questions to ask (the data).

For the big data project to have any value to a company, it needs to provide the team with actionable information. But it’s still a dumb computer, so you must be explicitly asked what questions you want to be answered. Because of the amount of data available in a big data ecosystem, you can ask some pretty interesting questions. Something as broad as “what is the age of most of my current customers,” to something as specific as “what zip code do most customers that buy blue caps live in.” You can also look for patterns of behavior and lots of other exciting things, and the computer can learn as more and more data is analyzed and results stored. Still, the company’s operations and marketing teams need to be able to articulate to the data analyst and programmers what questions they want to study.

This is where business alignment becomes critical, because Big Data only creates value when the use cases are tied to real operational or strategic questions. Without that clarity, projects can become technically impressive but commercially irrelevant.

4. It is expensive and takes time (Java programmers, lots of hardware, complex administration, expensive analysts).

I hope you have received from the first three challenges that it’s not easy to get a big data project off the ground, much less keep it running. So, the last reason a project like this fails is the lack of a skilled team that is not given enough time and budget. If you have never looked at what a MapReduce study looks like, you may want to take a look at this word count example, which only counts words and returns results. To add to the complexity, you need to develop in Java (although other programming languages will work; Java is the de facto standard environment), and you probably want to use an IDE such as Eclipse to manage things. All of this requires some pretty skilled developers who understand how to write MapReduce studies and can work with analysts to know what they need to program. These are not junior programmers. For the actual analysis, your data analytics team will need to understand how to develop the complex study algorithms and be able to articulate these to the developers to implement. These are senior analysts with a deep understanding of the data. And finally, the team needs to be led by a secure project manager who either is or has the sponsorship of a senior executive.

The cost challenge is not just infrastructure; it also includes specialized talent, governance, and ongoing maintenance. That is why Big Data projects need executive sponsorship and a clear ROI case before they start.

All of this costs both time and money. The good news is that once the initial project is finished, subsequent studies and changes to data sources can become somewhat routine, and the costs will decrease. But you still need to understand that maintenance will not be without cost or effort.

For a big data project to be successful, you must have the right team in place before you start a project, have defined expectations, and have the money and patience to let the project play out. If done correctly, a big data project can pay off big time. It’s just quite a climb to reach the top.

Data quality and governance

Poor data quality is one of the fastest ways to derail a Big Data initiative. Governance rules, validation, and stewardship should be defined early so that the data lake does not become a dumping ground for unusable information.

Modern platform choices

Today, teams often reduce complexity by using cloud-native analytics platforms instead of building everything manually on Hadoop and MapReduce. This can lower operational overhead and speed up time to insight.

Conclusion

The most successful Big Data programs are the ones that start with a business problem, a realistic scope, and the right people. When those pieces are in place, the project has a much better chance of paying off.

Jim Strickland

Head – Global IT & Systems

Jim Strickland, a seasoned IT executive, leads Global IT & Systems with a 20+ year record of building high-performing teams and driving growth through transformational system integrations. His expertise spans IT strategy, digital transformation, and cloud computing, making him a key player in technology-driven business success.