Decoding big data buzzwords

Posted by admin updated on 16 Jul, 2015

This is a reproduction of the original article that was published on CIO. The original article can be found here.

When popular trends emerge, be it in the business world or elsewhere, it seems inevitable that along with said trend comes a vernacular of its own, language that is best understood by those who have the closest association. Such is the case in the world of big data as well. As the popularity of data analytics has proliferated across the business world, so too has the language spoken by the practitioners. So for those somewhat new to the game or simply looking to bone up on the latest terms and slang, let us identify and attempt to explain the latest and greatest in Big Data buzzwords:

Big data
First, let’s analyze the popular (or notorious) term itself, “big data.” What does it really mean? There isn’t a clear definition, but overall the term is usually used to refer to the great amount of data that organizations have access to and that has the potential to be put to “good” use. It is often used interchangeably with data analytics or predictive modeling approaches. If you really want to be specific, try this definition around three “V’s” that I like (probably coined by Gartner):

  • Volume: Large datasets (think Terabytes or more)
  • Velocity: Data that needs to be ingested quickly and in some cases, acted upon quickly as well (think data being sent by Telematics device on a car)
  • Variety: Data that is different in formats, sources, etc – audio files, database records, social media comments, etc

Note that there are multiple other versions, including one with four, five,…even seven “v’s” – it looks like the number of “v’s” to describe Big Data is rapidly beating the number of blades that guys need to get that perfect shave.

Business Intelligence (BI)
Broadly speaking, BI refers to the approaches, tools, mechanisms that organizations can use to keep a finger on the pulse of their businesses. Also referred by unsexy versions — “dashboarding”, “MIS” or “reporting.” A development in this space is that today’s BI tools also allow a non-technical person to segment reports by different groups, perform basic levels of root-cause-analysis, and develop customized charts on the fly.

Data visualization
Refers to the approaches and tools used to visually understand the insights from data as well as all of its interconnections. There is usually a gold mine of insights hidden in vast volumes of data – the art and science of data visualization involves converting this data into a visual through which the insight leaps out. While there are plenty of tools available, there is a certain art to selecting the most appropriate visual to convey an insight or prove/disprove a hypothesis.

Predictive analytics
Refers to using statistical modeling techniques to predict outcomes. Think of developing an algorithm that predicts which borrower is going to pay back a loan or not, or the likelihood of a newsletter subscriber to open an email. Techniques involve regression models, decision trees, neural networks, and other methods. Done right, predictive analytics can lead to very smart strategic choices for a business. But be cautious here – you are implying that you can predict the future.

Descriptive analytics
This concept is defined as using data to explain what has already happened (compare with predictive analytics). How were sales this month, split by categories, regions, store segments, etc? How does loan performance compare by cohorts, over time, by product, etc? Most descriptive analytics tend to involve (relatively) simpler analytics, but some sophisticated approaches also go in here (e.g. clustering).

Real-time analytics
As the name suggests, this phrase refers to the approaches to use data and analytics in “real time.” This might refer to the ability of an organization to pre-process data in real time and only use processed data going forward, or the ability to continuously predict outcomes in real time as the input data and context changes (e.g. predicting weather), or even the ability of users to perform BI and other analytical tasks on data that is coming in real time.

Whether deserved or not, this is the word that gets talked about the most when conversations about big data come up. Hadoop refers to a way to store large volumes of data using off-the-shelf hardware products. Key features include: scalability (keep adding new hardware and the system can keep taking more data), redundancy (the storage model can deal with hardware breakdown), open source and hence free pricing, and the ability to ingest data in different formats (video, traditional data tables, Facebook comments, etc.). Given how commonly “Hadoop” and “big data” are spoken together, it is worthwhile noting that Hadoop is not an essential ingredient of a big data strategy. In fact, in many use cases of big data, a Hadoop based architecture may not make sense, just like it would make sense in other use cases. So do not equate analytics or big data with Hadoop.

Unstructured data
This is an imprecise term but it broadly refers to data that can’t be fit into the usual data model of rows and columns. Think of a collection of video files or text documents or weblog data or email content. Most of this data won’t neatly fit into the construct of usual data warehouses, although there might be very relevant insights and patterns hiding in such data. Most unstructured data needs to be converted into some structure to unlock such insights.

Machine Learning
This is a broad topic that covers approaches that allow using a machine to help discover insights and linkages, make predictions and recommend decisions. Predictive Modeling is a subset of machine learning and so is clustering & segmentation. Whenever I hear someone mention “machine learning”, I always ask them to be more specific — are they referring to predictive vs. descriptive analytics, are they referring to supervised or unsupervised learning, etc. By itself, machine learning is too broad a term to be helpful in discussing capabilities or discussing an approach to solving a problem.

[Other notable mentions that nearly made it to the above list: R, Python, NoSQL, and Neural Networks]

So there you have it. It seems there’s no shortage of chatter related to big data these days. Hopefully, these basic definitions will help any big data newcomer better understand what everyone is talking about.