From online to offline

The 3 Vs (volume, velocity, and variety) of Big Data are finding their way into every aspect of existence, from government work to online dating, improving the performance of systems or replacing them entirely. The revolution started by online retail is unfolding with our cooperation. The smartphone, fitness equipment, even our badge from the office are no longer just inanimate objects, but vectors of transformation. In this article we will evaluate the top Big Data algorithms and some applications.

big data 2


We already have in place numerous technical devices capable of generating and storing Big Data. The structured data can come from transactions, invoices, bank statements and reports, GPS positioning, lab tests, and RFID tags. Semi-structured data can be found online in directories, social media accounts or blogs, and weather portals, while unstructured data comes from video surveillance cameras, product reviews and so on.

To be useful in building and calibrating models, such data needs to be cleaned and transformed. First, data needs to be checked for consistency, any discrepancies eliminated and missing values inserted or approximated. Categories needs to be defined, data normalized and divided into “learning data” and “calibration data.”

Standard algorithms used for Big Data analysis and applications


Clustering means defining the natural groups that occur in data by identifying the attraction centers and gravitating objects. A variant includes fuzzy clusters, a situation when an object can simultaneously belong to more groups up to a certain percentage.

Data clustering can be used in retail to define customer segments, in healthcare for diagnosis by looking at symptoms and in entertainment to make recommendations based on the preferences of the group a client belongs to. In Fintech, it can be useful to identify similar stocks and create portfolios that ensure maximum performance.

Anomaly Detection

A method derived from clustering is detecting those records that don’t seem to fit any of the native groups. These are outliers or anomalies, and they are at a significant distance from any of the existing cluster centers. After an algorithm has been trained on real data sets, it can detect those items that don’t correspond to the rules.

Such an approach can be used to identify credit card fraud or illegal use of online accounts. In healthcare, a patient with an outlier behavior could either be sick or be showing improvement. It all depends on the initial definition of the acceptable cluster. Crime areas in a city can also be defined as outliers. A positive anomaly can be a blockbuster movie, and Big Data can create a recipe for success.

Association rules (patterns)

We know that correlation does not mean causation, but some interesting links between records can uncover new information. By studying large datasets, if a group of connections keeps repeating, it is worth investigating.

In retail, association rules are the base of cross-selling, while in medicine they could signal a disease. Associations can also be used to make city traffic less congested by redirecting it at peak hours. Even HR can benefit from this way of studying Big Data by creating the “ideal candidate” profile or scanning for a great leader by associating traits with outcomes. Fitness applications use associations from a large pool of users to create customized recommendations related to diet, exercise, and sleep.

big data analysis


This method relies on known statistical methods to estimate the value of an individual variable at a later moment based on existing data. It deals with time series, determining trend, seasonality and the effect of noise in data. It is derived from general data analysis techniques and adapted for Big Data.

Such an approach can be useful for estimating sales volumes and revenue, anticipating person-hours based on company growth or compute future energy consumption and adjust production. Even traffic or the number of customers in a retail store can be predicted by taking into consideration normal variations. Using this method, transportation companies can change the size of their fleet or educational facilities can estimate the number of full-time teachers and assistants.

Sequence analysis

A sequence is an ordered set of individual items, usually following a timeline. In Big Data, analyzing a sequence is like looking at a time series, but focusing on the links between the steps. The aim is to find out which step leads to another.

Typical applications are in genomics, but also in looking at the customer’s path in a store. Building an investment portfolio is also done in steps, as is getting a degree or climbing the hierarchical ladder. Practically, any evolution process can be subject to this technique, and Big Data can help identify desirable successions and wrong steps.

Computer vision

Artificial intelligence will be able to process video footage in a similar way to the human brain. The results of such analysis are included in the computer vision discipline, an encompassing way to make use of hours of CCTV or YouTube videos. The good news is that it is not limited to an existing camera roll, but it uses it as a starting point for AR and VR applications.

In the future, computer vision will replace showrooms and fitting rooms, and it could be used to increase the personal security of citizens. Some applications are already tested in medicine to show expected results of plastic surgery and help medicals students learn about the human body without cadavers.

Future challenges and opportunities

These are some of the most important algorithms for Big Data analysis, yet most of them are just an adaptation of existing statistical models. As the world of Big Data is growing, there will be more proprietary algorithms to deal with specific challenges.

The opportunities offered by using Big Data to power AI are endless, there is no domain where at least a process could benefit from such an approach. In this decade, the role of Big Data analysts is to create new architectures and algorithms that are more appropriate to support the 3 Vs.