Skip to main content

Big Data: A case for building an Analytics Platform.


In my previous post on this subject, I talked about few of the common problems that plague traditional data warehousing initiatives. Few of my friends after reading through the article asked me whether I am questioning the relevance of  traditional data warehousing.    The answer to that question is  a resounding  "No".   Data warehousing do offers significant benefits to the business, but my belief is that the practioners of the data warehousing infrastructure need to evolve beyond the traditional models and build platforms that leverages the benefits of the traditional warehouse while complementing it with system(s) that provide quick, agile and  adaptive solutions to meet the need of the business.

In this post, I am making a proposal for one such system and I am calling it an "Analytics Platform".

What is an Analytics Platform?

I would define Analytics Platform as an engine which is built with a underlying objective of offering "self services" to the users of the system.  I visualize the platform as a pipeline which is connected by a set of tools that facilitate easy ingestion of diverse data sets, is able to handle massive volumes and high velocity of the data, ensure the quality of data ingested based on a set of easy rules that can be specified by the users, and provide users with an set of capabilities to visualize and analyze the data ingested and stored.

What are the features that can make such a platform very useful and powerful?


Here goes my wish list:

a) This will be a platform which is essentially built to ensure self services to users.  Users will seamlessly able to interact with the platform for data ingestion, ensure data quality and integrity, and for visualization/analysis.

b) The platform will facilitate ingestion of wide variety of data, manage and store massive volumes of data, and will be able to ingest data that are bursty or coming with high velocity.

c) The platform will be able to handle structured, semi structured and unstructured data with equal ease.

d) The ingestion mechanism will be fast and will focus around ensuring that data is available for the users to analyze within minutes.  (It is to be noted that Real time data availability is  a desired goal but what I am talking about is data being available for analysis within an hour at max and not after days).

e) The data in its raw form (as well as in aggregated form), will be stored and be available for analysis for an extended period of time; say for a minimum duration of at least a month.  The retention duration can be configured based on the needs.

f)  The platform will provide a seamless mechanism to discover the data stored on it. Users can easily navigate and discover the data sets stored, the nature of data sets stored, the recency and retention of data sets and the constraints associated with the data.

g) The platform will offer an ability to dynamically link the data in the different data sets that are loaded into the platform to facilitate joins, and perform holistic analysis across the data sets stored.

h) The platform will expose the data in a seamless manner using which users will be able to plug in the tools that they are familiar with to process, aggregate and analyze the data available.  The interfaces to access the data will be standardized thereby facilitating external tools to  plug-in,interact and compute the data.

i) The platform can act as bridge between the data store and traditional legacy data warehouses.  The platform will essentially complement the existing warehouses and will not be an alternative to the data warehouses.  The platform will help the business in addressing the gaps with the traditional data warehouses without mandating a need to replace an well established existing system with a new one.

i) The platform should offer an ability to perform adhoc analysis on the data stored, support regular and scheduled reporting, and extraction into traditional tools used for analysis.

j) The platform will guarantee good performance at scale, ensure high availability and reliability for both data stored on the platform and processing of the data stored.


Can such a platform be built and if so can it be built quickly in a cost effective way?

At this juncture, I do not have the answer to it.  I am in the process of finding out the same.  But from the initial data points I have been able to gather, it looks quite feasible.  During my research, I have been able to find various concepts, systems, tools, and modules, that addresses one or the other features I have listed above. What is more gratifying is the fact that many number of such tools and systems are open sourced with a decent level of support infrastructure. But at this point, I have no idea whether all such tools can be stitched together to create the Analytics platform.  This is what I intend to find out and will keep posting as my research progresses.

Comments

Popular posts from this blog

Dilbert on Agile Programing

Dilbert on Agile and Extreme Programming -  Picked up from dilbert.com - Scott Adams.

Big Data: Why Traditional Data warehouses fail?

Over the years, have been involved with few of the data warehousing efforts.   As a concept, I believe that having a functional and active data  ware house is essential for an organization. Data warehouses facilitate easy analysis and help analysts in gathering insights about the business.   But my practical experiences suggest that the reality is far from the expectations. Many of the data warehousing initiatives end up as a high  cost, long gestation projects with questionable end results.   I have spoken to few of my associates who are involved in the area and it appears that  quite a few of them share my view. When I query the users and intended users of the data warehouses, I hear issues like: The system is inflexible and is not able to quickly adapt to changing business needs.  By the time, the changes get implemented on the system, the analytical need for which the changes were introduced is no longer relevant. The implementors of the datawarehouse are always look

Overview of Hadoop Ecosystem

Of late, have been looking into the Big Data space and Hadoop in particular.  When I started looking into it, found that there are so many products and tools related to Haddop.   Using this post summarize my discovery about Hadoop Ecosystem. Hadoop Ecosystem A small overview on each is listed below: Data Collection  - Primary objective of these is to move data into a Hadoop cluster Flume : - Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Developed by cloudera and currently being incubated at Apache software foundaton. The details about the same can be found here . Scribe : Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. Dveloped by Facebook and can be found here .  Chuckwa : Chukwa is a Hadoop subproject dev