In my previous post on this subject, I talked about few of the common problems that plague traditional data warehousing initiatives. Few of my friends after reading through the article asked me whether I am questioning the relevance of traditional data warehousing. The answer to that question is a resounding "No". Data warehousing do offers significant benefits to the business, but my belief is that the practioners of the data warehousing infrastructure need to evolve beyond the traditional models and build platforms that leverages the benefits of the traditional warehouse while complementing it with system(s) that provide quick, agile and adaptive solutions to meet the need of the business.
In this post, I am making a proposal for one such system and I am calling it an "Analytics Platform".
What is an Analytics Platform?
I would define Analytics Platform as an engine which is built with a underlying objective of offering "self services" to the users of the system. I visualize the platform as a pipeline which is connected by a set of tools that facilitate easy ingestion of diverse data sets, is able to handle massive volumes and high velocity of the data, ensure the quality of data ingested based on a set of easy rules that can be specified by the users, and provide users with an set of capabilities to visualize and analyze the data ingested and stored.
What are the features that can make such a platform very useful and powerful?
Here goes my wish list:
a) This will be a platform which is essentially built to ensure self services to users. Users will seamlessly able to interact with the platform for data ingestion, ensure data quality and integrity, and for visualization/analysis.
b) The platform will facilitate ingestion of wide variety of data, manage and store massive volumes of data, and will be able to ingest data that are bursty or coming with high velocity.
c) The platform will be able to handle structured, semi structured and unstructured data with equal ease.
d) The ingestion mechanism will be fast and will focus around ensuring that data is available for the users to analyze within minutes. (It is to be noted that Real time data availability is a desired goal but what I am talking about is data being available for analysis within an hour at max and not after days).
e) The data in its raw form (as well as in aggregated form), will be stored and be available for analysis for an extended period of time; say for a minimum duration of at least a month. The retention duration can be configured based on the needs.
f) The platform will provide a seamless mechanism to discover the data stored on it. Users can easily navigate and discover the data sets stored, the nature of data sets stored, the recency and retention of data sets and the constraints associated with the data.
g) The platform will offer an ability to dynamically link the data in the different data sets that are loaded into the platform to facilitate joins, and perform holistic analysis across the data sets stored.
h) The platform will expose the data in a seamless manner using which users will be able to plug in the tools that they are familiar with to process, aggregate and analyze the data available. The interfaces to access the data will be standardized thereby facilitating external tools to plug-in,interact and compute the data.
i) The platform can act as bridge between the data store and traditional legacy data warehouses. The platform will essentially complement the existing warehouses and will not be an alternative to the data warehouses. The platform will help the business in addressing the gaps with the traditional data warehouses without mandating a need to replace an well established existing system with a new one.
i) The platform should offer an ability to perform adhoc analysis on the data stored, support regular and scheduled reporting, and extraction into traditional tools used for analysis.
j) The platform will guarantee good performance at scale, ensure high availability and reliability for both data stored on the platform and processing of the data stored.
Can such a platform be built and if so can it be built quickly in a cost effective way?
At this juncture, I do not have the answer to it. I am in the process of finding out the same. But from the initial data points I have been able to gather, it looks quite feasible. During my research, I have been able to find various concepts, systems, tools, and modules, that addresses one or the other features I have listed above. What is more gratifying is the fact that many number of such tools and systems are open sourced with a decent level of support infrastructure. But at this point, I have no idea whether all such tools can be stitched together to create the Analytics platform. This is what I intend to find out and will keep posting as my research progresses.
Comments
Post a Comment