Skip to main content

Posts

Showing posts with the label BigData

RDBMS to NoSQL - A story about (r)Evolution in Databases

Over the last few days, few of my friends have been asking me about NoSQL and its relevance.  I have been casually trying to answer this, but have always felt that this subject need to elaborated.  Through the slide deck, I have tried to provide a perspective on this subject. RDBMS to NoSQL. An overview. View more PowerPoint from Girish Raghavan Hope it helps. P.S:  Some of the slides ended up looking very busy. My apologies for that. This is a result of trying to balance between content (without sufficient audio explanations) and managing the size of the deck.

Eight Fallacies of Distributed Computing.

By Peter Deutsch   and James Gosling Essentially everyone, when they first build a distributed application, makes the following eight assumptions. All prove to be false in the long run and all cause big trouble and painful learning experiences. 1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. The network is secure 5. Topology doesn't change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous There is a great article by Arnon Rotem-Gal-O z  explaining the same.  Read it if you are interested.  (Ref:  http://nighthacks.com/roller/jag/resource/Fallacies.html )

Big Data: Understanding CAP Theorem.

Definition: In theoretical computer science, the CAP Theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency (C) Availability (A) Partition Tolerance (P) According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all the three. ( Reference: Wikipedia ) Relevance and Importance: It has been over twelve years since, Eric Brewer , then a scientist at University of California Berkeley, made the conjuncture which led to what we now universally acknowledge as CAP Theorem.  But over these years, CAP theorem has changed the rules and proved to be one of the significant seeds on determining how a highly scalable and distributed computing platform can be built.  Over these twelve years, this theorem has ended up as one of the primary read for anyone who is involve...

Big Data: A case for building an Analytics Platform.

In my previous post on this subject, I talked about few of the common problems that plague traditional data warehousing initiatives. Few of my friends after reading through the article asked me whether I am questioning the relevance of    traditional  data warehousing .    The answer to that question is  a resounding  "No".   Data warehousing do offers significant  benefits to the business, but my belief is that the practioners of the data warehousing infrastructure need to evolve beyond the traditional  models and build platforms that leverages the benefits of the traditional warehouse while complementing it with system(s) that provide quick, agile and   adaptive solutions to meet the need of the business. In this post, I am making a proposal for one such system and I am calling it an " Analytics Platform ". What is an Analytics Platform? I would define Analytics Platform as an engine which is built with a underly...

Big Data: Why Traditional Data warehouses fail?

Over the years, have been involved with few of the data warehousing efforts.   As a concept, I believe that having a functional and active data  ware house is essential for an organization. Data warehouses facilitate easy analysis and help analysts in gathering insights about the business.   But my practical experiences suggest that the reality is far from the expectations. Many of the data warehousing initiatives end up as a high  cost, long gestation projects with questionable end results.   I have spoken to few of my associates who are involved in the area and it appears that  quite a few of them share my view. When I query the users and intended users of the data warehouses, I hear issues like: The system is inflexible and is not able to quickly adapt to changing business needs.  By the time, the changes get implemented on the system, the analytical need for which the changes were introduced is no longer relevant. The implementors of...

Overview of Hadoop Ecosystem

Of late, have been looking into the Big Data space and Hadoop in particular.  When I started looking into it, found that there are so many products and tools related to Haddop.   Using this post summarize my discovery about Hadoop Ecosystem. Hadoop Ecosystem A small overview on each is listed below: Data Collection  - Primary objective of these is to move data into a Hadoop cluster Flume : - Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Developed by cloudera and currently being incubated at Apache software foundaton. The details about the same can be found here . Scribe : Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. Dveloped by Facebook and can be found here .  Chuckwa : Chuk...