Skip to main content

Big Data: Understanding CAP Theorem.


In theoretical computer science, the CAP Theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
  • Consistency (C)
  • Availability (A)
  • Partition Tolerance (P)

According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all the three.

(Reference: Wikipedia)

Relevance and Importance:

It has been over twelve years since, Eric Brewer, then a scientist at University of California Berkeley, made the conjuncture which led to what we now universally acknowledge as CAP Theorem.  But over these years, CAP theorem has changed the rules and proved to be one of the significant seeds on determining how a highly scalable and distributed computing platform can be built.  Over these twelve years, this theorem has ended up as one of the primary read for anyone who is involved in building a distributed system.

The importance of CAP theorem is realized when the applications scale.  For example, at low volume,  delays in the transaction completion to ensure consistency is acceptable, but when the transaction volume increases, the trade-offs on latency to ensure consistency can have a significant impact on availability of the services and on the business as a whole.

Understanding and Implication:

To understand what this is all about, let us try to get a grasp on what the three attributes on which the theorem delves on in the world of distributed systems.

Consistency - This  implies that all the users of the system, from where ever they are connecting from and to which our node of a distributed system they are connecting to, will get the same result for a particular query. It further implies that whenever one of the users updates a value in one of the nodes, all the other users of the system spread across the distributed nodes, will see the updated value.

Easiest way to understand this concept is to look at a real life example as an analogy.   Let us say, that you are using the ATM to withdraw money from your bank account.  Technically speaking, as soon as you have completed the withdrawal, your account will be updated and the updated information will be available to anyone who can check your account status, irrespective of whether they do it and from which part of the world they do it.

Availability:   This implies that the system would ensure that the operations will continue even when some parts of the distributed systems are down either due to a planned maintenance or because of an error. The services to the users of the system, will not be impacted even when some parts of the system is down. Just want to stress the point, Availability is just not a protection against hardware failures, but also accounts for the software and ensures that factors like load balancing, load distribution, etc are accounted for.

Extending the previous analogy, you would expect that you would always be able to withdraw money from the ATM and  need not be bothered whether all the backend systems connected to ATM is up and working full gear.  You would regard the ATM service to be highly available, if and only if, that it is able to dispense cash whenever you want it.

Partition Tolerance:   Invariably distributed systems are connected over the network.  One of the ways to improve efficiency and utilization is to distribute the load across the nodes of the network.  By partition tolerance, it is implied that, even when few of these nodes (or cluster of them) is down; the system is able to operate well.

Extending the ATM analogy further, let us model a case.  Of the many models available for distributed system load management, one model that is widely used is Sharding. At a very broad level, the model proposes to group a set of entities together. In this example,  let us say, all the Users of ATM from Bangalore are located on a particular node.  When everything is stable,  irrespective of the point of access, the users identified to be original belonging to Bangalore will be routed the Bangalore node.  But partition tolerance is broken, when the link to the Bangalore node is broken, and because of this, a set of users (Bangalore based users) are not able to avail the ATM services.

What the CAP Theorem postulates is -  while building a distributed system, two of the above three can be guaranteed, but it would not be possible to guarantee all the three attributes to the same level of reliability. The implication is that, it would be necessary to drop one of the three attributes, while building a highly scalable distributed system.  

Update: Brewer did put up a new post on clarifying the 2 out of 3 rule that seem to be an inherent inference out of CAP Theorem and on strategies to handle partitions while ensuring Consistency and Availability.  The note can be found here



  1. Security Intelligence Solution provides one-click access to a comprehensive forensic trail and analytics in the same solution to simplify and accelerate threat discovery and incident investigation. To know more, visit Hadoop Training Bangalore

  2. Very good information. we need learn from real time examples and for this we choose good training institute, who were interested to know about Big Data which is quite interesting. We need a good training institute for my learning .. so people making use of the free demo classes.
    Many training institute provides free demo classes. One of the best training institute in Bangalore is Apponix Technologies.


Post a Comment

Popular posts from this blog

Dilbert on Agile Programing

Dilbert on Agile and Extreme Programming -  Picked up from - Scott Adams.

Big Data: Why Traditional Data warehouses fail?

Over the years, have been involved with few of the data warehousing efforts.   As a concept, I believe that having a functional and active data  ware house is essential for an organization. Data warehouses facilitate easy analysis and help analysts in gathering insights about the business.   But my practical experiences suggest that the reality is far from the expectations. Many of the data warehousing initiatives end up as a high  cost, long gestation projects with questionable end results.   I have spoken to few of my associates who are involved in the area and it appears that  quite a few of them share my view. When I query the users and intended users of the data warehouses, I hear issues like: The system is inflexible and is not able to quickly adapt to changing business needs.  By the time, the changes get implemented on the system, the analytical need for which the changes were introduced is no longer relevant. The implementors of the datawarehouse are always look

Overview of Hadoop Ecosystem

Of late, have been looking into the Big Data space and Hadoop in particular.  When I started looking into it, found that there are so many products and tools related to Haddop.   Using this post summarize my discovery about Hadoop Ecosystem. Hadoop Ecosystem A small overview on each is listed below: Data Collection  - Primary objective of these is to move data into a Hadoop cluster Flume : - Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Developed by cloudera and currently being incubated at Apache software foundaton. The details about the same can be found here . Scribe : Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. Dveloped by Facebook and can be found here .  Chuckwa : Chukwa is a Hadoop subproject dev