Skip to main content

Overview of Hadoop Ecosystem

Of late, have been looking into the Big Data space and Hadoop in particular.  When I started looking into it, found that there are so many products and tools related to Haddop.   Using this post summarize my discovery about Hadoop Ecosystem.

Hadoop Ecosystem

A small overview on each is listed below:

Data Collection  - Primary objective of these is to move data into a Hadoop cluster

Flume: - Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Developed by cloudera and currently being incubated at Apache software foundaton. The details about the same can be found here.

Scribe: Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. Dveloped by Facebook and can be found here

Chuckwa: Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of the Hadoop distributed filesystem (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness. Details about Chhuckwa can be found here

Kafka: Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. Kafka is being incubated at Apache software foundation and details about Kafka can be found here.

Sqoop -   Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external system on to HDFS, and populate tables in Hive and HBase. cloudera is the creator of Sqoop and Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can be found here.  

HIHO - HIHO aims to integrate Hadoop with existing data centric systems. There is a need to connect Hadoop to different systems, like databases, report display tools etc to fully leverage the functionality offered by each. HIHO project attempts to do so. Details on HIHO can be found here

Core Engine - These form the core Engine for storage and processing

HDFS - HDFS (Hadoop Distributed File System) is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. HDFS is part of core Hadoop framework and is a subproject of Apache Hadoop project. Additonal information pertaining to HDFS can be found here

Map Reduce - Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. Good starting point to understand mapreduce would be here

Data Storage  - Specialized storage on Hadoop

HBase - HBase is the Hadoop database. We can think of it as a distributed, scalable, big data store. The most common usage scenario for HBase is to use it when you need random, realtime read/write access to your Big Data. HBase is a sub project of Apache Hadoop project and information on HBase can be found  here

Support Extensions - These help in extending and supporting core activities on Hadoop

Avro - Avro is an additon to apache family of products and address the area of Data serialization.  Apache Avro is a data serialization system that provides Rich data structures, A compact, fast, binary data format, A container file, to store persistent data, Remote procedure call (RPC) and integration with dynamic languages. Information on Avro can be found here

Zookeeper - ZooKeeper is part of Apache family of products and  is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications like Hadoop. Details about zookeeper can be found here

Oozie - Oozie is a workflow/coordination system to manage Apache Hadoop job. Oozie is being incubated at Apache software foundation and  details regarding oozie can be found here.  

Thrift - he Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages. Thrift also provides a set of API to connect with HBase. Details on Thrift can be found here

Scripting Languages -  These provide easy ways to facilitate processing on Hadoop

Jaql - Jaql is a high-level scripting language for the JavaScript Object Notation (JSON). It is able to run on Hadoop and break most requests down to Map/Reduce tasks. Jaql heavily borrowes from SQL, XQuery, LISP, Pig Latin, JavaScript and Unix Pipes. Developed primarily inside Ibm and part of BigInsights project. The good starting point for JAQL will be here

Pig - Pig is a high-level scripting language for data transformation. It is a Hadoop Subproject and can be looked at as a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Details about Pig can be found here

Hive - Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive is a subproject of Apache hadoop project and information w.r.t to Hive can be found here

Data Analytics   -  These are specialized software to perform Analytics on data stored on Hadoop

Intellicus : Intellicus, with a variety of powerful features, extends Ad hoc Reporting and Dash boarding solution to support Big Data. Intellicus provides data source connectivity through Hive to all variants of Hadoop like Apache and Cloudera. It provides both background batch processing and online query processing for Ad hoc reporting using HiveQL. Intellecus as a basic free edition but also has feature rich paid editions. The website for intellicus is https://www.intellicus.com

Karmasphere - Karmasphere provides self-service access to Big Data and analytic functions for faster, more efficient analysis and collaboration. Visually explore data for patterns and trends, iteratively analyze data using familiar queries and skills. Look up details of karmasphhere at  https://karmasphere.com

Monitoring and management - These help to monitor and manage the jobs and data on Hadoop

Gangalia: Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.  It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. Ganglia is a BSD-licensed open-source project that grew out of the University of California, Berkeley Millennium Project.  The details of Ganglia can be found here

Hue: Hue is both a web UI for Hadoop and a framework to create interactive web applications. It features a FileBrowser for accessing HDFS, JobSub and JobBrowser applications for submitting and viewing MapReduce jobs, a Beeswax application for interacting with Hive. Hue grew out of Cloudra and details pertaining to Hue can be found here


Cacti: Cacti is a complete network graphing solution designed to harness the power of RRDTool's data storage and graphing functionality. Cacti provides a fast poller, advanced graph templating, multiple data acquisition methods, and user management features out of the box.There is a good set of hadoop cacti templates that can be used to visualize the data exposed by Hadoop jmx. Look for details on cacti here

Hadoop high level application - These are application/frameworks which leverage Hadoop for specialized purposes

Search Related - These are software which leverage Hadoop in web crawling and indexing.

Nutch: Apache Nutch is an open source web-search software project. Nutch is a project of the Apache Software Foundation. Nutch is a Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner. Nutch can run hadoop clusters and a good source to begin reading on this would be here

Solr: Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search.  Solr integrates easily with Nutch. Nutch crawls the web and solr indexes the crawled data.  Details of Solr can be found here


Machine Learning -  Specialized software that implement machine learning algorithms over Hadoop.

Mahout -  Mahout is a subproject in Apache family of products on Hadoop.  The Apache Mahout library's goal is to build scalable machine learning libraries. Mahout core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. A starting point for information on Mahout would be here.  


Comments

  1. Hi Girish,

    I am working with Intellicus, one of the products that you referred to in your blog. I would like to share the progress made by Intellicus in Big data space during last one year. The latest release of Intellicus provides data source connectivity to Hive, Pig, HBase, custom map reduce jobs, HDFS, AWS, S3. One can also invoke R server from Intellicus and create interactive charts in Intellicus from the response received by R server. Intellicus also supports Adhoc reporting over Greenplum, Vertica, Teradata,GridSQL & Sand.

    Intellicus also provides a unique solution which is MOLAP on Hadoop, which takes the execution of the cube to Hadoop rather than bringing data out of Hadoop.

    ReplyDelete
  2. Updating with the recent skills and applying it is the only tactic to live in our vocation. You have done really a great job by sharing this blog in here. Keep writing blog like this.


    Hadoop Training in Bangalore

    ReplyDelete
  3. Very Nice Blog on Hadoop Overview.. Thank You For sharing
    Devops Training in Bangalore
    itEanz

    ReplyDelete
  4. very helpfull blog it was a pleasure reading your blog
    would love to read it more
    knowldege is not found but earned through hardwork and good teaching
    that being said click here to join us the next best thing in bangalore
    devops online training
    Devops Training in Bangalore

    ReplyDelete
  5. Hi, from your post i learnt about hadoop ecosystem ,do keep posting your blog | Hadoop Training .
    Hadoop Training in Chennai | Hadoop .

    ReplyDelete

  6. This really has covered a great insight on Hadoop. I found myself lucky to visit your page and came across this insightful read on Hadoop tutorial. Please allow me to share similar work on Hadoop training course . Watch and gain knowledge today.https://www.youtube.com/watch?v=SwDZhlnr9ho

    ReplyDelete
  7. Wonderful blog & good post.Its really helpful for me, awaiting for more new post. Keep Blogging !!

    Power BI Training in Chennai | Power BI Training Institute in Chennai

    ReplyDelete
  8. Very good information. we need learn from real time examples and for this we choose good training institute, who were interested to know about Hadoop which is quite interesting. We need a good training institute for my learning .. so people making use of the free demo classes.
    Many training institute provides free demo classes. One of the best training institute in Bangalore is Apponix Technologies.
    https://www.apponix.com/Big-Data-Institute/hadoop-training-in-bangalore.html

    ReplyDelete
  9. Very excellent post.The knowledge you have been sharing through this post is very helpul to bring up new ideas and to innovate big things.I suggesst everyone to go through this blog,you not ony get knowledge on it but enjoy reading the post.Thanks for sharing.
    Python Certification Course in Bangalore

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. Nice post. Thanks for sharing! I want people to know just how good this information is in your blog. It’s interesting content and Great work
    data analytics course
    Business Analytics Certification Course Training in Hyderabad
    <a href="https://360digitmg.com/india/python-r-programming/''>Python & R Programming Course Training for Beginners</a>

    ReplyDelete
  13. I think great site for these post and I am read the most of contents have useful for my Carrier.Thanks for these useful information.Any information are commands like to share himThank you for your post. This is useful information.
    Here we provide our special one's.Java training in Chennai

    Java Online training in Chennai

    Java Course in Chennai

    Best JAVA Training Institutes in Chennai

    Java training in Bangalore

    Java training in Hyderabad

    Java Training in Coimbatore

    Java Training

    Java Online Training

    ReplyDelete
  14. Quite Interesting post!!! Thanks for posting such a useful post. I wish to read your upcoming post to enhance my skill set, keep blogging.I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    selenium training in chennai

    selenium training in chennai

    selenium online training in chennai

    software testing training in chennai

    selenium training in bangalore

    selenium training in hyderabad

    selenium training in coimbatore

    selenium online training

    selenium training

    ReplyDelete
  15. Thanks for sharing an informative blog keep rocking bring more details.I like the helpful info you provide in your articles. I’ll bookmark your weblog and check again here regularly.

    Software Testing Training in Chennai | Certification | Online
    Courses



    Software Testing Training in Chennai

    Software Testing Online Training in Chennai

    Software Testing Courses in Chennai

    Software Testing Training in Bangalore

    Software Testing Training in Hyderabad

    Software Testing Training in Coimbatore

    Software Testing Training

    Software Testing Online Training

    ReplyDelete
  16. Positive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work
    DevOps Training in Chennai

    DevOps Online Training in Chennai

    DevOps Training in Bangalore

    DevOps Training in Hyderabad

    DevOps Training in Coimbatore

    DevOps Training

    DevOps Online Training

    ReplyDelete
  17. Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me.Thanks for sharing this wonderful content.its very useful to us.I gained many unknown information, the way you have clearly explained is really fantastic.keep posting such useful information.
    Full Stack Training in Chennai | Certification | Online Training Course
    Full Stack Training in Bangalore | Certification | Online Training Course

    Full Stack Training in Hyderabad | Certification | Online Training Course
    Full Stack Developer Training in Chennai | Mean Stack Developer Training in Chennai
    Full Stack Training

    Full Stack Online Training

    ReplyDelete
  18. It’s really Nice and Meaningful. It’s really cool Blog. You have really helped lots of people who visit Blog and provide them Useful Information. Thanks for Sharing.

    IELTS Coaching in chennai

    German Classes in Chennai

    GRE Coaching Classes in Chennai

    TOEFL Coaching in Chennai

    spoken english classes in chennai | Communication training


    ReplyDelete
  19. Raise month approach doctor it president. Western share seek.technology

    ReplyDelete

Post a Comment

Popular posts from this blog

Dilbert on Agile Programing

Dilbert on Agile and Extreme Programming -  Picked up from dilbert.com - Scott Adams.

Big Data: Why Traditional Data warehouses fail?

Over the years, have been involved with few of the data warehousing efforts.   As a concept, I believe that having a functional and active data  ware house is essential for an organization. Data warehouses facilitate easy analysis and help analysts in gathering insights about the business.   But my practical experiences suggest that the reality is far from the expectations. Many of the data warehousing initiatives end up as a high  cost, long gestation projects with questionable end results.   I have spoken to few of my associates who are involved in the area and it appears that  quite a few of them share my view. When I query the users and intended users of the data warehouses, I hear issues like: The system is inflexible and is not able to quickly adapt to changing business needs.  By the time, the changes get implemented on the system, the analytical need for which the changes were introduced is no longer relevant. The implementors of the datawarehouse are always look