Big Data: Difference between revisions

Revision as of 20:30, 15 September 2017

https://www.infoworld.com/article/2606657/open-source-software/119424-Bossie-Awards-2013-The-best-open-source-big-data-tools.html

Apache Bigtop (bundled ecosystem of software that is all intended to enable lots of software to work with Hadoop)

Apache Hadoop (cluster distributed-data framework, distributes data among node in a cluster, useful for data-intensive computing)

Apache Spark (cluster computing framework, performs computations on data, separate layer from Hadoop that can sit on top of Hadoop or can use some other cluster distributed-data framework; operates fast, designed to read data from cluster, perform operations, write results to cluster, all in one pass)

Apache MapReduce (similar to Spark, but operates differently - reads data from cluster, performs operation, writes results to cluster, reads updated data from cluster, performs operation, writes next results to cluster, etc.)

Apache Pig

Apache Hive

Apache HBase

Apache Mahout (general machine learning engine, like R but for big data sets; does not implement comprehensive ML algorithms; check Apache Spark MLlib for algorithms not implemented by Mahout)

Cassandra (distributed NoSQL database)

Apache TinkerPop/Gremlin (Apache TinkerPop and Gremlin are to graph databases what the JDBC and SQL are to relational databases)

MongoDB (NoSQL document-based database)

Apache CloudStack (software that enables management/deployment of large numbers of nodes or virtual machines; basically, this is the back-end software used to run a cloud service provider)

Apache Sqoop

Talend

Apache Hama

Cloudera Impala

Apache Drill

Gephi

Neo4j

Couchbase

Paradigm4 SciDB

@@ Line 1: / Line 1: @@
 https://www.infoworld.com/article/2606657/open-source-software/119424-Bossie-Awards-2013-The-best-open-source-big-data-tools.html
-Apache Hadoop
+Apache Bigtop (bundled ecosystem of software that is all intended to enable lots of software to work with Hadoop)
+Apache Hadoop (cluster distributed-data framework, distributes data among node in a cluster, useful for data-intensive computing)
+Apache Spark (cluster computing framework, performs computations on data, separate layer from Hadoop that can sit on top of Hadoop or can use some other cluster distributed-data framework; operates fast, designed to read data from cluster, perform operations, write results to cluster, all in one pass)
+Apache MapReduce (similar to Spark, but operates differently - reads data from cluster, performs operation, writes results to cluster, reads updated data from cluster, performs operation, writes next results to cluster, etc.)
+Apache Pig
+Apache Hive
+Apache HBase
+Apache Mahout (general machine learning engine, like R but for big data sets; does not implement comprehensive ML algorithms; check Apache Spark MLlib for algorithms not implemented by Mahout)
+Cassandra (distributed NoSQL database)
+Apache TinkerPop/Gremlin (Apache TinkerPop and Gremlin are to graph databases what the JDBC and SQL are to relational databases)
+MongoDB (NoSQL document-based database)
+Apache CloudStack (software that enables management/deployment of large numbers of nodes or virtual machines; basically, this is the back-end software used to run a cloud service provider)
 Apache Sqoop
@@ Line 16: / Line 38: @@
 Neo4j
-MongoDB
 Couchbase
 Paradigm4 SciDB
-Cassandra
-Apache Spark
-Apache TinkerPop/Gremlin (Apache TinkerPop and Gremlin are to graph databases what the JDBC and SQL are to relational databases)
-Apache CloudStack (software that enables management/deployment of large numbers of nodes or virtual machines; basically, this is the back-end software used to run a cloud service provider)

Big Data: Difference between revisions

From charlesreid1

Revision as of 20:30, 15 September 2017