Saturday, September 16, 2017

Analytics Professional: Frameworks

Analytics Professional: Frameworks: Accumulo Apache Accumulo is an open-source distributed NoSQL database based on Google's BigTable. It is used to effici...

Sunday, June 26, 2016

Frameworks

Accumulo Apache Accumulo is an open-source distributed NoSQL database based on Google's BigTable. It is used to efficiently perform CRUD (Create Read Update Delete) operations on extremely large data sets (often referred to as Big Data). Accumulo is preferred over other similar distributed databases (such as HBase or CouchDB) if a project requires fine-grained security in the form of cell-level access control.
Amazon sqs Amazon Simple Queue Service (Amazon SQS) is a distributed queue messaging service introduced by Amazon.com in late 2004. It supports programmatic sending of messages via web service applications as a way to communicate over the Internet.
Ambari Web Interface, Provisioning/Managing/Monitoring Hadoop Clusters
Avro Data serialization/Remote Procedure Call, encoding
Azkaban Workflow engine
BigTop packaging/testing
Camel Apache Camel is a rule-based routing and mediation engine that provides a Java object-based implementation of the Enterprise Integration Patterns using an API (or declarative Java Domain Specific Language) to configure routing and mediation rules. The domain-specific language means that Apache Camel can support type-safe smart completion of routing rules in an integrated development environment using regular Java code without large amounts of XML configuration files, though XML configuration inside Spring is also supported.
Cascading Cascading is a software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs.
Cassandra Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Chef Chef is a company & configuration management tool written in Ruby and Erlang. It uses a pure-Ruby, domain-specific language (DSL) for writing system configuration "recipes". Chef is used to streamline the task of configuring and maintaining a company's servers, and can integrate with cloud-based platforms such as Rackspace, Internap, Amazon EC2, Google Cloud Platform, OpenStack, SoftLayer, and Microsoft Azure to automatically provision and configure new machines. Chef contains solutions for both small and large scale systems, with features and pricing for the respective ranges.
CouchDB CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents and query your indexes with your web browser, via HTTP. Index, combine, and transform your documents with JavaScript. CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of CouchDB. And you can distribute your data, or your apps, efficiently using CouchDB’s incremental replication. CouchDB supports master-master setups with automatic conflict detection.
Drill/Impala SQL like, No mapreduce conversion, real-time querying of data in HDFS/HBase, Drill - DrQL/Mongo Query Language
Dynamodb Amazon DynamoDB is a fully managed proprietary NoSQL database service that is offered by Amazon.com as part of the Amazon Web Services portfolio. DynamoDB exposes a similar data model and derives its name from Dynamo, but has a different underlying implementation.
Elasticsearch Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.
EMR Amazon Elastic MapReduce (EMR) is a web service that uses Hadoop, an open-source framework, to quickly & cost-effectively process vast amounts of data.
Flink Apache Flink is an open source platform for distributed stream and batch data processing
Flume/Chukwa Data Aggregation Tool, distributed + parallel
Hadoop HDFS + MapReduce, Repository, sequential data access, structured/unstructured, stores data as flat files, offline batch processing
Hbase NoSQL, column-oriented datastore, random real-time read/write access, stores data as key/value pairs, low latency, distributed, multi-dimensional, sorted-map, scalable, transactional processing - insert/update/delete, flexible schema, good with sparse tables,structured/semi-structured data, immediate consistency, accessible through Java/Thrift/REST APIs, data compression, automatic sharding, row key/column family/column/timestamp, key components are Zookeeper/HMaster/Region Server, indexed by rowkey
Hcatalog Apache™ HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.
Hive Declarative SQLish language, Flat relational data model, Schema is required, Data Warehouse, integrated to HDFS/Hbase tables, analytical tool, accessible through HiveQL/Compiler, Used by data analysts for BI reporting, supports partitions(sharding), thrift server, eventual consistency, data structures - tables/partitions/buckets(HDFS directories), Supports only structured data, batch processing, fixed schema, bitmap index
HSQLDB (Hyper SQL Database) HSQLDB is the default metastore for Sqoop. It is a relational database management system written in Java. It has a JDBC driver and supports a large subset of SQL-92 and SQL:2008 standards.[2] It offers a fast,[3] small (around 1300 kilobytes in version 2.2) database engine which offers both in-memory and disk-based tables. Both embedded and server modes are available for purchase.
Httpfs HttpFS is one of several tools available to interact with the MapR distributed file system.
HttpFS HttpFs, a singe node will act similar to a "gateway" and will be a single point of data transfer to the client node. So, HttpFs could be choked during a large file transfer but the good thing is that we are minimizing the footprint required to access HDFS.
Hue User Interface framework and SDK for visual Hadoop applications. Hue is an open-source Web interface that supports Apache Hadoop and its esystem, licensed under the Apache v2 license.
Impala Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software.
Kafka Apache Kafka. Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. Fast. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. 
Kinesis Amazon Kinesis is a platform for streaming data on AWS, offering powerful services to make it easy to load and analyze streaming data, and also providing the ability for you to build custom streaming data applications for specialized needs. Web applications, mobile devices, wearables, industrial sensors, and many software applications and services can generate staggering amounts of streaming data – sometimes TBs per hour – that need to be collected, stored, and processed continuously. Amazon Kinesis services enable you to do that simply and at a low cost.
Mahout Data Mining Library, clustering/regression testing/statistical modeling
MapReduce Supports Structured/Semi-structured/Unstructured data, can be executed in stand-alone/pseudo distributed/fully-distributed mode
MarkLogic MarkLogic is a multi-model NoSQL database that has evolved from its XML database roots to also natively store JSON documents and RDF triples, the data model for semantics. In addition to having a flexible data model, MarkLogic uses a distributed, scale-out architecture that can handle hundreds of billions of documents and hundreds of Terabytes of data. Unlike other NoSQL databases, MarkLogic maintains ACID consistency for transactions, and has focused on building enterprise features into every release, including a robust security model certified according to the Common Criteria, and enterprise-grade high availability and disaster recovery. MarkLogic is designed to run on-premise or in the cloud on Amazon Web Services.
Maven Maven is a project management and comprehension tool. Maven provides developers a complete build lifecycle framework. Development team can automate the project's build infrastructure in almost no time as Maven uses a standard directory layout and a default build lifecycle.
Mesos Apache Mesos is an open-source cluster manager that was developed at the University of California, Berkeley. It "provides efficient resource isolation and sharing across distributed applications, or frameworks".[1] The software enables resource sharing in a fine-grained manner, improving cluster utilization.
MongoDB MongoDB (from humongous) is a cross-platform document-oriented database. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. Released under a combination of the GNU Affero General Public License and the Apache License, MongoDB is free and open-source software.
Oozie Workflow engine( to run Directed Acyclical Graphs[DAG])/coordinator(scheduler)/bundle(to group coordinators), written in hPDL(Hadoop Process Definition Language), MapReduce + Pig + Hive + Sqoop imports + Java programs, 
Phoenix Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.
Pig Procedural DataFlow Language, Nested relational data model, Schema is optional, accessible through PigLatin/PigInterpreter, used by programmers/research for ETL process, data pipelining, atom/tuple/bag/map, No Partitions(sharding), No Thrift, Supports Structured/Unstructured data, Suitable for adhoc processing, can be used as script/grunt/embeded, can be executed in local/hdfs mode, batch processing
R Connectors Statistics
Redshift Amazon Redshift is a hosted data warehouse product, which is part of the larger cloud computing platform Amazon Web Services. It is built on top of technology from the massive parallel processing (MPP) data warehouse ParAccel by Actian.
SBT (Simple Build Tool) sbt uses a small number of concepts to support flexible and powerful build definitions
Scala Like Java, Scala is object-oriented, and uses a curly-brace syntax reminiscent of the C programming language. Unlike Java, Scala has many features of functional programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern matching.
Sentry Apache Sentry (incubating) is a highly modular system for providing fine-grained role based authorization to both data and metadata stored on an Apache Hadoop cluster. It currently works out of the box with Apache Hive and Cloudera Impala. In this blog post, you will learn how to use Sentry with Hive.
Shark (Hive on Spark) for Hive, increased processing speed
Solr Solr (pronounced "solar") is an open source enterprise search platform, written in Java, from the Apache Lucene project.
Spark for HDFS/Hbase/Amazon S3/Avro, increased processing speed
Splunk Basically Splunk takes in all of your text-based log data, and provides you an easy way to search through it. It started out as “Google for your logs”, but it’s become far more than that, as capabilities have been added. Now you can pull in all sorts of data, and perform all kinds of interesting statistical analysis on it, and present it in a variety of formats. You can simply search for specific patterns, or you can generate all manner of graphical reports:
Sqoop(SQL-to-Hadoop) Data Exchange for RDBMS
Storm Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Tez Apache™ Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
WebHdfs WebHDFS is a REST API built-into HDFS. It needs access to all nodes of the cluster and when some data is read it is transmitted from that node directly
Whirr Configuring Hadoop can be tricky and complicated. Apache Whirr can help in setting up a Hadoop cluster from scratch from a client terminal, but its use is not limited to Hadoop. Perhaps inspired by Chef, it uses configuration files or “whirr recipes” for running different services in a cloud-neutral way. For instance, one can use it to launch a Hadoop cluster on the Amazon Cloud. In a Whirr recipe for Hadoop, you can specify how many nodes you want, what Amazon image (AIM) you want to use, what version of Hadoop you want to install on the nodes and so on.
YARN(Yet Another Resource Negotiator) It is a cluster management technology. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
Zookeeper  Coordination/synchronization, znodes, distributed service - master/slave, stores configuration information