Showing posts with the label Hadoop

Introduction to Hadoop and MapReduce Notes

Class Listing on Udacity What is   big data -  a subjective term but mostly large amount of data that is usually difficult to be processed on a small machine not  necessarily  large amounts of data. Challenges  with data are that data comes in really fast and from multiple places. The three V's  Volume, Variety, Velocity References When to use HBase and when to use Hive - Stack Overflow Apache Flume – Architecture of Flume NG | Cloudera Developer Blog CDH  -  distribution of Apache Hadoop and related projects. Hadoop Streaming Hadoop Storing format. Introducing Parquet: Efficient Columnar Storage for Apache Hadoop | Cloudera Developer Blog hadoop - Storage format in HDFS - Stack Overflow Terms NameNode    MapReduce   Shuffle and Sort   Apache Spoop    Apache Nutch    The Final Much of this information below is on a Google doc that was some what hidden in the course wiki but not provided on the final's instructions. The doc can also be found

Sunyit's Project BITS Documentations

The following is a collection of documents that I created solely for myself and colleagues in order to meet standards for implementing a Hadoop cloud service. That said there is a lot of information that is specific for the systems used and customized to only work for those who were apart the project. The objective for Project "Bits" can be found here in this link . All ip addresses have been marked with x's and urls generalized in order to protect the SunyIT network system. I continue to study the systems used here and release the documents in hope that others might take up the project and implement it at his or her's University/College. Back-Bone of Bits Project This is the server BitsGW which features a vpn connection across multiple colleges. Creating VM’s of BitsHP (hadoop machines) to have a scalable new projects. Also providing a LDAP connection service. BitsGW has the following user: afassett, admin BitsHP: Pxe server for machines b