Introduction to Hadoop and MapReduce Notes

February 10, 2014

Class Listing on Udacity

What is big data - a subjective term but mostly large amount of data that is usually difficult to be processed on a small machine not necessarily large amounts of data.

Challenges with data are that data comes in really fast and from multiple places.

The three V's

Volume, Variety, Velocity

References
When to use HBase and when to use Hive - Stack Overflow

Apache Flume – Architecture of Flume NG | Cloudera Developer Blog

CDH - distribution of Apache Hadoop and related projects.

Hadoop Streaming
Hadoop Storing format.
Introducing Parquet: Efficient Columnar Storage for Apache Hadoop | Cloudera Developer Blog
hadoop - Storage format in HDFS - Stack Overflow

Terms

The Final
Much of this information below is on a Google doc that was some what hidden in the course wiki but not provided on the final's instructions. The doc can also be found in the forms for the class but rather then simply reference the class I wanted to take some of the notes from it since did not completely help me. Avoid the pain just Download a vm of Hadoop and follow the steps below.

Using Oracle VirtualBox

~~Download it from http://content.udacity-data.com/courses/ud617/Cloudera-Udacity-Training-VM-4.1.1.c.zip. Warning - the zipped file size is 1.7 GB~~
~~MD5sum file can be found here http://content.udacity-data.com/courses/ud617/Cloudera-Udacity-Training-VM-4.1.1.c.zip.md5~~
Unzip it. Warning - the unzipped size is 4.2GB Download and unzip data sets from:

Download and install VirtualBox from https://www.virtualbox.org/wiki/Downloads
Create a new Virtual machine:

Create a new virtual machine by pressing the ‘New’ button:
Choose a name, use ‘Type’: ‘Linux’:
Press Next
Select memory size for the VM.
Press Next
Select ‘Use an existing virtual hard drive file’’, click the button to browse to the directory you unzipped the provided VM image and press ‘Create’.
Start the VM!