Introduction to Hadoop and MapReduce Notes

What is big data - a subjective term but mostly large amount of data that is usually difficult to be processed on a small machine not necessarily large amounts of data.
Challenges with data are that data comes in really fast and from multiple places.

The three V's 
Volume, Variety, Velocity

When to use HBase and when to use Hive - Stack Overflow
Apache Flume – Architecture of Flume NG | Cloudera Developer Blog
CDH - distribution of Apache Hadoop and related projects.
Hadoop Streaming
Hadoop Storing format.
Introducing Parquet: Efficient Columnar Storage for Apache Hadoop | Cloudera Developer Blog
hadoop - Storage format in HDFS - Stack Overflow

The Final
Much of this information below is on a Google doc that was some what hidden in the course wiki but not provided on the final's instructions. The doc can also be found in the forms for the class but rather then simply reference the class I wanted to take some of the notes from it since did not completely help me. Avoid the pain just Download a vm of Hadoop and follow the steps below.
Using Oracle VirtualBox

  1. Download it from Warning - the zipped file size is 1.7 GB
  2. Unzip it. Warning - the unzipped size is 4.2GB Download and unzip data sets from:
  1. Download and install VirtualBox from
  2. Create a new Virtual machine:
    1. Create a new virtual machine by pressing the ‘New’ button:
    2. Choose a name, use ‘Type’: ‘Linux’:
    3. Press Next
    4. Select memory size for the VM.
    5. Press Next
    6. Select ‘Use an existing virtual hard drive file’’, click the button to browse to the directory you unzipped the provided VM image and press ‘Create’.
    7. Start the VM!

Python Resources
Example use of "continue" statement in Python? - Stack Overflow
4. More Control Flow Tools — Python v2.7.6 documentation

Popular posts from this blog

Nginx Best Practices Extended

Installing Windows on Acer chromebook 15 cb3-532

Entity Framework: ToListAynsc & WhereAsync