The last post we see how to load and query data using hive, this time we going to use the .NET SDK for Hadoop for the same purpose.
In this case we are using the SDK for Hadoop this allow us to use Map/Reduce Jobs to query data in a distribute file system environment that can be composed for hundreds or thousands of nodes. MapReduce is a programming model for processing large data sets being typically used to do distributed computing on clusters of computers.
In a typically Map/Reduces program we can find two class the mapper and the reducer.
The mapper: this is the collection data phase, in this phase the Mapper breaks up large pieces of work into smaller ones and then takes action on each pieces.
The Reducer: this is the processing phase. Reduce combines the many results from the map step into a single output.
The first thing we will do is to load the data, in this case I going to use the same file Products.csv from my last post and load it into the hadoop file system.
For this purpose open the Hadoop command line and type:
>hadoop fs -copyFromLocal C:\BigData\Products.csv input/Products/Products.csv
We can see the file we load using the Hadoop namenode status and browsing the file system.
Now we need to create a Class Library project using Visual Studio 2012, once the project has been created we need to add as reference the Microsoft Hadoop dlls, for this we are going to use Nuget Packages, so if we right click our project you will find the option Manage Nuget Packages (if not you need to install this Visual Studio add-in).
When Manage Nuget Packages open type hadoop in the search box and install all the packages.
The project have three class
MyMapReduceAPP: implement the HadoopJob interface, is the entry point of the job and indicate the mapper and the reducer class.
ProductMapper: collect and process the data
Reducer: aggregate the data.
To test our dll will need to execute it using the mrrunner.exe. So in the Hadoop command line, type.
>cd C:\Users\Administrator.SHAREPOINT2013L\Documents\visual studio 2012\Projects\MapReduceAPP\MapReduceAPP\mrlib
>mrrunner -dll “C:\Users\Administrator.SHAREPOINT2013L\Documents\visual studio 2012\Projects\MapReduceAPP\MapReduceAPP\bin\Debug\MapReduceAPP.dll”
We can see the result using the Hadoop namenode status and browsing the file system.
Has you can see is the same result we got before using hive.
