Pages

Apache Spark and Big data

Apache spark is one of the latest projects from apache that makes it really easy to write Map reduce tasks both in single cluster and multiple cluster mode. I have not tried multiple cluster, but as per the documentation, all it required is some configurations.

Installation

All you need is JDK installed in your machine.

Apache Spark does not need any installation on single node local mode. It is available in many languages like java, python, scala . I am using Java for my case.
Create a Maven Java Project and add the following dependencies in your pom.xml file.
<dependency>
 <groupId>org.apache.spark</groupId>
 <artifactId>spark-core_2.10</artifactId>
 <version>1.3.1</version>
</dependency>

If you are creating a non-maven project, just download the spark-core_<version>.jar file and add it to the projects build path. Then search in the internet and try one of the word count examples. You can run the class directly from eclipse, which case it will expect the file in your local filesystem.

You can also download the compiled version of spark and extract it in your directory and then use that to run your project. In this case, you need to package the classes in your project, and hit the command.

spark-1.3.1-bin-hadoop2.6/bin/spark-submit --class com.spark.WordCount --master local[2] ./target/ApacheSparkExample-0.0.1-SNAPSHOT.jar ipsuminput/*.txt ipsumoutput


The spark will expect the input and output locations to be in the HDFS in this case.

2 comments:

If you like to say anything (good/bad), Please do not hesitate...