Apache* Spark*¶

This tutorial describes how to install, configure, and run Apache Spark on Clear Linux* OS. Apache Spark is a fast general-purpose cluster computing system with the following features:

Provides high-level APIs in Java*, Scala*, Python*, and R*.
Includes an optimized engine that supports general execution graphs.
Supports high-level tools including Spark SQL, MLlib, GraphX, and Spark Streaming.

In this tutorial, you will install Spark on a single machine running the master daemon and a worker daemon.

Prerequisites¶

This tutorial assumes you have installed Clear Linux OS on your host system. For detailed instructions on installing Clear Linux OS on a bare metal system, visit the bare metal installation guide.

Before you install any new packages, update Clear Linux OS with the following command:

sudo swupd update

Install Apache Spark¶

Apache Spark is included in the big-data-basic bundle. To install the framework, enter:

sudo swupd bundle-add big-data-basic

Configure Apache Spark¶

Create the configuration directory with the command:
```
sudo mkdir /etc/spark
```
Copy the default templates from /usr/share/defaults/spark to /etc/spark with the command:
```
sudo cp /usr/share/defaults/spark/* /etc/spark
```
注解

Since Clear Linux OS is a stateless system, you should never modify the files under the /usr/share/defaults directory. The software updater overwrites those files.

Copy the template files below to create custom configuration files:

sudo cp /etc/spark/spark-defaults.conf.template /etc/spark/spark-defaults.conf
sudo cp /etc/spark/spark-env.sh.template /etc/spark/spark-env.sh
sudo cp /etc/spark/log4j.properties.template /etc/spark/log4j.properties

Edit the /etc/spark/spark-env.sh file and add the SPARK_MASTER_HOST variable. Replace the example address below with your localhost IP address. View your IP address using the hostname -I command.
```
SPARK_MASTER_HOST="10.300.200.100"
```
注解

This optional step enables the master’s web user interface to view information needed later in this tutorial.
Edit the /etc/spark/spark-defaults.conf file and update the spark.master variable with the SPARK_MASTER_HOST address and port 7077.
```
spark.master    spark://10.300.200.100:7077
```

Start the master server and a worker daemon¶

Start the master server using:

sudo /usr/share/apache-spark/sbin/./start-master.sh

Start one worker daemon and connect it to the master using the spark.master variable defined earlier:
```
sudo /usr/share/apache-spark/sbin/./start-slave.sh spark://10.300.200.100:7077
```
Open an internet browser and view the worker daemon information using the master’s IP address and port 8080:
```
http://10.300.200.100:8080
```

Run the Spark wordcount example¶

Run the wordcount example using a file on your local host and output the results to a new file with the following command:

sudo spark-submit /usr/share/apache-spark/examples/src/main/python/wordcount.py ~/Documents/example_file > ~/Documents/results

Open an internet browser and view the application information using the master’s IP address and port 8080:
```
http://10.300.200.100:8080
```
View the results of the wordcount application in the ~/Documents/results file.

Congratulations!

You successfully installed and set up a standalone Apache Spark cluster. Additionally, you ran a simple wordcount example.