Apache* Spark*

This tutorial describes how to install, configure, and run Apache Spark on Clear Linux* OS. Apache Spark is a fast general-purpose cluster computing system with the following features:

  • Provides high-level APIs in Java*, Scala*, Python*, and R*.
  • Includes an optimized engine that supports general execution graphs.
  • Supports high-level tools including Spark SQL, MLlib, GraphX, and Spark Streaming.

In this tutorial, you will install Spark on a single machine running the master daemon and a worker daemon.

Prerequisites

This tutorial assumes you have installed Clear Linux OS on your host system. For detailed instructions on installing Clear Linux OS on a bare metal system, visit the bare metal installation guide.

Before you install any new packages, update Clear Linux OS with the following command:

sudo swupd update

Install Apache Spark

Apache Spark is included in the big-data-basic bundle. To install the framework, enter:

sudo swupd bundle-add big-data-basic

Configure Apache Spark

  1. Create the configuration directory with the command:

    sudo mkdir /etc/spark
    
  2. Copy the default templates from /usr/share/defaults/spark to /etc/spark with the command:

    sudo cp /usr/share/defaults/spark/* /etc/spark
    

    注解

    Since Clear Linux OS is a stateless system, you should never modify the files under the /usr/share/defaults directory. The software updater overwrites those files.

  3. Copy the template files below to create custom configuration files:

    sudo cp /etc/spark/spark-defaults.conf.template /etc/spark/spark-defaults.conf
    sudo cp /etc/spark/spark-env.sh.template /etc/spark/spark-env.sh
    sudo cp /etc/spark/log4j.properties.template /etc/spark/log4j.properties
    
  4. Edit the /etc/spark/spark-env.sh file and add the SPARK_MASTER_HOST variable. Replace the example address below with your localhost IP address. View your IP address using the hostname -I command.

    SPARK_MASTER_HOST="10.300.200.100"
    

    注解

    This optional step enables the master’s web user interface to view information needed later in this tutorial.

  5. Edit the /etc/spark/spark-defaults.conf file and update the spark.master variable with the SPARK_MASTER_HOST address and port 7077.

    spark.master    spark://10.300.200.100:7077
    

Start the master server and a worker daemon

  1. Start the master server using:

    sudo /usr/share/apache-spark/sbin/./start-master.sh
    
  2. Start one worker daemon and connect it to the master using the spark.master variable defined earlier:

    sudo /usr/share/apache-spark/sbin/./start-slave.sh spark://10.300.200.100:7077
    
  3. Open an internet browser and view the worker daemon information using the master’s IP address and port 8080:

    http://10.300.200.100:8080
    

Run the Spark wordcount example

  1. Run the wordcount example using a file on your local host and output the results to a new file with the following command:

    sudo spark-submit /usr/share/apache-spark/examples/src/main/python/wordcount.py ~/Documents/example_file > ~/Documents/results
    
  2. Open an internet browser and view the application information using the master’s IP address and port 8080:

    http://10.300.200.100:8080
    
  3. View the results of the wordcount application in the ~/Documents/results file.

Congratulations!

You successfully installed and set up a standalone Apache Spark cluster. Additionally, you ran a simple wordcount example.