Updated on 2022-06-01 GMT+08:00

Configuring Process Parameters

Scenario

There are three processes in Spark on YARN mode: driver, ApplicationMaster, and executor. During task scheduling and running, the driver and executor have major responsibilities, and ApplicationMaster is responsible for starting and stopping containers.

Therefore, the parameter settings of the driver and executor greatly affect the execution of Spark applications. You can perform the following operations to optimize Spark cluster performance.

Procedure

  1. Configure driver memory.

    The driver schedules tasks and communicates with the executor and ApplicationMaster. When the number of tasks and the task parallelism increase, the driver memory needs to be increased accordingly.

    You can set a proper memory for the driver based on the actual number of tasks.

    • Set spark.driver.memory in spark-defaults.conf or SPARK_DRIVER_MEMORY in spark-env.sh to a proper value.
    • When you run the spark-submit command, add the --driver-memory MEM parameter to set the memory.

  2. Configure the number of executors.

    Every core of each executor can run one task at the same time. Therefore, increasing the number of executors increases the concurrency of tasks. When resources are sufficient, you can increase the number of executors to improve running efficiency.

    • Set spark.executor.instance in spark-defaults.conf or SPARK_EXECUTOR_INSTANCES in spark-env.sh to a proper value. You can also set the dynamic resource scheduling function for optimization.
    • When you run the spark-submit command, add the --num-executors NUM parameter to set the number of executors.

  3. Configure the number of executor cores.

    Multiple cores of an executor can run multiple tasks at the same time, which increases the task concurrency. However, because all cores share the memory of an executor, you need to balance the memory and the number of cores.

    • Set spark.executor.cores in spark-defaults.conf or SPARK_EXECUTOR_CORES in spark-env.sh to a proper value.
    • When you run the spark-submit command, add the --executor-cores NUM parameter to set the number of executor cores.

  4. Configure executor memory.

    The executor memory is used for task execution and communication. You can increase the memory for a big task that needs more resources, and reduce the memory to increase the concurrency level for a small task that runs fast.

    • Set spark.executor.memory in spark-defaults.conf or SPARK_EXECUTOR_MEMORY in spark-env.sh to a proper value.
    • When you run the spark-submit command, add the --executor-memory MEM parameter to set the memory.

Examples

  • During the spark wordcount calculation, the amount of data is 1.6 TB and the number of the executors is 250.

    The execution fails under the default configuration, and the Futures timed out and OOM errors occur.

    The causes are as follows: The data volume is large and there are many tasks. Each task of wordcount is small and can be quickly completed. When the number of tasks increases, some objects on the driver side become larger. In addition, when each task is complete, the executor and driver communicate with each other. As a result, the memory is insufficient and the communication between processes is interrupted.

    When the driver memory is set to 4 GB, the application is successfully executed.

  • When using Thrift Server to execute the TPC-DS test suite, many errors such as Executor Lost are reported under default parameter configuration. When there is 30 GB of driver memory, 2 executor cores, 125 executors, and 6 GB of executor memory, all tasks can be successfully executed.