Updated on 2022-06-01 GMT+08:00

Using the External Shuffle Service to Improve Performance

Scenario

When the Spark system runs applications that contain a shuffle process, an executor process also writes shuffle data and provides shuffle data for other executors in addition to running tasks. If the executor is heavily loaded and GC occurs, the executor cannot provide shuffle data for other Executors, affecting task running.

The external shuffle service is an auxiliary service in NodeManager. It captures shuffle data to reduce the load on executors. If GC occurs on an executor, tasks on other executors are not affected.

Procedure

  1. Enable the external shuffle service on NodeManager.
    1. On MRS Manager (for details about how to log in to MRS Manager, see Login to MRS Manager), choose Services > Yarn > Service Configuration and choose Yarn > Customize to add the following configuration items to yarn-site.xml:
      <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>spark_shuffle</value>
      </property>
      <property>
          <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
          <value>org.apache.spark.network.yarn.YarnShuffleService</value>
      </property>

      Parameter

      Description

      yarn.nodemanager.aux-services

      A long-term auxiliary service in NodeManager for improving shuffle computing performance

      yarn.nodemanager.aux-services.spark_shuffle.class

      Class of an auxiliary service in NodeManager

    2. Add a dependency JAR file.

      Copy ${SPARK_HOME}/lib/spark-1.5.1-yarn-shuffle.jar to the ${HADOOP_HOME}/share/hadoop/yarn/lib/ directory.

    3. Restart the NodeManager process so that the external shuffle service is started.
  2. Apply the external shuffle service to Spark applications.
    • Add the following configuration items to the client installation directory /Spark/spark/conf/spark-defaults.conf:
      spark.shuffle.service.enabled   true
      spark.shuffle.service.port      7337

      Parameter

      Description

      spark.shuffle.service.enabled

      A long-term auxiliary service in NodeManager for improving shuffle computing performance The default value is false, indicating that this function is disabled.

      spark.shuffle.service.port

      Port for the shuffle service to monitor requests for obtaining data. This parameter is optional and its default value is 7337.

      1. If the yarn.nodemanager.aux-services configuration item exists, add spark_shuffle to its value. Use a comma to separate this value from other values.

      2. The value of spark.shuffle.service.port must be the same as that in the yarn-site.xml file.