SparkSQL-on-YARN的Executors池(动态)配置
官网配置资料
实战
修改YARN配置,添加spark-yarn-shuffle.jar,同步配置和jar到nodemanager节点并重启。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| [hadoop@hadoop-master1 hadoop-2.6.3]$ vi etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
[hadoop@hadoop-master1 hadoop-2.6.3]$ cp ~/spark-1.6.0-bin-2.6.3/lib/spark-1.6.0-yarn-shuffle.jar share/hadoop/yarn/
for h in `cat etc/hadoop/slaves` ; do rsync -az share $h:~/hadoop-2.6.3/ ; done
for h in `cat etc/hadoop/slaves` ; do rsync -az etc $h:~/hadoop-2.6.3/ ; done
rsync -vaz etc hadoop-master2:~/hadoop-2.6.3/
rsync -vaz share hadoop-master2:~/hadoop-2.6.3/
[hadoop@hadoop-master1 hadoop-2.6.3]$ sbin/stop-yarn.sh
[hadoop@hadoop-master1 hadoop-2.6.3]$ sbin/start-yarn.sh
|
原来已经部署了Hive-1.2.1(和spark-1.6.0的hive是匹配的),直接把hive-site.xml做一个软链到spark/conf下面:
1
2
3
4
5
| [hadoop@hadoop-master1 spark-1.6.0-bin-2.6.3]$ cd conf/
[hadoop@hadoop-master1 conf]$ ln -s ~/hive/conf/hive-site.xml
[hadoop@hadoop-master1 spark-1.6.0-bin-2.6.3]$ ll conf/hive-site.xml
lrwxrwxrwx. 1 hadoop hadoop 36 3月 25 11:30 conf/hive-site.xml -> /home/hadoop/hive/conf/hive-site.xml
|
注意:如果原来配置了tez,把hive-site.xml的 hive.execution.engine 配置注释掉。或者启动的时刻换引擎: bin/spark-sql --master yarn-client --hiveconf hive.execution.engine=mr
修改spark配置
spark-defaults.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| [hadoop@hadoop-master1 conf]$ cat spark-defaults.conf
spark.yarn.jar hdfs:///spark/spark-assembly-1.6.0-hadoop2.6.3-ext-2.1.jar
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.executorIdleTimeout 600s
spark.dynamicAllocation.minExecutors 160
spark.dynamicAllocation.maxExecutors 1800
spark.dynamicAllocation.schedulerBacklogTimeout 5s
spark.driver.maxResultSize 0
spark.eventLog.enabled true
spark.eventLog.compress true
spark.eventLog.dir hdfs:///spark-eventlogs
spark.yarn.historyServer.address hadoop-master2:18080
spark.serializer org.apache.spark.serializer.KryoSerializer
|
- spark.yarn.jar 配置后,spark启动后直接使用该文件作为executor的main-jar,不需要每次都上传一次spark.jar(每次都搞一下180M也不少资源了)
- enabled 启用动态两个都配置必须设置为true
- executorIdleTimeout 关闭不用executors需要等待的时间
- schedulerBacklogTimeout 增加积压的任务时间来判断是否增加executors
- minExecutors 至少存活的executors个数
spark-env.sh环境变量
1
2
3
4
5
| [hadoop@hadoop-master1 conf]$ cat spark-env.sh
SPARK_CLASSPATH=/home/hadoop/hive/lib/mysql-connector-java-5.1.21-bin.jar:$SPARK_CLASSPATH
HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
SPARK_DRIVER_MEMORY=30g
SPARK_PID_DIR=/home/hadoop/tmp/pids
|
启动
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| [hadoop@hadoop-master1 spark-1.6.0-bin-2.6.3]$ sbin/start-thriftserver.sh --master yarn-client
[hadoop@hadoop-master2 spark-1.6.0-bin-2.6.3]$ sbin/start-history-server.sh hdfs:///spark-eventlogs
# 不包括hadoop jars的情况下,自己编写脚本把hadoop的依赖包加入classpath
[hadoop@hadoop-master2 spark-1.3.1-bin-hadoop2.6.3-without-hive]$ cat start-historyserver.sh
#!/bin/sh
bin=`dirname $0`
cd $bin
source $HADOOP_HOME/libexec/hadoop-config.sh
export SPARK_PID_DIR=/home/hadoop/tmp/pids
export SPARK_CLASSPATH=`hadoop classpath`
export SPARK_PID_DIR=/home/hadoop/tmp/pids
# http://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
export SPARK_HISTORY_OPTS="-Dspark.history.fs.update.interval=30s -Dspark.history.fs.cleaner.enabled=true -Dspark.history.fs.logDirectory=hdfs:///spark-eventlogs"
sbin/start-history-server.sh
|
收工。
整个过程就是:添加spark-shuffle到yarn,然后配置spark参数,最后就是重启任务(yarn/hiveserver)。
–END