Article
hadoop不同版本yarn和hdfs混搭,spark-yarn环境配置
hadoop分为存储和计算两个主要的功能,hdfs步入hadoop2后不论稳定性还是HA等等功能都比hadoop1要更吸引人。hadoop-2.2.0的hdfs已经比较稳定,但是yarn高版本有更加丰富的功能。本文主要关注spark-yarn下日志的查看,以及spark-yarn-dynamic的配置。
hadoop-2.2.0的hdfs原本已经在使用的环境,在这基础上搭建运行yarn-2.6.0,以及spark-1.3.0-bin-2.2.0。
- 编译
我是在虚拟机里面编译,共享了host主机的maven库。参考【VMware共享目录】,【VMware-Centos6 Build hadoop-2.6】注意cmake_symlink_library的异常,由于共享的windows目录下不能创建linux的软链接
tar zxvf ~/hadoop-2.6.0-src.tar.gz
cd hadoop-2.6.0-src/
mvn package -Pdist,native -DskipTests -Dtar -Dmaven.javadoc.skip=true
# 由于hadoop-hdfs还是2.2的,这里编译spark需要用2.2版本!
# 如果用2.6会遇到[UnsatisfiedLinkError:org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray ](http://blog.csdn.net/zeng_84_long/article/details/44340441)
cd spark-1.3.0
export MAVEN_OPTS="-Xmx3g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=512m"
mvn clean package -Phadoop-2.2 -Pyarn -Phive -Phive-thriftserver -Dmaven.test.skip=true -Dmaven.javadoc.skip=true -DskipTests
vi make-distribution.sh #注释掉BUILD_COMMAND那一行,不重复执行package!
./make-distribution.sh --mvn `which mvn` --tgz --skip-java-test -Phadoop-2.6 -Pyarn -Dmaven.test.skip=true -Dmaven.javadoc.skip=true -DskipTests
- 配置注意点
- core-site不要全部拷贝原来的,只要一些主要的配置即可。
- yarn-site的
yarn.resourcemanager.webapp.address需要填写具体的地址,不能写0.0.0.0。 - yarn-site的
yarn.nodemanager.aux-services添加spark_shuffle服务。https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation - 把hive-site的文件拷贝/链接到spark的conf目录下。
- spark-yarn-dynamic配置: https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
[esw@bigdatamgr1 spark-1.3.0-bin-2.2.0]$ cat conf/spark-defaults.conf
# spark.master spark://bigdatamgr1:7077,bigdata8:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
# spark.executor.extraJavaOptions -Xmx16g -Xms16g -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:ParallelGCThreads=10
spark.driver.memory 48g
spark.executor.memory 48g
spark.sql.shuffle.partitions 200
#spark.scheduler.mode FAIR
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize 8g
#spark.kryoserializer.buffer.max.mb 2048
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 4
spark.shuffle.service.enabled true
[esw@bigdatamgr1 conf]$ cat spark-env.sh
#!/usr/bin/env bash
JAVA_HOME=/home/esw/jdk1.7.0_60
# log4j
__add_to_classpath() {
root=$1
if [ -d "$root" ] ; then
for f in `ls $root/*.jar | grep -v -E '/hive.*.jar'` ; do
if [ -n "$SPARK_DIST_CLASSPATH" ] ; then
export SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$f
else
export SPARK_DIST_CLASSPATH=$f
fi
done
fi
}
# this add tail of SPARK_CLASSPATH
__add_to_classpath "/home/esw/apache-hive-0.13.1/lib"
#export HADOOP_CONF_DIR=/data/opt/ibm/biginsights/hadoop-2.2.0/etc/hadoop
export HADOOP_CONF_DIR=/home/esw/hadoop-2.6.0/etc/hadoop
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/esw/spark-1.3.0-bin-2.2.0/conf:$HADOOP_CONF_DIR
# HA
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=bi-00-01.bi.domain.com:2181 -Dspark.deploy.zookeeper.dir=/spark"
SPARK_PID_DIR=${SPARK_HOME}/pids
- 同步
for h in `cat slaves ` ; do rsync -vaz hadoop-2.6.0 $h:~/ --delete --exclude=work --exclude=logs --exclude=metastore_db --exclude=data --exclude=pids ; done
- 启动spark-hive-thrift
./sbin/start-thriftserver.sh --executor-memory 29g --master yarn-client
对于多任务的集群来说,配置自动动态分配(类似资源池)更有利于资源的使用。可以通过【All Applications】-【ApplicationMaster】-【Executors】来观察执行进程的变化。
–END
Related
Related posts
-
杀鸡焉用牛刀:DuckDB 正取代部分 Spark 场景
2026-02-16
-
基于对象存储的 Spark 数据读写实战:从末尾追加到任意更新
2025-10-28
-
认真的博客
2021-12-08
-
视频自动翻译
2018-08-25