初步了解
hadoop2自带的mapreduce任务中间只能传递一次,也即一个任务只能聚合一次(然后就的写入磁盘)。tez项目是对原有yarn架构的一个拓展,使用DAG(无环有向图)实现MRR的任务框架。
上图中,左边的MR任务完成一个步骤后,需要进行 数据存储 后再执行另一个任务来进行第二个 reduce ; 而tez则可以在reduce后继续执行reduce,减少了中间过程的IO以及mapreduce的启动时间。
环境整合
- Install/Deploy
- hadoop-2.2.0(umcc97-44:hdfs, umcc97-79:yarn)
- windows下使用Cygwin编译
下载编译tez
首先下载tez-0.4.0-incubating.tar.gz,同时还需要protoc的程序支持(可以参考Hadoop源码编译)。
解压后,使用mvn编译。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| Administrator@winseliu /cygdrive/e/local/libs/big
$ tar zxvf tez-0.4.0-incubating.tar.gz
Administrator@winseliu /cygdrive/e/local/libs/big
$ cd tez-0.4.0-incubating/
Administrator@winseliu /cygdrive/e/local/libs/big/tez-0.4.0-incubating
$ mvn install -DskipTests -Dmaven.javadoc.skip
...
[INFO] Reactor Summary:
[INFO]
[INFO] tez ............................................... SUCCESS [1.518s]
[INFO] tez-api ........................................... SUCCESS [8.890s]
[INFO] tez-common ........................................ SUCCESS [0.725s]
[INFO] tez-runtime-internals ............................. SUCCESS [2.529s]
[INFO] tez-runtime-library ............................... SUCCESS [5.100s]
[INFO] tez-mapreduce ..................................... SUCCESS [3.666s]
[INFO] tez-mapreduce-examples ............................ SUCCESS [2.692s]
[INFO] tez-dag ........................................... SUCCESS [13.943s]
[INFO] tez-tests ......................................... SUCCESS [1.691s]
[INFO] tez-dist .......................................... SUCCESS [14.370s]
[INFO] Tez ............................................... SUCCESS [0.245s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 55.791s
[INFO] Finished at: Tue Jun 17 17:33:45 CST 2014
[INFO] Final Memory: 35M/151M
[INFO] ------------------------------------------------------------------------
|
上传tez程序的jars到HDFS
为了简单我直接把tez jars上传到开发环境的集群上面去测试了。放到本地集群环境应该也类似。
1
2
3
4
5
6
7
8
9
10
11
| Administrator@winseliu /cygdrive/e/local/libs/big/tez-0.4.0-incubating
$ cd tez-dist/
Administrator@winseliu /cygdrive/e/local/libs/big/tez-0.4.0-incubating/tez-dist
$ cd target/
Administrator@winseliu /cygdrive/e/local/libs/big/tez-0.4.0-incubating/tez-dist/target
$ export HADOOP_USER_NAME=hadoop
Administrator@winseliu /cygdrive/e/local/libs/big/tez-0.4.0-incubating/tez-dist/target
$ hadoop dfs -put tez-0.4.0-incubating/tez-0.4.0-incubating/ hdfs://umcc97-44:9000/apps/
|
配置集群环境
首先看下原来集群的classpath路径,路径中已经包括了 etc/hadoop
目录,所以这里我直接把 tez-site.xml
放到该目录下。同时把tez-lib复制到 share/hadoop/tez
目录下,并添加到 HADOOP_CLASSPATH
环境变量。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
| [hadoop@umcc97-79 hadoop]$ hadoop classpath
/home/hadoop/hadoop-2.2.0/etc/hadoop:/home/hadoop/hadoop-2.2.0/share/hadoop/common/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/*:/home/hadoop/hadoop-2.2.0/contrib/capacity-scheduler/*.jar
# 用于map/reduce
[hadoop@umcc97-79 hadoop]$ cat tez-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>tez.lib.uris</name>
<value>${fs.default.name}/apps/tez-0.4.0-incubating,${fs.default.name}/apps/tez-0.4.0-incubating/lib/</value>
</property>
</configuration>
[hadoop@umcc97-79 hadoop]$ cd ~/hadoop-2.2.0/share/hadoop/tez/
[hadoop@umcc97-79 tez]$ ll
total 9616
-rw-r--r-- 1 hadoop hadoop 303139 Jun 17 17:33 avro-1.7.4.jar
-rw-r--r-- 1 hadoop hadoop 41123 Jun 17 17:33 commons-cli-1.2.jar
-rw-r--r-- 1 hadoop hadoop 610259 Jun 17 17:33 commons-collections4-4.0.jar
-rw-r--r-- 1 hadoop hadoop 1648200 Jun 17 17:33 guava-11.0.2.jar
-rw-r--r-- 1 hadoop hadoop 710492 Jun 17 17:33 guice-3.0.jar
-rw-r--r-- 1 hadoop hadoop 656365 Jun 17 17:33 hadoop-mapreduce-client-common-2.2.0.jar
-rw-r--r-- 1 hadoop hadoop 1455001 Jun 17 17:33 hadoop-mapreduce-client-core-2.2.0.jar
-rw-r--r-- 1 hadoop hadoop 21537 Jun 17 17:33 hadoop-mapreduce-client-shuffle-2.2.0.jar
-rw-r--r-- 1 hadoop hadoop 81743 Jun 17 17:33 jettison-1.3.4.jar
-rw-r--r-- 1 hadoop hadoop 533455 Jun 17 17:33 protobuf-java-2.5.0.jar
-rw-r--r-- 1 hadoop hadoop 995968 Jun 17 17:33 snappy-java-1.0.4.1.jar
-rw-r--r-- 1 hadoop hadoop 749917 Jun 17 17:33 tez-api-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop 34049 Jun 17 17:33 tez-common-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop 970987 Jun 17 17:33 tez-dag-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop 246409 Jun 17 17:33 tez-mapreduce-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop 199934 Jun 17 17:33 tez-mapreduce-examples-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop 114692 Jun 17 17:33 tez-runtime-internals-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop 352177 Jun 17 17:33 tez-runtime-library-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop 6845 Jun 17 17:33 tez-tests-0.4.0-incubating.jar
# MR配置,用于client任务提交
[hadoop@umcc97-79 hadoop]$ grep HADOOP_CLASSPATH hadoop-env.sh
export HADOOP_CLASSPATH=${HADOOP_HOME}/share/hadoop/tez/*:${HADOOP_HOME}/share/hadoop/tez/lib/*:$HADOOP_CLASSPATH
[hadoop@umcc97-79 hadoop]$ sed -n 19,23p mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn-tez</value>
</property>
|
同步,重启yarn
1
2
3
4
5
6
| for h in `cat hadoop-2.2.0/etc/hadoop/slaves ` ; do
rsync -vaz --exclude=logs --exclude=pid --exclude=tmp hadoop-2.2.0 $h:~/ ;
done
# 同步到secondnamenode
rsync -vaz --exclude=logs --exclude=pid --exclude=tmp hadoop-2.2.0 umcc97-44:~/
|
测试
1
2
3
4
5
6
7
8
9
10
11
| [hadoop@umcc97-79 ~]$ hadoop classpath
/home/hadoop/hadoop-2.2.0/etc/hadoop:/home/hadoop/hadoop-2.2.0/share/hadoop/common/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/*:/home/hadoop/hadoop-2.2.0/share/hadoop/tez/*:/home/hadoop/hadoop-2.2.0/share/hadoop/tez/lib/*:/home/hadoop/hadoop-2.2.0/contrib/capacity-scheduler/*.jar
[hadoop@umcc97-79 ~]$ cd hadoop-2.2.0/share/hadoop/mapreduce/
[hadoop@umcc97-79 mapreduce]$ hadoop jar hadoop-mapreduce-client-jobclient-2.2.0-tests.jar sleep -mt 1 -rt 1 -m 1 -r 1
cd hadoop-2.2.0/share/hadoop/tez/
hadoop fs -put ~/hadoop-2.2.0/logs/yarn-hadoop-resourcemanager-umcc97-79.* /hello/in
hadoop fs -rmr /hello/out
hadoop jar tez-mapreduce-examples-0.4.0-incubating.jar orderedwordcount /hello/in /hello/out
|
回滚,使用时临时修改环境变量即可
使用了tez后,导致hive-0.12.0不能运行。由于其他同事需要用hive,得把配置全部修改回去。【升级hive请查看hive-0.13中使用tez】
在配置文件中配置为yarn,要使用tez在 提交任务 时指定配置参数即可。
1
2
3
| export HADOOP_CLASSPATH=${HADOOP_HOME}/share/hadoop/tez/*:${HADOOP_HOME}/share/hadoop/tez/lib/*:$HADOOP_CLASSPATH
hadoop jar hadoop-2.2.0/share/hadoop/tez/tez-mapreduce-examples-0.4.0-incubating.jar orderedwordcount \
-Dmapreduce.framework.name=yarn-tez /hello/in /hello/out
|
org.apache.tez.mapreduce.examples.OrderedWordCount不仅计算出了结果,同时按个数大小进行了排序。
问题: tez的任务的history还不知道怎么弄的,启动historyserver没作用?
0.6版本已经有ui了。
持续更新
本来想编译好tez-0.6就往hive-0.13上面放,没想到遇到钉子了!!hive-0.13不支持!!
在编译tez并想集成到hive,先下载hive的源码,看看pom.xml中使用的是到底是什么版本的tez,再编译tez不迟!!!
1
2
| apache-hive-1.1.0-src.tar.gz/pom.xml
<tez.version>0.5.2</tez.version>
|
tez-0.6在hadoop-2.2基础上编译:
1
2
3
4
5
6
7
8
| E:\local\opt\bigdata\apache-tez-0.6.0-src>mvn package -Dhadoop.version=2.2.0 -DskipTests -Dmaven.javadoc.skip=true -DskipATS
vi tez-dist/pom.xml
<profile>
<id>hadoop26</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
|
–END