Winse Blog

走走停停都是风景, 熙熙攘攘都向最好, 忙忙碌碌都为明朝, 何畏之.

已有HDFS上部署yarn

原有环境

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
[biadmin@bigdatamgr1 IHC]$ pwd
/data/opt/ibm/biginsights/IHC

[biadmin@bigdatamgr1 biginsights]$ ll conf/ hadoop-conf
conf/:
total 64
-rwxr-xr-x 1 biadmin biadmin  2886 Jan 30 15:09 biginsights-env.sh
...

hadoop-conf:
total 108
-rw-rw-r-- 1 biadmin biadmin  7698 Mar 12 17:57 capacity-scheduler.xml
-rw-rw-r-- 1 biadmin biadmin   535 Mar 12 17:57 configuration.xsl
-rw-rw-r-- 1 biadmin biadmin   872 Mar 12 17:57 console-site.xml
-rw-rw-r-- 1 biadmin biadmin  3744 Mar 24 16:51 core-site.xml
-rw-rw-r-- 1 biadmin biadmin   569 Mar 12 17:57 fair-scheduler.xml
-rw-rw-r-- 1 biadmin biadmin   410 Mar 12 17:57 flex-scheduler.xml
-rwxrwxr-x 1 biadmin biadmin  5027 Mar 12 17:57 hadoop-env.sh
-rw-rw-r-- 1 biadmin biadmin  1859 Mar 12 17:57 hadoop-metrics2.properties
-rw-rw-r-- 1 biadmin biadmin  4886 Mar 12 17:57 hadoop-policy.xml
-rw-rw-r-- 1 biadmin biadmin  3836 Mar 12 17:57 hdfs-site.xml
-rw-rw-r-- 1 biadmin biadmin  2678 Mar 12 17:57 ibm-hadoop.properties
-rw-rw-r-- 1 biadmin biadmin   207 Mar 12 17:57 includes
-rw-rw-r-- 1 biadmin biadmin 10902 Mar 12 17:57 log4j.properties
-rw-rw-r-- 1 biadmin biadmin   610 Mar 12 17:57 mapred-queue-acls.xml
-rw-rw-r-- 1 biadmin biadmin  6951 Mar 23 17:24 mapred-site.xml
-rw-rw-r-- 1 biadmin biadmin    44 Mar 12 17:57 masters
-rw-rw-r-- 1 biadmin biadmin   207 Mar 12 17:57 slaves
-rw-rw-r-- 1 biadmin biadmin  1243 Mar 12 17:57 ssl-client.xml.example
-rw-rw-r-- 1 biadmin biadmin  1195 Mar 12 17:57 ssl-server.xml.example
-rw-rw-r-- 1 biadmin biadmin   301 Mar 12 17:57 taskcontroller.cfg
-rw-rw-r-- 1 biadmin biadmin   172 Mar 12 17:57 zk-jaas.conf

[root@bigdatamgr1 ~]# cat /etc/profile
...
for i in /etc/profile.d/*.sh ; do
    if [ -r "$i" ]; then
        if [ "${-#*i}" != "$-" ]; then
            . "$i"
        else
            . "$i" >/dev/null 2>&1
        fi
    fi
done


[root@bigdatamgr1 ~]# ll /etc/profile.d/
total 60
lrwxrwxrwx  1 root root   49 Jan 30 15:10 biginsights-env.sh -> /data/opt/ibm/biginsights/conf/biginsights-env.sh
...

[biadmin@bigdatamgr1 biginsights]$ cat hadoop-conf/hadoop-env.sh
...
# include biginsights-env.sh
if [ -r "/data/opt/ibm/biginsights/hdm/../conf/biginsights-env.sh" ]; then
        source "/data/opt/ibm/biginsights/hdm/../conf/biginsights-env.sh"
fi
...
export HADOOP_LOG_DIR=/data/var/ibm/biginsights/hadoop/logs
...
export HADOOP_PID_DIR=/data/var/ibm/biginsights/hadoop/pids
...

hdfs用的是2.x的,但是mr是1.x。真心坑爹!!

单独部署新的yarn

由于biginsights整了一套的环境变量,在加载profile的时刻就会进行初始化。所以需要搞一个新的用户在加载用户的环境变量的时刻把这些值清理掉。同时也为了与原来的有所区分。

1
2
3
4
5
6
7
8
9
10
11
[esw@bigdatamgr1 ~]$ cat .bash_profile 
...
for i in ~/conf/*.sh ; do
  if [ -r "$i" ] ; then
    . "$i"
  fi
done

[esw@bigdatamgr1 ~]$ ll conf/
total 4
-rwxr-xr-x 1 esw biadmin 292 Mar 24 20:48 reset-biginsights-env.sh

使用biadmin停掉原来的jobtracker-tasktracker。

1
2
3
[biadmin@bigdatamgr1 IHC]$ ssh `hdfs getconf -confKey mapreduce.jobtracker.address | sed 's/:.*//' ` "sudo -u mapred /data/opt/ibm/biginsights/IHC/sbin/hadoop-daemon.sh  stop jobtracker"

[biadmin@bigdatamgr1 biginsights]$ for h in `cat hadoop-conf/slaves ` ; do ssh $h "sudo -u mapred /data/opt/ibm/biginsights/IHC/sbin/hadoop-daemon.sh  stop tasktracker" ; done

这里使用while不行,不知道为啥!?

部署新的hadoop-2.2.0。使用超级管理员新建目录给esw用户:

1
2
3
usermod -g biadmin esw
mkdir /data/opt/ibm/biginsights/hadoop-2.2.0
chown esw:biadmin hadoop-2.2.0

使用超级管理员同步到各个slaver节点:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[root@bigdatamgr1 biginsights]# for line in `cat hadoop-conf/slaves` ; do ssh $line "usermod -g biadmin esw" ; done

[root@bigdatamgr1 biginsights]# cat hadoop-conf/slaves | while read line ; do rsync -vazXog hadoop-2.2.0 $line:/data/opt/ibm/biginsights/ ; done

[esw@bigdatamgr1 hadoop-2.2.0]$ cd etc/hadoop/
[esw@bigdatamgr1 hadoop]$ ll
total 116
-rw-r--r-- 1 esw biadmin 3560 Feb 15  2014 capacity-scheduler.xml
-rw-r--r-- 1 esw biadmin 1335 Feb 15  2014 configuration.xsl
-rw-r--r-- 1 esw biadmin  318 Feb 15  2014 container-executor.cfg
-rw-r--r-- 1 esw biadmin  713 Mar 24 23:31 core-site.xml
-rwxr-xr-x 1 esw biadmin 3614 Mar 24 22:45 hadoop-env.sh
-rw-r--r-- 1 esw biadmin 1774 Feb 15  2014 hadoop-metrics2.properties
-rw-r--r-- 1 esw biadmin 2490 Feb 15  2014 hadoop-metrics.properties
-rw-r--r-- 1 esw biadmin 9257 Feb 15  2014 hadoop-policy.xml
lrwxrwxrwx 1 esw biadmin   51 Mar 24 21:33 hdfs-site.xml -> /data/opt/ibm/biginsights/hadoop-conf/hdfs-site.xml
-rwxr-xr-x 1 esw biadmin 1180 Feb 15  2014 httpfs-env.sh
-rw-r--r-- 1 esw biadmin 1657 Feb 15  2014 httpfs-log4j.properties
-rw-r--r-- 1 esw biadmin   21 Feb 15  2014 httpfs-signature.secret
-rw-r--r-- 1 esw biadmin  620 Feb 15  2014 httpfs-site.xml
-rw-rw-r-- 1 esw biadmin   75 Feb 15  2014 journalnodes
-rw-r--r-- 1 esw biadmin 9116 Feb 15  2014 log4j.properties
-rwxr-xr-x 1 esw biadmin 1383 Feb 15  2014 mapred-env.sh
-rw-r--r-- 1 esw biadmin 4113 Feb 15  2014 mapred-queues.xml.template
-rw-rw-r-- 1 esw biadmin 1508 Mar 24 21:42 mapred-site.xml
-rw-r--r-- 1 esw biadmin  758 Feb 15  2014 mapred-site.xml.template
lrwxrwxrwx 1 esw biadmin   44 Mar 24 21:34 slaves -> /data/opt/ibm/biginsights/hadoop-conf/slaves
-rw-r--r-- 1 esw biadmin 2316 Feb 15  2014 ssl-client.xml.example
-rw-r--r-- 1 esw biadmin 2251 Feb 15  2014 ssl-server.xml.example
lrwxrwxrwx 1 esw biadmin   16 Mar 25 16:10 tez-site.xml -> tez-site.xml-0.4
-rw-r--r-- 1 esw biadmin  282 Mar 25 15:37 tez-site.xml-0.4
-rw-r--r-- 1 esw biadmin  347 Mar 25 15:49 tez-site.xml-0.6
-rwxr-xr-x 1 esw biadmin 4039 Mar 24 22:26 yarn-env.sh
-rw-r--r-- 1 esw biadmin 1826 Mar 24 21:42 yarn-site.xml

把属性配置好(hdfs,slaves可以用原来的就建立一个软链即可),然后用sbin/start-yarn.sh启动即可。

其他命令

1
[esw@bigdatamgr1 hadoop-2.2.0]$ for line in `cat etc/hadoop/slaves` ; do echo "================$line" ; ssh $line "top -u esw -n 1 -b | grep java | xargs -I{}  kill {} "   ; done

部署值得鉴戒学习的IBM bigsql套件:

  • 一个管理用户部署,各个引用使用各自的用户
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[root@bigdatamgr1 ~]# cat /etc/sudoers
biadmin ALL=(ALL)   NOPASSWD: ALL

[root@bigdatamgr1 ~]# cat /etc/passwd
biadmin:x:200:501::/home/biadmin:/bin/bash
avahi-autoipd:x:170:170:Avahi IPv4LL Stack:/var/lib/avahi-autoipd:/sbin/nologin
hive:x:205:501::/home/hive:/bin/bash
oozie:x:206:501::/home/oozie:/bin/bash
monitoring:x:220:501::/home/monitoring:/bin/bash
alert:x:225:501::/home/alert:/bin/bash
catalog:x:224:501::/home/catalog:/bin/bash
hdfs:x:201:501::/home/hdfs:/bin/bash
httpfs:x:221:501::/home/httpfs:/bin/bash
bigsql:x:222:501::/home/bigsql:/bin/bash
console:x:223:501::/home/console:/bin/bash
mapred:x:202:501::/home/mapred:/bin/bash
orchestrator:x:226:501::/home/orchestrator:/bin/bash
hbase:x:204:501::/home/hbase:/bin/bash
zookeeper:x:203:501::/home/zookeeper:/bin/bash

启用时管理员用户使用sudo -u XXX COMMAND操作。

  • 所有应用部署/启动管理
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[biadmin@bigdatamgr1 biginsights]$ bin/start.sh -h
Usage: start.sh <component>...
    Start one or more BigInsights components. Start all components if 'all' is
    specified. If a component is already started, this command does nothing to it.
    
    For example:
        start.sh all
          - Starts all components.
        start.sh hadoop zookeeper
          - Starts hadoop and zookeeper daemons.

OPTIONS:
    -ex=<component>
        Exclude a component, often used together with 'all'. I.e. 
        `stop.sh all -ex=console` stops all components but the mgmt console.

    -h, --help
        Get help information.
  • 反复依赖的包,通过软链来管理
1
2
3
4
5
6
7
8
9
10
11
[biadmin@bigdatamgr1 lib]$ ll
total 50336
-rw-r--r-- 1 biadmin biadmin   303042 Jan 30 15:22 avro-1.7.4.jar
lrwxrwxrwx 1 biadmin biadmin       60 Jan 30 15:22 biginsights-gpfs-2.2.0.jar -> /data/opt/ibm/biginsights/IHC/lib/biginsights-gpfs-2.2.0.jar
-rw-r--r-- 1 biadmin biadmin    15322 Jan 30 15:22 findbugs-annotations-1.3.9-1.jar
lrwxrwxrwx 1 biadmin biadmin       48 Jan 30 15:22 guardium-proxy.jar -> /data/opt/ibm/biginsights/lib/guardium-proxy.jar
-rw-r--r-- 1 biadmin biadmin  1795932 Jan 30 15:22 guava-12.0.1.jar
-rw-r--r-- 1 biadmin biadmin   710492 Jan 30 15:22 guice-3.0.jar
-rw-r--r-- 1 biadmin biadmin    65012 Jan 30 15:22 guice-servlet-3.0.jar
lrwxrwxrwx 1 biadmin biadmin       45 Jan 30 15:22 hadoop-core.jar -> /data/opt/ibm/biginsights/IHC/hadoop-core.jar
lrwxrwxrwx 1 biadmin biadmin       76 Jan 30 15:22 hadoop-distcp-2.2.0.jar -> /data/opt/ibm/biginsights/IHC/share/hadoop/tools/lib/hadoop-distcp-2.2.0.jar

–END

Hadoop Distcp

HDFS提供的CP是单线程的,对于大数据量的拷贝操作希望能并行的复制。Hadoop Tools提供了DistCp工具,通过调用MapRed来实现并行的拷贝。

先来了解下hdfs cp的功能:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Usage: hdfs dfs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest>

[hadoop@hadoop-master2 hadoop-2.6.0]$ hadoop fs -cp /cp /cp-not-exists
[hadoop@hadoop-master2 hadoop-2.6.0]$ hadoop fs -mkdir /cp-exists
[hadoop@hadoop-master2 hadoop-2.6.0]$ hadoop fs -cp /cp /cp-exists
[hadoop@hadoop-master2 hadoop-2.6.0]$ hadoop fs -cp /cp /cp-not-exists2/
cp: `/cp-not-exists2/': No such file or directory
[hadoop@hadoop-master2 hadoop-2.6.0]$ hadoop fs -ls -R /
drwxr-xr-x   - hadoop supergroup          0 2015-03-14 19:55 /cp
-rw-r--r--   1 hadoop supergroup       1366 2015-03-14 19:55 /cp/README.1.txt
-rw-r--r--   1 hadoop supergroup       1366 2015-03-14 19:54 /cp/README.txt
drwxr-xr-x   - hadoop supergroup          0 2015-03-14 20:17 /cp-exists
drwxr-xr-x   - hadoop supergroup          0 2015-03-14 20:17 /cp-exists/cp
-rw-r--r--   1 hadoop supergroup       1366 2015-03-14 20:17 /cp-exists/cp/README.1.txt
-rw-r--r--   1 hadoop supergroup       1366 2015-03-14 20:17 /cp-exists/cp/README.txt
drwxr-xr-x   - hadoop supergroup          0 2015-03-14 20:17 /cp-not-exists
-rw-r--r--   1 hadoop supergroup       1366 2015-03-14 20:17 /cp-not-exists/README.1.txt
-rw-r--r--   1 hadoop supergroup       1366 2015-03-14 20:17 /cp-not-exists/README.txt

DistCp(distributed copy)分布式拷贝简单使用方式:

1
[hadoop@hadoop-master2 hadoop-2.6.0]$ bin/hadoop distcp /cp /cp-distcp

用到分布式一般就说明规模不少,且数据量大,操作时间长。DistCp提供了一些参数来控制程序:

DistCpOptionSwitch选项 命令行参数 描述
LOG_PATH -log <logdir> map结果输出的目录。默认为JobStagingDir/_logs,在DistCp#configureOutputFormat把该路径设置给CopyOutputFormat#setOutputPath。
SOURCE_FILE_LISTING -f <urilist_uri> 需要拷贝的source-path…从改文件获取。
MAX_MAPS -m <num_maps> 默认20个,创建job时通过JobContext.NUM_MAPS添加到配置。
ATOMIC_COMMIT -atomic 原子操作。要么全部拷贝成功,那么失败。与SYNC_FOLDERS & DELETE_MISSING选项不兼容。
WORK_PATH -tmp <tmp_dir> 与atomic一起使用,中间过程存储数据目录。成功后在CopyCommitter一次性移动到target-path下。
SYNC_FOLDERS -update 新建或更新文件。当文件大小和blockSize(以及crc)一样忽略。
DELETE_MISSING -delete 针对target-path目录,清理source-paths目录下没有的文件。常和SYNC_FOLDERS选项一起使用。
BLOCKING -async 异步运行。其实就是job提交后,不打印日志了没有调用job.waitForCompletion(true)罢了。
BANDWIDTH -bandwidth num(M) 获取数据的最大速度。结合ThrottledInputStream来进行控制,在RetriableFileCopyCommand中初始化。
COPY_STRATEGY -strategy dynamic/uniformsize 复制的时刻分组策略,即每个Map到底处理写什么数据。后面会讲到,分为静态和动态。

还有新增的两个属性skipcrccheck(SKIP_CRC),append(APPEND)。保留Preserve 属性和ssl选项由于暂时没用到,这里不表,以后用到再补充。

DistCp的源码

放在hadoop-2.6.0-src\hadoop-tools\hadoop-distcp目录下。

1
mvn eclipse:eclipse 

网络没问题的话,一般都能成功生成.classpath和.project两个Eclipse需要的项目文件。然后把项目导入eclipse即可。包括4个目录。

还是先说说整个distcp的实现流程,看看distcp怎么跑的。

1
2
3
[hadoop@hadoop-master2 ~]$ export HADOOP_CLIENT_OPTS="-Dhadoop.root.logger=debug,console -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8071"
[hadoop@hadoop-master2 ~]$ hadoop distcp /cp /cp-distcp
Listening for transport dt_socket at address: 8071

运行eclipse远程调试,连接服务器的8071端口,在DistCp的run方法打个断点,就可以调试了解其运行方式。修改log4j为debug,可以查看更详细的日志,了解执行的流程。

服务器的jdk版本和本地eclipse的jdk版本最好一致,这样调试的时刻比较顺畅。

Driver

首先进到DistCp(Driver)的main方法,DistCp继承Configured实现了Tool接口,

第一步解析参数

  1. 使用ToolRunner.run运行会调用GenericOptionsParser解析-D的属性到Configuration实例;
  2. 进到run方法后,通过OptionsParser.parse来解析配置为DistCpOptions实例;功能比较单一,主要涉及到DistCpOptionSwitch和DistCpOptions两个类。

第二步准备MapRed的Job实例

  1. 创建metaFolderPath(后面的 待拷贝文件seq文件存取的位置:StagingDir/_distcp[RAND]),对应CONF_LABEL_META_FOLDER属性;
  2. 创建Job,设置名称、InputFormat(UniformSizeInputFormat|DynamicInputFormat)、Map类CopyMapper、Map个数(默认20个)、Reduce个数(0个)、OutputKey|ValueClass、MAP_SPECULATIVE(使用RetriableCommand代替)、CopyOutputFormat
  3. 把命令行的配置写入Configuration。
1
metaFolderPath /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp-1344594636

此处有话题,设置InputFormat时通过DistCpUtils#getStrategy获取,代码中并没有strategy.impl的键加入到configuration。why?此处也是我们可以学习的,这个设置项在distcp-default.xml配置文件中,这种方式可以实现代码的解耦。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
  public static Class<? extends InputFormat> getStrategy(Configuration conf,
                                                                 DistCpOptions options) {
    String confLabel = "distcp." +
        options.getCopyStrategy().toLowerCase(Locale.getDefault()) + ".strategy.impl";
    return conf.getClass(confLabel, UniformSizeInputFormat.class, InputFormat.class);
  }

// 配置
    <property>
        <name>distcp.dynamic.strategy.impl</name>
        <value>org.apache.hadoop.tools.mapred.lib.DynamicInputFormat</value>
        <description>Implementation of dynamic input format</description>
    </property>

    <property>
        <name>distcp.static.strategy.impl</name>
        <value>org.apache.hadoop.tools.mapred.UniformSizeInputFormat</value>
        <description>Implementation of static input format</description>
    </property>

配置CopyOutputFormat时,设置了三个路径:

  • WorkingDirectory(中间临时存储目录,atomic选项时为tmp路径,否则为target-path路径);
  • CommitDirectory(文件拷贝最终目录,即target-path);
  • OutputPath(map write记录输出路径)。

关于命令行选项有一个疑问,用eclipse查看Call Hierachy调用关系的时刻,并没有发现调用DistCpOptions#getXXX的方法,那么是通过什么方式把这些配置项设置到Configuration的呢? 在DistCpOptionSwitch的枚举类中定义了每个选项的confLabel,在DistCpOptions#appendToConf方法中一起把这些属性填充到Configuration中。 [统一配置] !!

1
2
3
4
5
6
  public void appendToConf(Configuration conf) {
    DistCpOptionSwitch.addToConf(conf, DistCpOptionSwitch.ATOMIC_COMMIT,
        String.valueOf(atomicCommit));
    DistCpOptionSwitch.addToConf(conf, DistCpOptionSwitch.IGNORE_FAILURES,
        String.valueOf(ignoreFailures));
...

第三步整理需要拷贝的文件列表

这个真tmd的独到,提前把要做的事情规划好。需要拷贝的列表数据最终写入[metaFolder]/fileList.seq(key:与source-path的相对路径,value:该文件的CopyListingFileStatus),对应CONF_LABEL_LISTING_FILE_PATH,也就是map的输入(在自定义的InputFormat中处理)。

涉及CopyList的三个实现FileBasedCopyListing(-f)、GlobbedCopyListing、SimpleCopyListing。最终都调用SimpleCopyListing把文件和空目录列表写入到fileList.seq;最后校验否有重复的文件名,如果存在会抛出DuplicateFileException。

1
/tmp/hadoop-yarn/staging/hadoop/.staging/_distcp179796572/fileList.seq

同时计算需要拷贝的个数和大小(Byte),对应CONF_LABEL_TOTAL_BYTES_TO_BE_COPIEDCONF_LABEL_TOTAL_NUMBER_OF_RECORDS

第四步提交任务,等待等待无尽的等待。

也可以设置async选项,提交成功后直接完成Driver。

Mapper

首先,setup从Configuration中获取配置属性:sync(update)/忽略错误(i)/校验码/overWrite/workPath/finalPath

然后,从CONF_LABEL_LISTING_FILE_PATH路径获取准备好的sourcepath->CopyListingFileStatus键值对作为map的输入。

其实CopyListingFileStatus这个对象真正用到的就是原始Path的路径,真心不知道搞这么多属性干嘛!获取原始路径后又重新实例CopyListingFileStatus为sourceCurrStatus。

  • 如果源路径为文件夹,调用createTargetDirsWithRetry(RetriableDirectoryCreateCommand)创建路径,COPY计数加1,return。
  • 如果源路径为文件,但是checkUpdate(文件大小和块大小一致)为skip,SKIP计数加1,BYTESSKIPPED计数加上sourceCurrStatus的长度,把改条记录写入map输出,return。
  • 如果源路径为文件,且检查后不是skip则调用copyFileWithRetry(RetriableFileCopyCommand)拷贝文件,BYTESEXPECTED计数加上sourceCurrStatus的长度,BYTESCOPIED计数加上拷贝文件的大小,COPY计数加1,再return。
  • 如果配置有保留文件/文件夹属性,对目标进行属性修改。

从CopyListing获取数据,调用FileSystem-IO接口进行数据的拷贝(在原有IO的基础上封装了ThrottledInputStream来进行限流处理)。于此同时会涉及到source路径是文件夹但是target不是文件夹等的检查;更新还是覆盖;文件属性的保留和Map计数值的更新操作。

InputFormat

自定义了InputFormat来UniformSizeInputFormat进行拆分构造FileSplit,对CONF_LABEL_LISTING_FILE_PATH文件的每个键值的文件大小平均分成Map num 个数小块,根据键值的位置构造Map num个FileSplit对象。执行map时,RecordReader根据FileSplit来获取键值对,然后传递给map。

新版本的增加了DynamicInputFormat,实现能者多难的功能。先通过实际的日志,看看运行效果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[hadoop@hadoop-master2 ~]$ export HADOOP_CLIENT_OPTS="-Dhadoop.root.logger=debug,console -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8071"
[hadoop@hadoop-master2 ~]$ hadoop distcp "-Dmapreduce.map.java.opts=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8090" -strategy dynamic -m 2 /cp /cp-distcp-dynamic

# 创建的chunk
[hadoop@hadoop-master2 ~]$ hadoop fs -ls -R /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446
-rw-r--r--   1 hadoop supergroup        506 2015-03-20 00:40 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/fileList.seq
-rw-r--r--   1 hadoop supergroup        446 2015-03-20 00:40 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/fileList.seq_sorted
[hadoop@hadoop-master2 ~]$ hadoop fs -ls -R /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446
drwx------   - hadoop supergroup          0 2015-03-20 00:41 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/chunkDir
-rw-r--r--   1 hadoop supergroup        198 2015-03-20 00:41 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/chunkDir/fileList.seq.chunk.00000
-rw-r--r--   1 hadoop supergroup        224 2015-03-20 00:41 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/chunkDir/fileList.seq.chunk.00001
-rw-r--r--   1 hadoop supergroup        220 2015-03-20 00:41 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/chunkDir/fileList.seq.chunk.00002
-rw-r--r--   1 hadoop supergroup        506 2015-03-20 00:40 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/fileList.seq
-rw-r--r--   1 hadoop supergroup        446 2015-03-20 00:40 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/fileList.seq_sorted

# 分配后的chunk
[hadoop@hadoop-master2 ~]$ hadoop fs -ls -R /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446
drwx------   - hadoop supergroup          0 2015-03-20 00:41 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/chunkDir
-rw-r--r--   1 hadoop supergroup        220 2015-03-20 00:41 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/chunkDir/fileList.seq.chunk.00002
-rw-r--r--   1 hadoop supergroup        198 2015-03-20 00:41 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/chunkDir/task_1426773672048_0006_m_000000
-rw-r--r--   1 hadoop supergroup        224 2015-03-20 00:41 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/chunkDir/task_1426773672048_0006_m_000001
-rw-r--r--   1 hadoop supergroup        506 2015-03-20 00:40 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/fileList.seq
-rw-r--r--   1 hadoop supergroup        446 2015-03-20 00:40 /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446/fileList.seq_sorted

# map获取后
[hadoop@hadoop-master2 ~]$  ssh -g -L 8090:hadoop-slaver1:8090 hadoop-slaver1
# 每拷贝完一个chunk/最后map结束,会把上一个跑完的chunk文件删除
# job跑完后,临时目录的数据就被清楚了
[hadoop@hadoop-master2 ~]$ hadoop fs -ls -R /tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446
ls: `/tmp/hadoop-yarn/staging/hadoop/.staging/_distcp1568928446': No such file or directory

由于设置的map num为2,还有一个chunk没有分配出去,等到真正执行的时刻再进行分配。体现了策略的动态性。这个chunkm_000000分配给map0(其他类似),其他没有分配出去的chunk让给map去

首先InputFormat创建FileSplit,在此过程中把原来的CONF_LABEL_LISTING_FILE_PATH中的需要处理的文件根据个数等份成chunk。(具体实现看源码,其中numEntriesPerChunk计算一个chunk几个文件比较复杂点)

chunk中的也是sourcepath->CopyListingFileStatus键值对,以seq格式的存储文件中。DynamicInputChunk#acquire(TaskAttemptContext)读取数据的时刻比较有意思,在Driver阶段分配的chunk处理完后,就会动态的取处理余下的chunk,能者多劳。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
  public static DynamicInputChunk acquire(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
    if (!areInvariantsInitialized())
        initializeChunkInvariants(taskAttemptContext.getConfiguration());

    String taskId = taskAttemptContext.getTaskAttemptID().getTaskID().toString();
    Path acquiredFilePath = new Path(chunkRootPath, taskId);

    if (fs.exists(acquiredFilePath)) {
      LOG.info("Acquiring pre-assigned chunk: " + acquiredFilePath);
      return new DynamicInputChunk(acquiredFilePath, taskAttemptContext);
    }

    for (FileStatus chunkFile : getListOfChunkFiles()) {
      if (fs.rename(chunkFile.getPath(), acquiredFilePath)) {
        LOG.info(taskId + " acquired " + chunkFile.getPath());
        return new DynamicInputChunk(acquiredFilePath, taskAttemptContext);
      }
      else
        LOG.warn(taskId + " could not acquire " + chunkFile.getPath());
    }

    return null;
  }

OutputFormat & Committer

自定义的CopyOutputFormat包括了working/commit/output路径的get/set方法,同时指定了自定义的OutputCommitter:CopyCommitter。

正常情况为app-master调用CopyCommitter#commitJob处理善后的事情:保留文件属性的情况下更新文件的属性,atomic情况下把working转到commit路径,delete情况下删除target目录多余的文件。最后清理临时目录。

看完DistCp然后再去看DistCpV1,尽管说功能上类似,但是要和新版本对上仍然要去看distcp的代码。好的代码就是这样吧,让人很自然轻松的理解,而不必反复来回的折腾,甚至于为了免得来回折腾而记住该代码块。(类太大,方法太长,变量定义和使用的位置相隔很远!一个变量作用域太长赋值变更次数太多)

参考

–END

Windows Build hadoop-2.6

环境

1
2
3
4
5
6
7
8
9
10
11
12
13
C:\Users\winse>java -version
java version "1.7.0_02"
Java(TM) SE Runtime Environment (build 1.7.0_02-b13)
Java HotSpot(TM) Client VM (build 22.0-b10, mixed mode, sharing)

C:\Users\winse>protoc --version
libprotoc 2.5.0

winse@Lenovo-PC ~
$ cygcheck -c cygwin
Cygwin Package Information
Package              Version        Status
cygwin               1.7.33-1       OK

具体步骤

在windows下,hadoop-2.6还不能直接编译java-x86的dll。需要自己处理/打patchHADOOP-9922,但是官网jira-patch给出来的和2.6.0-src对不上。自己动手丰衣足食,把x64的全部改成Win32即可,附编译成功的patch下载hadoop-2.6.0-common-native-win32-diff.patch(提取码:08fd)

  • 用visual studio2010的x86命令行进入:
1
2
3
Visual Studio 命令提示(2010)

Setting environment for using Microsoft Visual Studio 2010 x86 tools.
  • 切换到hadoop源码目录,打补丁和编译。同时protobuf目录和cygwin\bin目录加入PATH:
1
2
3
4
5
6
7
cd hadoop-2.6.0-src
cd hadoop-common-project\hadoop-common
patch -p0 < hadoop-2.6.0-common-native-win32-diff.patch

set PATH=%PATH%;E:\local\home\Administrator\bin;c:\cygwin\bin

mvn package -Pdist,native-win -DskipTests -Dtar -Dmaven.javadoc.skip=true
  • 编译完成后,直接把hadoop-common\target\bin目录下的内容拷贝到程序的bin目录下。

在windows下,执行java程序java.library.path默认到PATH路径找。这也是需要定义环境变量HADOOP_HOME,以及把%HADOOP_HOME%\bin加入到PATH的原因!

1
2
HADOOP_HOME=E:\local\libs\big\hadoop-2.2.0 
PATH=%HADOOP_HOME%\bin;%PATH%
  • 配置坑:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
winse@Lenovo-PC /cygdrive/e/local/opt/bigdata/hadoop-2.6.0
$ find . -name "*-default.xml" | xargs -I{} grep "hadoop.tmp.dir" {}
  <value>${hadoop.tmp.dir}/mapred/local</value>
  <value>${hadoop.tmp.dir}/mapred/system</value>
  <value>${hadoop.tmp.dir}/mapred/staging</value>
  <value>${hadoop.tmp.dir}/mapred/temp</value>
  <value>${hadoop.tmp.dir}/mapred/history/recoverystore</value>
  <name>hadoop.tmp.dir</name>
  <value>${hadoop.tmp.dir}/io/local</value>
  <value>${hadoop.tmp.dir}/s3</value>
  <value>${hadoop.tmp.dir}/s3a</value>
  <value>file://${hadoop.tmp.dir}/dfs/name</value>
  <value>file://${hadoop.tmp.dir}/dfs/data</value>
  <value>file://${hadoop.tmp.dir}/dfs/namesecondary</value>
    <value>${hadoop.tmp.dir}/yarn/system/rmstore</value>
    <value>${hadoop.tmp.dir}/nm-local-dir</value>
    <value>${hadoop.tmp.dir}/yarn-nm-recovery</value>
    <value>${hadoop.tmp.dir}/yarn/timeline</value>

就dfs的在前面加了file://前缀!

所以,在windows下如果你只配置hadoop.tmp.dir(file:///e:/tmp/hadoop)的话还得同时配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<property>
  <name>dfs.namenode.name.dir</name>
  <value>${hadoop.tmp.dir}/dfs/name</value>
</property>

<property>
  <name>dfs.datanode.data.dir</name>
  <value>${hadoop.tmp.dir}/dfs/data</value>
</property>

<property>
  <name>dfs.namenode.checkpoint.dir</name>
  <value>${hadoop.tmp.dir}/dfs/namesecondary</value>
</property>

接下来格式化,启动都和同时一样。

其他

调试,下载maven源码等

1
2
3
4
5
6
7
8
9
10
11
set HADOOP_NAMENODE_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8090"

mvn dependency:resolve -Dclassifier=sources

mvn eclipse:eclipse -DdownloadSources -DdownloadJavadocs 

mvn dependency:sources 
mvn dependency:resolve -Dclassifier=javadoc

/* 操作HDFS */
set HADOOP_ROOT_LOGGER=DEBUG,console

–END

VMware-Centos6 Build hadoop-2.6

每次编译hadoop(-common)都是惊心动魄,没一次顺顺当当的!由于本人的偷懒(vmware共享windows目录),引发的又一起血案~~~

同时,有时生产环境不是自己能选择的,需要适应各种环境来编译相应的hadoop,此时在已有的linux开发环境使用docker搭建各种linux及其方便的事情。这里在centos6上搭建docker-centos5实例来编译hadoop。

环境说明

  • 操作系统
1
2
3
4
[root@localhost ~]# uname -a
Linux localhost.localdomain 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]# cat /etc/redhat-release 
CentOS release 6.5 (Final)
  • 使用VMware的Shared Folders建立了maven和hadoop-2.6.0-src到宿主机器的映射:(不要直接在源码映射的目录下编译,先拷贝到linux的硬盘下!!)
1
2
3
[root@localhost ~]# ll -a hadoop-2.6.0-src maven
lrwxrwxrwx. 1 root root 26 Mar  7 22:47 hadoop-2.6.0-src -> /mnt/hgfs/hadoop-2.6.0-src
lrwxrwxrwx. 1 root root 15 Mar  7 22:47 maven -> /mnt/hgfs/maven

具体操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 安装maven,jdk
cat apache-maven-3.2.3-bin.tar.gz | ssh root@192.168.154.130 "cat - | tar zxv "

tar zxvf jdk-7u60-linux-x64.gz -C ~/
vi .bash_profile 

# 开发环境
yum install gcc glibc-headers gcc-c++ zlib-devel
yum install openssl-devel

# 安装protobuf
tar zxvf protobuf-2.5.0.tar.gz 
cd protobuf-2.5.0
./configure 
make && make install

## 编译hadoop-common
# 从映射文件中拷贝hadoop-common到linux文件系统,然后在编译hadoop-common
cd hadoop-2.6.0-src/hadoop-common-project/hadoop-common/
cd ..
cp -r  hadoop-common ~/  #Q:为啥要拷贝一份,【遇到的问题】中有进行解析
cd ~/hadoop-common
mvn install
mvn -X clean package -Pdist,native -Dmaven.test.skip=true -Dmaven.javadoc.skip=true

## 编译全部,耗时比较久,可以先去吃个饭^v^
cp -r /mnt/hgfs/hadoop-2.6.0-src ~/
mvn package -Pdist,native -DskipTests -Dmaven.javadoc.skip=true #Q:这里为啥不能用maven.test.skip?

$$TAG centos5 20160402

  • docker build hadoop-2.6.3(比自己搞个虚拟机更快)

实际生产需要使用centos5,这里在centos5编译。其他下载Centos特定版本,步骤是一样的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
[hadoop@cu2 ~]$ cat /etc/redhat-release 
CentOS release 6.6 (Final)

# 下载导入centos5镜像
[root@cu2 shm]# unzip sig-cloud-instance-images-centos-5.zip 
[root@cu2 shm]# cd sig-cloud-instance-images-c8d1a81b0516bca0f20434be8d0fac4f7d58a04a/docker/
[root@cu2 docker]# cat centos-5-20150304_1234-docker.tar.xz | docker import - centos:centos5
[root@cu2 ~]# docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
centos              centos5             a3f6a632c5ec        27 seconds ago      284.1 MB

# 把本机原有资源利用起来,如:maven/repo/jdk/hadoop等
[root@cu2 ~]# docker run -ti -v /home/hadoop:/home/hadoop -v /opt:/opt -v /data:/data centos:centos5 /bin/bash

export JAVA_HOME=/opt/jdk1.7.0_17
export MAVEN_HOME=/opt/apache-maven-3.3.9
export PATH=$JAVA_HOME/bin:$MAVEN_HOME/bin:$PATH

yum install lrzsz zlib-devel make which gcc gcc-c++ cmake openssl openssl-devel -y

cd protobuf-2.5.0
./configure 
make && make install
which protoc

cd hadoop-2.6.3-src/
mvn clean package -Dmaven.javadoc.skip=true -DskipTests -Pdist,native 

cd hadoop-dist/target/hadoop-2.6.3/lib/native/
cd ..
tar zcvf native-hadoop2.6.3-centos5.tar.gz native

----

在centos5编译snappy-1.1.3死都过不去,**Makefile.am:4: Libtool library used but `LIBTOOL' is undefined** 
网上资料都差了,最后直接用centos6编译好的snappy可以。哎,有的用就好。

[root@8fb11f6b3ced hadoop-2.6.3-src]# mvn package -Dmaven.javadoc.skip=true -DskipTests -Pdist,native  -Drequire.snappy=true  -Dsnappy.prefix=/home/hadoop/snappy
[root@8fb11f6b3ced hadoop-2.6.3-src]# cd hadoop-dist/target/hadoop-2.6.3/
[root@8fb11f6b3ced hadoop-2.6.3]# pwd
/home/hadoop/sources/hadoop-2.6.3-src/hadoop-dist/target/hadoop-2.6.3
[root@8fb11f6b3ced hadoop-2.6.3]# cd lib/native/
[root@8fb11f6b3ced native]# tar zxvf /home/hadoop/snappy/snappy-libs.tar.gz 

[root@8fb11f6b3ced native]# cd /home/hadoop/sources/hadoop-2.6.3-src/hadoop-dist/target/hadoop-2.6.3
[root@8fb11f6b3ced hadoop-2.6.3]# bin/hadoop checknative -a

# 打包到正式环境
[root@8fb11f6b3ced hadoop-2.6.3]# cd lib/
[root@8fb11f6b3ced lib]# tar zcvf native-hadoop2.6.3-centos5-with-snappy.tar.gz native

$$END TAG centos5 20160402

遇到的问题

  • 第一个问题肯定是没有c的编译环境,安装gcc即可。

  • configure: error: C++ preprocessor "/lib/cpp" fails sanity check,安装c++。

-> configure: error: C++ preprocessor “/lib/cpp” fails sanity check

  • Unknown lifecycle phase "c",点击错误提示最后的链接查看解决方法,即执行mvn install

-> 执行第一maven用例出错:Unknown lifecycle phase “complile”. -> LifecyclePhaseNotFoundException

  • CMake Error at /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:108 (message): Could NOT find ZLIB (missing: ZLIB_INCLUDE_DIR), 缺少zlib-devel。

-> Cmake时报错:Could NOT find ImageMagick

  • cmake_symlink_library: System Error: Operation not supported, 共享的windows目录下不能创建linux的软链接。

-> 参见9楼回复

创建链接不成功,要确认当前帐户下是否有权限在编译的目录中有创建链接的权限

比如,你如果是在一个WINDOWS机器上的共享目录中编译,就没法创建链接,就会失败。把源码复制到本地的目录中再编译就不会有这问题。

  • 全部编译时仅能用skipTests,不能maven.test.skip。
1
2
3
4
5
6
7
8
9
10
11
12
13
main:
     [echo] Running test_libhdfs_threaded
     [exec] nmdCreate: NativeMiniDfsCluster#Builder#Builder error:
     [exec] java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/MiniDFSCluster$Builder
     [exec] Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.MiniDFSCluster$Builder
     [exec]     at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
     [exec]     at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
     [exec]     at java.security.AccessController.doPrivileged(Native Method)
     [exec]     at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
     [exec]     at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
     [exec]     at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
     [exec]     at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
     [exec] TEST_ERROR: failed on /root/hadoop-2.6.0-src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/libhdfs/test_libhdfs_threaded.c:326 (errno: 2): got NULL from tlhCluster
  • Could NOT find OpenSSL, try to set the path to OpenSSL root folder in the,安装openssl-devel。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
main:
    [mkdir] Created dir: /root/hadoop-2.6.0-src/hadoop-tools/hadoop-pipes/target/native
     [exec] -- The C compiler identification is GNU 4.4.7
     [exec] -- The CXX compiler identification is GNU 4.4.7
     [exec] -- Check for working C compiler: /usr/bin/cc
     [exec] -- Check for working C compiler: /usr/bin/cc -- works
     [exec] -- Detecting C compiler ABI info
     [exec] -- Detecting C compiler ABI info - done
     [exec] -- Check for working CXX compiler: /usr/bin/c++
     [exec] -- Check for working CXX compiler: /usr/bin/c++ -- works
     [exec] -- Detecting CXX compiler ABI info
     [exec] -- Detecting CXX compiler ABI info - done
     [exec] -- Configuring incomplete, errors occurred!
     [exec] See also "/root/hadoop-2.6.0-src/hadoop-tools/hadoop-pipes/target/native/CMakeFiles/CMakeOutput.log".
     [exec] CMake Error at /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:108 (message):
     [exec]   Could NOT find OpenSSL, try to set the path to OpenSSL root folder in the
     [exec]   system variable OPENSSL_ROOT_DIR (missing: OPENSSL_LIBRARIES
     [exec]   OPENSSL_INCLUDE_DIR)
     [exec] Call Stack (most recent call first):
     [exec]   /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:315 (_FPHSA_FAILURE_MESSAGE)
     [exec]   /usr/share/cmake/Modules/FindOpenSSL.cmake:313 (find_package_handle_standard_args)
     [exec]   CMakeLists.txt:20 (find_package)
     [exec] 
     [exec] 

成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
[INFO] Executed tasks
[INFO] 
[INFO] --- maven-javadoc-plugin:2.8.1:jar (module-javadocs) @ hadoop-dist ---
[INFO] Skipping javadoc generation
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Hadoop Main ................................. SUCCESS [ 43.005 s]
[INFO] Apache Hadoop Project POM .......................... SUCCESS [ 25.511 s]
[INFO] Apache Hadoop Annotations .......................... SUCCESS [ 21.177 s]
[INFO] Apache Hadoop Assemblies ........................... SUCCESS [ 11.728 s]
[INFO] Apache Hadoop Project Dist POM ..................... SUCCESS [ 51.274 s]
[INFO] Apache Hadoop Maven Plugins ........................ SUCCESS [ 35.625 s]
[INFO] Apache Hadoop MiniKDC .............................. SUCCESS [ 21.936 s]
[INFO] Apache Hadoop Auth ................................. SUCCESS [ 24.665 s]
[INFO] Apache Hadoop Auth Examples ........................ SUCCESS [ 17.058 s]
[INFO] Apache Hadoop Common ............................... SUCCESS [06:07 min]
[INFO] Apache Hadoop NFS .................................. SUCCESS [ 41.279 s]
[INFO] Apache Hadoop KMS .................................. SUCCESS [ 59.186 s]
[INFO] Apache Hadoop Common Project ....................... SUCCESS [  7.216 s]
[INFO] Apache Hadoop HDFS ................................. SUCCESS [04:29 min]
[INFO] Apache Hadoop HttpFS ............................... SUCCESS [ 52.883 s]
[INFO] Apache Hadoop HDFS BookKeeper Journal .............. SUCCESS [ 28.972 s]
[INFO] Apache Hadoop HDFS-NFS ............................. SUCCESS [ 24.901 s]
[INFO] Apache Hadoop HDFS Project ......................... SUCCESS [  7.486 s]
[INFO] hadoop-yarn ........................................ SUCCESS [  7.466 s]
[INFO] hadoop-yarn-api .................................... SUCCESS [ 32.970 s]
[INFO] hadoop-yarn-common ................................. SUCCESS [ 25.549 s]
[INFO] hadoop-yarn-server ................................. SUCCESS [  6.709 s]
[INFO] hadoop-yarn-server-common .......................... SUCCESS [ 25.292 s]
[INFO] hadoop-yarn-server-nodemanager ..................... SUCCESS [ 29.555 s]
[INFO] hadoop-yarn-server-web-proxy ....................... SUCCESS [ 12.800 s]
[INFO] hadoop-yarn-server-applicationhistoryservice ....... SUCCESS [ 14.025 s]
[INFO] hadoop-yarn-server-resourcemanager ................. SUCCESS [ 21.121 s]
[INFO] hadoop-yarn-server-tests ........................... SUCCESS [ 24.019 s]
[INFO] hadoop-yarn-client ................................. SUCCESS [ 18.949 s]
[INFO] hadoop-yarn-applications ........................... SUCCESS [  7.586 s]
[INFO] hadoop-yarn-applications-distributedshell .......... SUCCESS [  8.428 s]
[INFO] hadoop-yarn-applications-unmanaged-am-launcher ..... SUCCESS [ 12.671 s]
[INFO] hadoop-yarn-site ................................... SUCCESS [  7.518 s]
[INFO] hadoop-yarn-registry ............................... SUCCESS [ 18.518 s]
[INFO] hadoop-yarn-project ................................ SUCCESS [ 38.781 s]
[INFO] hadoop-mapreduce-client ............................ SUCCESS [ 13.133 s]
[INFO] hadoop-mapreduce-client-core ....................... SUCCESS [ 23.772 s]
[INFO] hadoop-mapreduce-client-common ..................... SUCCESS [ 22.815 s]
[INFO] hadoop-mapreduce-client-shuffle .................... SUCCESS [ 16.810 s]
[INFO] hadoop-mapreduce-client-app ........................ SUCCESS [ 14.404 s]
[INFO] hadoop-mapreduce-client-hs ......................... SUCCESS [ 18.157 s]
[INFO] hadoop-mapreduce-client-jobclient .................. SUCCESS [ 14.637 s]
[INFO] hadoop-mapreduce-client-hs-plugins ................. SUCCESS [  9.190 s]
[INFO] Apache Hadoop MapReduce Examples ................... SUCCESS [  9.037 s]
[INFO] hadoop-mapreduce ................................... SUCCESS [ 59.280 s]
[INFO] Apache Hadoop MapReduce Streaming .................. SUCCESS [ 26.724 s]
[INFO] Apache Hadoop Distributed Copy ..................... SUCCESS [ 31.503 s]
[INFO] Apache Hadoop Archives ............................. SUCCESS [ 19.867 s]
[INFO] Apache Hadoop Rumen ................................ SUCCESS [ 27.401 s]
[INFO] Apache Hadoop Gridmix .............................. SUCCESS [ 20.102 s]
[INFO] Apache Hadoop Data Join ............................ SUCCESS [ 20.382 s]
[INFO] Apache Hadoop Ant Tasks ............................ SUCCESS [ 12.207 s]
[INFO] Apache Hadoop Extras ............................... SUCCESS [ 24.069 s]
[INFO] Apache Hadoop Pipes ................................ SUCCESS [ 31.975 s]
[INFO] Apache Hadoop OpenStack support .................... SUCCESS [ 32.225 s]
[INFO] Apache Hadoop Amazon Web Services support .......... SUCCESS [02:45 min]
[INFO] Apache Hadoop Client ............................... SUCCESS [01:38 min]
[INFO] Apache Hadoop Mini-Cluster ......................... SUCCESS [ 15.450 s]
[INFO] Apache Hadoop Scheduler Load Simulator ............. SUCCESS [ 46.489 s]
[INFO] Apache Hadoop Tools Dist ........................... SUCCESS [01:31 min]
[INFO] Apache Hadoop Tools ................................ SUCCESS [  7.603 s]
[INFO] Apache Hadoop Distribution ......................... SUCCESS [ 32.967 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 39:30 min
[INFO] Finished at: 2015-03-08T10:55:47+08:00
[INFO] Final Memory: 102M/340M
[INFO] ------------------------------------------------------------------------

把src编译出来的native下面的文件拷贝到hadoop集群程序目录下:

1
2
3
4
5
6
7
8
9
10
11
12
[hadoop@hadoop-master1 lib]$ scp -r root@172.17.42.1:~/hadoop-2.6.0-src/hadoop-dist/target/hadoop-2.6.0/lib/native ./
[hadoop@hadoop-master1 lib]$ cd native/
[hadoop@hadoop-master1 native]$ ll
total 4356
-rw-r--r--. 1 hadoop hadoop 1119518 Mar  8 03:11 libhadoop.a
-rw-r--r--. 1 hadoop hadoop 1486964 Mar  8 03:11 libhadooppipes.a
lrwxrwxrwx. 1 hadoop hadoop      18 Mar  3 21:08 libhadoop.so -> libhadoop.so.1.0.0
-rwxr-xr-x. 1 hadoop hadoop  671237 Mar  8 03:11 libhadoop.so.1.0.0
-rw-r--r--. 1 hadoop hadoop  581944 Mar  8 03:11 libhadooputils.a
-rw-r--r--. 1 hadoop hadoop  359490 Mar  8 03:11 libhdfs.a
lrwxrwxrwx. 1 hadoop hadoop      16 Mar  3 21:08 libhdfs.so -> libhdfs.so.0.0.0
-rwxr-xr-x. 1 hadoop hadoop  228451 Mar  8 03:11 libhdfs.so.0.0.0

添加编译的native包前后对比:

1
2
3
4
5
6
7
8
9
10
11
12
13
[hadoop@hadoop-master1 hadoop-2.6.0]$ hadoop fs -ls /
15/03/08 03:09:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r--   1 hadoop supergroup       1366 2015-03-06 16:49 /README.txt
drwx------   - hadoop supergroup          0 2015-03-06 16:54 /tmp
drwxr-xr-x   - hadoop supergroup          0 2015-03-06 16:54 /user

# 编译好后,警告提示没有了
[hadoop@hadoop-master1 hadoop-2.6.0]$ hadoop fs -ls /
Found 3 items
-rw-r--r--   1 hadoop supergroup       1366 2015-03-06 16:49 /README.txt
drwx------   - hadoop supergroup          0 2015-03-06 16:54 /tmp
drwxr-xr-x   - hadoop supergroup          0 2015-03-06 16:54 /user

–END

VMware共享目录

VMware提供了与主机共享目录的功能,可以在虚拟机访问宿主机器的文件。

  1. 选择映射目录 选择[Edit virtual machine settings],在弹出的对话框中选择[Options]页签,选择[Shared Folders],点击右边的[Add]按钮添加需要映射(maven)的本地目录。
  2. 安装VMware Tools
    • 启动linux虚拟机,选择[VM]菜单,再选择[Install VMware Tools…]菜单。下载完成后,会自动通过cdrom加载到虚拟机。
    • 登录linux虚拟机,执行以下命令:
1
2
3
4
5
6
7
8
9
10
11
12
13
cd /mnt
mkdir cdrom
mount /dev/cdrom cdrom
cd cdrom/
mkdir ~/vmware
tar zxvf VMwareTools-9.2.0-799703.tar.gz -C ~/vmware

cd ~/vmware
cd vmware-tools-distrib/
./vmware-install.pl 
reboot

cd /mnt/hgfs/maven

当前的maven目录是映射到宿主的机器目录。

1
2
3
4
5
[root@localhost maven]# ll -a
total 3
drwxrwxrwx. 1 root root    0 Dec 28  2012 .
dr-xr-xr-x. 1 root root 4192 Mar  7 22:41 ..
drwxrwxrwx. 1 root root    0 Dec 28  2012 .m2

–END