Winse Blog

走走停停都是风景, 熙熙攘攘都向最好, 忙忙碌碌都为明朝, 何畏之.

Pdsh

弄hadoop总是需要折腾不少机器,单单执行 rsync 就挺折腾人的,有时还要排除部分机器来查看一堆机器使用内存情况,等等。以前都使用 expect 结合 for in 来实现,总归简单用着也觉得还行。

但是最近,升级hadoop、tez、安装ganglia被折腾的不行。复制 for 语句到累,原来看过 pdsh 的介绍,不过原来就部署4-5台机器,最近查找Ganglia安装问题的博文里面再次 pdsh ,觉得非常亲切和简洁。再次安装使用也就有了本文。

安装

1
2
3
4
[root@bigdatamgr1 pdsh-2.29]# umask 0022
[root@bigdatamgr1 pdsh-2.29]# ./configure -h
[root@bigdatamgr1 pdsh-2.29]# ./configure --with-dshgroups  --with-exec --with-ssh 
[root@bigdatamgr1 pdsh-2.29]# make && make install

挺多选项的,用 disgroups 加上 ssh 差不多够用了,以后不够用的时刻再慢慢研究这些选项。

当然更简单的安装方式是使用yum: yum install pdsh -y

简单使用

使用pdsh管理机器的前提是已经建立了到目标机器的SSH无密钥登录,而建立这N台机器的无秘钥登录还是少不了 expect (当然你愿意一个个输入yes和密码也是OK的)!

  • 加载的模块
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 查看,安装的ssh/exec
[esw@bigdatamgr1 ~]$ pdsh -L

# 设置默认使用的模块
[esw@bigdatamgr1 ~]$ export PDSH_RCMD_TYPE=exec
[esw@bigdatamgr1 ~]$ pdsh -w bigdata[1-2] ssh %h hostname
bigdata2: bigdata2
bigdata1: bigdata1

# 命令行指定模块
[esw@bigdatamgr1 ~]$ pdsh -R ssh -w bigdata1,bigdata2 hostname
bigdata2: bigdata2
bigdata1: bigdata1

# 一个个的指定
[esw@bigdatamgr1 ~]$ pdsh -w ssh:bigdata1,ssh:bigdata2 hostname
bigdata2: bigdata2
bigdata1: bigdata1
[esw@bigdatamgr1 ~]$ pdsh -w ssh:bigdata[1,2] hostname
bigdata2: bigdata2
bigdata1: bigdata1
  • 主机加载
1
2
3
4
5
6
[esw@bigdatamgr1 ~]$ pdsh -w bigdata[1-2,5,6-8] -X nodes hostname
bigdata5: bigdata5
bigdata6: bigdata6
bigdata2: bigdata2
bigdata8: bigdata8
bigdata7: bigdata7

pdsh除了使用 -w 来指定主机列表,还可以通过文件来指定,如编译时的 --with-machines ,同时可以通过读取默认的位置的文件来获取。在编译pdsh时可以通过 --with-dshgroups 参数来激活此选项,默认可以将一组主机列表写入一个文件中并放到本地主机的 ~/.dsh/group/etc/dsh/group 目录下,这样就可以通过 -g 参数调用了。同时 -X groupname 可以用来排除主机列表中属于groupname组的主机(下面会提到group分组)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[esw@bigdatamgr1 ~]$ export PDSH_RCMD_TYPE=ssh

[esw@bigdatamgr1 ~]$ mkdir -p .dsh/group
[esw@bigdatamgr1 ~]$ cd .dsh/group/
[esw@bigdatamgr1 group]$ vi nodes
bigdata1
bigdata3

[esw@bigdatamgr1 ~]$ pdsh -g nodes hostname
bigdata3: bigdata3
bigdata1: bigdata1

[esw@bigdatamgr1 ~]$ pdsh -w bigdata[1-8] -X nodes hostname
bigdata2: bigdata2
bigdata8: bigdata8
bigdata5: bigdata5
bigdata6: bigdata6
bigdata4: bigdata4
bigdata7: bigdata7

-w 参数也可以用来读取特定文件中的主机列表,同时结合其他规则和进行过滤(具体查看man帮助)。-x 在主机列表基础上进行过滤(提供多一种的方式来实现过滤)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[esw@bigdatamgr1 ~]$ cat slaves | head -2
bigdata1
bigdata2

[esw@bigdatamgr1 ~]$ pdsh -w ^slaves hostname | head -5
bigdata8: bigdata8
bigdata6: bigdata6
bigdata5: bigdata5
bigdata2: bigdata2
bigdata3: bigdata3

[esw@bigdatamgr1 ~]$ pdsh -w ^slaves,-bigdata[2-8]
pdsh> hostname
bigdata1: bigdata1
pdsh> 
pdsh> exit
[esw@bigdatamgr1 ~]$ pdsh -w ^slaves,-/bigdata.?/
pdsh@bigdatamgr1: no remote hosts specified

[esw@bigdatamgr1 ~]$ pdsh -w ^slaves -x bigdata[1-7] hostname
bigdata8: bigdata8
  • 输出格式化

当一台主机的输出多余一行时,pdsh输出的内容看起来并不和谐。使用dshbak格式化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[esw@bigdatamgr1 ~]$ pdsh -w bigdata[1-2] free -m  | dshbak -c
----------------
bigdata1
----------------
             total       used       free     shared    buffers     cached
Mem:         64405      59207       5198          0        429      31356
-/+ buffers/cache:      27420      36985
Swap:        65535         57      65478
----------------
bigdata2
----------------
             total       used       free     shared    buffers     cached
Mem:         64405      58192       6213          0        505      29847
-/+ buffers/cache:      27838      36566
Swap:        65535         58      65477

批量SSH无密钥登录

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[hadoop@hadoop-master4 ~]$ cat ssh-copy-id.expect 
#!/usr/bin/expect  

## Usage $0 [user@]host password

set host [lrange $argv 0 0];
set password [lrange $argv 1 1] ;

set timeout 30;

spawn ssh-copy-id $host ;

expect {
  "(yes/no)?" { send yes\n; exp_continue; }
  "password:" { send $password\n; exp_continue; }
}

exec sleep 1;

[hadoop@hadoop-master4 ~]$ pdsh -w ^slaves ./ssh-copy-id.expect %h 'PASSWD'

# 验证是否全部成功
[hadoop@hadoop-master4 ~]# pdsh -w ^slaves -x hadoop-slaver[1-16] -R ssh hostname

参考

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
pdsh -w ssh:user00[1-10] "date"
此命令用于在user001到user0010上执行date命令。
pdsh -w ssh:user0[10-31],/1$/ "uptime"
此命令在选择远程主机时使用了正则表达式,表示在user010到user031中选择以1结尾的主机名,即在user011、user021、user031上执行uptime命令

-l    指定在远程主机上使用的用户名称。例如:
pdsh -R ssh -l opsuser -w user00[1-9] "date"

对于-g组,把对应的主机写入到/etc/dsh/group/或~/.dsh/group/目录下的文件中即可

[root@dispatch1 ~]# pdsh -w dispatch1,search1,horizon1 -l bigendian jps 
[root@dispatch1 ~]# vi servers
dispatch1
search1
horizon1
[root@dispatch1 ~]# pdsh -w ^servers -l bigendian hostname 
dispatch1: dispatch1
horizon1: horizon1
search1: search1

-f    设置同时连接到远程主机的个数

dshbak格式化输出

pdcp -R ssh -g userhosts /home/opsuser/mysqldb.tar.gz /home/opsuser #复制文件  
1
2
3
Some quick tips on how to get started using pdsh:
Set up your environment:
export PDSH_SSH_ARGS_APPEND=”-o ConnectTimeout=5 -o CheckHostIP=no -o StrictHostKeyChecking=no” (Add this to your .bashrc to save time.)

–END

安装配置Ganglia(2)

前一篇介绍了全部手工安装Ganglia的文章,当时安装测试的环境比较简单。按照网上的步骤安装好,看到图了以为就懂了。Ganglia的基本多播/单播的概念都没弄懂。

这次有机会把Ganglia安装到正式环境,由于网络复杂一些,遇到新的问题。也更进一步的了解了Ganglia。

后端Gmetad(ganglia meta daemon)和Gmond(ganglia monitoring daemon)是Ganglia的两个组件。

Gmetad负责收集各个cluster的数据,并更新到rrd数据库中;Gmond把本机的数据UDP广播(或者单播给某台机),同时收集集群节点的数据供Gmetad读取。Gmetad并不用于监控数据的汇总,是对已经采集好的全部数据处理并存储到rrdtool数据库。

搭建yum环境

由于正式环境没有提供外网环境,所以需要把安装光盘拷贝到机器,作为yum的本地源。

1
2
3
4
5
6
7
8
9
mount -t iso9660 -o loop rhel-server-6.4-x86_64-dvd\[ED2000.COM\].iso iso/
ln -s iso rhel6.4

vi /etc/yum.repos.d/rhel.repo 
[os]
name = Linux OS Packages
baseurl = file:///opt/rhel6.4
enabled=1
gpgcheck = 0

再极端点,yum程序都没有安装。到 Packages 目录用 rpm 安装 yum*

安装httpd后,把 rhel6.4 源建一个软链接到 /var/www/html/rhel6.4 ,其他机器就可以使用该源来进行安装软件了。

1
2
3
4
5
6
cat /etc/yum.repos.d/rhel.repo
[http]
name=LOCAL YUM server
baseurl = http://cu-omc1/rhel6.4
enabled=1
gpgcheck=0

注意:如果用CentOS的ISO会有两个光盘,两个地址用逗号分隔全部加到baseurl(http方式也一样):

1
2
3
4
5
6
[centos-local]
name=Centos Local
baseurl=file:///mnt/cdrom,file:///mnt/cdrom2 
failovermethod=priority
enabled=1
gpgcheck=0

使用yum安装依赖

1
2
3
4
5
6
7
8
9
10
yum install -y gcc gd httpd php php-devel php-mysql php-pear php-common php-gd php-mbstring php-cli 

yum install -y rrdtool 

yum install -y apr*

# 编译Ganglia时加 --with-libpcre=no 可以不安装pcre
yum install -y pcre*

# yum install -y zlib-devel

(仅)编译安装Ganglia 官网有的不再推荐自己手动编译

下载下面的软件(yum没有这些软件):

安装:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
umask 0022 # 临时修改下,不然后面会遇到权限问题

rpm -ivh rrdtool-devel-1.3.8-6.el6.x86_64.rpm 

# 如果yum可以安装的话:yum install -y libconfuse*
tar zxf confuse-2.7.tar.gz
cd confuse-2.7
./configure CFLAGS=-fPIC --disable-nls
make && make install

tar zxf ganglia-3.7.2.tar.gz 
cd ganglia-3.7.2
./configure --with-gmetad --enable-gexec --enable-status --prefix=/usr/local/ganglia
# 可选项,用于指定默认配置位置 `-sysconfdir=/etc/ganglia`

make && make install

cp gmetad/gmetad.init /etc/init.d/gmetad
chkconfig gmetad on
# 查看gmetad的情况
chkconfig --list | grep gm

df -h # 把rrds目录放到最大的分区,再做个链接到data目录下
mkdir -p /data/ganglia/rrds
chown nobody:nobody /data/ganglia/rrds
ln -s /usr/local/ganglia/sbin/gmetad /usr/sbin/gmetad

gmetad -h # 查看默认的config位置。下面步骤AB 二选一 根据是否配置 sysconfdir 选项
# 步骤A
# cp gmetad/gmetad.conf /etc/ganglia/
# 步骤B
vi /etc/init.d/gmetad 
  /usr/local/ganglia/etc/gmetad.conf #修改原来的默认配置路径
 
cd ganglia-3.7.2/gmond/
ln -s /usr/local/ganglia/sbin/gmond /usr/sbin/gmond
cp gmond.init /etc/init.d/gmond
chkconfig gmond on
chkconfig --list gmond
  
gmond -h # 查看默认的config位置。
./gmond -t >/usr/local/ganglia/etc/gmond.conf
vi /etc/init.d/gmond 
  /usr/local/ganglia/etc/gmond.conf #修改原来的默认配置路径

配置

  • Ganglia配置
1
2
3
4
5
6
7
8
9
10
11
12
vi /usr/local/ganglia/etc/gmetad.conf
  datasource "HADOOP" hadoop-master1
  datasource "CU" cu-ud1
  rrd_rootdir "/data/ganglia/rrds"
  gridname "bigdata"

vi /usr/local/ganglia/etc/gmond.conf
  cluster {
   name = "CU"

  udp_send_channel {
   bind_hostname = yes

http://ixdba.blog.51cto.com/2895551/1149003

Ganglia的收集数据工作可以工作在单播(unicast)或多播(multicast)模式下,默认为多播模式。

  • 单播:发送自己 收集 到的监控数据到特定的一台或几台机器上,可以跨网段
  • 多播:发送自己收集到的监控数据到同一网段内所有的机器上,同时收集同一网段内的所有机器发送过来的监控数据。因为是以广播包的形式发送,因此需要同一网段内。但同一网段内,又可以定义不同的发送通道。

主机多网卡(多IP)情况下需要绑定到特定的IP,设置bind_hostname来设置要绑定的IP地址。单IP情况下可以不需要考虑。

多播情况下只能在单一网段进行,如果集群存在多个网段,可以分拆成多个子集群(data_source),或者使用单播来进行配置。期望配置简单点的话,配置多个 data_source 。

  • data_source "cluster-db" node1 node2 定义集群名称,以及获取集群监控数据的节点。由于采用multicast模式,每台gmond节点都有本集群内节点服务器的所有监控数据,因此不必把所有节点都列出来。node1 node2是or的关系,如果node1无法下载,则才会尝试去node2下载,所以它们应该都是同一个集群的节点,保存着同样的数据。
  • cluster.name 本节点属于哪个cluster,需要与data_source对应。
  • host.location 类似于hostname的作用。
  • udp_send_channel.mcast_join/host 多播地址,工作在239.2.11.71通道下。如果使用单播模式,则要写host=node1,单播模式下可以配置多个upd_send_channel
  • udp_recv_channel.mcast_join

参考思路 (未具体实践):多网段情况可以用单播解决,要是单网段要配置多个data_source(集群)那就换个多播的端口吧!

启动以及测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
service httpd restart
service gmetad start
service gmond start

[root@cu-omc1 ganglia]# netstat -anp | grep gm
tcp        0      0 0.0.0.0:8649                0.0.0.0:*                   LISTEN      916/gmond           
tcp        0      0 0.0.0.0:8651                0.0.0.0:*                   LISTEN      12776/gmetad        
tcp        0      0 0.0.0.0:8652                0.0.0.0:*                   LISTEN      12776/gmetad        
udp        0      0 239.2.11.71:8649            0.0.0.0:*                               916/gmond           
udp        0      0 192.168.31.11:60126         239.2.11.71:8649            ESTABLISHED 916/gmond           
unix  2      [ ]         DGRAM                    1331526917 12776/gmetad        
[root@cu-omc1 ganglia]# bin/gstat -a
CLUSTER INFORMATION
       Name: CU
      Hosts: 0
Gexec Hosts: 0
 Dead Hosts: 0
  Localtime: Wed Jun 15 20:17:36 2016

There are no hosts up at this time



netstat -anp | grep -E "gmond|gmetad"

# 启动如果有问题,使用调试模式启动查找问题
/usr/sbin/gmetad -d 10

/usr/local/ganglia/bin/gstat -a
/usr/local/ganglia/bin/gstat -a -i hadoop-master1 查看master1上数据的情况

telnet localhost 8649 - 能抓取数据的配置deaf=no才有绑定到本机IP
telnet localhost 8651

问题:多播地址绑定失败

如果telnet8649没有数据,查看下route是否有 [hostname对应的IP] 到 [239.2.11.71] 的路由!(多网卡多IP的时刻,可能default的路由并非主机名对应IP的地址)

http://llydmissile.blog.51cto.com/7784666/1411239 http://www.cnblogs.com/Cherise/p/4350581.html

测试过程中可能会出现以下错误:Error creating multicast server mcast_join=239.2.11.71 port=8649 mcast_if=NULL family=‘inet4’. Will try again…,系统不支持多播,需要将多播ip地址加入路由表,使用route add -host 239.2.11.71 dev eth0命令即可,将该命令加入/etc/rc.d/rc.local文件中,一劳永逸

1
2
3
4
5
6
7
8
9
10
11
[root@hadoop-master4 ~]# gmond -d 10
loaded module: core_metrics
loaded module: cpu_module
loaded module: disk_module
loaded module: load_module
loaded module: mem_module
loaded module: net_module
loaded module: proc_module
loaded module: sys_module
udp_recv_channel mcast_join=239.2.11.71 mcast_if=NULL port=8649 bind=239.2.11.71 buffer=0
Error creating multicast server mcast_join=239.2.11.71 port=8649 mcast_if=NULL family='inet4'.  Will try again...

环境的default route被清理掉了(或者是由于网关和本机不在同一网段)。需要手动添加一条到网卡的route。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@hadoop-master4 ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.32.0    *               255.255.255.0   U     0      0        0 bond0
192.168.31.0    192.168.32.254  255.255.255.0   UG    0      0        0 bond0
link-local      *               255.255.0.0     U     1006   0        0 bond0
[root@hadoop-master4 ~]# route add -host 239.2.11.71 dev bond0
[root@hadoop-master4 ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
239.2.11.71     *               255.255.255.255 UH    0      0        0 bond0
192.168.32.0    *               255.255.255.0   U     0      0        0 bond0
192.168.31.0    192.168.32.254  255.255.255.0   UG    0      0        0 bond0
link-local      *               255.255.0.0     U     1006   0        0 bond0

还有就是防火墙!!!!!!

安装GWeb

1
2
3
4
5
6
7
8
9
10
11
12
cd ~/ganglia-web-3.7.1
vi Makefile # 一次性配置好,不再需要去修改conf_default.php
  GDESTDIR = /var/www/html/ganglia
  GCONFDIR = /usr/local/ganglia/etc/
  GWEB_STATEDIR = /var/www/html/ganglia
  # Gmetad rootdir (parent location of rrd folder)
  GMETAD_ROOTDIR = /data/ganglia
  APACHE_USER = apache
make install

# 注意:内网还是需要改下 conf_default.php 一堆jquery的js。
# 如果Web不能访问,查看下防火墙以及SELinux
  • httpd登录密码配置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
htpasswd -c /var/www/html/ganglia/etc/htpasswd.users gangliaadmin 

vi /etc/httpd/conf/httpd.conf 

  <Directory "/var/www/html/ganglia">
  #  SSLRequireSSL
     Options None
     AllowOverride None
     <IfVersion >= 2.3>
        <RequireAll>
           Require all granted
  #        Require host 127.0.0.1

           AuthName "Ganglia Access"
           AuthType Basic
           AuthUserFile /var/www/html/ganglia/etc/htpasswd.users
           Require valid-user
        </RequireAll>
     </IfVersion>
     <IfVersion < 2.3>
        Order allow,deny
        Allow from all
  #     Order deny,allow
  #     Deny from all
  #     Allow from 127.0.0.1

        AuthName "Ganglia Access"
        AuthType Basic
        AuthUserFile /var/www/html/ganglia/etc/htpasswd.users
        Require valid-user
     </IfVersion>
  </Directory>

service httpd restart

如果图出不来,可以看看httpd的错误日志!!!!!!

如果在nginx做权限控制,一样很简单:

1
2
3
4
5
location /ganglia {
      proxy_pass http://localhost/ganglia;
      auth_basic "Ganglia Access";
      auth_basic_user_file "/var/www/html/ganglia/etc/htpasswd.users";
}

集群配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cd /usr/local 
# for h in cu-ud{1,2} hadoop-master{1,2} ; do echo $h ; done
for h in cu-ud1 cu-ud2 hadoop-master1 hadoop-master2 ; do 
  cd /usr/local;
  rsync -vaz  ganglia $h:/usr/local/ ;
  ssh $h ln -s /usr/local/ganglia/sbin/gmond /usr/sbin/gmond ;
  scp /etc/init.d/gmond $h:/etc/init.d/ ;
  ssh $h "chkconfig gmond on" ;
  ssh $h "yum install apr* -y" ; 
  ssh $h "service gmond start" ; 
done

# 不同的集群,gmond.conf的cluster.name需要修改

telnet hadoop-master1 8649
netstat -anp | grep gm

要是集群有变动,添加还好,删除的话,会存在原来的旧数据,页面会提示机器down掉了。可以删除rrds目录下对应集群中节点的数据,然后重庆gmetad/httpd即可。

参考

内容

1
2
3
4
5
6
7
8
9
10
11
防火墙规则设置
iptables -I INPUT 3 -p tcp -m tcp --dport 80 -j ACCEPT
iptables -I INPUT 3 -p udp -m udp --dport 8649 -j ACCEPT

service iptables save
service iptables restart

关闭selinux
vi /etc/selinux/config
SELINUX=disabled
setenforce 0

实际应用中,需要监控的机器往往在不同的网段内,这个时候,就不能用gmond默认的多播方式(用于同一个网段内)来传送数据,必须使用单播的方法。

gmond可以配置成为一个cluster,这些gmond节点之间相互发送各自的监控数据。所以每个gmond节点上实际上都会有 cluster内的所有节点的监控数据。gmetad只需要去某一个节点获取数据就可以了。

web front-end 一个基于web的监控界面,通常和Gmetad安装在同一个节点上(还需确认是否可以不在一个节点上,因为php的配置文件中ms可配置gmetad的地址及端口),它从Gmetad取数据,并且读取rrd数据库,生成图片,显示出来。

gmetad周期性的去gmond节点或者gmetad节点poll数据。一个gmetad可以设置多个datasource,每个datasource可以有多个备份,一个失败还可以去其他host取数据。Gmetad只有tcp通道,一方面他向datasource发送请求,另一方面会使用一个tcp端口,发 布自身收集的xml文件,默认使用8651端口。所以gmetad即可以从gmond也可以从其他的gmetad得到xml数据。

对于IO来说,Gmetad默认15秒向gmond取一次xml数据,如果gmond和gmetad都是在同一个节点,这样就相当于本地io请求。同时gmetad请求完xml文件后,还需要对其解析,也就是说按默认设置每15秒需要解析一个10m级别的xml文件,这样cpu的压力就会很大。同时它还有写入RRD数据库,还要处理来自web客户端的解析请求,也会读RRD数据库。这样本身的IO CPU 网络压力就很大,因此这个节点至少应该是个空闲的而且能力比较强的节点。

  • 多播模式配置 这个是默认的方式,基本上不需要修改配置文件,且所有节点的配置是一样的。这种模式的好处是所有的节点上的 gmond 都有完备的数据,gmetad 连接其中任意一个就可以获取整个集群的所有监控数据,很方便。 其中可能要修改的是 mcast_if 这个参数,用于指定多播的网络接口。如果有多个网卡,要填写对应的内网接口。
  • 单播模式配置 监控机上的接收 Channel 配置。我们使用 UDP 单播模式,非常简单。我们的集群有部分机器在另一个机房,所以监听了 0.0.0.0,如果整个集群都在一个内网中,建议只 bind 内网地址。如果有防火墙,要打开相关的端口。
  • 最重要的配置项是 data_source: data_source "my-cluster" localhost:8648 如果使用的是默认的 8649 端口,则端口部分可以省略。如果有多个集群,则可以指定多个 data_source,每行一个。
  • 最后是 gridname 配置,用于给整个 Grid 命名
  • https://github.com/ganglia/gmond_python_modules

网址

–END

Set

1
2
3
4
5
6
7
8
9
10
Set<BlockedInfo> diffs = new HashSet<>();
diffs.addAll(oldBlockedList);
diffs.addAll(newBlockedList);
Iterator<BlockedInfo> iterator = diffs.iterator();
while (iterator.hasNext()) {
  BlockedInfo i = iterator.next();
  if (oldBlockedList.contains(i) && newBlockedList.contains(i)) {
      iterator.remove();
  }
}

第二段代码希望找出前后两个list的差别,即XOR的效果。但是。。。为什么呢?想一想。

用guava库一行代码搞定:

1
Sets.difference(Sets.union(oldBlockedList, newBlockedList), Sets.intersection(oldBlockedList, newBlockedList))

hive建表

由于fs.defaultFS和hive.metastore.warehouse.dir对不上,建表后不能删除。

1
2
3
4
5
6
7
8
<property>
        <name>fs.defaultFS</name>
        <value>hdfs://zfcluster</value>
</property>
<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>hdfs://zfcluster:8020/hive/warehousedir</value>
</property>

删除表报错:

1
2
hive> drop table es_t_house_monitor2;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.IllegalArgumentException: Wrong FS: hdfs://zfcluster:8020/hive/warehousedir/es_t_house_monitor2, expected: hdfs://zfcluster)

建错了,只能先改!然后把hive.metastore.warehouse.dir和fs.defaultFS调成一致。

修改hive持久化(数据库)的表sds,找到location是该路径的改掉。然后就可以drop表了。

nginx rewrite中大括号

由于大括号用于一段规则的结束开始,所以直接在rewrite写 {} 是报错的。需要把包括{}大括号的规则用双引号括起来。

注: 对花括号( { 和 } )来说, 他们既能用在重定向的正则表达式里,也是用在配置文件里分割代码块, 为了避免冲突, 正则表达式里带花括号的话,应该用双引号(或者单引号)包围。

–END

配置TEZ-UI

tez-ui很早就出来了,荒废了很多时间。今天才把它配置出来,效果挺不错的,和spark-web差不多。

记录了在hive-1.2.1上配置tez-0.7.0过程,配置运行hadoop2.6.3-timeline以及为tez添加tez-ui特性的步骤。

编译tez-0.7.0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[hadoop@cu2 apache-tez-0.7.0-src]$ mvn package -Dhadoop.version=2.6.3 -DskipTests

[INFO] Reactor Summary:
[INFO] 
[INFO] tez ................................................ SUCCESS [  0.831 s]
[INFO] tez-api ............................................ SUCCESS [  6.580 s]
[INFO] tez-common ......................................... SUCCESS [  0.124 s]
[INFO] tez-runtime-internals .............................. SUCCESS [  0.676 s]
[INFO] tez-runtime-library ................................ SUCCESS [  1.378 s]
[INFO] tez-mapreduce ...................................... SUCCESS [  0.989 s]
[INFO] tez-examples ....................................... SUCCESS [  0.105 s]
[INFO] tez-dag ............................................ SUCCESS [  2.391 s]
[INFO] tez-tests .......................................... SUCCESS [  0.187 s]
[INFO] tez-ui ............................................. SUCCESS [02:23 min]
[INFO] tez-plugins ........................................ SUCCESS [  0.017 s]
[INFO] tez-yarn-timeline-history .......................... SUCCESS [  0.595 s]
[INFO] tez-yarn-timeline-history-with-acls ................ SUCCESS [  0.316 s]
[INFO] tez-mbeans-resource-calculator ..................... SUCCESS [  0.189 s]
[INFO] tez-tools .......................................... SUCCESS [  0.017 s]
[INFO] tez-dist ........................................... SUCCESS [ 16.554 s]
[INFO] Tez ................................................ SUCCESS [  0.015 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:55 min
[INFO] Finished at: 2016-01-12T19:08:50+08:00
[INFO] Final Memory: 63M/756M
[INFO] ------------------------------------------------------------------------

tez嵌入到hive

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// 上传tez程序到hdfs
[hadoop@cu2 ~]$ cd sources/apache-tez-0.7.0-src/tez-dist/target/
[hadoop@cu2 target]$ hadoop fs -mkdir -p /apps/tez-0.7.0
[hadoop@cu2 target]$ hadoop fs -put tez-0.7.0.tar.gz /apps/tez-0.7.0/

// TEZ_CONF_DIR = HADOOP_CONF_DIR
[hadoop@cu2 ~]$ cd hadoop-2.6.3/etc/hadoop/
[hadoop@cu2 hadoop]$ vi tez-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>
<name>tez.lib.uris</name>
<value>${fs.defaultFS}/apps/tez-0.7.0/tez-0.7.0.tar.gz</value>
</property>

</configuration>

// 本地tez jars加入HADOOP_CLASSPATH
[hadoop@cu2 apache-tez-0.7.0-src]$ cd tez-dist/target/
archive-tmp/              maven-archiver/           tez-0.7.0/                tez-0.7.0-minimal.tar.gz  tez-0.7.0.tar.gz          tez-dist-0.7.0-tests.jar  
[hadoop@cu2 apache-tez-0.7.0-src]$ cd tez-dist/target/
[hadoop@cu2 target]$ mv tez-0.7.0 ~/

[hadoop@cu2 ~]$ vi apache-hive-1.2.1-bin/conf/hive-env.sh

// 多个jline版本 http://stackoverflow.com/questions/28997441/hive-startup-error-terminal-initialization-failed-falling-back-to-unsupporte
export HADOOP_USER_CLASSPATH_FIRST=true
export TEZ_HOME=/home/hadoop/tez-0.7.0
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$TEZ_HOME/*:$TEZ_HOME/lib/*

// http://stackoverflow.com/questions/26988388/hive-0-14-0-not-starting [/tmp/hive on HDFS should be writable. Current permissions are: rwxrwxr-x]
// hive.metastore.warehouse.dir  hive.exec.scratchdir
[hadoop@cu2 hive]$ rm -rf /tmp/hive
[hadoop@cu2 hive]$ hadoop fs -rmr /tmp/hive
// 或者修改权限 hadoop fs -chmod 777 /tmp/hive

启用/使用tez

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[hadoop@cu2 hadoop]$ cat ~/hive/conf/hive-site.xml 
...
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>

</configuration>

[hadoop@cu2 hive]$ bin/hive
...
hive> select count(*) from t_ods_access_log2;
Query ID = hadoop_20160112200359_f8be3d1c-9adc-42c0-abb9-2643dfef2cc7
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1452600034599_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 20.83 s    
--------------------------------------------------------------------------------
OK
67
Time taken: 27.823 seconds, Fetched: 1 row(s)

部署/启动hadoop-timeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[hadoop@cu2 hadoop]$ vi etc/hadoop/yarn-site.xml 
...
<property>
  <name>yarn.timeline-service.enabled</name>
  <value>true</value>
</property>

<property>
  <name>yarn.timeline-service.hostname</name>
  <value>hadoop-master2</value>
</property>

<property>
  <name>yarn.timeline-service.http-cross-origin.enabled</name>
  <value>true</value>
</property>

<property>
  <name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
  <value>true</value>
</property>

[hadoop@cu2 hadoop]$ for h in hadoop-slaver1 hadoop-slaver2 hadoop-slaver3 ; do rsync -vaz --exclude=logs hadoop-2.6.3 $h:~/ ; done

[hadoop@cu2 hadoop]$ sbin/yarn-daemon.sh start timelineserver

[hadoop@cu2 hadoop]$ sbin/stop-all.sh
[hadoop@cu2 hadoop]$ sbin/start-all.sh

部署tez-ui

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// 放置tez-ui
[hadoop@cu2 target]$ cd ../../tez-ui/
[hadoop@cu2 tez-ui]$ cd target/
[hadoop@cu2 target]$ ll
total 1476
drwxrwxr-x 3 hadoop hadoop    4096 Jan 12 19:08 classes
drwxrwxr-x 2 hadoop hadoop    4096 Jan 12 19:08 maven-archiver
drwxrwxr-x 8 hadoop hadoop    4096 Jan 12 19:08 tez-ui-0.7.0
-rw-rw-r-- 1 hadoop hadoop    3058 Jan 12 19:08 tez-ui-0.7.0-tests.jar
-rw-rw-r-- 1 hadoop hadoop 1491321 Jan 12 19:08 tez-ui-0.7.0.war
[hadoop@cu2 target]$ mv tez-ui-0.7.0 ~/

// 部署tez-ui
[hadoop@cu2 ~]$ cd apache-tomcat-7.0.67/conf/
修改端口为9999
[hadoop@cu2 apache-tomcat-7.0.67]$ vi conf/server.xml 

[hadoop@cu2 apache-tomcat-7.0.67]$ cd conf/Catalina/localhost/
[hadoop@cu2 localhost]$ vi tez-ui.xml

[hadoop@cu2 apache-tomcat-7.0.67]$ bin/startup.sh 

// tez添加tez-ui功能
[hadoop@cu2 hive]$ vi ~/hadoop-2.6.3/etc/hadoop/tez-site.xml 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>
<name>tez.lib.uris</name>
<value>${fs.defaultFS}/apps/tez-0.7.0/tez-0.7.0.tar.gz</value>
</property>

<property>
<name>tez.history.logging.service.class</name>
<value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value>
</property>

<property>
<name>tez.tez-ui.history-url.base</name>
<value>http://hadoop-master2:9999/tez-ui/</value>
</property>

</configuration>

再运行一遍hive,查询一两个SQL。

最终效果:

参考

–END

Hadoop安装与升级-(4)HA升级

官网的文档[HDFSHighAvailabilityWithQJM.html]和[HdfsRollingUpgrade.html](Note that rolling upgrade is supported only from Hadoop-2.4.0 onwards.)很详细,但是没有一个整体的案例。这里整理下操作记录下来。

  1. 关闭所有的namenode,部署新版本的hadoop
  2. 启动所有的journalnode,是所有!!升级namenode的同时,也会升级所有journalnode!!
  3. 使用-upgrade选项启动一台namenode。启动的这台namenode会直接进入active状态,升级本地的元数据,同时会升级shared edit log(也就是journalnode的数据)
  4. 使用-bootstrapStandby启动其他namenode,同步更新。不能使用-upgrade选项!(我也没试,不知道试了是啥效果)

关闭集群,部署新版本的hadoop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[hadoop@hadoop-master1 hadoop-2.2.0]$ sbin/stop-dfs.sh
16/01/08 09:10:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [hadoop-master1 hadoop-master2]
hadoop-master2: stopping namenode
hadoop-master1: stopping namenode
hadoop-slaver1: stopping datanode
hadoop-slaver2: stopping datanode
hadoop-slaver3: stopping datanode
Stopping journal nodes [hadoop-master1]
hadoop-master1: stopping journalnode
16/01/08 09:10:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping ZK Failover Controllers on NN hosts [hadoop-master1 hadoop-master2]
hadoop-master1: stopping zkfc
hadoop-master2: stopping zkfc
[hadoop@hadoop-master1 hadoop-2.2.0]$ 

[hadoop@hadoop-master1 hadoop-2.2.0]$ cd ~/hadoop-2.6.3
[hadoop@hadoop-master1 hadoop-2.6.3]$ ll
total 52
drwxr-xr-x 2 hadoop hadoop  4096 Dec 18 01:52 bin
lrwxrwxrwx 1 hadoop hadoop    32 Jan  8 06:05 etc -> /home/hadoop/hadoop-2.2.0/ha-etc
drwxr-xr-x 2 hadoop hadoop  4096 Dec 18 01:52 include
drwxr-xr-x 3 hadoop hadoop  4096 Dec 18 01:52 lib
drwxr-xr-x 2 hadoop hadoop  4096 Dec 18 01:52 libexec
-rw-r--r-- 1 hadoop hadoop 15429 Dec 18 01:52 LICENSE.txt
drwxrwxr-x 2 hadoop hadoop  4096 Jan  8 03:37 logs
-rw-r--r-- 1 hadoop hadoop   101 Dec 18 01:52 NOTICE.txt
-rw-r--r-- 1 hadoop hadoop  1366 Dec 18 01:52 README.txt
drwxr-xr-x 2 hadoop hadoop  4096 Dec 18 01:52 sbin
drwxr-xr-x 3 hadoop hadoop  4096 Jan  7 08:00 share

#// 同步
[hadoop@hadoop-master1 ~]$ for h in hadoop-master2 hadoop-slaver1 hadoop-slaver2 hadoop-slaver3 ; do rsync -vaz --delete --exclude=logs ~/hadoop-2.6.3 $h:~/ ; done

启动所有Journalnode

2.6和2.2用的是一份配置!etc通过软链接到2.2的ha-etc配置。

1
2
3
4
5
6
[hadoop@hadoop-master1 hadoop-2.6.3]$ sbin/hadoop-daemons.sh --hostnames "hadoop-master1" --script /home/hadoop/hadoop-2.2.0/bin/hdfs start journalnode
hadoop-master1: starting journalnode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-journalnode-hadoop-master1.out
[hadoop@hadoop-master1 hadoop-2.6.3]$ jps
31047 JournalNode
244 QuorumPeerMain
31097 Jps

升级一台namenode

1
2
3
4
5
6
7
8
9
10
11
12
[hadoop@hadoop-master1 hadoop-2.6.3]$ bin/hdfs namenode -upgrade
...
16/01/08 09:13:54 INFO namenode.NameNode: createNameNode [-upgrade]
...
16/01/08 09:13:57 INFO namenode.FSImage: Starting upgrade of local storage directories.
   old LV = -47; old CTime = 0.
   new LV = -60; new CTime = 1452244437060
16/01/08 09:13:57 INFO namenode.NNUpgradeUtil: Starting upgrade of storage directory /data/tmp/dfs/name
16/01/08 09:13:57 INFO namenode.FSImageTransactionalStorageInspector: No version file in /data/tmp/dfs/name
16/01/08 09:13:57 INFO namenode.NNUpgradeUtil: Performing upgrade of storage directory /data/tmp/dfs/name
16/01/08 09:13:57 INFO namenode.FSNamesystem: Need to save fs image? false (staleImage=false, haEnabled=true, isRollingUpgrade=false)
...

官网文档上说,除了升级了namenode的本地元数据外,sharededitlog也被升级了的。

查看journalnode的日志,确实journalnode也升级了:

1
2
3
4
5
6
7
8
9
10
11
12
[hadoop@hadoop-master1 hadoop-2.6.3]$ less logs/hadoop-hadoop-journalnode-hadoop-master1.log 
...
2016-01-08 09:13:57,070 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Starting upgrade of edits directory /data/journal/zfcluster
2016-01-08 09:13:57,072 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Starting upgrade of storage directory /data/journal/zfcluster
2016-01-08 09:13:57,185 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Starting upgrade of edits directory: .
   old LV = -47; old CTime = 0.
   new LV = -60; new CTime = 1452244437060
2016-01-08 09:13:57,185 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Performing upgrade of storage directory /data/journal/zfcluster
2016-01-08 09:13:57,222 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Updating lastWriterEpoch from 2 to 3 for client /172.17.0.1
2016-01-08 09:16:57,731 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Updating lastPromisedEpoch from 3 to 4 for client /172.17.0.1
2016-01-08 09:16:57,735 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Scanning storage FileJournalManager(root=/data/journal/zfcluster)
...

升级的namenode是前台运行的,不要关闭这个进程。接下来把另一台namenode同步一下。

同步另一台namenode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[hadoop@hadoop-master2 hadoop-2.6.3]$ bin/hdfs namenode -bootstrapStandby
...
=====================================================
About to bootstrap Standby ID nn2 from:
           Nameservice ID: zfcluster
        Other Namenode ID: nn1
  Other NN's HTTP address: http://hadoop-master1:50070
  Other NN's IPC  address: hadoop-master1/172.17.0.1:8020
             Namespace ID: 639021326
            Block pool ID: BP-1695500896-172.17.0.1-1452152050513
               Cluster ID: CID-7d5c31d8-5cd4-46c8-8e04-49151578e5bb
           Layout version: -60
       isUpgradeFinalized: false
=====================================================
16/01/08 09:15:19 INFO ha.BootstrapStandby: The active NameNode is in Upgrade. Prepare the upgrade for the standby NameNode as well.
16/01/08 09:15:19 INFO common.Storage: Lock on /data/tmp/dfs/name/in_use.lock acquired by nodename 5008@hadoop-master2
16/01/08 09:15:21 INFO namenode.TransferFsImage: Opening connection to http://hadoop-master1:50070/imagetransfer?getimage=1&txid=1126&storageInfo=-60:639021326:1452244437060:CID-7d5c31d8-5cd4-46c8-8e04-49151578e5bb
16/01/08 09:15:21 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000 milliseconds
16/01/08 09:15:21 INFO namenode.TransferFsImage: Transfer took 0.00s at 0.00 KB/s
16/01/08 09:15:21 INFO namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000001126 size 977 bytes.
16/01/08 09:15:21 INFO namenode.NNUpgradeUtil: Performing upgrade of storage directory /data/tmp/dfs/name
...

重新启动集群

ctrl+c关闭hadoop-master1 upgrade的namenode。启动整个集群。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[hadoop@hadoop-master1 hadoop-2.6.3]$ sbin/start-dfs.sh
16/01/08 09:16:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop-master1 hadoop-master2]
hadoop-master1: starting namenode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-namenode-hadoop-master1.out
hadoop-master2: starting namenode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-namenode-hadoop-master2.out
hadoop-slaver3: starting datanode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-datanode-hadoop-slaver3.out
hadoop-slaver2: starting datanode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-datanode-hadoop-slaver2.out
hadoop-slaver1: starting datanode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-datanode-hadoop-slaver1.out
Starting journal nodes [hadoop-master1]
hadoop-master1: journalnode running as process 31047. Stop it first.
16/01/08 09:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting ZK Failover Controllers on NN hosts [hadoop-master1 hadoop-master2]
hadoop-master2: starting zkfc, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-zkfc-hadoop-master2.out
hadoop-master1: starting zkfc, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-zkfc-hadoop-master1.out
[hadoop@hadoop-master1 hadoop-2.6.3]$ jps
31047 JournalNode
244 QuorumPeerMain
31596 DFSZKFailoverController
31655 Jps
31294 NameNode

后记:Journalnode重置

在HA和non-HA环境来回的切换,最后启动HA时master起不来,执行bootstrapStandby也不行。

1
2
3
4
5
6
7
8
9
2016-01-08 06:15:36,746 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [172.17.0.1:8485]. Skipping.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 1/1. 1 exceptions thrown:
172.17.0.1:8485: Asked for firstTxId 1022 which is in the middle of file /data/journal/zfcluster/current/edits_0000000000000001021-0000000000000001022
        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager.getRemoteEditLogs(FileJournalManager.java:198)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:640)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:181)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:203)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:17453)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

关闭集群,启动journalnode,跳转到没有问题的namenode机器,执行initializeSharedEdits命令。然后在有问题的namenode上重新初始化!

1
2
3
4
5
6
7
8
9
[hadoop@hadoop-master1 hadoop-2.2.0]$ sbin/hadoop-daemon.sh start journalnode

[hadoop@hadoop-master2 hadoop-2.2.0]$ bin/hdfs namenode -initializeSharedEdits

[hadoop@hadoop-master2 hadoop-2.2.0]$ sbin/hadoop-daemon.sh start namenode

[hadoop@hadoop-master1 hadoop-2.2.0]$ bin/hdfs namenode -bootstrapStandby

[hadoop@hadoop-master1 hadoop-2.2.0]$ sbin/start-dfs.sh

后话: 其实上面HA升级的步骤,如果upgrade时没有启动journalnode,导致出了问题的话,把journalnode重置应该也是可以的(谨慎,没有试验过,猜想而已)。

参考

–END