Article
scala wordcount on hadoop2
从了解scala,到spark再次遇见scala,准备好好学学这门语言。函数式编程大势所趋,简洁的语法,更抽象好用的集合操作。土生土长的JVM的语言,以及凭借其与java的互操作性,发展前景一片光明。在云计算以及手机(android)开发都有其大展拳脚的地方。
工作中大部分时间写mapreduce,项目空白期实践了一下把scala搬上hadoop。整体来说用scala写个helloworld是比较简单的,就一些细节的东西比较繁琐。尽管用了几年的eclipse了,但是scala-ide还是需要再适应适应!scala-idea也没有大家说的那么好,和webstorm比差远了。
{% gist winse/5df39f77e8bd59348a7a %}
使用scala主要原因:
- 写JavaBean更简单方便
- 多返回值无需定义Result实体类
- 集合更抽象的方法真的很好用
- trait可以更便捷的进行操作层面的聚合,也就是可以把操作分离出来,进行组合就可以实现新的功能。这不就是decorate模式嘛!java的decorate多麻烦的!加点东西太麻烦了!!!
上面的scala代码和java的比较类似,主要在集合操作上不同而已,变量定义简单化。
编写好代码后就是运行调试。
前面其他的文章已经说过了,默认mapreduce.framework.name的配置是本地local,所以直接运行就像运行一个普通的本地java程序。这就不多讲了。
这里主要讲讲怎么把代码打包放到真实的集群环境运行,相比java的版本要添加那些步骤。
从项目的maven pom中可以发现,其实就是多了scala-lang的新依赖而已,其他都是hadoop自带的公共包。

所以运行程序只需要指定把scala-lang.jar添加到运行环境的classpath中即可。使用maven打包后的项目结构如下:
[hadoop@master1 scalamapred-1.0.5]$ cd lib/
[hadoop@master1 lib]$ ls -l
total 8
drwxrwxr-x. 2 hadoop hadoop 4096 Sep 11 23:10 common
drwxrwxr-x. 2 hadoop hadoop 4096 Sep 11 23:56 core
[hadoop@master1 lib]$ ll core/
total 12
-rw-r--r--. 1 hadoop hadoop 11903 Sep 11 23:55 scalamapred-1.0.5.jar
[hadoop@master1 lib]$ ls common/
activation-1.1.jar commons-lang-2.6.jar hadoop-hdfs-2.2.0.jar jaxb-api-2.2.2.jar log4j-1.2.17.jar
aopalliance-1.0.jar commons-logging-1.1.1.jar hadoop-mapreduce-client-common-2.2.0.jar jaxb-impl-2.2.3-1.jar management-api-3.0.0-b012.jar
asm-3.1.jar commons-math-2.1.jar hadoop-mapreduce-client-core-2.2.0.jar jersey-client-1.9.jar netty-3.6.2.Final.jar
avro-1.7.4.jar commons-net-3.1.jar hadoop-yarn-api-2.2.0.jar jersey-core-1.9.jar paranamer-2.3.jar
commons-beanutils-1.7.0.jar gmbal-api-only-3.0.0-b023.jar hadoop-yarn-client-2.2.0.jar jersey-grizzly2-1.9.jar protobuf-java-2.5.0.jar
commons-beanutils-core-1.8.0.jar grizzly-framework-2.1.2.jar hadoop-yarn-common-2.2.0.jar jersey-guice-1.9.jar scala-library-2.10.4.jar
commons-cli-1.2.jar grizzly-http-2.1.2.jar hadoop-yarn-server-common-2.2.0.jar jersey-json-1.9.jar servlet-api-2.5.jar
commons-codec-1.4.jar grizzly-http-server-2.1.2.jar jackson-core-asl-1.8.8.jar jersey-server-1.9.jar slf4j-api-1.7.1.jar
commons-collections-3.2.1.jar grizzly-http-servlet-2.1.2.jar jackson-jaxrs-1.8.3.jar jersey-test-framework-core-1.9.jar slf4j-log4j12-1.7.1.jar
commons-compress-1.4.1.jar grizzly-rcm-2.1.2.jar jackson-mapper-asl-1.8.8.jar jersey-test-framework-grizzly2-1.9.jar snappy-java-1.0.4.1.jar
commons-configuration-1.6.jar guava-17.0.jar jackson-xc-1.8.3.jar jets3t-0.6.1.jar stax-api-1.0.1.jar
commons-daemon-1.0.13.jar guice-3.0.jar jasper-compiler-5.5.23.jar jettison-1.1.jar xmlenc-0.52.jar
commons-digester-1.8.jar guice-servlet-3.0.jar jasper-runtime-5.5.23.jar jetty-6.1.26.jar xz-1.0.jar
commons-el-1.0.jar hadoop-annotations-2.2.0.jar javax.inject-1.jar jetty-util-6.1.26.jar zookeeper-3.4.5.jar
commons-httpclient-3.1.jar hadoop-auth-2.2.0.jar javax.servlet-3.1.jar jsch-0.1.42.jar
commons-io-2.1.jar hadoop-common-2.2.0.jar javax.servlet-api-3.0.1.jar jsp-api-2.1.jar
[hadoop@master1 lib]$
完整的pom.xml的内容为:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.winse</groupId>
<version>1.0</version>
<artifactId>scalamapred</artifactId>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>build-helper-maven-plugin</artifactId>
<version>1.8</version>
<executions>
<execution>
<id>add-scala-sources</id>
<phase>generate-sources</phase>
<goals>
<goal>add-source</goal>
</goals>
<configuration>
<sources>
<source>${basedir}/src/main/scala</source>
</sources>
</configuration>
</execution>
<execution>
<id>add-scala-test-sources</id>
<phase>generate-test-sources</phase>
<goals>
<goal>add-test-source</goal>
</goals>
<configuration>
<sources>
<source>${basedir}/src/test/scala</source>
</sources>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
</dependencies>
<properties>
<scala.version>2.10.4</scala.version>
<hadoop.version>2.2.0</hadoop.version>
</properties>
<profiles>
<profile>
<id>tar</id>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
</profiles>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
</project>
在lib文件夹下面包括common和core两放置jar的文件夹,common是项目的依赖包,core下面的是项目的源码jar。
接下来运行程序,通过libjar把scala-library的包加入到mapreduce的运行时classpath。当然也可以把scala-library加入到mapreduce.application.classpath(默认值为$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*)。
[hadoop@master1 scalamapred-1.0.5]$ for j in `find . -name "*.jar"` ; do export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$j ; done
或者
[hadoop@master1 scalamapred-1.0.5]$ export HADOOP_CLASSPATH=
[hadoop@master1 scalamapred-1.0.5]$ export HADOOP_CLASSPATH=/home/hadoop/scalamapred-1.0.5/lib/core/*:/home/hadoop/scalamapred-1.0.5/lib/common/*
[hadoop@master1 scalamapred-1.0.5]$ hadoop com.github.winse.hadoop.HelloScalaMapRed -libjars lib/common/scala-library-2.10.4.jar
# 问题攻略
上面如果不加libjar的话,会在nodemanager的代码中抛出异常!!本来以为不加依赖包也就不能执行mapreduce里面的代码而已。问题的根源在哪里呢?
给代码添加远程调试的配置,然后运行一步步的查找问题(如果一次找不到就多运行调试几次)。
[hadoop@master1 scalamapred-1.0.5]$ hadoop com.github.winse.hadoop.HelloScalaMapRed -Dyarn.app.mapreduce.am.command-opts="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=18090"
// 我这里slaver就一台,去到机器上查看运行的程序
[hadoop@slaver1 nmPrivate]$ ps axu|grep java
hadoop 1427 0.6 10.5 1562760 106344 ? Sl Sep11 0:45 /opt/jdk1.7.0_60//bin/java -Dproc_datanode -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/hadoop/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/hadoop/hadoop-2.2.0 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,console -Djava.library.path=/home/hadoop/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/hadoop/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hadoop-datanode-slaver1.log -Dhadoop.home.dir=/home/hadoop/hadoop-2.2.0 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/home/hadoop/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
hadoop 2874 2.5 11.7 1599312 118980 ? Sl 00:08 0:57 /opt/jdk1.7.0_60//bin/java -Dproc_nodemanager -Xmx1000m -Dhadoop.log.dir=/home/hadoop/hadoop-2.2.0/logs -Dyarn.log.dir=/home/hadoop/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hadoop-nodemanager-slaver1.log -Dyarn.log.file=yarn-hadoop-nodemanager-slaver1.log -Dyarn.home.dir= -Dyarn.id.str=hadoop -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/home/hadoop/hadoop-2.2.0/lib/native -Dyarn.policy.file=hadoop-policy.xml -server -Dhadoop.log.dir=/home/hadoop/hadoop-2.2.0/logs -Dyarn.log.dir=/home/hadoop/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hadoop-nodemanager-slaver1.log -Dyarn.log.file=yarn-hadoop-nodemanager-slaver1.log -Dyarn.home.dir=/home/hadoop/hadoop-2.2.0 -Dhadoop.home.dir=/home/hadoop/hadoop-2.2.0 -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/home/hadoop/hadoop-2.2.0/lib/native -classpath /home/hadoop/hadoop-2.2.0/etc/hadoop:/home/hadoop/hadoop-2.2.0/etc/hadoop:/home/hadoop/hadoop-2.2.0/etc/hadoop:/home/hadoop/hadoop-2.2.0/share/hadoop/common/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/contrib/capacity-scheduler/*.jar:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.2.0/etc/hadoop/nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager
hadoop 3750 0.0 0.1 106104 1200 ? Ss 00:43 0:00 /bin/bash -c /opt/jdk1.7.0_60//bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/hadoop/hadoop-2.2.0/logs/userlogs/application_1410453720744_0007/container_1410453720744_0007_01_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=18090 org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/home/hadoop/hadoop-2.2.0/logs/userlogs/application_1410453720744_0007/container_1410453720744_0007_01_000001/stdout 2>/home/hadoop/hadoop-2.2.0/logs/userlogs/application_1410453720744_0007/container_1410453720744_0007_01_000001/stderr
hadoop 3759 0.1 1.8 737648 18232 ? Sl 00:43 0:00 /opt/jdk1.7.0_60//bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/hadoop/hadoop-2.2.0/logs/userlogs/application_1410453720744_0007/container_1410453720744_0007_01_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=18090 org.apache.hadoop.mapreduce.v2.app.MRAppMaster
hadoop 3778 0.0 0.0 103256 832 pts/0 S+ 00:45 0:00 grep java
// 去到对应的目录下查看launcher.sh的脚本
// appmaster launcher
[hadoop@slaver1 nm-local-dir]$ cd nmPrivate/application_1410453720744_0007/
[hadoop@slaver1 application_1410453720744_0007]$ ll
total 4
drwxrwxr-x. 2 hadoop hadoop 4096 Sep 12 00:43 container_1410453720744_0007_01_000001
[hadoop@slaver1 application_1410453720744_0007]$ less container_1410453720744_0007_01_000001/
container_1410453720744_0007_01_000001.tokens launch_container.sh
.container_1410453720744_0007_01_000001.tokens.crc .launch_container.sh.crc
[hadoop@slaver1 application_1410453720744_0007]$ less container_1410453720744_0007_01_000001/launch_container.sh
#!/bin/bash
export NM_HTTP_PORT="8042"
export LOCAL_DIRS="/home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0007"
export HADOOP_COMMON_HOME="/home/hadoop/hadoop-2.2.0"
export JAVA_HOME="/opt/jdk1.7.0_60/"
export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
"
export HADOOP_YARN_HOME="/home/hadoop/hadoop-2.2.0"
export CLASSPATH="$PWD:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:job.jar/job.jar:job.jar/classes/:job.jar/lib/*:$PWD/*"
export HADOOP_TOKEN_FILE_LOCATION="/home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0007/container_1410453720744_0007_01_000001/container_tokens"
export NM_HOST="slaver1"
export APPLICATION_WEB_PROXY_BASE="/proxy/application_1410453720744_0007"
export JVM_PID="$$"
export USER="hadoop"
export HADOOP_HDFS_HOME="/home/hadoop/hadoop-2.2.0"
export PWD="/home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0007/container_1410453720744_0007_01_000001"
export CONTAINER_ID="container_1410453720744_0007_01_000001"
export HOME="/home/"
export NM_PORT="40888"
export LOGNAME="hadoop"
export APP_SUBMIT_TIME_ENV="1410455811401"
export MAX_APP_ATTEMPTS="2"
export HADOOP_CONF_DIR="/home/hadoop/hadoop-2.2.0/etc/hadoop"
export MALLOC_ARENA_MAX="4"
export LOG_DIRS="/home/hadoop/hadoop-2.2.0/logs/userlogs/application_1410453720744_0007/container_1410453720744_0007_01_000001"
ln -sf "/home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0007/filecache/10/job.jar" "job.jar"
ln -sf "/home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0007/filecache/13/job.xml" "job.xml"
mkdir -p jobSubmitDir
ln -sf "/home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0007/filecache/11/job.splitmetainfo" "jobSubmitDir/job.splitmetainfo"
mkdir -p jobSubmitDir
ln -sf "/home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0007/filecache/12/job.split" "jobSubmitDir/job.split"
exec /bin/bash -c "$JAVA_HOME/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/hadoop/hadoop-2.2.0/logs/userlogs/application_1410453720744_0007/container_1410453720744_0007_01_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=18090 org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/home/hadoop/hadoop-2.2.0/logs/userlogs/application_1410453720744_0007/container_1410453720744_0007_01_000001/stdout 2>/home/hadoop/hadoop-2.2.0/logs/userlogs/application_1410453720744_0007/container_1410453720744_0007_01_000001/stderr "
// 去到TMP对应的目录下,查看MRAppMaster根目录
[hadoop@slaver1 ~]$ cd /home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0007/container_1410453720744_0007_01_000001
[hadoop@slaver1 container_1410453720744_0007_01_000001]$ ll
total 28
-rw-r--r--. 1 hadoop hadoop 95 Sep 12 00:43 container_tokens
-rwx------. 1 hadoop hadoop 468 Sep 12 00:43 default_container_executor.sh
lrwxrwxrwx. 1 hadoop hadoop 108 Sep 12 00:43 job.jar -> /home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0007/filecache/10/job.jar
drwxrwxr-x. 2 hadoop hadoop 4096 Sep 12 00:43 jobSubmitDir
lrwxrwxrwx. 1 hadoop hadoop 108 Sep 12 00:43 job.xml -> /home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0007/filecache/13/job.xml
-rwx------. 1 hadoop hadoop 3005 Sep 12 00:43 launch_container.sh
drwx--x---. 2 hadoop hadoop 4096 Sep 12 00:43 tmp
[hadoop@slaver1 container_1410453720744_0007_01_000001]$
为了对应,我这里列出来在添加了libjar的TMP目录的列表:
[hadoop@master1 scalamapred-1.0.5]$ hadoop com.github.winse.hadoop.HelloScalaMapRed -Dyarn.app.mapreduce.am.command-opts="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=18090" -libjars lib/common/scala-library-2.10.4.jar
[hadoop@slaver1 container_1410453720744_0007_01_000001]$ cd /home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0008/container_1410453720744_0008_01_000001
[hadoop@slaver1 container_1410453720744_0008_01_000001]$ ll
total 32
-rw-r--r--. 1 hadoop hadoop 95 Sep 12 00:49 container_tokens
-rwx------. 1 hadoop hadoop 468 Sep 12 00:49 default_container_executor.sh
lrwxrwxrwx. 1 hadoop hadoop 108 Sep 12 00:49 job.jar -> /home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0008/filecache/10/job.jar
drwxrwxr-x. 2 hadoop hadoop 4096 Sep 12 00:49 jobSubmitDir
lrwxrwxrwx. 1 hadoop hadoop 108 Sep 12 00:49 job.xml -> /home/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1410453720744_0008/filecache/13/job.xml
-rwx------. 1 hadoop hadoop 3127 Sep 12 00:49 launch_container.sh
lrwxrwxrwx. 1 hadoop hadoop 85 Sep 12 00:49 scala-library-2.10.4.jar -> /home/hadoop/data/nm-local-dir/usercache/hadoop/filecache/10/scala-library-2.10.4.jar
drwx--x---. 2 hadoop hadoop 4096 Sep 12 00:49 tmp
[hadoop@slaver1 container_1410453720744_0008_01_000001]$
windows本地使用eclipse和进行跟踪调试代码。

此时可以通过8088的网页查看状态,当前有一个mrappmaster在执行,如果第一个失败,会尝试执行第二次。

运行调试多次后,最终确定问题所在。在master中会检查是否为 链式mr ,而加载该class的时刻,同时要加载父类的class,即scala的类,所以在这里会抛出异常。

去到查看程序运行的日志,可以看到程序抛出的异常NoClassDefFoundError。
[hadoop@slaver1 ~]$ less /home/hadoop/hadoop-2.2.0/logs/userlogs/application_1410448728371_0003/*/syslog
2014-09-11 22:55:12,616 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1410448728371_0003_000001
...
2014-09-11 22:55:18,677 INFO [main] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Adding job token for job_1410448728371_0003 to jobTokenSecretManager
2014-09-11 22:55:19,119 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.NoClassDefFoundError: scala/Function1
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.isChainJob(JobImpl.java:1277)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.makeUberDecision(JobImpl.java:1217)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.access$3700(JobImpl.java:135)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1420)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1358)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:972)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:134)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1227)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1035)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1445)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1441)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1374)
Caused by: java.lang.ClassNotFoundException: scala.Function1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 22 more
2014-09-11 22:55:19,130 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. Signaling RMCommunicator and JobHistoryEventHandler.
# 意外收获
- 推测执行初始化代码

- OutputFormat的获取Committer代码

# 参考
–END
Related
Related posts
-
杀鸡焉用牛刀:DuckDB 正取代部分 Spark 场景
2026-02-16
-
基于对象存储的 Spark 数据读写实战:从末尾追加到任意更新
2025-10-28
-
认真的博客
2021-12-08
-
视频自动翻译
2018-08-25