rsync与scp优势

今天在做flume写kafka数据时，数据从其他目录cp拷贝过来，flume采集程序报错 程序采集的时刻文件发生了改变。

07 Mar 2016 16:46:05,535 ERROR [pool-3-thread-1] (org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run:256)  - FATAL: Spool Directory source s1: { spoolDir: /home/hadoop/flume/data/ }: Uncaught exception in SpoolDirectorySource thread. Restart or reconfigure Flume to continue processing.
java.lang.IllegalStateException: File has changed size since being read: /home/hadoop/flume/data/hbase-hadoop-master-cu2.log
        at org.apache.flume.client.avro.ReliableSpoolingFileEventReader.retireCurrentFile(ReliableSpoolingFileEventReader.java:326)
        at org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readEvents(ReliableSpoolingFileEventReader.java:259)

联想到scp和rsync，好像rsync是有重命名这样的步骤的。网上也有很多对比这个两个工具的资料。

http://stackoverflow.com/questions/20244585/how-does-scp-differ-from-rsync
http://superuser.com/questions/193952/why-is-rsync-avz-faster-than-scp-r
rsync可以增量复制，并且只复制内容不同的部分
rsync可以压缩，通过有断点续传 -P
rsync有各种参数： exclude等
SCP也可以增加压缩参数： scp -C -o 'CompressionLevel 9' -o 'IPQoS throughput' -c arcfour machine:file .
rsync会先写临时文件，复制完成后再重命名！

这里只关注最后一点，对于按照名称来采集的程序非常关键！下面使用inotify监控目录的操作，在进行scp和rsync时发生的操作：

[hadoop@cu2 test]$ scp -r source target/
[hadoop@cu2 test]$ rm target/source/1234
[hadoop@cu2 test]$ rsync -vaz source target/
sending incremental file list
source/
source/1234

sent 141 bytes  received 35 bytes  352.00 bytes/sec
total size is 34  speedup is 0.19

对应的inotify的输出为：

[hadoop@cu2 test]$ inotifywait -m target/source/ # yum install -y inotify*
Setting up watches.
Watches established.
target/source/ CREATE 1234
target/source/ OPEN 1234
target/source/ MODIFY 1234
target/source/ CLOSE_WRITE,CLOSE 1234

target/source/ DELETE 1234

target/source/ ATTRIB,ISDIR 
target/source/ CREATE .1234.ARUg56
target/source/ OPEN .1234.ARUg56
target/source/ ATTRIB .1234.ARUg56
target/source/ MODIFY .1234.ARUg56
target/source/ CLOSE_WRITE,CLOSE .1234.ARUg56
target/source/ ATTRIB .1234.ARUg56
target/source/ MOVED_FROM .1234.ARUg56
target/source/ MOVED_TO 1234

rsync会先写把内容复制到一个临时文件，复制完成后，再重命名为正式的名称。

在生产环境尽量使用rsync来进行文件(夹)的复制/同步操作，即快键有安全。

当然还有奇葩的快速删除海量文件夹的方式也用的是rsync：

rsync --delete-before -d /data/blank/ /var/spool/clientmqueue/ 

rsync --delete-before -a -H -v --progress --stats /tmp/test/ log/

–END

Discussion

在 GitHub 上讨论

欢迎通过 GitHub Issue 留言或反馈。每条讨论都会关联到对应文章的源文件路径。

2016-03-07-rsync-vs-scp.md

查看已有讨论新建 Issue

杀鸡焉用牛刀：DuckDB 正取代部分 Spark 场景
2026-02-16
基于对象存储的 Spark 数据读写实战：从末尾追加到任意更新
2025-10-28
认真的博客
2021-12-08
视频自动翻译
2018-08-25

在 GitHub 上讨论

Related posts