Article
rsync与scp优势
今天在做flume写kafka数据时,数据从其他目录cp拷贝过来,flume采集程序报错 程序采集的时刻文件发生了改变。
07 Mar 2016 16:46:05,535 ERROR [pool-3-thread-1] (org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run:256) - FATAL: Spool Directory source s1: { spoolDir: /home/hadoop/flume/data/ }: Uncaught exception in SpoolDirectorySource thread. Restart or reconfigure Flume to continue processing.
java.lang.IllegalStateException: File has changed size since being read: /home/hadoop/flume/data/hbase-hadoop-master-cu2.log
at org.apache.flume.client.avro.ReliableSpoolingFileEventReader.retireCurrentFile(ReliableSpoolingFileEventReader.java:326)
at org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readEvents(ReliableSpoolingFileEventReader.java:259)
联想到scp和rsync,好像rsync是有重命名这样的步骤的。网上也有很多对比这个两个工具的资料。
-
http://stackoverflow.com/questions/20244585/how-does-scp-differ-from-rsync
-
http://superuser.com/questions/193952/why-is-rsync-avz-faster-than-scp-r
-
rsync可以增量复制,并且只复制内容不同的部分
-
rsync可以压缩,通过有断点续传
-P -
rsync有各种参数: exclude等
-
SCP也可以增加压缩参数:
scp -C -o 'CompressionLevel 9' -o 'IPQoS throughput' -c arcfour machine:file . -
rsync会先写临时文件,复制完成后再重命名!
这里只关注最后一点,对于按照名称来采集的程序非常关键!下面使用inotify监控目录的操作,在进行scp和rsync时发生的操作:
[hadoop@cu2 test]$ scp -r source target/
[hadoop@cu2 test]$ rm target/source/1234
[hadoop@cu2 test]$ rsync -vaz source target/
sending incremental file list
source/
source/1234
sent 141 bytes received 35 bytes 352.00 bytes/sec
total size is 34 speedup is 0.19
对应的inotify的输出为:
[hadoop@cu2 test]$ inotifywait -m target/source/ # yum install -y inotify*
Setting up watches.
Watches established.
target/source/ CREATE 1234
target/source/ OPEN 1234
target/source/ MODIFY 1234
target/source/ CLOSE_WRITE,CLOSE 1234
target/source/ DELETE 1234
target/source/ ATTRIB,ISDIR
target/source/ CREATE .1234.ARUg56
target/source/ OPEN .1234.ARUg56
target/source/ ATTRIB .1234.ARUg56
target/source/ MODIFY .1234.ARUg56
target/source/ CLOSE_WRITE,CLOSE .1234.ARUg56
target/source/ ATTRIB .1234.ARUg56
target/source/ MOVED_FROM .1234.ARUg56
target/source/ MOVED_TO 1234
rsync会先写把内容复制到一个临时文件,复制完成后,再重命名为正式的名称。
在生产环境尽量使用rsync来进行文件(夹)的复制/同步操作,即快键有安全。
当然还有奇葩的快速删除海量文件夹的方式也用的是rsync:
rsync --delete-before -d /data/blank/ /var/spool/clientmqueue/
rsync --delete-before -a -H -v --progress --stats /tmp/test/ log/
–END
Related
Related posts
-
杀鸡焉用牛刀:DuckDB 正取代部分 Spark 场景
2026-02-16
-
基于对象存储的 Spark 数据读写实战:从末尾追加到任意更新
2025-10-28
-
认真的博客
2021-12-08
-
视频自动翻译
2018-08-25