生产HDFS Block损坏恢复最佳实践
1、上传文件hello.txt
[[email protected] apps]# hdfs dfs -mkdir /blockrecover
[[email protected] apps]# echo "hello word" > hello.txt
[[email protected] apps]# hdfs dfs -put hello.txt /blockrecover
[[email protected] apps]# hdfs dfs -ls /blockrecover
Found 1 items
-rw-r--r-- 2 root supergroup 11 2019-03-03 18:26 /blockrecover/hello.txt
[[email protected] apps]# hdfs fsck /
Connecting to namenode via http://cdh-node01:50070/fsck?ugi=root&path=%2F
FSCK started by root (auth:SIMPLE) from /192.168.17.20 for path / at Sun Mar 03 18:27:50 CST 2019
Status: HEALTHY
Number of data-nodes: 3
Number of racks: 1
Total dirs: 40
Total symlinks: 0
Replicated Blocks:
Total size: 108216 B
Total files: 35
Total blocks (validated): 25 (avg. block size 4328 B)
Minimally replicated blocks: 25 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
FSCK ended at Sun Mar 03 18:27:50 CST 2019 in 65 milliseconds
The filesystem under path '/' is HEALTHY
二.直接DN节点上删除文件一个block的一个副本(2副本)
删除块和meta文件:
查看块和meta文件位置:
[[email protected] subdir0]# rm -rf blk_1073741874 blk_1073741874_1065.meta
直接重启HDFS,直接模拟损坏效果,然后fsck检查:
[[email protected] ~]# hdfs fsck /
Connecting to namenode via http://cdh-node01:50070/fsck?ugi=root&path=%2F
FSCK started by root (auth:SIMPLE) from /192.168.17.20 for path / at Sun Mar 03 19:48:31 CST 2019
/blockrecover/hello.txt: Under replicated BP-794681415-192.168.17.20-1548403311677:blk_1073741874_1065. Target Replicas is 2 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).
/user/root/.Trash/Current/blockrecover/hello.txt: MISSING 1 blocks of total size 11 B.
Status: CORRUPT
Number of data-nodes: 3
Number of racks: 1
Total dirs: 45
Total symlinks: 0
Replicated Blocks:
Total size: 108227 B
Total files: 36
Total blocks (validated): 26 (avg. block size 4162 B)
********************************
UNDER MIN REPL'D BLOCKS: 1 (3.8461537 %)
MINIMAL BLOCK REPLICATION: 1
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 11 B
********************************
Minimally replicated blocks: 25 (96.15385 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 1 (3.8461537 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 1.8846154
Missing blocks: 1
Corrupt blocks: 0
Missing replicas: 1 (1.9230769 %)
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
FSCK ended at Sun Mar 03 19:48:31 CST 2019 in 100 milliseconds
The filesystem under path '/' is CORRUPT
三.手动修复hdfs debug
修复命令:
[[email protected] apps]# hdfs debug recoverLease -path /blockrecover/hello.txt -retries 10 recoverLease SUCCEEDED on /blockrecover/hello.txt
直接DN节点查看,block⽂文件和meta⽂文件恢复:
[[email protected] subdir0]# ll
total 8
-rw-r--r-- 1 root root 11 Mar 4 10:38 blk_1073741874
-rw-r--r-- 1 root root 11 Mar 4 10:38 blk_1073741874_1065.meta
四.自动修复
当数据块损坏后,DN节点执⾏行directoryscan操作之前,都不会发现损坏;
也就是directoryscan操作是间隔6h
dfs.datanode.directoryscan.interval : 21600
在DN向NN进⾏行行blockreport前,都不会恢复数据块;
也就是blockreport操作是间隔6h
dfs.blockreport.intervalMsec : 21600000
当NN收到blockreport才会进行恢复操作。
总结:
生产上本人一般倾向于使用手动修复方式,但是前提要手动删除损坏的block块。
切记,是删除损坏block文件和meta文件,而不是删除hdfs⽂文件。
当然还可以先把文件get下载,然后hdfs删除,再对应上传。
切记删除不不要执行: hdfs fsck / -delete 这是删除损坏的文件, 那么数据不就丢了嘛;除非无所谓丢数据,或
者有信心从其他地方可以补数据到hdfs!