实木复合地板搭建HDFS getmerge恢复

问题描述:

唯一的接触主题是here,但它不能解决我的问题。实木复合地板搭建HDFS getmerge恢复

这里的问题是,我们收集拼花本地备份与:

$ hadoop fs -getmerge /dir/on/hdfs /local/dir 

作出的错误是,我们认为拼花多个文件组织是由于HDFS写,但我们不明白这是真的实木复合地板文件“正常”组织。所以(不是很聪明)我们使用HDFS的getmerge来做备份。问题是我们的数据已经被删除,现在我们正在努力恢复它。

当分析(并阅读doc)实木复合地板时,我们发现所有文件最初由包含数据+元数据的块组成,其中包含幻数“PAR1”之间的元数据,并添加到此元素中的是2 - _metadata和_common_metadata - 文件的元数据。

通过观察getmerge处理文件(hdfs上的原始parquet目录)的顺序,我想出了一个脚本,该脚本将2'PAR1'之间的数据作为块文件。 构建的前两个文件是(_common_metadata,_metadata)。

filePrefix='part-' 
finalFilePrefix='part-r-' 

awk 'NR%2==0{ print $0 > "part-"i++ }' RS='PAR1' $1 

nbFiles=$(ls -lah | grep 'part-' | wc -l) 

for num in $(seq 0 $nbFiles) 
     do 
     fileName="$filePrefix$num" 
     lastName="" 
     if [ "$num" -eq "0" ]; then 
       lastName="_common_metadata" 
       awk '{print "PAR1" $0 "PAR1"}' $fileName > $lastName 
     else  

       if [ "$num" -eq "1" ]; then 
         lastName="_metadata" 
         awk '{print "PAR1" $0 "PAR1"}' $fileName > $lastName 
       else  
         if [ -e $fileName ]; then 
           count=$(printf "%05d" $(($num-2))) 
           lastName="$finalFilePrefix$count.gz.parquet" 
           awk '{print "PAR1" $0 "PAR1"}' $fileName > $lastName 
         fi  
       fi    
     fi 
     echo $lastName 
     truncate --size=-1 $lastName 
     rm -f "$fileName" 
done 

mv $1 $1.backup 
mkdir $1 
mv _* $1 
mv part* $1 

一些观察有关脚本:

  1. 它需要一个“getmerge”实木复合地板文件中的参数
  2. 创建的移动到原来的文件命名的目录中的所有部分(后来的幸福重命名文件名。备份)
  3. 必须在每个文件的末尾采取一个字节 - 截断 - 这是凭经验做出的,因为spark sc.load.parquet()不能读取元数据文件)否则
  4. 最终我们使用hadoop fs -put将其上传到hdfs。
  5. 尝试正如我所说的_metadata(和_common_metadate文件显然)读取确定将其加载的数据帧 但我们仍然有错装载块时:

代码:

val newDataDF = sqlContext.read.parquet("/tmp/userActionLog2-leclerc-culturel-2016.09.04.parquet") 
newDataDF.take(1) 

错误:

newDataDF: org.apache.spark.sql.DataFrame = [bson: binary] 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, hdp-node4.affinytix.com): java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 13 
at org.apache.parquet.format.Util.read(Util.java:216) 
at org.apache.parquet.format.Util.readPageHeader(Util.java:65) 
at org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:668) 
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546) 
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496) 
at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.checkEndOfRowGroup(UnsafeRowParquetRecordReader.java:604) 
at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.loadBatch(UnsafeRowParquetRecordReader.java:218) 
at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.nextKeyValue(UnsafeRowParquetRecordReader.java:196) 
at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194) 
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) 
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) 
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) 
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1881) 
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1881) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
at org.apache.spark.scheduler.Task.run(Task.scala:89) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
Caused by: parquet.org.apache.thrift.protocol.TProtocolException: don't know what type: 13 
at parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806) 
at parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500) 
at org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158) 
at org.apache.parquet.format.PageHeader.read(PageHeader.java:828) 
at org.apache.parquet.format.Util.read(Util.java:213) 
... 32 more 

鉴于我们的数据是利害攸关的,如果有人有什么想法,可以帮助,我衷心感谢他(呃)提前。

再见

我已经回答了问题。

我在开始时的基本想法是好的。问题在于awk(在解决方案脚本中)添加了许多字符。 因此,拼花块在那之后是不可读的。

解决方法是通过编程(python,perl ...)来操作合并的文件。 这是我提出的python解决方案。它等同于前一个,但不添加无用字符。

代码:

print "create parquet script." 
import sys 
filename = sys.argv[1] 
import locale 
currencode=locale.getpreferredencoding() 

import io 
print "=====================================================================" 
print "Create parquet from: ", filename 
print "defautl buffer size: ", io.DEFAULT_BUFFER_SIZE 
print "default encoding of the system: ", currencode 
print "=====================================================================" 

import re 
magicnum = "PAR1" 
with io.open(filename, mode='rb') as f: 
     content = f.read() 
res = [ magicnum + chunk + magicnum for chunk in filter(lambda s: s!="", re.split(magicnum, content)) ] 

szcontent = len(res[2:]) 
for i in range(0,szcontent) : 
     si = str(i) 
     write_to_binfile("part-r-{}.gz.parquet".format(si.zfill(5)), res[i+2]) 

write_to_binfile("_common_metadata", res[0]) 
write_to_binfile("_metadata", res[1]) 

import os 
os.system("mv {} {}.backup".format(filename, filename)) 
os.system("mkdir {}".format(filename)) 
os.system("mv _* {}".format(filename)) 
os.system("mv part* {}".format(filename)) 

观察: 镶木文件不能是多少大的蟒蛇功能加载在内存中的整个事情作为一个字符串(几十兆都OK)! 必须在linux/unix上执行,因为最后的系统调用是基于unix的。

再见