转载--Job的数据输入格式化器—InputFormat

Hadoop被设计用来处理海量数据，这种数据可以是结构化的，半结构化的，甚至是一些无结构化的文本数据(这些数据可能存储在HDFS文件中，也可能存放在DB中)。它处理数据的核心就是map-reduce模型，但是，无论是map还是reduce，它们的输入输出数据都是key-value对的形式，这种key-value对的形式我们可以看做是结构化的数据。同时，对于reduce的输入，当然就是map的输出，而reduce、map的输出又直接可以在map和reduce处理函数中定义，那么这就只剩下map的输入了，也就是说，Hadoop如何把输入文件包装成key-value对的形式交给map来处理，同时hadoop又是如何切割作业的输入文件来结果不同的TaskTracker同时来处理的呢？这两个问题就是本文将要重点讲述的内容——作业的输入文件格式化器(InputFormat)。

在Hadoop对Map-Reduce实现设计中，作业的输入文件格式化器包括两个组件：文件读取器(RecordReader)和文件切割器(Spliter)。其中，文件切割器用来对作业的所有输入数据进行分片切割，最后有多少个切片就有多少个map任务，文件读取器用来读取切片中的数据，并按照一定的格式把读取的数据包装成一个个key-value对。而在具体的对应实现中这个输入文件格式化器被定义了一个抽先类，这样它把如何切割输入数据以及如何读取数据并把数据包装成key-value对交给了用户来实现，因为只有用户才知道输入的数据是如何组织的，map函数需要什么样的key-value值作为输入值。这个输入文件格式化器对应的是org.apache.hadoop.mapreduce.InputFormat类：

public abstract class InputFormat<K, V> {

  /** 
   * Logically split the set of input files for the job.  
   */
  public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;
  
  /**
   * Create a record reader for a given split. The framework will call
   */
  public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException;

}

显然，在InputFormat类中，getSplits()方法是让用户定义如何对作业的输入数据进行切割分的，createRecordReader方法是定义如何读取输入数据，并包装成一个若干个key-value对的，即定义一个记录读取器。另外，对于一个输入数据切片信息(数据的长度、数据保存在哪些DataNode节点上)被保存在一个对应的InputSplit对象中。顺带需要提一下的是，JobClient在调用InputFormat的getSplits()方法时，对返回的InputSplit数组又使用JobClient.RawSplit进行了一次封装，并将其序列化到文件中。下面就来看看hadoop在其内部有哪些相关的默认实现的。

从上面的类图可以看出，Hadoop在抽象类FileInputFormat中实现了一个基于文件的数据分片切割器，所以在这里我先主要谈谈它是如何实现的。先来看源码：

protected long getFormatMinSplitSize() {
    return 1;
}

public static long getMinSplitSize(JobContext job) {
    return job.getConfiguration().getLong("mapred.min.split.size", 1L);
}

public static long getMaxSplitSize(JobContext context) {
    return context.getConfiguration().getLong("mapred.max.split.size", Long.MAX_VALUE);
}

protected long computeSplitSize(long blockSize, long minSize,long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
}

public List<InputSplit> getSplits(JobContext job) throws IOException {
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));//计算允许的最小切片大小
    long maxSize = getMaxSplitSize(job);//计算允许的最大切片大小

 // generate splits
    LOG.debug("start to split all input files for Job["+job.getJobName()+"]");
    List<InputSplit> splits = new ArrayList<InputSplit>();
    for (FileStatus file: listStatus(job)) {
      Path path = file.getPath();
      FileSystem fs = path.getFileSystem(job.getConfiguration());
      long length = file.getLen();

      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
      if ((length != 0) && isSplitable(job, path)) { 
        long blockSize = file.getBlockSize();
        long splitSize = computeSplitSize(blockSize, minSize, maxSize);//计算该输入文件一个切片最终大小
        long bytesRemaining = length;
        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts()));
          bytesRemaining -= splitSize;
        }
        
        if (bytesRemaining != 0) {
          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,blkLocations[blkLocations.length-1].getHosts()));
        }
      } else if (length != 0) {
        splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
      } else { 
        //Create empty hosts array for zero length files
        splits.add(new FileSplit(path, 0, length, new String[0]));
      }
    }
    
    LOG.debug("Total # of splits in Job["+job.getJobName()+"]'s input files: " + splits.size());
    
    return splits;
  }

/*是否允许对一个文件进行切片*/
protected boolean isSplitable(JobContext context, Path filename) {
    return true;
}

上面的输入数据切割器是支持多输入文件的，而且还要着重注意的是这个输入数据切割器是如何计算一个数据切片大小的，因为在很多情况下，切片的大小对一个作业的执行性能有着至关重要的影响，应为至少切片的数量决定了map任务的数量。试想一下，如果3个数据块被切成两个数据片和被切成三个数据块，哪一种情况下耗费的网络I/O时间要多一些呢？在作业没有配置数据切割器的情况下，默认的是TextInputFormat，对应的配置文件的设置项为：mapreduce.inputformat.class。

最后，以LineRecordReader为例来简单的讲解一下记录读取器的实现，这个记录读取器是按文本文件中的行来读取数据的，它的key-value中为：行号一行文本。

public class LineRecordReader extends RecordReader<LongWritable, Text> {
	
  private static final Log LOG = LogFactory.getLog(LineRecordReader.class);

  private CompressionCodecFactory compressionCodecs = null;
  private long start;
  private long pos;
  private long end;
  private LineReader in;
  private int maxLineLength;
  private LongWritable key = null;
  private Text value = null;

  public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
	  	 
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
    
    start = split.getStart();
    end = start + split.getLength();
    
    final Path file = split.getPath();
    compressionCodecs = new CompressionCodecFactory(job);
    final CompressionCodec codec = compressionCodecs.getCodec(file);

    // open the file and seek to the start of the split
    FileSystem fs = file.getFileSystem(job);
    Path _inputFile = split.getPath();

    FSDataInputStream fileIn = fs.open(_inputFile);
    boolean skipFirstLine = false;
    if (codec != null) {
      in = new LineReader(codec.createInputStream(fileIn), job);
      end = Long.MAX_VALUE;
    } else {
      if (start != 0) {
        skipFirstLine = true;
        --start;
        fileIn.seek(start);
      }
      in = new LineReader(fileIn, job);
    }
    
    if (skipFirstLine) {  // skip first line and re-establish "start".
      start += in.readLine(new Text(), 0,(int)Math.min((long)Integer.MAX_VALUE, end - start));
    }
    
    this.pos = start;
  }
  
  public boolean nextKeyValue() throws IOException {
	  
    if (key == null) {
      key = new LongWritable();
    }
    key.set(pos);
    
    if (value == null) {
      value = new Text();
    } 
    int newSize = 0;
    while (pos < end) {
      newSize = in.readLine(value, maxLineLength,Math.max((int)Math.min(Integer.MAX_VALUE, end-pos), maxLineLength));
      if (newSize == 0) {
        break;
      }
      pos += newSize;
      if (newSize < maxLineLength) {
        break;
      }

      // line too long. try again
      LOG.debug("Skipped this line because the line is too long: lineLength["+newSize+"]>maxLineLength["+maxLineLength+"] at position[" + (pos - newSize)+"].");
    }
    
    if (newSize == 0) {
      key = null;
      value = null;
      return false;
      
    } 
    else {
      return true;
    }
  }

  @Override
  public LongWritable getCurrentKey() {
    return key;
  }

  @Override
  public Text getCurrentValue() {
    return value;
  }

  /**
   * Get the progress within the split
   */
  public float getProgress() {
    if (start == end) {
      return 0.0f;
    } else {
      return Math.min(1.0f, (pos - start) / (float)(end - start));
    }
  }
  
  public synchronized void close() throws IOException {
    if (in != null) {
      in.close(); 
    }
  }
}

在记录读取器中，getProgress()被用来报告当前读取输入文件的进度，因为Hadoop为客户端查看当前作业执行进度的API。另外，由于LineRecordReader是按照行来读取的，由于切割器的分割，可能使得某一行在两个数据片中，所以在初始化的时候有一个是否跳过第一行的操作。

转载自： http://blog.csdn.net/xhh198781/article/details/7290979

转载--Job的数据输入格式化器—InputFormat

相关推荐