Hadoop7days-4 MR实现倒排索引
实现倒排索引值得是:将位于不同文件里面的单词,统计出其在不同文件中出现的次数,结果应为
“hello”,"a.txt->3,b.txt->2,c.txt->2"
的形式。要达成这一目标,需要设置多个mapper和reducer类。可以使用倒退的方法,来确定各个mapper和reducer要实现的功能,其步骤如下:
mapper 的输出是
context.write("hell0->a.txt","1");
context.write("hell0->a.txt","1");
context.write("hell0->a.txt","1");
shuffle后变为:
<"hello a.txt" , {1,1,1}>
------------------------------reducer
reducer的输入:
<"hello a.txt" , {1,1,1}>
reducer的输出应该是:
"hello","a.txt->3"
"hello","b.txt->2"
"hello","c.txt->2"
------------------------------maper的输出应该是:
mapper的输入应该是:
"hello","a.txt->3"
"hello","b.txt->2"
"hello","c.txt->2"
context.write("hello","a.txt->3"}
context.write("hello","b.txt->2"}
context.write("hello","c.txt->2"}
shuffle之后变为:
<"hello",{"a.txt->3","b.txt->2","c.txt->2">
-----------------------------最终reducer的输出
reducer的输入应该是
context.write("hello",{"a.txt->3","b.txt->2","c.txt->2"}
reducer的输出
context.write("hello","a.txt->3 b.txt->2 c.txt->2");
下面开始我们的设计:
第一个map应该讲文件变为 "word->name,"1"的形式
第一个reducer应该将 “word->name”,"1"变为 “word”,"name,1"的形式,我们加一个combiner,让combiner完成这个功能
reducer: