hadoop M/R 实现倒排索引
一.背景要求:
以倒排索引,词项-文档列表的形式,针对统计文件中的单词的个数,以下格式打印输出:”单词 文件路径->统计数;文件路径->统计数;........”的格式
Aili hdfs://192.168.59.128:9000/inverseindex/b.txt->1;hdfs://192.168.59.128:9000/inverseindex/a.txt->1;
baidu hdfs://192.168.59.128:9000/inverseindex/a.txt->1;hdfs://192.168.59.128:9000/inverseindex/c.txt->1;
预料格式:
a.txt:
baidu top1
aili top2
tengxun top3
xiaomi top4
ultrapower top5
java top6
python top7
B.txt
c top2
java top1
python top5
c++ top4
aili top0
tengxun top1
c++ top5
c.txt:
java top1
baidu top2
c top3
java top0
二.业务实现
2.1 调用主代码
2.2 mapper代码
2.3 combiner代码
2.4 reducer代码
三.上传文件和代码
将程序打成jar包,预料文件a.txt,b.txt,c.txt上传到linux目录下:
四.上传文件到hdfs上,执行jar
新建目录
[[email protected] sbin]# hadoop fs -mkdir /inverseindex
上传文件:
[[email protected] jurf_temp_data]# hadoop fs -put a.txt b.txt c.txt /inverseindex
[[email protected] jurf_temp_data]# hadoop jar hadoop-demo-inverseindex.jar /inverseindex /inverseindexout
2019-01-16 21:12:40,752 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2019-01-16 21:12:43,883 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2019-01-16 21:12:44,027 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1547639631603_0002
2019-01-16 21:12:48,398 INFO input.FileInputFormat: Total input files to process : 3
2019-01-16 21:12:50,021 INFO mapreduce.JobSubmitter: number of splits:3
2019-01-16 21:12:50,520 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2019-01-16 21:12:51,580 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547639631603_0002
2019-01-16 21:12:51,584 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-01-16 21:12:52,803 INFO conf.Configuration: resource-types.xml not found
2019-01-16 21:12:52,804 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-01-16 21:12:53,408 INFO impl.YarnClientImpl: Submitted application application_1547639631603_0002
2019-01-16 21:12:53,844 INFO mapreduce.Job: The url to track the job: http://naidong:8088/proxy/application_1547639631603_0002/
2019-01-16 21:12:53,845 INFO mapreduce.Job: Running job: job_1547639631603_0002
2019-01-16 21:13:28,404 INFO mapreduce.Job: Job job_1547639631603_0002 running in uber mode : false
2019-01-16 21:13:28,431 INFO mapreduce.Job: map 0% reduce 0%
2019-01-16 21:14:46,181 INFO mapreduce.Job: map 67% reduce 0%
2019-01-16 21:14:47,873 INFO mapreduce.Job: map 100% reduce 0%
2019-01-16 21:15:42,011 INFO mapreduce.Job: map 100% reduce 100%
2019-01-16 21:15:45,096 INFO mapreduce.Job: Job job_1547639631603_0002 completed successfully
2019-01-16 21:15:45,734 INFO mapreduce.Job: Counters: 53
File System Counters
FILE: Number of bytes read=1811
FILE: Number of bytes written=856475
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=534
HDFS: Number of bytes written=1683
HDFS: Number of read operations=14
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=3
Launched reduce tasks=1
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=451496
Total time spent by all reduces in occupied slots (ms)=76626
Total time spent by all map tasks (ms)=225748
Total time spent by all reduce tasks (ms)=25542
Total vcore-milliseconds taken by all map tasks=225748
Total vcore-milliseconds taken by all reduce tasks=25542
Total megabyte-milliseconds taken by all map tasks=462331904
Total megabyte-milliseconds taken by all reduce tasks=78465024
Map-Reduce Framework
Map input records=18
Map output records=36
Map output bytes=1956
Map output materialized bytes=1823
Input split bytes=330
Combine input records=36
Combine output records=32
Reduce input groups=18
Reduce shuffle bytes=1823
Reduce input records=32
Reduce output records=18
Spilled Records=64
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=2263
CPU time spent (ms)=10430
Physical memory (bytes) snapshot=745897984
Virtual memory (bytes) snapshot=12588302336
Total committed heap usage (bytes)=436482048
Peak Map Physical memory (bytes)=207892480
Peak Map Virtual memory (bytes)=2748080128
Peak Reduce Physical memory (bytes)=134983680
Peak Reduce Virtual memory (bytes)=4361895936
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=204
File Output Format Counters
Bytes Written=1683
2019-01-16 21:15:55,761 WARN util.ShutdownHookManager: ShutdownHook '' timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:68)
四.查看结果
摘出结果:
文档见:百度网盘:大数据资料/2019大数据资料