用Java统计文章中有多少个不同的汉字以及每个汉字出现的次数

1、思路

小编想输入一篇文章，输出字符和该字符在文章中出现的次数，那么就联想到了数据库中的数据表，有键值，还有键值对应的值，如下：

key	value
key1	value1
key2	value2
…	…

那么：

字符	出现次数
字符1	XX次
字符2	XX次

于是，在java语言中联想到了HashMap类，恰好可实现，而且有速度快的优势。有关内容见：对HashMap的简单认识

2、代码及解析

先附上代码吧：

import java.io.*;
import java.util.*;
class Run{
	public static void main(String[] args) throws Exception{
		long start = System.currentTimeMillis();//开始时间戳
		File file_source = new File(args[0]);
		File file_result = new File("result.txt");
		InputStreamReader inputStreamReader = new InputStreamReader(new FileInputStream(file_source));//将输入的字节流转换成字符流
		BufferedReader bufferedReader=new BufferedReader(inputStreamReader);//将字符流添加到缓冲流
		// 新建HashMap，汉字为Character类型，次数为Integer类型
        Map<Character,Integer> map = new HashMap<Character,Integer>();
		String str=null;
		try{
			//检测原文件是否存在
			if (!file_source.exists()) {
				System.out.println(args[0]+"不存在！");
				System.exit(0);
			}
			//检测结果文件是否已经存在
			if (file_result.exists()) {
				System.out.println("result.txt已经存在！");
				System.exit(0);
			}
			while ((str = bufferedReader.readLine()) != null){
				byte[] bytes = str.getBytes("GBK");//转码
				for (int i=0;i<(bytes.length/2) ;i++ ){
                	if (bytes[2*i]<-95 || bytes[2*i]>-87) {//首码小于-95，或大于-87，认为是汉字
                    	if(map.containsKey(str.charAt(i)) ){//看数组中否已有该元素
                        	Integer tempInt = (Integer)map.get(str.charAt(i));//获取该汉字的次数，并+1
                        	tempInt += 1;
                        	map.put(str.charAt(i), tempInt);//将汉字及其出现次数重新加入到map中，并且会覆盖相同内容的键
                    	}else map.put(str.charAt(i), 1);//没有该元素，则加入
                	}
            	}
			}
		}catch(FileNotFoundException e){
			e.printStackTrace();
		}
		PrintWriter output = new PrintWriter(file_result);//创建写对象
		List<Map.Entry<Character,Integer>> list = new ArrayList<Map.Entry<Character,Integer>>(map.entrySet());
        Collections.sort(list, new Comparator<Map.Entry<Character,Integer>>() {// 根据value排序
        	//降序排序
			public int compare(Map.Entry<Character,Integer> o1, Map.Entry<Character,Integer> o2) {
				double result = o2.getValue() - o1.getValue();
				if (result > 0)
					return 1;
				else if (result == 0)
					return 0;
				else
					return -1;
			}
		});
        Iterator<Map.Entry<Character,Integer>> iter = list.iterator();//获取List集合的迭代器,Map.Entry<K, V>为迭代元素的类型
        while(iter.hasNext()){
            Map.Entry<Character,Integer> item = iter.next();
            Character key = item.getKey();
            Integer value = item.getValue();
            //System.out.println( key + ":" + value);
            output.println( key + ":" + value);
        }
        output.close();//.close()方法关闭文件，如果不关闭，数据不能保存在文件中
        long end = System.currentTimeMillis();//结束时间戳
        System.out.println("共有"+map.size()+"个不同的汉字");
        System.out.println("耗时：" + (end - start) + "毫秒");
	}
}

25~34行是核心，判断汉字和其出现次数并加入到map中，其中第27行判断字符是汉字还是非汉字；
40~52行给map里的内容降序排序；
53~60行将内容写入到txt文件中方便查看。
详细过程见注释，可参见我前面的文章：Java文章的汉字个数、标点符号个数、总的字符个数

3、测试结果

我输入的是《西游记》原文

用Java统计文章中有多少个不同的汉字以及每个汉字出现的次数
自动写到txt文件中的内容：

用Java统计文章中有多少个不同的汉字以及每个汉字出现的次数
…

结论：

出现次数从多到少排在前十的汉字依次是：“道”，“不”，“一”，“了”，“那”，“我”，“是”，“来”，“他”，“个”。
共有4496个不同的汉字
耗时：100毫秒

用Java统计文章中有多少个不同的汉字以及每个汉字出现的次数

1、思路

2、代码及解析

3、测试结果

相关推荐