【高性能计算】任意线程数并行化快排结合Merge排序100w--10线程下只用0.06s

简述

通过这个标题也大概能猜测出算法的思想。

  • Merge操作是O(n)的(出自于MergeSort归并排序)
  • 根据线程数将数据划分为thread_count块(较为均匀点就好了)
  • 每段上用qsort(快排)
  • 之后再用一个merge将所有的快排结果合并起来

算法思想很简单,但是效果却不错。

  • 之前写过一篇用双进程写快排然后实现100w的排序只用0.2s

这里测试过用两个的情况,效果类似

PS D:\Code\C++\repo\openMP_test\x64\Debug> ./openMP_test.exe 2
reading finish
sort finished
Total time Serial: 0.223s

代码

  • 只有一个线程的时候不进行~
#include <iostream>
#include <omp.h>
#include <fstream>
#include <ctime>
using namespace std;
#pragma warning(disable : 4996)
int *a, n, thread_count, *output;

void QSort(int begin, int end) {
	if (begin >= end) return;
	// 枢轴元素就需求最前里面的那个
	int i = begin, j = end, key = a[begin]; // 坑在begin点 
	while (i < j) {
		while (i < j && a[j] >= key) --j;
		if (i < j) a[i] = a[j];
		while (i < j && a[i] <= key) ++i;
		if (i < j) a[j] = a[i];
	}
	a[i] = key;
	QSort(begin, i - 1);
	QSort(i + 1, end);
}
void sort() {
	// 分成thread_count段;
	int *keys = new int[thread_count];
	bool *finished = new bool[thread_count];
	int localn = n / thread_count, i;
	for (i = 0; i < thread_count; ++i) {
		keys[i] = i * localn; // begin keys
		finished[i] = false;
	}
#pragma omp parallel for num_threads (thread_count) default(none) shared(a, n,localn) private(i)
	for (i = 0; i < thread_count; ++i) {
		if (i < thread_count - 1) {
			QSort(i*localn, (i + 1)*localn - 1);
		}
		else {
			QSort(i*localn, n - 1);
		}
	}


	bool allfinished = false, first=true;
	int min=0, minkey_index, globalindex=0;
	while (!allfinished) {
		allfinished = false;
		first = true;
		// choose min index
		for (i = 0; i < thread_count; ++i) {
			if (finished[i]) continue;
			if (first) { 
				min = a[keys[i]]; first = false; minkey_index = i;
			} else if (a[keys[i]] < min){
				min = a[keys[i]]; minkey_index = i;
			}
		}
		if (first) { break; }
		
		output[globalindex++] = min;

		keys[minkey_index]++;
		if (minkey_index == thread_count - 1 && keys[minkey_index] == n) finished[minkey_index] = true;
		else if (minkey_index < thread_count - 1 && keys[minkey_index] == (minkey_index + 1) * localn) 
		{ finished[minkey_index] = true;  }
	}

	delete[]finished;
	delete[]keys;
}

int main(int argc, char **argv) {
	if (argc < 2) return 0;
	thread_count = strtol(argv[1], NULL, 10);
	int i;
	ifstream cin("input.txt");
	cin >> n;
	a = new int[n];
	output = new int[n];
	for (i = 0; i < n; ++i) cin >> a[i];
	cout << "reading finish\n";
	clock_t starttime, endtime;
	starttime = clock();
	sort();
	endtime = clock();
	cout << "sort finished\n";
	ofstream out("output.txt");
	for (i = 0; i < n; ++i) out << output[i] << " ";
	cout << "Total time sorted: " << (double)(endtime - starttime) / CLOCKS_PER_SEC << "s" << endl;
	delete[] a;
	delete[]output;
}
  • 调用之前写好的生成数据的脚本来生成数据
  • 然后用生成出来的exe来进行排序,后面的数值是线程数目
  • 运行代码其实并不消耗多少时间,主要的问题是在写入文件的时候消耗了点时间,IO操作嘛毕竟

【高性能计算】任意线程数并行化快排结合Merge排序100w--10线程下只用0.06s

PS D:\Code\C++\repo\openMP_test\x64\Debug> python generate.py 1000000
PS D:\Code\C++\repo\openMP_test\x64\Debug> ./openMP_test.exe 2
reading finish
sort finished
Total time Serial: 0.223s
PS D:\Code\C++\repo\openMP_test\x64\Debug> ./openMP_test.exe 10
reading finish
sort finished
Total time Serial: 0.068s
PS D:\Code\C++\repo\openMP_test\x64\Debug> ./openMP_test.exe 10
reading finish
sort finished
Total time sorted: 0.06s
PS D:\Code\C++\repo\openMP_test\x64\Debug> python check.py
right
PS D:\Code\C++\repo\openMP_test\x64\Debug>

生成数据的脚本

import sys
import numpy as np

N = int(sys.argv[1])
a = np.random.randint(- int(np.sqrt(N)), int(np.sqrt(N)), N)
c = sorted(a)
with open("input.txt", 'w') as f:
    f.write(str(N) + '\n')
    f.write(' '.join(list(map(str, a)))+'\n')
with open("ans.txt", 'w') as f:
    for i in range(N):
        f.write(str(c[i]) + ' ')

检查结果的代码

with open('ans.txt', 'r')as f, open('output.txt', 'r') as r:
    a = f.readline()
    b = r.readline()
    print('right' if a == b else 'wrong')