【高性能计算】任意线程数并行化快排结合Merge排序100w--10线程下只用0.06s
简述
通过这个标题也大概能猜测出算法的思想。
- Merge操作是O(n)的(出自于MergeSort归并排序)
- 根据线程数将数据划分为thread_count块(较为均匀点就好了)
- 每段上用qsort(快排)
- 之后再用一个merge将所有的快排结果合并起来
算法思想很简单,但是效果却不错。
- 之前写过一篇用双进程写快排然后实现100w的排序只用0.2s
这里测试过用两个的情况,效果类似
PS D:\Code\C++\repo\openMP_test\x64\Debug> ./openMP_test.exe 2
reading finish
sort finished
Total time Serial: 0.223s
代码
- 只有一个线程的时候不进行~
#include <iostream>
#include <omp.h>
#include <fstream>
#include <ctime>
using namespace std;
#pragma warning(disable : 4996)
int *a, n, thread_count, *output;
void QSort(int begin, int end) {
if (begin >= end) return;
// 枢轴元素就需求最前里面的那个
int i = begin, j = end, key = a[begin]; // 坑在begin点
while (i < j) {
while (i < j && a[j] >= key) --j;
if (i < j) a[i] = a[j];
while (i < j && a[i] <= key) ++i;
if (i < j) a[j] = a[i];
}
a[i] = key;
QSort(begin, i - 1);
QSort(i + 1, end);
}
void sort() {
// 分成thread_count段;
int *keys = new int[thread_count];
bool *finished = new bool[thread_count];
int localn = n / thread_count, i;
for (i = 0; i < thread_count; ++i) {
keys[i] = i * localn; // begin keys
finished[i] = false;
}
#pragma omp parallel for num_threads (thread_count) default(none) shared(a, n,localn) private(i)
for (i = 0; i < thread_count; ++i) {
if (i < thread_count - 1) {
QSort(i*localn, (i + 1)*localn - 1);
}
else {
QSort(i*localn, n - 1);
}
}
bool allfinished = false, first=true;
int min=0, minkey_index, globalindex=0;
while (!allfinished) {
allfinished = false;
first = true;
// choose min index
for (i = 0; i < thread_count; ++i) {
if (finished[i]) continue;
if (first) {
min = a[keys[i]]; first = false; minkey_index = i;
} else if (a[keys[i]] < min){
min = a[keys[i]]; minkey_index = i;
}
}
if (first) { break; }
output[globalindex++] = min;
keys[minkey_index]++;
if (minkey_index == thread_count - 1 && keys[minkey_index] == n) finished[minkey_index] = true;
else if (minkey_index < thread_count - 1 && keys[minkey_index] == (minkey_index + 1) * localn)
{ finished[minkey_index] = true; }
}
delete[]finished;
delete[]keys;
}
int main(int argc, char **argv) {
if (argc < 2) return 0;
thread_count = strtol(argv[1], NULL, 10);
int i;
ifstream cin("input.txt");
cin >> n;
a = new int[n];
output = new int[n];
for (i = 0; i < n; ++i) cin >> a[i];
cout << "reading finish\n";
clock_t starttime, endtime;
starttime = clock();
sort();
endtime = clock();
cout << "sort finished\n";
ofstream out("output.txt");
for (i = 0; i < n; ++i) out << output[i] << " ";
cout << "Total time sorted: " << (double)(endtime - starttime) / CLOCKS_PER_SEC << "s" << endl;
delete[] a;
delete[]output;
}
- 调用之前写好的生成数据的脚本来生成数据
- 然后用生成出来的exe来进行排序,后面的数值是线程数目
- 运行代码其实并不消耗多少时间,主要的问题是在写入文件的时候消耗了点时间,IO操作嘛毕竟
PS D:\Code\C++\repo\openMP_test\x64\Debug> python generate.py 1000000
PS D:\Code\C++\repo\openMP_test\x64\Debug> ./openMP_test.exe 2
reading finish
sort finished
Total time Serial: 0.223s
PS D:\Code\C++\repo\openMP_test\x64\Debug> ./openMP_test.exe 10
reading finish
sort finished
Total time Serial: 0.068s
PS D:\Code\C++\repo\openMP_test\x64\Debug> ./openMP_test.exe 10
reading finish
sort finished
Total time sorted: 0.06s
PS D:\Code\C++\repo\openMP_test\x64\Debug> python check.py
right
PS D:\Code\C++\repo\openMP_test\x64\Debug>
生成数据的脚本
import sys
import numpy as np
N = int(sys.argv[1])
a = np.random.randint(- int(np.sqrt(N)), int(np.sqrt(N)), N)
c = sorted(a)
with open("input.txt", 'w') as f:
f.write(str(N) + '\n')
f.write(' '.join(list(map(str, a)))+'\n')
with open("ans.txt", 'w') as f:
for i in range(N):
f.write(str(c[i]) + ' ')
检查结果的代码
with open('ans.txt', 'r')as f, open('output.txt', 'r') as r:
a = f.readline()
b = r.readline()
print('right' if a == b else 'wrong')