2013-06-04 16:13:32

分类：架构设计与优化

1.1 参考资料

《嵌入式Linux性能详解》，史子旺

https://perf.wiki.kernel.org/index.php/Main_Page，perf主页

http://www.docin.com/p-619608212.html，Linux 的系统级性能剖析工具-perf （一） - 淘宝内核组

http://www.ibm.com/developerworks/cn/linux/l-cn-perf1/，Perf -- Linux下的系统性能调优工具，第 1 部分

2 测试优化概述

2.1 性能指标

性能指标最好以数值的形式来体现，如下表所示

场景	理想(ms)	下限(ms)	优化前(ms)	优化后(ms)	提升（倍）
移动窗口延迟	0	1000

2.2 测试工具

测试工具是十分重要，影响到测试结果的准确性以及性能瓶颈定位。

测试工具要符合以下几点：

l 支持输出函数调用关系图

l 支持输出函数被调用的次数

l 支持输出调用函数的时间占所有时间的百分比

l 支持统计库函数

下面这些是可选的，有这些功能更好。

l 支持分析cache命中率

l 支持源码级分析

l 支持汇编级分析

2.3 测试优化基本流程

2.4 测试优化基础知识

程序性能的问题，有很多原因，总结如下三点：

1、程序的运算量很大，导致 CPU过于繁忙，CPU是瓶颈。

2、程序需要做大量的 I/O，读写文件、内存操作等等，CPU 更多的是处于等待，I/O部分称为程序性能的瓶颈。

3、程序之间相互等待，结果 CPU 利用率很低，但运行速度依然很慢，事务间的共享与死锁制约了程序的性能。

3 perf测试工具简介

Perf是内置于Linux内核源码树中的性能剖析工具。它基于事件采样原理，以性能事件为基础，支持针对处理器相关性能指标与操作系统相关性能指标的性能剖析。可用于性能瓶颈的查找与热点代码的定位。

通过它，应用程序可以利用 PMU，tracepoint 和内核中的特殊计数器来进行性能统计。它不但可以分析指定应用程序的性能问题 (per thread)，也可以用来分析内核的性能问题，当然也可以同时分析应用代码和内核，从而全面理解应用程序中的性能瓶颈。

最初的时候，它叫做 Performance counter，在 2.6.31 中第一次亮相。此后他成为内核开发最为活跃的一个领域。在 2.6.32 中它正式改名为 Performance Event，因为 perf 已不再仅仅作为 PMU 的抽象，而是能够处理所有的性能相关的事件。

使用 perf，您可以分析程序运行期间发生的硬件事件，比如 instructions retired ，processor clock cycles 等；您也可以分析软件事件，比如 Page Fault 和进程切换。这使得 Perf 拥有了众多的性能分析能力，举例来说，使用 Perf 可以计算每个时钟周期内的指令数，称为 IPC，IPC 偏低表明代码没有很好地利用 CPU。Perf 还可以对程序进行函数级别的采样，从而了解程序的性能瓶颈究竟在哪里等等。

3.1 移植安装perf

1. 内核支持perf，配置内核支持如下选项

CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_USE_VMALLOC=y
#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
CONFIG_PERF_COUNTERS=y

2. 编译perf

进入内核源码根目录

$cd tools/perf

$make

生成perf可执行文件，拷贝该文件即可。

3.2 优化示例

//test.c
void longa(int a[])
{
int i,j;
for(i = 0; i < 1000000; i++)
j=i;
*a = j;
}
void foo2(int a[])
{
int i;
for(i=0 ; i < 10; i++)
longa(&a[i]);
}
void foo1(int a[])
{
int i;
for(i = 0; i< 100; i++)
longa(&a[i]);
}
int main(void)
{
int a[100];
foo1(a);
foo2(a);
}

$ gcc test.c –g –o t1

3.2.1 定义性能指标

性能指标最好以数值的形式来体现，如下表所示

场景	理想(ms)	下限(ms)	优化前(ms)	优化后(ms)	提升（倍）
程序运行总时间	10	100

3.2.2 判断程序是CPU繁忙型还是IO等待型

$ perf stat ./t1

Performance counter stats for './t1':
262.738415 task-clock-msecs # 0.991 CPUs
2 context-switches # 0.000 M/sec
1 CPU-migrations # 0.000 M/sec
81 page-faults # 0.000 M/sec
9478851 cycles # 36.077 M/sec (scaled from 98.24%)
6771 instructions # 0.001 IPC (scaled from 98.99%)
111114049 branches # 422.908 M/sec (scaled from 99.37%)
8495 branch-misses # 0.008 % (scaled from 95.91%)
12152161 cache-references # 46.252 M/sec (scaled from 96.16%)
7245338 cache-misses # 27.576 M/sec (scaled from 95.49%)
0.265238069 seconds time elapsed

上面告诉我们，程序 t1 是一个 CPU bound 型，因为 task-clock-msecs 接近 1。

Task-clock-msecs：CPU 利用率，该值高，说明程序的多数时间花费在 CPU 计算上而非 IO。
Context-switches：进程切换次数，记录了程序运行过程中发生了多少次进程切换，频繁的进程切换是应该避免的。
Cache-misses：程序运行过程中总体的 cache 利用情况，如果该值过高，说明程序的 cache 利用不好
CPU-migrations：表示进程 t1 运行过程中发生了多少次 CPU 迁移，即被调度器从一个 CPU 转移到另外一个 CPU 上运行。
Cycles：处理器时钟，一条机器指令可能需要多个 cycles，
Instructions: 机器指令数目。
IPC：是 Instructions/Cycles 的比值，该值越大越好，说明程序充分利用了处理器的特性。
Cache-references:cache 命中的次数
Cache-misses:cache 失效的次数。

3.2.3 找出占用时间最多的函数

l 使用perf record对程序采样，

$ perf record –r0 –F100000 –g –f ./t1

-r 0实时采样

-F 100000 每秒采样100000次

-g 记录函数调用流程图

-f 覆盖以前的采样记录

l 输入采样结果，并打印函数调用流程图

$perf report –g

# Samples: 867913977
#
# Overhead Command Shared Object Symbol
# ........ ....... ............................................... ....
#
93.32% a.out a.out [.] longa
|
--- longa
|
|--93.01%-- foo1
| main
| __libc_start_main
| _start
|
|--6.99%-- foo2
| main
| __libc_start_main
| _start
--0.00%-- [...]
0.41% a.out [kernel.kallsyms] [k] perf_pending_interrupt
|
--- perf_pending_interrupt
|
|--91.40%-- longa
| |
| |--97.01%-- foo1
| | main
| | __libc_start_main
| | _start
| |
| --2.99%-- foo2
| main
| __libc_start_main
| _start
|
|--3.57%-- do_softirq
| irq_exit
| smp_apic_timer_interrupt
| apic_timer_interrupt
| |
| |--95.45%-- longa
| | foo1
| | main
| | __libc_start_main
| | _start
| |
| --4.55%-- do_filp_open
| do_sys_open
| sys_open
| syscall_call
| 0xb77ad994
| 0xb779d62b

从函数调用流程图可以看出，占用CPU时间最多的函数是longa（），占用程序运行时间的93.32%。 [.]表示用户态程序

93.32% a.out a.out [.] longa

3.2.4 定位修改代码

93.32% a.out a.out [.] longa
|
--- longa
|
|--93.01%-- foo1
| main
| __libc_start_main
| _start

可以看出，foo1()函数最终调用longa()函数，占用了大部份时间。因此着手分析main()函数，foo1()函数，longa()函数。

分析代码，得出foo1（）调用了100次longa(),而longa()中的for循环没有做什么有意义的事情，因此删除该循环。

void longa(int a[])
{
int i,j;
for(i = 0; i < 1000000; i++)
j=i;
*a = j;
}

修改后的代码如下

void longa(int a[])
{
*a = 999999;
}

3.2.5 重新测试验证

l 程序运行时间对比

bash-4.0# perf stat ./a.out

Performance counter stats for './a.out':
1.698595 task-clock-msecs # 0.580 CPUs
22 context-switches # 0.013 M/sec
0 CPU-migrations # 0.000 M/sec
82 page-faults # 0.048 M/sec
2339630 cycles # 1377.391 M/sec
607298 instructions # 0.260 IPC
107550 branches # 63.317 M/sec
12025 branch-misses # 11.181 %
<not counted> cache-references
<not counted> cache-misses
0.002929202 seconds time elapse

现在总的运行时间是0.002929202s，对比优化前的运行时间0.265238069s，运行速度提升了100倍。

l 占用时间最多的函数

bash-4.0# perf report -g | head-n 100

# Samples: 217690
#
# Overhead Command Shared Object Symbol
# ........ ....... ....................................... ......
#
5.51% a.out [kernel.kallsyms] [k] intel_pmu_enable_all
|
--- intel_pmu_enable_all
hw_perf_enable
perf_enable
|
|--62.88%-- ctx_sched_in
| perf_event_task_sched_in
| |
| |--99.88%-- finish_task_switch
| | schedule
| | |
| | |--50.00%-- work_resched
| | | 0xffffe430
| | | run_builtin
| | | main
| | | __libc_start_main
| | | _start
| | |
| | --50.00%-- __cond_resched
| | _cond_resched
| | wait_for_common
| | wait_for_completion
| | sched_exec
| | do_execve
| | sys_execve
| | ptregs_execve
| | 0xffffe430
| | __cmd_record
| | run_builtin
| | main
| | __libc_start_main
| | _start
| --0.12%-- [...]
|
|--37.04%-- perf_ctx_adjust_freq
| perf_event_task_tick
| scheduler_tick
| update_process_times
| tick_sched_timer
| __run_hrtimer
| hrtimer_interrupt
| smp_apic_timer_interrupt
| apic_timer_interrupt
| |
| |--84.78%-- search_binary_handler
| | do_execve
| | sys_execve
| | ptregs_execve
| | 0xffffe430

最耗时的函数是内核态程序intel_pmu_enable_all（），而是其运行时间只占程序总运行时间的5%,优化可以结束。

3.2.6 总结

场景	理想(ms)	下限(ms)	优化前(ms)	优化后(ms)	提升（倍）
程序运行总时间	10	100	265.238069	2.929202	91

Linux应用程序性能优化教程