这两天在处理一个内核资源泄漏的问题,整个分析过程比较典型,记在这里以便以后参考。

  这个问题发生在测试部同事的真实机上,于是抓内核DMP用的是sysinternals的livekd,命令行很简单:

livekd -o dump-file-path

  注:为了分析,我开了GFlags中的kernel pool tagging功能。

  分析过程如下:

 1 0: kd> !vm
 2 
 3 *** Virtual Memory Usage ***
 4  Physical Memory:      842859 (   3371,436 Kb)
 5  ...(Boxcounter:此处省略PagingFile相关输出)
 6  Available Pages:      582485 (   2329,940 Kb)
 7  ResAvail Pages:       756136 (   3024,544 Kb)
 8  Locked IO Pages:          51 (       204 Kb)
 9  Free System PTEs:      32523 (    130,092 Kb)
10  Free NP PTEs:          32767 (    131,068 Kb)
11  Free Special NP:       94800 (    379,200 Kb)[/font]
12  Modified Pages:        17626 (     70,504 Kb)
13  Modified PF Pages:     17131 (     68,524 Kb)
14  NonPagedPool Usage:     4096 (     16,384 Kb) <<
15  NonPagedPoolNx Usage:   5076 (     20,304 Kb)
16  NonPagedPool Max:      65536 (    262,144 Kb)
17  PagedPool 0 Usage:      7665 (     30,660 Kb)
18  PagedPool 1 Usage:     16753 (     67,012 Kb)
19  PagedPool 2 Usage:     16728 (     66,912 Kb)
20  PagedPool 3 Usage:     16766 (     67,064 Kb)
21  PagedPool 4 Usage:     16725 (     66,900 Kb)
22  PagedPool Usage:       74637 (    298,548 Kb) <<
23  PagedPool Maximum:     92160 (    368,640 Kb)
24  Session Commit:         2095 (      8,380 Kb)
25  Shared Commit:         13172 (     52,688 Kb)
26  Special Pool:           1175 (      4,700 Kb)
27  Shared Process:         3457 (     13,828 Kb)
28  PagedPool Commit:      74583 (    298,332 Kb) <<
29  Driver Commit:          3840 (     15,360 Kb)
30  Committed pages:      196936 (    787,744 Kb)
31  Commit limit:        2327741 (   9310,964 Kb)
32 
33  Total Private:         95118 (    380,472 Kb)
34          0c40 procexp.exe      19592 (     78,368 Kb)
35          0860 360Tray.exe      14027 (     56,108 Kb)
36          070c svchost.exe      13336 (     53,344 Kb)
37          068c explorer.exe      7362 (     29,448 Kb)
38          08b8 XDict.exe         7102 (     28,408 Kb)
39          0470 RTHDCPL.EXE       5785 (     23,140 Kb)
40          075c svchost.exe       4462 (     17,848 Kb)
41          06d4 SEye.exe          2513 (     10,052 Kb)
42          03b8 winlogon.exe      1739 (      6,956 Kb)
43          019c ZhuDongFangYu.e   1519 (      6,076 Kb)
44          ... (Boxcounter:省略剩余不重要的进程信息)

  这段输出需要关注的重点是第16、22和28行。
  看上去,nonpaged pool使用情况比较正常,主要嫌疑在paged pool上。
  看看paged pool吧。

 1 0: kd> !poolused 5
 2    Sorting by  Paged Pool Consumed
 3 
 4   Pool Used:
 5             NonPaged                    Paged
 6  Tag    Allocs    Frees     Diff     Used   Allocs    Frees     Diff     Used
 7  prpt        0        0        0        0  4191169  1396976  2794193  203,976,160    UNKNOWN pooltag 'prpt', please update pooltag.txt
 8  otpu        0        0        0        0   349352       28   349324   30,269,696    UNKNOWN pooltag 'otpu', please update pooltag.txt
 9  INTC    38799    38410      389    95952  1355200  1352980     2220    9,142,856    Intel video driver 
10  Gh05        0        0        0        0   561321   560728      593    4,213,584    GDITAG_HMGR_SPRITE_TYPE , Binary: win32k.sys
11  UlHT        0        0        0        0        1        0        1    4,198,400    Hash Table , Binary: http.sys
12  MmSt        0        0        0        0  1463662  1462178     1484    3,680,824    Mm section object prototype ptes , Binary: nt!mm
13  VMON        2        0        2       72      648        0      648    2,646,432    Volume Manager , Binary: volmgr.sys
14  CM35        0        0        0        0       39       13       26    2,629,632    Internal Configuration manager allocations , Binary: nt!cm
15  Gcac        0        0        0        0      214      157       57    2,096,640    Gdi glyph cache 
16  Ttfd        0        0        0        0     3713     3023      690    1,275,080    TrueType Font driver 
17  Ifs   7166549  7166547        2      176  5361522  5360467     1055    1,131,376    Default file system allocations (user's of ntifs.h) 
18  Ddk   9649715  9649684       31   931176     2684      790     1894    1,087,824    Default for driver allocated memory (user's of ntddk.h) 
19  CMDa        0        0        0        0   216579   214244     2335      530,480    value data cache pool tag , Binary: nt!cm
20  Ntff       18        0       18     3744    84222    83589      633      526,656    FCB_DATA , Binary: ntfs.sys
21  Gla1        1        0        1      112     3961     3652      309      494,400    GDITAG_HMGR_LOOKASIDE_DC_TYPE , Binary: win32k.sys
22  Gla5        1        0        1      112     7449     6223     1226      480,592    GDITAG_HMGR_LOOKASIDE_SURF_TYPE , Binary: win32k.sys
23  logb        0        0        0        0     1959     1954        5      413,728    UNKNOWN pooltag 'logb', please update pooltag.txt
24  pipt        0        0        0        0   353979   351340     2639      407,664    UNKNOWN pooltag 'pipt', please update pooltag.txt
25  CM25        0        0        0        0      972      947       25      327,680    Internal Configuration manager allocations , Binary: nt!cm
26  Obtb        0        0        0        0   362905   362778      127      315,376    object tables via EX handle.c , Binary: nt!ob

  重点是第7、8行这两个windbg不认识的tag。绝大部分的paged pool内存都是它们的。看看它们是谁。

C:\WINDOWS\system32\drivers>findstr /m /l prpt *sys
qutmdrv.sys

C:\WINDOWS\system32\drivers>findstr /m /l otpu *sys
qutmdrv.sys

  找到这个驱动,看了下属性,安全卫士的,该驱动的版本是7.2.1.1022。

  

  分析到这里我想起之前看过的一个类似的样例

  现在找到了一个嫌疑人,为了确认没有误伤,卸掉我的驱动重启,等了10来分钟,看内存和cpu占用都稳定了,启动纯应用层测试工具做压力测试。
  在测试前抓了个dmp,当时状态是这样的:

0: kd> .time
Debug session time: Wed Aug 15 10:37:44.500 2012 (UTC + 8:00)
System Uptime: 0 days 0:13:07.406

 1 0: kd> !vm
 2 
 3 *** Virtual Memory Usage ***
 4  Physical Memory:      842859 (   3371436 Kb)
 5  ...(Boxcounter:此处省略PagingFile相关输出)
 6  Available Pages:      703681 (   2814,724 Kb)
 7  ResAvail Pages:       757240 (   3028,960 Kb)
 8  Locked IO Pages:          50 (       200 Kb)
 9  Free System PTEs:     104176 (    416,704 Kb)
10  Free NP PTEs:          32767 (    131,068 Kb)
11  Free Special NP:       95598 (    382,392 Kb)
12  Modified Pages:         1698 (      6,792 Kb)
13  Modified PF Pages:       545 (      2,180 Kb)
14  NonPagedPool Usage:     4096 (     16,384 Kb) <<
15  NonPagedPoolNx Usage:   3164 (     12,656 Kb)
16  NonPagedPool Max:      65536 (    262,144 Kb)
17  PagedPool 0 Usage:      8987 (     35,948 Kb)
18  PagedPool 1 Usage:       934 (      3,736 Kb)
19  PagedPool 2 Usage:       934 (      3,736 Kb)
20  PagedPool 3 Usage:       923 (      3,692 Kb)
21  PagedPool 4 Usage:       944 (      3,776 Kb)
22  PagedPool Usage:       12722 (     50,888 Kb) <<
23  PagedPool Maximum:     92160 (    368,640 Kb)
24  Shared Commit:         11888 (     47,552 Kb)
25  Special Pool:            273 (      1,092 Kb)
26  Shared Process:         3503 (     14,012 Kb)
27  PagedPool Commit:      12690 (     50,760 Kb) <<
28  Driver Commit:          3807 (     15,228 Kb)
29  Committed pages:      111988 (    447,952 Kb)
30  Commit limit:        2327741 (   9310,964 Kb)

  依然着重看第15、22和27行。

  再重点看看两个嫌疑人的状态。

0: kd> !poolused 1 prpt
   Sorting by  Tag

  Pool Used:
            NonPaged                    Paged
Tag    Allocs    Frees     Diff     Used   Allocs    Frees     Diff     Used
prpt        0        0        0        0     1081      296      785    57,376  UNKNOWN pooltag 'prpt', please update pooltag.txt
otpu        0        0        0        0      184       30      154    11,104  UNKNOWN pooltag 'otpu', please update pooltag.txt

  这个输出我做了一些调整,!poolused命令不支持同时查询多个tag,我使用了两次,然后把结果合并了。

  看看经过几个小时的测试后的状态。

0: kd> .time
Debug session time: Wed Aug 15 15:26:00.875 2012 (UTC + 8:00)
System Uptime: 0 days 5:01:23.703

 1 0: kd> !vm
 2 
 3 *** Virtual Memory Usage ***
 4  Physical Memory:      842859 (   3,371,436 Kb)
 5     ...(Boxcounter:此处省略PagingFile相关输出)
 6  Available Pages:      604366 (   2,417,464 Kb)
 7  ResAvail Pages:       757320 (   3,029,280 Kb)
 8  Locked IO Pages:          50 (       200 Kb)
 9  Free System PTEs:     103840 (    415,360 Kb)
10  Free NP PTEs:          32767 (    131,068 Kb)
11  Free Special NP:       95604 (    382,416 Kb)
12  Modified Pages:         1883 (      7,532 Kb)
13  Modified PF Pages:       711 (      2,844 Kb)
14  NonPagedPool Usage:     4096 (     16,384 Kb) <<
15  NonPagedPoolNx Usage:   3144 (     12,576 Kb)
16  NonPagedPool Max:      65536 (    262,144 Kb)
17  PagedPool 0 Usage:      9867 (     39,468 Kb)
18  PagedPool 1 Usage:     11069 (     44,276 Kb)
19  PagedPool 2 Usage:     11059 (     44,236 Kb)
20  PagedPool 3 Usage:     11054 (     44,216 Kb)
21  PagedPool 4 Usage:     11077 (     44,308 Kb)
22  PagedPool Usage:       54126 (    216,504 Kb) <<
23  PagedPool Maximum:     92160 (    368,640 Kb)
24  Session Commit:         2107 (      8,428 Kb)
25  Shared Commit:         12882 (     51,528 Kb)
26  Special Pool:            267 (      1,068 Kb)
27  Shared Process:         3272 (     13,088 Kb)
28  PagedPool Commit:      54058 (    216,232 Kb) <<
29  Driver Commit:          3807 (     15,228 Kb)
30  Committed pages:      163773 (    655,092 Kb)
31  Commit limit:        2327741 (   9310,964 Kb)
32 
33  Total Private:         84063 (    336,252 Kb)
34          061c 360Tray.exe      14005 (     56,020 Kb)
35          0390 svchost.exe      13331 (     53,324 Kb)
36          0460 explorer.exe     10689 (     42,756 Kb)
37          0e18 procexp.exe       9525 (     38,100 Kb)
38          0658 XDict.exe         6997 (     27,988 Kb)
39          ...(Boxcounter:省略其余不重要输出)

  再来看看tag情况。

 1 0: kd> !poolused 5
 2    Sorting by  Paged Pool Consumed
 3 
 4   Pool Used:
 5             NonPaged                    Paged
 6  Tag    Allocs    Frees     Diff     Used   Allocs    Frees     Diff     Used
 7  prpt        0        0        0        0  2681313   893728  1787585 130,493,776 UNKNOWN pooltag 'prpt', please update pooltag.txt
 8  otpu        0        0        0        0   223536       32   223504  19,366,280 UNKNOWN pooltag 'otpu', please update pooltag.txt
 9  MmSt        0        0        0        0   740248   737979     2269  12,516,920 Mm section object prototype ptes , Binary: nt!mm
10  INTC   289851   289462      389    95952  1200152  1197885     2267   9,146,056 Intel video driver 
11  Gh05        0        0        0        0   350733   350137      596   4,870,704 GDITAG_HMGR_SPRITE_TYPE , Binary: win32k.sys
12  UlHT        0        0        0        0        1        0        1   4,198,400 Hash Table , Binary: http.sys
13  VMON        2        0        2       72      648        0      648   2,646,432 Volume Manager , Binary: volmgr.sys
14  CM35        0        0        0        0       39       13       26   2,629,632 Internal Configuration manager allocations , Binary: nt!cm
15  Gcac        0        0        0        0      343      277       66   2,356,704 Gdi glyph cache 
16  Ttfd        0        0        0        0     5625     4930      695   1,313,168 TrueType Font driver 
17  Ifs   3897847  3897844        3      240  3283105  3282050     1055   1,131,376 Default file system allocations (user's of ntifs.h) 
18  Ddk    622677   622646       31   931176     2654      760     1894   1,087,824 Default for driver allocated memory (user's of ntddk.h) 
19  Ntff       18        0       18     3744    19431    18491      940     782,080 FCB_DATA , Binary: ntfs.sys
20  pipt        0        0        0        0   229268   226053     3215     499,136 UNKNOWN pooltag 'pipt', please update pooltag.txt
21  ...(Boxcounter:省略其余不重要输出)

  第7、8行验证了前面的判断。目前不太清楚这是qutmdrv.sys的资源泄漏BUG,还是它的缓存机制。(最开始的情况更严重是因为那时已经测试约莫11个小时,而后面这次测试时间较短)

  分析过程不长,命令也就那么几个。不过确实耗了我不少时间,主要麻烦是问题不好重现和思维方式转不过弯(之前我的驱动正好刚出现过一次资源泄漏问题,所以这次一看到现象就​认为是我的驱动的问题……)。