share/doc/papers/kerntune/3.t

   1 .\" Copyright (c) 1984 M. K. McKusick
   2 .\" Copyright (c) 1984 The Regents of the University of California.
   3 .\" All rights reserved.
   4 .\"
   5 .\" Redistribution and use in source and binary forms, with or without
   6 .\" modification, are permitted provided that the following conditions
   7 .\" are met:
   8 .\" 1. Redistributions of source code must retain the above copyright
   9 .\"    notice, this list of conditions and the following disclaimer.
  10 .\" 2. Redistributions in binary form must reproduce the above copyright
  11 .\"    notice, this list of conditions and the following disclaimer in the
  12 .\"    documentation and/or other materials provided with the distribution.
  13 .\" 3. Neither the name of the University nor the names of its contributors
  14 .\"    may be used to endorse or promote products derived from this software
  15 .\"    without specific prior written permission.
  16 .\"
  17 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  18 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  19 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  20 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  21 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  22 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  23 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  24 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  25 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  26 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  27 .\" SUCH DAMAGE.
  28 .\"
  29 .\"     @(#)3.t 1.2 (Berkeley) 11/8/90
  30 .\"
  31 .ds RH Techniques for Improving Performance
  32 .NH 1
  33 Techniques for Improving Performance
  34 .PP
  35 This section gives several hints on general optimization techniques.
  36 It then proceeds with an example of how they can be
  37 applied to the 4.2BSD kernel to improve its performance.
  38 .NH 2
  39 Using the Profiler
  40 .PP
  41 The profiler is a useful tool for improving
  42 a set of routines that implement an abstraction.
  43 It can be helpful in identifying poorly coded routines,
  44 and in evaluating the new algorithms and code that replace them.
  45 Taking full advantage of the profiler
  46 requires a careful examination of the call graph profile,
  47 and a thorough knowledge of the abstractions underlying
  48 the kernel.
  49 .PP
  50 The easiest optimization that can be performed
  51 is a small change
  52 to a control construct or data structure.
  53 An obvious starting point
  54 is to expand a small frequently called routine inline.
  55 The drawback to inline expansion is that the data abstractions
  56 in the kernel may become less parameterized,
  57 hence less clearly defined.
  58 The profiling will also become less useful since the loss of
  59 routines will make its output more granular.
  60 .PP
  61 Further potential for optimization lies in routines that
  62 implement data abstractions whose total execution
  63 time is long.
  64 If the data abstraction function cannot easily be speeded up,
  65 it may be advantageous to cache its results,
  66 and eliminate the need to rerun
  67 it for identical inputs.
  68 These and other ideas for program improvement are discussed in
  69 [Bentley81].
  70 .PP
  71 This tool is best used in an iterative approach:
  72 profiling the kernel,
  73 eliminating one bottleneck,
  74 then finding some other part of the kernel
  75 that begins to dominate execution time.
  76 .PP
  77 A completely different use of the profiler is to analyze the control
  78 flow of an unfamiliar section of the kernel.
  79 By running an example that exercises the unfamiliar section of the kernel,
  80 and then using \fIgprof\fR, you can get a view of the
  81 control structure of the unfamiliar section.
  82 .NH 2
  83 An Example of Tuning
  84 .PP
  85 The first step is to come up with a method for generating
  86 profile data.
  87 We prefer to run a profiling system for about a one day
  88 period on one of our general timesharing machines.
  89 While this is not as reproducible as a synthetic workload,
  90 it certainly represents a realistic test.
  91 We have run one day profiles on several
  92 occasions over a three month period.
  93 Despite the long period of time that elapsed
  94 between the test runs the shape of the profiles,
  95 as measured by the number of times each system call
  96 entry point was called, were remarkably similar.
  97 .PP
  98 A second alternative is to write a small benchmark
  99 program to repeated exercise a suspected bottleneck.
 100 While these benchmarks are not useful as a long term profile
 101 they can give quick feedback on whether a hypothesized
 102 improvement is really having an effect.
 103 It is important to realize that the only real assurance
 104 that a change has a beneficial effect is through
 105 long term measurements of general timesharing.
 106 We have numerous examples where a benchmark program
 107 suggests vast improvements while the change
 108 in the long term system performance is negligible,
 109 and conversely examples in which the benchmark program run more slowly,
 110 but the long term system performance improves significantly.
 111 .PP
 112 An investigation of our long term profiling showed that
 113 the single most expensive function performed by the kernel
 114 is path name translation.
 115 We find that our general time sharing systems do about
 116 500,000 name translations per day.
 117 The cost of doing name translation in the original 4.2BSD
 118 is 24.2 milliseconds,
 119 representing 40% of the time processing system calls,
 120 which is 19% of the total cycles in the kernel,
 121 or 11% of all cycles executed on the machine.
 122 The times are shown in Figure 3.
 123 .KF
 124 .DS L
 125 .TS
 126 center box;
 127 l r r.
 128 part    time    % of kernel
 129 _
 130 self    14.3 ms/call    11.3%
 131 child   9.9 ms/call     7.9%
 132 _
 133 total   24.2 ms/call    19.2%
 134 .TE
 135 .ce
 136 Figure 3. Call times for \fInamei\fP.
 137 .DE
 138 .KE
 139 .PP
 140 The system measurements collected showed the
 141 pathname translation routine, \fInamei\fP,
 142 was clearly worth optimizing.
 143 An inspection of \fInamei\fP shows that
 144 it consists of two nested loops.
 145 The outer loop is traversed once per pathname component.
 146 The inner loop performs a linear search through a directory looking
 147 for a particular pathname component.
 148 .PP
 149 Our first idea was to observe that many programs
 150 step through a directory performing an operation on
 151 each entry in turn.
 152 This caused us to modify \fInamei\fP to cache
 153 the directory offset of the last pathname
 154 component looked up by a process.
 155 The cached offset is then used
 156 as the point at which a search in the same directory
 157 begins.  Changing directories invalidates the cache, as
 158 does modifying the directory.
 159 For programs that step sequentially through a directory with
 160 $N$ files, search time decreases from $O ( N sup 2 )$
 161 to $O(N)$.
 162 .PP
 163 The cost of the cache is about 20 lines of code
 164 (about 0.2 kilobytes)
 165 and 16 bytes per process, with the cached data
 166 stored in a process's \fIuser\fP vector.
 167 .PP
 168 As a quick benchmark to verify the effectiveness of the
 169 cache we ran ``ls \-l''
 170 on a directory containing 600 files.
 171 Before the per-process cache this command
 172 used 22.3 seconds of system time.
 173 After adding the cache the program used the same amount
 174 of user time, but the system time dropped to 3.3 seconds.
 175 .PP
 176 This change prompted our rerunning a profiled system
 177 on a machine containing the new \fInamei\fP.
 178 The results showed that the time in \fInamei\fP
 179 dropped by only 2.6 ms/call and
 180 still accounted for 36% of the system call time,
 181 18% of the kernel, or about 10% of all the machine cycles.
 182 This amounted to a drop in system time from 57% to about 55%.
 183 The results are shown in Figure 4.
 184 .KF
 185 .DS L
 186 .TS
 187 center box;
 188 l r r.
 189 part    time    % of kernel
 190 _
 191 self    11.0 ms/call    9.2%
 192 child   10.6 ms/call    8.9%
 193 _
 194 total   21.6 ms/call    18.1%
 195 .TE
 196 .ce
 197 Figure 4. Call times for \fInamei\fP with per-process cache.
 198 .DE
 199 .KE
 200 .PP
 201 The small performance improvement
 202 was caused by a low cache hit ratio.
 203 Although the cache was 90% effective when hit,
 204 it was only usable on about 25% of the names being translated.
 205 An additional reason for the small improvement was that
 206 although the amount of time spent in \fInamei\fP itself
 207 decreased substantially,
 208 more time was spent in the routines that it called
 209 since each directory had to be accessed twice;
 210 once to search from the middle to the end,
 211 and once to search from the beginning to the middle.
 212 .PP
 213 Most missed names were caused by path name components
 214 other than the last.
 215 Thus Robert Elz introduced a system wide cache of most recent
 216 name translations.
 217 The cache is keyed on a name and the
 218 inode and device number of the directory that contains it.
 219 Associated with each entry is a pointer to the corresponding
 220 entry in the inode table.
 221 This has the effect of short circuiting the outer loop of \fInamei\fP.
 222 For each path name component,
 223 \fInamei\fP first looks in its cache of recent translations
 224 for the needed name.
 225 If it exists, the directory search can be completely eliminated.
 226 If the name is not recognized,
 227 then the per-process cache may still be useful in
 228 reducing the directory search time.
 229 The two cacheing schemes complement each other well.
 230 .PP
 231 The cost of the name cache is about 200 lines of code
 232 (about 1.2 kilobytes)
 233 and 44 bytes per cache entry.
 234 Depending on the size of the system,
 235 about 200 to 1000 entries will normally be configured,
 236 using 10-44 kilobytes of physical memory.
 237 The name cache is resident in memory at all times.
 238 .PP
 239 After adding the system wide name cache we reran ``ls \-l''
 240 on the same directory.
 241 The user time remained the same,
 242 however the system time rose slightly to 3.7 seconds.
 243 This was not surprising as \fInamei\fP
 244 now had to maintain the cache,
 245 but was never able to make any use of it.
 246 .PP
 247 Another profiled system was created and measurements
 248 were collected over a one day period.  These measurements
 249 showed a 6 ms/call decrease in \fInamei\fP, with
 250 \fInamei\fP accounting for only 31% of the system call time,
 251 16% of the time in the kernel,
 252 or about 7% of all the machine cycles.
 253 System time dropped from 55% to about 49%.
 254 The results are shown in Figure 5.
 255 .KF
 256 .DS L
 257 .TS
 258 center box;
 259 l r r.
 260 part    time    % of kernel
 261 _
 262 self    9.5 ms/call     9.6%
 263 child   6.1 ms/call     6.1%
 264 _
 265 total   15.6 ms/call    15.7%
 266 .TE
 267 .ce
 268 Figure 5.  Call times for \fInamei\fP with both caches.
 269 .DE
 270 .KE
 271 .PP
 272 Statistics on the performance of both caches show
 273 the large performance improvement is
 274 caused by the high hit ratio.
 275 On the profiled system a 60% hit rate was observed in
 276 the system wide cache.  This, coupled with the 25%
 277 hit rate in the per-process offset cache yielded an
 278 effective cache hit rate of 85%.
 279 While the system wide cache reduces both the amount of time in
 280 the routines that \fInamei\fP calls as well as \fInamei\fP itself
 281 (since fewer directories need to be accessed or searched),
 282 it is interesting to note that the actual percentage of system
 283 time spent in \fInamei\fP itself increases even though the
 284 actual time per call decreases.
 285 This is because less total time is being spent in the kernel,
 286 hence a smaller absolute time becomes a larger total percentage.