share/doc/papers/sysperf/2.t

   1 .\" Copyright (c) 1985 The Regents of the University of California.
   2 .\" All rights reserved.
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .\" 3. All advertising materials mentioning features or use of this software
  13 .\"    must display the following acknowledgement:
  14 .\"     This product includes software developed by the University of
  15 .\"     California, Berkeley and its contributors.
  16 .\" 4. Neither the name of the University nor the names of its contributors
  17 .\"    may be used to endorse or promote products derived from this software
  18 .\"    without specific prior written permission.
  19 .\"
  20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  23 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  30 .\" SUCH DAMAGE.
  31 .\"
  32 .\"     @(#)2.t 5.1 (Berkeley) 4/17/91
  33 .\"
  34 .ds RH Observation techniques
  35 .NH
  36 Observation techniques
  37 .PP
  38 There are many tools available for monitoring the performance
  39 of the system.
  40 Those that we found most useful are described below.
  41 .NH 2
  42 System maintenance tools
  43 .PP
  44 Several standard maintenance programs are invaluable in
  45 observing the basic actions of the system.
  46 The \fIvmstat\fP(1)
  47 program is designed to be an aid to monitoring
  48 systemwide activity.  Together with the
  49 \fIps\fP\|(1)
  50 command (as in ``ps av''), it can be used to investigate systemwide
  51 virtual memory activity.
  52 By running \fIvmstat\fP
  53 when the system is active you can judge the system activity in several
  54 dimensions: job distribution, virtual memory load, paging and swapping
  55 activity, disk and cpu utilization.
  56 Ideally, to have a balanced system in activity,
  57 there should be few blocked (b) jobs,
  58 there should be little paging or swapping activity, there should
  59 be available bandwidth on the disk devices (most single arms peak
  60 out at 25-35 tps in practice), and the user cpu utilization (us) should
  61 be high (above 50%).
  62 .PP
  63 If the system is busy, then the count of active jobs may be large,
  64 and several of these jobs may often be blocked (b).  If the virtual
  65 memory is active, then the paging demon will be running (sr will
  66 be non-zero).  It is healthy for the paging demon to free pages when
  67 the virtual memory gets active; it is triggered by the amount of free
  68 memory dropping below a threshold and increases its pace as free memory
  69 goes to zero.
  70 .PP
  71 If you run \fIvmstat\fP
  72 when the system is busy (a ``vmstat 5'' gives all the
  73 numbers computed by the system), you can find
  74 imbalances by noting abnormal job distributions.  If many
  75 processes are blocked (b), then the disk subsystem
  76 is overloaded or imbalanced.  If you have several non-dma
  77 devices or open teletype lines that are ``ringing'', or user programs
  78 that are doing high-speed non-buffered input/output, then the system
  79 time may go high (60-80% or higher).
  80 It is often possible to pin down the cause of high system time by
  81 looking to see if there is excessive context switching (cs), interrupt
  82 activity (in) or system call activity (sy).  Long term measurements
  83 on one of
  84 our large machines show
  85 an average of 60 context switches and interrupts
  86 per second and an average of 90 system calls per second.
  87 .PP
  88 If the system is heavily loaded, or if you have little memory
  89 for your load (1 megabyte is little in our environment), then the system
  90 may be forced to swap.  This is likely to be accompanied by a noticeable
  91 reduction in the system responsiveness and long pauses when interactive
  92 jobs such as editors swap out.
  93 .PP
  94 A second important program is \fIiostat\fP\|(1).
  95 \fIIostat\fP
  96 iteratively reports the number of characters read and written to terminals,
  97 and, for each disk, the number of transfers per second, kilobytes
  98 transferred per second,
  99 and the milliseconds per average seek.
 100 It also gives the percentage of time the system has
 101 spent in user mode, in user mode running low priority (niced) processes,
 102 in system mode, and idling.
 103 .PP
 104 To compute this information, for each disk, seeks and data transfer completions
 105 and the number of words transferred are counted;
 106 for terminals collectively, the number
 107 of input and output characters are counted.
 108 Also, every 100 ms,
 109 the state of each disk is examined
 110 and a tally is made if the disk is active.
 111 From these numbers and the transfer rates
 112 of the devices it is possible to determine
 113 average seek times for each device.
 114 .PP
 115 When filesystems are poorly placed on the available
 116 disks, figures reported by \fIiostat\fP can be used
 117 to pinpoint bottlenecks.  Under heavy system load, disk
 118 traffic should be spread out among the drives with
 119 higher traffic expected to the devices where the root, swap, and
 120 /tmp filesystems are located.  When multiple disk drives are
 121 attached to the same controller, the system will
 122 attempt to overlap seek operations with I/O transfers.  When
 123 seeks are performed, \fIiostat\fP will show
 124 non-zero average seek times.  Most modern disk drives should
 125 exhibit an average seek time of 25-35 ms.
 126 .PP
 127 Terminal traffic reported by \fIiostat\fP should be heavily
 128 output oriented unless terminal lines are being used for
 129 data transfer by programs such as \fIuucp\fP.  Input and
 130 output rates are system specific.  Screen editors
 131 such as \fIvi\fP and \fIemacs\fP tend to exhibit output/input
 132 ratios of anywhere from 5/1 to 8/1.  On one of our largest
 133 systems, 88 terminal lines plus 32 pseudo terminals, we observed
 134 an average of 180 characters/second input and 450 characters/second
 135 output over 4 days of operation.
 136 .NH 2
 137 Kernel profiling
 138 .PP
 139 It is simple to build a 4.2BSD kernel that will automatically
 140 collect profiling information as it operates simply by specifying the
 141 .B \-p
 142 option to \fIconfig\fP\|(8) when configuring a kernel.
 143 The program counter sampling can be driven by the system clock,
 144 or by an alternate real time clock.
 145 The latter is highly recommended as use of the system clock results
 146 in statistical anomalies in accounting for
 147 the time spent in the kernel clock routine.
 148 .PP
 149 Once a profiling system has been booted statistic gathering is
 150 handled by \fIkgmon\fP\|(8).
 151 \fIKgmon\fP allows profiling to be started and stopped
 152 and the internal state of the profiling buffers to be dumped.
 153 \fIKgmon\fP can also be used to reset the state of the internal
 154 buffers to allow multiple experiments to be run without
 155 rebooting the machine.
 156 .PP
 157 The profiling data is processed with \fIgprof\fP\|(1)
 158 to obtain information regarding the system's operation.
 159 Profiled systems maintain histograms of the kernel program counter,
 160 the number of invocations of each routine,
 161 and a dynamic call graph of the executing system.
 162 The postprocessing propagates the time spent in each
 163 routine along the arcs of the call graph.
 164 \fIGprof\fP then generates a listing for each routine in the kernel,
 165 sorted according to the time it uses
 166 including the time of its call graph descendents.
 167 Below each routine entry is shown its (direct) call graph children,
 168 and how their times are propagated to this routine.
 169 A similar display above the routine shows how this routine's time and the
 170 time of its descendents is propagated to its (direct) call graph parents.
 171 .PP
 172 A profiled system is about 5-10% larger in its text space because of
 173 the calls to count the subroutine invocations.
 174 When the system executes,
 175 the profiling data is stored in a buffer that is 1.2
 176 times the size of the text space.
 177 All the information is summarized in memory,
 178 it is not necessary to have a trace file
 179 being continuously dumped to disk.
 180 The overhead for running a profiled system varies;
 181 under normal load we see anywhere from 5-25%
 182 of the system time spent in the profiling code.
 183 Thus the system is noticeably slower than an unprofiled system,
 184 yet is not so bad that it cannot be used in a production environment.
 185 This is important since it allows us to gather data
 186 in a real environment rather than trying to
 187 devise synthetic work loads.
 188 .NH 2
 189 Kernel tracing
 190 .PP
 191 The kernel can be configured to trace certain operations by
 192 specifying ``options TRACE'' in the configuration file.  This
 193 forces the inclusion of code that records the occurrence of
 194 events in \fItrace records\fP in a circular buffer in kernel
 195 memory.  Events may be enabled/disabled selectively while the
 196 system is operating.  Each trace record contains a time stamp
 197 (taken from the VAX hardware time of day clock register), an
 198 event identifier, and additional information that is interpreted
 199 according to the event type.  Buffer cache operations, such as
 200 initiating a read, include
 201 the disk drive, block number, and transfer size in the trace record.
 202 Virtual memory operations, such as a pagein completing, include
 203 the virtual address and process id in the trace record.  The circular
 204 buffer is normally configured to hold 256 16-byte trace records.\**
 205 .FS
 206 \** The standard trace facilities distributed with 4.2
 207 differ slightly from those described here.  The time stamp in the
 208 distributed system is calculated from the kernel's time of day
 209 variable instead of the VAX hardware register, and the buffer cache
 210 trace points do not record the transfer size.
 211 .FE
 212 .PP
 213 Several user programs were written to sample and interpret the
 214 tracing information.  One program runs in the background and
 215 periodically reads the circular buffer of trace records.  The
 216 trace information is compressed, in some instances interpreted
 217 to generate additional information, and a summary is written to a
 218 file.  In addition, the sampling program can also record
 219 information from other kernel data structures, such as those
 220 interpreted by the \fIvmstat\fP program.  Data written out to
 221 a file is further buffered to minimize I/O load.
 222 .PP
 223 Once a trace log has been created, programs that compress
 224 and interpret the data may be run to generate graphs showing the
 225 data and relationships between traced events and
 226 system load.
 227 .PP
 228 The trace package was used mainly to investigate the operation of
 229 the file system buffer cache.  The sampling program maintained a
 230 history of read-ahead blocks and used the trace information to
 231 calculate, for example, percentage of read-ahead blocks used.
 232 .NH 2
 233 Benchmark programs
 234 .PP
 235 Benchmark programs were used in two ways.  First, a suite of
 236 programs was constructed to calculate the cost of certain basic
 237 system operations.  Operations such as system call overhead and
 238 context switching time are critically important in evaluating the
 239 overall performance of a system.  Because of the drastic changes in
 240 the system between 4.1BSD and 4.2BSD, it was important to verify
 241 the overhead of these low level operations had not changed appreciably.
 242 .PP
 243 The second use of benchmarks was in exercising
 244 suspected bottlenecks.
 245 When we suspected a specific problem with the system,
 246 a small benchmark program was written to repeatedly use
 247 the facility.
 248 While these benchmarks are not useful as a general tool
 249 they can give quick feedback on whether a hypothesized
 250 improvement is really having an effect.
 251 It is important to realize that the only real assurance
 252 that a change has a beneficial effect is through
 253 long term measurements of general timesharing.
 254 We have numerous examples where a benchmark program
 255 suggests vast improvements while the change
 256 in the long term system performance is negligible,
 257 and conversely examples in which the benchmark program run more slowly,
 258 but the long term system performance improves significantly.