share/doc/papers/malloc/performance.ms

   1 .\"
   2 .\" ----------------------------------------------------------------------------
   3 .\" "THE BEER-WARE LICENSE" (Revision 42):
   4 .\" <phk@FreeBSD.org> wrote this file.  As long as you retain this notice you
   5 .\" can do whatever you want with this stuff. If we meet some day, and you think
   6 .\" this stuff is worth it, you can buy me a beer in return.   Poul-Henning Kamp
   7 .\" ----------------------------------------------------------------------------
   8 .\"
   9 .\" $FreeBSD$
  10 .\"
  11 .ds RH Performance
  12 .NH
  13 Performance
  14 .PP
  15 Performance for a malloc(3) implementation comes as two variables:
  16 .IP
  17 A: How much time does it use for searching and manipulating data structures.
  18 We will refer to this as ``overhead time''.
  19 .IP
  20 B: How well does it manage the storage.
  21 This rather vague metric we call ``quality of allocation''.
  22 .PP
  23 The overhead time is easy to measure, just do a lot of malloc/free calls
  24 of various kinds and combination, and compare the results.
  25 .PP
  26 The quality of allocation is not quite as simple as that.
  27 One measure of quality is the size of the process, that should obviously
  28 be minimized.
  29 Another measure is the execution time of the process.
  30 This is not an obvious indicator of quality, but people will generally
  31 agree that it should be minimized as well, and if malloc(3) can do
  32 anything to do so, it should.
  33 Explanation why it is still a good metric follows:
  34 .PP
  35 In a traditional segment/swap kernel, the desirable behavior of a process
  36 is to keep the brk(2) as low as possible, thus minimizing the size of the
  37 data/bss/heap segment, which in turn translates to a smaller process and
  38 a smaller probability of the process being swapped out, qed: faster
  39 execution time as an average.
  40 .PP
  41 In a paging environment this is not a bad choice for a default, but
  42 a couple of details needs to be looked at much more carefully.
  43 .PP
  44 First of all, the size of a process becomes a more vague concept since
  45 only the pages that are actually used need to be in primary storage
  46 for execution to progress, and they only need to be there when used.
  47 That implies that many more processes can fit in the same amount of
  48 primary storage, since most processes have a high degree of locality
  49 of reference and thus only need some fraction of their pages to actually
  50 do their job.
  51 .PP
  52 From this it follows that the interesting size of the process, is some
  53 subset of the total amount of virtual memory occupied by the process.
  54 This number isn't a constant, it varies depending on the whereabouts
  55 of the process, and it may indeed fluctuate wildly over the lifetime
  56 of the process.
  57 .PP
  58 One of the names for this vague concept is ``current working set''.
  59 It has been defined many different ways over the years, mostly to
  60 satisfy and support claims in marketing or benchmark contexts.
  61 .PP
  62 For now we can simply say that it is the number of pages the process
  63 needs in order to run at a sufficiently low paging rate in a congested
  64 primary storage.
  65 (If primary storage isn't congested, this is not really important
  66 of course, but most systems would be better off using the pages for
  67 disk-cache or similar functions, so from that perspective it will
  68 always be congested.)
  69 If the number of pages is too small, the process will wait for its
  70 pages to be read from secondary storage much of the time, if it's too
  71 big, the space could be used better for something else.
  72 .PP
  73 From the view of any single process, this number of pages is
  74 "all of my pages", but from the point of view of the OS it should
  75 be tuned to maximise the total throughput of all the processes on
  76 the machine at the time.
  77 This is usually done using various kinds of least-recently-used
  78 replacement algorithms to select page candidates for replacement.
  79 .PP
  80 With this knowledge, can we decide what the performance goal is for
  81 a modern malloc(3) ?
  82 Well, it's almost as simple as it used to be:
  83 .B
  84 Minimize the number of pages accessed.
  85 .R
  86 .PP
  87 This really is the core of it all.
  88 If the number of accessed pages is smaller, then locality of reference is
  89 higher, and all kinds of caches (which is essentially what the
  90 primary storage is in a VM system) work better.
  91 .PP
  92 It's interesting to notice that the classical malloc fails on this one
  93 because the information about free chunks is kept with the free
  94 chunks themselves.  In some of the benchmarks this came out as all the
  95 pages being paged in every time a malloc call was made, because malloc
  96 had to traverse the free list to find a suitable chunk for the allocation.
  97 If memory is not in use, then you shouldn't access it.
  98 .PP
  99 The secondary goal is more evident:
 100 .B
 101 Try to work in pages.
 102 .R
 103 .PP
 104 That makes it easier for the kernel, and wastes less virtual memory.
 105 Most modern implementations do this when they interact with the
 106 kernel, but few try to avoid objects spanning pages.
 107 .PP
 108 If an object's size
 109 is less than or equal to a page, there is no reason for it to span two pages.
 110 Having objects span pages means that two pages must be
 111 paged in, if that object is accessed.
 112 .PP
 113 With this analysis in the luggage, we can start coding.