share/doc/papers/timecounter/timecounter.ms

   1 .EQ
   2 delim øø
   3 .EN
   4 .\"
   5 .\" ----------------------------------------------------------------------------
   6 .\" "THE BEER-WARE LICENSE" (Revision 42):
   7 .\" <phk@login.dknet.dk> wrote this file.  As long as you retain this notice you
   8 .\" can do whatever you want with this stuff. If we meet some day, and you think
   9 .\" this stuff is worth it, you can buy me a beer in return.   Poul-Henning Kamp
  10 .\" ----------------------------------------------------------------------------
  11 .\"
  12 .\" $FreeBSD$
  13 .\"
  14 .if n .ND
  15 .TI
  16 Timecounters: Efficient and precise timekeeping in SMP kernels.
  17 .AA
  18 .A "Poul-Henning Kamp" "The FreeBSD Project"
  19 .AB
  20 .PP
  21 The FreeBSD timecounters are an architecture-independent implementation
  22 of a binary timescale using whatever hardware support is at hand
  23 for tracking time.  The binary timescale converts using simple
  24 multiplication to canonical timescales based on micro- or nano-seconds
  25 and can interface seamlessly to the NTP PLL/FLL facilities for clock
  26 synchronisation.  Timecounters are implemented using lock-less
  27 stable-storage based primitives which scale efficiently in SMP
  28 systems.  The math and implementation behind timecounters will
  29 be detailed as well as the mechanisms used for synchronisation. \**
  30 .AE
  31 .FS
  32 This paper was presented at the EuroBSDcon 2002 conference in Amsterdam.
  33 .FE
  34 .SH
  35 Introduction
  36 .PP
  37 Despite digging around for it, I have not been able to positively
  38 identify the first computer which knew the time of day.
  39 The feature probably arrived either from the commercial side
  40 so service centres could bill computer cycles to customers or from
  41 the technical side so computers could timestamp external events,
  42 but I have not been able to conclusively nail the first implementation down.
  43 .LP
  44 But there is no doubt that it happened very early in the development
  45 of computers
  46 and if systems like the ``SAGE'' [SAGE] did not know what time
  47 it was I would be amazed.
  48 .LP
  49 On the other hand, it took a long time for a real time clock to
  50 become a standard feature:
  51 .LP
  52 The ``Apple ]['' computer
  53 had neither in hardware or software any notion what time it was.
  54 .LP
  55 The original ``IBM PC'' did know what time it was, provided you typed
  56 it in when you booted it, but it forgot when you turned it off.
  57 .LP
  58 One of the ``advanced technologies'' in the ``IBM PC/AT'' was a battery
  59 backed CMOS chip which kept track of time even when the computer
  60 was powered off.
  61 .LP
  62 Today we expect our computers to know the time, and with network
  63 protocols like NTP we will usually find that they do, give and
  64 take some milliseconds.
  65 .LP
  66 This article is about the code in the FreeBSD kernel which keeps
  67 track of time.
  68 .SH
  69 Time and timescale basics
  70 .PP
  71 Despite the fact that time is the physical quantity (or maybe entity
  72 ?) about which we know the least, it is at the same time [sic!] what we
  73 can measure with the highest precision of all physical quantities.
  74 .LP
  75 The current crop of atomic clocks will neither gain nor lose a
  76 second in the next couple hundred million years, provided we
  77 stick to the preventative maintenance schedules.  This is a feat
  78 roughly in line with to knowing the circumference of the Earth
  79 with one micrometer precision, in real time.
  80 .LP
  81 While it is possible to measure time by means other than oscillations,
  82 for instance transport or consumption of a substance at a well-known
  83 rate, such designs play no practical role in time measurement because
  84 their performance is significantly inferior to oscillation based
  85 designs.
  86 .LP
  87 In other words, it is pretty fair to say that all relevant
  88 timekeeping is based on oscillating phenomena:
  89 .TS
  90 center;
  91 l l.
  92 sun-dial        Earths rotation about its axis.
  93 calendar        Ditto + Earths orbit around the sun.
  94 clockwork       Mechanical oscillation of pendulum.
  95 crystals        Mechanical resonance in quartz.
  96 atomic  Quantum-state transitions in atoms.
  97 .TE
  98 .LP
  99 We can therefore with good fidelity define ``a clock'' to be the
 100 combination of an oscillator and a counting mechanism:
 101 .LP
 102 .if t .PSPIC fig3.eps
 103 .LP
 104 The standard second is currently defined as
 105 .QP
 106 The duration of  9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the caesium 133 atom.
 107 .LP
 108 and we have frequency standards which are able to mark a sequence of
 109 such seconds
 110 with an error less than ø2 cdot 10 sup{-15}ø [DMK2001] with commercially
 111 available products doing better than ø1 cdot 10 sup{-14}ø [AG2002].
 112 .LP
 113 Unlike other physical units with a conventionally defined origin,
 114 longitude for instance, the ephemeral nature of time prevents us
 115 from putting a stake in the ground, so to speak, and measure from
 116 there.  For measuring time we have to rely on ``dead reckoning'',
 117 just like the navigators before Harrison built his clocks [RGO2002]:
 118 We have to tally how far we went from our reference point, keeping a
 119 running total at all times, and use that as our estimated position.
 120 .LP
 121 The upshot of this is, that we cannot define a timescale by any
 122 other means than some other timescale(s).
 123 .LP
 124 ``Relative time'' is a time interval between two events, and for this
 125 we only need to agree on the rate of the oscillator.
 126 .LP
 127 ``Absolute time'' consists of a well defined point in time and the
 128 time interval since then, this is a bit more tricky.
 129 .LP
 130 The Internationally agreed upon TAI and the UTC timescales
 131 starts at (from a physics point of view) arbitrary points in time
 132 and progresses in integral intervals of the standard second, with the
 133 difference being that UTC does tricks to the counting to stay roughly
 134 in sync with Earths rotation \**.
 135 .FS
 136 The first atomic based definition actually operated in a different way:
 137 each year would have its own value determined for the frequency of the
 138 caesium resonance, selected so as to match the revolution rate of the
 139 Earth.  This resulted in time-intervals being very unwieldy business,
 140 and more and more scientists realized that that the caesium resonance
 141 was many times more stable than the angular momentum of the Earth.
 142 Eventually the new leap-second method were introduced in 1972.
 143 It is interesting to note that the autumn leaves falling on the
 144 northern hemisphere affects the angular momentum enough to change
 145 the Earths rotational rate measurably.
 146 .FE
 147 .LP
 148 TAI is defined as a sequence of standard seconds (the first timescale),
 149 counted from January 1st 1958 (the second timescale).
 150 .LP
 151 UTC is defined basically the same way, but every so often a leap-second
 152 is inserted (or theoretically deleted) to keep UTC synchronised
 153 with Earths rotation.
 154 .LP
 155 Both the implementation of these two, and a few others speciality
 156 timescales are the result of the
 157 combined efforts of several hundred atomic frequency standards in
 158 various laboratories and institutions throughout the world, all
 159 reporting to the BIPM in Paris who calculate the ``paper clock'' which
 160 TAI and UTC really are using a carefully designed weighting algorithm \**.
 161 .FS
 162 The majority of these clocks are model 5071A from Agilent (the test
 163 and measurement company formerly known as ``Hewlett-Packard'') which
 164 count for as much as 85% of the combined weight.
 165 A fact the company deservedly is proud of.
 166 The majority of the remaining weight is assigned to a handful of big
 167 custom-design units like the PTB2 and NIST7.
 168 .FE
 169 .LP
 170 Leap seconds are typically announced six to nine months in advance,
 171 based on precise observations of median transit times of stars and VLBI
 172 radio astronomy of very distant quasars.
 173 .LP
 174 The perceived wisdom of leap-seconds have been gradually decreasing
 175 in recent years, as devices and products with built-in calendar
 176 functionality becomes more and more common and people realize that
 177 user input or software upgrades are necessary to instruct the
 178 calendar functionality about upcoming leap seconds.
 179 .SH
 180 UNIX timescales
 181 .PP
 182 UNIX systems use a timescale which pretends to be UTC, but defined
 183 as the count of standard seconds since 00:00:00 01-01-1970 UTC,
 184 ignoring the leap-seconds.  This definition has never been perceived
 185 as wise.
 186 .LP
 187 Ignoring leap seconds means that unless some trickery is performed
 188 when a leap second happens on the UTC scale, UNIX clocks would be
 189 one second off.  Another implication is that the length of a
 190 time interval calculated on UNIX time_t variables, can be up to 22
 191 (and counting) seconds wrong relative to the same time interval
 192 measured on the UTC timescale.
 193 .LP
 194 Recent efforts have tried to make the NTP protocol make up for this
 195 deficiency by transmitting the UTC-TAI offset as part of the protocol.
 196 [MILLS2000A]
 197 .LP
 198 Fractional seconds are represented two ways in UNIX, ``timeval'' and
 199 ``timespec''.  Both of these formats are two-component structures
 200 which record the number of seconds, and the number of microseconds
 201 or nanoseconds respectively.
 202 .LP
 203 This unfortunate definition makes arithmetic on these two formats
 204 quite expensive to perform in terms of computer instructions:
 205 .DS
 206 .ps -1
 207 /* Subtract timeval from timespec */
 208 t3.tv_sec = t1.tv_sec - t2.tv_sec;
 209 t3.tv_nsec = t1.tv_nsec -
 210              t2.tv_usec * 1000;
 211 if (t3.tv_nsec >= 1000000000) {
 212     t3.tv_sec++;
 213     t3.tv_nsec -= 1000000000;
 214 } else if (t3.tv_nsec < 0) {
 215     t3.tv_sec--;
 216     t3.tv_nsec += 1000000000;
 217 }
 218 .ps +1
 219 .DE
 220 .LP
 221 While nanoseconds will probably be enough for most timestamping
 222 tasks faced by UNIX computers for a number of years, it is an
 223 increasingly uncomfortable situation that CPU clock periods and
 224 instruction timings are already not representable in the standard
 225 time formats available on UNIX for consumer grade hardware,
 226 and the first POSIX mandated API, \fCclock_getres(3)\fP has
 227 already effectively reached end of life as a result of this.
 228 .LP
 229 Hopefully the various standards bodies will address this issue
 230 better in the future.
 231 .SH
 232 Precision, Stability and Resolution
 233 .PP
 234 Three very important terms in timekeeping are ``precision'',
 235 ``stability'' and ``resolution''.
 236 While the three words may seem to describe somewhat the
 237 same property in most uses, their use in timekeeping covers three
 238 very distinct and well defined properties of a clock.
 239 .LP
 240 Resolution in clocks is simply a matter of the step-size of the
 241 counter or in other words: the rate at which it steps.
 242 A counter running on a 1 MHz frequency will have a resolution
 243 of 1 microsecond.
 244 .LP
 245 Precision talks about how close to the intended rate the clock runs,
 246 stability about how much the rate varies and resolution about the
 247 size of the smallest timeinterval we can measure.
 248 .LP
 249 From a quality point of view, Stability is a much more
 250 valuable property than precision, this is probably best explained
 251 using a graphic illustration of the difference between the two
 252 concepts:
 253 .LP
 254 .if t .PSPIC fig1.eps
 255 .LP
 256 In the top row we have instability, the bullet holes are spread over
 257 a large fraction of the target area.
 258 In the bottom row, the bullets all hit in a very small area.
 259 .LP
 260 On the left side, we have lack of precision, the holes obviously are
 261 not centred on the target, a systematic offset exists.
 262 In the right side we have precision, the bullets are centred on
 263 the target \**.
 264 .FS
 265 We cannot easily get resolution into this analogy, the obvious
 266 representation as the diameter of the bullet-hole is not correct,
 267 it would have to be the grid or other pattern of locations where
 268 the bullet could possibly penetrate the target material, but this
 269 gets too quantum-mechanical-oid to serve the instructional purpose.
 270 .FE
 271 .LP
 272 Transposing these four targets to actual clocks, the situation
 273 could look like the following plots:
 274 .LP
 275 .if t .PSPIC fig2.eps
 276 .LP
 277 On the x-axis we have time and on the y-axis how wrong the clock
 278 was at a given point in time.
 279 .LP
 280 The reason atomic standards are such a big deal in timekeeping is
 281 that they are incredibly stable: they are able to generate an oscillation
 282 where the period varies by roughly a millionth of a billonth of a
 283 second in long term measurements.
 284 .LP
 285 They are in fact not nearly as precise as they are stable, but as
 286 one can see from the graphic above, a stable clock which is not
 287 precise can be easily corrected for the offset and thus calibrated
 288 is as good as any clock.
 289 .LP
 290 This lack of precision is not necessarily a flaw in these kinds of
 291 devices, once you get into the ø10 cdot 10 sup{-15}ø territory
 292 things like the blackbody spectrum at the particular absolute
 293 temperature of the clocks hardware and general relativistic
 294 effects mostly dependent on the altitude above earths center
 295 has to be corrected for \**.
 296 .FS
 297 This particularly becomes an issue with space-based atomic standards
 298 as those found on the ``Navstar'' GPS satellites.
 299 .FE
 300 .SH
 301 Design goals of timecounters
 302 .PP
 303 After this brief description of the major features of the local
 304 landscape, we can look at the design goals of timecounters in detail:
 305 .LP
 306 .I "Provide timestamps in timeval and timespec formats,"
 307 .IP
 308 This is obviously the basic task we have to solve, but as was noted
 309 earlier, this is in no way the performance requirement.
 310 .LP
 311 .I "on both the ``uptime'' and the POSIX timescales,"
 312 .IP
 313 The ``uptime'' timescale is convenient for time intervals which are
 314 not anchored in UTC time: the run time of processes, the access
 315 time of disks and similar.
 316 .IP
 317 The uptime timescale counts seconds starting from when the system
 318 is booted.  The POSIX/UTC timescale is implemented by adding an
 319 estimate of the POSIX time when the system booted to the uptime
 320 timescale.
 321 .LP
 322 .I "using whatever hardware we have available at the time,"
 323 .IP
 324 Which in a subtle way also implies ``be able to switch from one
 325 piece of hardware to another on the fly'' since we may not know
 326 right up front what hardware we have access to and which is
 327 preferable to use.
 328 .LP
 329 .I "while supporting time the NTP PLL/FLL discipline code,"
 330 .IP
 331 The NTP kernel PLL/FLL code allows the local clock and timescale
 332 to be synchronised or syntonised to an external timescale either
 333 via network packets or hardware connection.  This also implies
 334 that the rate and phase of the timescale must be manoeuvrable
 335 with sufficient resolution.
 336 .LP
 337 .I "and providing support for the RFC 2783 PPS API,"
 338 .IP
 339 This is mainly for the benefit of the NTPD daemons communication
 340 with external clock or frequency hardware, but it has many other
 341 interesting uses as well [PHK2001].
 342 .LP
 343 .I "in a SMP efficient way."
 344 .IP
 345 Timestamps are used many places in the kernel and often at pretty
 346 high rate so it is important that the timekeeping facility
 347 does not become a point of CPU or lock contention.
 348 .SH
 349 Timecounter timestamp format.
 350 .PP
 351 Choosing the fundamental timestamp format for the timecounters is
 352 mostly a question of the resolution and steer-ability requirements.
 353 .LP
 354 There are two basic options on contemporary hardware: use a 32 bit
 355 integer for the fractional part of seconds, or use a 64 bit which
 356 is computationally more expensive.
 357 .LP
 358 The question therefore reduced to the somewhat simpler: can we get
 359 away with using only 32 bit ?
 360 .LP
 361 Since 32 bits fractional seconds have a resolution of slightly
 362 better than quarter of a nanosecond (.2328 nsec) it can obviously
 363 be converted to nanosecond resolution struct timespec timestamps
 364 with no loss of precision, but unfortunately not with pure 32 bit
 365 arithmetic as that would result in unacceptable rounding errors.
 366 .LP
 367 But timecounters also need to represent the clock period of the
 368 chosen hardware and this hardware might be the GHz range CPU-clock.
 369 The list of clock frequencies we could support with 32 bits are:
 370 .TS
 371 center;
 372 l l n l.
 373 ø2 sup{32} / 1ø ø=ø     4.294   GHz
 374 ø2 sup{32} / 2ø ø=ø     2.147   GHz
 375 ø2 sup{32} / 3ø ø=ø     1.432   GHz
 376 \&...
 377 ø2 sup{32} / (2 sup{32}-1)ø     ø=ø     1.000   Hz
 378 .TE
 379 We can immediately see that 32 bit is insufficient to faithfully
 380 represent clock frequencies even in the low GHz area, much less in
 381 the range of frequencies which have already been vapourwared by
 382 both IBM, Intel and AMD.
 383 QED: 32 bit fractions are not enough.
 384 .LP
 385 With 64 bit fractions the same table looks like:
 386 .TS
 387 center;
 388 l l r l.
 389 ø2 sup{64} / 1ø ø=ø     ø 18.45 cdot 10 sup{9}ø GHz
 390 ø2 sup{64} / 2ø ø=ø     ø 9.223 cdot 10 sup{9}ø GHz
 391 \&...
 392 ø2 sup{64} / 2 sup{32}ø ø=ø     4.294   GHz
 393 \&...
 394 ø2 sup{64} / (2 sup{64}-1)ø     ø=ø     1.000   Hz
 395 .TE
 396 And the resolution in the 4 GHz frequency range is approximately one Hz.
 397 .LP
 398 The following format have therefore been chosen as the basic format
 399 for timecounters operations:
 400 .DS
 401 .ps -1
 402 struct bintime {
 403     time_t  sec;
 404     uint64_t frac;
 405 };
 406 .ps +1
 407 .DE
 408 Notice that the format will adapt to any size of time_t variable,
 409 keeping timecounters safely out of the ``We SHALL prepare for the
 410 Y2.038K problem'' war zone.
 411 .LP
 412 One beauty of the bintime format, compared to the timeval and
 413 timespec formats is that it is a binary number, not a pseudo-decimal
 414 number.  If compilers and standards allowed, the representation
 415 would have been ``int128_t'' or at least ``int96_t'', but since this
 416 is currently not possible, we have to express the simple concept
 417 of multiword addition in the C language which has no concept of a
 418 ``carry bit''.
 419 .LP
 420 To add two bintime values, the code therefore looks like this \**:
 421 .FS
 422 If the reader suspects the '>' is a typo, further study is suggested.
 423 .FE
 424 .LP
 425 .DS
 426 .ps -1
 427 uint64_t u;
 428
 429 u = bt1->frac;
 430 bt3->frac = bt1->frac + bt2->frac;
 431 bt3->sec  = bt1->sec  + bt2->sec;
 432 if (u > bt3->frac)
 433     bt3->sec += 1;
 434 .ps +1
 435 .DE
 436 .LP
 437 An important property of the bintime format is that it can be
 438 converted to and from timeval and timespec formats with simple
 439 multiplication and shift operations as shown in these two
 440 actual code fragments:
 441 .DS
 442 .ps -1
 443 void
 444 bintime2timespec(struct bintime *bt,
 445                  struct timespec *ts)
 446 {
 447
 448     ts->tv_sec = bt->sec;
 449     ts->tv_nsec =
 450       ((uint64_t)1000000000 *
 451       (uint32_t)(bt->frac >> 32)) >> 32;
 452 }
 453 .ps +1
 454 .DE
 455 .DS
 456 .ps -1
 457 void
 458 timespec2bintime(struct timespec *ts,
 459                  struct bintime *bt)
 460 {
 461
 462     bt->sec = ts->tv_sec;
 463     /* 18446744073 =
 464       int(2^64 / 1000000000) */
 465     bt->frac = ts->tv_nsec *
 466       (uint64_t)18446744073LL;
 467 }
 468 .ps +1
 469 .DE
 470 .LP
 471 .SH
 472 How timecounters work
 473 .PP
 474 To produce a current timestamp the timecounter code
 475 reads the hardware counter, subtracts a reference
 476 count to find the number of steps the counter has
 477 progressed since the reference timestamp.
 478 This number of steps is multiplied with a factor
 479 derived from the counters frequency, taking into account
 480 any corrections from the NTP PLL/FLL and this product
 481 is added to the reference timestamp to get a timestamp.
 482 .LP
 483 This timestamp is on the ``uptime'' time scale, so if
 484 UNIX/UTC time is requested, the estimated time of boot is
 485 added to the timestamp and finally it is scaled to the
 486 timeval or timespec if that is the desired format.
 487 .LP
 488 A fairly large number of functions are provided to produce
 489 timestamps, depending on the desired timescale and output
 490 format:
 491 .TS
 492 center;
 493 l r r.
 494 Desired uptime  UTC/POSIX
 495 Format  timescale       timescale
 496 _
 497 bintime binuptime()     bintime()
 498 timespec        nanouptime()    nanotime()
 499 timeval microuptime()   microtime()
 500 .TE
 501 .LP
 502 Some applications need to timestamp events, but are not
 503 particular picky about the precision.
 504 In many cases a precision of tenths or hundreds of
 505 seconds is sufficient.
 506 .LP
 507 A very typical case is UNIX file timestamps:
 508 There is little point in spending computational resources getting an
 509 exact nanosecond timestamp, when the data is written to
 510 a mechanical device which has several milliseconds of unpredictable
 511 delay before the operation is completed.
 512 .LP
 513 Therefore a complementary shadow family of timestamping functions
 514 with the prefix ``get'' have been added.
 515 .LP
 516 These functions return the reference
 517 timestamp from the current timehands structure without going to the
 518 hardware to determine how much time has elapsed since then.
 519 These timestamps are known to be correct to within rate at which
 520 the periodic update runs, which in practice means 1 to 10 milliseconds.
 521 .SH
 522 Timecounter math
 523 .LP
 524 The delta-count operation is straightforward subtraction, but we
 525 need to logically AND the result with a bit-mask with the same number
 526 (or less) bits as the counter implements,
 527 to prevent higher order bits from getting set when the counter rolls over:
 528 .DS
 529 .ce
 530 .EQ
 531 Delta Count = (Count sub{now} - Count sub{ref}) ~ BITAND ~ mask
 532 .EN
 533 .DE
 534 The scaling step is straightforward.
 535 .DS
 536 .ce
 537 .EQ
 538 T sub{now} = Delta Count cdot R sub{counter} + T sub{ref}
 539 .EN
 540 .DE
 541 The scaling factor øR sub{counter}ø will be described below.
 542 .LP
 543 At regular intervals, scheduled by \fChardclock()\fP, a housekeeping
 544 routine is run which does the following:
 545 .LP
 546 A timestamp with associated hardware counter reading is elevated
 547 to be the new reference timecount:
 548 .DS
 549
 550 .ce
 551 .EQ
 552 Delta Count = (Count sub{now} - Count sub{ref}) ~ BITAND ~ mask
 553 .EN
 554
 555 .ce
 556 .EQ
 557 T sub{now} = Delta Count cdot R sub{counter}
 558 .EN
 559
 560 .ce
 561 .EQ
 562 Count sub{ref} = Count sub{now}
 563 .EN
 564
 565 .ce
 566 .EQ
 567 T sub{ref} = T sub{now}
 568 .EN
 569 .DE
 570 .LP
 571 If a new second has started, the NTP processing routines are called
 572 and the correction they return and the counters frequency is used
 573 to calculate the new scaling factor øR sub{counter}ø:
 574 .DS
 575 .ce
 576 .EQ
 577 R sub{counter} = {2 sup{64} over Freq sub{counter}} cdot ( 1 + R sub{NTP} )
 578 .EN
 579 .DE
 580 Since we only have access to 64 bit arithmetic, dividing something
 581 into ø2 sup{64}ø is a problem, so in the name of code clarity
 582 and efficiency, we sacrifice the low order bit and instead calculate:
 583 .DS
 584 .ce
 585 .EQ
 586 R sub{counter} = 2 cdot {2 sup{63} over Freq sub{counter}} cdot ( 1 + R sub{NTP} )
 587 .EN
 588 .DE
 589 The øR sub{NTP}ø correct factor arrives as the signed number of
 590 nanoseconds (with 32 bit binary fractions) to adjust per second.
 591 This quasi-decimal number is a bit of a square peg in our round binary
 592 hole, and a conversion factor is needed.
 593 Ideally we want to multiply this factor by:
 594 .DS
 595 .ce
 596 .EQ
 597 2 sup {64} over {10 sup{9} cdot 2 sup{32}} = 4.294967296
 598 .EN
 599 .DE
 600 This is not a nice number to work with.
 601 Fortunately, the precision of this correction is not critical, we are
 602 within an factor of a million of the ø10 sup{-15}ø performance level
 603 of state of the art atomic clocks, so we can use an approximation
 604 on this term without anybody noticing.
 605 .LP
 606 Deciding which fraction to use as approximation needs to carefully
 607 consider any possible overflows that could happen.
 608 In this case the correction may be as large as \(+- 5000 PPM which
 609 leaves us room to multiply with about 850 in a multiply-before-divide
 610 setting.
 611 Unfortunately, there are no good fractions which multiply with less
 612 than 850 and at the same time divide by a power of two, which is
 613 desirable since it can be implemented as a binary shift instead of
 614 an expensive full division.
 615 .LP
 616 A divide-before-multiply approximation necessarily results in a loss
 617 of lower order bits, but in this case dividing by 512 and multiplying
 618 by 2199 gives a good approximation where the lower order bit loss is
 619 not a concern:
 620 .DE
 621 .EQ
 622 2199 over 512 = 4.294921875
 623 .EN
 624 .DE
 625 The resulting error is an systematic under compensation of 10.6PPM
 626 of the requested change, or ø1.06 cdot 10 sup -14ø per nanosecond
 627 of correction.
 628 This is perfectly acceptable.
 629 .LP
 630 Putting it all together, including the one bit we put on the alter for the
 631 Goddess of code clarity, the formula looks like this:
 632 .DS
 633 .ce
 634 .EQ
 635 R sub{counter} = 2 cdot {{2 sup{63} + 2199 cdot {R sub{NTP}} over 1024} over Freq sub{counter}}
 636 .EN
 637 .DE
 638 Presented here in slightly unorthodox format to show the component arithmetic
 639 operations as they are carried out in the code.
 640 .SH
 641 Frequency of the periodic update
 642 .PP
 643 The hardware counter should have a long enough
 644 period, ie, number of distinct counter values divided by
 645 frequency, to not roll over before our periodic update function
 646 has had a chance to update the reference timestamp data.
 647 .LP
 648 The periodic update function is called from \fChardclock()\fP which
 649 runs at a rate which is controlled by the kernel parameter
 650 .I HZ .
 651 .LP
 652 By default HZ is 100 which means that only hardware with a period
 653 longer than 10 msec is usable.
 654 If HZ is configured higher than 1000, an internal divider is
 655 activated to keep the timecounter periodic update running
 656 no more often than 2000 times per second.
 657 .LP
 658 Let us take an example:
 659 At HZ=100 a 16 bit counter can run no faster than:
 660 .DS
 661 .ce
 662 .EQ
 663 2 sup{16} cdot {100 Hz} = 6.5536 MHz
 664 .EN
 665 .DE
 666 Similarly, if the counter runs at 10MHz, the minimum HZ is
 667 .DS
 668 .ce
 669 .EQ
 670 {10 MHz} over {2 sup{16}} = 152.6 Hz
 671 .EN
 672 .DE
 673 .LP
 674 Some amount of margin is of course always advisable,
 675 and a factor two is considered prudent.
 676 .LP
 677 .SH
 678 Locking, lack of ...
 679 .PP
 680 Provided our hardware can be read atomically, that our arithmetic
 681 has enough bits to not roll over and that our clock frequency is
 682 perfectly, or at least sufficiently, stable, we could avoid the
 683 periodic update function, and consequently disregard the entire
 684 issue of locking.
 685 We are seldom that lucky in practice.
 686 .LP
 687 The straightforward way of dealing with meta data updates is to
 688 put a lock of some kind on the data and grab hold of that before
 689 doing anything.
 690 This would however be a very heavy-handed approach.  First of
 691 all, the updates are infrequent compared to simple references,
 692 second it is not important which particular state of meta data
 693 a consumer gets hold of, as long as it is consistent: as long
 694 as the øCount sub{ref}ø and øT sub{ref}ø are a matching pair,
 695 and not old enough to cause an ambiguity with hardware counter
 696 rollover, a valid timestamp can be derived from them.
 697 .LP
 698 A pseudo-stable-storage with generation count method has been
 699 chosen instead.
 700 A ring of ten ``timehands'' data structures are used to hold the
 701 state of the timecounter system, the periodic update function
 702 updates the next structure with the new reference data and
 703 scaling factor and makes it the current timehands.
 704 .LP
 705 The beauty of this arrangement lies in the fact that even though
 706 a particular ``timehands'' data structure has been bumped from being
 707 the ``currents state'' by its successor, it still contains valid data
 708 for some amount of time into the future.
 709 .LP
 710 Therefore, a process which has started the timestamping process but
 711 suffered an interrupt which resulted in the above periodic processing
 712 can continue unaware of this afterwards and not suffer corruption
 713 or miscalculation even though it holds no locks on the shared
 714 meta-data.
 715 .if t .PSPIC fig4.eps
 716 .LP
 717 This scheme has an inherent risk that a process may be de-scheduled for
 718 so long time that it will not manage to complete the timestamping
 719 process before the entire ring of timehands have been recycled.
 720 This case is covered by each timehand having a private generation number
 721 which is temporarily set to zero during the periodic processing, to
 722 mark inconsistent data, and incremented to one more than the
 723 previous value when the update has finished and the timehands
 724 is again consistent.
 725 .LP
 726 The timestamping code will grab a copy of this generation number and
 727 compare this copy to the generation in the timehands after completion
 728 and if they differ it will restart the timestamping calculation.
 729 .DS
 730 .ps -1
 731 do {
 732     th = timehands;
 733     gen = th->th_generation;
 734     /* calculate timestamp */
 735 } while (gen == 0 ||
 736    gen != th->th_generation);
 737 .ps +1
 738 .DE
 739 .LP
 740 Each hardware device supporting timecounting is represented by a
 741 small data structure called a timecounter, which documents the
 742 frequency, the number of bits implemented by the counter and a method
 743 function to read the counter.
 744 .LP
 745 Part of the state in the timehands structure is a pointer to the
 746 relevant timecounter structure, this makes it possible to change
 747 to a one piece of hardware to another ``on the fly'' by updating
 748 the current timehands pointer in a manner similar to the periodic
 749 update function.
 750 .LP
 751 In practice this can be done with sysctl(8):
 752 .DS
 753 .ps -1
 754 sysctl kern.timecounter.hardware=TSC
 755 .ps +1
 756 .DE
 757 .LP
 758 at any time while the system is running.
 759 .SH
 760 Suitable hardware
 761 .PP
 762 A closer look on ``suitable hardware'' is warranted
 763 at this point.
 764 It is obvious from the above description that the ideal hardware
 765 for timecounting is a wide binary counter running at a constant
 766 high frequency
 767 and atomically readable by all CPUs in the system with a fast
 768 instruction(-sequence).
 769 .LP
 770 When looking at the hardware support on the PC platform, one
 771 is somewhat tempted to sigh deeply and mutter ``so much for theory'',
 772 because none of the above parameters seems to have been on the
 773 drawing board together yet.
 774 .LP
 775 All IBM PC derivatives contain a device more or less compatible
 776 with the venerable Intel i8254 chip.
 777 This device contains 3 counters of 16 bits each,
 778 one of which is wired so it can interrupt the CPU when the
 779 programmable terminal count is reached.
 780 .LP
 781 The problem with this device is that it only has 8bit bus-width,
 782 so reading a 16 bit timestamp takes 3 I/O operations: one to latch
 783 the count in an internal register, and two to read the high and
 784 low parts of that register respectively.
 785 .LP
 786 Obviously, on multi-CPU systems this cannot be done without some
 787 kind of locking mechanism preventing the other CPUs from trying
 788 to do the same thing at the same time.
 789 .LP
 790 Less obviously we find it is even worse than that:
 791 Since a low priority kernel thread
 792 might be reading a timestamp when an interrupt comes in, and since
 793 the interrupt thread might also attempt to generate a timestamp,
 794 we need to totally block interrupts out while doing those three
 795 I/O instructions.
 796 .LP
 797 And just to make life even more complicated, FreeBSD uses the same
 798 counter to provide the periodic interrupts which schedule the
 799 \fChardclock()\fP routine, so in addition the code has to deal with the
 800 fact that the counter does not count down from a power of two and
 801 that an interrupt is generated right after the reloading of the
 802 counter when it reaches zero.
 803 .LP
 804 Ohh, and did I mention that the interrupt rate for hardclock() will
 805 be set to a higher frequency if profiling is active ? \**
 806 .FS
 807 I will not even mention the fact that it can be set also to ridiculous
 808 high frequencies in order to be able to use the binary driven ``beep''
 809 speaker in the PC in a PCM fashion to output ``real sounds''.
 810 .FE
 811 .LP
 812 It hopefully doesn't ever get more complicated than that, but it
 813 shows, in its own bizarre and twisted way, just how little help the
 814 timecounter code needs from the actual hardware.
 815 .LP
 816 The next kind of hardware support to materialise was the ``CPU clock
 817 counter'' called ``TSC'' in official data-sheets.
 818 This is basically a on-CPU counter, which counts at the rate
 819 of the CPU clock.
 820 .LP
 821 Unfortunately, the electrical power needed to run a CPU is pretty
 822 precisely proportional with the clock frequency for the
 823 prevailing CMOS chip technology, so
 824 the advent of computers powered by batteries prompted technologies
 825 like APM, ACPI, SpeedStep and others which varies or throttles the
 826 CPU clock to match computing demand in order to minimise the power
 827 consumption \**.
 828 .FS
 829 This technology also found ways into stationary computers from
 830 two different vectors.
 831 The first vector was technical: Cheaper cooling solutions can be used
 832 for the CPU if they are employed resulting in cheaper commodity
 833 hardware.
 834 The second vector was political: For reasons beyond reason, energy
 835 conservation became an issue with personal computers, despite the fact
 836 that practically all north American households contains 4 to 5 household
 837 items which through inefficient designs waste more power than a
 838 personal computer use.
 839 .FE
 840 .LP
 841 Another wiggle for the TSC is that it is not usable on multi-CPU
 842 systems because the counter is implemented inside the CPU and
 843 not readable from other CPUs in the system.
 844 .LP
 845 The counters on different CPUs are not guaranteed
 846 to run syntonously (ie: show the same count at the same time).
 847 For some architectures like the DEC/alpha architecture they do not even
 848 run synchronously (ie: at the same rate) because the CPU clock frequency
 849 is generated by a small SAW device on the chip which is very sensitive
 850 to temperature changes.
 851 .LP
 852 The ACPI specification finally brings some light:
 853 it postulates the existence of a 24 or 32 bit
 854 counter running at a standardised constant frequency and
 855 specifically notes that this is intended to be used for timekeeping.
 856 .LP
 857 The frequency chosen, 3.5795454... MHz\**
 858 .FS
 859 The reason for this odd-ball frequency has to be sought in the ghastly
 860 colours offered by the original IBM PC Color Graphics Adapter:  It
 861 delivered NTSC format output and therefore introduced the NTSC colour
 862 sync frequency into personal computers.
 863 .FE
 864  is not quite as high as one
 865 could have wished for, but it is certainly a big improvement over
 866 the i8254 hardware in terms of access path.
 867 .LP
 868 But trust it to Murphys Law: The majority of implementations so far
 869 have failed to provide latching suitable to avoid meta-stability
 870 problems, and several readings from the counter is necessary to
 871 get a reliable timestamp.
 872 In difference from the i8254 mentioned above, we do not need to
 873 any locking while doing so, since each individual read is atomic.
 874 .LP
 875 An initialization routine tries to test if the ACPI counter is properly
 876 latched by examining the width of a histogram over read delta-values.
 877 .LP
 878 Other architectures are similarly equipped with means for timekeeping,
 879 but generally more carefully thought out compared to the haphazard
 880 developments of the IBM PC architecture.
 881 .LP
 882 One final important wiggle of all this, is that it may not be possible
 883 to determine which piece of hardware is best suited for clock
 884 use until well into or even after the bootstrap process.
 885 .LP
 886 One example of this is the Loran-C receiver designed by Prof. Dave Mills
 887 [MILLS1992]
 888 which is unsuitable as timecounter until the daemon program which
 889 implements the software-half of the receiver has properly initialised
 890 and locked onto a Loran-C signal.
 891 .SH
 892 Ideal timecounter hardware
 893 .LP
 894 As proof of concept, a sort of an existentialist protest against
 895 the sorry state describe above, the author undertook a project to
 896 prove that it is possible to do better than that, since none of
 897 the standard hardware offered a way to fully validate the timecounter
 898 design.
 899 .LP
 900 Using a COTS product, ``HOT1'', from Virtual Computers Corporation
 901 [VCC2002] containing a FPGA chip on a PCI form factor card, a 26
 902 bit timecounter running at 100MHz was successfully implemented.
 903 .LP
 904 .if t .PSPIC fig5.eps
 905 .LP
 906 .LP
 907 In order to show that timestamping does not necessarily have to
 908 be done using unpredictable and uncalibratable interrupts, an
 909 array of latches were implemented as well, which allow up to 10
 910 external signals to latch the reading of the counter when
 911 an external PPS signal transitions from logic high to logic
 912 low or vice versa.
 913 .LP
 914 Using this setup, a standard 133 MHz Pentium based PC is able to
 915 timestamp the PPS output of the Motorola UT+ GPS receiver with
 916 a precision of \(+- 10 nanoseconds \(+- one count which in practice
 917 averages out to roughly \(+- 15 nanoseconds\**:
 918 .FS
 919 The reason the plot does not show a very distinct 10 nanosecond
 920 quantization is that the GPS receiver produces the PPS signal from
 921 a clock with a roughly 55 nanosecond period and then predicts in
 922 the serial data stream how many nanoseconds this will be offset
 923 from the ideal time.
 924 This plot shows the timestamps corrected for this ``negative
 925 sawtooth correction''.
 926 .FE
 927 .LP
 928 .if t .PSPIC gps.ps
 929 .LP
 930 It shold be noted that the author is no hardware wizard and
 931 a number of issues in the implementation results in less than
 932 ideal noise performance.
 933 .LP
 934 Now compare this to ``ideal'' timecounter to the normal setup
 935 where the PPS signal is used
 936 to trigger an interrupt via the DCD pin on a serial port, and
 937 the interrupt handler calls \fCnanotime()\fP to timestamp
 938 the external event \**:
 939 .FS
 940 In both cases, the computers clock frequency controlled
 941 with a Rubidium Frequency standard.
 942 The average quality of crystals used for computers would
 943 totally obscure the curves due to their temperature coefficient.
 944 .FE
 945 .LP
 946 .if t .PSPIC intr.ps
 947 .LP
 948 It is painfully obvious that the interrupt latency is the
 949 dominant noise factor in PPS timestamping in the second case.
 950 The asymetric distribution of the noise in the second plot
 951 also more or less entirely invalidates the design assumption
 952 in the NTP PLL/FLL kernel code that timestamps are dominated
 953 by gaussian noise with few spikes.
 954 .SH
 955 Status and availability
 956 .PP
 957 The timecounter code has been developed and used in FreeBSD
 958 for a number of years and has now reached maturity.
 959 The source-code is located almost entirely in the kernel source file
 960 kern_tc.c, with a few necessary adaptations in code which
 961 interfaces to it, primarily the NTP PLL/FLL code.
 962 .LP
 963 The code runs on all FreeBSD platforms including i386, alpha,
 964 PC98, sparc64, ia64 and s/390 and contains no wordsize or
 965 endianess issues not specifically handled in the sourcecode.
 966 .LP
 967 The timecounter implementation is distributed under the ``BSD''
 968 open source license or the even more free ``Beer-ware'' license.
 969 .LP
 970 While the ability to accurately model and compensate for
 971 inaccuracies typical of atomic frequency standards are not
 972 catering to the larger userbase, but this ability and precision
 973 of the code guarntees solid support for the widespread deployment
 974 of NTP as a time synchronization protocol, without rounding
 975 or accumulative errors.
 976 .LP
 977 Adding support for new hardware and platforms have been
 978 done several times by other developers without any input from the
 979 author, so this particular aspect of timecounters design
 980 seems to work very well.
 981 .SH
 982 Future work
 983 .PP
 984 At this point in time, no specific plans exist for further
 985 development of the timecounters code.
 986 .LP
 987 Various micro-optimizations, mostly to compensate for inadequate
 988 compiler optimization could be contemplated, but the author
 989 resists these on the basis that they significantly decrease
 990 the readability of the source code.
 991 .SH
 992 Acknowledgements
 993 .PP
 994 .EQ
 995 delim ññ
 996 .EN
 997 The author would like to thank:
 998 .LP
 999 Bruce Evans
1000 for his invaluable assistance
1001 in taming the evil i8254 timecounter, as well as the enthusiastic
1002 resistance he has provided throughout.
1003 .PP
1004 Professor Dave Mills of University of Delaware for his work on
1005 NTP, for lending out the neglected twin Loran-C receiver and for
1006 picking up the glove when timecounters made it clear
1007 that the old ``microkernel'' NTP timekeeping code were not up to snuff
1008 [MILLS2000B].
1009 .PP
1010 Tom Van Baak for helping out, despite the best efforts of the National
1011 Danish Posts center for Customs and Dues to prevent it.
1012 .PP
1013 Corby Dawson for helping with the care and feeding for caesium standards.
1014 .PP
1015 The staff at the NELS Loran-C control station in Bø, Norway for providing
1016 information about step-changes.
1017 .PP
1018 The staff at NELS Loran-C station Eiðe, Faeroe
1019 Islands for permission to tour their installation.
1020 .PP
1021 The FreeBSD users for putting up with ``micro uptime went backwards''.
1022 .SH
1023 References
1024 .LP
1025 [AG2002]
1026 Published specifications for Agilent model 5071A Primary Frequency
1027 Standard on
1028 .br
1029 http://www.agilent.com
1030 .LP
1031 [DMK2001]
1032 "Accuracy Evaluation of a Cesium Fountain Primary Frequency Standard at NIST."
1033 D. M. Meekhof, S. R. Jefferts, M. Stephanovic, and T. E. Parker
1034 IEEE Transactions on instrumentation and measurement, VOL. 50, NO. 2,
1035 APRIL 2001.
1036 .LP
1037 [PHK2001]
1038 "Monitoring Natural Gas Usage"
1039 Poul-Henning Kamp
1040 http://phk.freebsd.dk/Gasdims/
1041 .LP
1042 [MILLS1992]
1043 "A computer-controlled LORAN-C receiver for precision timekeeping."
1044 Mills, D.L.
1045 Electrical Engineering Department Report 92-3-1, University of Delaware, March 1992, 63 pp.
1046 .LP
1047 [MILLS2000A]
1048 Levine, J., and D. Mills. "Using the Network Time Protocol to transmit International Atomic Time (TAI)". Proc. Precision Time and Time Interval (PTTI) Applications and Planning Meeting (Reston VA, November 2000), 431-439.
1049 .LP
1050 [MILLS2000B]
1051 "The nanokernel."
1052 Mills, D.L., and P.-H. Kamp.
1053 Proc. Precision Time and Time Interval (PTTI) Applications and Planning Meeting (Reston VA, November 2000), 423-430.
1054 .LP
1055 [RGO2002]
1056 For an introduction to Harrison and his clocks, see for
1057 instance
1058 .br
1059 http://www.rog.nmm.ac.uk/museum/harrison/
1060 .br
1061 or for
1062 a more detailed and possibly better researched account: Dava
1063 Sobels excellent book, "Longitude: The True Story of a Lone
1064 Genius Who Solved the Greatest Scientific Problem of His
1065 Time" Penguin USA (Paper); ISBN: 0140258795.
1066 .LP
1067 [SAGE]
1068 This ``gee-wiz'' kind of article in Dr. Dobbs Journal is a good place to
1069 start:
1070 .br
1071 http://www.ddj.com/documents/s=1493/ddj0001hc/0085a.htm
1072 .LP
1073 [VCC2002]
1074 Please consult Virtual Computer Corporations homepage:
1075 .br
1076 http://www.vcc.com