CyberLeo.Net >> Repos - FreeBSD/stable/9.git/commit

MFC: r260457

The changes in r233781 (MFCed to stable/9 in r235515) attempted to make
logging during a machine check exception more readable.  In practice they
prevented all logging during a machine check exception on at least some
systems.  Specifically, when an uncorrected ECC error is detected in a DIMM
on a Nehalem/Westmere class machine, all CPUs receive a machine check
exception, but only CPUs on the same package as the memory controller for
the erroring DIMM log an error.  The CPUs on the other package would complete
the scan of their machine check banks and panic before the first set of CPUs
could log an error.  The end result was a clearer display during the panic
(no interleaved messages), but a crashdump without any useful info about
the error that occurred.

To handle this case, make all CPUs spin in the machine check handler
once they have completed their scan of their machine check banks until
at least one machine check error is logged.  I tried using a DELAY()
instead so that the CPUs would not potentially hang forever, but that
was not reliable in testing.

While here, don't clear MCIP from MSR_MCG_STATUS before invoking panic.
Only clear it if the machine check handler does not panic and returns
to the interrupted thread.

MFC: r263113

Correct type for malloc().

Submitted by: "Conrad Meyer" <conrad.meyer@isilon.com>

MFC: r269052, r269239, r269242

Intel desktop Haswell CPUs may report benign corrected parity errors (see
HSD131 erratum in [1]) at a considerable rate. So filter these (default),
unless logging is enabled. Unfortunately, there really is no better way to
reasonably implement suppressing these errors than to just skipping them
in mca_log(). Given that they are reported for bank 0, they'd need to be
masked in MSR_MC0_CTL. However, P6 family processors require that register
to be set to either all 0s or all 1s, disabling way more than the one error
in question when using all 0s there. Alternatively, it could be masked for
the corresponding CMCI, but that still wouldn't keep the periodic scanner
from detecting these spurious errors. Apart from that, register contents of
MSR_MC0_CTL{,2} don't seem to be publicly documented, neither in the Intel
Architectures Developer's Manual nor in the Haswell datasheets.

Note that while HSD131 actually is only about C0-stepping as of revision
014 of the Intel desktop 4th generation processor family specification
update, these corrected errors also have been observed with D0-stepping
aka "Haswell Refresh".

1: http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf

Reviewed by: jhb
Sponsored by: Bally Wulff Games & Entertainment GmbH

git-svn-id: svn://svn.freebsd.org/base/stable/9@269593 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f

author	marius <marius@ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f>
	Tue, 5 Aug 2014 16:30:13 +0000 (16:30 +0000)
committer	marius <marius@ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f>
	Tue, 5 Aug 2014 16:30:13 +0000 (16:30 +0000)
commit	d6fe970ae4401ac842ed30af99809023b1947e6b
tree	ded9f0ba46142d8303709af8837c6d339bfaeab7	tree \| snapshot
parent	bd130232e2fa637e4052defa3b445b238ab56655	commit \| diff