CyberLeo.Net >> Repos - FreeBSD/stable/10.git/commit

MFC: r260457

The changes in r233781 attempted to make logging during a machine check
exception more readable.  In practice they prevented all logging during
a machine check exception on at least some systems.  Specifically, when
an uncorrected ECC error is detected in a DIMM on a Nehalem/Westmere
class machine, all CPUs receive a machine check exception, but only
CPUs on the same package as the memory controller for the erroring DIMM
log an error.  The CPUs on the other package would complete the scan of
their machine check banks and panic before the first set of CPUs could
log an error.  The end result was a clearer display during the panic
(no interleaved messages), but a crashdump without any useful info about
the error that occurred.

To handle this case, make all CPUs spin in the machine check handler
once they have completed their scan of their machine check banks until
at least one machine check error is logged.  I tried using a DELAY()
instead so that the CPUs would not potentially hang forever, but that
was not reliable in testing.

While here, don't clear MCIP from MSR_MCG_STATUS before invoking panic.
Only clear it if the machine check handler does not panic and returns
to the interrupted thread.

MFC: r263113

Correct type for malloc().

Submitted by: "Conrad Meyer" <conrad.meyer@isilon.com>

MFC: r269052, r269239, r269242

Intel desktop Haswell CPUs may report benign corrected parity errors (see
HSD131 erratum in [1]) at a considerable rate. So filter these (default),
unless logging is enabled. Unfortunately, there really is no better way to
reasonably implement suppressing these errors than to just skipping them
in mca_log(). Given that they are reported for bank 0, they'd need to be
masked in MSR_MC0_CTL. However, P6 family processors require that register
to be set to either all 0s or all 1s, disabling way more than the one error
in question when using all 0s there. Alternatively, it could be masked for
the corresponding CMCI, but that still wouldn't keep the periodic scanner
from detecting these spurious errors. Apart from that, register contents of
MSR_MC0_CTL{,2} don't seem to be publicly documented, neither in the Intel
Architectures Developer's Manual nor in the Haswell datasheets.

Note that while HSD131 actually is only about C0-stepping as of revision
014 of the Intel desktop 4th generation processor family specification
update, these corrected errors also have been observed with D0-stepping
aka "Haswell Refresh".

1: http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf

Reviewed by: jhb
Sponsored by: Bally Wulff Games & Entertainment GmbH

git-svn-id: svn://svn.freebsd.org/base/stable/10@269592 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f

author	marius <marius@ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f>
	Tue, 5 Aug 2014 16:04:22 +0000 (16:04 +0000)
committer	marius <marius@ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f>
	Tue, 5 Aug 2014 16:04:22 +0000 (16:04 +0000)
commit	1f1c8e13070d727670df10a6c36ffc6f4c3f43fe
tree	3280c34d7accae0ea451ed806b8a3d8b5a22df7a	tree \| snapshot
parent	60d1e318aafcd2103133960d24a4f7e43b9e8205	commit \| diff