diff mbox series

rasdaemon: Change default corrected error threshold from 50/day to 2/day

Message ID 20220627210525.515705-1-tony.luck@intel.com (mailing list archive)
State New, archived
Headers show
Series rasdaemon: Change default corrected error threshold from 50/day to 2/day | expand

Commit Message

Tony Luck June 27, 2022, 9:05 p.m. UTC
A large scale study of memory errors in data centers showed that it is
best to aggressively take pages with corrected errors offline. This is
the best strategy of using corrected errors as a predictor of future
uncorrected errors.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Here's the link to the study. I thought of putting into the commit
comment. But these links are sometimes changed as website is re-organised,
making the link stale.

https://www.intel.com/content/dam/www/public/us/en/documents/intel-and-samsung-mrt-improving-memory-reliability-at-data-centers.pdf

Note that Boris has questions against my kernel patch that does the same
on whether an unconditional change to "2" is the right value for other
CPU & memory vendors.
	https://lore.kernel.org/all/YrnBWjkX82OhXAtL@zn.tnic/

---
 misc/rasdaemon.env | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff mbox series

Patch

diff --git a/misc/rasdaemon.env b/misc/rasdaemon.env
index 12fd76658430..5feab290cf01 100644
--- a/misc/rasdaemon.env
+++ b/misc/rasdaemon.env
@@ -14,7 +14,7 @@ 
 #
 # The two configs will only take no effect when PAGE_CE_ACTION is "off".
 PAGE_CE_REFRESH_CYCLE="24h"
-PAGE_CE_THRESHOLD="50"
+PAGE_CE_THRESHOLD="2"
 
 # Specify the internal action in rasdaemon to exceeding a page error threshold.
 #