Message ID | 20220627210525.515705-1-tony.luck@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | rasdaemon: Change default corrected error threshold from 50/day to 2/day | expand |
diff --git a/misc/rasdaemon.env b/misc/rasdaemon.env index 12fd76658430..5feab290cf01 100644 --- a/misc/rasdaemon.env +++ b/misc/rasdaemon.env @@ -14,7 +14,7 @@ # # The two configs will only take no effect when PAGE_CE_ACTION is "off". PAGE_CE_REFRESH_CYCLE="24h" -PAGE_CE_THRESHOLD="50" +PAGE_CE_THRESHOLD="2" # Specify the internal action in rasdaemon to exceeding a page error threshold. #
A large scale study of memory errors in data centers showed that it is best to aggressively take pages with corrected errors offline. This is the best strategy of using corrected errors as a predictor of future uncorrected errors. Signed-off-by: Tony Luck <tony.luck@intel.com> --- Here's the link to the study. I thought of putting into the commit comment. But these links are sometimes changed as website is re-organised, making the link stale. https://www.intel.com/content/dam/www/public/us/en/documents/intel-and-samsung-mrt-improving-memory-reliability-at-data-centers.pdf Note that Boris has questions against my kernel patch that does the same on whether an unconditional change to "2" is the right value for other CPU & memory vendors. https://lore.kernel.org/all/YrnBWjkX82OhXAtL@zn.tnic/ --- misc/rasdaemon.env | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)