Message ID | 20210819054116.266126-3-shy828301@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/3] mm: hwpoison: don't drop slab caches for offlining non-LRU page | expand |
On Wed, Aug 18, 2021 at 10:41:16PM -0700, Yang Shi wrote: > Currently just very simple message is shown for unhandlable page, e.g. > non-LRU page, like: > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () > > It is not very helpful for further debug, calling dump_page() could show > more useful information. > > Calling dump_page() in get_any_page() in order to not duplicate the call > in a couple of different places. It may be called with pcp disabled and > holding memory hotplug lock, it should be not a big deal since hwpoison > handler is not called very often. > > Suggested-by: Matthew Wilcox <willy@infradead.org> > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> > Cc: Oscar Salvador <osalvador@suse.de> > Signed-off-by: Yang Shi <shy828301@gmail.com> > --- > mm/memory-failure.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 7cfa134b1370..60df8fcd0444 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1228,6 +1228,9 @@ static int get_any_page(struct page *p, unsigned long flags) > ret = -EIO; > } > out: > + if (ret == -EIO) > + dump_page(p, "hwpoison: unhandlable page"); > + I feel that 4 callers of get_hwpoison_page() are in the different context, so it might be better to consider them separately to add dump_page() or not. soft_offline_page() still prints out "%s: %#lx: unknown page type: %lx (%pGp)" message, which might be duplicate so this printk() may be dropped. In memory_failure_hugetlb() and memory_failure(), we can call dump_page() after action_result(). unpoison_memory() doesn't need dump_page() at all because it's related to already hwpoisoned page. Thanks, Naoya Horiguchi
On Thu, Aug 19, 2021 at 11:48 PM HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote: > > On Wed, Aug 18, 2021 at 10:41:16PM -0700, Yang Shi wrote: > > Currently just very simple message is shown for unhandlable page, e.g. > > non-LRU page, like: > > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () > > > > It is not very helpful for further debug, calling dump_page() could show > > more useful information. > > > > Calling dump_page() in get_any_page() in order to not duplicate the call > > in a couple of different places. It may be called with pcp disabled and > > holding memory hotplug lock, it should be not a big deal since hwpoison > > handler is not called very often. > > > > Suggested-by: Matthew Wilcox <willy@infradead.org> > > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> > > Cc: Oscar Salvador <osalvador@suse.de> > > Signed-off-by: Yang Shi <shy828301@gmail.com> > > --- > > mm/memory-failure.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > index 7cfa134b1370..60df8fcd0444 100644 > > --- a/mm/memory-failure.c > > +++ b/mm/memory-failure.c > > @@ -1228,6 +1228,9 @@ static int get_any_page(struct page *p, unsigned long flags) > > ret = -EIO; > > } > > out: > > + if (ret == -EIO) > > + dump_page(p, "hwpoison: unhandlable page"); > > + > > I feel that 4 callers of get_hwpoison_page() are in the different context, > so it might be better to consider them separately to add dump_page() or not. > soft_offline_page() still prints out "%s: %#lx: unknown page type: %lx (%pGp)" No strong opinion to keep or remove it. > message, which might be duplicate so this printk() may be dropped. > In memory_failure_hugetlb() and memory_failure(), we can call dump_page() after > action_result(). unpoison_memory() doesn't need dump_page() at all because > it's related to already hwpoisoned page. I don't have a strong opinion either to have the dump_page() called either before action or after action, it just moves around the dumped page information around that printk. For unpoison_memory(), I think it is harmless to have dump_page() called, right? If get_hwpoison_page() can't return -EIO, then the dump_page() won't be called at all, if it is possible then this is exactly why we call dump_page() to help debug. So IMHO calling dump_page() in get_any_page when -EIO is returned could work for all the cases well and avoid duplicating the call. > > Thanks, > Naoya Horiguchi
On Fri, Aug 20, 2021 at 11:40:24AM -0700, Yang Shi wrote: > On Thu, Aug 19, 2021 at 11:48 PM HORIGUCHI NAOYA(堀口 直也) > <naoya.horiguchi@nec.com> wrote: > > > > On Wed, Aug 18, 2021 at 10:41:16PM -0700, Yang Shi wrote: > > > Currently just very simple message is shown for unhandlable page, e.g. > > > non-LRU page, like: > > > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () > > > > > > It is not very helpful for further debug, calling dump_page() could show > > > more useful information. > > > > > > Calling dump_page() in get_any_page() in order to not duplicate the call > > > in a couple of different places. It may be called with pcp disabled and > > > holding memory hotplug lock, it should be not a big deal since hwpoison > > > handler is not called very often. > > > > > > Suggested-by: Matthew Wilcox <willy@infradead.org> > > > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> > > > Cc: Oscar Salvador <osalvador@suse.de> > > > Signed-off-by: Yang Shi <shy828301@gmail.com> > > > --- > > > mm/memory-failure.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > index 7cfa134b1370..60df8fcd0444 100644 > > > --- a/mm/memory-failure.c > > > +++ b/mm/memory-failure.c > > > @@ -1228,6 +1228,9 @@ static int get_any_page(struct page *p, unsigned long flags) > > > ret = -EIO; > > > } > > > out: > > > + if (ret == -EIO) > > > + dump_page(p, "hwpoison: unhandlable page"); > > > + > > > > I feel that 4 callers of get_hwpoison_page() are in the different context, > > so it might be better to consider them separately to add dump_page() or not. > > soft_offline_page() still prints out "%s: %#lx: unknown page type: %lx (%pGp)" > > No strong opinion to keep or remove it. Reading the explanation below, I think that calling dump_page() in the original place is fine. So let's remove "else if (ret == 0)" block in soft_offline_page(). > > > message, which might be duplicate so this printk() may be dropped. > > In memory_failure_hugetlb() and memory_failure(), we can call dump_page() after > > action_result(). unpoison_memory() doesn't need dump_page() at all because > > it's related to already hwpoisoned page. > > I don't have a strong opinion either to have the dump_page() called > either before action or after action, it just moves around the dumped > page information around that printk. > > For unpoison_memory(), I think it is harmless to have dump_page() > called, right? If get_hwpoison_page() can't return -EIO, then the > dump_page() won't be called at all, if it is possible then this is > exactly why we call dump_page() to help debug. > > So IMHO calling dump_page() in get_any_page when -EIO is returned > could work for all the cases well and avoid duplicating the call. Fair enough. So could you repost 3/3 with the above change in soft_offline_page()? Thanks, Naoya Horiguchi
On Sun, Aug 22, 2021 at 10:05 PM HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote: > > On Fri, Aug 20, 2021 at 11:40:24AM -0700, Yang Shi wrote: > > On Thu, Aug 19, 2021 at 11:48 PM HORIGUCHI NAOYA(堀口 直也) > > <naoya.horiguchi@nec.com> wrote: > > > > > > On Wed, Aug 18, 2021 at 10:41:16PM -0700, Yang Shi wrote: > > > > Currently just very simple message is shown for unhandlable page, e.g. > > > > non-LRU page, like: > > > > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () > > > > > > > > It is not very helpful for further debug, calling dump_page() could show > > > > more useful information. > > > > > > > > Calling dump_page() in get_any_page() in order to not duplicate the call > > > > in a couple of different places. It may be called with pcp disabled and > > > > holding memory hotplug lock, it should be not a big deal since hwpoison > > > > handler is not called very often. > > > > > > > > Suggested-by: Matthew Wilcox <willy@infradead.org> > > > > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> > > > > Cc: Oscar Salvador <osalvador@suse.de> > > > > Signed-off-by: Yang Shi <shy828301@gmail.com> > > > > --- > > > > mm/memory-failure.c | 3 +++ > > > > 1 file changed, 3 insertions(+) > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > > index 7cfa134b1370..60df8fcd0444 100644 > > > > --- a/mm/memory-failure.c > > > > +++ b/mm/memory-failure.c > > > > @@ -1228,6 +1228,9 @@ static int get_any_page(struct page *p, unsigned long flags) > > > > ret = -EIO; > > > > } > > > > out: > > > > + if (ret == -EIO) > > > > + dump_page(p, "hwpoison: unhandlable page"); > > > > + > > > > > > I feel that 4 callers of get_hwpoison_page() are in the different context, > > > so it might be better to consider them separately to add dump_page() or not. > > > soft_offline_page() still prints out "%s: %#lx: unknown page type: %lx (%pGp)" > > > > No strong opinion to keep or remove it. > > Reading the explanation below, I think that calling dump_page() in the > original place is fine. So let's remove "else if (ret == 0)" block in > soft_offline_page(). The "else if (ret == 0)" block is used to handle free page IIUC. I'm supposed you mean the "else if (ret == -EIO)" block which just calls printk. > > > > > > message, which might be duplicate so this printk() may be dropped. > > > In memory_failure_hugetlb() and memory_failure(), we can call dump_page() after > > > action_result(). unpoison_memory() doesn't need dump_page() at all because > > > it's related to already hwpoisoned page. > > > > I don't have a strong opinion either to have the dump_page() called > > either before action or after action, it just moves around the dumped > > page information around that printk. > > > > For unpoison_memory(), I think it is harmless to have dump_page() > > called, right? If get_hwpoison_page() can't return -EIO, then the > > dump_page() won't be called at all, if it is possible then this is > > exactly why we call dump_page() to help debug. > > > > So IMHO calling dump_page() in get_any_page when -EIO is returned > > could work for all the cases well and avoid duplicating the call. > > Fair enough. So could you repost 3/3 with the above change in soft_offline_page()? > > Thanks, > Naoya Horiguchi
On Mon, Aug 23, 2021 at 10:47:03AM -0700, Yang Shi wrote: > On Sun, Aug 22, 2021 at 10:05 PM HORIGUCHI NAOYA(堀口 直也) > <naoya.horiguchi@nec.com> wrote: > > > > On Fri, Aug 20, 2021 at 11:40:24AM -0700, Yang Shi wrote: > > > On Thu, Aug 19, 2021 at 11:48 PM HORIGUCHI NAOYA(堀口 直也) > > > <naoya.horiguchi@nec.com> wrote: > > > > > > > > On Wed, Aug 18, 2021 at 10:41:16PM -0700, Yang Shi wrote: > > > > > Currently just very simple message is shown for unhandlable page, e.g. > > > > > non-LRU page, like: > > > > > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () > > > > > > > > > > It is not very helpful for further debug, calling dump_page() could show > > > > > more useful information. > > > > > > > > > > Calling dump_page() in get_any_page() in order to not duplicate the call > > > > > in a couple of different places. It may be called with pcp disabled and > > > > > holding memory hotplug lock, it should be not a big deal since hwpoison > > > > > handler is not called very often. > > > > > > > > > > Suggested-by: Matthew Wilcox <willy@infradead.org> > > > > > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> > > > > > Cc: Oscar Salvador <osalvador@suse.de> > > > > > Signed-off-by: Yang Shi <shy828301@gmail.com> > > > > > --- > > > > > mm/memory-failure.c | 3 +++ > > > > > 1 file changed, 3 insertions(+) > > > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > > > index 7cfa134b1370..60df8fcd0444 100644 > > > > > --- a/mm/memory-failure.c > > > > > +++ b/mm/memory-failure.c > > > > > @@ -1228,6 +1228,9 @@ static int get_any_page(struct page *p, unsigned long flags) > > > > > ret = -EIO; > > > > > } > > > > > out: > > > > > + if (ret == -EIO) > > > > > + dump_page(p, "hwpoison: unhandlable page"); > > > > > + > > > > > > > > I feel that 4 callers of get_hwpoison_page() are in the different context, > > > > so it might be better to consider them separately to add dump_page() or not. > > > > soft_offline_page() still prints out "%s: %#lx: unknown page type: %lx (%pGp)" > > > > > > No strong opinion to keep or remove it. > > > > Reading the explanation below, I think that calling dump_page() in the > > original place is fine. So let's remove "else if (ret == 0)" block in > > soft_offline_page(). > > The "else if (ret == 0)" block is used to handle free page IIUC. I'm > supposed you mean the "else if (ret == -EIO)" block which just calls > printk. Sorry, you're right. I miss-copied the line. - Naoya Horiguchi
On Wed, Aug 18, 2021 at 10:41:16PM -0700, Yang Shi wrote: > Currently just very simple message is shown for unhandlable page, e.g. > non-LRU page, like: > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () > > It is not very helpful for further debug, calling dump_page() could show > more useful information. Looks like your code already caught something. An error injection test may have injected into a shared library. Though I'm not sure that the refcount/mapcount in the dump agrees with that diagnosis from the author of this test. Here's what appeared on the console: [ 4817.622254] mce: Uncorrected hardware memory error in user-access at cef2747000 [ 4817.630520] page:000000003ab9dca4 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xcef2747 [ 4817.638651] mce: Uncorrected hardware memory error in user-access at cef2747000 [ 4817.646860] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff) [ 4818.025515] mce: Uncorrected hardware memory error in user-access at cef2747000 [ 4818.033689] raw: 0057ffffc0801000 ffd400033bc9d1c8 ffd400033bc9d1c8 0000000000000000 [ 4818.272435] mce: Uncorrected hardware memory error in user-access at cef2747000 [ 4818.280640] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 [ 4818.280658] mce: Uncorrected hardware memory error in user-access at cef2747000 [ 4818.313606] mce: Uncorrected hardware memory error in user-access at cef2747000 [ 4818.321804] page dumped because: hwpoison: unhandlable page [ 4818.564802] mce: Uncorrected hardware memory error in user-access at cef2747000 [ 4818.573043] Memory failure: 0xcef2747: recovery action for unknown page: Ignored [ 4818.595837] Memory failure: 0xcef2747: already hardware poisoned [ 4818.603245] Memory failure: 0xcef2747: Sending SIGBUS to multichase:67460 due to hardware memory corruption [ 4818.614297] Memory failure: 0xcef2747: already hardware poisoned -Tony
On Wed, Sep 22, 2021 at 12:37 PM Luck, Tony <tony.luck@intel.com> wrote: > > On Wed, Aug 18, 2021 at 10:41:16PM -0700, Yang Shi wrote: > > Currently just very simple message is shown for unhandlable page, e.g. > > non-LRU page, like: > > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () > > > > It is not very helpful for further debug, calling dump_page() could show > > more useful information. > > Looks like your code already caught something. An error injection > test may have injected into a shared library. Though I'm not sure that > the refcount/mapcount in the dump agrees with that diagnosis from the > author of this test. The messages from dump_page() are (unwind them from mce logs): [ 4817.630520] page:000000003ab9dca4 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xcef2747 [ 4817.646860] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff) [ 4818.033689] raw: 0057ffffc0801000 ffd400033bc9d1c8 ffd400033bc9d1c8 0000000000000000 [ 4818.280640] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 The page flags tell it is a "reserved" page and mapping is NULL. It doesn't seem like a user page or movable page, so hwpoision can't handle it so that the messages are dumped. > > Here's what appeared on the console: > > [ 4817.622254] mce: Uncorrected hardware memory error in user-access at cef2747000 > [ 4817.630520] page:000000003ab9dca4 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xcef2747 > [ 4817.638651] mce: Uncorrected hardware memory error in user-access at cef2747000 > [ 4817.646860] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff) > [ 4818.025515] mce: Uncorrected hardware memory error in user-access at cef2747000 > [ 4818.033689] raw: 0057ffffc0801000 ffd400033bc9d1c8 ffd400033bc9d1c8 0000000000000000 > [ 4818.272435] mce: Uncorrected hardware memory error in user-access at cef2747000 > [ 4818.280640] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > [ 4818.280658] mce: Uncorrected hardware memory error in user-access at cef2747000 > [ 4818.313606] mce: Uncorrected hardware memory error in user-access at cef2747000 > [ 4818.321804] page dumped because: hwpoison: unhandlable page > [ 4818.564802] mce: Uncorrected hardware memory error in user-access at cef2747000 > [ 4818.573043] Memory failure: 0xcef2747: recovery action for unknown page: Ignored > [ 4818.595837] Memory failure: 0xcef2747: already hardware poisoned > [ 4818.603245] Memory failure: 0xcef2747: Sending SIGBUS to multichase:67460 due to hardware memory corruption > [ 4818.614297] Memory failure: 0xcef2747: already hardware poisoned > > -Tony
On Wed, Sep 22, 2021 at 12:58 PM Yang Shi <shy828301@gmail.com> wrote: > > On Wed, Sep 22, 2021 at 12:37 PM Luck, Tony <tony.luck@intel.com> wrote: > > > > On Wed, Aug 18, 2021 at 10:41:16PM -0700, Yang Shi wrote: > > > Currently just very simple message is shown for unhandlable page, e.g. > > > non-LRU page, like: > > > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () > > > > > > It is not very helpful for further debug, calling dump_page() could show > > > more useful information. > > > > Looks like your code already caught something. An error injection > > test may have injected into a shared library. Though I'm not sure that > > the refcount/mapcount in the dump agrees with that diagnosis from the > > author of this test. > > The messages from dump_page() are (unwind them from mce logs): > > [ 4817.630520] page:000000003ab9dca4 refcount:1 mapcount:0 > mapping:0000000000000000 index:0x0 pfn:0xcef2747 > [ 4817.646860] flags: > 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff) > [ 4818.033689] raw: 0057ffffc0801000 ffd400033bc9d1c8 ffd400033bc9d1c8 > 0000000000000000 > [ 4818.280640] raw: 0000000000000000 0000000000000000 00000001ffffffff > 0000000000000000 Missed one line from the dump: [ 4818.321804] page dumped because: hwpoison: unhandlable page Anyway dump_page() is just called when unhandlable page is met. > > The page flags tell it is a "reserved" page and mapping is NULL. It > doesn't seem like a user page or movable page, so hwpoision can't > handle it so that the messages are dumped. > > > > > Here's what appeared on the console: > > > > [ 4817.622254] mce: Uncorrected hardware memory error in user-access at cef2747000 > > [ 4817.630520] page:000000003ab9dca4 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xcef2747 > > [ 4817.638651] mce: Uncorrected hardware memory error in user-access at cef2747000 > > [ 4817.646860] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff) > > [ 4818.025515] mce: Uncorrected hardware memory error in user-access at cef2747000 > > [ 4818.033689] raw: 0057ffffc0801000 ffd400033bc9d1c8 ffd400033bc9d1c8 0000000000000000 > > [ 4818.272435] mce: Uncorrected hardware memory error in user-access at cef2747000 > > [ 4818.280640] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > > [ 4818.280658] mce: Uncorrected hardware memory error in user-access at cef2747000 > > [ 4818.313606] mce: Uncorrected hardware memory error in user-access at cef2747000 > > [ 4818.321804] page dumped because: hwpoison: unhandlable page > > [ 4818.564802] mce: Uncorrected hardware memory error in user-access at cef2747000 > > [ 4818.573043] Memory failure: 0xcef2747: recovery action for unknown page: Ignored > > [ 4818.595837] Memory failure: 0xcef2747: already hardware poisoned > > [ 4818.603245] Memory failure: 0xcef2747: Sending SIGBUS to multichase:67460 due to hardware memory corruption > > [ 4818.614297] Memory failure: 0xcef2747: already hardware poisoned > > > > -Tony
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 7cfa134b1370..60df8fcd0444 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1228,6 +1228,9 @@ static int get_any_page(struct page *p, unsigned long flags) ret = -EIO; } out: + if (ret == -EIO) + dump_page(p, "hwpoison: unhandlable page"); + return ret; }
Currently just very simple message is shown for unhandlable page, e.g. non-LRU page, like: soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () It is not very helpful for further debug, calling dump_page() could show more useful information. Calling dump_page() in get_any_page() in order to not duplicate the call in a couple of different places. It may be called with pcp disabled and holding memory hotplug lock, it should be not a big deal since hwpoison handler is not called very often. Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Yang Shi <shy828301@gmail.com> --- mm/memory-failure.c | 3 +++ 1 file changed, 3 insertions(+)