diff mbox series

[RFC,3/6] kexec: export PG_offline to VMCOREINFO

Message ID 20181114211704.6381-4-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm/kdump: allow to exclude pages that are logically offline | expand

Commit Message

David Hildenbrand Nov. 14, 2018, 9:17 p.m. UTC
Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
makedumpfile can directly skip pages that are logically offline and the
content therefore stale.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 kernel/crash_core.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Dave Young Nov. 15, 2018, 6:19 a.m. UTC | #1
Hi David,

On 11/14/18 at 10:17pm, David Hildenbrand wrote:
> Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
> makedumpfile can directly skip pages that are logically offline and the
> content therefore stale.

It would be good to copy some background info from cover letter to the
patch description so that we can get better understanding why this is
needed now.

BTW, Lianbo is working on a documentation of the vmcoreinfo exported
fields. Ccing him so that he is aware of this.

Also cc Boris,  although I do not want the doc changes blocks this
he might have different opinion :)

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  kernel/crash_core.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 933cb3e45b98..093c9f917ed0 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void)
>  	VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
>  #ifdef CONFIG_HUGETLB_PAGE
>  	VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
> +#define PAGE_OFFLINE_MAPCOUNT_VALUE	(~PG_offline)
> +	VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
>  #endif
>  
>  	arch_crash_save_vmcoreinfo();
> -- 
> 2.17.2
> 

Thanks
Dave
David Hildenbrand Nov. 15, 2018, 9:23 a.m. UTC | #2
On 15.11.18 07:19, Dave Young wrote:
> Hi David,
> 
> On 11/14/18 at 10:17pm, David Hildenbrand wrote:
>> Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
>> makedumpfile can directly skip pages that are logically offline and the
>> content therefore stale.
> 
> It would be good to copy some background info from cover letter to the
> patch description so that we can get better understanding why this is
> needed now.

Yes, will add more detail!

> 
> BTW, Lianbo is working on a documentation of the vmcoreinfo exported
> fields. Ccing him so that he is aware of this.
> 
> Also cc Boris,  although I do not want the doc changes blocks this
> he might have different opinion :)

I'll be happy to help updating documentation (or updating it myself if
the doc updates go in first).
Borislav Petkov Nov. 15, 2018, 11:10 a.m. UTC | #3
On Thu, Nov 15, 2018 at 02:19:23PM +0800, Dave Young wrote:
> It would be good to copy some background info from cover letter to the
> patch description so that we can get better understanding why this is
> needed now.
> 
> BTW, Lianbo is working on a documentation of the vmcoreinfo exported
> fields. Ccing him so that he is aware of this.
> 
> Also cc Boris,  although I do not want the doc changes blocks this
> he might have different opinion :)

Yeah, my initial reaction is that exporting an mm-internal flag to
userspace is a no-no.

What would be better, IMHO, is having a general method of telling the
kdump kernel - maybe ranges of physical addresses - which to skip.

Because the moment there's a set of pages which do not have PG_offline
set but kdump would still like to skip, this breaks.

But that's mm guys' call.
David Hildenbrand Nov. 15, 2018, 11:20 a.m. UTC | #4
On 15.11.18 12:10, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 02:19:23PM +0800, Dave Young wrote:
>> It would be good to copy some background info from cover letter to the
>> patch description so that we can get better understanding why this is
>> needed now.
>>
>> BTW, Lianbo is working on a documentation of the vmcoreinfo exported
>> fields. Ccing him so that he is aware of this.
>>
>> Also cc Boris,  although I do not want the doc changes blocks this
>> he might have different opinion :)
> 
> Yeah, my initial reaction is that exporting an mm-internal flag to
> userspace is a no-no.

Sorry to say, but that is the current practice without which
makedumpfile would not be able to work at all. (exclude user pages,
exclude page cache, exclude buddy pages). Let's not reinvent the wheel
here. This is how dumping works forever.

Also see how hwpoisoned pages are handled.

(please note that most distributions only support dumping via
makedumpfile, for said reasons)

> 
> What would be better, IMHO, is having a general method of telling the
> kdump kernel - maybe ranges of physical addresses - which to skip.

And that has to be updated whenever we change a page flag. But then the
dump kernel would have to be aware about "struct page" location and
format of some old kernel to be dump. Let's just not even discuss going
down that path.

> 
> Because the moment there's a set of pages which do not have PG_offline
> set but kdump would still like to skip, this breaks.

I don't understand your concern. PG_offline is only an optimization for
sections that are online. Offline sections can already be completely
ignored when dumping.

I don't see how there should be "set of pages which do not have
PG_offline". I mean if they don't have PG_offline they will simply be
dumped like before. Which worked forever. Sorry, I don't get your point.
Borislav Petkov Nov. 15, 2018, 11:52 a.m. UTC | #5
On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote:
> Sorry to say, but that is the current practice without which
> makedumpfile would not be able to work at all. (exclude user pages,
> exclude page cache, exclude buddy pages). Let's not reinvent the wheel
> here. This is how dumping works forever.

Sorry, but "we've always done this in the past" doesn't make it better.

> I don't see how there should be "set of pages which do not have
> PG_offline".

It doesn't have to be a set of pages. Think a (mmconfig perhaps) region
which the kdump kernel should completely skip because poking in it in
the kdump kernel, causes all kinds of havoc like machine checks. etc.
We've had and still have one issue like that.

But let me clarify my note: I don't want to be discussing with you the
design of makedumpfile and how it should or should not work - that ship
has already sailed. Apparently there are valid reasons to do it this
way.

I was *simply* stating that it feels wrong to export mm flags like that.

But as I said already, that is mm guys' call and looking at how we're
already exporting a bunch of stuff in the vmcoreinfo - including other
mm flags - I guess one more flag doesn't matter anymore.
David Hildenbrand Nov. 15, 2018, 12:01 p.m. UTC | #6
On 15.11.18 12:52, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote:
>> Sorry to say, but that is the current practice without which
>> makedumpfile would not be able to work at all. (exclude user pages,
>> exclude page cache, exclude buddy pages). Let's not reinvent the wheel
>> here. This is how dumping works forever.
> 
> Sorry, but "we've always done this in the past" doesn't make it better.

Just saying that "I'm not the first to do it, don't hit me with a stick" :)

> 
>> I don't see how there should be "set of pages which do not have
>> PG_offline".
> 
> It doesn't have to be a set of pages. Think a (mmconfig perhaps) region
> which the kdump kernel should completely skip because poking in it in
> the kdump kernel, causes all kinds of havoc like machine checks. etc.
> We've had and still have one issue like that.

Indeed. And we still have without makedumpfile. I think you are aware of
this, but I'll explain it just for consistency: PG_hwpoison

At some point we detect a HW error and mask a page as PG_hwpoison.

makedumpfile knows how to treat that flag and can exclude it from the
dump (== not access it). No crash.

kdump itself has no clue about old "struct pages". Especially:
a) Where they are located in memory (e.g. SPARSE)
b) What their format is ("where are the flags")
c) What the meaning of flags is ("what does bit X mean")

In order to know such information, we would have to do parsing of quite
some information inside the kernel in kdump. Basically what makedumpfile
does just now. Is this feasible? I don't think so.

So we would need another approach to communicate such information as you
said. I can't think of any, but if anybody reading this has an idea,
please speak up. I am interested.


The *only* way right now we would have to handle such scenarios:

1. While dumping memory and we get a machine check, fake reading a zero
page instead of crashing.
2. While dumping memory and we get a fault, fake reading a zero page
instead of crashing.

> 
> But let me clarify my note: I don't want to be discussing with you the
> design of makedumpfile and how it should or should not work - that ship
> has already sailed. Apparently there are valid reasons to do it this
> way.

Indeed, and the basic design is to export these flags. (let's say
"unfortunately", being able to handle such stuff in kdump directly would
be the dream).

> I was *simply* stating that it feels wrong to export mm flags like that.
> 
> But as I said already, that is mm guys' call and looking at how we're
> already exporting a bunch of stuff in the vmcoreinfo - including other
> mm flags - I guess one more flag doesn't matter anymore.

Fair enough, noted. If you have an idea how to handle this in kdump,
please let me know.
Michal Hocko Nov. 15, 2018, 12:11 p.m. UTC | #7
On Thu 15-11-18 12:52:13, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote:
> > Sorry to say, but that is the current practice without which
> > makedumpfile would not be able to work at all. (exclude user pages,
> > exclude page cache, exclude buddy pages). Let's not reinvent the wheel
> > here. This is how dumping works forever.
> 
> Sorry, but "we've always done this in the past" doesn't make it better.
> 
> > I don't see how there should be "set of pages which do not have
> > PG_offline".
> 
> It doesn't have to be a set of pages. Think a (mmconfig perhaps) region
> which the kdump kernel should completely skip because poking in it in
> the kdump kernel, causes all kinds of havoc like machine checks. etc.
> We've had and still have one issue like that.
> 
> But let me clarify my note: I don't want to be discussing with you the
> design of makedumpfile and how it should or should not work - that ship
> has already sailed. Apparently there are valid reasons to do it this
> way.
> 
> I was *simply* stating that it feels wrong to export mm flags like that.
> 
> But as I said already, that is mm guys' call and looking at how we're
> already exporting a bunch of stuff in the vmcoreinfo - including other
> mm flags - I guess one more flag doesn't matter anymore.

I am not familiar with kexec to judge this particular patch but we
cannot simply define any range for these pages (same as for hwpoison
ones) because they can be almost anywhere in the available memory range.
Then there can be countless of them. There is no other way to rule them
out but to check the page state.

I am not really sure what is the concern here exactly. Kdump is so
closly tight to the specific kernel version that the api exported
specifically for its purpose cannot be seriously considered an ABI.
Kdump has to adopt all the time.
Borislav Petkov Nov. 15, 2018, 2:13 p.m. UTC | #8
On Thu, Nov 15, 2018 at 01:11:02PM +0100, Michal Hocko wrote:
> I am not familiar with kexec to judge this particular patch but we
> cannot simply define any range for these pages (same as for hwpoison
> ones) because they can be almost anywhere in the available memory range.
> Then there can be countless of them. There is no other way to rule them
> out but to check the page state.

I guess, especially if it is a monster box with a lot of memory in it.

> I am not really sure what is the concern here exactly. Kdump is so
> closly tight to the specific kernel version that the api exported
> specifically for its purpose cannot be seriously considered an ABI.
> Kdump has to adopt all the time.

Right...

Except, when people start ogling vmcoreinfo for other things and start
exporting all kinds of kernel internals in there, my alarm bells start
ringing.

But ok, kdump *is* special and I guess that's fine.
Borislav Petkov Nov. 15, 2018, 5:58 p.m. UTC | #9
On Thu, Nov 15, 2018 at 01:01:17PM +0100, David Hildenbrand wrote:
> Just saying that "I'm not the first to do it, don't hit me with a stick" :)

:-)

> Indeed. And we still have without makedumpfile. I think you are aware of
> this, but I'll explain it just for consistency: PG_hwpoison

No, I appreciate an explanation very much! So thanks for that. :)

> At some point we detect a HW error and mask a page as PG_hwpoison.
> 
> makedumpfile knows how to treat that flag and can exclude it from the
> dump (== not access it). No crash.
> 
> kdump itself has no clue about old "struct pages". Especially:
> a) Where they are located in memory (e.g. SPARSE)
> b) What their format is ("where are the flags")
> c) What the meaning of flags is ("what does bit X mean")
> 
> In order to know such information, we would have to do parsing of quite
> some information inside the kernel in kdump. Basically what makedumpfile
> does just now. Is this feasible? I don't think so.
> 
> So we would need another approach to communicate such information as you
> said. I can't think of any, but if anybody reading this has an idea,
> please speak up. I am interested.

Yeah but that ship has sailed. And even if we had a great idea, we'd
have to support kdump before and after the idea. And that would be a
serious mess.

And if you have a huge box with gazillion piles of memory and an alpha
particle passes through a bunch of them on its way down to the earth's
core, and while doing so, flips a bunch of bits, you need to go and
collect all those regions and update some list which you then need to
shove into the second kernel.

And you probably need to do all that through perhaps a piece of memory
which is used for communication between first and second kernel and that
list better fit in there, or you need to realloc. And that piece of
memory's layout needs to be properly defined so that the second kernel
can parse it correctly.

And so on...

> The *only* way right now we would have to handle such scenarios:
> 
> 1. While dumping memory and we get a machine check, fake reading a zero
> page instead of crashing.
> 2. While dumping memory and we get a fault, fake reading a zero page
> instead of crashing.

Yap.

> Indeed, and the basic design is to export these flags. (let's say
> "unfortunately", being able to handle such stuff in kdump directly would
> be the dream).

Well, AFAICT, the minimum work you need to always do before starting the
dumping is somehow generate that list of pages or ranges to not dump.
And that work needs to be done by the first or the second kernel, I'd
say.

If the first kernel would do it, then you'd have to probably have
callbacks to certain operations which go and add ranges or pages to
exclude, to a list which is then readily accessible to the second
kernel. Which means, when you reserve memory for the second kernel,
you'd have to reserve memory also for such a list.

But then what do you do when that memory gets filled up...?

So I guess exporting those things in vmcoreinfo is probably the only
thing we *can* do in the end.

Oh well, enough rambling... :)
diff mbox series

Patch

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 933cb3e45b98..093c9f917ed0 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -464,6 +464,8 @@  static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
 #ifdef CONFIG_HUGETLB_PAGE
 	VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
+#define PAGE_OFFLINE_MAPCOUNT_VALUE	(~PG_offline)
+	VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
 #endif
 
 	arch_crash_save_vmcoreinfo();