[v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo
diff mbox series

Message ID 1542318469-13699-1-git-send-email-bhsharma@redhat.com
State New
Headers show
Series
  • [v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo
Related show

Commit Message

Bhupesh Sharma Nov. 15, 2018, 9:47 p.m. UTC
x86_64 kernel uses 'page_offset_base' variable to point to the
start of direct mapping of all physical memory. This variable
is also updated for KASLR boot cases, so this can be exported
via vmcoreinfo as a standard ABI between kernel and user-space,
to allow user-space utilities to use the same for calculating
the start of direct mapping of all physical memory.

'arch/x86/kernel/head64.c' sets the same as:
   unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;

and also uses the same to indicate the base of KASLR regions on x86_64:
   static __initdata struct kaslr_memory_region {
	    unsigned long *base;
	        unsigned long size_tb;
   } kaslr_regions[] = {
	    { &page_offset_base, 0 },
   .. snip ..

Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
live-debugging of a running kernel via user-space utilities
like makedumpfile (see [1]).

Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
details), whose live debugging feature is broken with newer kernels
(I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
added to kcore, thus leading to an additional sections in the same, and
makedumpfile is not longer able to determine the start of direct
mapping of all physical memory, as it relies on traversing the PT_LOAD
segments inside kcore and using the last PT_LOAD segment
to determine the start of direct mapping.

Such user-space issues can be resolved if the user-space code instead
uses a standard ABI to read the kernel exposed machine specific
variables. With the kernel commit 23c85094fe1895caefdd
["proc/kcore: add vmcoreinfo note to /proc/kcore"]), it is
now possible to use the vmcoreinfo present inside kcore as the standard
ABI which can be used by the user-space utilities for reading
the machine specific information (and hence for debugging a
live kernel).

User-space utilities like makedumpfile, kexec-tools and crash
are either already using this ABI or are discussing patches
which look to add the same feature. This helps in simplifying the
overall code and also in reducing code-rewrite across the
user-space utilities for getting values of these kernel
symbols/variables.

Accordingly this patch allows appending 'page_offset_base' for
x86_64 platforms to vmcoreinfo, so that user-space tools can use the
same as a standard interface to determine the start of direct mapping
of all physical memory.

Testing:
-------
 - I tested this patch (rebased on 'linux-next') on a x86_64 machine
   using the modified 'makedumpfile' user-space code (see [3] for my
   github tree which contains the same) for determining how many pages
   are dumpable when different dump_level is specified (which is
   one use-case of live-debugging via 'makedumpfile').
 - I tested both the KASLR and non-KASLR boot cases with this patch.
 - Here is one sample log (for KASLR boot case) on my x86_64 machine:

   < snip..>
   The kernel doesn't support mmap(),read() will be used instead.

   TYPE		PAGES			EXCLUDABLE	DESCRIPTION
   ----------------------------------------------------------------------
   ZERO		21299           	yes		Pages filled
   with zero
   NON_PRI_CACHE	91785           	yes		Cache
   pages without private flag
   PRI_CACHE	1               	yes		Cache pages with
   private flag
   USER		14057           	yes		User process
   pages
   FREE		740346          	yes		Free pages
   KERN_DATA	58152           	no		Dumpable kernel
   data

   page size:		4096
   Total pages on system:	925640
   Total size on system:	3791421440       Byte

[1]. MAN pages -> MAKEDUMPFILE(8) and CRASH(8)
[2]. makedumpfile issue with latest kernels -> http://lists.infradead.org/pipermail/kexec/2018-October/021769.html
[3]. https://github.com/bhupesh-sharma/makedumpfile/tree/add-page-offset-base-to-vmcore-v1

Cc: Boris Petkov <bp@alien8.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: x86@kernel.org
Cc: kexec@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
---
Changes since v1:
 - Fixed the build issue reported by build bot and tested this version
   with 'make allmodconfig'.
 - Reworded most of the commit log to explain the intent behind the
   patch.

 arch/x86/kernel/machine_kexec_64.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Kazuhito Hagio Nov. 19, 2018, 9:07 p.m. UTC | #1
On 11/15/2018 4:47 PM, Bhupesh Sharma wrote:
> Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> live-debugging of a running kernel via user-space utilities
> like makedumpfile (see [1]).

I agree.

> Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
> details), whose live debugging feature is broken with newer kernels
> (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> added to kcore, thus leading to an additional sections in the same, and
> makedumpfile is not longer able to determine the start of direct
> mapping of all physical memory, as it relies on traversing the PT_LOAD
> segments inside kcore and using the last PT_LOAD segment
> to determine the start of direct mapping.

Actually, the KCORE_REMAP segments were already removed from kcore by
commit bf904d2762ee ("x86/pti/64: Remove the SYSCALL64 entry trampoline")
and kcore's program headers got back to the previous ones, but this
fact shows that they are changeable.

So I think that if we have this NUMBER(page_offset_base) in vmcoreinfo
for x86_64, user-space tools (especially makedumpfile) would become
more stable against changes in kcore and vmcore, rather than depending
on their PT_LOAD values.

Thanks,
Kazu
Bhupesh Sharma Nov. 21, 2018, 7:37 a.m. UTC | #2
Hi Kazu,

On Tue, Nov 20, 2018 at 2:47 AM Kazuhito Hagio <k-hagio@ab.jp.nec.com> wrote:
>
> On 11/15/2018 4:47 PM, Bhupesh Sharma wrote:
> > Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> > live-debugging of a running kernel via user-space utilities
> > like makedumpfile (see [1]).
>
> I agree.
>
> > Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
> > details), whose live debugging feature is broken with newer kernels
> > (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> > added to kcore, thus leading to an additional sections in the same, and
> > makedumpfile is not longer able to determine the start of direct
> > mapping of all physical memory, as it relies on traversing the PT_LOAD
> > segments inside kcore and using the last PT_LOAD segment
> > to determine the start of direct mapping.
>
> Actually, the KCORE_REMAP segments were already removed from kcore by
> commit bf904d2762ee ("x86/pti/64: Remove the SYSCALL64 entry trampoline")
> and kcore's program headers got back to the previous ones, but this
> fact shows that they are changeable.
>
> So I think that if we have this NUMBER(page_offset_base) in vmcoreinfo
> for x86_64, user-space tools (especially makedumpfile) would become
> more stable against changes in kcore and vmcore, rather than depending
> on their PT_LOAD values.

Thanks a lot for your review.

Hi x86 maintainers,

Ping, any comments on this patch. Since it is a useful addition to the
kernel-userspace ABI (as suggested by the user-space utility
maintainer as well), can we consider including this in the kernel?

Thanks,
Bhupesh
Borislav Petkov Nov. 21, 2018, 11:39 a.m. UTC | #3
+ Kees.

On Fri, Nov 16, 2018 at 03:17:49AM +0530, Bhupesh Sharma wrote:
> x86_64 kernel uses 'page_offset_base' variable to point to the
> start of direct mapping of all physical memory. This variable
> is also updated for KASLR boot cases, so this can be exported
> via vmcoreinfo as a standard ABI between kernel and user-space,
> to allow user-space utilities to use the same for calculating
> the start of direct mapping of all physical memory.
> 
> 'arch/x86/kernel/head64.c' sets the same as:
>    unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
> 
> and also uses the same to indicate the base of KASLR regions on x86_64:
>    static __initdata struct kaslr_memory_region {
> 	    unsigned long *base;
> 	        unsigned long size_tb;
>    } kaslr_regions[] = {
> 	    { &page_offset_base, 0 },
>    .. snip ..

Why is that detail needed in the commit message?

> Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> live-debugging of a running kernel via user-space utilities
> like makedumpfile (see [1]).
> 
> Recently, I saw an issue with the 'makedumpfile' utility (see [2] for

Use passive tone in your commit message: no "we" or "I", etc.

Also, pls read section "2) Describe your changes" in
Documentation/process/submitting-patches.rst.

> details), whose live debugging feature is broken with newer kernels
> (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> added to kcore, thus leading to an additional sections in the same, and
> makedumpfile is not longer able to determine the start of direct
> mapping of all physical memory, as it relies on traversing the PT_LOAD
> segments inside kcore and using the last PT_LOAD segment
> to determine the start of direct mapping.
> 
> Such user-space issues can be resolved if the user-space code instead
> uses a standard ABI to read the kernel exposed machine specific
> variables. With the kernel commit 23c85094fe1895caefdd
> ["proc/kcore: add vmcoreinfo note to /proc/kcore"]), it is

ERROR: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 23c85094fe18 ("proc/kcore: add vmcoreinfo note to /proc/kcore")'
#54: 
variables. With the kernel commit 23c85094fe1895caefdd

> now possible to use the vmcoreinfo present inside kcore as the standard
> ABI which can be used by the user-space utilities for reading
> the machine specific information (and hence for debugging a
> live kernel).
> 
> User-space utilities like makedumpfile, kexec-tools and crash
> are either already using this ABI or are discussing patches
> which look to add the same feature. This helps in simplifying the
> overall code and also in reducing code-rewrite across the
> user-space utilities for getting values of these kernel
> symbols/variables.

> Accordingly this patch allows appending 'page_offset_base' for
> x86_64 platforms to vmcoreinfo, so that user-space tools can use the
> same as a standard interface to determine the start of direct mapping
> of all physical memory.
> 
> Testing:
> -------
>  - I tested this patch (rebased on 'linux-next') on a x86_64 machine
>    using the modified 'makedumpfile' user-space code (see [3] for my
>    github tree which contains the same) for determining how many pages
>    are dumpable when different dump_level is specified (which is
>    one use-case of live-debugging via 'makedumpfile').
>  - I tested both the KASLR and non-KASLR boot cases with this patch.
>  - Here is one sample log (for KASLR boot case) on my x86_64 machine:
> 
>    < snip..>
>    The kernel doesn't support mmap(),read() will be used instead.
> 
>    TYPE		PAGES			EXCLUDABLE	DESCRIPTION
>    ----------------------------------------------------------------------
>    ZERO		21299           	yes		Pages filled
>    with zero
>    NON_PRI_CACHE	91785           	yes		Cache
>    pages without private flag
>    PRI_CACHE	1               	yes		Cache pages with
>    private flag
>    USER		14057           	yes		User process
>    pages
>    FREE		740346          	yes		Free pages
>    KERN_DATA	58152           	no		Dumpable kernel
>    data
> 
>    page size:		4096
>    Total pages on system:	925640
>    Total size on system:	3791421440       Byte
> 
> [1]. MAN pages -> MAKEDUMPFILE(8) and CRASH(8)
> [2]. makedumpfile issue with latest kernels -> http://lists.infradead.org/pipermail/kexec/2018-October/021769.html
> [3]. https://github.com/bhupesh-sharma/makedumpfile/tree/add-page-offset-base-to-vmcore-v1
> 
> Cc: Boris Petkov <bp@alien8.de>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
> Cc: Dave Anderson <anderson@redhat.com>
> Cc: James Morse <james.morse@arm.com>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: x86@kernel.org
> Cc: kexec@lists.infradead.org
> Cc: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
> ---
> Changes since v1:
>  - Fixed the build issue reported by build bot and tested this version
>    with 'make allmodconfig'.
>  - Reworded most of the commit log to explain the intent behind the
>    patch.
> 
>  arch/x86/kernel/machine_kexec_64.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 4c8acdfdc5a7..6161d77c5bfb 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
>  	VMCOREINFO_SYMBOL(init_top_pgt);
>  	vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
>  			pgtable_l5_enabled());
> +#ifdef CONFIG_RANDOMIZE_BASE
> +	VMCOREINFO_NUMBER(page_offset_base);
> +#endif
>  
>  #ifdef CONFIG_NUMA
>  	VMCOREINFO_SYMBOL(node_data);
> --

All above are only nitpicks though.

My opinion is this: people are exporting all kinds of kernel-internal
stuff in vmcoreinfo and frankly, I'm not crazy about this idea.

And AFAICT, this thing basically bypasses KASLR completely but you need
root for it so it probably doesn't really matter.

Now, on another thread we agreed more or less that what gets exported in
vmcoreinfo is so tightly coupled to the running kernel so that it is not
even considered an ABI. I guess that is debatable but whatever.

So my only request right now would be to have all those things being
exported, documented somewhere and I believe Lianbo is working on that.

But I'm sure others will have more to say about it.
Bhupesh Sharma Nov. 24, 2018, 8:06 p.m. UTC | #4
Hi Boris,

Thanks for your review. Please see my replies inline:

On Wed, Nov 21, 2018 at 5:10 PM Borislav Petkov <bp@alien8.de> wrote:
>
> + Kees.
>
> On Fri, Nov 16, 2018 at 03:17:49AM +0530, Bhupesh Sharma wrote:
> > x86_64 kernel uses 'page_offset_base' variable to point to the
> > start of direct mapping of all physical memory. This variable
> > is also updated for KASLR boot cases, so this can be exported
> > via vmcoreinfo as a standard ABI between kernel and user-space,
> > to allow user-space utilities to use the same for calculating
> > the start of direct mapping of all physical memory.
> >
> > 'arch/x86/kernel/head64.c' sets the same as:
> >    unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
> >
> > and also uses the same to indicate the base of KASLR regions on x86_64:
> >    static __initdata struct kaslr_memory_region {
> >           unsigned long *base;
> >               unsigned long size_tb;
> >    } kaslr_regions[] = {
> >           { &page_offset_base, 0 },
> >    .. snip ..
>
> Why is that detail needed in the commit message?

This (and similar) details were requested by Baoquan during the v1
review, that is why I added them to the v2 commit log.
Although personally I also think that such details are probably not
needed in a commit log (may be better suited for a cover letter, which
is maybe a overkill for this single patch).
Will drop this from v3.

> > Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> > live-debugging of a running kernel via user-space utilities
> > like makedumpfile (see [1]).
> >
> > Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
>
> Use passive tone in your commit message: no "we" or "I", etc.

Ok.

> Also, pls read section "2) Describe your changes" in
> Documentation/process/submitting-patches.rst.

Ok.

> > details), whose live debugging feature is broken with newer kernels
> > (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> > added to kcore, thus leading to an additional sections in the same, and
> > makedumpfile is not longer able to determine the start of direct
> > mapping of all physical memory, as it relies on traversing the PT_LOAD
> > segments inside kcore and using the last PT_LOAD segment
> > to determine the start of direct mapping.
> >
> > Such user-space issues can be resolved if the user-space code instead
> > uses a standard ABI to read the kernel exposed machine specific
> > variables. With the kernel commit 23c85094fe1895caefdd
> > ["proc/kcore: add vmcoreinfo note to /proc/kcore"]), it is
>
> ERROR: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 23c85094fe18 ("proc/kcore: add vmcoreinfo note to /proc/kcore")'
> #54:
> variables. With the kernel commit 23c85094fe1895caefdd

Ok.

> > now possible to use the vmcoreinfo present inside kcore as the standard
> > ABI which can be used by the user-space utilities for reading
> > the machine specific information (and hence for debugging a
> > live kernel).
> >
> > User-space utilities like makedumpfile, kexec-tools and crash
> > are either already using this ABI or are discussing patches
> > which look to add the same feature. This helps in simplifying the
> > overall code and also in reducing code-rewrite across the
> > user-space utilities for getting values of these kernel
> > symbols/variables.
>
> > Accordingly this patch allows appending 'page_offset_base' for
> > x86_64 platforms to vmcoreinfo, so that user-space tools can use the
> > same as a standard interface to determine the start of direct mapping
> > of all physical memory.
> >
> > Testing:
> > -------
> >  - I tested this patch (rebased on 'linux-next') on a x86_64 machine
> >    using the modified 'makedumpfile' user-space code (see [3] for my
> >    github tree which contains the same) for determining how many pages
> >    are dumpable when different dump_level is specified (which is
> >    one use-case of live-debugging via 'makedumpfile').
> >  - I tested both the KASLR and non-KASLR boot cases with this patch.
> >  - Here is one sample log (for KASLR boot case) on my x86_64 machine:
> >
> >    < snip..>
> >    The kernel doesn't support mmap(),read() will be used instead.
> >
> >    TYPE               PAGES                   EXCLUDABLE      DESCRIPTION
> >    ----------------------------------------------------------------------
> >    ZERO               21299                   yes             Pages filled
> >    with zero
> >    NON_PRI_CACHE      91785                   yes             Cache
> >    pages without private flag
> >    PRI_CACHE  1                       yes             Cache pages with
> >    private flag
> >    USER               14057                   yes             User process
> >    pages
> >    FREE               740346                  yes             Free pages
> >    KERN_DATA  58152                   no              Dumpable kernel
> >    data
> >
> >    page size:         4096
> >    Total pages on system:     925640
> >    Total size on system:      3791421440       Byte
> >
> > [1]. MAN pages -> MAKEDUMPFILE(8) and CRASH(8)
> > [2]. makedumpfile issue with latest kernels -> http://lists.infradead.org/pipermail/kexec/2018-October/021769.html
> > [3]. https://github.com/bhupesh-sharma/makedumpfile/tree/add-page-offset-base-to-vmcore-v1
> >
> > Cc: Boris Petkov <bp@alien8.de>
> > Cc: Baoquan He <bhe@redhat.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
> > Cc: Dave Anderson <anderson@redhat.com>
> > Cc: James Morse <james.morse@arm.com>
> > Cc: Omar Sandoval <osandov@fb.com>
> > Cc: x86@kernel.org
> > Cc: kexec@lists.infradead.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
> > ---
> > Changes since v1:
> >  - Fixed the build issue reported by build bot and tested this version
> >    with 'make allmodconfig'.
> >  - Reworded most of the commit log to explain the intent behind the
> >    patch.
> >
> >  arch/x86/kernel/machine_kexec_64.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > index 4c8acdfdc5a7..6161d77c5bfb 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> >       VMCOREINFO_SYMBOL(init_top_pgt);
> >       vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> >                       pgtable_l5_enabled());
> > +#ifdef CONFIG_RANDOMIZE_BASE
> > +     VMCOREINFO_NUMBER(page_offset_base);
> > +#endif
> >
> >  #ifdef CONFIG_NUMA
> >       VMCOREINFO_SYMBOL(node_data);
> > --
>
> All above are only nitpicks though.

Right the above are mostly nitpicks, but useful ones, so I will
address these in the v3.

> My opinion is this: people are exporting all kinds of kernel-internal
> stuff in vmcoreinfo and frankly, I'm not crazy about this idea.
>
> And AFAICT, this thing basically bypasses KASLR completely but you need
> root for it so it probably doesn't really matter.
>
> Now, on another thread we agreed more or less that what gets exported in
> vmcoreinfo is so tightly coupled to the running kernel so that it is not
> even considered an ABI. I guess that is debatable but whatever.

I understand.

> So my only request right now would be to have all those things being
> exported, documented somewhere and I believe Lianbo is working on that.

I will work with Lianbo to get this added to his documentation patch as well.

> But I'm sure others will have more to say about it.

Sure, I will wait for a couple of days more and then send out a v3 and
also make sure that
this variable being export'ed also gets added to the overall documentation patch
which Lianbo is creating.

Regards,
Bhupesh
Baoquan He Nov. 25, 2018, 10:19 a.m. UTC | #5
On 11/25/18 at 01:36am, Bhupesh Sharma wrote:
> Hi Boris,
> 
> Thanks for your review. Please see my replies inline:
> 
> On Wed, Nov 21, 2018 at 5:10 PM Borislav Petkov <bp@alien8.de> wrote:
> >
> > + Kees.
> >
> > On Fri, Nov 16, 2018 at 03:17:49AM +0530, Bhupesh Sharma wrote:
> > > x86_64 kernel uses 'page_offset_base' variable to point to the
> > > start of direct mapping of all physical memory. This variable
> > > is also updated for KASLR boot cases, so this can be exported
> > > via vmcoreinfo as a standard ABI between kernel and user-space,
> > > to allow user-space utilities to use the same for calculating
> > > the start of direct mapping of all physical memory.
> > >
> > > 'arch/x86/kernel/head64.c' sets the same as:
> > >    unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
> > >
> > > and also uses the same to indicate the base of KASLR regions on x86_64:
> > >    static __initdata struct kaslr_memory_region {
> > >           unsigned long *base;
> > >               unsigned long size_tb;
> > >    } kaslr_regions[] = {
> > >           { &page_offset_base, 0 },
> > >    .. snip ..
> >
> > Why is that detail needed in the commit message?
> 
> This (and similar) details were requested by Baoquan during the v1
> review, that is why I added them to the v2 commit log.

Hmm, Bhupesh, this is what I requested in your v1:
lkml.kernel.org/r/20181030020752.GB11408@MiWiFi-R3L-srv

You said x86 makedumpfile is broken because KCORE_REMAP is added. I
wanted to know why it's broken. Now I think I don't need to request it
in this thread, becasue Kazu has made it clear and fixed it:
4AE2DC15AC0B8543882A74EA0D43DBEC0355DFC9@BPXM09GP.gisp.nec.co.jp

Or could you abstract the sentences in which I requested you to add
above information into patch log so that I can notice my expression
and can improve on future reviewing and communicating?

And for problem solving and describing, if things are simple, just
tell it direclty; If things are complex, try to simplify it and make it
be simply told. For this page_offset_base exporting, we can save time on
'what' and 'how' since it's very simple and very easy to see what it is
and how it's done. You just need tell WHY it's needed. BUT I saw so many
words and not easy to get why. If simplifying problems is one kind of
ability, I don't know what to say about complicating one simple problem,
another kind of ability?

I have read your patch log when you sent out v2, honestly I don't like it.
I tried to not touch this patch, hope other people can review and give
comments. This can make you know I didn't intentionally embarrass you on
patch reviewing.

Honestly, if this patch is merged, no matter v2 or your old v1,
a small record will be made. On my x1 laptop, I need scroll down FOUR
full screens till the real patch content is seen. Are those really
needed? You can check other vmcoreinfo exporting patch.

I really don't want to review people's patch if they don't welcome my
words, just I was asked to do. Hope you know I had no any offence when
I started to read your patch, just I felt not good when my head was 
scratched to bleed when read. My english is not good, I don't know how to
express euphemistically sometime, so if I say so about one thing, that is
how I think so to the thing, without any personal prejudice or attack to
person. Please forgive me if any offence felt.

Thanks
Baoquan

> Although personally I also think that such details are probably not
> needed in a commit log (may be better suited for a cover letter, which
> is maybe a overkill for this single patch).
> Will drop this from v3.
Baoquan He Nov. 26, 2018, 1:28 a.m. UTC | #6
On 11/16/18 at 03:17am, Bhupesh Sharma wrote:
> Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> live-debugging of a running kernel via user-space utilities
> like makedumpfile (see [1]).
> 
> Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
> details), whose live debugging feature is broken with newer kernels

I think this paragraph explained why KCORE_REMAP adding caused the
mistake of page_offset calculation in makedumpfile. It can prove the
advantage of appending 'page_offset_base' to vmcoreinfo. The old way I
took in makedumpfile could be impacted by kernel code change, adding it
to vmcoreinfo can make it stable. The example is KCORE_REMAP adding, and
later it's removed.

But it's not live debugging feature of makedumpfile. Makedumpfile can't be
used to live debug. The feature is called '--mem-usage' in makedumpfile,
in fact it's used to estimate how big the vmcore could be so that customer
can deply an appropriate size of storage space to store it. Because both
kcore and vmcore are all elf files which the 1st kernel's memory is
mapped to, even though they are different, kcore is dynamically changing.
This is more likely a precision in order of of magnitude. This is a feature
required by redhat customer.

I thought you are talking about using DaveA's crash utility to live
debug the running kernel, like we usually do with gdb.

	gdb vmlinux /proc/kcore

Yes, this gdb live debugging is broken because of KASLR. We have bug about
this, while it has not been fixed. Using Crash utility to replace gdb is
one way if Crash code is adjusted.

> (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> added to kcore, thus leading to an additional sections in the same, and
> makedumpfile is not longer able to determine the start of direct
> mapping of all physical memory, as it relies on traversing the PT_LOAD
> segments inside kcore and using the last PT_LOAD segment
> to determine the start of direct mapping.
...
> Testing:
> -------

This one vmcoreinfo entry adding won't impact kernel performance. And
page_offset_base need be got during makedumpfile initialization, it
won't impact makedumpfile efficiency either, especially compared with
the later page filterring and writting out to storage space. I don't
think there's any need to provide a detailed test result here. If
possible, just mention it works in this way, maybe it's better in some
aspects, such as code simplicity, etc.

>  - I tested this patch (rebased on 'linux-next') on a x86_64 machine
>    using the modified 'makedumpfile' user-space code (see [3] for my
>    github tree which contains the same) for determining how many pages
>    are dumpable when different dump_level is specified (which is
>    one use-case of live-debugging via 'makedumpfile').
>  - I tested both the KASLR and non-KASLR boot cases with this patch.
>  - Here is one sample log (for KASLR boot case) on my x86_64 machine:
> 
>    < snip..>
>    The kernel doesn't support mmap(),read() will be used instead.
> 
>    TYPE		PAGES			EXCLUDABLE	DESCRIPTION
>    ----------------------------------------------------------------------
>    ZERO		21299           	yes		Pages filled
>    with zero
>    NON_PRI_CACHE	91785           	yes		Cache
>    pages without private flag
>    PRI_CACHE	1               	yes		Cache pages with
>    private flag
>    USER		14057           	yes		User process
>    pages
>    FREE		740346          	yes		Free pages
>    KERN_DATA	58152           	no		Dumpable kernel
>    data
> 
>    page size:		4096
>    Total pages on system:	925640
>    Total size on system:	3791421440       Byte
> 
...

> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 4c8acdfdc5a7..6161d77c5bfb 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
>  	VMCOREINFO_SYMBOL(init_top_pgt);
>  	vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
>  			pgtable_l5_enabled());
> +#ifdef CONFIG_RANDOMIZE_BASE

Finally, embracing it into CONFIG_RANDOMIZE_BASE ifdefery seems not
right. The latest kernel is using page_offset_base to do the dynamic
memory layout between level4 and level5 changing. This may not work in
5-level system with CONFIG_RANDOMIZE_BASE=n.

> +	VMCOREINFO_NUMBER(page_offset_base);
> +#endif
>  
>  #ifdef CONFIG_NUMA
>  	VMCOREINFO_SYMBOL(node_data);
> -- 
> 2.7.4
>
Bhupesh Sharma Nov. 26, 2018, 7:31 p.m. UTC | #7
On Mon, Nov 26, 2018 at 6:58 AM Baoquan He <bhe@redhat.com> wrote:
>
> On 11/16/18 at 03:17am, Bhupesh Sharma wrote:
> > Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> > live-debugging of a running kernel via user-space utilities
> > like makedumpfile (see [1]).
> >
> > Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
> > details), whose live debugging feature is broken with newer kernels
>
> I think this paragraph explained why KCORE_REMAP adding caused the
> mistake of page_offset calculation in makedumpfile. It can prove the
> advantage of appending 'page_offset_base' to vmcoreinfo. The old way I
> took in makedumpfile could be impacted by kernel code change, adding it
> to vmcoreinfo can make it stable. The example is KCORE_REMAP adding, and
> later it's removed.
>
> But it's not live debugging feature of makedumpfile. Makedumpfile can't be
> used to live debug. The feature is called '--mem-usage' in makedumpfile,
> in fact it's used to estimate how big the vmcore could be so that customer
> can deply an appropriate size of storage space to store it. Because both
> kcore and vmcore are all elf files which the 1st kernel's memory is
> mapped to, even though they are different, kcore is dynamically changing.
> This is more likely a precision in order of of magnitude. This is a feature
> required by redhat customer.

Indeed this is a live debugging feature - see we are running this in
the primary kernel
context, not in kdump context. We are trying to debug a kernel we are
presently running (in this case determining the page mapping)
hence the term live debugging.

Also, this feature is not limited to redhat - we are talking in
upstream makedumpfile context here - it is used by other projects as
well which can have even a simple busybox rootfs configuration (e.g.
qemu).

> I thought you are talking about using DaveA's crash utility to live
> debug the running kernel, like we usually do with gdb.
>
>         gdb vmlinux /proc/kcore
>
> Yes, this gdb live debugging is broken because of KASLR. We have bug about
> this, while it has not been fixed. Using Crash utility to replace gdb is
> one way if Crash code is adjusted.
>
> > (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> > added to kcore, thus leading to an additional sections in the same, and
> > makedumpfile is not longer able to determine the start of direct
> > mapping of all physical memory, as it relies on traversing the PT_LOAD
> > segments inside kcore and using the last PT_LOAD segment
> > to determine the start of direct mapping.
> ...
> > Testing:
> > -------
>
> This one vmcoreinfo entry adding won't impact kernel performance. And
> page_offset_base need be got during makedumpfile initialization, it
> won't impact makedumpfile efficiency either, especially compared with
> the later page filterring and writting out to storage space. I don't
> think there's any need to provide a detailed test result here. If
> possible, just mention it works in this way, maybe it's better in some
> aspects, such as code simplicity, etc.
>
> >  - I tested this patch (rebased on 'linux-next') on a x86_64 machine
> >    using the modified 'makedumpfile' user-space code (see [3] for my
> >    github tree which contains the same) for determining how many pages
> >    are dumpable when different dump_level is specified (which is
> >    one use-case of live-debugging via 'makedumpfile').
> >  - I tested both the KASLR and non-KASLR boot cases with this patch.
> >  - Here is one sample log (for KASLR boot case) on my x86_64 machine:
> >
> >    < snip..>
> >    The kernel doesn't support mmap(),read() will be used instead.
> >
> >    TYPE               PAGES                   EXCLUDABLE      DESCRIPTION
> >    ----------------------------------------------------------------------
> >    ZERO               21299                   yes             Pages filled
> >    with zero
> >    NON_PRI_CACHE      91785                   yes             Cache
> >    pages without private flag
> >    PRI_CACHE  1                       yes             Cache pages with
> >    private flag
> >    USER               14057                   yes             User process
> >    pages
> >    FREE               740346                  yes             Free pages
> >    KERN_DATA  58152                   no              Dumpable kernel
> >    data
> >
> >    page size:         4096
> >    Total pages on system:     925640
> >    Total size on system:      3791421440       Byte
> >
> ...
>
> > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > index 4c8acdfdc5a7..6161d77c5bfb 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> >       VMCOREINFO_SYMBOL(init_top_pgt);
> >       vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> >                       pgtable_l5_enabled());
> > +#ifdef CONFIG_RANDOMIZE_BASE
>
> Finally, embracing it into CONFIG_RANDOMIZE_BASE ifdefery seems not
> right. The latest kernel is using page_offset_base to do the dynamic
> memory layout between level4 and level5 changing. This may not work in
> 5-level system with CONFIG_RANDOMIZE_BASE=n.

I think you missed the v2 change log and the build-bot error on v1
(see here: <https://patchwork.kernel.org/patch/10657933/#22291691>).
With .config files which have CONFIG_RANDOMIZE_BASE=n, we get the
following compilation error
without the #ifdef jugglery:

 arch/x86/kernel/machine_kexec_64.o: In function `arch_crash_save_vmcoreinfo':
 arch/x86/kernel/machine_kexec_64.c:359: undefined reference to
`page_offset_base'
 arch/x86/kernel/machine_kexec_64.c:359: undefined reference to
`page_offset_base'

Anyways, with Kazu's and Boris's comments on the v2, I understand that
adding 'page_offset_base' variable to vmcoreinfo is useful for x86
kernel.
I will now work on the v3 to take into account review comments and
also work with Lianbo to get the same added to the overall vmcoreinfo
documentation he is preparing for x86.

Thanks,
Bhupesh


> > +     VMCOREINFO_NUMBER(page_offset_base);
> > +#endif
> >
> >  #ifdef CONFIG_NUMA
> >       VMCOREINFO_SYMBOL(node_data);
> > --
> > 2.7.4
> >
Baoquan He Nov. 27, 2018, 6:48 a.m. UTC | #8
On 11/27/18 at 01:01am, Bhupesh Sharma wrote:
> > But it's not live debugging feature of makedumpfile. Makedumpfile can't be
> > used to live debug. The feature is called '--mem-usage' in makedumpfile,
> > in fact it's used to estimate how big the vmcore could be so that customer
> > can deply an appropriate size of storage space to store it. Because both
> > kcore and vmcore are all elf files which the 1st kernel's memory is
> > mapped to, even though they are different, kcore is dynamically changing.
> > This is more likely a precision in order of of magnitude. This is a feature
> > required by redhat customer.
> 
> Indeed this is a live debugging feature - see we are running this in
> the primary kernel
> context, not in kdump context. We are trying to debug a kernel we are
> presently running (in this case determining the page mapping)
> hence the term live debugging.

Thanks for redefining '--mem-usage' of makedumpfile as a live debugging
feature. I am the author of this feature, I don't know it has this
awesome attribute. Can you post patch to refresh it's doc or man page
about this live debugging thing? I think it's important, I am
astonished. I noticed Boris said you are adding this to support live
debug, just not sure if he is misled by you, or he also think this
vmcore size estimating feature of makedumpfile is live debugging
feature. 

> 
> Also, this feature is not limited to redhat - we are talking in
> upstream makedumpfile context here - it is used by other projects as
> well which can have even a simple busybox rootfs configuration (e.g.
> qemu).

Yes, it's not limited to redhat, I am just saying its origin.

> 
> > I thought you are talking about using DaveA's crash utility to live
> > debug the running kernel, like we usually do with gdb.
> >
> >         gdb vmlinux /proc/kcore
> >
> > Yes, this gdb live debugging is broken because of KASLR. We have bug about
> > this, while it has not been fixed. Using Crash utility to replace gdb is
> > one way if Crash code is adjusted.

......

> > > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > > index 4c8acdfdc5a7..6161d77c5bfb 100644
> > > --- a/arch/x86/kernel/machine_kexec_64.c
> > > +++ b/arch/x86/kernel/machine_kexec_64.c
> > > @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> > >       VMCOREINFO_SYMBOL(init_top_pgt);
> > >       vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> > >                       pgtable_l5_enabled());
> > > +#ifdef CONFIG_RANDOMIZE_BASE
> >
> > Finally, embracing it into CONFIG_RANDOMIZE_BASE ifdefery seems not
> > right. The latest kernel is using page_offset_base to do the dynamic
> > memory layout between level4 and level5 changing. This may not work in
> > 5-level system with CONFIG_RANDOMIZE_BASE=n.
> 
> I think you missed the v2 change log and the build-bot error on v1
> (see here: <https://patchwork.kernel.org/patch/10657933/#22291691>).
> With .config files which have CONFIG_RANDOMIZE_BASE=n, we get the
> following compilation error
> without the #ifdef jugglery:

Which line of your v2 change log did I miss?

Ask again, what if I set
CONFIG_DYNAMIC_MEMORY_LAYOUT=y && CONFIG_RANDOMIZE_MEMORY=n
in a kernel with LEVEL5 enabled to make kernel be able to switch
into l4 and l5 dynamically?

Is there anything wrong if I add it like this:

#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
	VMCOREINFO_SYMBOL(node_data);
#endif

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
...
#endif

config DYNAMIC_MEMORY_LAYOUT
        bool
        ---help---
          This option makes base addresses of vmalloc and vmemmap as well as
          __PAGE_OFFSET movable during boot.

config RANDOMIZE_MEMORY
        bool "Randomize the kernel memory sections"
        depends on X86_64
        depends on RANDOMIZE_BASE
        select DYNAMIC_MEMORY_LAYOUT
        default RANDOMIZE_BASE
        ---help---
           Randomizes the base virtual address of kernel memory sections
           (physical memory mapping, vmalloc & vmemmap). This security feature
           makes exploits relying on predictable memory locations less reliable.

           The order of allocations remains unchanged. Entropy is generated in
           the same way as RANDOMIZE_BASE. Current implementation in the optimal
           configuration have in average 30,000 different possible virtual
           addresses for each memory section.

           If unsure, say Y.

> 
>  arch/x86/kernel/machine_kexec_64.o: In function `arch_crash_save_vmcoreinfo':
>  arch/x86/kernel/machine_kexec_64.c:359: undefined reference to
> `page_offset_base'
>  arch/x86/kernel/machine_kexec_64.c:359: undefined reference to
> `page_offset_base'
> 
> Anyways, with Kazu's and Boris's comments on the v2, I understand that
> adding 'page_offset_base' variable to vmcoreinfo is useful for x86
> kernel.

Please notice that the makedumpfile '--mem-usage' bug has been fixed by
below commit. Adding 'page_offset_base' into vmcoreinfo is 50:50 to
me. Adding it to vmcoreinfo to expose, or deducing it from kcore/vmcore
program segment, both is fine to me since both has pro and con. Adding
it you have to give a good log, based on good understanding, but
not those confusing self invented term and the useless test results,
and wrong code.

commit 1ea989bf6a93db377689c16b61e9c2c6a9bafa17 (HEAD -> devel, origin/devel)
Author: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Date:   Wed Nov 21 13:57:31 2018 -0500

    [PATCH] x86_64: fix failure of getting kcore vmcoreinfo on kernel 4.19

> I will now work on the v3 to take into account review comments and
> also work with Lianbo to get the same added to the overall vmcoreinfo
> documentation he is preparing for x86.

So please do not post new version directly without replying to confirm if
we have agreement on concerns, then saying that is requested by me or
other people.

Thanks
Baoquan
Baoquan He Nov. 27, 2018, 7:15 a.m. UTC | #9
On 11/27/18 at 02:48pm, Baoquan He wrote:
 
> #ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
> 	VMCOREINFO_SYMBOL(node_data);
	^^^ 

> #endif

Wrong copy
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
	VMCOREINFO_NUMBER(page_offset_base);

#endif
Kees Cook Nov. 27, 2018, 10:16 p.m. UTC | #10
On Wed, Nov 21, 2018 at 3:39 AM, Borislav Petkov <bp@alien8.de> wrote:
> + Kees.
>
> On Fri, Nov 16, 2018 at 03:17:49AM +0530, Bhupesh Sharma wrote:
>> x86_64 kernel uses 'page_offset_base' variable to point to the
>> start of direct mapping of all physical memory. This variable
>> is also updated for KASLR boot cases, so this can be exported
>> via vmcoreinfo as a standard ABI between kernel and user-space,
>> to allow user-space utilities to use the same for calculating
>> the start of direct mapping of all physical memory.

Why is KERNELOFFSET= not sufficient?

See commit b6085a865762 ("x86, kaslr: export offset in VMCOREINFO ELF notes")

+       vmcoreinfo_append_str("KERNELOFFSET=%lx\n",
+                             (unsigned long)&_text - __START_KERNEL);

-Kees

>> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
>> index 4c8acdfdc5a7..6161d77c5bfb 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
>>       VMCOREINFO_SYMBOL(init_top_pgt);
>>       vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
>>                       pgtable_l5_enabled());
>> +#ifdef CONFIG_RANDOMIZE_BASE
>> +     VMCOREINFO_NUMBER(page_offset_base);
>> +#endif
>>
>>  #ifdef CONFIG_NUMA
>>       VMCOREINFO_SYMBOL(node_data);
Baoquan He Nov. 27, 2018, 11:29 p.m. UTC | #11
On 11/27/18 at 02:16pm, Kees Cook wrote:
> Why is KERNELOFFSET= not sufficient?
> 
> See commit b6085a865762 ("x86, kaslr: export offset in VMCOREINFO ELF notes")
> 
> +       vmcoreinfo_append_str("KERNELOFFSET=%lx\n",
> +                             (unsigned long)&_text - __START_KERNEL);

KERNELOFFSET is virtual address delta after kernel text KASLR, namely
the offset from the original default kernel text virtual address,
0xffffffff88000000.

While after memory region KASLR in kernel_randomize_memory(), the
starting address of the direct mapping of physical memory, PAGE_OFFSET,
is changed too. We need get it to analyze memory in makedumpfile/crash.
Currently we deduce it from elf program segment of kcore:
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
......

  LOAD           0x00000a62c0004000 0xffff8a62c0001000 0x0000000000001000
                 0x000000000009c000 0x000000000009c000  RWE    1000

page_offset = 0xffff8a62c0001000 - 0x0000000000001000;
Since we put the direct mapping segments at the bottom part of kcore, we
can always get page_offset right.

Thanks
Baoquan

> 
> -Kees
> 
> >> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> >> index 4c8acdfdc5a7..6161d77c5bfb 100644
> >> --- a/arch/x86/kernel/machine_kexec_64.c
> >> +++ b/arch/x86/kernel/machine_kexec_64.c
> >> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> >>       VMCOREINFO_SYMBOL(init_top_pgt);
> >>       vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> >>                       pgtable_l5_enabled());
> >> +#ifdef CONFIG_RANDOMIZE_BASE
> >> +     VMCOREINFO_NUMBER(page_offset_base);
> >> +#endif
> >>
> >>  #ifdef CONFIG_NUMA
> >>       VMCOREINFO_SYMBOL(node_data);
> 
> -- 
> Kees Cook
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
Kees Cook Nov. 28, 2018, 12:39 a.m. UTC | #12
On Tue, Nov 27, 2018 at 3:29 PM, Baoquan He <bhe@redhat.com> wrote:
> On 11/27/18 at 02:16pm, Kees Cook wrote:
>> Why is KERNELOFFSET= not sufficient?
>>
>> See commit b6085a865762 ("x86, kaslr: export offset in VMCOREINFO ELF notes")
>>
>> +       vmcoreinfo_append_str("KERNELOFFSET=%lx\n",
>> +                             (unsigned long)&_text - __START_KERNEL);
>
> KERNELOFFSET is virtual address delta after kernel text KASLR, namely
> the offset from the original default kernel text virtual address,
> 0xffffffff88000000.
>
> While after memory region KASLR in kernel_randomize_memory(), the
> starting address of the direct mapping of physical memory, PAGE_OFFSET,
> is changed too. We need get it to analyze memory in makedumpfile/crash.
> Currently we deduce it from elf program segment of kcore:
> Program Headers:
>   Type           Offset             VirtAddr           PhysAddr
>                  FileSiz            MemSiz              Flags  Align
> ......
>
>   LOAD           0x00000a62c0004000 0xffff8a62c0001000 0x0000000000001000
>                  0x000000000009c000 0x000000000009c000  RWE    1000
>
> page_offset = 0xffff8a62c0001000 - 0x0000000000001000;
> Since we put the direct mapping segments at the bottom part of kcore, we
> can always get page_offset right.
>
> Thanks
> Baoquan
>
>>
>> -Kees
>>
>> >> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
>> >> index 4c8acdfdc5a7..6161d77c5bfb 100644
>> >> --- a/arch/x86/kernel/machine_kexec_64.c
>> >> +++ b/arch/x86/kernel/machine_kexec_64.c
>> >> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
>> >>       VMCOREINFO_SYMBOL(init_top_pgt);
>> >>       vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
>> >>                       pgtable_l5_enabled());
>> >> +#ifdef CONFIG_RANDOMIZE_BASE

Okay, gotcha. In that case, shouldn't this be CONFIG_RANDOMIZE_MEMORY?

-Kees

>> >> +     VMCOREINFO_NUMBER(page_offset_base);
>> >> +#endif
>> >>
>> >>  #ifdef CONFIG_NUMA
>> >>       VMCOREINFO_SYMBOL(node_data);
>>
>> --
>> Kees Cook
>>
>> _______________________________________________
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
Baoquan He Nov. 28, 2018, 1:39 a.m. UTC | #13
On 11/27/18 at 04:39pm, Kees Cook wrote:
> >> >> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> >> >> index 4c8acdfdc5a7..6161d77c5bfb 100644
> >> >> --- a/arch/x86/kernel/machine_kexec_64.c
> >> >> +++ b/arch/x86/kernel/machine_kexec_64.c
> >> >> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> >> >>       VMCOREINFO_SYMBOL(init_top_pgt);
> >> >>       vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> >> >>                       pgtable_l5_enabled());
> >> >> +#ifdef CONFIG_RANDOMIZE_BASE
> 
> Okay, gotcha. In that case, shouldn't this be CONFIG_RANDOMIZE_MEMORY?

Currently, Kirill added level5 support to x86_64, and kernel with level5
enabled can be boot switched into level4 or level5 with kernel option
"no5lvl". So page_offset_base will be changed accordingly. You can see
code pasted at bottom, DYNAMIC_MEMORY_LAYOUT is added for this change,
not only KASLR, but 5LEVEL.

If only put it inside ifdef CONFIG_RANDOMIZE_MEMORY, change between l4
and l5 will force us to decide page_offset again if
CONFIG_RANDOMIZE_MEMORY or CONFIG_RANDOMIZE_BASE is not set. Besides, 
below commit change the starting address of the direct mapping again, if
only judge CONFIG_RANDOMIZE_MEMORY, in case KASLR is disabled, code in
userspace may need many if-else checking as below. So if we pass, better
pass it for all.

get_page_offset()
{
	if(get_page_offset_from_vmcoreinfo()) {
		xxx //in KASLR case
		return;
	} else if (check_5level_paging()) {
		if (version < 4.21) {
			page_offset = 0xff10000000000000;
		} else //version > = 4.21
		{
			page_offset = 0xff11000000000000;
		}
		
	} else { //4level
		if (version < 2.6.27) {
			page_offset = 0xffff810000000000;
		} else if (version < 4.21) {
			page_offset = 0xffff880000000000;
		} else //version > = 4.21
		{
			page_offset = 0xffff888000000000,;
		}
	}
}

Sign, seeing above code, I still think that deducing it from
kcore/vmcore elf header is better. Can't we be better to ourselves?

commit d52888aa2753e3063a9d3a0c9f72f94aa9809c15
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date:   Fri Oct 26 15:28:54 2018 +0300

    x86/mm: Move LDT remap out of KASLR region on 5-level paging

[bhe@ linux]$ git describe --contains d52888aa2753e3063a9d3a0c9f72f94aa9809c15
v4.20-rc2~5^2~4


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;                                                                           
EXPORT_SYMBOL(page_offset_base);
unsigned long vmalloc_base __ro_after_init = __VMALLOC_BASE_L4;
EXPORT_SYMBOL(vmalloc_base);
unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
EXPORT_SYMBOL(vmemmap_base);
#endif


config DYNAMIC_MEMORY_LAYOUT                                                                                                                      
        bool
        ---help---
          This option makes base addresses of vmalloc and vmemmap as well as
          __PAGE_OFFSET movable during boot.

config RANDOMIZE_MEMORY
        bool "Randomize the kernel memory sections"
        depends on X86_64
        depends on RANDOMIZE_BASE
        select DYNAMIC_MEMORY_LAYOUT
        default RANDOMIZE_BASE

config X86_5LEVEL
        bool "Enable 5-level page tables support"
        select DYNAMIC_MEMORY_LAYOUT
        select SPARSEMEM_VMEMMAP
        depends on X86_64
Baoquan He Nov. 28, 2018, 1:57 a.m. UTC | #14
On 11/27/18 at 04:39pm, Kees Cook wrote:
> >> >> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> >> >> index 4c8acdfdc5a7..6161d77c5bfb 100644
> >> >> --- a/arch/x86/kernel/machine_kexec_64.c
> >> >> +++ b/arch/x86/kernel/machine_kexec_64.c
> >> >> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> >> >>       VMCOREINFO_SYMBOL(init_top_pgt);
> >> >>       vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> >> >>                       pgtable_l5_enabled());
> >> >> +#ifdef CONFIG_RANDOMIZE_BASE
> 
> Okay, gotcha. In that case, shouldn't this be CONFIG_RANDOMIZE_MEMORY?

And yes, if we only care about KASLR, it should be
CONFIG_RANDOMIZE_MEMORY, but not CONFIG_RANDOMIZE_BASE.
Bhupesh Sharma Nov. 28, 2018, 4:26 a.m. UTC | #15
On Wed, Nov 28, 2018 at 7:27 AM Baoquan He <bhe@redhat.com> wrote:
>
> On 11/27/18 at 04:39pm, Kees Cook wrote:
> > >> >> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > >> >> index 4c8acdfdc5a7..6161d77c5bfb 100644
> > >> >> --- a/arch/x86/kernel/machine_kexec_64.c
> > >> >> +++ b/arch/x86/kernel/machine_kexec_64.c
> > >> >> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> > >> >>       VMCOREINFO_SYMBOL(init_top_pgt);
> > >> >>       vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> > >> >>                       pgtable_l5_enabled());
> > >> >> +#ifdef CONFIG_RANDOMIZE_BASE
> >
> > Okay, gotcha. In that case, shouldn't this be CONFIG_RANDOMIZE_MEMORY?
>
> And yes, if we only care about KASLR, it should be
> CONFIG_RANDOMIZE_MEMORY, but not CONFIG_RANDOMIZE_BASE.
>

Have you tried building the changes with 'make allmodconfig' with all
the different CONFIG options you have suggested so far?
Dave Young Nov. 28, 2018, 11:38 a.m. UTC | #16
> > now possible to use the vmcoreinfo present inside kcore as the standard
> > ABI which can be used by the user-space utilities for reading
> > the machine specific information (and hence for debugging a
> > live kernel).
> > 
> > User-space utilities like makedumpfile, kexec-tools and crash
> > are either already using this ABI or are discussing patches
> > which look to add the same feature. This helps in simplifying the
> > overall code and also in reducing code-rewrite across the
> > user-space utilities for getting values of these kernel
> > symbols/variables.
> 
> > Accordingly this patch allows appending 'page_offset_base' for
> > x86_64 platforms to vmcoreinfo, so that user-space tools can use the
> > same as a standard interface to determine the start of direct mapping
> > of all physical memory.
> > 

[snip]

> All above are only nitpicks though.
> 
> My opinion is this: people are exporting all kinds of kernel-internal
> stuff in vmcoreinfo and frankly, I'm not crazy about this idea.
> 
> And AFAICT, this thing basically bypasses KASLR completely but you need
> root for it so it probably doesn't really matter.
> 
> Now, on another thread we agreed more or less that what gets exported in
> vmcoreinfo is so tightly coupled to the running kernel so that it is not
> even considered an ABI. I guess that is debatable but whatever.

We do not regard this strictly as an ABI, but we also carefully review
every new extra exported thing and only export when we have to do so eg.
something breaks.

Seems this change only make userspace tools handling on the kaslr case
easier but since everything works without this patch I would prefer not to
do it.

> 
> So my only request right now would be to have all those things being
> exported, documented somewhere and I believe Lianbo is working on that.
> 
> But I'm sure others will have more to say about it.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> Good mailing practices for 400: avoid top-posting and trim the reply.
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Thanks
Dave

Patch
diff mbox series

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 4c8acdfdc5a7..6161d77c5bfb 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -356,6 +356,9 @@  void arch_crash_save_vmcoreinfo(void)
 	VMCOREINFO_SYMBOL(init_top_pgt);
 	vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
 			pgtable_l5_enabled());
+#ifdef CONFIG_RANDOMIZE_BASE
+	VMCOREINFO_NUMBER(page_offset_base);
+#endif
 
 #ifdef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(node_data);