mbox series

[RFCv1,0/6] Page Detective

Message ID 20241116175922.3265872-1-pasha.tatashin@soleen.com (mailing list archive)
Headers show
Series Page Detective | expand

Message

Pasha Tatashin Nov. 16, 2024, 5:59 p.m. UTC
Page Detective is a new kernel debugging tool that provides detailed
information about the usage and mapping of physical memory pages.

It is often known that a particular page is corrupted, but it is hard to
extract more information about such a page from live system. Examples
are:

- Checksum failure during live migration
- Filesystem journal failure
- dump_page warnings on the console log
- Unexcpected segfaults

Page Detective helps to extract more information from the kernel, so it
can be used by developers to root cause the associated problem.

It operates through the Linux debugfs interface, with two files: "virt"
and "phys".

The "virt" file takes a virtual address and PID and outputs information
about the corresponding page.

The "phys" file takes a physical address and outputs information about
that page.

The output is presented via kernel log messages (can be accessed with
dmesg), and includes information such as the page's reference count,
mapping, flags, and memory cgroup. It also shows whether the page is
mapped in the kernel page table, and if so, how many times.

Pasha Tatashin (6):
  mm: Make get_vma_name() function public
  pagewalk: Add a page table walker for init_mm page table
  mm: Add a dump_page variant that accept log level argument
  misc/page_detective: Introduce Page Detective
  misc/page_detective: enable loadable module
  selftests/page_detective: Introduce self tests for Page Detective

 Documentation/misc-devices/index.rst          |   1 +
 Documentation/misc-devices/page_detective.rst |  78 ++
 MAINTAINERS                                   |   8 +
 drivers/misc/Kconfig                          |  11 +
 drivers/misc/Makefile                         |   1 +
 drivers/misc/page_detective.c                 | 808 ++++++++++++++++++
 fs/inode.c                                    |  18 +-
 fs/kernfs/dir.c                               |   1 +
 fs/proc/task_mmu.c                            |  61 --
 include/linux/fs.h                            |   5 +-
 include/linux/mmdebug.h                       |   1 +
 include/linux/pagewalk.h                      |   2 +
 kernel/pid.c                                  |   1 +
 mm/debug.c                                    |  53 +-
 mm/memcontrol.c                               |   1 +
 mm/oom_kill.c                                 |   1 +
 mm/pagewalk.c                                 |  32 +
 mm/vma.c                                      |  60 ++
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/page_detective/.gitignore       |   1 +
 .../testing/selftests/page_detective/Makefile |   7 +
 tools/testing/selftests/page_detective/config |   4 +
 .../page_detective/page_detective_test.c      | 727 ++++++++++++++++
 23 files changed, 1787 insertions(+), 96 deletions(-)
 create mode 100644 Documentation/misc-devices/page_detective.rst
 create mode 100644 drivers/misc/page_detective.c
 create mode 100644 tools/testing/selftests/page_detective/.gitignore
 create mode 100644 tools/testing/selftests/page_detective/Makefile
 create mode 100644 tools/testing/selftests/page_detective/config
 create mode 100644 tools/testing/selftests/page_detective/page_detective_test.c

Comments

Lorenzo Stoakes Nov. 18, 2024, 11:17 a.m. UTC | #1
On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:
> Page Detective is a new kernel debugging tool that provides detailed
> information about the usage and mapping of physical memory pages.
>
> It is often known that a particular page is corrupted, but it is hard to
> extract more information about such a page from live system. Examples
> are:
>
> - Checksum failure during live migration
> - Filesystem journal failure
> - dump_page warnings on the console log
> - Unexcpected segfaults
>
> Page Detective helps to extract more information from the kernel, so it
> can be used by developers to root cause the associated problem.

I like the _concept_ of providing more information like this.

But you've bizarrely gone to great lengths to expose mm internal
implementation details to drivers in order to implement this as a driver.

This is _very clearly_ an mm thing, and _very clearly_ subject to change
depending on how mm changes. It should live under mm/ and not be a loadable
driver.

I am also very very much not in favour of re-implementing yet another page
table walker, this time in driver code (!). Please no.

So NACK in its current form. This has to be implemented within mm if we are
to take it.

I'm also concerned about its scalability and impact on the system, as it
takes every single mm lock in the system at once, which seems kinda unwise
or at least problematic, and not something we want happening outside of mm,
at any rate.

>
> It operates through the Linux debugfs interface, with two files: "virt"
> and "phys".
>
> The "virt" file takes a virtual address and PID and outputs information
> about the corresponding page.
>
> The "phys" file takes a physical address and outputs information about
> that page.
>
> The output is presented via kernel log messages (can be accessed with
> dmesg), and includes information such as the page's reference count,
> mapping, flags, and memory cgroup. It also shows whether the page is
> mapped in the kernel page table, and if so, how many times.

I mean, even though I'm not a huge fan of kernel pointer hashing etc. this
is obviously leaking as much information as you might want about kernel
internal state to the point of maybe making the whole kernel pointer
hashing thing moot.

I know this requires CAP_SYS_ADMIN, but there are things that also require
that which _still_ obscure kernel pointers.

And you're outputting it all to dmesg.

So yeah, a security person (Jann?) would be better placed to comment on
this than me, but are we sure we want to do this when not in a
CONFIG_DEBUG_VM* kernel?

>
> Pasha Tatashin (6):
>   mm: Make get_vma_name() function public
>   pagewalk: Add a page table walker for init_mm page table
>   mm: Add a dump_page variant that accept log level argument
>   misc/page_detective: Introduce Page Detective
>   misc/page_detective: enable loadable module
>   selftests/page_detective: Introduce self tests for Page Detective
>
>  Documentation/misc-devices/index.rst          |   1 +
>  Documentation/misc-devices/page_detective.rst |  78 ++
>  MAINTAINERS                                   |   8 +
>  drivers/misc/Kconfig                          |  11 +
>  drivers/misc/Makefile                         |   1 +
>  drivers/misc/page_detective.c                 | 808 ++++++++++++++++++
>  fs/inode.c                                    |  18 +-
>  fs/kernfs/dir.c                               |   1 +
>  fs/proc/task_mmu.c                            |  61 --
>  include/linux/fs.h                            |   5 +-
>  include/linux/mmdebug.h                       |   1 +
>  include/linux/pagewalk.h                      |   2 +
>  kernel/pid.c                                  |   1 +
>  mm/debug.c                                    |  53 +-
>  mm/memcontrol.c                               |   1 +
>  mm/oom_kill.c                                 |   1 +
>  mm/pagewalk.c                                 |  32 +
>  mm/vma.c                                      |  60 ++
>  tools/testing/selftests/Makefile              |   1 +
>  .../selftests/page_detective/.gitignore       |   1 +
>  .../testing/selftests/page_detective/Makefile |   7 +
>  tools/testing/selftests/page_detective/config |   4 +
>  .../page_detective/page_detective_test.c      | 727 ++++++++++++++++
>  23 files changed, 1787 insertions(+), 96 deletions(-)
>  create mode 100644 Documentation/misc-devices/page_detective.rst
>  create mode 100644 drivers/misc/page_detective.c
>  create mode 100644 tools/testing/selftests/page_detective/.gitignore
>  create mode 100644 tools/testing/selftests/page_detective/Makefile
>  create mode 100644 tools/testing/selftests/page_detective/config
>  create mode 100644 tools/testing/selftests/page_detective/page_detective_test.c
>
> --
> 2.47.0.338.g60cca15819-goog
>
Jann Horn Nov. 18, 2024, 12:53 p.m. UTC | #2
On Mon, Nov 18, 2024 at 12:17 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:
> > It operates through the Linux debugfs interface, with two files: "virt"
> > and "phys".
> >
> > The "virt" file takes a virtual address and PID and outputs information
> > about the corresponding page.
> >
> > The "phys" file takes a physical address and outputs information about
> > that page.
> >
> > The output is presented via kernel log messages (can be accessed with
> > dmesg), and includes information such as the page's reference count,
> > mapping, flags, and memory cgroup. It also shows whether the page is
> > mapped in the kernel page table, and if so, how many times.
>
> I mean, even though I'm not a huge fan of kernel pointer hashing etc. this
> is obviously leaking as much information as you might want about kernel
> internal state to the point of maybe making the whole kernel pointer
> hashing thing moot.
>
> I know this requires CAP_SYS_ADMIN, but there are things that also require
> that which _still_ obscure kernel pointers.
>
> And you're outputting it all to dmesg.
>
> So yeah, a security person (Jann?) would be better placed to comment on
> this than me, but are we sure we want to do this when not in a
> CONFIG_DEBUG_VM* kernel?

I guess there are two parts to this - what root is allowed to do, and
what information we're fine with exposing to dmesg.

If the lockdown LSM is not set to LOCKDOWN_CONFIDENTIALITY_MAX, the
kernel allows root to read kernel memory through some interfaces - in
particular, BPF allows reading arbitrary kernel memory, and perf
allows reading at least some stuff (like kernel register states). With
lockdown in the most restrictive mode, the kernel tries to prevent
root from reading arbitrary kernel memory, but we don't really change
how much information goes into dmesg. (And I imagine you could
probably still get kernel pointers out of BPF somehow even in the most
restrictive lockdown mode, but that's probably not relevant.)

The main issue with dmesg is that some systems make its contents
available to code that is not running with root privileges; and I
think it is also sometimes stored persistently in unencrypted form
(like in EFI pstore) even when everything else on the system is
encrypted.
So on one hand, we definitely shouldn't print the contents of random
chunks of memory into dmesg without a good reason; on the other hand,
for example we do already print kernel register state on WARN() (which
often includes kernel pointers and could theoretically include more
sensitive data too).

So I think showing page metadata to root when requested is probably
okay as a tradeoff? And dumping that data into dmesg is maybe not
great, but acceptable as long as only root can actually trigger this?

I don't really have a strong opinion on this...


To me, a bigger issue is that dump_page() looks like it might be racy,
which is maybe not terrible in debugging code that only runs when
something has already gone wrong, but bad if it is in code that root
can trigger on demand? __dump_page() copies the given page with
memcpy(), which I don't think guarantees enough atomicity with
concurrent updates of page->mapping or such, so dump_mapping() could
probably run on a bogus pointer. Even without torn pointers, I think
there could be a UAF if the page's mapping is destroyed while we're
going through dump_page(), since the page might not be locked. And in
dump_mapping(), the strncpy_from_kernel_nofault() also doesn't guard
against concurrent renaming of the dentry, which I think again would
probably result in UAF.
So I think dump_page() in its current form is not something we should
expose to a userspace-reachable API.
Roman Gushchin Nov. 18, 2024, 7:11 p.m. UTC | #3
On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:
> Page Detective is a new kernel debugging tool that provides detailed
> information about the usage and mapping of physical memory pages.
> 
> It is often known that a particular page is corrupted, but it is hard to
> extract more information about such a page from live system. Examples
> are:
> 
> - Checksum failure during live migration
> - Filesystem journal failure
> - dump_page warnings on the console log
> - Unexcpected segfaults
> 
> Page Detective helps to extract more information from the kernel, so it
> can be used by developers to root cause the associated problem.
> 
> It operates through the Linux debugfs interface, with two files: "virt"
> and "phys".
> 
> The "virt" file takes a virtual address and PID and outputs information
> about the corresponding page.
> 
> The "phys" file takes a physical address and outputs information about
> that page.
> 
> The output is presented via kernel log messages (can be accessed with
> dmesg), and includes information such as the page's reference count,
> mapping, flags, and memory cgroup. It also shows whether the page is
> mapped in the kernel page table, and if so, how many times.

This looks questionable both from the security and convenience points of view.
Given the request-response nature of the interface, the output can be
provided using a "normal" seq-based pseudo-file.

But I have a more generic question:
doesn't it make sense to implement it as a set of drgn scripts instead
of kernel code? This provides more flexibility, is safer (even if it's buggy,
you won't crash the host) and should be at least in theory equally
powerful.

Thanks!
Pasha Tatashin Nov. 18, 2024, 10:08 p.m. UTC | #4
On Mon, Nov 18, 2024 at 2:11 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:
> > Page Detective is a new kernel debugging tool that provides detailed
> > information about the usage and mapping of physical memory pages.
> >
> > It is often known that a particular page is corrupted, but it is hard to
> > extract more information about such a page from live system. Examples
> > are:
> >
> > - Checksum failure during live migration
> > - Filesystem journal failure
> > - dump_page warnings on the console log
> > - Unexcpected segfaults
> >
> > Page Detective helps to extract more information from the kernel, so it
> > can be used by developers to root cause the associated problem.
> >
> > It operates through the Linux debugfs interface, with two files: "virt"
> > and "phys".
> >
> > The "virt" file takes a virtual address and PID and outputs information
> > about the corresponding page.
> >
> > The "phys" file takes a physical address and outputs information about
> > that page.
> >
> > The output is presented via kernel log messages (can be accessed with
> > dmesg), and includes information such as the page's reference count,
> > mapping, flags, and memory cgroup. It also shows whether the page is
> > mapped in the kernel page table, and if so, how many times.
>
> This looks questionable both from the security and convenience points of view.
> Given the request-response nature of the interface, the output can be
> provided using a "normal" seq-based pseudo-file.

We opted to use dmesg for output because it's the standard method for
capturing kernel information and is commonly included in bug reports.
Introducing a new file would require modifying existing data
collection scripts used for reporting, so this approach minimizes
disruption to existing workflows.

> But I have a more generic question:
> doesn't it make sense to implement it as a set of drgn scripts instead
> of kernel code? This provides more flexibility, is safer (even if it's buggy,
> you won't crash the host) and should be at least in theory equally
> powerful.

Regarding your suggestion, our plan is to perform reverse lookups in
all page tables: kernel, user, IOMMU, and KVM. Currently, we only
traverse the kernel and user page tables, but we intend to extend this
functionality to IOMMU and KVM tables in future updates, I am not sure
if drgn can provide this level of details within a reasonable amount
of time.

This approach will be helpful for debugging memory corruption
scenarios. Often, external mechanisms detect corruption but require
kernel-level information for root cause analysis. In our experience,
invalid mappings persist in page tables for a period after corruption,
providing a window to identify other users of the corrupted page via
timely reverse lookup.

Additionally, using crash/drgn is not feasible for us at this time, it
requires keeping external tools on our hosts, also it requires
approval and a security review for each script before deployment in
our fleet.

Thanks,
Pasha
Pasha Tatashin Nov. 18, 2024, 10:24 p.m. UTC | #5
On Mon, Nov 18, 2024 at 7:54 AM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Nov 18, 2024 at 12:17 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:
> > > It operates through the Linux debugfs interface, with two files: "virt"
> > > and "phys".
> > >
> > > The "virt" file takes a virtual address and PID and outputs information
> > > about the corresponding page.
> > >
> > > The "phys" file takes a physical address and outputs information about
> > > that page.
> > >
> > > The output is presented via kernel log messages (can be accessed with
> > > dmesg), and includes information such as the page's reference count,
> > > mapping, flags, and memory cgroup. It also shows whether the page is
> > > mapped in the kernel page table, and if so, how many times.
> >
> > I mean, even though I'm not a huge fan of kernel pointer hashing etc. this
> > is obviously leaking as much information as you might want about kernel
> > internal state to the point of maybe making the whole kernel pointer
> > hashing thing moot.
> >
> > I know this requires CAP_SYS_ADMIN, but there are things that also require
> > that which _still_ obscure kernel pointers.
> >
> > And you're outputting it all to dmesg.
> >
> > So yeah, a security person (Jann?) would be better placed to comment on
> > this than me, but are we sure we want to do this when not in a
> > CONFIG_DEBUG_VM* kernel?
>
> I guess there are two parts to this - what root is allowed to do, and
> what information we're fine with exposing to dmesg.
>
> If the lockdown LSM is not set to LOCKDOWN_CONFIDENTIALITY_MAX, the
> kernel allows root to read kernel memory through some interfaces - in
> particular, BPF allows reading arbitrary kernel memory, and perf
> allows reading at least some stuff (like kernel register states). With
> lockdown in the most restrictive mode, the kernel tries to prevent
> root from reading arbitrary kernel memory, but we don't really change
> how much information goes into dmesg. (And I imagine you could
> probably still get kernel pointers out of BPF somehow even in the most
> restrictive lockdown mode, but that's probably not relevant.)
>
> The main issue with dmesg is that some systems make its contents
> available to code that is not running with root privileges; and I
> think it is also sometimes stored persistently in unencrypted form
> (like in EFI pstore) even when everything else on the system is
> encrypted.
> So on one hand, we definitely shouldn't print the contents of random
> chunks of memory into dmesg without a good reason; on the other hand,
> for example we do already print kernel register state on WARN() (which
> often includes kernel pointers and could theoretically include more
> sensitive data too).
>
> So I think showing page metadata to root when requested is probably
> okay as a tradeoff? And dumping that data into dmesg is maybe not
> great, but acceptable as long as only root can actually trigger this?
>
> I don't really have a strong opinion on this...
>
>
> To me, a bigger issue is that dump_page() looks like it might be racy,
> which is maybe not terrible in debugging code that only runs when
> something has already gone wrong, but bad if it is in code that root
> can trigger on demand?

Hi Jann, thank you for reviewing this proposal.

Presumably, the interface should be used only when something has gone
wrong but has not been noticed by the kernel. That something is
usually checksums failures that are outside of the kernel: i.e. during
live migration, snapshotting, filesystem journaling, etc. We already
have interfaces that provide data from the live kernel that could be
racy, i.e. crash utility.

> __dump_page() copies the given page with
> memcpy(), which I don't think guarantees enough atomicity with
> concurrent updates of page->mapping or such, so dump_mapping() could
> probably run on a bogus pointer. Even without torn pointers, I think
> there could be a UAF if the page's mapping is destroyed while we're
> going through dump_page(), since the page might not be locked. And in
> dump_mapping(), the strncpy_from_kernel_nofault() also doesn't guard
> against concurrent renaming of the dentry, which I think again would
> probably result in UAF.

Since we are holding a reference on the page at the time of
dump_page(), the identity of the page should not really change, but
dentry can be renamed.

> So I think dump_page() in its current form is not something we should
> expose to a userspace-reachable API.

We use dump_page() all over WARN_ONs in MM code where pages might not
be locked, but this is a good point, that while even the existing
usage might be racy, providing a user-reachable API potentially makes
it worse. I will see if I could add some locking before dump_page(),
or make a dump_page variant that does not do dump_mapping().
Jann Horn Nov. 19, 2024, 12:39 a.m. UTC | #6
On Mon, Nov 18, 2024 at 11:24 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
> On Mon, Nov 18, 2024 at 7:54 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Mon, Nov 18, 2024 at 12:17 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:
> > > > It operates through the Linux debugfs interface, with two files: "virt"
> > > > and "phys".
> > > >
> > > > The "virt" file takes a virtual address and PID and outputs information
> > > > about the corresponding page.
> > > >
> > > > The "phys" file takes a physical address and outputs information about
> > > > that page.
> > > >
> > > > The output is presented via kernel log messages (can be accessed with
> > > > dmesg), and includes information such as the page's reference count,
> > > > mapping, flags, and memory cgroup. It also shows whether the page is
> > > > mapped in the kernel page table, and if so, how many times.
> > >
> > > I mean, even though I'm not a huge fan of kernel pointer hashing etc. this
> > > is obviously leaking as much information as you might want about kernel
> > > internal state to the point of maybe making the whole kernel pointer
> > > hashing thing moot.
> > >
> > > I know this requires CAP_SYS_ADMIN, but there are things that also require
> > > that which _still_ obscure kernel pointers.
> > >
> > > And you're outputting it all to dmesg.
> > >
> > > So yeah, a security person (Jann?) would be better placed to comment on
> > > this than me, but are we sure we want to do this when not in a
> > > CONFIG_DEBUG_VM* kernel?
> >
> > I guess there are two parts to this - what root is allowed to do, and
> > what information we're fine with exposing to dmesg.
> >
> > If the lockdown LSM is not set to LOCKDOWN_CONFIDENTIALITY_MAX, the
> > kernel allows root to read kernel memory through some interfaces - in
> > particular, BPF allows reading arbitrary kernel memory, and perf
> > allows reading at least some stuff (like kernel register states). With
> > lockdown in the most restrictive mode, the kernel tries to prevent
> > root from reading arbitrary kernel memory, but we don't really change
> > how much information goes into dmesg. (And I imagine you could
> > probably still get kernel pointers out of BPF somehow even in the most
> > restrictive lockdown mode, but that's probably not relevant.)
> >
> > The main issue with dmesg is that some systems make its contents
> > available to code that is not running with root privileges; and I
> > think it is also sometimes stored persistently in unencrypted form
> > (like in EFI pstore) even when everything else on the system is
> > encrypted.
> > So on one hand, we definitely shouldn't print the contents of random
> > chunks of memory into dmesg without a good reason; on the other hand,
> > for example we do already print kernel register state on WARN() (which
> > often includes kernel pointers and could theoretically include more
> > sensitive data too).
> >
> > So I think showing page metadata to root when requested is probably
> > okay as a tradeoff? And dumping that data into dmesg is maybe not
> > great, but acceptable as long as only root can actually trigger this?
> >
> > I don't really have a strong opinion on this...
> >
> >
> > To me, a bigger issue is that dump_page() looks like it might be racy,
> > which is maybe not terrible in debugging code that only runs when
> > something has already gone wrong, but bad if it is in code that root
> > can trigger on demand?
>
> Hi Jann, thank you for reviewing this proposal.
>
> Presumably, the interface should be used only when something has gone
> wrong but has not been noticed by the kernel. That something is
> usually checksums failures that are outside of the kernel: i.e. during
> live migration, snapshotting, filesystem journaling, etc. We already
> have interfaces that provide data from the live kernel that could be
> racy, i.e. crash utility.

Ah, yes, I'm drawing a distinction here between "something has gone
wrong internally in the kernel and the kernel does some kinda-broken
best-effort self-diagnostics" and "userspace thinks something is
broken and asks the kernel".

> > __dump_page() copies the given page with
> > memcpy(), which I don't think guarantees enough atomicity with
> > concurrent updates of page->mapping or such, so dump_mapping() could
> > probably run on a bogus pointer. Even without torn pointers, I think
> > there could be a UAF if the page's mapping is destroyed while we're
> > going through dump_page(), since the page might not be locked. And in
> > dump_mapping(), the strncpy_from_kernel_nofault() also doesn't guard
> > against concurrent renaming of the dentry, which I think again would
> > probably result in UAF.
>
> Since we are holding a reference on the page at the time of
> dump_page(), the identity of the page should not really change, but
> dentry can be renamed.

Can you point me to where a refcounted reference to the page comes
from when page_detective_metadata() calls dump_page_lvl()?

> > So I think dump_page() in its current form is not something we should
> > expose to a userspace-reachable API.
>
> We use dump_page() all over WARN_ONs in MM code where pages might not
> be locked, but this is a good point, that while even the existing
> usage might be racy, providing a user-reachable API potentially makes
> it worse. I will see if I could add some locking before dump_page(),
> or make a dump_page variant that does not do dump_mapping().

To be clear, I am not that strongly opposed to racily reading data
such that the data may not be internally consistent or such; but this
is a case of racy use-after-free reads that might end up dumping
entirely unrelated memory contents into dmesg. I think we should
properly protect against that in an API that userspace can invoke.
Otherwise, if we race, we might end up writing random memory contents
into dmesg; and if we are particularly unlucky, those random memory
contents could be PII or authentication tokens or such.

I'm not entirely sure what the right approach is here; I guess it
makes sense that when the kernel internally detects corruption,
dump_page doesn't take references on pages it accesses to avoid
corrupting things further. If you are looking at a page based on a
userspace request, I guess you could access the page with the
necessary locking to access its properties under the normal locking
rules?

(If anyone else has opinions either way on this line I'm trying to
draw between kernel-internal debug paths and userspace-triggerable
debugging, feel free to share; I hope my mental model makes sense but
I could imagine other folks having a different model of this?)
Greg Kroah-Hartman Nov. 19, 2024, 1:09 a.m. UTC | #7
On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:
> Additionally, using crash/drgn is not feasible for us at this time, it
> requires keeping external tools on our hosts, also it requires
> approval and a security review for each script before deployment in
> our fleet.

So it's ok to add a totally insecure kernel feature to your fleet
instead?  You might want to reconsider that policy decision :)

good luck!

greg k-h
Pasha Tatashin Nov. 19, 2024, 1:29 a.m. UTC | #8
> Can you point me to where a refcounted reference to the page comes
> from when page_detective_metadata() calls dump_page_lvl()?

I am sorry, I remembered incorrectly, we are getting reference right
after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I
will move the folio_try_get() to before dump_page_lvl().

> > > So I think dump_page() in its current form is not something we should
> > > expose to a userspace-reachable API.
> >
> > We use dump_page() all over WARN_ONs in MM code where pages might not
> > be locked, but this is a good point, that while even the existing
> > usage might be racy, providing a user-reachable API potentially makes
> > it worse. I will see if I could add some locking before dump_page(),
> > or make a dump_page variant that does not do dump_mapping().
>
> To be clear, I am not that strongly opposed to racily reading data
> such that the data may not be internally consistent or such; but this
> is a case of racy use-after-free reads that might end up dumping
> entirely unrelated memory contents into dmesg. I think we should
> properly protect against that in an API that userspace can invoke.
> Otherwise, if we race, we might end up writing random memory contents
> into dmesg; and if we are particularly unlucky, those random memory
> contents could be PII or authentication tokens or such.
>
> I'm not entirely sure what the right approach is here; I guess it
> makes sense that when the kernel internally detects corruption,
> dump_page doesn't take references on pages it accesses to avoid
> corrupting things further. If you are looking at a page based on a
> userspace request, I guess you could access the page with the
> necessary locking to access its properties under the normal locking
> rules?

I will take reference, as we already do that for memcg purpose, but
have not included dump_page().

Thank you,
Pasha
Jann Horn Nov. 19, 2024, 12:52 p.m. UTC | #9
On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
> > Can you point me to where a refcounted reference to the page comes
> > from when page_detective_metadata() calls dump_page_lvl()?
>
> I am sorry, I remembered incorrectly, we are getting reference right
> after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I
> will move the folio_try_get() to before dump_page_lvl().
>
> > > > So I think dump_page() in its current form is not something we should
> > > > expose to a userspace-reachable API.
> > >
> > > We use dump_page() all over WARN_ONs in MM code where pages might not
> > > be locked, but this is a good point, that while even the existing
> > > usage might be racy, providing a user-reachable API potentially makes
> > > it worse. I will see if I could add some locking before dump_page(),
> > > or make a dump_page variant that does not do dump_mapping().
> >
> > To be clear, I am not that strongly opposed to racily reading data
> > such that the data may not be internally consistent or such; but this
> > is a case of racy use-after-free reads that might end up dumping
> > entirely unrelated memory contents into dmesg. I think we should
> > properly protect against that in an API that userspace can invoke.
> > Otherwise, if we race, we might end up writing random memory contents
> > into dmesg; and if we are particularly unlucky, those random memory
> > contents could be PII or authentication tokens or such.
> >
> > I'm not entirely sure what the right approach is here; I guess it
> > makes sense that when the kernel internally detects corruption,
> > dump_page doesn't take references on pages it accesses to avoid
> > corrupting things further. If you are looking at a page based on a
> > userspace request, I guess you could access the page with the
> > necessary locking to access its properties under the normal locking
> > rules?
>
> I will take reference, as we already do that for memcg purpose, but
> have not included dump_page().

Note that taking a reference on the page does not make all of
dump_page() fine; in particular, my understanding is that
folio_mapping() requires that the page is locked in order to return a
stable pointer, and some of the code in dump_mapping() would probably
also require some other locks - probably at least on the inode and
maybe also on the dentry, I think? Otherwise the inode's dentry list
can probably change concurrently, and the dentry's name pointer can
change too.
Pasha Tatashin Nov. 19, 2024, 3:08 p.m. UTC | #10
On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:
> > Additionally, using crash/drgn is not feasible for us at this time, it
> > requires keeping external tools on our hosts, also it requires
> > approval and a security review for each script before deployment in
> > our fleet.
>
> So it's ok to add a totally insecure kernel feature to your fleet
> instead?  You might want to reconsider that policy decision :)

Hi Greg,

While some risk is inherent, we believe the potential for abuse here
is limited, especially given the existing  CAP_SYS_ADMIN requirement.
But, even with root access compromised, this tool presents a smaller
attack surface than alternatives like crash/drgn. It exposes less
sensitive information, unlike crash/drgn, which could potentially
allow reading all of kernel memory.

Pasha
Pasha Tatashin Nov. 19, 2024, 3:14 p.m. UTC | #11
On Tue, Nov 19, 2024 at 7:52 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> > > Can you point me to where a refcounted reference to the page comes
> > > from when page_detective_metadata() calls dump_page_lvl()?
> >
> > I am sorry, I remembered incorrectly, we are getting reference right
> > after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I
> > will move the folio_try_get() to before dump_page_lvl().
> >
> > > > > So I think dump_page() in its current form is not something we should
> > > > > expose to a userspace-reachable API.
> > > >
> > > > We use dump_page() all over WARN_ONs in MM code where pages might not
> > > > be locked, but this is a good point, that while even the existing
> > > > usage might be racy, providing a user-reachable API potentially makes
> > > > it worse. I will see if I could add some locking before dump_page(),
> > > > or make a dump_page variant that does not do dump_mapping().
> > >
> > > To be clear, I am not that strongly opposed to racily reading data
> > > such that the data may not be internally consistent or such; but this
> > > is a case of racy use-after-free reads that might end up dumping
> > > entirely unrelated memory contents into dmesg. I think we should
> > > properly protect against that in an API that userspace can invoke.
> > > Otherwise, if we race, we might end up writing random memory contents
> > > into dmesg; and if we are particularly unlucky, those random memory
> > > contents could be PII or authentication tokens or such.
> > >
> > > I'm not entirely sure what the right approach is here; I guess it
> > > makes sense that when the kernel internally detects corruption,
> > > dump_page doesn't take references on pages it accesses to avoid
> > > corrupting things further. If you are looking at a page based on a
> > > userspace request, I guess you could access the page with the
> > > necessary locking to access its properties under the normal locking
> > > rules?
> >
> > I will take reference, as we already do that for memcg purpose, but
> > have not included dump_page().
>
> Note that taking a reference on the page does not make all of
> dump_page() fine; in particular, my understanding is that
> folio_mapping() requires that the page is locked in order to return a
> stable pointer, and some of the code in dump_mapping() would probably
> also require some other locks - probably at least on the inode and
> maybe also on the dentry, I think? Otherwise the inode's dentry list
> can probably change concurrently, and the dentry's name pointer can
> change too.

Agreed, once reference is taken, the page identity cannot change (i.e.
if it is a named page it will stay a named page), but dentry can be
renamed. I will look into what can be done to guarantee consistency in
the next version. There is also a fallback if locking cannot be
reliably resolved (i.e. for performance reasons) where we can make
dump_mapping() optionally disabled from dump_page_lvl() with a new
argument flag.

Thank you,
Pasha
Jann Horn Nov. 19, 2024, 3:53 p.m. UTC | #12
On Tue, Nov 19, 2024 at 4:14 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
> On Tue, Nov 19, 2024 at 7:52 AM Jann Horn <jannh@google.com> wrote:
> > On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> > > > Can you point me to where a refcounted reference to the page comes
> > > > from when page_detective_metadata() calls dump_page_lvl()?
> > >
> > > I am sorry, I remembered incorrectly, we are getting reference right
> > > after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I
> > > will move the folio_try_get() to before dump_page_lvl().
> > >
> > > > > > So I think dump_page() in its current form is not something we should
> > > > > > expose to a userspace-reachable API.
> > > > >
> > > > > We use dump_page() all over WARN_ONs in MM code where pages might not
> > > > > be locked, but this is a good point, that while even the existing
> > > > > usage might be racy, providing a user-reachable API potentially makes
> > > > > it worse. I will see if I could add some locking before dump_page(),
> > > > > or make a dump_page variant that does not do dump_mapping().
> > > >
> > > > To be clear, I am not that strongly opposed to racily reading data
> > > > such that the data may not be internally consistent or such; but this
> > > > is a case of racy use-after-free reads that might end up dumping
> > > > entirely unrelated memory contents into dmesg. I think we should
> > > > properly protect against that in an API that userspace can invoke.
> > > > Otherwise, if we race, we might end up writing random memory contents
> > > > into dmesg; and if we are particularly unlucky, those random memory
> > > > contents could be PII or authentication tokens or such.
> > > >
> > > > I'm not entirely sure what the right approach is here; I guess it
> > > > makes sense that when the kernel internally detects corruption,
> > > > dump_page doesn't take references on pages it accesses to avoid
> > > > corrupting things further. If you are looking at a page based on a
> > > > userspace request, I guess you could access the page with the
> > > > necessary locking to access its properties under the normal locking
> > > > rules?
> > >
> > > I will take reference, as we already do that for memcg purpose, but
> > > have not included dump_page().
> >
> > Note that taking a reference on the page does not make all of
> > dump_page() fine; in particular, my understanding is that
> > folio_mapping() requires that the page is locked in order to return a
> > stable pointer, and some of the code in dump_mapping() would probably
> > also require some other locks - probably at least on the inode and
> > maybe also on the dentry, I think? Otherwise the inode's dentry list
> > can probably change concurrently, and the dentry's name pointer can
> > change too.
>
> Agreed, once reference is taken, the page identity cannot change (i.e.
> if it is a named page it will stay a named page), but dentry can be
> renamed. I will look into what can be done to guarantee consistency in
> the next version. There is also a fallback if locking cannot be
> reliably resolved (i.e. for performance reasons) where we can make
> dump_mapping() optionally disabled from dump_page_lvl() with a new
> argument flag.

Yeah, I think if you don't need the details that dump_mapping() shows,
skipping that for user-requested dumps might be a reasonable option.
Roman Gushchin Nov. 19, 2024, 6:23 p.m. UTC | #13
On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:
> On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:
> > > Additionally, using crash/drgn is not feasible for us at this time, it
> > > requires keeping external tools on our hosts, also it requires
> > > approval and a security review for each script before deployment in
> > > our fleet.
> >
> > So it's ok to add a totally insecure kernel feature to your fleet
> > instead?  You might want to reconsider that policy decision :)
> 
> Hi Greg,
> 
> While some risk is inherent, we believe the potential for abuse here
> is limited, especially given the existing  CAP_SYS_ADMIN requirement.
> But, even with root access compromised, this tool presents a smaller
> attack surface than alternatives like crash/drgn. It exposes less
> sensitive information, unlike crash/drgn, which could potentially
> allow reading all of kernel memory.

The problem here is with using dmesg for output. No security-sensitive
information should go there. Even exposing raw kernel pointers is not
considered safe.

I'm also not sure about what presents a bigger attack surface. Yes,
drgn allows to read more, but it's using /proc/kcore, so the in-kernel
code is much simpler. But I don't think it's a relevant discussion,
if a malicious user has a root access, there are better options than
both drgn and page detective.
Matthew Wilcox (Oracle) Nov. 19, 2024, 6:51 p.m. UTC | #14
On Tue, Nov 19, 2024 at 01:52:00PM +0100, Jann Horn wrote:
> > I will take reference, as we already do that for memcg purpose, but
> > have not included dump_page().
> 
> Note that taking a reference on the page does not make all of
> dump_page() fine; in particular, my understanding is that
> folio_mapping() requires that the page is locked in order to return a
> stable pointer, and some of the code in dump_mapping() would probably
> also require some other locks - probably at least on the inode and
> maybe also on the dentry, I think? Otherwise the inode's dentry list
> can probably change concurrently, and the dentry's name pointer can
> change too.

First important thing is that we snapshot the page.  So while we may
have a torn snapshot of the page, it can't change under us any more,
so we don't have to worry about it being swizzled one way and then
swizzled back.

Second thing is that I think using folio_mapping() is actually wrong.
We don't want the swap mapping if it's an anon page that's in the
swapcache.  We'd be fine just doing mapping = folio->mapping (we'd need
to add a check for movable, but I think that's fine).  Anyway, we know
the folio isn't ksm or anon at the point that we call dump_mapping()
because there's a chain of "else" statements.  So I think we're fine
because we can't switch between anon & file while holding a refcount.

Having a refcount on the folio will prevent the folio from being allocated
to anything else again.  It will not protect the mapping from being torn
down (the folio can be truncated from the mapping, then the mapping can
be freed, and the memory reused).  As you say, the dentry can be renamed
as well.

This patch series makes me nervous.  I'd rather see it done as a bpf
script or drgn script, but if it is going to be done in C, I'd really
like to see more auditing of the safety here.  It feels like the kind
of hack that one deploys internally to debug a hard-to-hit condition,
rather than the kind of code that we like to ship upstream.
Pasha Tatashin Nov. 19, 2024, 7:30 p.m. UTC | #15
On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:
> > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:
> > > > Additionally, using crash/drgn is not feasible for us at this time, it
> > > > requires keeping external tools on our hosts, also it requires
> > > > approval and a security review for each script before deployment in
> > > > our fleet.
> > >
> > > So it's ok to add a totally insecure kernel feature to your fleet
> > > instead?  You might want to reconsider that policy decision :)
> >
> > Hi Greg,
> >
> > While some risk is inherent, we believe the potential for abuse here
> > is limited, especially given the existing  CAP_SYS_ADMIN requirement.
> > But, even with root access compromised, this tool presents a smaller
> > attack surface than alternatives like crash/drgn. It exposes less
> > sensitive information, unlike crash/drgn, which could potentially
> > allow reading all of kernel memory.
>
> The problem here is with using dmesg for output. No security-sensitive
> information should go there. Even exposing raw kernel pointers is not
> considered safe.

I am OK in writing the output to a debugfs file in the next version,
the only concern I have is that implies that dump_page() would need to
be basically duplicated, as it now outputs everything via printk's.
Yosry Ahmed Nov. 19, 2024, 7:35 p.m. UTC | #16
On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:
> > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:
> > > > > Additionally, using crash/drgn is not feasible for us at this time, it
> > > > > requires keeping external tools on our hosts, also it requires
> > > > > approval and a security review for each script before deployment in
> > > > > our fleet.
> > > >
> > > > So it's ok to add a totally insecure kernel feature to your fleet
> > > > instead?  You might want to reconsider that policy decision :)
> > >
> > > Hi Greg,
> > >
> > > While some risk is inherent, we believe the potential for abuse here
> > > is limited, especially given the existing  CAP_SYS_ADMIN requirement.
> > > But, even with root access compromised, this tool presents a smaller
> > > attack surface than alternatives like crash/drgn. It exposes less
> > > sensitive information, unlike crash/drgn, which could potentially
> > > allow reading all of kernel memory.
> >
> > The problem here is with using dmesg for output. No security-sensitive
> > information should go there. Even exposing raw kernel pointers is not
> > considered safe.
>
> I am OK in writing the output to a debugfs file in the next version,
> the only concern I have is that implies that dump_page() would need to
> be basically duplicated, as it now outputs everything via printk's.

Perhaps you can refactor the code in dump_page() to use a seq_buf,
then have dump_page() printk that seq_buf using seq_buf_do_printk(),
and have page detective output that seq_buf to the debugfs file?

We do something very similar with memory_stat_format(). We use the
same function to generate the memcg stats in a seq_buf, then we use
that seq_buf to output the stats to memory.stat as well as the OOM
log.
Roman Gushchin Nov. 19, 2024, 8:57 p.m. UTC | #17
On Tue, Nov 19, 2024 at 11:35:47AM -0800, Yosry Ahmed wrote:
> On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:
> > > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:
> > > > > > Additionally, using crash/drgn is not feasible for us at this time, it
> > > > > > requires keeping external tools on our hosts, also it requires
> > > > > > approval and a security review for each script before deployment in
> > > > > > our fleet.
> > > > >
> > > > > So it's ok to add a totally insecure kernel feature to your fleet
> > > > > instead?  You might want to reconsider that policy decision :)
> > > >
> > > > Hi Greg,
> > > >
> > > > While some risk is inherent, we believe the potential for abuse here
> > > > is limited, especially given the existing  CAP_SYS_ADMIN requirement.
> > > > But, even with root access compromised, this tool presents a smaller
> > > > attack surface than alternatives like crash/drgn. It exposes less
> > > > sensitive information, unlike crash/drgn, which could potentially
> > > > allow reading all of kernel memory.
> > >
> > > The problem here is with using dmesg for output. No security-sensitive
> > > information should go there. Even exposing raw kernel pointers is not
> > > considered safe.
> >
> > I am OK in writing the output to a debugfs file in the next version,
> > the only concern I have is that implies that dump_page() would need to
> > be basically duplicated, as it now outputs everything via printk's.
> 
> Perhaps you can refactor the code in dump_page() to use a seq_buf,
> then have dump_page() printk that seq_buf using seq_buf_do_printk(),
> and have page detective output that seq_buf to the debugfs file?
> 
> We do something very similar with memory_stat_format(). We use the
> same function to generate the memcg stats in a seq_buf, then we use
> that seq_buf to output the stats to memory.stat as well as the OOM
> log.

+1

Thanks!
Andi Kleen Nov. 20, 2024, 3:29 p.m. UTC | #18
Pasha Tatashin <pasha.tatashin@soleen.com> writes:

> Page Detective is a new kernel debugging tool that provides detailed
> information about the usage and mapping of physical memory pages.
>
> It is often known that a particular page is corrupted, but it is hard to
> extract more information about such a page from live system. Examples
> are:
>
> - Checksum failure during live migration
> - Filesystem journal failure
> - dump_page warnings on the console log
> - Unexcpected segfaults
>
> Page Detective helps to extract more information from the kernel, so it
> can be used by developers to root cause the associated problem.
>
> It operates through the Linux debugfs interface, with two files: "virt"
> and "phys".
>
> The "virt" file takes a virtual address and PID and outputs information
> about the corresponding page.
>
> The "phys" file takes a physical address and outputs information about
> that page.
>
> The output is presented via kernel log messages (can be accessed with
> dmesg), and includes information such as the page's reference count,
> mapping, flags, and memory cgroup. It also shows whether the page is
> mapped in the kernel page table, and if so, how many times.

A lot of all that is already covered in /proc/kpage{flags,cgroup,count)
Also we already have /proc/pid/pagemap to resolve virtual addresses.

At a minimum you need to discuss why these existing mechanisms are not
suitable for you and how your new one is better.

If something particular is missing perhaps the existing mechanisms
can be extended?

Outputting in the dmesg seems rather clumpsy for a production mechanism.

I personally would just use live crash or live gdb on /proc/kcore to get
extra information, although I can see that might have races.

-Andi
Pasha Tatashin Nov. 20, 2024, 4:13 p.m. UTC | #19
On Tue, Nov 19, 2024 at 2:36 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:
> > > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:
> > > > > > Additionally, using crash/drgn is not feasible for us at this time, it
> > > > > > requires keeping external tools on our hosts, also it requires
> > > > > > approval and a security review for each script before deployment in
> > > > > > our fleet.
> > > > >
> > > > > So it's ok to add a totally insecure kernel feature to your fleet
> > > > > instead?  You might want to reconsider that policy decision :)
> > > >
> > > > Hi Greg,
> > > >
> > > > While some risk is inherent, we believe the potential for abuse here
> > > > is limited, especially given the existing  CAP_SYS_ADMIN requirement.
> > > > But, even with root access compromised, this tool presents a smaller
> > > > attack surface than alternatives like crash/drgn. It exposes less
> > > > sensitive information, unlike crash/drgn, which could potentially
> > > > allow reading all of kernel memory.
> > >
> > > The problem here is with using dmesg for output. No security-sensitive
> > > information should go there. Even exposing raw kernel pointers is not
> > > considered safe.
> >
> > I am OK in writing the output to a debugfs file in the next version,
> > the only concern I have is that implies that dump_page() would need to
> > be basically duplicated, as it now outputs everything via printk's.
>
> Perhaps you can refactor the code in dump_page() to use a seq_buf,
> then have dump_page() printk that seq_buf using seq_buf_do_printk(),
> and have page detective output that seq_buf to the debugfs file?

Good idea, I will look into modifying it this way.

> We do something very similar with memory_stat_format(). We use the

void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
{
        /* Use static buffer, for the caller is holding oom_lock. */
        static char buf[PAGE_SIZE];
        ....
        seq_buf_init(&s, buf, sizeof(buf));
        memory_stat_format(memcg, &s);
        seq_buf_do_printk(&s, KERN_INFO);
}

This is a callosal stack allocation, given that our fleet only has 8K
stacks. :-)

> same function to generate the memcg stats in a seq_buf, then we use
> that seq_buf to output the stats to memory.stat as well as the OOM
> log.
Pasha Tatashin Nov. 20, 2024, 4:40 p.m. UTC | #20
On Wed, Nov 20, 2024 at 10:29 AM Andi Kleen <ak@linux.intel.com> wrote:
>
> Pasha Tatashin <pasha.tatashin@soleen.com> writes:
>
> > Page Detective is a new kernel debugging tool that provides detailed
> > information about the usage and mapping of physical memory pages.
> >
> > It is often known that a particular page is corrupted, but it is hard to
> > extract more information about such a page from live system. Examples
> > are:
> >
> > - Checksum failure during live migration
> > - Filesystem journal failure
> > - dump_page warnings on the console log
> > - Unexcpected segfaults
> >
> > Page Detective helps to extract more information from the kernel, so it
> > can be used by developers to root cause the associated problem.
> >
> > It operates through the Linux debugfs interface, with two files: "virt"
> > and "phys".
> >
> > The "virt" file takes a virtual address and PID and outputs information
> > about the corresponding page.
> >
> > The "phys" file takes a physical address and outputs information about
> > that page.
> >
> > The output is presented via kernel log messages (can be accessed with
> > dmesg), and includes information such as the page's reference count,
> > mapping, flags, and memory cgroup. It also shows whether the page is
> > mapped in the kernel page table, and if so, how many times.
>
> A lot of all that is already covered in /proc/kpage{flags,cgroup,count)
> Also we already have /proc/pid/pagemap to resolve virtual addresses.
>
> At a minimum you need to discuss why these existing mechanisms are not
> suitable for you and how your new one is better.

Hi Andi,

Thanks for your feedback! I will extend the cover letter in the next
version to address your comment about comparing with the existing
methods.

We periodically receive rare reports of page corruptions detected
through various methods (journaling, live migrations, crashes, etc.)
from userland. To effectively root cause these corruptions, we need to
automatically and quickly gather comprehensive data about the affected
pages from the kernel.

This includes:

- Obtain all metadata associated with a page.
- Quickly identify all user processes mapping a given page.
- Determine if and where the kernel maps the page, which is also
important given the opportunity to remove guest memory from the kernel
direct map (as discussed at LPC'24).

We also plan to extend this functionality to include KVM and IOMMU
page tables in the future.
<pagemap> provides an interface to traversing through user page
tables, but the other information cannot be extracted using the
existing interfaces.

To ensure data integrity, even when dealing with potential memory
corruptions, Page Detective minimizes reliance on kernel data
structures. Instead, it leverages direct access to hardware structures
like page tables, providing a more reliable view of page mappings.

> If something particular is missing perhaps the existing mechanisms
> can be extended?
> Outputting in the dmesg seems rather clumpsy for a production mechanism.

I am going to change the output to a file in the next version.

> I personally would just use live crash or live gdb on /proc/kcore to get
> extra information, although I can see that might have races.

For security reasons crash is currently not available on our
production fleet machines as it potentially provides access to all
kernel memory.

Thank you,
Pasha
Yosry Ahmed Nov. 20, 2024, 5:33 p.m. UTC | #21
On Wed, Nov 20, 2024 at 8:14 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Tue, Nov 19, 2024 at 2:36 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> > >
> > > On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:
> > > > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > > >
> > > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:
> > > > > > > Additionally, using crash/drgn is not feasible for us at this time, it
> > > > > > > requires keeping external tools on our hosts, also it requires
> > > > > > > approval and a security review for each script before deployment in
> > > > > > > our fleet.
> > > > > >
> > > > > > So it's ok to add a totally insecure kernel feature to your fleet
> > > > > > instead?  You might want to reconsider that policy decision :)
> > > > >
> > > > > Hi Greg,
> > > > >
> > > > > While some risk is inherent, we believe the potential for abuse here
> > > > > is limited, especially given the existing  CAP_SYS_ADMIN requirement.
> > > > > But, even with root access compromised, this tool presents a smaller
> > > > > attack surface than alternatives like crash/drgn. It exposes less
> > > > > sensitive information, unlike crash/drgn, which could potentially
> > > > > allow reading all of kernel memory.
> > > >
> > > > The problem here is with using dmesg for output. No security-sensitive
> > > > information should go there. Even exposing raw kernel pointers is not
> > > > considered safe.
> > >
> > > I am OK in writing the output to a debugfs file in the next version,
> > > the only concern I have is that implies that dump_page() would need to
> > > be basically duplicated, as it now outputs everything via printk's.
> >
> > Perhaps you can refactor the code in dump_page() to use a seq_buf,
> > then have dump_page() printk that seq_buf using seq_buf_do_printk(),
> > and have page detective output that seq_buf to the debugfs file?
>
> Good idea, I will look into modifying it this way.
>
> > We do something very similar with memory_stat_format(). We use the
>
> void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
> {
>         /* Use static buffer, for the caller is holding oom_lock. */
>         static char buf[PAGE_SIZE];
>         ....
>         seq_buf_init(&s, buf, sizeof(buf));
>         memory_stat_format(memcg, &s);
>         seq_buf_do_printk(&s, KERN_INFO);
> }
>
> This is a callosal stack allocation, given that our fleet only has 8K
> stacks. :-)

That's a static allocation though :)
Pasha Tatashin Nov. 20, 2024, 5:46 p.m. UTC | #22
> >         /* Use static buffer, for the caller is holding oom_lock. */
> >         static char buf[PAGE_SIZE];
> >         ....
> >         seq_buf_init(&s, buf, sizeof(buf));
> >         memory_stat_format(memcg, &s);
> >         seq_buf_do_printk(&s, KERN_INFO);
> > }
> >
> > This is a callosal stack allocation, given that our fleet only has 8K
> > stacks. :-)
>
> That's a static allocation though :)

Ah right, did not notice it was static (and ignored the comment)

Pasha
Andi Kleen Nov. 20, 2024, 7:14 p.m. UTC | #23
> - Quickly identify all user processes mapping a given page.

Can be done with /proc/*/pagemap today. Maybe it's not "quick"
because it won't use the rmap chains, but is that a serious
issue?

> - Determine if and where the kernel maps the page, which is also
> important given the opportunity to remove guest memory from the kernel
> direct map (as discussed at LPC'24).

At least x86 already has a kernel page table dumper in debugfs
that can be used for this. The value of a second redundant
one seems low.

> We also plan to extend this functionality to include KVM and IOMMU
> page tables in the future.

Yes dumpers for those would likely be useful.

(at least for the case when one hand is tied behind your back
by security policies forbidding /proc/kcore access)

> <pagemap> provides an interface to traversing through user page
> tables, but the other information cannot be extracted using the
> existing interfaces.

Like what? You mean the reference counts?

/proc/k* doesn't have any reference counts, and no space
for full counts, but I suspect usually all you need to know is a
few states like (>1, 1, 0, maybe negative) which could be mapped to a
few spare kpageflags bits.

That said I thought Willy wanted to move a lot of these
elsewhere anyways with the folio revolution, so it might 
be a short lived interface anyways.

-Andi