diff mbox series

mm/page_table_check: Fix crash on ZONE_DEVICE

Message ID 20240605212146.994486-1-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm/page_table_check: Fix crash on ZONE_DEVICE | expand

Commit Message

Peter Xu June 5, 2024, 9:21 p.m. UTC
Not all pages may apply to pgtable check.  One example is ZONE_DEVICE
pages: they map PFNs directly, and they don't allocate page_ext at all even
if there's struct page around.  One may reference devm_memremap_pages().

When both ZONE_DEVICE and page-table-check enabled, then try to map some
dax memories, one can trigger kernel bug constantly now when the kernel was
trying to inject some pfn maps on the dax device:

 kernel BUG at mm/page_table_check.c:55!

While it's pretty legal to use set_pxx_at() for ZONE_DEVICE pages for page
fault resolutions, skip all the checks if page_ext doesn't even exist in
pgtable checker, which applies to ZONE_DEVICE but maybe more.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/page_table_check.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Andrew Morton June 5, 2024, 10:05 p.m. UTC | #1
On Wed,  5 Jun 2024 17:21:46 -0400 Peter Xu <peterx@redhat.com> wrote:

> Not all pages may apply to pgtable check.  One example is ZONE_DEVICE
> pages: they map PFNs directly, and they don't allocate page_ext at all even
> if there's struct page around.  One may reference devm_memremap_pages().
> 
> When both ZONE_DEVICE and page-table-check enabled, then try to map some
> dax memories, one can trigger kernel bug constantly now when the kernel was
> trying to inject some pfn maps on the dax device:
> 
>  kernel BUG at mm/page_table_check.c:55!
> 
> While it's pretty legal to use set_pxx_at() for ZONE_DEVICE pages for page
> fault resolutions, skip all the checks if page_ext doesn't even exist in
> pgtable checker, which applies to ZONE_DEVICE but maybe more.

Do we have a Reported-by: for this one?

And a Fixes?  It looks like df4e817b7108?
Dan Williams June 5, 2024, 10:54 p.m. UTC | #2
[ add Alistair ]

Peter Xu wrote:
> Not all pages may apply to pgtable check.  One example is ZONE_DEVICE
> pages: they map PFNs directly, and they don't allocate page_ext at all even
> if there's struct page around.  One may reference devm_memremap_pages().
> 
> When both ZONE_DEVICE and page-table-check enabled, then try to map some
> dax memories, one can trigger kernel bug constantly now when the kernel was
> trying to inject some pfn maps on the dax device:
> 
>  kernel BUG at mm/page_table_check.c:55!
> 
> While it's pretty legal to use set_pxx_at() for ZONE_DEVICE pages for page
> fault resolutions, skip all the checks if page_ext doesn't even exist in
> pgtable checker, which applies to ZONE_DEVICE but maybe more.

This looks correct to me, and needed in the near term. You can add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

In the long term, the page_ext check may not be needed. I.e. the reason
I added Alistair was in case his work to make DAX pages behave like
typical pages for reference counting would also make them behave the
same for the presence of page_ext.
Alistair Popple June 5, 2024, 10:58 p.m. UTC | #3
Dan Williams <dan.j.williams@intel.com> writes:

> [ add Alistair ]
>
> Peter Xu wrote:
>> Not all pages may apply to pgtable check.  One example is ZONE_DEVICE
>> pages: they map PFNs directly, and they don't allocate page_ext at all even
>> if there's struct page around.  One may reference devm_memremap_pages().
>> 
>> When both ZONE_DEVICE and page-table-check enabled, then try to map some
>> dax memories, one can trigger kernel bug constantly now when the kernel was
>> trying to inject some pfn maps on the dax device:
>> 
>>  kernel BUG at mm/page_table_check.c:55!
>> 
>> While it's pretty legal to use set_pxx_at() for ZONE_DEVICE pages for page
>> fault resolutions, skip all the checks if page_ext doesn't even exist in
>> pgtable checker, which applies to ZONE_DEVICE but maybe more.
>
> This looks correct to me, and needed in the near term. You can add:
>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
>
> In the long term, the page_ext check may not be needed. I.e. the reason
> I added Alistair was in case his work to make DAX pages behave like
> typical pages for reference counting would also make them behave the
> same for the presence of page_ext.

It doesn't currently. However I did run into this bug while I was
developing those so please add:

Reviewed-by: Alistair Popple <apopple@nvidia.com>
Pasha Tatashin June 6, 2024, 12:01 a.m. UTC | #4
On Wed, Jun 5, 2024 at 5:21 PM Peter Xu <peterx@redhat.com> wrote:
>
> Not all pages may apply to pgtable check.  One example is ZONE_DEVICE
> pages: they map PFNs directly, and they don't allocate page_ext at all even
> if there's struct page around.  One may reference devm_memremap_pages().
>
> When both ZONE_DEVICE and page-table-check enabled, then try to map some
> dax memories, one can trigger kernel bug constantly now when the kernel was
> trying to inject some pfn maps on the dax device:
>
>  kernel BUG at mm/page_table_check.c:55!
>
> While it's pretty legal to use set_pxx_at() for ZONE_DEVICE pages for page
> fault resolutions, skip all the checks if page_ext doesn't even exist in
> pgtable checker, which applies to ZONE_DEVICE but maybe more.

Thank you for reporting this bug. A few comments below:

>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/page_table_check.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> index 4169576bed72..509c6ef8de40 100644
> --- a/mm/page_table_check.c
> +++ b/mm/page_table_check.c
> @@ -73,6 +73,9 @@ static void page_table_check_clear(unsigned long pfn, unsigned long pgcnt)
>         page = pfn_to_page(pfn);
>         page_ext = page_ext_get(page);
>
> +       if (!page_ext)
> +               return;

I would replace the above with the following, here and in other places:

if (!page_ext) {
  WARN_ONCE(!is_zone_device_page(page),
                          "page_ext is missing for a non-device page\n");
  return;
}

> +
>         BUG_ON(PageSlab(page));
>         anon = PageAnon(page);
>
> @@ -110,6 +113,9 @@ static void page_table_check_set(unsigned long pfn, unsigned long pgcnt,
>         page = pfn_to_page(pfn);
>         page_ext = page_ext_get(page);
>
> +       if (!page_ext)
> +               return;
> +
>         BUG_ON(PageSlab(page));
>         anon = PageAnon(page);
>
> @@ -140,7 +146,10 @@ void __page_table_check_zero(struct page *page, unsigned int order)
>         BUG_ON(PageSlab(page));
>
>         page_ext = page_ext_get(page);
> -       BUG_ON(!page_ext);
> +
> +       if (!page_ext)
> +               return;
> +
>         for (i = 0; i < (1ul << order); i++) {
>                 struct page_table_check *ptc = get_page_table_check(page_ext);
>
> --
> 2.45.0
>
Dan Williams June 6, 2024, 12:20 a.m. UTC | #5
Pasha Tatashin wrote:
> On Wed, Jun 5, 2024 at 5:21 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Not all pages may apply to pgtable check.  One example is ZONE_DEVICE
> > pages: they map PFNs directly, and they don't allocate page_ext at all even
> > if there's struct page around.  One may reference devm_memremap_pages().
> >
> > When both ZONE_DEVICE and page-table-check enabled, then try to map some
> > dax memories, one can trigger kernel bug constantly now when the kernel was
> > trying to inject some pfn maps on the dax device:
> >
> >  kernel BUG at mm/page_table_check.c:55!
> >
> > While it's pretty legal to use set_pxx_at() for ZONE_DEVICE pages for page
> > fault resolutions, skip all the checks if page_ext doesn't even exist in
> > pgtable checker, which applies to ZONE_DEVICE but maybe more.
> 
> Thank you for reporting this bug. A few comments below:
> 
> >
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/page_table_check.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> > index 4169576bed72..509c6ef8de40 100644
> > --- a/mm/page_table_check.c
> > +++ b/mm/page_table_check.c
> > @@ -73,6 +73,9 @@ static void page_table_check_clear(unsigned long pfn, unsigned long pgcnt)
> >         page = pfn_to_page(pfn);
> >         page_ext = page_ext_get(page);
> >
> > +       if (!page_ext)
> > +               return;
> 
> I would replace the above with the following, here and in other places:
> 
> if (!page_ext) {
>   WARN_ONCE(!is_zone_device_page(page),
>                           "page_ext is missing for a non-device page\n");
>   return;
> }

Hmm, but this function is silent for the !pfn_valid(@pfn) case, and the
old cold has BUG_ON(!page_ext). So we know the caller is not being
careful about @pfn, and existing code is likely avoiding the BUG_ON().

The justification for the WARN_ONCE(), or maybe VM_WARN_ONCE(), would
be if there is a high likelihood that ongoing kernel changes introduce
more pfn_valid() but not page_ext covered pages? Is that a realistic
scenario?
Pasha Tatashin June 6, 2024, 12:24 a.m. UTC | #6
On Wed, Jun 5, 2024 at 8:20 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Pasha Tatashin wrote:
> > On Wed, Jun 5, 2024 at 5:21 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Not all pages may apply to pgtable check.  One example is ZONE_DEVICE
> > > pages: they map PFNs directly, and they don't allocate page_ext at all even
> > > if there's struct page around.  One may reference devm_memremap_pages().
> > >
> > > When both ZONE_DEVICE and page-table-check enabled, then try to map some
> > > dax memories, one can trigger kernel bug constantly now when the kernel was
> > > trying to inject some pfn maps on the dax device:
> > >
> > >  kernel BUG at mm/page_table_check.c:55!
> > >
> > > While it's pretty legal to use set_pxx_at() for ZONE_DEVICE pages for page
> > > fault resolutions, skip all the checks if page_ext doesn't even exist in
> > > pgtable checker, which applies to ZONE_DEVICE but maybe more.
> >
> > Thank you for reporting this bug. A few comments below:
> >
> > >
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  mm/page_table_check.c | 11 ++++++++++-
> > >  1 file changed, 10 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> > > index 4169576bed72..509c6ef8de40 100644
> > > --- a/mm/page_table_check.c
> > > +++ b/mm/page_table_check.c
> > > @@ -73,6 +73,9 @@ static void page_table_check_clear(unsigned long pfn, unsigned long pgcnt)
> > >         page = pfn_to_page(pfn);
> > >         page_ext = page_ext_get(page);
> > >
> > > +       if (!page_ext)
> > > +               return;
> >
> > I would replace the above with the following, here and in other places:
> >
> > if (!page_ext) {
> >   WARN_ONCE(!is_zone_device_page(page),
> >                           "page_ext is missing for a non-device page\n");
> >   return;
> > }
>
> Hmm, but this function is silent for the !pfn_valid(@pfn) case, and the
> old cold has BUG_ON(!page_ext). So we know the caller is not being
> careful about @pfn, and existing code is likely avoiding the BUG_ON().
>
> The justification for the WARN_ONCE(), or maybe VM_WARN_ONCE(), would
> be if there is a high likelihood that ongoing kernel changes introduce
> more pfn_valid() but not page_ext covered pages? Is that a realistic
> scenario?

Good point, it is unlikely we will have scenarios without page_ext.
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Peter Xu June 6, 2024, 1:14 p.m. UTC | #7
On Wed, Jun 05, 2024 at 03:05:43PM -0700, Andrew Morton wrote:
> On Wed,  5 Jun 2024 17:21:46 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > Not all pages may apply to pgtable check.  One example is ZONE_DEVICE
> > pages: they map PFNs directly, and they don't allocate page_ext at all even
> > if there's struct page around.  One may reference devm_memremap_pages().
> > 
> > When both ZONE_DEVICE and page-table-check enabled, then try to map some
> > dax memories, one can trigger kernel bug constantly now when the kernel was
> > trying to inject some pfn maps on the dax device:
> > 
> >  kernel BUG at mm/page_table_check.c:55!
> > 
> > While it's pretty legal to use set_pxx_at() for ZONE_DEVICE pages for page
> > fault resolutions, skip all the checks if page_ext doesn't even exist in
> > pgtable checker, which applies to ZONE_DEVICE but maybe more.
> 
> Do we have a Reported-by: for this one?

Nop, I just hit that when I started to look at the dax issues.

> 
> And a Fixes?  It looks like df4e817b7108?

Yes that commit should be proper.

Thanks,
diff mbox series

Patch

diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 4169576bed72..509c6ef8de40 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -73,6 +73,9 @@  static void page_table_check_clear(unsigned long pfn, unsigned long pgcnt)
 	page = pfn_to_page(pfn);
 	page_ext = page_ext_get(page);
 
+	if (!page_ext)
+		return;
+
 	BUG_ON(PageSlab(page));
 	anon = PageAnon(page);
 
@@ -110,6 +113,9 @@  static void page_table_check_set(unsigned long pfn, unsigned long pgcnt,
 	page = pfn_to_page(pfn);
 	page_ext = page_ext_get(page);
 
+	if (!page_ext)
+		return;
+
 	BUG_ON(PageSlab(page));
 	anon = PageAnon(page);
 
@@ -140,7 +146,10 @@  void __page_table_check_zero(struct page *page, unsigned int order)
 	BUG_ON(PageSlab(page));
 
 	page_ext = page_ext_get(page);
-	BUG_ON(!page_ext);
+
+	if (!page_ext)
+		return;
+
 	for (i = 0; i < (1ul << order); i++) {
 		struct page_table_check *ptc = get_page_table_check(page_ext);