diff mbox series

[RFC] resource: Fix CXL node not populated issue

Message ID 20241202111941.2636613-1-raghavendra.kt@amd.com
State New
Headers show
Series [RFC] resource: Fix CXL node not populated issue | expand

Commit Message

Raghavendra K T Dec. 2, 2024, 11:19 a.m. UTC
Before:
~]$ numastat -m
...
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal               128096.18       128838.48       256934.65

After:
$ numastat -m
.....
                          Node 0          Node 1          Node 2           Total
                 --------------- --------------- --------------- ---------------
MemTotal               128054.16       128880.51       129024.00       385958.67

Current patch reverts the effect of first commit where the issue is seen.

git bisect had led to below commit
Fixes: b4afe4183ec7 ("resource: fix region_intersects() vs add_memory_driver_managed()")
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: <ilpo.jarvinen@linux.intel.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Fontenot Nathan <Nathan.Fontenot@amd.com>
Cc: Wei Huang <wei.huang2@amd.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 kernel/resource.c | 51 ++++++++---------------------------------------
 1 file changed, 8 insertions(+), 43 deletions(-)

Note: Posting the fix that works for me. But looks like an exact fix
might be different. Since I am not much familiar with the below code,
I will have to go back and look into more details.

Please let me know if more detail is needed. sorry if I had missed something
obvious.

Git bisec log looked like this:

# good: [f7feea289f9ae3a8fb56e9daa3832949bf742c53] mm: numa_memblks: use memblock_{start,end}_of_DRAM() when sanitizing meminfo
git bisect good f7feea289f9ae3a8fb56e9daa3832949bf742c53
# bad: [9852d85ec9d492ebef56dc5f229416c925758edc] Linux 6.12-rc1
git bisect bad 9852d85ec9d492ebef56dc5f229416c925758edc
# good: [a65b3c3ed49a3b8068c002e98c90f8594927ff25] Merge tag 'hid-for-linus-2024091602' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
git bisect good a65b3c3ed49a3b8068c002e98c90f8594927ff25
# bad: [486fd58af7ac1098b68370b1d4d9f94a2a1c7124] zram: don't free statically defined names
git bisect bad 486fd58af7ac1098b68370b1d4d9f94a2a1c7124
# bad: [7856a565416e0cf091f825b0e25c7a1b7abb650e] Merge tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad 7856a565416e0cf091f825b0e25c7a1b7abb650e
# bad: [1868f9d0260e9afaf7c6436d14923ae12eaea465] Merge tag 'for-linux-6.12-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
git bisect bad 1868f9d0260e9afaf7c6436d14923ae12eaea465
# good: [e55ef65510a401862b902dc979441ea10ae25c61] Merge tag 'amd-drm-next-6.12-2024-08-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect good e55ef65510a401862b902dc979441ea10ae25c61
# good: [f1a4dceeb2bd4b4478e4f0c77dac55569d153fb3] drm/xe: Fix missing conversion to xe_display_pm_runtime_resume
git bisect good f1a4dceeb2bd4b4478e4f0c77dac55569d153fb3
# bad: [839c4f596f898edc424070dc8b517381572f8502] Merge tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad 839c4f596f898edc424070dc8b517381572f8502
# good: [726e2d0cf2bbc14e3bf38491cddda1a56fe18663] Merge tag 'dma-mapping-6.12-2024-09-19' of git://git.infradead.org/users/hch/dma-mapping
git bisect good 726e2d0cf2bbc14e3bf38491cddda1a56fe18663
# good: [992f9884626a0e6ab73a98ca4eb166d17675cae6] Merge patch series "NCR5380: Bug fixes and other improvements"
git bisect good 992f9884626a0e6ab73a98ca4eb166d17675cae6
# good: [adedd0f46c923f8d63aeb42d504c82431febed31] scsi: bnx2i: Remove unused declarations
git bisect good adedd0f46c923f8d63aeb42d504c82431febed31
# good: [a1d1eb2f57501b2e7e2076ce89b3f3a666ddbfdd] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect good a1d1eb2f57501b2e7e2076ce89b3f3a666ddbfdd
# good: [fb497d6db7c19c797cbd694b52d1af87c4eebcc6] mm/damon/vaddr: protect vma traversal in __damon_va_thre_regions() with rcu read lock
git bisect good fb497d6db7c19c797cbd694b52d1af87c4eebcc6
# bad: [2a058ab3286d6475b2082b90c2d2182d2fea4b39] mm: change vmf_anon_prepare() to __vmf_anon_prepare()
git bisect bad 2a058ab3286d6475b2082b90c2d2182d2fea4b39
# bad: [b4afe4183ec77f230851ea139d91e5cf2644c68b] resource: fix region_intersects() vs add_memory_driver_managed()
git bisect bad b4afe4183ec77f230851ea139d91e5cf2644c68b
# good: [6040f650c56862a4ac40b00c37ef6ab1ddfcebb5] zsmalloc: use unique zsmalloc caches names
git bisect good 6040f650c56862a4ac40b00c37ef6ab1ddfcebb5
# first bad commit: [b4afe4183ec77f230851ea139d91e5cf2644c68b] resource: fix region_intersects() vs add_memory_driver_managed()
#
#
#  git bisect good
b4afe4183ec77f230851ea139d91e5cf2644c68b is the first bad commit
commit b4afe4183ec77f230851ea139d91e5cf2644c68b


base-commit: e70140ba0d2b1a30467d4af6bcfe761327b9ec95

Comments

Andy Shevchenko Dec. 2, 2024, 3:34 p.m. UTC | #1
On Mon, Dec 02, 2024 at 11:19:41AM +0000, Raghavendra K T wrote:
> Before:
> ~]$ numastat -m
> ...
>                           Node 0          Node 1           Total
>                  --------------- --------------- ---------------
> MemTotal               128096.18       128838.48       256934.65
> 
> After:
> $ numastat -m
> .....
>                           Node 0          Node 1          Node 2           Total
>                  --------------- --------------- --------------- ---------------
> MemTotal               128054.16       128880.51       129024.00       385958.67
> 
> Current patch reverts the effect of first commit where the issue is seen.
> 
> git bisect had led to below commit

Missed blank line here.

> Fixes: b4afe4183ec7 ("resource: fix region_intersects() vs add_memory_driver_managed()")

> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alison Schofield <alison.schofield@intel.com>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: <ilpo.jarvinen@linux.intel.com>
> Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
> Cc: Fontenot Nathan <Nathan.Fontenot@amd.com>
> Cc: Wei Huang <wei.huang2@amd.com>

Isn't it too many to be included in the commit message? Note you may use the
same list with --cc in the command line with almost the same effect (almost --
no noise in the commit message).

> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---

...

> +		bool is_type = (((p->flags & flags) == flags) &&
> +				((desc == IORES_DESC_NONE) ||
> +				 (desc == p->desc)));
> +
> +		if (resource_overlaps(p, &res))
> +			is_type ? type++ : other++;

Instead (if you will end up with this approach) please still use is_type_match().
Raghavendra K T Dec. 3, 2024, 6:01 a.m. UTC | #2
++ Huang new address
Thank you for looking into patch Andy.
On 12/2/2024 9:04 PM, Andy Shevchenko wrote:
> On Mon, Dec 02, 2024 at 11:19:41AM +0000, Raghavendra K T wrote:
>> Before:
>> ~]$ numastat -m
>> ...
>>                            Node 0          Node 1           Total
>>                   --------------- --------------- ---------------
>> MemTotal               128096.18       128838.48       256934.65
>>
>> After:
>> $ numastat -m
>> .....
>>                            Node 0          Node 1          Node 2           Total
>>                   --------------- --------------- --------------- ---------------
>> MemTotal               128054.16       128880.51       129024.00       385958.67
>>
>> Current patch reverts the effect of first commit where the issue is seen.
>>
>> git bisect had led to below commit
> 
> Missed blank line here.
> 

Will add.

>> Fixes: b4afe4183ec7 ("resource: fix region_intersects() vs add_memory_driver_managed()")
> 
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Cc: Dave Jiang <dave.jiang@intel.com>
>> Cc: Alison Schofield <alison.schofield@intel.com>
>> Cc: Vishal Verma <vishal.l.verma@intel.com>
>> Cc: Ira Weiny <ira.weiny@intel.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: Baoquan He <bhe@redhat.com>
>> Cc: <ilpo.jarvinen@linux.intel.com>
>> Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
>> Cc: Fontenot Nathan <Nathan.Fontenot@amd.com>
>> Cc: Wei Huang <wei.huang2@amd.com>
> 
> Isn't it too many to be included in the commit message? Note you may use the
> same list with --cc in the command line with almost the same effect (almost --
> no noise in the commit message).

Agree. Do not want to add in the final commit message. I took
the names from the bisected message. will Cc directly next time.

> 
>> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
> 
> ...
> 
>> +		bool is_type = (((p->flags & flags) == flags) &&
>> +				((desc == IORES_DESC_NONE) ||
>> +				 (desc == p->desc)));
>> +
>> +		if (resource_overlaps(p, &res))
>> +			is_type ? type++ : other++;
> 
> Instead (if you will end up with this approach) please still use is_type_match().
> 

Sure will do in next iteration unless someone comes with expected fix.

I bumped into this because I was not able to evaluate my patches
on 6.13 with CXL card. Thought of reporting what worked for me...

Link: : 
https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/ 


I may need some time to go back and understand some history and code.

Thanks and Regards
- Raghu
Huang, Ying Dec. 3, 2024, 6:26 a.m. UTC | #3
Raghavendra K T <raghavendra.kt@amd.com> writes:

> Before:
> ~]$ numastat -m
> ...
>                           Node 0          Node 1           Total
>                  --------------- --------------- ---------------
> MemTotal               128096.18       128838.48       256934.65
>
> After:
> $ numastat -m
> .....
>                           Node 0          Node 1          Node 2           Total
>                  --------------- --------------- --------------- ---------------
> MemTotal               128054.16       128880.51       129024.00       385958.67
>
> Current patch reverts the effect of first commit where the issue is seen.

This doesn't root cause the issue.  Please trace the code path
(e.g. return value of region_intersects(), in which code path) at least.
Without these information, we cannot come up with a proper fix to your issue.

> git bisect had led to below commit
> Fixes: b4afe4183ec7 ("resource: fix region_intersects() vs add_memory_driver_managed()")

This breaks you case, sorry about that.  But this also fixed a real bug
too.  So, it's not appropriate just to revert it blindly.

[snip]

---
Best Regards,
Huang, Ying
Andy Shevchenko Dec. 3, 2024, 1:41 p.m. UTC | #4
On Tue, Dec 03, 2024 at 02:26:52PM +0800, Huang, Ying wrote:
> Raghavendra K T <raghavendra.kt@amd.com> writes:

...

> > git bisect had led to below commit
> > Fixes: b4afe4183ec7 ("resource: fix region_intersects() vs add_memory_driver_managed()")
> 
> This breaks you case, sorry about that.  But this also fixed a real bug
> too.  So, it's not appropriate just to revert it blindly.

Linus was clear about this recently. Even if it fixes a bug, regression is
still regression and might (*) lead to a revert.
https://lwn.net/Articles/990599/

(*) in general fixes are better than reverts, but depends on the timing in
    the release cycle the revert may be the only option.
Andy Shevchenko Dec. 3, 2024, 10:07 p.m. UTC | #5
On Tue, Dec 03, 2024 at 11:56:35AM -0800, Gregory Price wrote:
> On Tue, Dec 3, 2024, 6:09 AM Andy Shevchenko <
> andriy.shevchenko@linux.intel.com> wrote:
> > On Tue, Dec 03, 2024 at 02:26:52PM +0800, Huang, Ying wrote:
> > > Raghavendra K T <raghavendra.kt@amd.com> writes:

...

> > > > git bisect had led to below commit
> > > > Fixes: b4afe4183ec7 ("resource: fix region_intersects() vs
> > add_memory_driver_managed()")
> > >
> > > This breaks you case, sorry about that.  But this also fixed a real bug
> > > too.  So, it's not appropriate just to revert it blindly.
> >
> > Linus was clear about this recently. Even if it fixes a bug, regression is
> > still regression
> 
> In my experience, region intersection issues often stem from bad CEDT
> entries.

Yeah, sometimes fixes may reveal the hidden issues.

> Have you dumped the ACPI tables on your system and validated the content?

I believe this is the Q to the original reporter.

> It would be a shame to revert a patch because the ACPI tables were wrong.

I agree with you and as I said in my reply "fixes are better than reverts".
I.o.w. I'm on the same page with Linus on this.
Huang, Ying Dec. 4, 2024, 2:07 a.m. UTC | #6
Hi, Andy,

Andy Shevchenko <andriy.shevchenko@linux.intel.com> writes:

> On Tue, Dec 03, 2024 at 02:26:52PM +0800, Huang, Ying wrote:
>> Raghavendra K T <raghavendra.kt@amd.com> writes:
>
> ...
>
>> > git bisect had led to below commit
>> > Fixes: b4afe4183ec7 ("resource: fix region_intersects() vs add_memory_driver_managed()")
>> 
>> This breaks you case, sorry about that.  But this also fixed a real bug
>> too.  So, it's not appropriate just to revert it blindly.
>
> Linus was clear about this recently. Even if it fixes a bug, regression is
> still regression and might (*) lead to a revert.
> https://lwn.net/Articles/990599/
>
> (*) in general fixes are better than reverts, but depends on the timing in
>     the release cycle the revert may be the only option.

I don't think that the timing is so tight that we should not work on
proper fix firstly.  I'm trying to work with the reporter on this.

BTW, the commit b4afe4183ec7 ("resource: fix region_intersects() vs
add_memory_driver_managed()") fixed a security related bug.  The bug
weakened the protection to prevent users read/write system memory via
/dev/mem.  So, IMO, we need to be more careful about this.

---
Best Regards,
Huang, Ying
Andy Shevchenko Dec. 4, 2024, 3:01 a.m. UTC | #7
On Wed, Dec 04, 2024 at 10:07:16AM +0800, Huang, Ying wrote:
> Andy Shevchenko <andriy.shevchenko@linux.intel.com> writes:
> > On Tue, Dec 03, 2024 at 02:26:52PM +0800, Huang, Ying wrote:
> >> Raghavendra K T <raghavendra.kt@amd.com> writes:

...

> >> > git bisect had led to below commit
> >> > Fixes: b4afe4183ec7 ("resource: fix region_intersects() vs add_memory_driver_managed()")
> >> 
> >> This breaks you case, sorry about that.  But this also fixed a real bug
> >> too.  So, it's not appropriate just to revert it blindly.
> >
> > Linus was clear about this recently. Even if it fixes a bug, regression is
> > still regression and might (*) lead to a revert.
> > https://lwn.net/Articles/990599/
> >
> > (*) in general fixes are better than reverts, but depends on the timing in
> >     the release cycle the revert may be the only option.
> 
> I don't think that the timing is so tight that we should not work on
> proper fix firstly.  I'm trying to work with the reporter on this.

I agree on this, please do.

> BTW, the commit b4afe4183ec7 ("resource: fix region_intersects() vs
> add_memory_driver_managed()") fixed a security related bug.  The bug
> weakened the protection to prevent users read/write system memory via
> /dev/mem.  So, IMO, we need to be more careful about this.

My point was that the regression is obvious and it needs to be fixed.
That's all. Revert is a last resort in this sense.
Dan Williams Dec. 4, 2024, 3:55 a.m. UTC | #8
[ add regressions@lists.linux.dev ]

Next time make the subject of the patch:

   Revert "resource: fix region_intersects() vs add_memory_driver_managed()"

...to make it clear that this is a revert, not a fix.

The revert should be applied if a fix does not materialize in the next few weeks.

Raghavendra K T wrote:
> Before:
> ~]$ numastat -m
> ...
>                           Node 0          Node 1           Total
>                  --------------- --------------- ---------------
> MemTotal               128096.18       128838.48       256934.65
> 
> After:
> $ numastat -m
> .....
>                           Node 0          Node 1          Node 2           Total
>                  --------------- --------------- --------------- ---------------
> MemTotal               128054.16       128880.51       129024.00       385958.67
> 
> Current patch reverts the effect of first commit where the issue is seen.

Might you be able to dig a bit further into the details like memory map
for this platform and ACPI SRAT tables? A dmesg comparison of the good
and bad cases would be useful (those can be shared via a github gist).
Even better would be some debug instrumentation to identify which call
to __region_intersects() started behaving differently resulting in a
whole node disappearing.

In terms of the urgency of fixing this it would also help to know how
prevalent the system this was found on is in the wild.
Raghavendra K T Dec. 4, 2024, 4:41 a.m. UTC | #9
On 12/4/2024 9:25 AM, Dan Williams wrote:
> [ add regressions@lists.linux.dev ]
> 
> Next time make the subject of the patch:
> 
>     Revert "resource: fix region_intersects() vs add_memory_driver_managed()"
> 
> ...to make it clear that this is a revert, not a fix.
> 
> The revert should be applied if a fix does not materialize in the next few weeks.
> 

Agreed regarding fix.
one thing to note is it is not exact revert.

> Raghavendra K T wrote:
>> Before:
>> ~]$ numastat -m
>> ...
>>                            Node 0          Node 1           Total
>>                   --------------- --------------- ---------------
>> MemTotal               128096.18       128838.48       256934.65
>>
>> After:
>> $ numastat -m
>> .....
>>                            Node 0          Node 1          Node 2           Total
>>                   --------------- --------------- --------------- ---------------
>> MemTotal               128054.16       128880.51       129024.00       385958.67
>>
>> Current patch reverts the effect of first commit where the issue is seen.
> 
> Might you be able to dig a bit further into the details like memory map
> for this platform and ACPI SRAT tables? A dmesg comparison of the good
> and bad cases would be useful (those can be shared via a github gist).
> Even better would be some debug instrumentation to identify which call
> to __region_intersects() started behaving differently resulting in a
> whole node disappearing.
> 
> In terms of the urgency of fixing this it would also help to know how
> prevalent the system this was found on is in the wild.

I have compared dmesg, proc/iomem of both success and fail case.

A. dmesg:

1. Address ranges is different
2. extra message about printing Demotion target

Fallback order for Node 0: 0 1 2
Fallback order for Node 1: 1 0 2
Fallback order for Node 2: 2 0 1
Built 3 zonelists, mobility grouping on.  Total pages: 66145521
Policy zone: Normal
....
Demotion targets for Node 0: preferred: 2, fallback: 2
Demotion targets for Node 1: preferred: 2, fallback: 2
Demotion targets for Node 2: null

B. /proc/iomem

$ vimdiff success fail

  4050000000-604fffffff : Soft Reserved 
   |  164 4050000000-604fffffff : Soft Reserved
   165   4050000000-604fffffff : CXL Window 0 
        |  165   4050000000-604fffffff : CXL Window 0
   166     4080000000-5fffffffff : dax1.0 
        | 
------------------------------------------------------------------------
   167       4080000000-5fffffffff : System RAM (kmem) 
        | 
--------------------------------------------------------------------


I will get more detail from ACPI SRAT table etc..

Thanks and Regards
  Raghu
Raghavendra K T Dec. 4, 2024, 4:44 a.m. UTC | #10
On 12/4/2024 8:31 AM, Andy Shevchenko wrote:
> On Wed, Dec 04, 2024 at 10:07:16AM +0800, Huang, Ying wrote:
>> Andy Shevchenko <andriy.shevchenko@linux.intel.com> writes:
>>> On Tue, Dec 03, 2024 at 02:26:52PM +0800, Huang, Ying wrote:
>>>> Raghavendra K T <raghavendra.kt@amd.com> writes:
> 
> ...
> 
>>>>> git bisect had led to below commit
>>>>> Fixes: b4afe4183ec7 ("resource: fix region_intersects() vs add_memory_driver_managed()")
>>>>
>>>> This breaks you case, sorry about that.  But this also fixed a real bug
>>>> too.  So, it's not appropriate just to revert it blindly.
>>>
>>> Linus was clear about this recently. Even if it fixes a bug, regression is
>>> still regression and might (*) lead to a revert.
>>> https://lwn.net/Articles/990599/
>>>
>>> (*) in general fixes are better than reverts, but depends on the timing in
>>>      the release cycle the revert may be the only option.
>>
>> I don't think that the timing is so tight that we should not work on
>> proper fix firstly.  I'm trying to work with the reporter on this.
> 
> I agree on this, please do.
> 
>> BTW, the commit b4afe4183ec7 ("resource: fix region_intersects() vs
>> add_memory_driver_managed()") fixed a security related bug.  The bug
>> weakened the protection to prevent users read/write system memory via
>> /dev/mem.  So, IMO, we need to be more careful about this.
> 
> My point was that the regression is obvious and it needs to be fixed.
> That's all. Revert is a last resort in this sense.
> 

I agree in general to both of your comment. (i.e. since this bisected
commit had security fix, we shall try to get better fix than a close to
revert).

  I am trying to work on this, but it is a bit slow on my side.


Thanks and Regards
- Raghu
Raghavendra K T Dec. 5, 2024, 5:45 a.m. UTC | #11
On 12/4/2024 10:14 AM, Raghavendra K T wrote:
> 
> 
>>>>
>>>> (*) in general fixes are better than reverts, but depends on the 
>>>> timing in
>>>>      the release cycle the revert may be the only option.
>>>
>>> I don't think that the timing is so tight that we should not work on
>>> proper fix firstly.  I'm trying to work with the reporter on this.
>>
>> I agree on this, please do.
>>
>>> BTW, the commit b4afe4183ec7 ("resource: fix region_intersects() vs
>>> add_memory_driver_managed()") fixed a security related bug.  The bug
>>> weakened the protection to prevent users read/write system memory via
>>> /dev/mem.  So, IMO, we need to be more careful about this.
>>
>> My point was that the regression is obvious and it needs to be fixed.
>> That's all. Revert is a last resort in this sense.
>>
> 
> I agree in general to both of your comment. (i.e. since this bisected
> commit had security fix, we shall try to get better fix than a close to
> revert).
> 
>   I am trying to work on this, but it is a bit slow on my side.
> 

I will try to get a fix that retains old bugfix and works well for me too.
Since it is reproduced only on this shared system, I will get hold of
this system next week, and work towards a fix.

Thanks and Regards
- Raghu
Dan Williams Dec. 6, 2024, 7:50 a.m. UTC | #12
Raghavendra K T wrote:
> 
> 
> On 12/4/2024 9:25 AM, Dan Williams wrote:
> > [ add regressions@lists.linux.dev ]
> > 
> > Next time make the subject of the patch:
> > 
> >     Revert "resource: fix region_intersects() vs add_memory_driver_managed()"
> > 
> > ...to make it clear that this is a revert, not a fix.
> > 
> > The revert should be applied if a fix does not materialize in the next few weeks.
> > 
> 
> Agreed regarding fix.
> one thing to note is it is not exact revert.
> 
> > Raghavendra K T wrote:
> >> Before:
> >> ~]$ numastat -m
> >> ...
> >>                            Node 0          Node 1           Total
> >>                   --------------- --------------- ---------------
> >> MemTotal               128096.18       128838.48       256934.65
> >>
> >> After:
> >> $ numastat -m
> >> .....
> >>                            Node 0          Node 1          Node 2           Total
> >>                   --------------- --------------- --------------- ---------------
> >> MemTotal               128054.16       128880.51       129024.00       385958.67
> >>
> >> Current patch reverts the effect of first commit where the issue is seen.
> > 
> > Might you be able to dig a bit further into the details like memory map
> > for this platform and ACPI SRAT tables? A dmesg comparison of the good
> > and bad cases would be useful (those can be shared via a github gist).
> > Even better would be some debug instrumentation to identify which call
> > to __region_intersects() started behaving differently resulting in a
> > whole node disappearing.
> > 
> > In terms of the urgency of fixing this it would also help to know how
> > prevalent the system this was found on is in the wild.
> 
> I have compared dmesg, proc/iomem of both success and fail case.
> 
> A. dmesg:
> 
> 1. Address ranges is different
> 2. extra message about printing Demotion target
> 
> Fallback order for Node 0: 0 1 2
> Fallback order for Node 1: 1 0 2
> Fallback order for Node 2: 2 0 1
> Built 3 zonelists, mobility grouping on.  Total pages: 66145521
> Policy zone: Normal
> ....
> Demotion targets for Node 0: preferred: 2, fallback: 2
> Demotion targets for Node 1: preferred: 2, fallback: 2
> Demotion targets for Node 2: null
> 
> B. /proc/iomem
> 
> $ vimdiff success fail
> 
>   4050000000-604fffffff : Soft Reserved 
>    |  164 4050000000-604fffffff : Soft Reserved
>    165   4050000000-604fffffff : CXL Window 0 
>         |  165   4050000000-604fffffff : CXL Window 0
>    166     4080000000-5fffffffff : dax1.0 
>         | 
> ------------------------------------------------------------------------
>    167       4080000000-5fffffffff : System RAM (kmem) 
>         | 
> --------------------------------------------------------------------

My eyes only know how to read unified diff (diff -u) format. Is this
saying that in the failure case the System RAM range for dax1.0 is
missing?
Raghavendra K T Dec. 6, 2024, 8:28 a.m. UTC | #13
On 12/6/2024 1:20 PM, Dan Williams wrote:
> Raghavendra K T wrote:
>>
[...]
>> B. /proc/iomem
>>
>> $ vimdiff success fail
>>
>>    4050000000-604fffffff : Soft Reserved
>>     |  164 4050000000-604fffffff : Soft Reserved
>>     165   4050000000-604fffffff : CXL Window 0
>>          |  165   4050000000-604fffffff : CXL Window 0
>>     166     4080000000-5fffffffff : dax1.0
>>          |
>> ------------------------------------------------------------------------
>>     167       4080000000-5fffffffff : System RAM (kmem)
>>          |
>> --------------------------------------------------------------------
> 
> My eyes only know how to read unified diff (diff -u) format. Is this
> saying that in the failure case the System RAM range for dax1.0 is
> missing?


Sorry for that formatting. I realized later.
Yes. dax1.0 missing in the failure case.

Looks like problem is that,

4050000000-604fffffff : Soft Reserved
   4050000000-604fffffff : CXL Window 0
      4080000000-5fffffffff : dax1.0
         4080000000-5fffffffff : System RAM (kmem)

this use case where,

|A<------------->D|  CXL window
|..|B<-->C|.......|    kmem

B->C range that falls within A->D as in above somehow
not covered after the patch.

Thanks and Regards
- Raghu
Dan Williams Dec. 11, 2024, 4:44 a.m. UTC | #14
Raghavendra K T wrote:
> 
> 
> On 12/6/2024 1:20 PM, Dan Williams wrote:
> > Raghavendra K T wrote:
> >>
> [...]
> >> B. /proc/iomem
> >>
> >> $ vimdiff success fail
> >>
> >>    4050000000-604fffffff : Soft Reserved
> >>     |  164 4050000000-604fffffff : Soft Reserved
> >>     165   4050000000-604fffffff : CXL Window 0
> >>          |  165   4050000000-604fffffff : CXL Window 0
> >>     166     4080000000-5fffffffff : dax1.0
> >>          |
> >> ------------------------------------------------------------------------
> >>     167       4080000000-5fffffffff : System RAM (kmem)
> >>          |
> >> --------------------------------------------------------------------
> > 
> > My eyes only know how to read unified diff (diff -u) format. Is this
> > saying that in the failure case the System RAM range for dax1.0 is
> > missing?
> 
> 
> Sorry for that formatting. I realized later.
> Yes. dax1.0 missing in the failure case.
> 
> Looks like problem is that,
> 
> 4050000000-604fffffff : Soft Reserved
>    4050000000-604fffffff : CXL Window 0
>       4080000000-5fffffffff : dax1.0
>          4080000000-5fffffffff : System RAM (kmem)
> 
> this use case where,
> 
> |A<------------->D|  CXL window
> |..|B<-->C|.......|    kmem
> 
> B->C range that falls within A->D as in above somehow
> not covered after the patch.

I was able to reproduce a similar set of conditions with the cxl_test
environment:

f010000000-f04fffffff : Soft Reserved
  f010000000-f04fffffff : CXL Window 0
    f020000000-f03fffffff : region3
      f020000000-f03fffffff : dax3.0
        f020000000-f03fffffff : System RAM (kmem)


...but that did not result in the bug. So there are some other details
missing. Can you proceed with providing the dmesg from the good and the
bad cases?

gist.github.com is useful for this.
Bharata B Rao Dec. 11, 2024, 6:40 a.m. UTC | #15
On 11-Dec-24 10:14 AM, Dan Williams wrote:
> 
> I was able to reproduce a similar set of conditions with the cxl_test
> environment:
> 
> f010000000-f04fffffff : Soft Reserved
>    f010000000-f04fffffff : CXL Window 0
>      f020000000-f03fffffff : region3
>        f020000000-f03fffffff : dax3.0
>          f020000000-f03fffffff : System RAM (kmem)
> 
> 
> ...but that did not result in the bug. So there are some other details
> missing. Can you proceed with providing the dmesg from the good and the
> bad cases?
> 
> gist.github.com is useful for this.

Sorry for the delay in providing the data. The system wasn't available 
for a while. Now I have put the good(6.11.0-rc6) and bad(6.13.0-rc1) 
dmesg and iomem logs at

https://gist.github.com/bharata/4a57db11e044fd1d313035f3dd5f763b

What I see is that in the bad case, we hit the below shown dev_dbg 
message (not seen in the logs that are provided)

static int hmem_register_device(struct device *host, int target_nid,
                                 const struct resource *res)
{
         struct platform_device *pdev;
         struct memregion_info info;
         long id;
         int rc;

         if (IS_ENABLED(CONFIG_CXL_REGION) &&
             region_intersects(res->start, resource_size(res), 
IORESOURCE_MEM,
                               IORES_DESC_CXL) != REGION_DISJOINT) {
                 dev_dbg(host, "deferring range to CXL: %pr\n", res); <--
                 return 0;
         }

With this, it appears that dev_dax_kmem_probe() won't happen for the CXL 
range and hence CXL memory doesn't get detected.

Let us know if you need more debug data.

Regards,
Bharata.
Dan Williams Dec. 11, 2024, 8:38 a.m. UTC | #16
Bharata B Rao wrote:
> On 11-Dec-24 10:14 AM, Dan Williams wrote:
> > 
> > I was able to reproduce a similar set of conditions with the cxl_test
> > environment:
> > 
> > f010000000-f04fffffff : Soft Reserved
> >    f010000000-f04fffffff : CXL Window 0
> >      f020000000-f03fffffff : region3
> >        f020000000-f03fffffff : dax3.0
> >          f020000000-f03fffffff : System RAM (kmem)
> > 
> > 
> > ...but that did not result in the bug. So there are some other details
> > missing. Can you proceed with providing the dmesg from the good and the
> > bad cases?
> > 
> > gist.github.com is useful for this.
> 
> Sorry for the delay in providing the data. The system wasn't available 
> for a while. Now I have put the good(6.11.0-rc6) and bad(6.13.0-rc1) 
> dmesg and iomem logs at
> 
> https://gist.github.com/bharata/4a57db11e044fd1d313035f3dd5f763b
> 
> What I see is that in the bad case, we hit the below shown dev_dbg 
> message (not seen in the logs that are provided)
> 
> static int hmem_register_device(struct device *host, int target_nid,
>                                  const struct resource *res)
> {
>          struct platform_device *pdev;
>          struct memregion_info info;
>          long id;
>          int rc;
> 
>          if (IS_ENABLED(CONFIG_CXL_REGION) &&
>              region_intersects(res->start, resource_size(res), 
> IORESOURCE_MEM,
>                                IORES_DESC_CXL) != REGION_DISJOINT) {
>                  dev_dbg(host, "deferring range to CXL: %pr\n", res); <--
>                  return 0;
>          }
> 
> With this, it appears that dev_dax_kmem_probe() won't happen for the CXL 
> range and hence CXL memory doesn't get detected.
> 
> Let us know if you need more debug data.

Ah, thanks for that!

So, it turns out Ying's patch is working as advertised. It is traversing
past the top-level entry of the the iomem_resource topology to discover that a
Soft Reserved range is described by CXL. Then the expectation is that
the CXL subsystem parses the topology and registers a dax device.

I missed that detail in my repro because I was not testing the HMEM
handoff.

Now, the dmesg shows that the CXL subsystem gives up early on the
CXL devices as they do not appear to be meeting the expecations of a "CXL
Memory Device" as described by the "PCI Header - Class Code Register
(Offset 09h)" of the CXL spec.

Specifically these messages:

    [    4.449072] cxl_pci 0000:9f:00.0: registers not found: status mbox memdev 

I think this situation is increasingly showing that the pace of
non-standard quirks being deployed is higher than CXL subsystem is able
to keep pace.

The immediate workaround to this problem that Linux discovered is to
disable the cxl_acpi driver. Can you confirm that preventing the
cxl_acpi driver from loading restores that missing node?

Longer term the urgency of Nathan's patch [1] needs to be escalated.

[1]: http://lore.kernel.org/20241202155542.22111-1-nathan.fontenot@amd.com
Bharata B Rao Dec. 11, 2024, 4:47 p.m. UTC | #17
On 11-Dec-24 2:08 PM, Dan Williams wrote:
> Ah, thanks for that!
> 
> So, it turns out Ying's patch is working as advertised. It is traversing
> past the top-level entry of the the iomem_resource topology to discover that a
> Soft Reserved range is described by CXL. Then the expectation is that
> the CXL subsystem parses the topology and registers a dax device.
> 
> I missed that detail in my repro because I was not testing the HMEM
> handoff.
> 
> Now, the dmesg shows that the CXL subsystem gives up early on the
> CXL devices as they do not appear to be meeting the expecations of a "CXL
> Memory Device" as described by the "PCI Header - Class Code Register
> (Offset 09h)" of the CXL spec.
> 
> Specifically these messages:
> 
>      [    4.449072] cxl_pci 0000:9f:00.0: registers not found: status mbox memdev
> 
> I think this situation is increasingly showing that the pace of
> non-standard quirks being deployed is higher than CXL subsystem is able
> to keep pace.
> 
> The immediate workaround to this problem that Linux discovered is to
> disable the cxl_acpi driver. Can you confirm that preventing the
> cxl_acpi driver from loading restores that missing node?

Yes, disabling CONFIG_CXL_ACPI gets the CXL node up with the memory. 
Thanks for this workaround.

> 
> Longer term the urgency of Nathan's patch [1] needs to be escalated.
> 
> [1]: http://lore.kernel.org/20241202155542.22111-1-nathan.fontenot@amd.com

I didn't find that patch helping though with CONFIG_CXL_ACPI kept enabled.

Regards,
Bharata.
Raghavendra K T Dec. 12, 2024, 1:02 a.m. UTC | #18
On 12/11/2024 10:17 PM, Bharata B Rao wrote:
> On 11-Dec-24 2:08 PM, Dan Williams wrote:
>> Ah, thanks for that!
>>
>> So, it turns out Ying's patch is working as advertised. It is traversing
>> past the top-level entry of the the iomem_resource topology to 
>> discover that a
>> Soft Reserved range is described by CXL. Then the expectation is that
>> the CXL subsystem parses the topology and registers a dax device.
>>
>> I missed that detail in my repro because I was not testing the HMEM
>> handoff.
>>
>> Now, the dmesg shows that the CXL subsystem gives up early on the
>> CXL devices as they do not appear to be meeting the expecations of a "CXL
>> Memory Device" as described by the "PCI Header - Class Code Register
>> (Offset 09h)" of the CXL spec.
>>
>> Specifically these messages:
>>
>>      [    4.449072] cxl_pci 0000:9f:00.0: registers not found: status 
>> mbox memdev
>>
>> I think this situation is increasingly showing that the pace of
>> non-standard quirks being deployed is higher than CXL subsystem is able
>> to keep pace.
>>
>> The immediate workaround to this problem that Linux discovered is to
>> disable the cxl_acpi driver. Can you confirm that preventing the
>> cxl_acpi driver from loading restores that missing node?
> 
> Yes, disabling CONFIG_CXL_ACPI gets the CXL node up with the memory. 
> Thanks for this workaround.
> 

Thank you Bharata for filling in.

Thank you Dan, Andy, Gregory, Ying for your input. (sorry for the delay 
because machine was not available as pointed by Bharata).

This helps to proceed with

CONFIG_CXL_ACPI=n for now in my case.

- Raghu
diff mbox series

Patch

diff --git a/kernel/resource.c b/kernel/resource.c
index c9fd26c06345..d4dcaa1831cd 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -556,55 +556,20 @@  static int __region_intersects(struct resource *parent, resource_size_t start,
 			       size_t size, unsigned long flags,
 			       unsigned long desc)
 {
+	struct resource res;
 	int type = 0; int other = 0;
-	struct resource *p, *dp;
-	struct resource res, o;
-	bool covered;
+	struct resource *p;
 
 	res.start = start;
 	res.end = start + size - 1;
 
 	for (p = parent->child; p ; p = p->sibling) {
-		if (!resource_intersection(p, &res, &o))
-			continue;
-		if (is_type_match(p, flags, desc)) {
-			type++;
-			continue;
-		}
-		/*
-		 * Continue to search in descendant resources as if the
-		 * matched descendant resources cover some ranges of 'p'.
-		 *
-		 * |------------- "CXL Window 0" ------------|
-		 * |-- "System RAM" --|
-		 *
-		 * will behave similar as the following fake resource
-		 * tree when searching "System RAM".
-		 *
-		 * |-- "System RAM" --||-- "CXL Window 0a" --|
-		 */
-		covered = false;
-		for_each_resource(p, dp, false) {
-			if (!resource_overlaps(dp, &res))
-				continue;
-			if (is_type_match(dp, flags, desc)) {
-				type++;
-				/*
-				 * Range from 'o.start' to 'dp->start'
-				 * isn't covered by matched resource.
-				 */
-				if (dp->start > o.start)
-					break;
-				if (dp->end >= o.end) {
-					covered = true;
-					break;
-				}
-				/* Remove covered range */
-				o.start = max(o.start, dp->end + 1);
-			}
-		}
-		if (!covered)
-			other++;
+		bool is_type = (((p->flags & flags) == flags) &&
+				((desc == IORES_DESC_NONE) ||
+				 (desc == p->desc)));
+
+		if (resource_overlaps(p, &res))
+			is_type ? type++ : other++;
 	}
 
 	if (type == 0)