mbox series

[v2,00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default

Message ID 167601992097.1924368.18291887895351917895.stgit@dwillia2-xfh.jf.intel.com (mailing list archive)
Headers show
Series CXL RAM and the 'Soft Reserved' => 'System RAM' default | expand

Message

Dan Williams Feb. 10, 2023, 9:05 a.m. UTC
Changes since v1: [1]
- Add a fix for memdev removal racing port removal (found by unit tests)
- Add a fix to unwind region target list updates on error in
  cxl_region_attach() (Jonathan)
- Move the passthrough decoder fix for submission for v6.2-final (Greg)
- Fix wrong initcall for cxl_core (Gregory and Davidlohr)
- Add an endpoint decoder state (CXL_DECODER_STATE_AUTO) to replace
  the flag CXL_DECODER_F_AUTO (Jonathan)
- Reflow cmp_decode_pos() to reduce levels of indentation (Jonathan)
- Fix a leaked reference count in cxl_add_to_region() (Jonathan)
- Make cxl_add_to_region() return an error (Jonathan)
- Fix several spurious whitespace changes (Jonathan)
- Cleanup some spurious changes from the tools/testing/cxl update
  (Jonathan)
- Test for == CXL_CONFIG_COMMIT rather than >= CXL_CONFIG_COMMIT
  (Jonathan)
- Add comment to clarify device_attach() return code expectation in
  cxl_add_to_region() (Jonathan)
- Add a patch to split cxl_port_probe() into switch and endpoint port
  probe calls (Jonathan)
- Collect reviewed-by and tested-by tags

[1]: http://lore.kernel.org/r/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com

---
Cover letter same as v1

Summary:
--------

CXL RAM support allows for the dynamic provisioning of new CXL RAM
regions, and more routinely, assembling a region from an existing
configuration established by platform-firmware. The latter is motivated
by CXL memory RAS (Reliability, Availability and Serviceability)
support, that requires associating device events with System Physical
Address ranges and vice versa.

The 'Soft Reserved' policy rework arranges for performance
differentiated memory like CXL attached DRAM, or high-bandwidth memory,
to be designated for 'System RAM' by default, rather than the device-dax
dedicated access mode. That current device-dax default is confusing and
surprising for the Pareto of users that do not expect memory to be
quarantined for dedicated access by default. Most users expect all
'System RAM'-capable memory to show up in FREE(1).


Details:
--------

Recall that the Linux 'Soft Reserved' designation for memory is a
reaction to platform-firmware, like EFI EDK2, delineating memory with
the EFI Specific Purpose Memory attribute (EFI_MEMORY_SP). An
alternative way to think of that attribute is that it specifies the
*not* general-purpose memory pool. It is memory that may be too precious
for general usage or not performant enough for some hot data structures.
However, in the absence of explicit policy it should just be 'System
RAM' by default.

Rather than require every distribution to ship a udev policy to assign
dax devices to dax_kmem (the device-memory hotplug driver) just make
that the kernel default. This is similar to the rationale in:

commit 8604d9e534a3 ("memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE")

With this change the relatively niche use case of accessing this memory
via mapping a device-dax instance can be achieved by building with
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, or specifying
memhp_default_state=offline at boot, and then use:

    daxctl reconfigure-device $device -m devdax --force

...to shift the corresponding address range to device-dax access.

The process of assembling a device-dax instance for a given CXL region
device configuration is similar to the process of assembling a
Device-Mapper or MDRAID storage-device array. Specifically, asynchronous
probing by the PCI and driver core enumerates all CXL endpoints and
their decoders. Then, once enough decoders have arrived to a describe a
given region, that region is passed to the device-dax subsystem where it
is subject to the above 'dax_kmem' policy. This assignment and policy
choice is only possible if memory is set aside by the 'Soft Reserved'
designation. Otherwise, CXL that is mapped as 'System RAM' becomes
immutable by CXL driver mechanisms, but is still enumerated for RAS
purposes.

This series is also available via:

https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region

...and has gone through some preview testing in various forms.

---

Dan Williams (20):
      cxl/memdev: Fix endpoint port removal
      cxl/Documentation: Update references to attributes added in v6.0
      cxl/region: Add a mode attribute for regions
      cxl/region: Support empty uuids for non-pmem regions
      cxl/region: Validate region mode vs decoder mode
      cxl/region: Add volatile region creation support
      cxl/region: Refactor attach_target() for autodiscovery
      cxl/region: Cleanup target list on attach error
      cxl/region: Move region-position validation to a helper
      kernel/range: Uplevel the cxl subsystem's range_contains() helper
      cxl/region: Enable CONFIG_CXL_REGION to be toggled
      cxl/port: Split endpoint and switch port probe
      cxl/region: Add region autodiscovery
      tools/testing/cxl: Define a fixed volatile configuration to parse
      dax/hmem: Move HMAT and Soft reservation probe initcall level
      dax/hmem: Drop unnecessary dax_hmem_remove()
      dax/hmem: Convey the dax range via memregion_info()
      dax/hmem: Move hmem device registration to dax_hmem.ko
      dax: Assign RAM regions to memory-hotplug by default
      cxl/dax: Create dax devices for CXL RAM regions


 Documentation/ABI/testing/sysfs-bus-cxl |   64 +-
 MAINTAINERS                             |    1 
 drivers/acpi/numa/hmat.c                |    4 
 drivers/cxl/Kconfig                     |   12 
 drivers/cxl/acpi.c                      |    3 
 drivers/cxl/core/core.h                 |    7 
 drivers/cxl/core/hdm.c                  |   25 +
 drivers/cxl/core/memdev.c               |    1 
 drivers/cxl/core/pci.c                  |    5 
 drivers/cxl/core/port.c                 |   92 ++-
 drivers/cxl/core/region.c               |  851 ++++++++++++++++++++++++++++---
 drivers/cxl/cxl.h                       |   57 ++
 drivers/cxl/cxlmem.h                    |    5 
 drivers/cxl/port.c                      |  113 +++-
 drivers/dax/Kconfig                     |   17 +
 drivers/dax/Makefile                    |    2 
 drivers/dax/bus.c                       |   53 +-
 drivers/dax/bus.h                       |   12 
 drivers/dax/cxl.c                       |   53 ++
 drivers/dax/device.c                    |    3 
 drivers/dax/hmem/Makefile               |    3 
 drivers/dax/hmem/device.c               |  102 ++--
 drivers/dax/hmem/hmem.c                 |  148 +++++
 drivers/dax/kmem.c                      |    1 
 include/linux/dax.h                     |    7 
 include/linux/memregion.h               |    2 
 include/linux/range.h                   |    5 
 lib/stackinit_kunit.c                   |    6 
 tools/testing/cxl/test/cxl.c            |  147 +++++
 29 files changed, 1484 insertions(+), 317 deletions(-)
 create mode 100644 drivers/dax/cxl.c

base-commit: 172738bbccdb4ef76bdd72fc72a315c741c39161

Comments

Dan Williams Feb. 10, 2023, 5:53 p.m. UTC | #1
Dan Williams wrote:
> Changes since v1: [1]
> - Add a fix for memdev removal racing port removal (found by unit tests)
> - Add a fix to unwind region target list updates on error in
>   cxl_region_attach() (Jonathan)
> - Move the passthrough decoder fix for submission for v6.2-final (Greg)
> - Fix wrong initcall for cxl_core (Gregory and Davidlohr)
> - Add an endpoint decoder state (CXL_DECODER_STATE_AUTO) to replace
>   the flag CXL_DECODER_F_AUTO (Jonathan)
> - Reflow cmp_decode_pos() to reduce levels of indentation (Jonathan)
> - Fix a leaked reference count in cxl_add_to_region() (Jonathan)
> - Make cxl_add_to_region() return an error (Jonathan)
> - Fix several spurious whitespace changes (Jonathan)
> - Cleanup some spurious changes from the tools/testing/cxl update
>   (Jonathan)
> - Test for == CXL_CONFIG_COMMIT rather than >= CXL_CONFIG_COMMIT
>   (Jonathan)
> - Add comment to clarify device_attach() return code expectation in
>   cxl_add_to_region() (Jonathan)
> - Add a patch to split cxl_port_probe() into switch and endpoint port
>   probe calls (Jonathan)
> - Collect reviewed-by and tested-by tags
> 
> [1]: http://lore.kernel.org/r/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com
> 
> ---
> Cover letter same as v1

Thanks for all the review so far! The outstanding backlog is still too
high to definitively say this will make v6.3:

http://lore.kernel.org/r/167601992789.1924368.8083994227892600608.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167601996980.1924368.390423634911157277.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167601999378.1924368.15071142145866277623.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167601999958.1924368.9366954455835735048.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167602000547.1924368.11613151863880268868.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167602001107.1924368.11562316181038595611.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167602002771.1924368.5653558226424530127.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167602003896.1924368.10335442077318970468.stgit@dwillia2-xfh.jf.intel.com

...what I plan to do is provisionally include it in -next and then make
a judgement call next Friday.

I am encouraged by Fan's test results:

http://lore.kernel.org/r/20230208173720.GA709329@bgt-140510-bm03

...and am reminded that there are some non-trivial TODOs pent up behind
region enumeration:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ea4dcf49878
Gregory Price Feb. 11, 2023, 2:04 p.m. UTC | #2
On Fri, Feb 10, 2023 at 09:53:35AM -0800, Dan Williams wrote:
> Dan Williams wrote:
> > Changes since v1: [1]
> > - Add a fix for memdev removal racing port removal (found by unit tests)
> > - Add a fix to unwind region target list updates on error in
> >   cxl_region_attach() (Jonathan)
> > - Move the passthrough decoder fix for submission for v6.2-final (Greg)
> > - Fix wrong initcall for cxl_core (Gregory and Davidlohr)
> > - Add an endpoint decoder state (CXL_DECODER_STATE_AUTO) to replace
> >   the flag CXL_DECODER_F_AUTO (Jonathan)
> > - Reflow cmp_decode_pos() to reduce levels of indentation (Jonathan)
> > - Fix a leaked reference count in cxl_add_to_region() (Jonathan)
> > - Make cxl_add_to_region() return an error (Jonathan)
> > - Fix several spurious whitespace changes (Jonathan)
> > - Cleanup some spurious changes from the tools/testing/cxl update
> >   (Jonathan)
> > - Test for == CXL_CONFIG_COMMIT rather than >= CXL_CONFIG_COMMIT
> >   (Jonathan)
> > - Add comment to clarify device_attach() return code expectation in
> >   cxl_add_to_region() (Jonathan)
> > - Add a patch to split cxl_port_probe() into switch and endpoint port
> >   probe calls (Jonathan)
> > - Collect reviewed-by and tested-by tags
> > 
> > [1]: http://lore.kernel.org/r/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com
> > 
> > ---
> > Cover letter same as v1
> 
> Thanks for all the review so far! The outstanding backlog is still too
> high to definitively say this will make v6.3:
> 
> http://lore.kernel.org/r/167601992789.1924368.8083994227892600608.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167601996980.1924368.390423634911157277.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167601999378.1924368.15071142145866277623.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167601999958.1924368.9366954455835735048.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167602000547.1924368.11613151863880268868.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167602001107.1924368.11562316181038595611.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167602002771.1924368.5653558226424530127.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167602003896.1924368.10335442077318970468.stgit@dwillia2-xfh.jf.intel.com
> 
> ...what I plan to do is provisionally include it in -next and then make
> a judgement call next Friday.
> 
> I am encouraged by Fan's test results:
> 
> http://lore.kernel.org/r/20230208173720.GA709329@bgt-140510-bm03
> 
> ...and am reminded that there are some non-trivial TODOs pent up behind
> region enumeration:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ea4dcf49878

I have also been testing with generally positive results, I've just been
traveling so internet has been spotty.

I'll post a full breakdown of what i've been doing on monday, including
poking the new kernel boot parameter and such.

~Gregory
Gregory Price Feb. 13, 2023, 6:22 p.m. UTC | #3
On Fri, Feb 10, 2023 at 01:05:21AM -0800, Dan Williams wrote:
> Changes since v1: [1]
> [... snip ...]

For a single attached device - I have been finding general success.

For multiple attached devices, I'm seeing some strange behaviors.

With multiple root ports, I got some stack traces before deciding
I needed multiple CMFW to do this "correctly", and just attached
multiple pxb-cxl to the root bus.

Obviously this configuration is "not great", and some form of
"impossible in the real world", but it's worth examining i think.

/opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/data/qemu/images/pool/pool1.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 4G,slots=4,maxmem=16G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-netdev bridge,id=hn0,br=virbr0 \
-device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:56 \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191 \
-device pxb-cxl,id=cxl.2,bus=pcie.0,bus_nr=230 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
-device cxl-rp,id=rp1,bus=cxl.1,chassis=0,port=1,slot=1 \
-device cxl-rp,id=rp2,bus=cxl.2,chassis=0,port=2,slot=2 \
-object memory-backend-ram,id=mem0,size=4G \
-object memory-backend-ram,id=mem1,size=4G \
-object memory-backend-ram,id=mem2,size=4G \
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
-device cxl-type3,bus=rp1,volatile-memdev=mem1,id=cxl-mem1 \
-device cxl-type3,bus=rp2,volatile-memdev=mem2,id=cxl-mem2 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.2.targets.0=cxl.2,cxl-fmw.2.size=4G

The goal here should be to have 3 different memory expanders have their
regions created and mapped to 3 different numa nodes.

One piece i'm not confident about is my CFMW's
(listed more readably:)

-M cxl-fmw.0.targets.0=cxl.0,
   cxl-fmw.0.size=4G,
   cxl-fmw.1.targets.0=cxl.1,
   cxl-fmw.1.size=4G,
   cxl-fmw.2.targets.0=cxl.2,
   cxl-fmw.2.size=4G

should targets in this case be targets.0/1/2, or all of them targets.0?

Either way, i would expect 3 root decoders, and 3 memory devices

[root@fedora ~]# ls /sys/bus/cxl/devices/
decoder0.0  decoder1.0  decoder4.0  endpoint4  mem0  nvdimm-bridge0  port3
decoder0.1  decoder2.0  decoder5.0  endpoint5  mem1  port1           root0
decoder0.2  decoder3.0  decoder6.0  endpoint6  mem2  port2

I see the devices I expect, but i would expect the following:
(cxl list output at the bottom)

decoder0.0 -> mem0
decoder0.1 -> mem1
decoder0.2 -> mem2

root0 -> [decoder0.0, 0.1, 0.2]
root0 -> [port1, 2, 3]
port1 -> mem0
port2 -> mem1
port3 -> mem2

Really i see these decoders and device mappings setup:
port1 -> mem2
port2 -> mem1
port3 -> mem0

Therefore I should expect
decoder0.0 -> mem2
decoder0.1 -> mem1
decoder0.2 -> mem0

This bears out: attempting to use any other combination produces ndctl errors.

So the numbers are backwards, maybe that's relevant, maybe it's not.
The devices are otherwise completely the same, so for the most part
everything might "just work".  Lets keep testing.


[root@fedora ~]# cat create_region.sh
./ndctl/build/cxl/cxl \
  create-region \
  -m \
  -t ram \
  -d decoder0.$1 \
  -w 1 \
  -g 4096 \
  mem$2

[root@fedora ~]# ./create_region.sh 2 0
[   34.424931] cxl_region region2: Bypassing cpu_cache_invalidate_memregion() for testing!
{
  "region":"region2",
  "resource":"0x790000000",
  "size":"4.00 GiB (4.29 GB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem0",
      "decoder":"decoder4.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region

[   34.486668] Fallback order for Node 3: 3 0
[   34.487568] Built 1 zonelists, mobility grouping on.  Total pages: 979669
[   34.488206] Policy zone: Normal
[   34.501938] Fallback order for Node 0: 0 3
[   34.502405] Fallback order for Node 1: 1 3 0
[   34.502832] Fallback order for Node 2: 2 3 0
[   34.503251] Fallback order for Node 3: 3 0
[   34.503649] Built 2 zonelists, mobility grouping on.  Total pages: 1012437
[   34.504296] Policy zone: Normal



Cool, looks good.  Lets try mem1



[root@fedora ~]# ./create_region.sh 1 1

[   98.787029] Fallback order for Node 2: 2 3 0
[   98.787630] Built 2 zonelists, mobility grouping on.  Total pages: 2019798
[   98.788483] Policy zone: Normal
[  128.301580] Fallback order for Node 0: 0 2 3
[  128.302084] Fallback order for Node 1: 1 3 2 0
[  128.302547] Fallback order for Node 2: 2 3 0
[  128.303009] Fallback order for Node 3: 3 2 0
[  128.303436] Built 3 zonelists, mobility grouping on.  Total pages: 2052566
[  128.304071] Policy zone: Normal
[ .... wait 20-30 more seconds .... ]
{
  "region":"region1",
  "resource":"0x690000000",
  "size":"4.00 GiB (4.29 GB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem1",
      "decoder":"decoder5.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region


This takes a LONG time to complete. Maybe that's expected, I don't know.


Lets online mem2.


[root@fedora ~]# ./create_region.sh 0 2
extra data[7]: 0x0000000000000000
emulation failure
RAX=0000000000000000 RBX=ffff8a6f90006800 RCX=0000000000100001 RDX=0000000080100010
RSI=ffffca291a400000 RDI=0000000040000000 RBP=ffff9684c0017a60 RSP=ffff9684c0017a30
R8 =ffff8a6f90006800 R9 =0000000000100001 R10=0000000000000000 R11=0000000000000001
R12=ffffca291a400000 R13=0000000000100001 R14=0000000000000000 R15=0000000080100010
RIP=ffffffffb71c5831 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 00007fd03025db40 ffffffff 00c00000
GS =0000 ffff8a6a7bd00000 ffffffff 00c00000
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 fffffe46e6e25000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe46e6e23000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=00005604371ab0c8 CR3=0000000102ece000 CR4=000006e0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=83 ec 08 81 e7 00 00 00 40 74 2c 48 89 d0 48 89 ca 4c 89 c9 <f0> 48 0f c7 4e 20 0f 84 85 00 00 00 f3 90 48 83 c4 08 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d


Well that seems bad lol.  I'm not sure what to make of this since my
scrollback cuts off and the machine completely locks up.  I have never
seen "emulation failure" before.


Reboot and attempt to online that region by itself:

[root@fedora ~]# ./create_region.sh 0 2
[   21.292598] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!
[   21.341753] Fallback order for Node 1: 1 0
[   21.342462] Built 1 zonelists, mobility grouping on.  Total pages: 979670
[   21.343085] Policy zone: Normal
[   21.355166] Fallback order for Node 0: 0 1
[   21.355613] Fallback order for Node 1: 1 0
[   21.356009] Fallback order for Node 2: 2 1 0
[   21.356441] Fallback order for Node 3: 3 1 0
[   21.356874] Built 2 zonelists, mobility grouping on.  Total pages: 1012438
[   21.357501] Policy zone: Normal
{
  "region":"region0",
  "resource":"0x590000000",
  "size":"4.00 GiB (4.29 GB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem2",
      "decoder":"decoder6.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region


That works fine, and works just like onlining the first region (2,0).

This suggests the issue is actually creating multiple regions in this
topology.



Bonus round: booting with memhp_default_state=offline

All regions successfully get created without error.



I have a few guesses, but haven't dived in yet:

1) There's a QEMU error in the way this configuration routes to various
   components of the CXL structure, and/or multiple pxb-cxl's do bad
   things and I should feel bad for doing this configuration.

2) There's something going on when creating the topology, leading to the
   inverted [decoder0.2, mem0], [decoder0.1, mem1], [decoder0.0, mem2]
   mappings, leading to inconsistent device control.  Or I'm making a
   bad assumption and this is expected behavior.

3) The memory block creation / online code is getting hung up somewhere.
   Why does the second region take forever to online?

4) Something else completely.


My gut at the moment tells me my configuration is bad, but i have no
idea why.  Anyone with an idea on what I should look for, let me know.


cxl list output for completeness:

[root@fedora ~]# ./ndctl/build/cxl/cxl list -vvvv
[
  {
    "bus":"root0",
    "provider":"ACPI.CXL",
    "nr_dports":3,
    "dports":[
      {
        "dport":"pci0000:e6",
        "alias":"ACPI0016:00",
        "id":230
      },
      {
        "dport":"pci0000:bf",
        "alias":"ACPI0016:01",
        "id":191
      },
      {
        "dport":"pci0000:34",
        "alias":"ACPI0016:02",
        "id":52
      }
    ],
    "ports:root0":[
      {
        "port":"port1",
        "host":"pci0000:e6",
        "depth":1,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:e6:00.0",
            "id":2
          }
        ],
        "endpoints:port1":[
          {
            "endpoint":"endpoint5",
            "host":"mem1",
            "depth":2,
            "memdev":{
              "memdev":"mem1",
              "ram_size":4294967296,
              "serial":0,
              "host":"0000:e7:00.0",
              "partition_info":{
                "total_size":4294967296,
                "volatile_only_size":4294967296,
                "persistent_only_size":0,
                "partition_alignment_size":0
              }
            },
            "decoders:endpoint5":[
              {
                "decoder":"decoder5.0",
                "interleave_ways":1,
                "state":"disabled"
              }
            ]
          }
        ],
        "decoders:port1":[
          {
            "decoder":"decoder1.0",
            "interleave_ways":1,
            "state":"disabled",
            "nr_targets":1,
            "targets":[
              {
                "target":"0000:e6:00.0",
                "position":0,
                "id":2
              }
            ]
          }
        ]
      },
      {
        "port":"port3",
        "host":"pci0000:34",
        "depth":1,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:34:00.0",
            "id":0
          }
        ],
        "endpoints:port3":[
          {
            "endpoint":"endpoint4",
            "host":"mem0",
            "depth":2,
            "memdev":{
              "memdev":"mem0",
              "ram_size":4294967296,
              "serial":0,
              "host":"0000:35:00.0",
              "partition_info":{
                "total_size":4294967296,
                "volatile_only_size":4294967296,
                "persistent_only_size":0,
                "partition_alignment_size":0
              }
            },
            "decoders:endpoint4":[
              {
                "decoder":"decoder4.0",
                "interleave_ways":1,
                "state":"disabled"
              }
            ]
          }
        ],
        "decoders:port3":[
          {
            "decoder":"decoder3.0",
            "interleave_ways":1,
            "state":"disabled",
            "nr_targets":1,
            "targets":[
              {
                "target":"0000:34:00.0",
                "position":0,
                "id":0
              }
            ]
          }
        ]
      },
      {
        "port":"port2",
        "host":"pci0000:bf",
        "depth":1,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:bf:00.0",
            "id":1
          }
        ],
        "endpoints:port2":[
          {
            "endpoint":"endpoint6",
            "host":"mem2",
            "depth":2,
            "memdev":{
              "memdev":"mem2",
              "ram_size":4294967296,
              "serial":0,
              "host":"0000:c0:00.0",
              "partition_info":{
                "total_size":4294967296,
                "volatile_only_size":4294967296,
                "persistent_only_size":0,
                "partition_alignment_size":0
              }
            },
            "decoders:endpoint6":[
              {
                "decoder":"decoder6.0",
                "interleave_ways":1,
                "state":"disabled"
              }
            ]
          }
        ],
        "decoders:port2":[
          {
            "decoder":"decoder2.0",
            "interleave_ways":1,
            "state":"disabled",
            "nr_targets":1,
            "targets":[
              {
                "target":"0000:bf:00.0",
                "position":0,
                "id":1
              }
            ]
          }
        ]
      }
    ],
    "decoders:root0":[
      {
        "decoder":"decoder0.0",
        "resource":23890755584,
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:34",
            "alias":"ACPI0016:02",
            "position":0,
            "id":52
          }
        ]
      },
      {
        "decoder":"decoder0.1",
        "resource":28185722880,
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:bf",
            "alias":"ACPI0016:01",
            "position":0,
            "id":191
          }
        ]
      },
      {
        "decoder":"decoder0.2",
        "resource":32480690176,
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:e6",
            "alias":"ACPI0016:00",
            "position":0,
            "id":230
          }
        ]
      }
    ]
  }
]
Gregory Price Feb. 13, 2023, 6:31 p.m. UTC | #4
On Mon, Feb 13, 2023 at 01:22:17PM -0500, Gregory Price wrote:
> On Fri, Feb 10, 2023 at 01:05:21AM -0800, Dan Williams wrote:
> > Changes since v1: [1]
> > [... snip ...]
> [... snip ...]
> Really i see these decoders and device mappings setup:
> port1 -> mem2
> port2 -> mem1
> port3 -> mem0

small correction:
port1 -> mem1
port3 -> mem0
port2 -> mem2

> 
> Therefore I should expect
> decoder0.0 -> mem2
> decoder0.1 -> mem1
> decoder0.2 -> mem0
> 

this end up mapping this way, which is still further jumbled.

Something feels like there's an off-by-one
Jonathan Cameron Feb. 14, 2023, 1:35 p.m. UTC | #5
On Mon, 13 Feb 2023 13:22:17 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Fri, Feb 10, 2023 at 01:05:21AM -0800, Dan Williams wrote:
> > Changes since v1: [1]
> > [... snip ...]  
> 
> For a single attached device - I have been finding general success.
> 
> For multiple attached devices, I'm seeing some strange behaviors.
> 
> With multiple root ports, I got some stack traces before deciding
> I needed multiple CMFW to do this "correctly", and just attached
> multiple pxb-cxl to the root bus.

Hmm. I should get on with doing multiple HDM decoders in the host
bridge at least. (also useful to support in switch and EP obviously)

> 
> Obviously this configuration is "not great", and some form of
> "impossible in the real world", but it's worth examining i think.

Seems reasonable simplification of what you might see on a 3 socket
system.  Host bridge and CFMWS per socket.

> 
> /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -drive file=/data/qemu/images/pool/pool1.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 4G,slots=4,maxmem=16G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -netdev bridge,id=hn0,br=virbr0 \
> -device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:56 \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191 \
> -device pxb-cxl,id=cxl.2,bus=pcie.0,bus_nr=230 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
> -device cxl-rp,id=rp1,bus=cxl.1,chassis=0,port=1,slot=1 \
> -device cxl-rp,id=rp2,bus=cxl.2,chassis=0,port=2,slot=2 \
> -object memory-backend-ram,id=mem0,size=4G \
> -object memory-backend-ram,id=mem1,size=4G \
> -object memory-backend-ram,id=mem2,size=4G \
> -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> -device cxl-type3,bus=rp1,volatile-memdev=mem1,id=cxl-mem1 \
> -device cxl-type3,bus=rp2,volatile-memdev=mem2,id=cxl-mem2 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.2.targets.0=cxl.2,cxl-fmw.2.size=4G
> 
> The goal here should be to have 3 different memory expanders have their
> regions created and mapped to 3 different numa nodes.
> 
> One piece i'm not confident about is my CFMW's
> (listed more readably:)
> 
> -M cxl-fmw.0.targets.0=cxl.0,
>    cxl-fmw.0.size=4G,
>    cxl-fmw.1.targets.0=cxl.1,
>    cxl-fmw.1.size=4G,
>    cxl-fmw.2.targets.0=cxl.2,
>    cxl-fmw.2.size=4G
> 
> should targets in this case be targets.0/1/2, or all of them targets.0?

targets.0 for all of them.  Fairly sure it will moan at you if you don't
do that as some of the targets array for each cfmws will be un specified.

> 
> Either way, i would expect 3 root decoders, and 3 memory devices
> 
> [root@fedora ~]# ls /sys/bus/cxl/devices/
> decoder0.0  decoder1.0  decoder4.0  endpoint4  mem0  nvdimm-bridge0  port3
> decoder0.1  decoder2.0  decoder5.0  endpoint5  mem1  port1           root0
> decoder0.2  decoder3.0  decoder6.0  endpoint6  mem2  port2
> 
> I see the devices I expect, but i would expect the following:
> (cxl list output at the bottom)
> 
> decoder0.0 -> mem0
> decoder0.1 -> mem1
> decoder0.2 -> mem2

I don't think there is any enforcement of ordering across various elements.
It just depends on exact ordering of probe calls that are racing.


> 
> root0 -> [decoder0.0, 0.1, 0.2]
> root0 -> [port1, 2, 3]
> port1 -> mem0
> port2 -> mem1
> port3 -> mem2
> 
> Really i see these decoders and device mappings setup:
> port1 -> mem2
> port2 -> mem1
> port3 -> mem0
> 
> Therefore I should expect
> decoder0.0 -> mem2
> decoder0.1 -> mem1
> decoder0.2 -> mem0
> 
> This bears out: attempting to use any other combination produces ndctl errors.
> 
> So the numbers are backwards, maybe that's relevant, maybe it's not.
> The devices are otherwise completely the same, so for the most part
> everything might "just work".  Lets keep testing.
> 
> 
> [root@fedora ~]# cat create_region.sh
> ./ndctl/build/cxl/cxl \
>   create-region \
>   -m \
>   -t ram \
>   -d decoder0.$1 \
>   -w 1 \
>   -g 4096 \
>   mem$2
> 
> [root@fedora ~]# ./create_region.sh 2 0
> [   34.424931] cxl_region region2: Bypassing cpu_cache_invalidate_memregion() for testing!
> {
>   "region":"region2",
>   "resource":"0x790000000",
>   "size":"4.00 GiB (4.29 GB)",
>   "type":"ram",
>   "interleave_ways":1,
>   "interleave_granularity":4096,
>   "decode_state":"commit",
>   "mappings":[
>     {
>       "position":0,
>       "memdev":"mem0",
>       "decoder":"decoder4.0"
>     }
>   ]
> }
> cxl region: cmd_create_region: created 1 region
> 
> [   34.486668] Fallback order for Node 3: 3 0
> [   34.487568] Built 1 zonelists, mobility grouping on.  Total pages: 979669
> [   34.488206] Policy zone: Normal
> [   34.501938] Fallback order for Node 0: 0 3
> [   34.502405] Fallback order for Node 1: 1 3 0
> [   34.502832] Fallback order for Node 2: 2 3 0
> [   34.503251] Fallback order for Node 3: 3 0
> [   34.503649] Built 2 zonelists, mobility grouping on.  Total pages: 1012437
> [   34.504296] Policy zone: Normal
> 
> 
> 
> Cool, looks good.  Lets try mem1
> 
> 
> 
> [root@fedora ~]# ./create_region.sh 1 1
> 
> [   98.787029] Fallback order for Node 2: 2 3 0
> [   98.787630] Built 2 zonelists, mobility grouping on.  Total pages: 2019798
> [   98.788483] Policy zone: Normal
> [  128.301580] Fallback order for Node 0: 0 2 3
> [  128.302084] Fallback order for Node 1: 1 3 2 0
> [  128.302547] Fallback order for Node 2: 2 3 0
> [  128.303009] Fallback order for Node 3: 3 2 0
> [  128.303436] Built 3 zonelists, mobility grouping on.  Total pages: 2052566
> [  128.304071] Policy zone: Normal
> [ .... wait 20-30 more seconds .... ]
> {
>   "region":"region1",
>   "resource":"0x690000000",
>   "size":"4.00 GiB (4.29 GB)",
>   "type":"ram",
>   "interleave_ways":1,
>   "interleave_granularity":4096,
>   "decode_state":"commit",
>   "mappings":[
>     {
>       "position":0,
>       "memdev":"mem1",
>       "decoder":"decoder5.0"
>     }
>   ]
> }
> cxl region: cmd_create_region: created 1 region
> 
> 
> This takes a LONG time to complete. Maybe that's expected, I don't know.
> 
> 
> Lets online mem2.
> 
> 
> [root@fedora ~]# ./create_region.sh 0 2
> extra data[7]: 0x0000000000000000
> emulation failure
> RAX=0000000000000000 RBX=ffff8a6f90006800 RCX=0000000000100001 RDX=0000000080100010
> RSI=ffffca291a400000 RDI=0000000040000000 RBP=ffff9684c0017a60 RSP=ffff9684c0017a30
> R8 =ffff8a6f90006800 R9 =0000000000100001 R10=0000000000000000 R11=0000000000000001
> R12=ffffca291a400000 R13=0000000000100001 R14=0000000000000000 R15=0000000080100010
> RIP=ffffffffb71c5831 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0000 0000000000000000 ffffffff 00c00000
> CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> DS =0000 0000000000000000 ffffffff 00c00000
> FS =0000 00007fd03025db40 ffffffff 00c00000
> GS =0000 ffff8a6a7bd00000 ffffffff 00c00000
> LDT=0000 0000000000000000 ffffffff 00c00000
> TR =0040 fffffe46e6e25000 00004087 00008b00 DPL=0 TSS64-busy
> GDT=     fffffe46e6e23000 0000007f
> IDT=     fffffe0000000000 00000fff
> CR0=80050033 CR2=00005604371ab0c8 CR3=0000000102ece000 CR4=000006e0
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000fffe0ff0 DR7=0000000000000400
> EFER=0000000000000d01
> Code=83 ec 08 81 e7 00 00 00 40 74 2c 48 89 d0 48 89 ca 4c 89 c9 <f0> 48 0f c7 4e 20 0f 84 85 00 00 00 f3 90 48 83 c4 08 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d
> 
> 
> Well that seems bad lol.  I'm not sure what to make of this since my
> scrollback cuts off and the machine completely locks up.  I have never
> seen "emulation failure" before.
> 
> 
> Reboot and attempt to online that region by itself:
> 
> [root@fedora ~]# ./create_region.sh 0 2
> [   21.292598] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!
> [   21.341753] Fallback order for Node 1: 1 0
> [   21.342462] Built 1 zonelists, mobility grouping on.  Total pages: 979670
> [   21.343085] Policy zone: Normal
> [   21.355166] Fallback order for Node 0: 0 1
> [   21.355613] Fallback order for Node 1: 1 0
> [   21.356009] Fallback order for Node 2: 2 1 0
> [   21.356441] Fallback order for Node 3: 3 1 0
> [   21.356874] Built 2 zonelists, mobility grouping on.  Total pages: 1012438
> [   21.357501] Policy zone: Normal
> {
>   "region":"region0",
>   "resource":"0x590000000",
>   "size":"4.00 GiB (4.29 GB)",
>   "type":"ram",
>   "interleave_ways":1,
>   "interleave_granularity":4096,
>   "decode_state":"commit",
>   "mappings":[
>     {
>       "position":0,
>       "memdev":"mem2",
>       "decoder":"decoder6.0"
>     }
>   ]
> }
> cxl region: cmd_create_region: created 1 region
> 
> 
> That works fine, and works just like onlining the first region (2,0).
> 
> This suggests the issue is actually creating multiple regions in this
> topology.
> 
> 
> 
> Bonus round: booting with memhp_default_state=offline
> 
> All regions successfully get created without error.
> 
> 
> 
> I have a few guesses, but haven't dived in yet:
> 
> 1) There's a QEMU error in the way this configuration routes to various
>    components of the CXL structure, and/or multiple pxb-cxl's do bad
>    things and I should feel bad for doing this configuration.

Nothing looks like it should be broken in your command line etc.
There may well be a bug in qemu though.
> 
> 2) There's something going on when creating the topology, leading to the
>    inverted [decoder0.2, mem0], [decoder0.1, mem1], [decoder0.0, mem2]
>    mappings, leading to inconsistent device control.  Or I'm making a
>    bad assumption and this is expected behavior.

Expected I think.

> 
> 3) The memory block creation / online code is getting hung up somewhere.
>    Why does the second region take forever to online?

Maybe try some smaller devices, just in case it's running out of memory somewhere.

> 
> 4) Something else completely.
> 
> 
> My gut at the moment tells me my configuration is bad, but i have no
> idea why.  Anyone with an idea on what I should look for, let me know.
> 
> 
> cxl list output for completeness:
> 
> [root@fedora ~]# ./ndctl/build/cxl/cxl list -vvvv
> [
>   {
>     "bus":"root0",
>     "provider":"ACPI.CXL",
>     "nr_dports":3,
>     "dports":[
>       {
>         "dport":"pci0000:e6",
>         "alias":"ACPI0016:00",
>         "id":230
>       },
>       {
>         "dport":"pci0000:bf",
>         "alias":"ACPI0016:01",
>         "id":191
>       },
>       {
>         "dport":"pci0000:34",
>         "alias":"ACPI0016:02",
>         "id":52
>       }
>     ],
>     "ports:root0":[
>       {
>         "port":"port1",
>         "host":"pci0000:e6",
>         "depth":1,
>         "nr_dports":1,
>         "dports":[
>           {
>             "dport":"0000:e6:00.0",
>             "id":2
>           }
>         ],
>         "endpoints:port1":[
>           {
>             "endpoint":"endpoint5",
>             "host":"mem1",
>             "depth":2,
>             "memdev":{
>               "memdev":"mem1",
>               "ram_size":4294967296,
>               "serial":0,
>               "host":"0000:e7:00.0",
>               "partition_info":{
>                 "total_size":4294967296,
>                 "volatile_only_size":4294967296,
>                 "persistent_only_size":0,
>                 "partition_alignment_size":0
>               }
>             },
>             "decoders:endpoint5":[
>               {
>                 "decoder":"decoder5.0",
>                 "interleave_ways":1,
>                 "state":"disabled"
>               }
>             ]
>           }
>         ],
>         "decoders:port1":[
>           {
>             "decoder":"decoder1.0",
>             "interleave_ways":1,
>             "state":"disabled",
>             "nr_targets":1,
>             "targets":[
>               {
>                 "target":"0000:e6:00.0",
>                 "position":0,
>                 "id":2
>               }
>             ]
>           }
>         ]
>       },
>       {
>         "port":"port3",
>         "host":"pci0000:34",
>         "depth":1,
>         "nr_dports":1,
>         "dports":[
>           {
>             "dport":"0000:34:00.0",
>             "id":0
>           }
>         ],
>         "endpoints:port3":[
>           {
>             "endpoint":"endpoint4",
>             "host":"mem0",
>             "depth":2,
>             "memdev":{
>               "memdev":"mem0",
>               "ram_size":4294967296,
>               "serial":0,
>               "host":"0000:35:00.0",
>               "partition_info":{
>                 "total_size":4294967296,
>                 "volatile_only_size":4294967296,
>                 "persistent_only_size":0,
>                 "partition_alignment_size":0
>               }
>             },
>             "decoders:endpoint4":[
>               {
>                 "decoder":"decoder4.0",
>                 "interleave_ways":1,
>                 "state":"disabled"
>               }
>             ]
>           }
>         ],
>         "decoders:port3":[
>           {
>             "decoder":"decoder3.0",
>             "interleave_ways":1,
>             "state":"disabled",
>             "nr_targets":1,
>             "targets":[
>               {
>                 "target":"0000:34:00.0",
>                 "position":0,
>                 "id":0
>               }
>             ]
>           }
>         ]
>       },
>       {
>         "port":"port2",
>         "host":"pci0000:bf",
>         "depth":1,
>         "nr_dports":1,
>         "dports":[
>           {
>             "dport":"0000:bf:00.0",
>             "id":1
>           }
>         ],
>         "endpoints:port2":[
>           {
>             "endpoint":"endpoint6",
>             "host":"mem2",
>             "depth":2,
>             "memdev":{
>               "memdev":"mem2",
>               "ram_size":4294967296,
>               "serial":0,
>               "host":"0000:c0:00.0",
>               "partition_info":{
>                 "total_size":4294967296,
>                 "volatile_only_size":4294967296,
>                 "persistent_only_size":0,
>                 "partition_alignment_size":0
>               }
>             },
>             "decoders:endpoint6":[
>               {
>                 "decoder":"decoder6.0",
>                 "interleave_ways":1,
>                 "state":"disabled"
>               }
>             ]
>           }
>         ],
>         "decoders:port2":[
>           {
>             "decoder":"decoder2.0",
>             "interleave_ways":1,
>             "state":"disabled",
>             "nr_targets":1,
>             "targets":[
>               {
>                 "target":"0000:bf:00.0",
>                 "position":0,
>                 "id":1
>               }
>             ]
>           }
>         ]
>       }
>     ],
>     "decoders:root0":[
>       {
>         "decoder":"decoder0.0",
>         "resource":23890755584,
>         "size":4294967296,
>         "interleave_ways":1,
>         "max_available_extent":4294967296,
>         "pmem_capable":true,
>         "volatile_capable":true,
>         "accelmem_capable":true,
>         "nr_targets":1,
>         "targets":[
>           {
>             "target":"pci0000:34",
>             "alias":"ACPI0016:02",
>             "position":0,
>             "id":52
>           }
>         ]
>       },
>       {
>         "decoder":"decoder0.1",
>         "resource":28185722880,
>         "size":4294967296,
>         "interleave_ways":1,
>         "max_available_extent":4294967296,
>         "pmem_capable":true,
>         "volatile_capable":true,
>         "accelmem_capable":true,
>         "nr_targets":1,
>         "targets":[
>           {
>             "target":"pci0000:bf",
>             "alias":"ACPI0016:01",
>             "position":0,
>             "id":191
>           }
>         ]
>       },
>       {
>         "decoder":"decoder0.2",
>         "resource":32480690176,
>         "size":4294967296,
>         "interleave_ways":1,
>         "max_available_extent":4294967296,
>         "pmem_capable":true,
>         "volatile_capable":true,
>         "accelmem_capable":true,
>         "nr_targets":1,
>         "targets":[
>           {
>             "target":"pci0000:e6",
>             "alias":"ACPI0016:00",
>             "position":0,
>             "id":230
>           }
>         ]
>       }
>     ]
>   }
> ]
Fan Ni Feb. 22, 2023, 9:41 p.m. UTC | #6
On Mon, Feb 13, 2023 at 01:31:17PM -0500, Gregory Price wrote:

> On Mon, Feb 13, 2023 at 01:22:17PM -0500, Gregory Price wrote:
> > On Fri, Feb 10, 2023 at 01:05:21AM -0800, Dan Williams wrote:
> > > Changes since v1: [1]
> > > [... snip ...]
> > [... snip ...]
> > Really i see these decoders and device mappings setup:
> > port1 -> mem2
> > port2 -> mem1
> > port3 -> mem0
> 
> small correction:
> port1 -> mem1
> port3 -> mem0
> port2 -> mem2
> 
> > 
> > Therefore I should expect
> > decoder0.0 -> mem2
> > decoder0.1 -> mem1
> > decoder0.2 -> mem0
> > 
> 
> this end up mapping this way, which is still further jumbled.
> 
> Something feels like there's an off-by-one
> 

Currently, the naming of memdevs can be out-of-order due to the
following two reasons,
1. At kernel side, cxl port driver does async device probe, which can
change the memdev naming even within a single OS boot and among multiple
time of device enumeration. The pattern can be observed with following
steps in the guest,
	loop(){
	a) modprobe cxl_xxx
	b)cxl list  --> you will see the memdev name changes (like mem0->mem1).
	c) rmmod cxl_xxx
	}
This behaviour can be avoided by using sync device probe by making the
following change
--------------------------------------------
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 258004f34281..f3f90fad62b5 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -663,7 +663,7 @@ static struct pci_driver cxl_pci_driver = {
 	.probe			= cxl_pci_probe,
 	.err_handler		= &cxl_error_handlers,
 	.driver	= {
-		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
+		.probe_type = PROBE_FORCE_SYNCHRONOUS,
 	},
 };
-------------------------------------------

The above patch, you will see consistent memdev naming within one
OS boot, however, the order can be still different from what we expect with
the qemu config options we use. We need to make some change at QEMU side
also as shown below.

2. Currently in Qemu, multiple components at the same topology level are
stored in a data structure called QLIST as defined in
include/qemu/queue.h. When enqueuing a component, current qemu code uses
QLIST_INSERT_HEAD to insert the item at the head, but when iterating, it
uses QLIST_FOREACH/QLIST_FOREACH_SAFE which is also from the head of the
list. That is to say, if we enqueue items P1,P2,P3 in order, when iterating,
we get P3,P2,P1. I have a simple test with the below code change(always
insert to the list tail), the order issue is fixed.

----------------------------------------------------------------------------
diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index e029e7bf66..15491960e1 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -130,7 +130,7 @@ struct {                                                                \
         (listelm)->field.le_prev = &(elm)->field.le_next;               \
 } while (/*CONSTCOND*/0)
 
-#define QLIST_INSERT_HEAD(head, elm, field) do {                        \
+#define QLIST_INSERT_HEAD_OLD(head, elm, field) do {                    \
         if (((elm)->field.le_next = (head)->lh_first) != NULL)          \
                 (head)->lh_first->field.le_prev = &(elm)->field.le_next;\
         (head)->lh_first = (elm);                                       \
@@ -146,6 +146,20 @@ struct {                                                                \
         (elm)->field.le_prev = NULL;                                    \
 } while (/*CONSTCOND*/0)
 
+#define QLIST_INSERT_TAIL(head, elm, field) do {                        \
+        typeof(elm) last_p = (head)->lh_first;                          \
+        while (last_p && last_p->field.le_next)                         \
+            last_p = last_p->field.le_next;                             \
+        if (last_p)                                                     \
+            QLIST_INSERT_AFTER(last_p, elm, field);                     \
+        else                                                            \
+            QLIST_INSERT_HEAD_OLD(head, elm, field);                    \
+} while (/*CONSTCOND*/0)
+
+#define QLIST_INSERT_HEAD(head, elm, field) do {                        \
+        QLIST_INSERT_TAIL(head, elm, field);                            \
+} while (/*CONSTCOND*/0)
+
 /*
  * Like QLIST_REMOVE() but safe to call when elm is not in a list
  */
-----------------------------------------------------------------------------

The memdev naming order can also cause confusion when creating regions
for multiple memdevs under different HBs as in the kernel code, we enforce
HB check to ensure the target position matches the CFMW configuration.
To avoid the confusion, we can use "cxl list -TD" to find out the target
position for a memdev, but it is kind of annoying to do it before
creating region.
Dan Williams Feb. 22, 2023, 10:18 p.m. UTC | #7
Fan Ni wrote:
> On Mon, Feb 13, 2023 at 01:31:17PM -0500, Gregory Price wrote:
> 
> > On Mon, Feb 13, 2023 at 01:22:17PM -0500, Gregory Price wrote:
> > > On Fri, Feb 10, 2023 at 01:05:21AM -0800, Dan Williams wrote:
> > > > Changes since v1: [1]
> > > > [... snip ...]
> > > [... snip ...]
> > > Really i see these decoders and device mappings setup:
> > > port1 -> mem2
> > > port2 -> mem1
> > > port3 -> mem0
> > 
> > small correction:
> > port1 -> mem1
> > port3 -> mem0
> > port2 -> mem2
> > 
> > > 
> > > Therefore I should expect
> > > decoder0.0 -> mem2
> > > decoder0.1 -> mem1
> > > decoder0.2 -> mem0
> > > 
> > 
> > this end up mapping this way, which is still further jumbled.
> > 
> > Something feels like there's an off-by-one
> > 
> 
> Currently, the naming of memdevs can be out-of-order due to the
> following two reasons,
> 1. At kernel side, cxl port driver does async device probe, which can
> change the memdev naming even within a single OS boot and among multiple
> time of device enumeration. The pattern can be observed with following
> steps in the guest,
> 	loop(){
> 	a) modprobe cxl_xxx
> 	b)cxl list  --> you will see the memdev name changes (like mem0->mem1).
> 	c) rmmod cxl_xxx
> 	}
> This behaviour can be avoided by using sync device probe by making the
> following change
> --------------------------------------------
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 258004f34281..f3f90fad62b5 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -663,7 +663,7 @@ static struct pci_driver cxl_pci_driver = {
>  	.probe			= cxl_pci_probe,
>  	.err_handler		= &cxl_error_handlers,
>  	.driver	= {
> -		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
> +		.probe_type = PROBE_FORCE_SYNCHRONOUS,
>  	},
>  };
> -------------------------------------------
> 
> The above patch, you will see consistent memdev naming within one
> OS boot, however, the order can be still different from what we expect with
> the qemu config options we use. We need to make some change at QEMU side
> also as shown below.

This is by design. Kernel device name order is not guaranteed even with
synchronous probing and the async probing acts to make sure these names
are always random for memdevs. For a memdev the recommendation is to
identify them by 'host'/'path' or by 'serial':

# cxl list -u -m 0000:35:00.0
{
  "memdev":"mem0",
  "pmem_size":"512.00 MiB (536.87 MB)",
  "serial":"0",
  "host":"0000:35:00.0"
}


# cxl list -u -s 0
{
  "memdev":"mem0",
  "pmem_size":"512.00 MiB (536.87 MB)",
  "serial":"0",
  "host":"0000:35:00.0"
}

Although, in real life a CXL device will have a non-zero unique serial
number.

> 2. Currently in Qemu, multiple components at the same topology level are
> stored in a data structure called QLIST as defined in
> include/qemu/queue.h. When enqueuing a component, current qemu code uses
> QLIST_INSERT_HEAD to insert the item at the head, but when iterating, it
> uses QLIST_FOREACH/QLIST_FOREACH_SAFE which is also from the head of the
> list. That is to say, if we enqueue items P1,P2,P3 in order, when iterating,
> we get P3,P2,P1. I have a simple test with the below code change(always
> insert to the list tail), the order issue is fixed.

Again, kernel does not and should not be expected to guarantee kernel
device name ordering. Perhaps this merits /dev/cxl/by-path and
/dev/cxl/by-id similar to /dev/disk/by-path and /dev/disk/by-id for
semi-persistent / persistent naming.

That's a conversation to have with the systemd-udev folks.