mbox series

[v7,00/46] CXl 2.0 emulation Support

Message ID 20220306174137.5707-1-Jonathan.Cameron@huawei.com
Headers show
Series CXl 2.0 emulation Support | expand

Message

Jonathan Cameron March 6, 2022, 5:40 p.m. UTC
Ideally I'd love it if we could start picking up the earlier
sections of this series as I think those have been reasonably
well reviewed and should not be particularly controversial.
(perhaps up to patch 15 inline with what Michael Tsirkin suggested
on v5).

There is one core memory handling related patch (34) marked as RFC.
Whilst it's impact seems small to me, I'm not sure it is the best way
to meet our requirements wrt interleaving.

Changes since v7:

Thanks to all who have taken a look.
Small amount of reordering was necessary due to LSA fix in patch 17.
Test moved forwards to patch 22 and so all intermediate patches
move -1 in the series.

(New stuff)
- Switch support.  Needed to support more interesting topologies.
(Ben Widawsky)
- Patch 17: Fix reversed condition on presence of LSA that meant these never
  got properly initialized. Related change needed to ensure test for cxl_type3
  always needs an LSA. We can relax this later when adding volatile memory
  support.
(Markus Armbuster)
- Patch 27: Change -cxl-fixed-memory-window option handling to use
  qobject_input_visitor_new_str().  This changed the required handling of
  targets parameter to require an array index and hence test and docs updates.
  e.g. targets.1=cxl_hb0,targets.2=cxl_hb1
  (Patches 38,40,42,43)
- Missing structure element docs and version number (optimisitic :)
(Alex Bennée)
- Added Reviewed-by tags.  Thanks!.
- Series wise: Switch to compiler.h QEMU_BUILD_BUG_ON/MSG QEMU_PACKED
  and QEMU_ALIGNED as Alex suggested in patch 20.
- Patch 6: Dropped documentation for a non-existent lock.
           Added error code suitable for unimplemented commands.
	   Reordered code for better readability.
- Patch 9: Reorder as suggested to avoid a goto.
- Patch 16: Add LOG_UNIMP message where feature not yet implemented.
            Drop "Explain" comment that doesn't explain anything.
- Patch 18: Drop pointless void * cast.
            Add assertion as suggested (without divide)
- Patch 19: Use pstrcpy rather than snprintf for a fixed string.
            The compiler.h comment was in this patch but affects a
	    number of other patches as well.
- Patch 20: Move structure CXLType3Dev to header when originally
            introduced so changes are more obvious in this patch.
- Patch 21: Substantial refactor to resolve unclear use of sizeof
            on the LSA command header. Now uses a variable length
	    last element so we can use offsetof()
- Patch 22: Use g_autoptr() to avoid need for explicit free in tests
  	    Similar in later patches.
- Patch 29: Minor reorganziation as suggested.
	    
(Tidy up from me)
- Trivial stuff like moving header includes to patch where first used.
- Patch 17: Drop ifndef protections from TYPE_CXL_TYPE3_DEV as there
            doesn't seem to be a reason.

Series organized to allow it to be taken in stages if the maintainers
prefer that approach. Most sets end with the addition of appropriate
tests (TBD for final set)

Patches 0-15 - CXL PXB
Patches 16-22 - Type 3 Device, Root Port
Patches 23-40 - ACPI, board elements and interleave decoding to enable x86 hosts
Patches 41-42 - arm64 support on virt.
Patch 43 - Initial documentation
Patches 44-46 - Switch support.

Gitlab CI is proving challenging to get a completely clean bill of health
as there seem to be some intermittent failures in common with the
main QEMU gitlab. In particular an ASAN leak error that appears in some
upstream CI runs and build-oss-fuzz timeouts.
Results at http://gitlab.com/jic23/qemu cxl-v7-draft-2-for-test
which also includes the DOE/CDAT patches serial number support which
will form part of a future series.

Updated background info:

Looking in particular for:
* Review of the PCI interactions
* x86 and ARM machine interactions (particularly the memory maps)
* Review of the interleaving approach - is the basic idea
  acceptable?
* Review of the command line interface.
* CXL related review welcome but much of that got reviewed
  in earlier versions and hasn't changed substantially.

Big TODOs:

* Volatile memory devices (easy but it's more code so left for now).
* Hotplug?  May not need much but it's not tested yet!
* More tests and tighter verification that values written to hardware
  are actually valid - stuff that real hardware would check.
* Testing, testing and more testing.  I have been running a basic
  set of ARM and x86 tests on this, but there is always room for
  more tests and greater automation.
* CFMWS flags as requested by Ben.

Why do we want QEMU emulation of CXL?

As Ben stated in V3, QEMU support has been critical to getting OS
software written given lack of availability of hardware supporting the
latest CXL features (coupled with very high demand for support being
ready in a timely fashion). What has become clear since Ben's v3
is that situation is a continuous one. Whilst we can't talk about
them yet, CXL 3.0 features and OS support have been prototyped on
top of this support and a lot of the ongoing kernel work is being
tested against these patches. The kernel CXL mocking code allows
some forms of testing, but QEMU provides a more versatile and
exensible platform.

Other features on the qemu-list that build on these include PCI-DOE
/CDAT support from the Avery Design team further showing how this
code is useful. Whilst not directly related this is also the test
platform for work on PCI IDE/CMA + related DMTF SPDM as CXL both
utilizes and extends those technologies and is likely to be an early
adopter.
Refs:
CMA Kernel: https://lore.kernel.org/all/20210804161839.3492053-1-Jonathan.Cameron@huawei.com/
CMA Qemu: https://lore.kernel.org/qemu-devel/1624665723-5169-1-git-send-email-cbrowy@avery-design.com/
DOE Qemu: https://lore.kernel.org/qemu-devel/1623329999-15662-1-git-send-email-cbrowy@avery-design.com/

As can be seen there is non trivial interaction with other areas of
Qemu, particularly PCI and keeping this set up to date is proving
a burden we'd rather do without :)

Ben mentioned a few other good reasons in v3:
https://lore.kernel.org/qemu-devel/20210202005948.241655-1-ben.widawsky@intel.com/

What we have here is about what you need for it to be useful for testing
currently kernel code.  Note the kernel code is moving fast so
since v4, some features have been introduced we don't yet support in
QEMU (e.g. use of the PCIe serial number extended capability).

All comments welcome.

Additional info that was here in v5 is now in the documentation patch.

Thanks,

Jonathan

Ben Widawsky (24):
  hw/pci/cxl: Add a CXL component type (interface)
  hw/cxl/component: Introduce CXL components (8.1.x, 8.2.5)
  hw/cxl/device: Introduce a CXL device (8.2.8)
  hw/cxl/device: Implement the CAP array (8.2.8.1-2)
  hw/cxl/device: Implement basic mailbox (8.2.8.4)
  hw/cxl/device: Add memory device utilities
  hw/cxl/device: Add cheap EVENTS implementation (8.2.9.1)
  hw/cxl/device: Timestamp implementation (8.2.9.3)
  hw/cxl/device: Add log commands (8.2.9.4) + CEL
  hw/pxb: Use a type for realizing expanders
  hw/pci/cxl: Create a CXL bus type
  hw/pxb: Allow creation of a CXL PXB (host bridge)
  hw/cxl/rp: Add a root port
  hw/cxl/device: Add a memory device (8.2.8.5)
  hw/cxl/device: Implement MMIO HDM decoding (8.2.5.12)
  hw/cxl/device: Add some trivial commands
  hw/cxl/device: Plumb real Label Storage Area (LSA) sizing
  hw/cxl/device: Implement get/set Label Storage Area (LSA)
  hw/cxl/component: Implement host bridge MMIO (8.2.5, table 142)
  acpi/cxl: Add _OSC implementation (9.14.2)
  acpi/cxl: Create the CEDT (9.14.1)
  acpi/cxl: Introduce CFMWS structures in CEDT
  hw/cxl/component Add a dumb HDM decoder handler
  qtest/cxl: Add more complex test cases with CFMWs

Jonathan Cameron (22):
  MAINTAINERS: Add entry for Compute Express Link Emulation
  cxl: Machine level control on whether CXL support is enabled
  qtest/cxl: Introduce initial test for pxb-cxl only.
  qtests/cxl: Add initial root port and CXL type3 tests
  hw/cxl/component: Add utils for interleave parameter encoding/decoding
  hw/cxl/host: Add support for CXL Fixed Memory Windows.
  hw/pci-host/gpex-acpi: Add support for dsdt construction for pxb-cxl
  pci/pcie_port: Add pci_find_port_by_pn()
  CXL/cxl_component: Add cxl_get_hb_cstate()
  mem/cxl_type3: Add read and write functions for associated hostmem.
  cxl/cxl-host: Add memops for CFMWS region.
  RFC: softmmu/memory: Add ops to memory_region_ram_init_from_file
  i386/pc: Enable CXL fixed memory windows
  tests/acpi: q35: Allow addition of a CXL test.
  qtests/bios-tables-test: Add a test for CXL emulation.
  tests/acpi: Add tables for CXL emulation.
  hw/arm/virt: Basic CXL enablement on pci_expander_bridge instances
    pxb-cxl
  qtest/cxl: Add aarch64 virt test for CXL
  docs/cxl: Add initial Compute eXpress Link (CXL) documentation.
  pci-bridge/cxl_upstream: Add a CXL switch upstream port
  pci-bridge/cxl_downstream: Add a CXL switch downstream port
  cxl/cxl-host: Support interleave decoding with one level of switches.

 MAINTAINERS                         |   7 +
 docs/system/device-emulation.rst    |   1 +
 docs/system/devices/cxl.rst         | 302 +++++++++++++++++
 hw/Kconfig                          |   1 +
 hw/acpi/Kconfig                     |   5 +
 hw/acpi/cxl-stub.c                  |  12 +
 hw/acpi/cxl.c                       | 231 +++++++++++++
 hw/acpi/meson.build                 |   4 +-
 hw/arm/Kconfig                      |   1 +
 hw/arm/virt-acpi-build.c            |  33 ++
 hw/arm/virt.c                       |  40 ++-
 hw/core/machine.c                   |  28 ++
 hw/cxl/Kconfig                      |   3 +
 hw/cxl/cxl-component-utils.c        | 284 ++++++++++++++++
 hw/cxl/cxl-device-utils.c           | 265 +++++++++++++++
 hw/cxl/cxl-host-stubs.c             |  16 +
 hw/cxl/cxl-host.c                   | 262 +++++++++++++++
 hw/cxl/cxl-mailbox-utils.c          | 485 ++++++++++++++++++++++++++++
 hw/cxl/meson.build                  |  12 +
 hw/i386/acpi-build.c                |  57 +++-
 hw/i386/pc.c                        |  57 +++-
 hw/mem/Kconfig                      |   5 +
 hw/mem/cxl_type3.c                  | 352 ++++++++++++++++++++
 hw/mem/meson.build                  |   1 +
 hw/meson.build                      |   1 +
 hw/pci-bridge/Kconfig               |   5 +
 hw/pci-bridge/cxl_downstream.c      | 229 +++++++++++++
 hw/pci-bridge/cxl_root_port.c       | 231 +++++++++++++
 hw/pci-bridge/cxl_upstream.c        | 206 ++++++++++++
 hw/pci-bridge/meson.build           |   1 +
 hw/pci-bridge/pci_expander_bridge.c | 172 +++++++++-
 hw/pci-bridge/pcie_root_port.c      |   6 +-
 hw/pci-host/gpex-acpi.c             |  20 +-
 hw/pci/pci.c                        |  21 +-
 hw/pci/pcie_port.c                  |  25 ++
 include/hw/acpi/cxl.h               |  28 ++
 include/hw/arm/virt.h               |   1 +
 include/hw/boards.h                 |   2 +
 include/hw/cxl/cxl.h                |  54 ++++
 include/hw/cxl/cxl_component.h      | 207 ++++++++++++
 include/hw/cxl/cxl_device.h         | 270 ++++++++++++++++
 include/hw/cxl/cxl_pci.h            | 156 +++++++++
 include/hw/pci/pci.h                |  14 +
 include/hw/pci/pci_bridge.h         |  20 ++
 include/hw/pci/pci_bus.h            |   7 +
 include/hw/pci/pci_ids.h            |   1 +
 include/hw/pci/pcie_port.h          |   2 +
 qapi/machine.json                   |  18 ++
 qemu-options.hx                     |  38 +++
 scripts/device-crash-test           |   1 +
 softmmu/memory.c                    |   9 +
 softmmu/vl.c                        |  42 +++
 tests/data/acpi/q35/CEDT.cxl        | Bin 0 -> 184 bytes
 tests/data/acpi/q35/DSDT.cxl        | Bin 0 -> 9627 bytes
 tests/qtest/bios-tables-test.c      |  44 +++
 tests/qtest/cxl-test.c              | 181 +++++++++++
 tests/qtest/meson.build             |   5 +
 57 files changed, 4456 insertions(+), 25 deletions(-)
 create mode 100644 docs/system/devices/cxl.rst
 create mode 100644 hw/acpi/cxl-stub.c
 create mode 100644 hw/acpi/cxl.c
 create mode 100644 hw/cxl/Kconfig
 create mode 100644 hw/cxl/cxl-component-utils.c
 create mode 100644 hw/cxl/cxl-device-utils.c
 create mode 100644 hw/cxl/cxl-host-stubs.c
 create mode 100644 hw/cxl/cxl-host.c
 create mode 100644 hw/cxl/cxl-mailbox-utils.c
 create mode 100644 hw/cxl/meson.build
 create mode 100644 hw/mem/cxl_type3.c
 create mode 100644 hw/pci-bridge/cxl_downstream.c
 create mode 100644 hw/pci-bridge/cxl_root_port.c
 create mode 100644 hw/pci-bridge/cxl_upstream.c
 create mode 100644 include/hw/acpi/cxl.h
 create mode 100644 include/hw/cxl/cxl.h
 create mode 100644 include/hw/cxl/cxl_component.h
 create mode 100644 include/hw/cxl/cxl_device.h
 create mode 100644 include/hw/cxl/cxl_pci.h
 create mode 100644 tests/data/acpi/q35/CEDT.cxl
 create mode 100644 tests/data/acpi/q35/DSDT.cxl
 create mode 100644 tests/qtest/cxl-test.c

Comments

Michael S. Tsirkin March 6, 2022, 9:33 p.m. UTC | #1
On Sun, Mar 06, 2022 at 05:40:51PM +0000, Jonathan Cameron wrote:
> Ideally I'd love it if we could start picking up the earlier
> sections of this series as I think those have been reasonably
> well reviewed and should not be particularly controversial.
> (perhaps up to patch 15 inline with what Michael Tsirkin suggested
> on v5).

Well true but given we are entering freeze this will leave
us with a half baked devices which cant be used.
At this point if we can't merge it up to documentation then
I think we should wait until after the release.

> There is one core memory handling related patch (34) marked as RFC.
> Whilst it's impact seems small to me, I'm not sure it is the best way
> to meet our requirements wrt interleaving.
> 
> Changes since v7:
> 
> Thanks to all who have taken a look.
> Small amount of reordering was necessary due to LSA fix in patch 17.
> Test moved forwards to patch 22 and so all intermediate patches
> move -1 in the series.
> 
> (New stuff)
> - Switch support.  Needed to support more interesting topologies.
> (Ben Widawsky)
> - Patch 17: Fix reversed condition on presence of LSA that meant these never
>   got properly initialized. Related change needed to ensure test for cxl_type3
>   always needs an LSA. We can relax this later when adding volatile memory
>   support.
> (Markus Armbuster)
> - Patch 27: Change -cxl-fixed-memory-window option handling to use
>   qobject_input_visitor_new_str().  This changed the required handling of
>   targets parameter to require an array index and hence test and docs updates.
>   e.g. targets.1=cxl_hb0,targets.2=cxl_hb1
>   (Patches 38,40,42,43)
> - Missing structure element docs and version number (optimisitic :)
> (Alex Bennée)
> - Added Reviewed-by tags.  Thanks!.
> - Series wise: Switch to compiler.h QEMU_BUILD_BUG_ON/MSG QEMU_PACKED
>   and QEMU_ALIGNED as Alex suggested in patch 20.
> - Patch 6: Dropped documentation for a non-existent lock.
>            Added error code suitable for unimplemented commands.
> 	   Reordered code for better readability.
> - Patch 9: Reorder as suggested to avoid a goto.
> - Patch 16: Add LOG_UNIMP message where feature not yet implemented.
>             Drop "Explain" comment that doesn't explain anything.
> - Patch 18: Drop pointless void * cast.
>             Add assertion as suggested (without divide)
> - Patch 19: Use pstrcpy rather than snprintf for a fixed string.
>             The compiler.h comment was in this patch but affects a
> 	    number of other patches as well.
> - Patch 20: Move structure CXLType3Dev to header when originally
>             introduced so changes are more obvious in this patch.
> - Patch 21: Substantial refactor to resolve unclear use of sizeof
>             on the LSA command header. Now uses a variable length
> 	    last element so we can use offsetof()
> - Patch 22: Use g_autoptr() to avoid need for explicit free in tests
>   	    Similar in later patches.
> - Patch 29: Minor reorganziation as suggested.
> 	    
> (Tidy up from me)
> - Trivial stuff like moving header includes to patch where first used.
> - Patch 17: Drop ifndef protections from TYPE_CXL_TYPE3_DEV as there
>             doesn't seem to be a reason.
> 
> Series organized to allow it to be taken in stages if the maintainers
> prefer that approach. Most sets end with the addition of appropriate
> tests (TBD for final set)
> 
> Patches 0-15 - CXL PXB
> Patches 16-22 - Type 3 Device, Root Port
> Patches 23-40 - ACPI, board elements and interleave decoding to enable x86 hosts
> Patches 41-42 - arm64 support on virt.
> Patch 43 - Initial documentation
> Patches 44-46 - Switch support.
> 
> Gitlab CI is proving challenging to get a completely clean bill of health
> as there seem to be some intermittent failures in common with the
> main QEMU gitlab. In particular an ASAN leak error that appears in some
> upstream CI runs and build-oss-fuzz timeouts.
> Results at http://gitlab.com/jic23/qemu cxl-v7-draft-2-for-test
> which also includes the DOE/CDAT patches serial number support which
> will form part of a future series.
> 
> Updated background info:
> 
> Looking in particular for:
> * Review of the PCI interactions
> * x86 and ARM machine interactions (particularly the memory maps)
> * Review of the interleaving approach - is the basic idea
>   acceptable?
> * Review of the command line interface.
> * CXL related review welcome but much of that got reviewed
>   in earlier versions and hasn't changed substantially.
> 
> Big TODOs:
> 
> * Volatile memory devices (easy but it's more code so left for now).
> * Hotplug?  May not need much but it's not tested yet!
> * More tests and tighter verification that values written to hardware
>   are actually valid - stuff that real hardware would check.
> * Testing, testing and more testing.  I have been running a basic
>   set of ARM and x86 tests on this, but there is always room for
>   more tests and greater automation.
> * CFMWS flags as requested by Ben.
> 
> Why do we want QEMU emulation of CXL?
> 
> As Ben stated in V3, QEMU support has been critical to getting OS
> software written given lack of availability of hardware supporting the
> latest CXL features (coupled with very high demand for support being
> ready in a timely fashion). What has become clear since Ben's v3
> is that situation is a continuous one. Whilst we can't talk about
> them yet, CXL 3.0 features and OS support have been prototyped on
> top of this support and a lot of the ongoing kernel work is being
> tested against these patches. The kernel CXL mocking code allows
> some forms of testing, but QEMU provides a more versatile and
> exensible platform.
> 
> Other features on the qemu-list that build on these include PCI-DOE
> /CDAT support from the Avery Design team further showing how this
> code is useful. Whilst not directly related this is also the test
> platform for work on PCI IDE/CMA + related DMTF SPDM as CXL both
> utilizes and extends those technologies and is likely to be an early
> adopter.
> Refs:
> CMA Kernel: https://lore.kernel.org/all/20210804161839.3492053-1-Jonathan.Cameron@huawei.com/
> CMA Qemu: https://lore.kernel.org/qemu-devel/1624665723-5169-1-git-send-email-cbrowy@avery-design.com/
> DOE Qemu: https://lore.kernel.org/qemu-devel/1623329999-15662-1-git-send-email-cbrowy@avery-design.com/
> 
> As can be seen there is non trivial interaction with other areas of
> Qemu, particularly PCI and keeping this set up to date is proving
> a burden we'd rather do without :)
> 
> Ben mentioned a few other good reasons in v3:
> https://lore.kernel.org/qemu-devel/20210202005948.241655-1-ben.widawsky@intel.com/
> 
> What we have here is about what you need for it to be useful for testing
> currently kernel code.  Note the kernel code is moving fast so
> since v4, some features have been introduced we don't yet support in
> QEMU (e.g. use of the PCIe serial number extended capability).
> 
> All comments welcome.
> 
> Additional info that was here in v5 is now in the documentation patch.
> 
> Thanks,
> 
> Jonathan
> 
> Ben Widawsky (24):
>   hw/pci/cxl: Add a CXL component type (interface)
>   hw/cxl/component: Introduce CXL components (8.1.x, 8.2.5)
>   hw/cxl/device: Introduce a CXL device (8.2.8)
>   hw/cxl/device: Implement the CAP array (8.2.8.1-2)
>   hw/cxl/device: Implement basic mailbox (8.2.8.4)
>   hw/cxl/device: Add memory device utilities
>   hw/cxl/device: Add cheap EVENTS implementation (8.2.9.1)
>   hw/cxl/device: Timestamp implementation (8.2.9.3)
>   hw/cxl/device: Add log commands (8.2.9.4) + CEL
>   hw/pxb: Use a type for realizing expanders
>   hw/pci/cxl: Create a CXL bus type
>   hw/pxb: Allow creation of a CXL PXB (host bridge)
>   hw/cxl/rp: Add a root port
>   hw/cxl/device: Add a memory device (8.2.8.5)
>   hw/cxl/device: Implement MMIO HDM decoding (8.2.5.12)
>   hw/cxl/device: Add some trivial commands
>   hw/cxl/device: Plumb real Label Storage Area (LSA) sizing
>   hw/cxl/device: Implement get/set Label Storage Area (LSA)
>   hw/cxl/component: Implement host bridge MMIO (8.2.5, table 142)
>   acpi/cxl: Add _OSC implementation (9.14.2)
>   acpi/cxl: Create the CEDT (9.14.1)
>   acpi/cxl: Introduce CFMWS structures in CEDT
>   hw/cxl/component Add a dumb HDM decoder handler
>   qtest/cxl: Add more complex test cases with CFMWs
> 
> Jonathan Cameron (22):
>   MAINTAINERS: Add entry for Compute Express Link Emulation
>   cxl: Machine level control on whether CXL support is enabled
>   qtest/cxl: Introduce initial test for pxb-cxl only.
>   qtests/cxl: Add initial root port and CXL type3 tests
>   hw/cxl/component: Add utils for interleave parameter encoding/decoding
>   hw/cxl/host: Add support for CXL Fixed Memory Windows.
>   hw/pci-host/gpex-acpi: Add support for dsdt construction for pxb-cxl
>   pci/pcie_port: Add pci_find_port_by_pn()
>   CXL/cxl_component: Add cxl_get_hb_cstate()
>   mem/cxl_type3: Add read and write functions for associated hostmem.
>   cxl/cxl-host: Add memops for CFMWS region.
>   RFC: softmmu/memory: Add ops to memory_region_ram_init_from_file
>   i386/pc: Enable CXL fixed memory windows
>   tests/acpi: q35: Allow addition of a CXL test.
>   qtests/bios-tables-test: Add a test for CXL emulation.
>   tests/acpi: Add tables for CXL emulation.
>   hw/arm/virt: Basic CXL enablement on pci_expander_bridge instances
>     pxb-cxl
>   qtest/cxl: Add aarch64 virt test for CXL
>   docs/cxl: Add initial Compute eXpress Link (CXL) documentation.
>   pci-bridge/cxl_upstream: Add a CXL switch upstream port
>   pci-bridge/cxl_downstream: Add a CXL switch downstream port
>   cxl/cxl-host: Support interleave decoding with one level of switches.
> 
>  MAINTAINERS                         |   7 +
>  docs/system/device-emulation.rst    |   1 +
>  docs/system/devices/cxl.rst         | 302 +++++++++++++++++
>  hw/Kconfig                          |   1 +
>  hw/acpi/Kconfig                     |   5 +
>  hw/acpi/cxl-stub.c                  |  12 +
>  hw/acpi/cxl.c                       | 231 +++++++++++++
>  hw/acpi/meson.build                 |   4 +-
>  hw/arm/Kconfig                      |   1 +
>  hw/arm/virt-acpi-build.c            |  33 ++
>  hw/arm/virt.c                       |  40 ++-
>  hw/core/machine.c                   |  28 ++
>  hw/cxl/Kconfig                      |   3 +
>  hw/cxl/cxl-component-utils.c        | 284 ++++++++++++++++
>  hw/cxl/cxl-device-utils.c           | 265 +++++++++++++++
>  hw/cxl/cxl-host-stubs.c             |  16 +
>  hw/cxl/cxl-host.c                   | 262 +++++++++++++++
>  hw/cxl/cxl-mailbox-utils.c          | 485 ++++++++++++++++++++++++++++
>  hw/cxl/meson.build                  |  12 +
>  hw/i386/acpi-build.c                |  57 +++-
>  hw/i386/pc.c                        |  57 +++-
>  hw/mem/Kconfig                      |   5 +
>  hw/mem/cxl_type3.c                  | 352 ++++++++++++++++++++
>  hw/mem/meson.build                  |   1 +
>  hw/meson.build                      |   1 +
>  hw/pci-bridge/Kconfig               |   5 +
>  hw/pci-bridge/cxl_downstream.c      | 229 +++++++++++++
>  hw/pci-bridge/cxl_root_port.c       | 231 +++++++++++++
>  hw/pci-bridge/cxl_upstream.c        | 206 ++++++++++++
>  hw/pci-bridge/meson.build           |   1 +
>  hw/pci-bridge/pci_expander_bridge.c | 172 +++++++++-
>  hw/pci-bridge/pcie_root_port.c      |   6 +-
>  hw/pci-host/gpex-acpi.c             |  20 +-
>  hw/pci/pci.c                        |  21 +-
>  hw/pci/pcie_port.c                  |  25 ++
>  include/hw/acpi/cxl.h               |  28 ++
>  include/hw/arm/virt.h               |   1 +
>  include/hw/boards.h                 |   2 +
>  include/hw/cxl/cxl.h                |  54 ++++
>  include/hw/cxl/cxl_component.h      | 207 ++++++++++++
>  include/hw/cxl/cxl_device.h         | 270 ++++++++++++++++
>  include/hw/cxl/cxl_pci.h            | 156 +++++++++
>  include/hw/pci/pci.h                |  14 +
>  include/hw/pci/pci_bridge.h         |  20 ++
>  include/hw/pci/pci_bus.h            |   7 +
>  include/hw/pci/pci_ids.h            |   1 +
>  include/hw/pci/pcie_port.h          |   2 +
>  qapi/machine.json                   |  18 ++
>  qemu-options.hx                     |  38 +++
>  scripts/device-crash-test           |   1 +
>  softmmu/memory.c                    |   9 +
>  softmmu/vl.c                        |  42 +++
>  tests/data/acpi/q35/CEDT.cxl        | Bin 0 -> 184 bytes
>  tests/data/acpi/q35/DSDT.cxl        | Bin 0 -> 9627 bytes
>  tests/qtest/bios-tables-test.c      |  44 +++
>  tests/qtest/cxl-test.c              | 181 +++++++++++
>  tests/qtest/meson.build             |   5 +
>  57 files changed, 4456 insertions(+), 25 deletions(-)
>  create mode 100644 docs/system/devices/cxl.rst
>  create mode 100644 hw/acpi/cxl-stub.c
>  create mode 100644 hw/acpi/cxl.c
>  create mode 100644 hw/cxl/Kconfig
>  create mode 100644 hw/cxl/cxl-component-utils.c
>  create mode 100644 hw/cxl/cxl-device-utils.c
>  create mode 100644 hw/cxl/cxl-host-stubs.c
>  create mode 100644 hw/cxl/cxl-host.c
>  create mode 100644 hw/cxl/cxl-mailbox-utils.c
>  create mode 100644 hw/cxl/meson.build
>  create mode 100644 hw/mem/cxl_type3.c
>  create mode 100644 hw/pci-bridge/cxl_downstream.c
>  create mode 100644 hw/pci-bridge/cxl_root_port.c
>  create mode 100644 hw/pci-bridge/cxl_upstream.c
>  create mode 100644 include/hw/acpi/cxl.h
>  create mode 100644 include/hw/cxl/cxl.h
>  create mode 100644 include/hw/cxl/cxl_component.h
>  create mode 100644 include/hw/cxl/cxl_device.h
>  create mode 100644 include/hw/cxl/cxl_pci.h
>  create mode 100644 tests/data/acpi/q35/CEDT.cxl
>  create mode 100644 tests/data/acpi/q35/DSDT.cxl
>  create mode 100644 tests/qtest/cxl-test.c
> 
> -- 
> 2.32.0
Jonathan Cameron March 7, 2022, 9:39 a.m. UTC | #2
On Sun, 6 Mar 2022 16:33:40 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Sun, Mar 06, 2022 at 05:40:51PM +0000, Jonathan Cameron wrote:
> > Ideally I'd love it if we could start picking up the earlier
> > sections of this series as I think those have been reasonably
> > well reviewed and should not be particularly controversial.
> > (perhaps up to patch 15 inline with what Michael Tsirkin suggested
> > on v5).  
> 
> Well true but given we are entering freeze this will leave
> us with a half baked devices which cant be used.
> At this point if we can't merge it up to documentation then
> I think we should wait until after the release.

Makes sense.

If any of the memory maintainers can take a look at patch 34 that would
be great as to my mind that and the related interleave decoding in general is
the big unknown in this set. I just realized I haven't cc'd everyone
I should have for that - added them here and I'll make sure to CC them
all on V8.

Thanks.

Jonathan

> 
> > There is one core memory handling related patch (34) marked as RFC.
> > Whilst it's impact seems small to me, I'm not sure it is the best way
> > to meet our requirements wrt interleaving.
> > 
> > Changes since v7:
> > 
> > Thanks to all who have taken a look.
> > Small amount of reordering was necessary due to LSA fix in patch 17.
> > Test moved forwards to patch 22 and so all intermediate patches
> > move -1 in the series.
> > 
> > (New stuff)
> > - Switch support.  Needed to support more interesting topologies.
> > (Ben Widawsky)
> > - Patch 17: Fix reversed condition on presence of LSA that meant these never
> >   got properly initialized. Related change needed to ensure test for cxl_type3
> >   always needs an LSA. We can relax this later when adding volatile memory
> >   support.
> > (Markus Armbuster)
> > - Patch 27: Change -cxl-fixed-memory-window option handling to use
> >   qobject_input_visitor_new_str().  This changed the required handling of
> >   targets parameter to require an array index and hence test and docs updates.
> >   e.g. targets.1=cxl_hb0,targets.2=cxl_hb1
> >   (Patches 38,40,42,43)
> > - Missing structure element docs and version number (optimisitic :)
> > (Alex Bennée)
> > - Added Reviewed-by tags.  Thanks!.
> > - Series wise: Switch to compiler.h QEMU_BUILD_BUG_ON/MSG QEMU_PACKED
> >   and QEMU_ALIGNED as Alex suggested in patch 20.
> > - Patch 6: Dropped documentation for a non-existent lock.
> >            Added error code suitable for unimplemented commands.
> > 	   Reordered code for better readability.
> > - Patch 9: Reorder as suggested to avoid a goto.
> > - Patch 16: Add LOG_UNIMP message where feature not yet implemented.
> >             Drop "Explain" comment that doesn't explain anything.
> > - Patch 18: Drop pointless void * cast.
> >             Add assertion as suggested (without divide)
> > - Patch 19: Use pstrcpy rather than snprintf for a fixed string.
> >             The compiler.h comment was in this patch but affects a
> > 	    number of other patches as well.
> > - Patch 20: Move structure CXLType3Dev to header when originally
> >             introduced so changes are more obvious in this patch.
> > - Patch 21: Substantial refactor to resolve unclear use of sizeof
> >             on the LSA command header. Now uses a variable length
> > 	    last element so we can use offsetof()
> > - Patch 22: Use g_autoptr() to avoid need for explicit free in tests
> >   	    Similar in later patches.
> > - Patch 29: Minor reorganziation as suggested.
> > 	    
> > (Tidy up from me)
> > - Trivial stuff like moving header includes to patch where first used.
> > - Patch 17: Drop ifndef protections from TYPE_CXL_TYPE3_DEV as there
> >             doesn't seem to be a reason.
> > 
> > Series organized to allow it to be taken in stages if the maintainers
> > prefer that approach. Most sets end with the addition of appropriate
> > tests (TBD for final set)
> > 
> > Patches 0-15 - CXL PXB
> > Patches 16-22 - Type 3 Device, Root Port
> > Patches 23-40 - ACPI, board elements and interleave decoding to enable x86 hosts
> > Patches 41-42 - arm64 support on virt.
> > Patch 43 - Initial documentation
> > Patches 44-46 - Switch support.
> > 
> > Gitlab CI is proving challenging to get a completely clean bill of health
> > as there seem to be some intermittent failures in common with the
> > main QEMU gitlab. In particular an ASAN leak error that appears in some
> > upstream CI runs and build-oss-fuzz timeouts.
> > Results at http://gitlab.com/jic23/qemu cxl-v7-draft-2-for-test
> > which also includes the DOE/CDAT patches serial number support which
> > will form part of a future series.
> > 
> > Updated background info:
> > 
> > Looking in particular for:
> > * Review of the PCI interactions
> > * x86 and ARM machine interactions (particularly the memory maps)
> > * Review of the interleaving approach - is the basic idea
> >   acceptable?
> > * Review of the command line interface.
> > * CXL related review welcome but much of that got reviewed
> >   in earlier versions and hasn't changed substantially.
> > 
> > Big TODOs:
> > 
> > * Volatile memory devices (easy but it's more code so left for now).
> > * Hotplug?  May not need much but it's not tested yet!
> > * More tests and tighter verification that values written to hardware
> >   are actually valid - stuff that real hardware would check.
> > * Testing, testing and more testing.  I have been running a basic
> >   set of ARM and x86 tests on this, but there is always room for
> >   more tests and greater automation.
> > * CFMWS flags as requested by Ben.
> > 
> > Why do we want QEMU emulation of CXL?
> > 
> > As Ben stated in V3, QEMU support has been critical to getting OS
> > software written given lack of availability of hardware supporting the
> > latest CXL features (coupled with very high demand for support being
> > ready in a timely fashion). What has become clear since Ben's v3
> > is that situation is a continuous one. Whilst we can't talk about
> > them yet, CXL 3.0 features and OS support have been prototyped on
> > top of this support and a lot of the ongoing kernel work is being
> > tested against these patches. The kernel CXL mocking code allows
> > some forms of testing, but QEMU provides a more versatile and
> > exensible platform.
> > 
> > Other features on the qemu-list that build on these include PCI-DOE
> > /CDAT support from the Avery Design team further showing how this
> > code is useful. Whilst not directly related this is also the test
> > platform for work on PCI IDE/CMA + related DMTF SPDM as CXL both
> > utilizes and extends those technologies and is likely to be an early
> > adopter.
> > Refs:
> > CMA Kernel: https://lore.kernel.org/all/20210804161839.3492053-1-Jonathan.Cameron@huawei.com/
> > CMA Qemu: https://lore.kernel.org/qemu-devel/1624665723-5169-1-git-send-email-cbrowy@avery-design.com/
> > DOE Qemu: https://lore.kernel.org/qemu-devel/1623329999-15662-1-git-send-email-cbrowy@avery-design.com/
> > 
> > As can be seen there is non trivial interaction with other areas of
> > Qemu, particularly PCI and keeping this set up to date is proving
> > a burden we'd rather do without :)
> > 
> > Ben mentioned a few other good reasons in v3:
> > https://lore.kernel.org/qemu-devel/20210202005948.241655-1-ben.widawsky@intel.com/
> > 
> > What we have here is about what you need for it to be useful for testing
> > currently kernel code.  Note the kernel code is moving fast so
> > since v4, some features have been introduced we don't yet support in
> > QEMU (e.g. use of the PCIe serial number extended capability).
> > 
> > All comments welcome.
> > 
> > Additional info that was here in v5 is now in the documentation patch.
> > 
> > Thanks,
> > 
> > Jonathan
> > 
> > Ben Widawsky (24):
> >   hw/pci/cxl: Add a CXL component type (interface)
> >   hw/cxl/component: Introduce CXL components (8.1.x, 8.2.5)
> >   hw/cxl/device: Introduce a CXL device (8.2.8)
> >   hw/cxl/device: Implement the CAP array (8.2.8.1-2)
> >   hw/cxl/device: Implement basic mailbox (8.2.8.4)
> >   hw/cxl/device: Add memory device utilities
> >   hw/cxl/device: Add cheap EVENTS implementation (8.2.9.1)
> >   hw/cxl/device: Timestamp implementation (8.2.9.3)
> >   hw/cxl/device: Add log commands (8.2.9.4) + CEL
> >   hw/pxb: Use a type for realizing expanders
> >   hw/pci/cxl: Create a CXL bus type
> >   hw/pxb: Allow creation of a CXL PXB (host bridge)
> >   hw/cxl/rp: Add a root port
> >   hw/cxl/device: Add a memory device (8.2.8.5)
> >   hw/cxl/device: Implement MMIO HDM decoding (8.2.5.12)
> >   hw/cxl/device: Add some trivial commands
> >   hw/cxl/device: Plumb real Label Storage Area (LSA) sizing
> >   hw/cxl/device: Implement get/set Label Storage Area (LSA)
> >   hw/cxl/component: Implement host bridge MMIO (8.2.5, table 142)
> >   acpi/cxl: Add _OSC implementation (9.14.2)
> >   acpi/cxl: Create the CEDT (9.14.1)
> >   acpi/cxl: Introduce CFMWS structures in CEDT
> >   hw/cxl/component Add a dumb HDM decoder handler
> >   qtest/cxl: Add more complex test cases with CFMWs
> > 
> > Jonathan Cameron (22):
> >   MAINTAINERS: Add entry for Compute Express Link Emulation
> >   cxl: Machine level control on whether CXL support is enabled
> >   qtest/cxl: Introduce initial test for pxb-cxl only.
> >   qtests/cxl: Add initial root port and CXL type3 tests
> >   hw/cxl/component: Add utils for interleave parameter encoding/decoding
> >   hw/cxl/host: Add support for CXL Fixed Memory Windows.
> >   hw/pci-host/gpex-acpi: Add support for dsdt construction for pxb-cxl
> >   pci/pcie_port: Add pci_find_port_by_pn()
> >   CXL/cxl_component: Add cxl_get_hb_cstate()
> >   mem/cxl_type3: Add read and write functions for associated hostmem.
> >   cxl/cxl-host: Add memops for CFMWS region.
> >   RFC: softmmu/memory: Add ops to memory_region_ram_init_from_file
> >   i386/pc: Enable CXL fixed memory windows
> >   tests/acpi: q35: Allow addition of a CXL test.
> >   qtests/bios-tables-test: Add a test for CXL emulation.
> >   tests/acpi: Add tables for CXL emulation.
> >   hw/arm/virt: Basic CXL enablement on pci_expander_bridge instances
> >     pxb-cxl
> >   qtest/cxl: Add aarch64 virt test for CXL
> >   docs/cxl: Add initial Compute eXpress Link (CXL) documentation.
> >   pci-bridge/cxl_upstream: Add a CXL switch upstream port
> >   pci-bridge/cxl_downstream: Add a CXL switch downstream port
> >   cxl/cxl-host: Support interleave decoding with one level of switches.
> > 
> >  MAINTAINERS                         |   7 +
> >  docs/system/device-emulation.rst    |   1 +
> >  docs/system/devices/cxl.rst         | 302 +++++++++++++++++
> >  hw/Kconfig                          |   1 +
> >  hw/acpi/Kconfig                     |   5 +
> >  hw/acpi/cxl-stub.c                  |  12 +
> >  hw/acpi/cxl.c                       | 231 +++++++++++++
> >  hw/acpi/meson.build                 |   4 +-
> >  hw/arm/Kconfig                      |   1 +
> >  hw/arm/virt-acpi-build.c            |  33 ++
> >  hw/arm/virt.c                       |  40 ++-
> >  hw/core/machine.c                   |  28 ++
> >  hw/cxl/Kconfig                      |   3 +
> >  hw/cxl/cxl-component-utils.c        | 284 ++++++++++++++++
> >  hw/cxl/cxl-device-utils.c           | 265 +++++++++++++++
> >  hw/cxl/cxl-host-stubs.c             |  16 +
> >  hw/cxl/cxl-host.c                   | 262 +++++++++++++++
> >  hw/cxl/cxl-mailbox-utils.c          | 485 ++++++++++++++++++++++++++++
> >  hw/cxl/meson.build                  |  12 +
> >  hw/i386/acpi-build.c                |  57 +++-
> >  hw/i386/pc.c                        |  57 +++-
> >  hw/mem/Kconfig                      |   5 +
> >  hw/mem/cxl_type3.c                  | 352 ++++++++++++++++++++
> >  hw/mem/meson.build                  |   1 +
> >  hw/meson.build                      |   1 +
> >  hw/pci-bridge/Kconfig               |   5 +
> >  hw/pci-bridge/cxl_downstream.c      | 229 +++++++++++++
> >  hw/pci-bridge/cxl_root_port.c       | 231 +++++++++++++
> >  hw/pci-bridge/cxl_upstream.c        | 206 ++++++++++++
> >  hw/pci-bridge/meson.build           |   1 +
> >  hw/pci-bridge/pci_expander_bridge.c | 172 +++++++++-
> >  hw/pci-bridge/pcie_root_port.c      |   6 +-
> >  hw/pci-host/gpex-acpi.c             |  20 +-
> >  hw/pci/pci.c                        |  21 +-
> >  hw/pci/pcie_port.c                  |  25 ++
> >  include/hw/acpi/cxl.h               |  28 ++
> >  include/hw/arm/virt.h               |   1 +
> >  include/hw/boards.h                 |   2 +
> >  include/hw/cxl/cxl.h                |  54 ++++
> >  include/hw/cxl/cxl_component.h      | 207 ++++++++++++
> >  include/hw/cxl/cxl_device.h         | 270 ++++++++++++++++
> >  include/hw/cxl/cxl_pci.h            | 156 +++++++++
> >  include/hw/pci/pci.h                |  14 +
> >  include/hw/pci/pci_bridge.h         |  20 ++
> >  include/hw/pci/pci_bus.h            |   7 +
> >  include/hw/pci/pci_ids.h            |   1 +
> >  include/hw/pci/pcie_port.h          |   2 +
> >  qapi/machine.json                   |  18 ++
> >  qemu-options.hx                     |  38 +++
> >  scripts/device-crash-test           |   1 +
> >  softmmu/memory.c                    |   9 +
> >  softmmu/vl.c                        |  42 +++
> >  tests/data/acpi/q35/CEDT.cxl        | Bin 0 -> 184 bytes
> >  tests/data/acpi/q35/DSDT.cxl        | Bin 0 -> 9627 bytes
> >  tests/qtest/bios-tables-test.c      |  44 +++
> >  tests/qtest/cxl-test.c              | 181 +++++++++++
> >  tests/qtest/meson.build             |   5 +
> >  57 files changed, 4456 insertions(+), 25 deletions(-)
> >  create mode 100644 docs/system/devices/cxl.rst
> >  create mode 100644 hw/acpi/cxl-stub.c
> >  create mode 100644 hw/acpi/cxl.c
> >  create mode 100644 hw/cxl/Kconfig
> >  create mode 100644 hw/cxl/cxl-component-utils.c
> >  create mode 100644 hw/cxl/cxl-device-utils.c
> >  create mode 100644 hw/cxl/cxl-host-stubs.c
> >  create mode 100644 hw/cxl/cxl-host.c
> >  create mode 100644 hw/cxl/cxl-mailbox-utils.c
> >  create mode 100644 hw/cxl/meson.build
> >  create mode 100644 hw/mem/cxl_type3.c
> >  create mode 100644 hw/pci-bridge/cxl_downstream.c
> >  create mode 100644 hw/pci-bridge/cxl_root_port.c
> >  create mode 100644 hw/pci-bridge/cxl_upstream.c
> >  create mode 100644 include/hw/acpi/cxl.h
> >  create mode 100644 include/hw/cxl/cxl.h
> >  create mode 100644 include/hw/cxl/cxl_component.h
> >  create mode 100644 include/hw/cxl/cxl_device.h
> >  create mode 100644 include/hw/cxl/cxl_pci.h
> >  create mode 100644 tests/data/acpi/q35/CEDT.cxl
> >  create mode 100644 tests/data/acpi/q35/DSDT.cxl
> >  create mode 100644 tests/qtest/cxl-test.c
> > 
> > -- 
> > 2.32.0  
> 
>
Peter Xu March 9, 2022, 8:15 a.m. UTC | #3
On Mon, Mar 07, 2022 at 09:39:18AM +0000, Jonathan Cameron via wrote:
> If any of the memory maintainers can take a look at patch 34 that would
> be great as to my mind that and the related interleave decoding in general is
> the big unknown in this set. I just realized I haven't cc'd everyone
> I should have for that - added them here and I'll make sure to CC them
> all on V8.

https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/

Having mr->ops set but with memory_access_is_direct() returning true sounds
weird to me.

Sorry to have no understanding of the whole picture, but.. could you share
more on what's the interleaving requirement on the proxying, and why it
can't be done with adding some IO memory regions as sub-regions upon the
file one?

Thanks,
Jonathan Cameron March 9, 2022, 11:28 a.m. UTC | #4
On Wed, 9 Mar 2022 16:15:24 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Mar 07, 2022 at 09:39:18AM +0000, Jonathan Cameron via wrote:
> > If any of the memory maintainers can take a look at patch 34 that would
> > be great as to my mind that and the related interleave decoding in general is
> > the big unknown in this set. I just realized I haven't cc'd everyone
> > I should have for that - added them here and I'll make sure to CC them
> > all on V8.  

Hi Peter,

> 
> https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
> 
> Having mr->ops set but with memory_access_is_direct() returning true sounds
> weird to me.
> 
> Sorry to have no understanding of the whole picture, but.. could you share
> more on what's the interleaving requirement on the proxying, and why it
> can't be done with adding some IO memory regions as sub-regions upon the
> file one?

The proxying requirement is simply a means to read/write to a computed address
within a memory region. There may well be a better way to do that.

If I understand your suggestion correctly you would need a very high
number of IO memory regions to be created dynamically when particular sets of
registers across multiple devices in the topology are all programmed.

The interleave can be 256 bytes across up to 16x, many terabyte, devices.
So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
IO regions.  Even for a minimal useful test case of largest interleave
set of 16x 256MB devices (256MB is minimum size the specification allows per
decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
Any idea if that approach would scale sensibly to this number of regions?

There are also complexities to getting all the information in one place to
work out which IO memory regions maps where in PA space. Current solution is
to do that mapping in the same way the hardware does which is hierarchical,
so we walk the path to the device, picking directions based on each interleave
decoder that we meet.
Obviously this is a bit slow but I only really care about correctness at the
moment.  I can think of various approaches to speeding it up but I'm not sure
if we will ever care about performance.

https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
has the logic for that and as you can see it's fairly simple because we are always
going down the topology following the decoders.

Below I have mapped out an algorithm I think would work for doing it with
IO memory regions as subregions.

We could fake the whole thing by limiting ourselves to small host
memory windows which are always directly backed, but then I wouldn't
achieve the main aim of this which is to provide a test base for the OS code.
To do that I need real interleave so I can seed the files with test patterns
and verify the accesses hit the correct locations. Emulating what the hardware
is actually doing on a device by device basis is the easiest way I have
come up with to do that.

Let me try to provide some more background so you hopefully don't have
to have read the specs to follow what is going on!
There are an example for directly connected (no switches) topology in the
docs

https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst

The overall picture is we have a large number of CXL Type 3 memory devices,
which at runtime (by OS at boot/on hotplug) are configured into various
interleaving sets with hierarchical decoding at the host + host bridge
+ switch levels. For test setups I probably need to go to around 32 devices
so I can hit various configurations simultaneously.
No individual device has visibility of the full interleave setup - hence
the walk in the existing code through the various decoders to find the
final Device Physical address.

At the host level the host provides a set of Physical Address windows with
a fixed interleave decoding across the different host bridges in the system
(CXL Fixed Memory windows, CFMWs)
On a real system these have to be large enough to allow for any memory
devices that might be hotplugged and all possible configurations (so
with 2 host bridges you need at least 3 windows in the many TB range,
much worse as the number of host bridges goes up). It'll be worse than
this when we have QoS groups, but the current Qemu code just puts all
the windows in group 0.  Hence my first thought of just putting memory
behind those doesn't scale (a similar approach to this was in the
earliest versions of this patch set - though the full access path
wasn't wired up).

The granularity can be in powers of 2 from 256 bytes to 16 kbytes

Next each host bridge has programmable address decoders which take the
incoming (often already interleaved) memory access and direct them to
appropriate root ports.  The root ports can be connected to a switch
which has additional address decoders in the upstream port to decide
which downstream port to route to.  Note we currently only support 1 level
of switches but it's easy to make this algorithm recursive to support
multiple switch levels (currently the kernel proposals only support 1 level)

Finally the End Point with the actual memory receives the interleaved request and
takes the full address and (for power of 2 decoding - we don't yet support
3,6 and 12 way which is more complex and there is no kernel support yet)
it drops a few address bits and adds an offset for the decoder used to
calculate it's own device physical address.  Note device will support
multiple interleave sets for different parts of it's file once we add
multiple decoder support (on the todo list).

So the current solution is straight forward (with the exception of that
proxying) because it follows the same decoding as used in real hardware
to route the memory accesses. As a result we get a read/write to a
device physical address and hence proxy that.  If any of the decoders
along the path are not configured then we error out at that stage.

To create the equivalent as IO subregions I think we'd have to do the
following from (this might be mediated by some central entity that
doesn't currently exist, or done on demand from which ever CXL device
happens to have it's decoder set up last)

1) Wait for a decoder commit (enable) on any component. Goto 2.
2) Walk the topology (up to host decoder, down to memory device)
If a complete interleaving path has been configured -
   i.e. we have committed decoders all the way to the memory
   device goto step 3, otherwise return to step 1 to wait for
   more decoders to be committed.
3) For the memory region being supplied by the memory device,
   add subregions to map the device physical address (address
   in the file) for each interleave stride to the appropriate
   host Physical Address.
4) Return to step 1 to wait for more decoders to commit.

So summary is we can do it with IO regions, but there are a lot of them
and the setup is somewhat complex as we don't have one single point in
time where we know all the necessary information is available to compute
the right addresses.

Looking forward to your suggestions if I haven't caused more confusion!

Thanks,

Jonathan


> 
> Thanks,
>
Peter Xu March 10, 2022, 8:02 a.m. UTC | #5
On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:
> Hi Peter,

Hi, Jonathan,

> 
> > 
> > https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
> > 
> > Having mr->ops set but with memory_access_is_direct() returning true sounds
> > weird to me.
> > 
> > Sorry to have no understanding of the whole picture, but.. could you share
> > more on what's the interleaving requirement on the proxying, and why it
> > can't be done with adding some IO memory regions as sub-regions upon the
> > file one?
> 
> The proxying requirement is simply a means to read/write to a computed address
> within a memory region. There may well be a better way to do that.
> 
> If I understand your suggestion correctly you would need a very high
> number of IO memory regions to be created dynamically when particular sets of
> registers across multiple devices in the topology are all programmed.
> 
> The interleave can be 256 bytes across up to 16x, many terabyte, devices.
> So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
> IO regions.  Even for a minimal useful test case of largest interleave
> set of 16x 256MB devices (256MB is minimum size the specification allows per
> decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
> Any idea if that approach would scale sensibly to this number of regions?
> 
> There are also complexities to getting all the information in one place to
> work out which IO memory regions maps where in PA space. Current solution is
> to do that mapping in the same way the hardware does which is hierarchical,
> so we walk the path to the device, picking directions based on each interleave
> decoder that we meet.
> Obviously this is a bit slow but I only really care about correctness at the
> moment.  I can think of various approaches to speeding it up but I'm not sure
> if we will ever care about performance.
> 
> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
> has the logic for that and as you can see it's fairly simple because we are always
> going down the topology following the decoders.
> 
> Below I have mapped out an algorithm I think would work for doing it with
> IO memory regions as subregions.
> 
> We could fake the whole thing by limiting ourselves to small host
> memory windows which are always directly backed, but then I wouldn't
> achieve the main aim of this which is to provide a test base for the OS code.
> To do that I need real interleave so I can seed the files with test patterns
> and verify the accesses hit the correct locations. Emulating what the hardware
> is actually doing on a device by device basis is the easiest way I have
> come up with to do that.
> 
> Let me try to provide some more background so you hopefully don't have
> to have read the specs to follow what is going on!
> There are an example for directly connected (no switches) topology in the
> docs
> 
> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst
> 
> The overall picture is we have a large number of CXL Type 3 memory devices,
> which at runtime (by OS at boot/on hotplug) are configured into various
> interleaving sets with hierarchical decoding at the host + host bridge
> + switch levels. For test setups I probably need to go to around 32 devices
> so I can hit various configurations simultaneously.
> No individual device has visibility of the full interleave setup - hence
> the walk in the existing code through the various decoders to find the
> final Device Physical address.
> 
> At the host level the host provides a set of Physical Address windows with
> a fixed interleave decoding across the different host bridges in the system
> (CXL Fixed Memory windows, CFMWs)
> On a real system these have to be large enough to allow for any memory
> devices that might be hotplugged and all possible configurations (so
> with 2 host bridges you need at least 3 windows in the many TB range,
> much worse as the number of host bridges goes up). It'll be worse than
> this when we have QoS groups, but the current Qemu code just puts all
> the windows in group 0.  Hence my first thought of just putting memory
> behind those doesn't scale (a similar approach to this was in the
> earliest versions of this patch set - though the full access path
> wasn't wired up).
> 
> The granularity can be in powers of 2 from 256 bytes to 16 kbytes
> 
> Next each host bridge has programmable address decoders which take the
> incoming (often already interleaved) memory access and direct them to
> appropriate root ports.  The root ports can be connected to a switch
> which has additional address decoders in the upstream port to decide
> which downstream port to route to.  Note we currently only support 1 level
> of switches but it's easy to make this algorithm recursive to support
> multiple switch levels (currently the kernel proposals only support 1 level)
> 
> Finally the End Point with the actual memory receives the interleaved request and
> takes the full address and (for power of 2 decoding - we don't yet support
> 3,6 and 12 way which is more complex and there is no kernel support yet)
> it drops a few address bits and adds an offset for the decoder used to
> calculate it's own device physical address.  Note device will support
> multiple interleave sets for different parts of it's file once we add
> multiple decoder support (on the todo list).
> 
> So the current solution is straight forward (with the exception of that
> proxying) because it follows the same decoding as used in real hardware
> to route the memory accesses. As a result we get a read/write to a
> device physical address and hence proxy that.  If any of the decoders
> along the path are not configured then we error out at that stage.
> 
> To create the equivalent as IO subregions I think we'd have to do the
> following from (this might be mediated by some central entity that
> doesn't currently exist, or done on demand from which ever CXL device
> happens to have it's decoder set up last)
> 
> 1) Wait for a decoder commit (enable) on any component. Goto 2.
> 2) Walk the topology (up to host decoder, down to memory device)
> If a complete interleaving path has been configured -
>    i.e. we have committed decoders all the way to the memory
>    device goto step 3, otherwise return to step 1 to wait for
>    more decoders to be committed.
> 3) For the memory region being supplied by the memory device,
>    add subregions to map the device physical address (address
>    in the file) for each interleave stride to the appropriate
>    host Physical Address.
> 4) Return to step 1 to wait for more decoders to commit.
> 
> So summary is we can do it with IO regions, but there are a lot of them
> and the setup is somewhat complex as we don't have one single point in
> time where we know all the necessary information is available to compute
> the right addresses.
> 
> Looking forward to your suggestions if I haven't caused more confusion!

Thanks for the write up - I must confess they're a lot! :)

I merely only learned what is CXL today, and I'm not very experienced on
device modeling either, so please bare with me with stupid questions..

IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
That's a normal IO region, which looks very reasonable.

However I'm confused why patch "RFC: softmmu/memory: Add ops to
memory_region_ram_init_from_file" helped.

Per my knowledge, all the memory accesses upon this CFMW window trapped
using this IO region already.  There can be multiple memory file objects
underneath, and when read/write happens the object will be decoded from
cxl_cfmws_find_device() as you referenced.

However I see nowhere that these memory objects got mapped as sub-regions
into parent (CXLFixedWindow.mr).  Then I don't understand why they cannot
be trapped.

To ask in another way: what will happen if you simply revert this RFC
patch?  What will go wrong?

Thanks,
Jonathan Cameron March 16, 2022, 4:50 p.m. UTC | #6
On Thu, 10 Mar 2022 16:02:22 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:
> > Hi Peter,  
> 
> Hi, Jonathan,
> 
> >   
> > > 
> > > https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
> > > 
> > > Having mr->ops set but with memory_access_is_direct() returning true sounds
> > > weird to me.
> > > 
> > > Sorry to have no understanding of the whole picture, but.. could you share
> > > more on what's the interleaving requirement on the proxying, and why it
> > > can't be done with adding some IO memory regions as sub-regions upon the
> > > file one?  
> > 
> > The proxying requirement is simply a means to read/write to a computed address
> > within a memory region. There may well be a better way to do that.
> > 
> > If I understand your suggestion correctly you would need a very high
> > number of IO memory regions to be created dynamically when particular sets of
> > registers across multiple devices in the topology are all programmed.
> > 
> > The interleave can be 256 bytes across up to 16x, many terabyte, devices.
> > So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
> > IO regions.  Even for a minimal useful test case of largest interleave
> > set of 16x 256MB devices (256MB is minimum size the specification allows per
> > decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
> > Any idea if that approach would scale sensibly to this number of regions?
> > 
> > There are also complexities to getting all the information in one place to
> > work out which IO memory regions maps where in PA space. Current solution is
> > to do that mapping in the same way the hardware does which is hierarchical,
> > so we walk the path to the device, picking directions based on each interleave
> > decoder that we meet.
> > Obviously this is a bit slow but I only really care about correctness at the
> > moment.  I can think of various approaches to speeding it up but I'm not sure
> > if we will ever care about performance.
> > 
> > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
> > has the logic for that and as you can see it's fairly simple because we are always
> > going down the topology following the decoders.
> > 
> > Below I have mapped out an algorithm I think would work for doing it with
> > IO memory regions as subregions.
> > 
> > We could fake the whole thing by limiting ourselves to small host
> > memory windows which are always directly backed, but then I wouldn't
> > achieve the main aim of this which is to provide a test base for the OS code.
> > To do that I need real interleave so I can seed the files with test patterns
> > and verify the accesses hit the correct locations. Emulating what the hardware
> > is actually doing on a device by device basis is the easiest way I have
> > come up with to do that.
> > 
> > Let me try to provide some more background so you hopefully don't have
> > to have read the specs to follow what is going on!
> > There are an example for directly connected (no switches) topology in the
> > docs
> > 
> > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst
> > 
> > The overall picture is we have a large number of CXL Type 3 memory devices,
> > which at runtime (by OS at boot/on hotplug) are configured into various
> > interleaving sets with hierarchical decoding at the host + host bridge
> > + switch levels. For test setups I probably need to go to around 32 devices
> > so I can hit various configurations simultaneously.
> > No individual device has visibility of the full interleave setup - hence
> > the walk in the existing code through the various decoders to find the
> > final Device Physical address.
> > 
> > At the host level the host provides a set of Physical Address windows with
> > a fixed interleave decoding across the different host bridges in the system
> > (CXL Fixed Memory windows, CFMWs)
> > On a real system these have to be large enough to allow for any memory
> > devices that might be hotplugged and all possible configurations (so
> > with 2 host bridges you need at least 3 windows in the many TB range,
> > much worse as the number of host bridges goes up). It'll be worse than
> > this when we have QoS groups, but the current Qemu code just puts all
> > the windows in group 0.  Hence my first thought of just putting memory
> > behind those doesn't scale (a similar approach to this was in the
> > earliest versions of this patch set - though the full access path
> > wasn't wired up).
> > 
> > The granularity can be in powers of 2 from 256 bytes to 16 kbytes
> > 
> > Next each host bridge has programmable address decoders which take the
> > incoming (often already interleaved) memory access and direct them to
> > appropriate root ports.  The root ports can be connected to a switch
> > which has additional address decoders in the upstream port to decide
> > which downstream port to route to.  Note we currently only support 1 level
> > of switches but it's easy to make this algorithm recursive to support
> > multiple switch levels (currently the kernel proposals only support 1 level)
> > 
> > Finally the End Point with the actual memory receives the interleaved request and
> > takes the full address and (for power of 2 decoding - we don't yet support
> > 3,6 and 12 way which is more complex and there is no kernel support yet)
> > it drops a few address bits and adds an offset for the decoder used to
> > calculate it's own device physical address.  Note device will support
> > multiple interleave sets for different parts of it's file once we add
> > multiple decoder support (on the todo list).
> > 
> > So the current solution is straight forward (with the exception of that
> > proxying) because it follows the same decoding as used in real hardware
> > to route the memory accesses. As a result we get a read/write to a
> > device physical address and hence proxy that.  If any of the decoders
> > along the path are not configured then we error out at that stage.
> > 
> > To create the equivalent as IO subregions I think we'd have to do the
> > following from (this might be mediated by some central entity that
> > doesn't currently exist, or done on demand from which ever CXL device
> > happens to have it's decoder set up last)
> > 
> > 1) Wait for a decoder commit (enable) on any component. Goto 2.
> > 2) Walk the topology (up to host decoder, down to memory device)
> > If a complete interleaving path has been configured -
> >    i.e. we have committed decoders all the way to the memory
> >    device goto step 3, otherwise return to step 1 to wait for
> >    more decoders to be committed.
> > 3) For the memory region being supplied by the memory device,
> >    add subregions to map the device physical address (address
> >    in the file) for each interleave stride to the appropriate
> >    host Physical Address.
> > 4) Return to step 1 to wait for more decoders to commit.
> > 
> > So summary is we can do it with IO regions, but there are a lot of them
> > and the setup is somewhat complex as we don't have one single point in
> > time where we know all the necessary information is available to compute
> > the right addresses.
> > 
> > Looking forward to your suggestions if I haven't caused more confusion!  

Hi Peter,

> 
> Thanks for the write up - I must confess they're a lot! :)
> 
> I merely only learned what is CXL today, and I'm not very experienced on
> device modeling either, so please bare with me with stupid questions..
> 
> IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
> That's a normal IO region, which looks very reasonable.
> 
> However I'm confused why patch "RFC: softmmu/memory: Add ops to
> memory_region_ram_init_from_file" helped.
> 
> Per my knowledge, all the memory accesses upon this CFMW window trapped
> using this IO region already.  There can be multiple memory file objects
> underneath, and when read/write happens the object will be decoded from
> cxl_cfmws_find_device() as you referenced.

Yes.

> 
> However I see nowhere that these memory objects got mapped as sub-regions
> into parent (CXLFixedWindow.mr).  Then I don't understand why they cannot
> be trapped.

AS you note they aren't mapped into the parent mr, hence we are trapping.
The parent mem_ops are responsible for decoding the 'which device' +
'what address in device memory space'. Once we've gotten that info
the question is how do I actually do the access?

Mapping as subregions seems unwise due to the huge number required.

> 
> To ask in another way: what will happen if you simply revert this RFC
> patch?  What will go wrong?

The call to memory_region_dispatch_read()
https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556

would call memory_region_access_valid() that calls 
mr->ops->valid.accepts() which is set to
unassigned_mem_accepts() and hence...
you get back a MEMTX_DECODE_ERROR back and an exception in the
guest.

That wouldn't happen with a non proxied access to the ram as
those paths never uses the ops as memory_access_is_direct() is called
and simply memcpy used without any involvement of the ops.

Is a better way to proxy those writes to the backing files?

I was fishing a bit in the dark here and saw the existing ops defined
for a different purpose for VFIO

4a2e242bbb ("memory Don't use memcpy for ram_device regions")

and those allowed the use of memory_region_dispatch_write() to work.

Hence the RFC marking on that patch :)

Thanks,

Jonathan



> 
> Thanks,
>
Mark Cave-Ayland March 16, 2022, 5:16 p.m. UTC | #7
On 16/03/2022 16:50, Jonathan Cameron via wrote:

> On Thu, 10 Mar 2022 16:02:22 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
>> On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:
>>> Hi Peter,
>>
>> Hi, Jonathan,
>>
>>>    
>>>>
>>>> https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
>>>>
>>>> Having mr->ops set but with memory_access_is_direct() returning true sounds
>>>> weird to me.
>>>>
>>>> Sorry to have no understanding of the whole picture, but.. could you share
>>>> more on what's the interleaving requirement on the proxying, and why it
>>>> can't be done with adding some IO memory regions as sub-regions upon the
>>>> file one?
>>>
>>> The proxying requirement is simply a means to read/write to a computed address
>>> within a memory region. There may well be a better way to do that.
>>>
>>> If I understand your suggestion correctly you would need a very high
>>> number of IO memory regions to be created dynamically when particular sets of
>>> registers across multiple devices in the topology are all programmed.
>>>
>>> The interleave can be 256 bytes across up to 16x, many terabyte, devices.
>>> So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
>>> IO regions.  Even for a minimal useful test case of largest interleave
>>> set of 16x 256MB devices (256MB is minimum size the specification allows per
>>> decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
>>> Any idea if that approach would scale sensibly to this number of regions?
>>>
>>> There are also complexities to getting all the information in one place to
>>> work out which IO memory regions maps where in PA space. Current solution is
>>> to do that mapping in the same way the hardware does which is hierarchical,
>>> so we walk the path to the device, picking directions based on each interleave
>>> decoder that we meet.
>>> Obviously this is a bit slow but I only really care about correctness at the
>>> moment.  I can think of various approaches to speeding it up but I'm not sure
>>> if we will ever care about performance.
>>>
>>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
>>> has the logic for that and as you can see it's fairly simple because we are always
>>> going down the topology following the decoders.
>>>
>>> Below I have mapped out an algorithm I think would work for doing it with
>>> IO memory regions as subregions.
>>>
>>> We could fake the whole thing by limiting ourselves to small host
>>> memory windows which are always directly backed, but then I wouldn't
>>> achieve the main aim of this which is to provide a test base for the OS code.
>>> To do that I need real interleave so I can seed the files with test patterns
>>> and verify the accesses hit the correct locations. Emulating what the hardware
>>> is actually doing on a device by device basis is the easiest way I have
>>> come up with to do that.
>>>
>>> Let me try to provide some more background so you hopefully don't have
>>> to have read the specs to follow what is going on!
>>> There are an example for directly connected (no switches) topology in the
>>> docs
>>>
>>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst
>>>
>>> The overall picture is we have a large number of CXL Type 3 memory devices,
>>> which at runtime (by OS at boot/on hotplug) are configured into various
>>> interleaving sets with hierarchical decoding at the host + host bridge
>>> + switch levels. For test setups I probably need to go to around 32 devices
>>> so I can hit various configurations simultaneously.
>>> No individual device has visibility of the full interleave setup - hence
>>> the walk in the existing code through the various decoders to find the
>>> final Device Physical address.
>>>
>>> At the host level the host provides a set of Physical Address windows with
>>> a fixed interleave decoding across the different host bridges in the system
>>> (CXL Fixed Memory windows, CFMWs)
>>> On a real system these have to be large enough to allow for any memory
>>> devices that might be hotplugged and all possible configurations (so
>>> with 2 host bridges you need at least 3 windows in the many TB range,
>>> much worse as the number of host bridges goes up). It'll be worse than
>>> this when we have QoS groups, but the current Qemu code just puts all
>>> the windows in group 0.  Hence my first thought of just putting memory
>>> behind those doesn't scale (a similar approach to this was in the
>>> earliest versions of this patch set - though the full access path
>>> wasn't wired up).
>>>
>>> The granularity can be in powers of 2 from 256 bytes to 16 kbytes
>>>
>>> Next each host bridge has programmable address decoders which take the
>>> incoming (often already interleaved) memory access and direct them to
>>> appropriate root ports.  The root ports can be connected to a switch
>>> which has additional address decoders in the upstream port to decide
>>> which downstream port to route to.  Note we currently only support 1 level
>>> of switches but it's easy to make this algorithm recursive to support
>>> multiple switch levels (currently the kernel proposals only support 1 level)
>>>
>>> Finally the End Point with the actual memory receives the interleaved request and
>>> takes the full address and (for power of 2 decoding - we don't yet support
>>> 3,6 and 12 way which is more complex and there is no kernel support yet)
>>> it drops a few address bits and adds an offset for the decoder used to
>>> calculate it's own device physical address.  Note device will support
>>> multiple interleave sets for different parts of it's file once we add
>>> multiple decoder support (on the todo list).
>>>
>>> So the current solution is straight forward (with the exception of that
>>> proxying) because it follows the same decoding as used in real hardware
>>> to route the memory accesses. As a result we get a read/write to a
>>> device physical address and hence proxy that.  If any of the decoders
>>> along the path are not configured then we error out at that stage.
>>>
>>> To create the equivalent as IO subregions I think we'd have to do the
>>> following from (this might be mediated by some central entity that
>>> doesn't currently exist, or done on demand from which ever CXL device
>>> happens to have it's decoder set up last)
>>>
>>> 1) Wait for a decoder commit (enable) on any component. Goto 2.
>>> 2) Walk the topology (up to host decoder, down to memory device)
>>> If a complete interleaving path has been configured -
>>>     i.e. we have committed decoders all the way to the memory
>>>     device goto step 3, otherwise return to step 1 to wait for
>>>     more decoders to be committed.
>>> 3) For the memory region being supplied by the memory device,
>>>     add subregions to map the device physical address (address
>>>     in the file) for each interleave stride to the appropriate
>>>     host Physical Address.
>>> 4) Return to step 1 to wait for more decoders to commit.
>>>
>>> So summary is we can do it with IO regions, but there are a lot of them
>>> and the setup is somewhat complex as we don't have one single point in
>>> time where we know all the necessary information is available to compute
>>> the right addresses.
>>>
>>> Looking forward to your suggestions if I haven't caused more confusion!
> 
> Hi Peter,
> 
>>
>> Thanks for the write up - I must confess they're a lot! :)
>>
>> I merely only learned what is CXL today, and I'm not very experienced on
>> device modeling either, so please bare with me with stupid questions..
>>
>> IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
>> That's a normal IO region, which looks very reasonable.
>>
>> However I'm confused why patch "RFC: softmmu/memory: Add ops to
>> memory_region_ram_init_from_file" helped.
>>
>> Per my knowledge, all the memory accesses upon this CFMW window trapped
>> using this IO region already.  There can be multiple memory file objects
>> underneath, and when read/write happens the object will be decoded from
>> cxl_cfmws_find_device() as you referenced.
> 
> Yes.
> 
>>
>> However I see nowhere that these memory objects got mapped as sub-regions
>> into parent (CXLFixedWindow.mr).  Then I don't understand why they cannot
>> be trapped.
> 
> AS you note they aren't mapped into the parent mr, hence we are trapping.
> The parent mem_ops are responsible for decoding the 'which device' +
> 'what address in device memory space'. Once we've gotten that info
> the question is how do I actually do the access?
> 
> Mapping as subregions seems unwise due to the huge number required.
> 
>>
>> To ask in another way: what will happen if you simply revert this RFC
>> patch?  What will go wrong?
> 
> The call to memory_region_dispatch_read()
> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556
> 
> would call memory_region_access_valid() that calls
> mr->ops->valid.accepts() which is set to
> unassigned_mem_accepts() and hence...
> you get back a MEMTX_DECODE_ERROR back and an exception in the
> guest.
> 
> That wouldn't happen with a non proxied access to the ram as
> those paths never uses the ops as memory_access_is_direct() is called
> and simply memcpy used without any involvement of the ops.
> 
> Is a better way to proxy those writes to the backing files?
> 
> I was fishing a bit in the dark here and saw the existing ops defined
> for a different purpose for VFIO
> 
> 4a2e242bbb ("memory Don't use memcpy for ram_device regions")
> 
> and those allowed the use of memory_region_dispatch_write() to work.
> 
> Hence the RFC marking on that patch :)

FWIW I had a similar issue implementing manual aliasing in one of my q800 patches 
where I found that dispatching a read to a non-IO memory region didn't work with 
memory_region_dispatch_read(). The solution in my case was to switch to using the 
address space API instead, which whilst requiring an absolute address for the target 
address space, handles the dispatch correctly across all different memory region types.

Have a look at 
https://gitlab.com/mcayland/qemu/-/commit/318e12579c7570196187652da13542db86b8c722 to 
see how I did this in macio_alias_read().

IIRC from my experiments in this area, my conclusion was that 
memory_region_dispatch_read() can only work correctly if mapping directly between 2 
IO memory regions, and for anything else you need to use the address space API.


ATB,

Mark.
Jonathan Cameron March 16, 2022, 5:58 p.m. UTC | #8
On Wed, 16 Mar 2022 17:16:55 +0000
Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> wrote:

> On 16/03/2022 16:50, Jonathan Cameron via wrote:
> 
> > On Thu, 10 Mar 2022 16:02:22 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> >> On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:  
> >>> Hi Peter,  
> >>
> >> Hi, Jonathan,
> >>  
> >>>      
> >>>>
> >>>> https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
> >>>>
> >>>> Having mr->ops set but with memory_access_is_direct() returning true sounds
> >>>> weird to me.
> >>>>
> >>>> Sorry to have no understanding of the whole picture, but.. could you share
> >>>> more on what's the interleaving requirement on the proxying, and why it
> >>>> can't be done with adding some IO memory regions as sub-regions upon the
> >>>> file one?  
> >>>
> >>> The proxying requirement is simply a means to read/write to a computed address
> >>> within a memory region. There may well be a better way to do that.
> >>>
> >>> If I understand your suggestion correctly you would need a very high
> >>> number of IO memory regions to be created dynamically when particular sets of
> >>> registers across multiple devices in the topology are all programmed.
> >>>
> >>> The interleave can be 256 bytes across up to 16x, many terabyte, devices.
> >>> So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
> >>> IO regions.  Even for a minimal useful test case of largest interleave
> >>> set of 16x 256MB devices (256MB is minimum size the specification allows per
> >>> decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
> >>> Any idea if that approach would scale sensibly to this number of regions?
> >>>
> >>> There are also complexities to getting all the information in one place to
> >>> work out which IO memory regions maps where in PA space. Current solution is
> >>> to do that mapping in the same way the hardware does which is hierarchical,
> >>> so we walk the path to the device, picking directions based on each interleave
> >>> decoder that we meet.
> >>> Obviously this is a bit slow but I only really care about correctness at the
> >>> moment.  I can think of various approaches to speeding it up but I'm not sure
> >>> if we will ever care about performance.
> >>>
> >>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
> >>> has the logic for that and as you can see it's fairly simple because we are always
> >>> going down the topology following the decoders.
> >>>
> >>> Below I have mapped out an algorithm I think would work for doing it with
> >>> IO memory regions as subregions.
> >>>
> >>> We could fake the whole thing by limiting ourselves to small host
> >>> memory windows which are always directly backed, but then I wouldn't
> >>> achieve the main aim of this which is to provide a test base for the OS code.
> >>> To do that I need real interleave so I can seed the files with test patterns
> >>> and verify the accesses hit the correct locations. Emulating what the hardware
> >>> is actually doing on a device by device basis is the easiest way I have
> >>> come up with to do that.
> >>>
> >>> Let me try to provide some more background so you hopefully don't have
> >>> to have read the specs to follow what is going on!
> >>> There are an example for directly connected (no switches) topology in the
> >>> docs
> >>>
> >>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst
> >>>
> >>> The overall picture is we have a large number of CXL Type 3 memory devices,
> >>> which at runtime (by OS at boot/on hotplug) are configured into various
> >>> interleaving sets with hierarchical decoding at the host + host bridge
> >>> + switch levels. For test setups I probably need to go to around 32 devices
> >>> so I can hit various configurations simultaneously.
> >>> No individual device has visibility of the full interleave setup - hence
> >>> the walk in the existing code through the various decoders to find the
> >>> final Device Physical address.
> >>>
> >>> At the host level the host provides a set of Physical Address windows with
> >>> a fixed interleave decoding across the different host bridges in the system
> >>> (CXL Fixed Memory windows, CFMWs)
> >>> On a real system these have to be large enough to allow for any memory
> >>> devices that might be hotplugged and all possible configurations (so
> >>> with 2 host bridges you need at least 3 windows in the many TB range,
> >>> much worse as the number of host bridges goes up). It'll be worse than
> >>> this when we have QoS groups, but the current Qemu code just puts all
> >>> the windows in group 0.  Hence my first thought of just putting memory
> >>> behind those doesn't scale (a similar approach to this was in the
> >>> earliest versions of this patch set - though the full access path
> >>> wasn't wired up).
> >>>
> >>> The granularity can be in powers of 2 from 256 bytes to 16 kbytes
> >>>
> >>> Next each host bridge has programmable address decoders which take the
> >>> incoming (often already interleaved) memory access and direct them to
> >>> appropriate root ports.  The root ports can be connected to a switch
> >>> which has additional address decoders in the upstream port to decide
> >>> which downstream port to route to.  Note we currently only support 1 level
> >>> of switches but it's easy to make this algorithm recursive to support
> >>> multiple switch levels (currently the kernel proposals only support 1 level)
> >>>
> >>> Finally the End Point with the actual memory receives the interleaved request and
> >>> takes the full address and (for power of 2 decoding - we don't yet support
> >>> 3,6 and 12 way which is more complex and there is no kernel support yet)
> >>> it drops a few address bits and adds an offset for the decoder used to
> >>> calculate it's own device physical address.  Note device will support
> >>> multiple interleave sets for different parts of it's file once we add
> >>> multiple decoder support (on the todo list).
> >>>
> >>> So the current solution is straight forward (with the exception of that
> >>> proxying) because it follows the same decoding as used in real hardware
> >>> to route the memory accesses. As a result we get a read/write to a
> >>> device physical address and hence proxy that.  If any of the decoders
> >>> along the path are not configured then we error out at that stage.
> >>>
> >>> To create the equivalent as IO subregions I think we'd have to do the
> >>> following from (this might be mediated by some central entity that
> >>> doesn't currently exist, or done on demand from which ever CXL device
> >>> happens to have it's decoder set up last)
> >>>
> >>> 1) Wait for a decoder commit (enable) on any component. Goto 2.
> >>> 2) Walk the topology (up to host decoder, down to memory device)
> >>> If a complete interleaving path has been configured -
> >>>     i.e. we have committed decoders all the way to the memory
> >>>     device goto step 3, otherwise return to step 1 to wait for
> >>>     more decoders to be committed.
> >>> 3) For the memory region being supplied by the memory device,
> >>>     add subregions to map the device physical address (address
> >>>     in the file) for each interleave stride to the appropriate
> >>>     host Physical Address.
> >>> 4) Return to step 1 to wait for more decoders to commit.
> >>>
> >>> So summary is we can do it with IO regions, but there are a lot of them
> >>> and the setup is somewhat complex as we don't have one single point in
> >>> time where we know all the necessary information is available to compute
> >>> the right addresses.
> >>>
> >>> Looking forward to your suggestions if I haven't caused more confusion!  
> > 
> > Hi Peter,
> >   
> >>
> >> Thanks for the write up - I must confess they're a lot! :)
> >>
> >> I merely only learned what is CXL today, and I'm not very experienced on
> >> device modeling either, so please bare with me with stupid questions..
> >>
> >> IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
> >> That's a normal IO region, which looks very reasonable.
> >>
> >> However I'm confused why patch "RFC: softmmu/memory: Add ops to
> >> memory_region_ram_init_from_file" helped.
> >>
> >> Per my knowledge, all the memory accesses upon this CFMW window trapped
> >> using this IO region already.  There can be multiple memory file objects
> >> underneath, and when read/write happens the object will be decoded from
> >> cxl_cfmws_find_device() as you referenced.  
> > 
> > Yes.
> >   
> >>
> >> However I see nowhere that these memory objects got mapped as sub-regions
> >> into parent (CXLFixedWindow.mr).  Then I don't understand why they cannot
> >> be trapped.  
> > 
> > AS you note they aren't mapped into the parent mr, hence we are trapping.
> > The parent mem_ops are responsible for decoding the 'which device' +
> > 'what address in device memory space'. Once we've gotten that info
> > the question is how do I actually do the access?
> > 
> > Mapping as subregions seems unwise due to the huge number required.
> >   
> >>
> >> To ask in another way: what will happen if you simply revert this RFC
> >> patch?  What will go wrong?  
> > 
> > The call to memory_region_dispatch_read()
> > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556
> > 
> > would call memory_region_access_valid() that calls
> > mr->ops->valid.accepts() which is set to
> > unassigned_mem_accepts() and hence...
> > you get back a MEMTX_DECODE_ERROR back and an exception in the
> > guest.
> > 
> > That wouldn't happen with a non proxied access to the ram as
> > those paths never uses the ops as memory_access_is_direct() is called
> > and simply memcpy used without any involvement of the ops.
> > 
> > Is a better way to proxy those writes to the backing files?
> > 
> > I was fishing a bit in the dark here and saw the existing ops defined
> > for a different purpose for VFIO
> > 
> > 4a2e242bbb ("memory Don't use memcpy for ram_device regions")
> > 
> > and those allowed the use of memory_region_dispatch_write() to work.
> > 
> > Hence the RFC marking on that patch :)  
> 
> FWIW I had a similar issue implementing manual aliasing in one of my q800 patches 
> where I found that dispatching a read to a non-IO memory region didn't work with 
> memory_region_dispatch_read(). The solution in my case was to switch to using the 
> address space API instead, which whilst requiring an absolute address for the target 
> address space, handles the dispatch correctly across all different memory region types.
> 
> Have a look at 
> https://gitlab.com/mcayland/qemu/-/commit/318e12579c7570196187652da13542db86b8c722 to 
> see how I did this in macio_alias_read().
> 
> IIRC from my experiments in this area, my conclusion was that 
> memory_region_dispatch_read() can only work correctly if mapping directly between 2 
> IO memory regions, and for anything else you need to use the address space API.

Hi Mark,

I'd wondered about the address space API as an alternative approach.

From that reference looks like you have the memory mapped into the system address
space and are providing an alias to that.  That's something I'd ideally like to
avoid doing as there is no meaningful way to do it so I'd just be hiding the memory
somewhere up high.  The memory should only be accessible through the one
route.

I think I could spin a separate address space for this purpose (one per CXL type 3
device probably) but that seems like another nasty hack to make. I'll try a quick
prototype of this tomorrow.

What do people think is the least horrible way to do this?

Thanks for the suggestion and I'm glad I'm not the only one trying to get this
sort of thing to work ;)

Jonathan

> 
> 
> ATB,
> 
> Mark.
>
Jonathan Cameron March 16, 2022, 6:26 p.m. UTC | #9
On Wed, 16 Mar 2022 17:58:46 +0000
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Wed, 16 Mar 2022 17:16:55 +0000
> Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> wrote:
> 
> > On 16/03/2022 16:50, Jonathan Cameron via wrote:
> >   
> > > On Thu, 10 Mar 2022 16:02:22 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > >     
> > >> On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:    
> > >>> Hi Peter,    
> > >>
> > >> Hi, Jonathan,
> > >>    
> > >>>        
> > >>>>
> > >>>> https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
> > >>>>
> > >>>> Having mr->ops set but with memory_access_is_direct() returning true sounds
> > >>>> weird to me.
> > >>>>
> > >>>> Sorry to have no understanding of the whole picture, but.. could you share
> > >>>> more on what's the interleaving requirement on the proxying, and why it
> > >>>> can't be done with adding some IO memory regions as sub-regions upon the
> > >>>> file one?    
> > >>>
> > >>> The proxying requirement is simply a means to read/write to a computed address
> > >>> within a memory region. There may well be a better way to do that.
> > >>>
> > >>> If I understand your suggestion correctly you would need a very high
> > >>> number of IO memory regions to be created dynamically when particular sets of
> > >>> registers across multiple devices in the topology are all programmed.
> > >>>
> > >>> The interleave can be 256 bytes across up to 16x, many terabyte, devices.
> > >>> So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
> > >>> IO regions.  Even for a minimal useful test case of largest interleave
> > >>> set of 16x 256MB devices (256MB is minimum size the specification allows per
> > >>> decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
> > >>> Any idea if that approach would scale sensibly to this number of regions?
> > >>>
> > >>> There are also complexities to getting all the information in one place to
> > >>> work out which IO memory regions maps where in PA space. Current solution is
> > >>> to do that mapping in the same way the hardware does which is hierarchical,
> > >>> so we walk the path to the device, picking directions based on each interleave
> > >>> decoder that we meet.
> > >>> Obviously this is a bit slow but I only really care about correctness at the
> > >>> moment.  I can think of various approaches to speeding it up but I'm not sure
> > >>> if we will ever care about performance.
> > >>>
> > >>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
> > >>> has the logic for that and as you can see it's fairly simple because we are always
> > >>> going down the topology following the decoders.
> > >>>
> > >>> Below I have mapped out an algorithm I think would work for doing it with
> > >>> IO memory regions as subregions.
> > >>>
> > >>> We could fake the whole thing by limiting ourselves to small host
> > >>> memory windows which are always directly backed, but then I wouldn't
> > >>> achieve the main aim of this which is to provide a test base for the OS code.
> > >>> To do that I need real interleave so I can seed the files with test patterns
> > >>> and verify the accesses hit the correct locations. Emulating what the hardware
> > >>> is actually doing on a device by device basis is the easiest way I have
> > >>> come up with to do that.
> > >>>
> > >>> Let me try to provide some more background so you hopefully don't have
> > >>> to have read the specs to follow what is going on!
> > >>> There are an example for directly connected (no switches) topology in the
> > >>> docs
> > >>>
> > >>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst
> > >>>
> > >>> The overall picture is we have a large number of CXL Type 3 memory devices,
> > >>> which at runtime (by OS at boot/on hotplug) are configured into various
> > >>> interleaving sets with hierarchical decoding at the host + host bridge
> > >>> + switch levels. For test setups I probably need to go to around 32 devices
> > >>> so I can hit various configurations simultaneously.
> > >>> No individual device has visibility of the full interleave setup - hence
> > >>> the walk in the existing code through the various decoders to find the
> > >>> final Device Physical address.
> > >>>
> > >>> At the host level the host provides a set of Physical Address windows with
> > >>> a fixed interleave decoding across the different host bridges in the system
> > >>> (CXL Fixed Memory windows, CFMWs)
> > >>> On a real system these have to be large enough to allow for any memory
> > >>> devices that might be hotplugged and all possible configurations (so
> > >>> with 2 host bridges you need at least 3 windows in the many TB range,
> > >>> much worse as the number of host bridges goes up). It'll be worse than
> > >>> this when we have QoS groups, but the current Qemu code just puts all
> > >>> the windows in group 0.  Hence my first thought of just putting memory
> > >>> behind those doesn't scale (a similar approach to this was in the
> > >>> earliest versions of this patch set - though the full access path
> > >>> wasn't wired up).
> > >>>
> > >>> The granularity can be in powers of 2 from 256 bytes to 16 kbytes
> > >>>
> > >>> Next each host bridge has programmable address decoders which take the
> > >>> incoming (often already interleaved) memory access and direct them to
> > >>> appropriate root ports.  The root ports can be connected to a switch
> > >>> which has additional address decoders in the upstream port to decide
> > >>> which downstream port to route to.  Note we currently only support 1 level
> > >>> of switches but it's easy to make this algorithm recursive to support
> > >>> multiple switch levels (currently the kernel proposals only support 1 level)
> > >>>
> > >>> Finally the End Point with the actual memory receives the interleaved request and
> > >>> takes the full address and (for power of 2 decoding - we don't yet support
> > >>> 3,6 and 12 way which is more complex and there is no kernel support yet)
> > >>> it drops a few address bits and adds an offset for the decoder used to
> > >>> calculate it's own device physical address.  Note device will support
> > >>> multiple interleave sets for different parts of it's file once we add
> > >>> multiple decoder support (on the todo list).
> > >>>
> > >>> So the current solution is straight forward (with the exception of that
> > >>> proxying) because it follows the same decoding as used in real hardware
> > >>> to route the memory accesses. As a result we get a read/write to a
> > >>> device physical address and hence proxy that.  If any of the decoders
> > >>> along the path are not configured then we error out at that stage.
> > >>>
> > >>> To create the equivalent as IO subregions I think we'd have to do the
> > >>> following from (this might be mediated by some central entity that
> > >>> doesn't currently exist, or done on demand from which ever CXL device
> > >>> happens to have it's decoder set up last)
> > >>>
> > >>> 1) Wait for a decoder commit (enable) on any component. Goto 2.
> > >>> 2) Walk the topology (up to host decoder, down to memory device)
> > >>> If a complete interleaving path has been configured -
> > >>>     i.e. we have committed decoders all the way to the memory
> > >>>     device goto step 3, otherwise return to step 1 to wait for
> > >>>     more decoders to be committed.
> > >>> 3) For the memory region being supplied by the memory device,
> > >>>     add subregions to map the device physical address (address
> > >>>     in the file) for each interleave stride to the appropriate
> > >>>     host Physical Address.
> > >>> 4) Return to step 1 to wait for more decoders to commit.
> > >>>
> > >>> So summary is we can do it with IO regions, but there are a lot of them
> > >>> and the setup is somewhat complex as we don't have one single point in
> > >>> time where we know all the necessary information is available to compute
> > >>> the right addresses.
> > >>>
> > >>> Looking forward to your suggestions if I haven't caused more confusion!    
> > > 
> > > Hi Peter,
> > >     
> > >>
> > >> Thanks for the write up - I must confess they're a lot! :)
> > >>
> > >> I merely only learned what is CXL today, and I'm not very experienced on
> > >> device modeling either, so please bare with me with stupid questions..
> > >>
> > >> IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
> > >> That's a normal IO region, which looks very reasonable.
> > >>
> > >> However I'm confused why patch "RFC: softmmu/memory: Add ops to
> > >> memory_region_ram_init_from_file" helped.
> > >>
> > >> Per my knowledge, all the memory accesses upon this CFMW window trapped
> > >> using this IO region already.  There can be multiple memory file objects
> > >> underneath, and when read/write happens the object will be decoded from
> > >> cxl_cfmws_find_device() as you referenced.    
> > > 
> > > Yes.
> > >     
> > >>
> > >> However I see nowhere that these memory objects got mapped as sub-regions
> > >> into parent (CXLFixedWindow.mr).  Then I don't understand why they cannot
> > >> be trapped.    
> > > 
> > > AS you note they aren't mapped into the parent mr, hence we are trapping.
> > > The parent mem_ops are responsible for decoding the 'which device' +
> > > 'what address in device memory space'. Once we've gotten that info
> > > the question is how do I actually do the access?
> > > 
> > > Mapping as subregions seems unwise due to the huge number required.
> > >     
> > >>
> > >> To ask in another way: what will happen if you simply revert this RFC
> > >> patch?  What will go wrong?    
> > > 
> > > The call to memory_region_dispatch_read()
> > > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556
> > > 
> > > would call memory_region_access_valid() that calls
> > > mr->ops->valid.accepts() which is set to
> > > unassigned_mem_accepts() and hence...
> > > you get back a MEMTX_DECODE_ERROR back and an exception in the
> > > guest.
> > > 
> > > That wouldn't happen with a non proxied access to the ram as
> > > those paths never uses the ops as memory_access_is_direct() is called
> > > and simply memcpy used without any involvement of the ops.
> > > 
> > > Is a better way to proxy those writes to the backing files?
> > > 
> > > I was fishing a bit in the dark here and saw the existing ops defined
> > > for a different purpose for VFIO
> > > 
> > > 4a2e242bbb ("memory Don't use memcpy for ram_device regions")
> > > 
> > > and those allowed the use of memory_region_dispatch_write() to work.
> > > 
> > > Hence the RFC marking on that patch :)    
> > 
> > FWIW I had a similar issue implementing manual aliasing in one of my q800 patches 
> > where I found that dispatching a read to a non-IO memory region didn't work with 
> > memory_region_dispatch_read(). The solution in my case was to switch to using the 
> > address space API instead, which whilst requiring an absolute address for the target 
> > address space, handles the dispatch correctly across all different memory region types.
> > 
> > Have a look at 
> > https://gitlab.com/mcayland/qemu/-/commit/318e12579c7570196187652da13542db86b8c722 to 
> > see how I did this in macio_alias_read().
> > 
> > IIRC from my experiments in this area, my conclusion was that 
> > memory_region_dispatch_read() can only work correctly if mapping directly between 2 
> > IO memory regions, and for anything else you need to use the address space API.  
> 
> Hi Mark,
> 
> I'd wondered about the address space API as an alternative approach.
> 
> From that reference looks like you have the memory mapped into the system address
> space and are providing an alias to that.  That's something I'd ideally like to
> avoid doing as there is no meaningful way to do it so I'd just be hiding the memory
> somewhere up high.  The memory should only be accessible through the one
> route.
> 
> I think I could spin a separate address space for this purpose (one per CXL type 3
> device probably) but that seems like another nasty hack to make. I'll try a quick
> prototype of this tomorrow.

Turned out to be trivial so already done.  Will send out as v8 unless anyone
feeds back that there is a major disadvantage to just spinning up one address space
per CXL type3 device.  That will mean dropping the RFC patch as well as no longer
used :)

Thanks for the hint Mark.

Jonathan

> 
> What do people think is the least horrible way to do this?
> 
> Thanks for the suggestion and I'm glad I'm not the only one trying to get this
> sort of thing to work ;)
> 
> Jonathan
> 
> > 
> > 
> > ATB,
> > 
> > Mark.
> >   
>
Mark Cave-Ayland March 17, 2022, 8:12 a.m. UTC | #10
On 16/03/2022 18:26, Jonathan Cameron via wrote:
> On Wed, 16 Mar 2022 17:58:46 +0000
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> 
>> On Wed, 16 Mar 2022 17:16:55 +0000
>> Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> wrote:
>>
>>> On 16/03/2022 16:50, Jonathan Cameron via wrote:
>>>    
>>>> On Thu, 10 Mar 2022 16:02:22 +0800
>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>      
>>>>> On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:
>>>>>> Hi Peter,
>>>>>
>>>>> Hi, Jonathan,
>>>>>     
>>>>>>         
>>>>>>>
>>>>>>> https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
>>>>>>>
>>>>>>> Having mr->ops set but with memory_access_is_direct() returning true sounds
>>>>>>> weird to me.
>>>>>>>
>>>>>>> Sorry to have no understanding of the whole picture, but.. could you share
>>>>>>> more on what's the interleaving requirement on the proxying, and why it
>>>>>>> can't be done with adding some IO memory regions as sub-regions upon the
>>>>>>> file one?
>>>>>>
>>>>>> The proxying requirement is simply a means to read/write to a computed address
>>>>>> within a memory region. There may well be a better way to do that.
>>>>>>
>>>>>> If I understand your suggestion correctly you would need a very high
>>>>>> number of IO memory regions to be created dynamically when particular sets of
>>>>>> registers across multiple devices in the topology are all programmed.
>>>>>>
>>>>>> The interleave can be 256 bytes across up to 16x, many terabyte, devices.
>>>>>> So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
>>>>>> IO regions.  Even for a minimal useful test case of largest interleave
>>>>>> set of 16x 256MB devices (256MB is minimum size the specification allows per
>>>>>> decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
>>>>>> Any idea if that approach would scale sensibly to this number of regions?
>>>>>>
>>>>>> There are also complexities to getting all the information in one place to
>>>>>> work out which IO memory regions maps where in PA space. Current solution is
>>>>>> to do that mapping in the same way the hardware does which is hierarchical,
>>>>>> so we walk the path to the device, picking directions based on each interleave
>>>>>> decoder that we meet.
>>>>>> Obviously this is a bit slow but I only really care about correctness at the
>>>>>> moment.  I can think of various approaches to speeding it up but I'm not sure
>>>>>> if we will ever care about performance.
>>>>>>
>>>>>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
>>>>>> has the logic for that and as you can see it's fairly simple because we are always
>>>>>> going down the topology following the decoders.
>>>>>>
>>>>>> Below I have mapped out an algorithm I think would work for doing it with
>>>>>> IO memory regions as subregions.
>>>>>>
>>>>>> We could fake the whole thing by limiting ourselves to small host
>>>>>> memory windows which are always directly backed, but then I wouldn't
>>>>>> achieve the main aim of this which is to provide a test base for the OS code.
>>>>>> To do that I need real interleave so I can seed the files with test patterns
>>>>>> and verify the accesses hit the correct locations. Emulating what the hardware
>>>>>> is actually doing on a device by device basis is the easiest way I have
>>>>>> come up with to do that.
>>>>>>
>>>>>> Let me try to provide some more background so you hopefully don't have
>>>>>> to have read the specs to follow what is going on!
>>>>>> There are an example for directly connected (no switches) topology in the
>>>>>> docs
>>>>>>
>>>>>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst
>>>>>>
>>>>>> The overall picture is we have a large number of CXL Type 3 memory devices,
>>>>>> which at runtime (by OS at boot/on hotplug) are configured into various
>>>>>> interleaving sets with hierarchical decoding at the host + host bridge
>>>>>> + switch levels. For test setups I probably need to go to around 32 devices
>>>>>> so I can hit various configurations simultaneously.
>>>>>> No individual device has visibility of the full interleave setup - hence
>>>>>> the walk in the existing code through the various decoders to find the
>>>>>> final Device Physical address.
>>>>>>
>>>>>> At the host level the host provides a set of Physical Address windows with
>>>>>> a fixed interleave decoding across the different host bridges in the system
>>>>>> (CXL Fixed Memory windows, CFMWs)
>>>>>> On a real system these have to be large enough to allow for any memory
>>>>>> devices that might be hotplugged and all possible configurations (so
>>>>>> with 2 host bridges you need at least 3 windows in the many TB range,
>>>>>> much worse as the number of host bridges goes up). It'll be worse than
>>>>>> this when we have QoS groups, but the current Qemu code just puts all
>>>>>> the windows in group 0.  Hence my first thought of just putting memory
>>>>>> behind those doesn't scale (a similar approach to this was in the
>>>>>> earliest versions of this patch set - though the full access path
>>>>>> wasn't wired up).
>>>>>>
>>>>>> The granularity can be in powers of 2 from 256 bytes to 16 kbytes
>>>>>>
>>>>>> Next each host bridge has programmable address decoders which take the
>>>>>> incoming (often already interleaved) memory access and direct them to
>>>>>> appropriate root ports.  The root ports can be connected to a switch
>>>>>> which has additional address decoders in the upstream port to decide
>>>>>> which downstream port to route to.  Note we currently only support 1 level
>>>>>> of switches but it's easy to make this algorithm recursive to support
>>>>>> multiple switch levels (currently the kernel proposals only support 1 level)
>>>>>>
>>>>>> Finally the End Point with the actual memory receives the interleaved request and
>>>>>> takes the full address and (for power of 2 decoding - we don't yet support
>>>>>> 3,6 and 12 way which is more complex and there is no kernel support yet)
>>>>>> it drops a few address bits and adds an offset for the decoder used to
>>>>>> calculate it's own device physical address.  Note device will support
>>>>>> multiple interleave sets for different parts of it's file once we add
>>>>>> multiple decoder support (on the todo list).
>>>>>>
>>>>>> So the current solution is straight forward (with the exception of that
>>>>>> proxying) because it follows the same decoding as used in real hardware
>>>>>> to route the memory accesses. As a result we get a read/write to a
>>>>>> device physical address and hence proxy that.  If any of the decoders
>>>>>> along the path are not configured then we error out at that stage.
>>>>>>
>>>>>> To create the equivalent as IO subregions I think we'd have to do the
>>>>>> following from (this might be mediated by some central entity that
>>>>>> doesn't currently exist, or done on demand from which ever CXL device
>>>>>> happens to have it's decoder set up last)
>>>>>>
>>>>>> 1) Wait for a decoder commit (enable) on any component. Goto 2.
>>>>>> 2) Walk the topology (up to host decoder, down to memory device)
>>>>>> If a complete interleaving path has been configured -
>>>>>>      i.e. we have committed decoders all the way to the memory
>>>>>>      device goto step 3, otherwise return to step 1 to wait for
>>>>>>      more decoders to be committed.
>>>>>> 3) For the memory region being supplied by the memory device,
>>>>>>      add subregions to map the device physical address (address
>>>>>>      in the file) for each interleave stride to the appropriate
>>>>>>      host Physical Address.
>>>>>> 4) Return to step 1 to wait for more decoders to commit.
>>>>>>
>>>>>> So summary is we can do it with IO regions, but there are a lot of them
>>>>>> and the setup is somewhat complex as we don't have one single point in
>>>>>> time where we know all the necessary information is available to compute
>>>>>> the right addresses.
>>>>>>
>>>>>> Looking forward to your suggestions if I haven't caused more confusion!
>>>>
>>>> Hi Peter,
>>>>      
>>>>>
>>>>> Thanks for the write up - I must confess they're a lot! :)
>>>>>
>>>>> I merely only learned what is CXL today, and I'm not very experienced on
>>>>> device modeling either, so please bare with me with stupid questions..
>>>>>
>>>>> IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
>>>>> That's a normal IO region, which looks very reasonable.
>>>>>
>>>>> However I'm confused why patch "RFC: softmmu/memory: Add ops to
>>>>> memory_region_ram_init_from_file" helped.
>>>>>
>>>>> Per my knowledge, all the memory accesses upon this CFMW window trapped
>>>>> using this IO region already.  There can be multiple memory file objects
>>>>> underneath, and when read/write happens the object will be decoded from
>>>>> cxl_cfmws_find_device() as you referenced.
>>>>
>>>> Yes.
>>>>      
>>>>>
>>>>> However I see nowhere that these memory objects got mapped as sub-regions
>>>>> into parent (CXLFixedWindow.mr).  Then I don't understand why they cannot
>>>>> be trapped.
>>>>
>>>> AS you note they aren't mapped into the parent mr, hence we are trapping.
>>>> The parent mem_ops are responsible for decoding the 'which device' +
>>>> 'what address in device memory space'. Once we've gotten that info
>>>> the question is how do I actually do the access?
>>>>
>>>> Mapping as subregions seems unwise due to the huge number required.
>>>>      
>>>>>
>>>>> To ask in another way: what will happen if you simply revert this RFC
>>>>> patch?  What will go wrong?
>>>>
>>>> The call to memory_region_dispatch_read()
>>>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556
>>>>
>>>> would call memory_region_access_valid() that calls
>>>> mr->ops->valid.accepts() which is set to
>>>> unassigned_mem_accepts() and hence...
>>>> you get back a MEMTX_DECODE_ERROR back and an exception in the
>>>> guest.
>>>>
>>>> That wouldn't happen with a non proxied access to the ram as
>>>> those paths never uses the ops as memory_access_is_direct() is called
>>>> and simply memcpy used without any involvement of the ops.
>>>>
>>>> Is a better way to proxy those writes to the backing files?
>>>>
>>>> I was fishing a bit in the dark here and saw the existing ops defined
>>>> for a different purpose for VFIO
>>>>
>>>> 4a2e242bbb ("memory Don't use memcpy for ram_device regions")
>>>>
>>>> and those allowed the use of memory_region_dispatch_write() to work.
>>>>
>>>> Hence the RFC marking on that patch :)
>>>
>>> FWIW I had a similar issue implementing manual aliasing in one of my q800 patches
>>> where I found that dispatching a read to a non-IO memory region didn't work with
>>> memory_region_dispatch_read(). The solution in my case was to switch to using the
>>> address space API instead, which whilst requiring an absolute address for the target
>>> address space, handles the dispatch correctly across all different memory region types.
>>>
>>> Have a look at
>>> https://gitlab.com/mcayland/qemu/-/commit/318e12579c7570196187652da13542db86b8c722 to
>>> see how I did this in macio_alias_read().
>>>
>>> IIRC from my experiments in this area, my conclusion was that
>>> memory_region_dispatch_read() can only work correctly if mapping directly between 2
>>> IO memory regions, and for anything else you need to use the address space API.
>>
>> Hi Mark,
>>
>> I'd wondered about the address space API as an alternative approach.
>>
>>  From that reference looks like you have the memory mapped into the system address
>> space and are providing an alias to that.  That's something I'd ideally like to
>> avoid doing as there is no meaningful way to do it so I'd just be hiding the memory
>> somewhere up high.  The memory should only be accessible through the one
>> route.
>>
>> I think I could spin a separate address space for this purpose (one per CXL type 3
>> device probably) but that seems like another nasty hack to make. I'll try a quick
>> prototype of this tomorrow.
> 
> Turned out to be trivial so already done.  Will send out as v8 unless anyone
> feeds back that there is a major disadvantage to just spinning up one address space
> per CXL type3 device.  That will mean dropping the RFC patch as well as no longer
> used :)
> 
> Thanks for the hint Mark.
> 
> Jonathan

Ah great! As you've already noticed my particular case was performing partial 
decoding on a memory region, but there are no issues if you need to dispatch to 
another existing address space such as PCI/IOMMU. Creating a separate address space 
per device shouldn't be an issue either, as that's effectively how the PCI bus master 
requests are handled.

The address spaces are visible in "info mtree" so if you haven't already, I would 
recommend generating a dynamic name for the address space based upon the device 
name/address to make it easier for development and debugging.


ATB,

Mark.
Jonathan Cameron March 17, 2022, 4:47 p.m. UTC | #11
On Thu, 17 Mar 2022 08:12:56 +0000
Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> wrote:

> On 16/03/2022 18:26, Jonathan Cameron via wrote:
> > On Wed, 16 Mar 2022 17:58:46 +0000
> > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> >   
> >> On Wed, 16 Mar 2022 17:16:55 +0000
> >> Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> wrote:
> >>  
> >>> On 16/03/2022 16:50, Jonathan Cameron via wrote:
> >>>      
> >>>> On Thu, 10 Mar 2022 16:02:22 +0800
> >>>> Peter Xu <peterx@redhat.com> wrote:
> >>>>        
> >>>>> On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:  
> >>>>>> Hi Peter,  
> >>>>>
> >>>>> Hi, Jonathan,
> >>>>>       
> >>>>>>           
> >>>>>>>
> >>>>>>> https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
> >>>>>>>
> >>>>>>> Having mr->ops set but with memory_access_is_direct() returning true sounds
> >>>>>>> weird to me.
> >>>>>>>
> >>>>>>> Sorry to have no understanding of the whole picture, but.. could you share
> >>>>>>> more on what's the interleaving requirement on the proxying, and why it
> >>>>>>> can't be done with adding some IO memory regions as sub-regions upon the
> >>>>>>> file one?  
> >>>>>>
> >>>>>> The proxying requirement is simply a means to read/write to a computed address
> >>>>>> within a memory region. There may well be a better way to do that.
> >>>>>>
> >>>>>> If I understand your suggestion correctly you would need a very high
> >>>>>> number of IO memory regions to be created dynamically when particular sets of
> >>>>>> registers across multiple devices in the topology are all programmed.
> >>>>>>
> >>>>>> The interleave can be 256 bytes across up to 16x, many terabyte, devices.
> >>>>>> So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
> >>>>>> IO regions.  Even for a minimal useful test case of largest interleave
> >>>>>> set of 16x 256MB devices (256MB is minimum size the specification allows per
> >>>>>> decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
> >>>>>> Any idea if that approach would scale sensibly to this number of regions?
> >>>>>>
> >>>>>> There are also complexities to getting all the information in one place to
> >>>>>> work out which IO memory regions maps where in PA space. Current solution is
> >>>>>> to do that mapping in the same way the hardware does which is hierarchical,
> >>>>>> so we walk the path to the device, picking directions based on each interleave
> >>>>>> decoder that we meet.
> >>>>>> Obviously this is a bit slow but I only really care about correctness at the
> >>>>>> moment.  I can think of various approaches to speeding it up but I'm not sure
> >>>>>> if we will ever care about performance.
> >>>>>>
> >>>>>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
> >>>>>> has the logic for that and as you can see it's fairly simple because we are always
> >>>>>> going down the topology following the decoders.
> >>>>>>
> >>>>>> Below I have mapped out an algorithm I think would work for doing it with
> >>>>>> IO memory regions as subregions.
> >>>>>>
> >>>>>> We could fake the whole thing by limiting ourselves to small host
> >>>>>> memory windows which are always directly backed, but then I wouldn't
> >>>>>> achieve the main aim of this which is to provide a test base for the OS code.
> >>>>>> To do that I need real interleave so I can seed the files with test patterns
> >>>>>> and verify the accesses hit the correct locations. Emulating what the hardware
> >>>>>> is actually doing on a device by device basis is the easiest way I have
> >>>>>> come up with to do that.
> >>>>>>
> >>>>>> Let me try to provide some more background so you hopefully don't have
> >>>>>> to have read the specs to follow what is going on!
> >>>>>> There are an example for directly connected (no switches) topology in the
> >>>>>> docs
> >>>>>>
> >>>>>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst
> >>>>>>
> >>>>>> The overall picture is we have a large number of CXL Type 3 memory devices,
> >>>>>> which at runtime (by OS at boot/on hotplug) are configured into various
> >>>>>> interleaving sets with hierarchical decoding at the host + host bridge
> >>>>>> + switch levels. For test setups I probably need to go to around 32 devices
> >>>>>> so I can hit various configurations simultaneously.
> >>>>>> No individual device has visibility of the full interleave setup - hence
> >>>>>> the walk in the existing code through the various decoders to find the
> >>>>>> final Device Physical address.
> >>>>>>
> >>>>>> At the host level the host provides a set of Physical Address windows with
> >>>>>> a fixed interleave decoding across the different host bridges in the system
> >>>>>> (CXL Fixed Memory windows, CFMWs)
> >>>>>> On a real system these have to be large enough to allow for any memory
> >>>>>> devices that might be hotplugged and all possible configurations (so
> >>>>>> with 2 host bridges you need at least 3 windows in the many TB range,
> >>>>>> much worse as the number of host bridges goes up). It'll be worse than
> >>>>>> this when we have QoS groups, but the current Qemu code just puts all
> >>>>>> the windows in group 0.  Hence my first thought of just putting memory
> >>>>>> behind those doesn't scale (a similar approach to this was in the
> >>>>>> earliest versions of this patch set - though the full access path
> >>>>>> wasn't wired up).
> >>>>>>
> >>>>>> The granularity can be in powers of 2 from 256 bytes to 16 kbytes
> >>>>>>
> >>>>>> Next each host bridge has programmable address decoders which take the
> >>>>>> incoming (often already interleaved) memory access and direct them to
> >>>>>> appropriate root ports.  The root ports can be connected to a switch
> >>>>>> which has additional address decoders in the upstream port to decide
> >>>>>> which downstream port to route to.  Note we currently only support 1 level
> >>>>>> of switches but it's easy to make this algorithm recursive to support
> >>>>>> multiple switch levels (currently the kernel proposals only support 1 level)
> >>>>>>
> >>>>>> Finally the End Point with the actual memory receives the interleaved request and
> >>>>>> takes the full address and (for power of 2 decoding - we don't yet support
> >>>>>> 3,6 and 12 way which is more complex and there is no kernel support yet)
> >>>>>> it drops a few address bits and adds an offset for the decoder used to
> >>>>>> calculate it's own device physical address.  Note device will support
> >>>>>> multiple interleave sets for different parts of it's file once we add
> >>>>>> multiple decoder support (on the todo list).
> >>>>>>
> >>>>>> So the current solution is straight forward (with the exception of that
> >>>>>> proxying) because it follows the same decoding as used in real hardware
> >>>>>> to route the memory accesses. As a result we get a read/write to a
> >>>>>> device physical address and hence proxy that.  If any of the decoders
> >>>>>> along the path are not configured then we error out at that stage.
> >>>>>>
> >>>>>> To create the equivalent as IO subregions I think we'd have to do the
> >>>>>> following from (this might be mediated by some central entity that
> >>>>>> doesn't currently exist, or done on demand from which ever CXL device
> >>>>>> happens to have it's decoder set up last)
> >>>>>>
> >>>>>> 1) Wait for a decoder commit (enable) on any component. Goto 2.
> >>>>>> 2) Walk the topology (up to host decoder, down to memory device)
> >>>>>> If a complete interleaving path has been configured -
> >>>>>>      i.e. we have committed decoders all the way to the memory
> >>>>>>      device goto step 3, otherwise return to step 1 to wait for
> >>>>>>      more decoders to be committed.
> >>>>>> 3) For the memory region being supplied by the memory device,
> >>>>>>      add subregions to map the device physical address (address
> >>>>>>      in the file) for each interleave stride to the appropriate
> >>>>>>      host Physical Address.
> >>>>>> 4) Return to step 1 to wait for more decoders to commit.
> >>>>>>
> >>>>>> So summary is we can do it with IO regions, but there are a lot of them
> >>>>>> and the setup is somewhat complex as we don't have one single point in
> >>>>>> time where we know all the necessary information is available to compute
> >>>>>> the right addresses.
> >>>>>>
> >>>>>> Looking forward to your suggestions if I haven't caused more confusion!  
> >>>>
> >>>> Hi Peter,
> >>>>        
> >>>>>
> >>>>> Thanks for the write up - I must confess they're a lot! :)
> >>>>>
> >>>>> I merely only learned what is CXL today, and I'm not very experienced on
> >>>>> device modeling either, so please bare with me with stupid questions..
> >>>>>
> >>>>> IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
> >>>>> That's a normal IO region, which looks very reasonable.
> >>>>>
> >>>>> However I'm confused why patch "RFC: softmmu/memory: Add ops to
> >>>>> memory_region_ram_init_from_file" helped.
> >>>>>
> >>>>> Per my knowledge, all the memory accesses upon this CFMW window trapped
> >>>>> using this IO region already.  There can be multiple memory file objects
> >>>>> underneath, and when read/write happens the object will be decoded from
> >>>>> cxl_cfmws_find_device() as you referenced.  
> >>>>
> >>>> Yes.
> >>>>        
> >>>>>
> >>>>> However I see nowhere that these memory objects got mapped as sub-regions
> >>>>> into parent (CXLFixedWindow.mr).  Then I don't understand why they cannot
> >>>>> be trapped.  
> >>>>
> >>>> AS you note they aren't mapped into the parent mr, hence we are trapping.
> >>>> The parent mem_ops are responsible for decoding the 'which device' +
> >>>> 'what address in device memory space'. Once we've gotten that info
> >>>> the question is how do I actually do the access?
> >>>>
> >>>> Mapping as subregions seems unwise due to the huge number required.
> >>>>        
> >>>>>
> >>>>> To ask in another way: what will happen if you simply revert this RFC
> >>>>> patch?  What will go wrong?  
> >>>>
> >>>> The call to memory_region_dispatch_read()
> >>>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556
> >>>>
> >>>> would call memory_region_access_valid() that calls
> >>>> mr->ops->valid.accepts() which is set to
> >>>> unassigned_mem_accepts() and hence...
> >>>> you get back a MEMTX_DECODE_ERROR back and an exception in the
> >>>> guest.
> >>>>
> >>>> That wouldn't happen with a non proxied access to the ram as
> >>>> those paths never uses the ops as memory_access_is_direct() is called
> >>>> and simply memcpy used without any involvement of the ops.
> >>>>
> >>>> Is a better way to proxy those writes to the backing files?
> >>>>
> >>>> I was fishing a bit in the dark here and saw the existing ops defined
> >>>> for a different purpose for VFIO
> >>>>
> >>>> 4a2e242bbb ("memory Don't use memcpy for ram_device regions")
> >>>>
> >>>> and those allowed the use of memory_region_dispatch_write() to work.
> >>>>
> >>>> Hence the RFC marking on that patch :)  
> >>>
> >>> FWIW I had a similar issue implementing manual aliasing in one of my q800 patches
> >>> where I found that dispatching a read to a non-IO memory region didn't work with
> >>> memory_region_dispatch_read(). The solution in my case was to switch to using the
> >>> address space API instead, which whilst requiring an absolute address for the target
> >>> address space, handles the dispatch correctly across all different memory region types.
> >>>
> >>> Have a look at
> >>> https://gitlab.com/mcayland/qemu/-/commit/318e12579c7570196187652da13542db86b8c722 to
> >>> see how I did this in macio_alias_read().
> >>>
> >>> IIRC from my experiments in this area, my conclusion was that
> >>> memory_region_dispatch_read() can only work correctly if mapping directly between 2
> >>> IO memory regions, and for anything else you need to use the address space API.  
> >>
> >> Hi Mark,
> >>
> >> I'd wondered about the address space API as an alternative approach.
> >>
> >>  From that reference looks like you have the memory mapped into the system address
> >> space and are providing an alias to that.  That's something I'd ideally like to
> >> avoid doing as there is no meaningful way to do it so I'd just be hiding the memory
> >> somewhere up high.  The memory should only be accessible through the one
> >> route.
> >>
> >> I think I could spin a separate address space for this purpose (one per CXL type 3
> >> device probably) but that seems like another nasty hack to make. I'll try a quick
> >> prototype of this tomorrow.  
> > 
> > Turned out to be trivial so already done.  Will send out as v8 unless anyone
> > feeds back that there is a major disadvantage to just spinning up one address space
> > per CXL type3 device.  That will mean dropping the RFC patch as well as no longer
> > used :)
> > 
> > Thanks for the hint Mark.
> > 
> > Jonathan  
> 
> Ah great! As you've already noticed my particular case was performing partial 
> decoding on a memory region, but there are no issues if you need to dispatch to 
> another existing address space such as PCI/IOMMU. Creating a separate address space 
> per device shouldn't be an issue either, as that's effectively how the PCI bus master 
> requests are handled.
> 
> The address spaces are visible in "info mtree" so if you haven't already, I would 
> recommend generating a dynamic name for the address space based upon the device 
> name/address to make it easier for development and debugging.
info mtree already provides the following with a static name
address-space: cxl-type3-dpa-space
  0000000000000000-000000000fffffff (prio 0, nv-ram): cxl-mem2

So the device association is there anyway.  Hence I'm not sure a dynamic name adds
a lot on this occasion and code is simpler without making it dynamic.

Thanks,

Jonathan


> 
> 
> ATB,
> 
> Mark.
>
Mark Cave-Ayland March 18, 2022, 8:14 a.m. UTC | #12
On 17/03/2022 16:47, Jonathan Cameron via wrote:

>> Ah great! As you've already noticed my particular case was performing partial
>> decoding on a memory region, but there are no issues if you need to dispatch to
>> another existing address space such as PCI/IOMMU. Creating a separate address space
>> per device shouldn't be an issue either, as that's effectively how the PCI bus master
>> requests are handled.
>>
>> The address spaces are visible in "info mtree" so if you haven't already, I would
>> recommend generating a dynamic name for the address space based upon the device
>> name/address to make it easier for development and debugging.
> info mtree already provides the following with a static name
> address-space: cxl-type3-dpa-space
>    0000000000000000-000000000fffffff (prio 0, nv-ram): cxl-mem2
> 
> So the device association is there anyway.  Hence I'm not sure a dynamic name adds
> a lot on this occasion and code is simpler without making it dynamic.

Is this using a single address space for multiple memory devices, or one per device 
as you were suggesting in the thread? If it is one per device and cxl-mem2 is the 
value of the -device id parameter, I still think it is worth adding the same device 
id into the address space name for the sake of a g_strdup_printf() and corresponding 
g_free().

Alas I don't currently have the time (and enough knowledge of CXL!) to do a more 
comprehensive review of the patches, but a quick skim of the series suggests it seems 
quite mature. The only thing that I noticed was that there doesn't seem to be any 
trace-events added, which I think may be useful to aid driver developers if they need 
to debug some of the memory access routing.

Finally I should point out that there are a number of more experienced PCI developers 
on the CC list than me, and they should have the final say on patch review. So please 
consider these comments as recommendations based upon my development work on QEMU, 
and not as a NAK for proceeding with the series :)


ATB,

Mark.
Jonathan Cameron March 18, 2022, 10:08 a.m. UTC | #13
On Fri, 18 Mar 2022 08:14:58 +0000
Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> wrote:

> On 17/03/2022 16:47, Jonathan Cameron via wrote:
> 
> >> Ah great! As you've already noticed my particular case was performing partial
> >> decoding on a memory region, but there are no issues if you need to dispatch to
> >> another existing address space such as PCI/IOMMU. Creating a separate address space
> >> per device shouldn't be an issue either, as that's effectively how the PCI bus master
> >> requests are handled.
> >>
> >> The address spaces are visible in "info mtree" so if you haven't already, I would
> >> recommend generating a dynamic name for the address space based upon the device
> >> name/address to make it easier for development and debugging.  
> > info mtree already provides the following with a static name
> > address-space: cxl-type3-dpa-space
> >    0000000000000000-000000000fffffff (prio 0, nv-ram): cxl-mem2
> > 
> > So the device association is there anyway.  Hence I'm not sure a dynamic name adds
> > a lot on this occasion and code is simpler without making it dynamic.  
> 
> Is this using a single address space for multiple memory devices, or one per device 
> as you were suggesting in the thread? If it is one per device and cxl-mem2 is the 
> value of the -device id parameter, I still think it is worth adding the same device 
> id into the address space name for the sake of a g_strdup_printf() and corresponding 
> g_free().

One per device.  Ultimately when I add volatile memory support we'll end up with possibly
having to add an mr as a container for the two hostmem mr.   Looking again, the name
above is actually the id of the mr, not the type3 device. Probably better to optionally
use the type3 device name if available.

I'll make the name something like cxl-type3-dpa-space-cxl-pmem3 if id available
and fall back to cxl-type3-dpa-space as before if not.

> 
> Alas I don't currently have the time (and enough knowledge of CXL!) to do a more 
> comprehensive review of the patches, but a quick skim of the series suggests it seems 
> quite mature. The only thing that I noticed was that there doesn't seem to be any 
> trace-events added, which I think may be useful to aid driver developers if they need 
> to debug some of the memory access routing.

Good suggestion.  I'm inclined to add them in a follow up patch though because
this patch set is already somewhat unmanageable from point of view of review.
I already have a number of other patches queued up for a second series adding
more functionality.

> 
> Finally I should point out that there are a number of more experienced PCI developers 
> on the CC list than me, and they should have the final say on patch review. So please 
> consider these comments as recommendations based upon my development work on QEMU, 
> and not as a NAK for proceeding with the series :)

No problem and thanks for your help as (I think) you've solved the biggest open issue :)

Jonathan

> 
> 
> ATB,
> 
> Mark.