mbox series

[RFC,00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough

Message ID 20240920223446.1908673-1-zhiw@nvidia.com (mailing list archive)
Headers show
Series vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough | expand

Message

Zhi Wang Sept. 20, 2024, 10:34 p.m. UTC
Hi folks:

As promised in the LPC, here are all you need (patches, repos, guiding
video, kernel config) to build a environment to test the vfio-cxl-core.

Thanks so much for the discussions! Enjoy and see you in the next one.

Background
==========

Compute Express Link (CXL) is an open standard interconnect built upon
industrial PCI layers to enhance the performance and efficiency of data
centers by enabling high-speed, low-latency communication between CPUs
and various types of devices such as accelerators, memory.

It supports three key protocols: CXL.io as the control protocol, CXL.cache
as the cache-coherent host-device data transfer protocol, and CXL.mem as
memory expansion protocol. CXL Type 2 devices leverage the three protocols
to seamlessly integrate with host CPUs, providing a unified and efficient
interface for high-speed data transfer and memory sharing. This integration
is crucial for heterogeneous computing environments where accelerators,
such as GPUs, and other specialized processors, are used to handle
intensive workloads.

Goal
====

Although CXL is built upon the PCI layers, passing a CXL type-2 device can
be different than PCI devices according to CXL specification[1]:

- CXL type-2 device initialization. CXL type-2 device requires an
additional initialization sequence besides the PCI device initialization.
CXL type-2 device initialization can be pretty complicated due to its
hierarchy of register interfaces. Thus, a standard CXL type-2 driver
initialization sequence provided by the kernel CXL core is used.

- Create a CXL region and map it to the VM. A mapping between HPA and DPA
(Device PA) needs to be created to access the device memory directly. HDM
decoders in the CXL topology need to be configured level by level to
manage the mapping. After the region is created, it needs to be mapped to
GPA in the virtual HDM decoders configured by the VM.

- CXL reset. The CXL device reset is different from the PCI device reset.
A CXL reset sequence is introduced by the CXL spec.

- Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
configuration for device enumeration and device control. (E.g. if a device
is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
by the kernel CXL core, and the VM can not modify them.

- Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
that can sit in a PCI BAR. The location of register groups sit in the PCI
BAR is indicated by the register locator in the CXL DVSECs. They are also
owned by the kernel CXL core. Some of them need to be emulated.

Design
======

To achieve the purpose above, the vfio-cxl-core is introduced to host the
common routines that variant driver requires for device passthrough.
Similar with the vfio-pci-core, the vfio-cxl-core provides common
routines of vfio_device_ops for the variant driver to hook and perform the
CXL routines behind it.

Besides, several extra APIs are introduced for the variant driver to
provide the necessary information the kernel CXL core to initialize
the CXL device. E.g., Device DPA.

CXL is built upon the PCI layers but with differences. Thus, the
vfio-pci-core is aimed to be re-used as much as possible with the
awareness of operating on a CXL device.

A new VFIO device region is introduced to expose the CXL region to the
userspace. A new CXL VFIO device cap has also been introduced to convey
the necessary CXL device information to the userspace.

Patches
=======

The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
rebase them to V3 once the cxl-type2 support v3 patch review is done.

PATCH 1-3: Expose the necessary routines required by vfio-cxl.

PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
initialization, CXL region creation.

PATCH 5: Expose the CXL region to the userspace.

PATCH 6-7: Prepare to emulate the HDM decoder registers.

PATCH 8: Emulate the HDM decoder registers.

PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.

PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.

PATCH 11: Expose the CXL device information that userspace needs.

PATCH 12: An example variant driver to demonstrate the usage of
vfio-cxl-core from the perspective of the VFIO variant driver.

PATCH 13: A workaround needs suggestions.

Test
====

To test the patches and hack around, a virtual passthrough with nested
virtualization approach is used.

The host QEMU emulates a CXL type-2 accel device based on Ira's patches
with the changes to emulate HDM decoders.

While running the vfio-cxl in the L1 guest, an example VFIO variant
driver is used to attach with the QEMU CXL access device.

The L2 guest can be booted via the QEMU with the vfio-cxl support in the
VFIOStub.

In the L2 guest, a dummy CXL device driver is provided to attach to the
virtual pass-thru device.

The dummy CXL type-2 device driver can successfully be loaded with the
kernel cxl core type2 support, create CXL region by requesting the CXL
core to allocate HPA and DPA and configure the HDM decoders.

To make sure everyone can test the patches, the kernel config of L1 and
L2 are provided in the repos, the required kernel command params and
qemu command line can be found from the demostration video.[5]

Repos
=====

QEMU host: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
L1 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
L1 QEMU: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2

[1] https://computeexpresslink.org/cxl-specification/
[2] https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
[3] https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
[4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7

Feedback expected
=================

- Archtiecture level between vfio-pci-core and vfio-cxl-core.
- Variant driver requirements from more hardware vendors.
- vfio-cxl-core UABI to QEMU.

Moving foward
=============

- Rebase the patches on top of Alejandro's PATCH v3.
- Get Ira's type-2 emulated device patch into upstream as CXL folks and RH
  folks both came to talk and expect this. I had a chat with Ira and he
  expected me to take it over. Will start a discussion in the CXL discord
  group for the desgin of V1.
- Sparse map in vfio-cxl-core.

Known issues
============

- Teardown path. Missing teardown paths have been implements in Alejandor's
  PATCH v3. It should be solved after the rebase.

- Powerdown L1 guest instead of reboot it. The QEMU reset handler is missing
  in the Ira's patch. When rebooting L1, many CXL registers are not reset.
  This will be addressed in the formal review of emulated CXL type-2 device
  support.

Zhi Wang (13):
  cxl: allow a type-2 device not to have memory device registers
  cxl: introduce cxl_get_hdm_info()
  cxl: introduce cxl_find_comp_reglock_offset()
  vfio: introduce vfio-cxl core preludes
  vfio/cxl: expose CXL region to the usersapce via a new VFIO device
    region
  vfio/pci: expose vfio_pci_rw()
  vfio/cxl: introduce vfio_cxl_core_{read, write}()
  vfio/cxl: emulate HDM decoder registers
  vfio/pci: introduce CXL device awareness
  vfio/pci: emulate CXL DVSEC registers in the configuration space
  vfio/cxl: introduce VFIO CXL device cap
  vfio/cxl: VFIO variant driver for QEMU CXL accel device
  vfio/cxl: workaround: don't take resource region when cxl is enabled.

 drivers/cxl/core/pci.c              |  28 ++
 drivers/cxl/core/regs.c             |  22 +
 drivers/cxl/cxl.h                   |   1 +
 drivers/cxl/cxlpci.h                |   3 +
 drivers/cxl/pci.c                   |  14 +-
 drivers/vfio/pci/Kconfig            |   6 +
 drivers/vfio/pci/Makefile           |   5 +
 drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
 drivers/vfio/pci/cxl-accel/Makefile |   3 +
 drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
 drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c  |  10 +
 drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
 drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
 include/linux/cxl_accel_mem.h       |   3 +
 include/linux/cxl_accel_pci.h       |   6 +
 include/linux/vfio_pci_core.h       |  53 +++
 include/uapi/linux/vfio.h           |  14 +
 18 files changed, 992 insertions(+), 32 deletions(-)
 create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
 create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
 create mode 100644 drivers/vfio/pci/cxl-accel/main.c
 create mode 100644 drivers/vfio/pci/vfio_cxl_core.c

Comments

Tian, Kevin Sept. 23, 2024, 8 a.m. UTC | #1
> From: Zhi Wang <zhiw@nvidia.com>
> Sent: Saturday, September 21, 2024 6:35 AM
> 
[...]
> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
> (Device PA) needs to be created to access the device memory directly. HDM
> decoders in the CXL topology need to be configured level by level to
> manage the mapping. After the region is created, it needs to be mapped to
> GPA in the virtual HDM decoders configured by the VM.

Any time when a new address space is introduced it's worthy of more
context to help people who have no CXL background better understand
the mechanism and think any potential hole.

At a glance looks we are talking about a mapping tier:

  GPA->HPA->DPA

The location/size of HPA/DPA for a cxl region are decided and mapped
at @open_device and the HPA range is mapped to GPA at @mmap.

In addition the guest also manages a virtual HDM decoder:

  GPA->vDPA

Ideally the vDPA range selected by guest is a subset of the physical
cxl region so based on offset and vHDM the VMM may figure out
which offset in the cxl region to be mmaped for the corresponding
GPA (which in the end maps to the desired DPA).

Is this understanding correct?

btw is one cxl device only allowed to create one region? If multiple
regions are possible how will they be exposed to the guest?

> 
> - CXL reset. The CXL device reset is different from the PCI device reset.
> A CXL reset sequence is introduced by the CXL spec.
> 
> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
> configuration for device enumeration and device control. (E.g. if a device
> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
> by the kernel CXL core, and the VM can not modify them.

any side effect from emulating it purely in software (patch10), e.g. when
the guest desired configuration is different from the physical one?

> 
> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
> that can sit in a PCI BAR. The location of register groups sit in the PCI
> BAR is indicated by the register locator in the CXL DVSECs. They are also
> owned by the kernel CXL core. Some of them need to be emulated.

ditto

> 
> In the L2 guest, a dummy CXL device driver is provided to attach to the
> virtual pass-thru device.
> 
> The dummy CXL type-2 device driver can successfully be loaded with the
> kernel cxl core type2 support, create CXL region by requesting the CXL
> core to allocate HPA and DPA and configure the HDM decoders.

It'd be good to see a real cxl device working to add confidence on
the core design.
Zhi Wang Sept. 24, 2024, 8:30 a.m. UTC | #2
On 23/09/2024 11.00, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
> 
> 
>> From: Zhi Wang <zhiw@nvidia.com>
>> Sent: Saturday, September 21, 2024 6:35 AM
>>
> [...]
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
> 
> Any time when a new address space is introduced it's worthy of more
> context to help people who have no CXL background better understand
> the mechanism and think any potential hole.
> 
> At a glance looks we are talking about a mapping tier:
> 
>    GPA->HPA->DPA
> 
> The location/size of HPA/DPA for a cxl region are decided and mapped
> at @open_device and the HPA range is mapped to GPA at @mmap.
> 
> In addition the guest also manages a virtual HDM decoder:
> 
>    GPA->vDPA
> 
> Ideally the vDPA range selected by guest is a subset of the physical
> cxl region so based on offset and vHDM the VMM may figure out
> which offset in the cxl region to be mmaped for the corresponding
> GPA (which in the end maps to the desired DPA).
> 
> Is this understanding correct?
> 

Yes. Many thanks to summarize this. It is a design decision from a 
discussion in the CXL discord channel.

> btw is one cxl device only allowed to create one region? If multiple
> regions are possible how will they be exposed to the guest?
>

It is not an (shouldn't be) enforced requirement from the VFIO cxl core. 
It is really requirement-driven. I am expecting what kind of use cases 
in reality that needs multiple CXL regions in the host and then passing 
multiple regions to the guest.

Presumably, the host creates one large CXL region that covers the entire 
DPA, while QEMU can virtually partition it into different regions and 
map them to different virtual CXL region if QEMU presents multiple HDM 
decoders to the guest.

>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
>> by the kernel CXL core, and the VM can not modify them.
> 
> any side effect from emulating it purely in software (patch10), e.g. when
> the guest desired configuration is different from the physical one?
> 

This should be with a summary and later be decided if mediate pass 
through is needed. In this RFC, its goal is just to prevent the guest to 
modify pRegs.

>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
> 
> ditto
> 
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
> 
> It'd be good to see a real cxl device working to add confidence on
> the core design.

To leverage the opportunity of F2F discussion in LPC, I proposed this 
patchset to start the discussion and meanwhile offered an environment 
for people to try and hack around. Also patches is good base for 
discussion. We see what we will get. :)

There are devices already there and on-going. AMD's SFC (patches are 
under review) and I think they are going to be the first variant driver 
that use the core. NVIDIA's device is also coming and NVIDIA's variant 
driver is going upstream for sure. Plus this emulated device, I assume 
we will have three in-tree variant drivers talks to the CXL core.

Thanks,
Zhi.
Alejandro Lucero Palau Sept. 25, 2024, 10:11 a.m. UTC | #3
On 9/20/24 23:34, Zhi Wang wrote:
> Hi folks:
>
> As promised in the LPC, here are all you need (patches, repos, guiding
> video, kernel config) to build a environment to test the vfio-cxl-core.
>
> Thanks so much for the discussions! Enjoy and see you in the next one.
>
> Background
> ==========
>
> Compute Express Link (CXL) is an open standard interconnect built upon
> industrial PCI layers to enhance the performance and efficiency of data
> centers by enabling high-speed, low-latency communication between CPUs
> and various types of devices such as accelerators, memory.
>
> It supports three key protocols: CXL.io as the control protocol, CXL.cache
> as the cache-coherent host-device data transfer protocol, and CXL.mem as
> memory expansion protocol. CXL Type 2 devices leverage the three protocols
> to seamlessly integrate with host CPUs, providing a unified and efficient
> interface for high-speed data transfer and memory sharing. This integration
> is crucial for heterogeneous computing environments where accelerators,
> such as GPUs, and other specialized processors, are used to handle
> intensive workloads.
>
> Goal
> ====
>
> Although CXL is built upon the PCI layers, passing a CXL type-2 device can
> be different than PCI devices according to CXL specification[1]:
>
> - CXL type-2 device initialization. CXL type-2 device requires an
> additional initialization sequence besides the PCI device initialization.
> CXL type-2 device initialization can be pretty complicated due to its
> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
> initialization sequence provided by the kernel CXL core is used.
>
> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
> (Device PA) needs to be created to access the device memory directly. HDM
> decoders in the CXL topology need to be configured level by level to
> manage the mapping. After the region is created, it needs to be mapped to
> GPA in the virtual HDM decoders configured by the VM.
>
> - CXL reset. The CXL device reset is different from the PCI device reset.
> A CXL reset sequence is introduced by the CXL spec.
>
> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
> configuration for device enumeration and device control. (E.g. if a device
> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
> by the kernel CXL core, and the VM can not modify them.
>
> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
> that can sit in a PCI BAR. The location of register groups sit in the PCI
> BAR is indicated by the register locator in the CXL DVSECs. They are also
> owned by the kernel CXL core. Some of them need to be emulated.
>
> Design
> ======
>
> To achieve the purpose above, the vfio-cxl-core is introduced to host the
> common routines that variant driver requires for device passthrough.
> Similar with the vfio-pci-core, the vfio-cxl-core provides common
> routines of vfio_device_ops for the variant driver to hook and perform the
> CXL routines behind it.
>
> Besides, several extra APIs are introduced for the variant driver to
> provide the necessary information the kernel CXL core to initialize
> the CXL device. E.g., Device DPA.
>
> CXL is built upon the PCI layers but with differences. Thus, the
> vfio-pci-core is aimed to be re-used as much as possible with the
> awareness of operating on a CXL device.
>
> A new VFIO device region is introduced to expose the CXL region to the
> userspace. A new CXL VFIO device cap has also been introduced to convey
> the necessary CXL device information to the userspace.



Hi Zhi,


As you know, I was confused with this work but after looking at the 
patchset and thinking about all this, it makes sense now. FWIW, the most 
confusing point was to use the CXL core inside the VM for creating the 
region what implies commits to the CXL root switch/complex and any other 
switch in the path. I realize now it will happen but on emulated 
hardware with no implication to the real one, which was updated with any 
necessary change like those commits by the vfio cxl code in the host (L1 
VM in your tests).


The only problem I can see with this approach is the CXL initialization 
is left unconditionally to the hypervisor. I guess most of the time will 
be fine, but the driver could not be mapping/using the whole CXL mem 
always.  I know this could be awkward, but possible depending on the 
device state unrelated to CXL itself. In other words, this approach 
assumes beforehand something which could not be true. What I had in mind 
was to have VM exits for any action on CXL configuration on behalf of 
that device/driver inside the device.


This is all more problematic with CXL.cache, and I think the same 
approach can not be followed. I'm writing a document trying to share all 
my concerns about CXL.cache and DMA/IOMMU mappings, and I will cover 
this for sure. As a quick note, while DMA/IOMMU has no limitations 
regarding the amount of memory to use, it is unlikely the same can be 
done due to scarce host snoop cache resources, therefore the CXL.cache 
mappings will likely need to be explicitly done by the driver and 
approved by the CXL core (along with DMA/IOMMU), and for a driver inside 
a VM that implies VM exits.


Thanks.

Alejandro.

> Patches
> =======
>
> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>
> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>
> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
> initialization, CXL region creation.
>
> PATCH 5: Expose the CXL region to the userspace.
>
> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>
> PATCH 8: Emulate the HDM decoder registers.
>
> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>
> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>
> PATCH 11: Expose the CXL device information that userspace needs.
>
> PATCH 12: An example variant driver to demonstrate the usage of
> vfio-cxl-core from the perspective of the VFIO variant driver.
>
> PATCH 13: A workaround needs suggestions.
>
> Test
> ====
>
> To test the patches and hack around, a virtual passthrough with nested
> virtualization approach is used.
>
> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
> with the changes to emulate HDM decoders.
>
> While running the vfio-cxl in the L1 guest, an example VFIO variant
> driver is used to attach with the QEMU CXL access device.
>
> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
> VFIOStub.
>
> In the L2 guest, a dummy CXL device driver is provided to attach to the
> virtual pass-thru device.
>
> The dummy CXL type-2 device driver can successfully be loaded with the
> kernel cxl core type2 support, create CXL region by requesting the CXL
> core to allocate HPA and DPA and configure the HDM decoders.
>
> To make sure everyone can test the patches, the kernel config of L1 and
> L2 are provided in the repos, the required kernel command params and
> qemu command line can be found from the demostration video.[5]
>
> Repos
> =====
>
> QEMU host: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
> L1 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
> L1 QEMU: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>
> [1] https://computeexpresslink.org/cxl-specification/
> [2] https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
> [3] https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>
> Feedback expected
> =================
>
> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
> - Variant driver requirements from more hardware vendors.
> - vfio-cxl-core UABI to QEMU.
>
> Moving foward
> =============
>
> - Rebase the patches on top of Alejandro's PATCH v3.
> - Get Ira's type-2 emulated device patch into upstream as CXL folks and RH
>    folks both came to talk and expect this. I had a chat with Ira and he
>    expected me to take it over. Will start a discussion in the CXL discord
>    group for the desgin of V1.
> - Sparse map in vfio-cxl-core.
>
> Known issues
> ============
>
> - Teardown path. Missing teardown paths have been implements in Alejandor's
>    PATCH v3. It should be solved after the rebase.
>
> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is missing
>    in the Ira's patch. When rebooting L1, many CXL registers are not reset.
>    This will be addressed in the formal review of emulated CXL type-2 device
>    support.
>
> Zhi Wang (13):
>    cxl: allow a type-2 device not to have memory device registers
>    cxl: introduce cxl_get_hdm_info()
>    cxl: introduce cxl_find_comp_reglock_offset()
>    vfio: introduce vfio-cxl core preludes
>    vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>      region
>    vfio/pci: expose vfio_pci_rw()
>    vfio/cxl: introduce vfio_cxl_core_{read, write}()
>    vfio/cxl: emulate HDM decoder registers
>    vfio/pci: introduce CXL device awareness
>    vfio/pci: emulate CXL DVSEC registers in the configuration space
>    vfio/cxl: introduce VFIO CXL device cap
>    vfio/cxl: VFIO variant driver for QEMU CXL accel device
>    vfio/cxl: workaround: don't take resource region when cxl is enabled.
>
>   drivers/cxl/core/pci.c              |  28 ++
>   drivers/cxl/core/regs.c             |  22 +
>   drivers/cxl/cxl.h                   |   1 +
>   drivers/cxl/cxlpci.h                |   3 +
>   drivers/cxl/pci.c                   |  14 +-
>   drivers/vfio/pci/Kconfig            |   6 +
>   drivers/vfio/pci/Makefile           |   5 +
>   drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>   drivers/vfio/pci/cxl-accel/Makefile |   3 +
>   drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>   drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>   drivers/vfio/pci/vfio_pci_config.c  |  10 +
>   drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>   drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>   include/linux/cxl_accel_mem.h       |   3 +
>   include/linux/cxl_accel_pci.h       |   6 +
>   include/linux/vfio_pci_core.h       |  53 +++
>   include/uapi/linux/vfio.h           |  14 +
>   18 files changed, 992 insertions(+), 32 deletions(-)
>   create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>   create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>   create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>   create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>
Jonathan Cameron Sept. 25, 2024, 1:05 p.m. UTC | #4
On Tue, 24 Sep 2024 08:30:17 +0000
Zhi Wang <zhiw@nvidia.com> wrote:

> On 23/09/2024 11.00, Tian, Kevin wrote:
> > External email: Use caution opening links or attachments
> > 
> >   
> >> From: Zhi Wang <zhiw@nvidia.com>
> >> Sent: Saturday, September 21, 2024 6:35 AM
> >>  
> > [...]  
> >> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
> >> (Device PA) needs to be created to access the device memory directly. HDM
> >> decoders in the CXL topology need to be configured level by level to
> >> manage the mapping. After the region is created, it needs to be mapped to
> >> GPA in the virtual HDM decoders configured by the VM.  
> > 
> > Any time when a new address space is introduced it's worthy of more
> > context to help people who have no CXL background better understand
> > the mechanism and think any potential hole.
> > 
> > At a glance looks we are talking about a mapping tier:
> > 
> >    GPA->HPA->DPA
> > 
> > The location/size of HPA/DPA for a cxl region are decided and mapped
> > at @open_device and the HPA range is mapped to GPA at @mmap.
> > 
> > In addition the guest also manages a virtual HDM decoder:
> > 
> >    GPA->vDPA
> > 
> > Ideally the vDPA range selected by guest is a subset of the physical
> > cxl region so based on offset and vHDM the VMM may figure out
> > which offset in the cxl region to be mmaped for the corresponding
> > GPA (which in the end maps to the desired DPA).
> > 
> > Is this understanding correct?
> >   
> 
> Yes. Many thanks to summarize this. It is a design decision from a 
> discussion in the CXL discord channel.
> 
> > btw is one cxl device only allowed to create one region? If multiple
> > regions are possible how will they be exposed to the guest?
> >  
> 
> It is not an (shouldn't be) enforced requirement from the VFIO cxl core. 
> It is really requirement-driven. I am expecting what kind of use cases 
> in reality that needs multiple CXL regions in the host and then passing 
> multiple regions to the guest.

Mix of back invalidate and non back invalidate supporting device memory
maybe?  A bounce region for p2p traffic would the obvious reason to do
this without paying the cost of large snoop filters. If anyone puts PMEM
on the device, then maybe mix of that at volatile. In theory you might
do separate regions for QoS reasons but seems unlikely to me...

Anyhow not an immediately problem as I don't know of any
BI capable hosts yet and doubt anyone (other than Dan) cares about PMEM :)


> 
> Presumably, the host creates one large CXL region that covers the entire 
> DPA, while QEMU can virtually partition it into different regions and 
> map them to different virtual CXL region if QEMU presents multiple HDM 
> decoders to the guest.

I'm not sure why it would do that. Can't think why you'd break up
a host region - maybe I'm missing something.

...

> >> In the L2 guest, a dummy CXL device driver is provided to attach to the
> >> virtual pass-thru device.
> >>
> >> The dummy CXL type-2 device driver can successfully be loaded with the
> >> kernel cxl core type2 support, create CXL region by requesting the CXL
> >> core to allocate HPA and DPA and configure the HDM decoders.  
> > 
> > It'd be good to see a real cxl device working to add confidence on
> > the core design.  
> 
> To leverage the opportunity of F2F discussion in LPC, I proposed this 
> patchset to start the discussion and meanwhile offered an environment 
> for people to try and hack around. Also patches is good base for 
> discussion. We see what we will get. :)
> 
> There are devices already there and on-going. AMD's SFC (patches are 
> under review) and I think they are going to be the first variant driver 
> that use the core. NVIDIA's device is also coming and NVIDIA's variant 
> driver is going upstream for sure. Plus this emulated device, I assume 
> we will have three in-tree variant drivers talks to the CXL core.
Nice.
> 
> Thanks,
> Zhi.
Tian, Kevin Sept. 26, 2024, 6:55 a.m. UTC | #5
> From: Zhi Wang <zhiw@nvidia.com>
> Sent: Tuesday, September 24, 2024 4:30 PM
> 
> On 23/09/2024 11.00, Tian, Kevin wrote:
> > External email: Use caution opening links or attachments
> >
> >
> >> From: Zhi Wang <zhiw@nvidia.com>
> >> Sent: Saturday, September 21, 2024 6:35 AM
> >>
> > [...]
> >> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
> >> (Device PA) needs to be created to access the device memory directly. HDM
> >> decoders in the CXL topology need to be configured level by level to
> >> manage the mapping. After the region is created, it needs to be mapped to
> >> GPA in the virtual HDM decoders configured by the VM.
> >
> > Any time when a new address space is introduced it's worthy of more
> > context to help people who have no CXL background better understand
> > the mechanism and think any potential hole.
> >
> > At a glance looks we are talking about a mapping tier:
> >
> >    GPA->HPA->DPA
> >
> > The location/size of HPA/DPA for a cxl region are decided and mapped
> > at @open_device and the HPA range is mapped to GPA at @mmap.
> >
> > In addition the guest also manages a virtual HDM decoder:
> >
> >    GPA->vDPA
> >
> > Ideally the vDPA range selected by guest is a subset of the physical
> > cxl region so based on offset and vHDM the VMM may figure out
> > which offset in the cxl region to be mmaped for the corresponding
> > GPA (which in the end maps to the desired DPA).
> >
> > Is this understanding correct?
> >
> 
> Yes. Many thanks to summarize this. It is a design decision from a
> discussion in the CXL discord channel.
> 
> > btw is one cxl device only allowed to create one region? If multiple
> > regions are possible how will they be exposed to the guest?
> >
> 
> It is not an (shouldn't be) enforced requirement from the VFIO cxl core.
> It is really requirement-driven. I am expecting what kind of use cases
> in reality that needs multiple CXL regions in the host and then passing
> multiple regions to the guest.
> 
> Presumably, the host creates one large CXL region that covers the entire
> DPA, while QEMU can virtually partition it into different regions and
> map them to different virtual CXL region if QEMU presents multiple HDM
> decoders to the guest.

non-cxl guys have no idea about what a region is and how it is associated
to the backing hardware resource, e.g. it's created by software then
when the virtual CXL device is composed how is that software-decided
region translated back to a set of virtual CXL hw resource enumerable
to the guest, etc.

In your description, QEMU, as the virtual platform, map the VFIO CXL
region into different virtual CXL regions. This kind of suggests regions
are created by hw, conflicting with the point having sw create it.

We need a fully picture to connect relevant knowledge points in CXL
so the proposal can be better reviewed in the VFIO side. 
Zhi Wang Sept. 27, 2024, 7:18 a.m. UTC | #6
On 25/09/2024 15.05, Jonathan Cameron wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Tue, 24 Sep 2024 08:30:17 +0000
> Zhi Wang <zhiw@nvidia.com> wrote:
> 
>> On 23/09/2024 11.00, Tian, Kevin wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>>> From: Zhi Wang <zhiw@nvidia.com>
>>>> Sent: Saturday, September 21, 2024 6:35 AM
>>>>
>>> [...]
>>>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>>>> (Device PA) needs to be created to access the device memory directly. HDM
>>>> decoders in the CXL topology need to be configured level by level to
>>>> manage the mapping. After the region is created, it needs to be mapped to
>>>> GPA in the virtual HDM decoders configured by the VM.
>>>
>>> Any time when a new address space is introduced it's worthy of more
>>> context to help people who have no CXL background better understand
>>> the mechanism and think any potential hole.
>>>
>>> At a glance looks we are talking about a mapping tier:
>>>
>>>     GPA->HPA->DPA
>>>
>>> The location/size of HPA/DPA for a cxl region are decided and mapped
>>> at @open_device and the HPA range is mapped to GPA at @mmap.
>>>
>>> In addition the guest also manages a virtual HDM decoder:
>>>
>>>     GPA->vDPA
>>>
>>> Ideally the vDPA range selected by guest is a subset of the physical
>>> cxl region so based on offset and vHDM the VMM may figure out
>>> which offset in the cxl region to be mmaped for the corresponding
>>> GPA (which in the end maps to the desired DPA).
>>>
>>> Is this understanding correct?
>>>
>>
>> Yes. Many thanks to summarize this. It is a design decision from a
>> discussion in the CXL discord channel.
>>
>>> btw is one cxl device only allowed to create one region? If multiple
>>> regions are possible how will they be exposed to the guest?
>>>
>>
>> It is not an (shouldn't be) enforced requirement from the VFIO cxl core.
>> It is really requirement-driven. I am expecting what kind of use cases
>> in reality that needs multiple CXL regions in the host and then passing
>> multiple regions to the guest.
> 
> Mix of back invalidate and non back invalidate supporting device memory
> maybe?  A bounce region for p2p traffic would the obvious reason to do
> this without paying the cost of large snoop filters. If anyone puts PMEM
> on the device, then maybe mix of that at volatile. In theory you might
> do separate regions for QoS reasons but seems unlikely to me...
> 
> Anyhow not an immediately problem as I don't know of any
> BI capable hosts yet and doubt anyone (other than Dan) cares about PMEM :)
> 

Got it.
> 
>>
>> Presumably, the host creates one large CXL region that covers the entire
>> DPA, while QEMU can virtually partition it into different regions and
>> map them to different virtual CXL region if QEMU presents multiple HDM
>> decoders to the guest.
> 
> I'm not sure why it would do that. Can't think why you'd break up
> a host region - maybe I'm missing something.
> 

It is mostly concerning about a device can have multiple HDM decoders. 
In the current design, a large physical CXL (pCXL) region with the whole 
DPA will be passed to the userspace. Thinking that the guest will see 
the virtual multiple HDM decoders, which usually SW is asking for, the 
guest SW might create multiple virtual CXL regions. In that case QEMU 
needs to map them into different regions of the pCXL region.

> ...
> 
>>>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>>>> virtual pass-thru device.
>>>>
>>>> The dummy CXL type-2 device driver can successfully be loaded with the
>>>> kernel cxl core type2 support, create CXL region by requesting the CXL
>>>> core to allocate HPA and DPA and configure the HDM decoders.
>>>
>>> It'd be good to see a real cxl device working to add confidence on
>>> the core design.
>>
>> To leverage the opportunity of F2F discussion in LPC, I proposed this
>> patchset to start the discussion and meanwhile offered an environment
>> for people to try and hack around. Also patches is good base for
>> discussion. We see what we will get. :)
>>
>> There are devices already there and on-going. AMD's SFC (patches are
>> under review) and I think they are going to be the first variant driver
>> that use the core. NVIDIA's device is also coming and NVIDIA's variant
>> driver is going upstream for sure. Plus this emulated device, I assume
>> we will have three in-tree variant drivers talks to the CXL core.
> Nice.
>>
>> Thanks,
>> Zhi.
>
Zhi Wang Sept. 27, 2024, 7:38 a.m. UTC | #7
On 25/09/2024 12.11, Alejandro Lucero Palau wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 9/20/24 23:34, Zhi Wang wrote:
>> Hi folks:
>>
>> As promised in the LPC, here are all you need (patches, repos, guiding
>> video, kernel config) to build a environment to test the vfio-cxl-core.
>>
>> Thanks so much for the discussions! Enjoy and see you in the next one.
>>
>> Background
>> ==========
>>
>> Compute Express Link (CXL) is an open standard interconnect built upon
>> industrial PCI layers to enhance the performance and efficiency of data
>> centers by enabling high-speed, low-latency communication between CPUs
>> and various types of devices such as accelerators, memory.
>>
>> It supports three key protocols: CXL.io as the control protocol, 
>> CXL.cache
>> as the cache-coherent host-device data transfer protocol, and CXL.mem as
>> memory expansion protocol. CXL Type 2 devices leverage the three 
>> protocols
>> to seamlessly integrate with host CPUs, providing a unified and efficient
>> interface for high-speed data transfer and memory sharing. This 
>> integration
>> is crucial for heterogeneous computing environments where accelerators,
>> such as GPUs, and other specialized processors, are used to handle
>> intensive workloads.
>>
>> Goal
>> ====
>>
>> Although CXL is built upon the PCI layers, passing a CXL type-2 device 
>> can
>> be different than PCI devices according to CXL specification[1]:
>>
>> - CXL type-2 device initialization. CXL type-2 device requires an
>> additional initialization sequence besides the PCI device initialization.
>> CXL type-2 device initialization can be pretty complicated due to its
>> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
>> initialization sequence provided by the kernel CXL core is used.
>>
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a 
>> device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are 
>> owned
>> by the kernel CXL core, and the VM can not modify them.
>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO 
>> registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
>>
>> Design
>> ======
>>
>> To achieve the purpose above, the vfio-cxl-core is introduced to host the
>> common routines that variant driver requires for device passthrough.
>> Similar with the vfio-pci-core, the vfio-cxl-core provides common
>> routines of vfio_device_ops for the variant driver to hook and perform 
>> the
>> CXL routines behind it.
>>
>> Besides, several extra APIs are introduced for the variant driver to
>> provide the necessary information the kernel CXL core to initialize
>> the CXL device. E.g., Device DPA.
>>
>> CXL is built upon the PCI layers but with differences. Thus, the
>> vfio-pci-core is aimed to be re-used as much as possible with the
>> awareness of operating on a CXL device.
>>
>> A new VFIO device region is introduced to expose the CXL region to the
>> userspace. A new CXL VFIO device cap has also been introduced to convey
>> the necessary CXL device information to the userspace.
> 
> 
> 
> Hi Zhi,
> 
> 
> As you know, I was confused with this work but after looking at the
> patchset and thinking about all this, it makes sense now. FWIW, the most
> confusing point was to use the CXL core inside the VM for creating the
> region what implies commits to the CXL root switch/complex and any other
> switch in the path. I realize now it will happen but on emulated
> hardware with no implication to the real one, which was updated with any
> necessary change like those commits by the vfio cxl code in the host (L1
> VM in your tests).
> 
> 
> The only problem I can see with this approach is the CXL initialization
> is left unconditionally to the hypervisor. I guess most of the time will
> be fine, but the driver could not be mapping/using the whole CXL mem
> always.  I know this could be awkward, but possible depending on the
> device state unrelated to CXL itself. 

Will this device states be one-time on/off state or a runtime 
configuration state that a guest need to poke all the time?

There can be two paths for handling these states in a vendor-specific 
variant driver: 1) vfio_device->fops->open() path, it suits for one-time 
on/off state 2) vfio_device->fops->{read, write}(), the VM 
exit->QEMU->variant driver path. The vendor-specific driver can 
configure the HW based on the register access from the guest.

It would be nice to know more about this, like how many registers the 
vendor-specific driver would like to handle. Thus, the VFIO CXL core can 
provide common helpers.

In other words, this approach
> assumes beforehand something which could not be true. What I had in mind
> was to have VM exits for any action on CXL configuration on behalf of
> that device/driver inside the device.
> 

Initially, this was a idea from Dan. I think this would be a good topic 
for the next CXL open-source collaboration meeting. Kevn also commented 
for this.

> 
> This is all more problematic with CXL.cache, and I think the same
> approach can not be followed. I'm writing a document trying to share all
> my concerns about CXL.cache and DMA/IOMMU mappings, and I will cover
> this for sure. As a quick note, while DMA/IOMMU has no limitations
> regarding the amount of memory to use, it is unlikely the same can be
> done due to scarce host snoop cache resources, therefore the CXL.cache
> mappings will likely need to be explicitly done by the driver and
> approved by the CXL core (along with DMA/IOMMU), and for a driver inside
> a VM that implies VM exits.
> 

Good to hear. Please CCme as well. Many thanks.

> 
> Thanks.
> 
> Alejandro.
> 
>> Patches
>> =======
>>
>> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
>> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>>
>> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>>
>> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
>> initialization, CXL region creation.
>>
>> PATCH 5: Expose the CXL region to the userspace.
>>
>> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>>
>> PATCH 8: Emulate the HDM decoder registers.
>>
>> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>>
>> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>>
>> PATCH 11: Expose the CXL device information that userspace needs.
>>
>> PATCH 12: An example variant driver to demonstrate the usage of
>> vfio-cxl-core from the perspective of the VFIO variant driver.
>>
>> PATCH 13: A workaround needs suggestions.
>>
>> Test
>> ====
>>
>> To test the patches and hack around, a virtual passthrough with nested
>> virtualization approach is used.
>>
>> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
>> with the changes to emulate HDM decoders.
>>
>> While running the vfio-cxl in the L1 guest, an example VFIO variant
>> driver is used to attach with the QEMU CXL access device.
>>
>> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
>> VFIOStub.
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
>>
>> To make sure everyone can test the patches, the kernel config of L1 and
>> L2 are provided in the repos, the required kernel command params and
>> qemu command line can be found from the demostration video.[5]
>>
>> Repos
>> =====
>>
>> QEMU host: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
>> L1 Kernel: 
>> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
>> L1 QEMU: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
>> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>>
>> [1] https://computeexpresslink.org/cxl-specification/
>> [2] 
>> https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
>> [3] 
>> https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
>> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>>
>> Feedback expected
>> =================
>>
>> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
>> - Variant driver requirements from more hardware vendors.
>> - vfio-cxl-core UABI to QEMU.
>>
>> Moving foward
>> =============
>>
>> - Rebase the patches on top of Alejandro's PATCH v3.
>> - Get Ira's type-2 emulated device patch into upstream as CXL folks 
>> and RH
>>    folks both came to talk and expect this. I had a chat with Ira and he
>>    expected me to take it over. Will start a discussion in the CXL 
>> discord
>>    group for the desgin of V1.
>> - Sparse map in vfio-cxl-core.
>>
>> Known issues
>> ============
>>
>> - Teardown path. Missing teardown paths have been implements in 
>> Alejandor's
>>    PATCH v3. It should be solved after the rebase.
>>
>> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is 
>> missing
>>    in the Ira's patch. When rebooting L1, many CXL registers are not 
>> reset.
>>    This will be addressed in the formal review of emulated CXL type-2 
>> device
>>    support.
>>
>> Zhi Wang (13):
>>    cxl: allow a type-2 device not to have memory device registers
>>    cxl: introduce cxl_get_hdm_info()
>>    cxl: introduce cxl_find_comp_reglock_offset()
>>    vfio: introduce vfio-cxl core preludes
>>    vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>>      region
>>    vfio/pci: expose vfio_pci_rw()
>>    vfio/cxl: introduce vfio_cxl_core_{read, write}()
>>    vfio/cxl: emulate HDM decoder registers
>>    vfio/pci: introduce CXL device awareness
>>    vfio/pci: emulate CXL DVSEC registers in the configuration space
>>    vfio/cxl: introduce VFIO CXL device cap
>>    vfio/cxl: VFIO variant driver for QEMU CXL accel device
>>    vfio/cxl: workaround: don't take resource region when cxl is enabled.
>>
>>   drivers/cxl/core/pci.c              |  28 ++
>>   drivers/cxl/core/regs.c             |  22 +
>>   drivers/cxl/cxl.h                   |   1 +
>>   drivers/cxl/cxlpci.h                |   3 +
>>   drivers/cxl/pci.c                   |  14 +-
>>   drivers/vfio/pci/Kconfig            |   6 +
>>   drivers/vfio/pci/Makefile           |   5 +
>>   drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>>   drivers/vfio/pci/cxl-accel/Makefile |   3 +
>>   drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>>   drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>>   drivers/vfio/pci/vfio_pci_config.c  |  10 +
>>   drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>>   drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>>   include/linux/cxl_accel_mem.h       |   3 +
>>   include/linux/cxl_accel_pci.h       |   6 +
>>   include/linux/vfio_pci_core.h       |  53 +++
>>   include/uapi/linux/vfio.h           |  14 +
>>   18 files changed, 992 insertions(+), 32 deletions(-)
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>>   create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>>   create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>>
Zhi Wang Sept. 27, 2024, 7:38 a.m. UTC | #8
On 25/09/2024 12.11, Alejandro Lucero Palau wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 9/20/24 23:34, Zhi Wang wrote:
>> Hi folks:
>>
>> As promised in the LPC, here are all you need (patches, repos, guiding
>> video, kernel config) to build a environment to test the vfio-cxl-core.
>>
>> Thanks so much for the discussions! Enjoy and see you in the next one.
>>
>> Background
>> ==========
>>
>> Compute Express Link (CXL) is an open standard interconnect built upon
>> industrial PCI layers to enhance the performance and efficiency of data
>> centers by enabling high-speed, low-latency communication between CPUs
>> and various types of devices such as accelerators, memory.
>>
>> It supports three key protocols: CXL.io as the control protocol, 
>> CXL.cache
>> as the cache-coherent host-device data transfer protocol, and CXL.mem as
>> memory expansion protocol. CXL Type 2 devices leverage the three 
>> protocols
>> to seamlessly integrate with host CPUs, providing a unified and efficient
>> interface for high-speed data transfer and memory sharing. This 
>> integration
>> is crucial for heterogeneous computing environments where accelerators,
>> such as GPUs, and other specialized processors, are used to handle
>> intensive workloads.
>>
>> Goal
>> ====
>>
>> Although CXL is built upon the PCI layers, passing a CXL type-2 device 
>> can
>> be different than PCI devices according to CXL specification[1]:
>>
>> - CXL type-2 device initialization. CXL type-2 device requires an
>> additional initialization sequence besides the PCI device initialization.
>> CXL type-2 device initialization can be pretty complicated due to its
>> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
>> initialization sequence provided by the kernel CXL core is used.
>>
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a 
>> device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are 
>> owned
>> by the kernel CXL core, and the VM can not modify them.
>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO 
>> registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
>>
>> Design
>> ======
>>
>> To achieve the purpose above, the vfio-cxl-core is introduced to host the
>> common routines that variant driver requires for device passthrough.
>> Similar with the vfio-pci-core, the vfio-cxl-core provides common
>> routines of vfio_device_ops for the variant driver to hook and perform 
>> the
>> CXL routines behind it.
>>
>> Besides, several extra APIs are introduced for the variant driver to
>> provide the necessary information the kernel CXL core to initialize
>> the CXL device. E.g., Device DPA.
>>
>> CXL is built upon the PCI layers but with differences. Thus, the
>> vfio-pci-core is aimed to be re-used as much as possible with the
>> awareness of operating on a CXL device.
>>
>> A new VFIO device region is introduced to expose the CXL region to the
>> userspace. A new CXL VFIO device cap has also been introduced to convey
>> the necessary CXL device information to the userspace.
> 
> 
> 
> Hi Zhi,
> 
> 
> As you know, I was confused with this work but after looking at the
> patchset and thinking about all this, it makes sense now. FWIW, the most
> confusing point was to use the CXL core inside the VM for creating the
> region what implies commits to the CXL root switch/complex and any other
> switch in the path. I realize now it will happen but on emulated
> hardware with no implication to the real one, which was updated with any
> necessary change like those commits by the vfio cxl code in the host (L1
> VM in your tests).
> 
> 
> The only problem I can see with this approach is the CXL initialization
> is left unconditionally to the hypervisor. I guess most of the time will
> be fine, but the driver could not be mapping/using the whole CXL mem
> always.  I know this could be awkward, but possible depending on the
> device state unrelated to CXL itself. 

Will this device states be one-time on/off state or a runtime 
configuration state that a guest need to poke all the time?

There can be two paths for handling these states in a vendor-specific 
variant driver: 1) vfio_device->fops->open() path, it suits for one-time 
on/off state 2) vfio_device->fops->{read, write}(), the VM 
exit->QEMU->variant driver path. The vendor-specific driver can 
configure the HW based on the register access from the guest.

It would be nice to know more about this, like how many registers the 
vendor-specific driver would like to handle. Thus, the VFIO CXL core can 
provide common helpers.

In other words, this approach
> assumes beforehand something which could not be true. What I had in mind
> was to have VM exits for any action on CXL configuration on behalf of
> that device/driver inside the device.
> 

Initially, this was a idea from Dan. I think this would be a good topic 
for the next CXL open-source collaboration meeting. Kevn also commented 
for this.

> 
> This is all more problematic with CXL.cache, and I think the same
> approach can not be followed. I'm writing a document trying to share all
> my concerns about CXL.cache and DMA/IOMMU mappings, and I will cover
> this for sure. As a quick note, while DMA/IOMMU has no limitations
> regarding the amount of memory to use, it is unlikely the same can be
> done due to scarce host snoop cache resources, therefore the CXL.cache
> mappings will likely need to be explicitly done by the driver and
> approved by the CXL core (along with DMA/IOMMU), and for a driver inside
> a VM that implies VM exits.
> 

Good to hear. Please CCme as well. Many thanks.

> 
> Thanks.
> 
> Alejandro.
> 
>> Patches
>> =======
>>
>> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
>> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>>
>> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>>
>> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
>> initialization, CXL region creation.
>>
>> PATCH 5: Expose the CXL region to the userspace.
>>
>> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>>
>> PATCH 8: Emulate the HDM decoder registers.
>>
>> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>>
>> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>>
>> PATCH 11: Expose the CXL device information that userspace needs.
>>
>> PATCH 12: An example variant driver to demonstrate the usage of
>> vfio-cxl-core from the perspective of the VFIO variant driver.
>>
>> PATCH 13: A workaround needs suggestions.
>>
>> Test
>> ====
>>
>> To test the patches and hack around, a virtual passthrough with nested
>> virtualization approach is used.
>>
>> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
>> with the changes to emulate HDM decoders.
>>
>> While running the vfio-cxl in the L1 guest, an example VFIO variant
>> driver is used to attach with the QEMU CXL access device.
>>
>> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
>> VFIOStub.
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
>>
>> To make sure everyone can test the patches, the kernel config of L1 and
>> L2 are provided in the repos, the required kernel command params and
>> qemu command line can be found from the demostration video.[5]
>>
>> Repos
>> =====
>>
>> QEMU host: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
>> L1 Kernel: 
>> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
>> L1 QEMU: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
>> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>>
>> [1] https://computeexpresslink.org/cxl-specification/
>> [2] 
>> https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
>> [3] 
>> https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
>> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>>
>> Feedback expected
>> =================
>>
>> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
>> - Variant driver requirements from more hardware vendors.
>> - vfio-cxl-core UABI to QEMU.
>>
>> Moving foward
>> =============
>>
>> - Rebase the patches on top of Alejandro's PATCH v3.
>> - Get Ira's type-2 emulated device patch into upstream as CXL folks 
>> and RH
>>    folks both came to talk and expect this. I had a chat with Ira and he
>>    expected me to take it over. Will start a discussion in the CXL 
>> discord
>>    group for the desgin of V1.
>> - Sparse map in vfio-cxl-core.
>>
>> Known issues
>> ============
>>
>> - Teardown path. Missing teardown paths have been implements in 
>> Alejandor's
>>    PATCH v3. It should be solved after the rebase.
>>
>> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is 
>> missing
>>    in the Ira's patch. When rebooting L1, many CXL registers are not 
>> reset.
>>    This will be addressed in the formal review of emulated CXL type-2 
>> device
>>    support.
>>
>> Zhi Wang (13):
>>    cxl: allow a type-2 device not to have memory device registers
>>    cxl: introduce cxl_get_hdm_info()
>>    cxl: introduce cxl_find_comp_reglock_offset()
>>    vfio: introduce vfio-cxl core preludes
>>    vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>>      region
>>    vfio/pci: expose vfio_pci_rw()
>>    vfio/cxl: introduce vfio_cxl_core_{read, write}()
>>    vfio/cxl: emulate HDM decoder registers
>>    vfio/pci: introduce CXL device awareness
>>    vfio/pci: emulate CXL DVSEC registers in the configuration space
>>    vfio/cxl: introduce VFIO CXL device cap
>>    vfio/cxl: VFIO variant driver for QEMU CXL accel device
>>    vfio/cxl: workaround: don't take resource region when cxl is enabled.
>>
>>   drivers/cxl/core/pci.c              |  28 ++
>>   drivers/cxl/core/regs.c             |  22 +
>>   drivers/cxl/cxl.h                   |   1 +
>>   drivers/cxl/cxlpci.h                |   3 +
>>   drivers/cxl/pci.c                   |  14 +-
>>   drivers/vfio/pci/Kconfig            |   6 +
>>   drivers/vfio/pci/Makefile           |   5 +
>>   drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>>   drivers/vfio/pci/cxl-accel/Makefile |   3 +
>>   drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>>   drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>>   drivers/vfio/pci/vfio_pci_config.c  |  10 +
>>   drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>>   drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>>   include/linux/cxl_accel_mem.h       |   3 +
>>   include/linux/cxl_accel_pci.h       |   6 +
>>   include/linux/vfio_pci_core.h       |  53 +++
>>   include/uapi/linux/vfio.h           |  14 +
>>   18 files changed, 992 insertions(+), 32 deletions(-)
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>>   create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>>   create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>>
Jonathan Cameron Oct. 4, 2024, 11:40 a.m. UTC | #9
> >   
> >>
> >> Presumably, the host creates one large CXL region that covers the entire
> >> DPA, while QEMU can virtually partition it into different regions and
> >> map them to different virtual CXL region if QEMU presents multiple HDM
> >> decoders to the guest.  
> > 
> > I'm not sure why it would do that. Can't think why you'd break up
> > a host region - maybe I'm missing something.
> >   
> 
> It is mostly concerning about a device can have multiple HDM decoders. 
> In the current design, a large physical CXL (pCXL) region with the whole 
> DPA will be passed to the userspace. Thinking that the guest will see 
> the virtual multiple HDM decoders, which usually SW is asking for, the 
> guest SW might create multiple virtual CXL regions. In that case QEMU 
> needs to map them into different regions of the pCXL region.

Don't let the guest see multiple HDM decoders?

There is no obvious reason why it would want them other than type
differences.

Why is it useful for a type 2 device to be setup for multiple CXL regions?
It shouldn't be a performance thing. Might be convenient for management
I guess, but the driver can layer it's own allocator etc on top of a single
region so I'm not sure I see a reason to do this...

Jonathan
Zhi Wang Oct. 19, 2024, 5:30 a.m. UTC | #10
On 04/10/2024 14.40, Jonathan Cameron wrote:
> External email: Use caution opening links or attachments
> 
> 
>>>
>>>>
>>>> Presumably, the host creates one large CXL region that covers the entire
>>>> DPA, while QEMU can virtually partition it into different regions and
>>>> map them to different virtual CXL region if QEMU presents multiple HDM
>>>> decoders to the guest.
>>>
>>> I'm not sure why it would do that. Can't think why you'd break up
>>> a host region - maybe I'm missing something.
>>>
>>
>> It is mostly concerning about a device can have multiple HDM decoders.
>> In the current design, a large physical CXL (pCXL) region with the whole
>> DPA will be passed to the userspace. Thinking that the guest will see
>> the virtual multiple HDM decoders, which usually SW is asking for, the
>> guest SW might create multiple virtual CXL regions. In that case QEMU
>> needs to map them into different regions of the pCXL region.
> 
> Don't let the guest see multiple HDM decoders?
> 
> There is no obvious reason why it would want them other than type
> differences.
> 
> Why is it useful for a type 2 device to be setup for multiple CXL regions?
> It shouldn't be a performance thing. Might be convenient for management
> I guess, but the driver can layer it's own allocator etc on top of a single
> region so I'm not sure I see a reason to do this...
> 

Sorry for the late reply as I were confirming the this requirement with 
folks. It make sense to have only one HDM decoder for the guest CXL 
type-2 device driver. I think it is similar to efx_cxl according to the 
code. Alejandro, it would be nice you can confirm this.

Thanks,
Zhi.
> Jonathan
> 
>
Zhi Wang Oct. 21, 2024, 10:49 a.m. UTC | #11
On 21/09/2024 1.34, Zhi Wang wrote:

Hi folks:

Thanks so much for the comments and discussions in the mail and 
collaboration meeting. Here are the update of the major opens raised and 
conclusion/next steps:

1) It is not necessary to support the multiple virtual HDM decoders for 
CXL type-2 device. (Jonathan)

Was asking SW folks around about the requirement of multiple HDM 
decoders in a CXL type-2 device driver. It seems one is enough, which is 
reasonable, because the CXL region created by the type-2 device driver 
is mostly kept for its own private use.

2) Pre-created vs post-created CXL region for the guest. 
(Dan/Kevin/Alejandro)

There has been a discussion about when to create CXL region for the guest.

Option a: The pCXL region is pre-created before VM boots. When a guest 
creates the CXL region via its virtual HDM decoder, QEMU maps the pCXL 
region to the virtual CXL region configured by the guest. Changes and 
re-configuration of the pCXL region is not expected.

Option b: The pCXL region will be (re)created when a guest creates the 
CXL region via its virtual HDM decoder. QEMU traps the HDM decoder 
commits, triggers the pCXL region creation, maps the pCXL to the virtual 
CXL region.

Alejandro (option b):
- Will write a doc to elaborate the problem of CXL.cache and why option 
b should be chosen.

Kevin (option b):
- CXL region is a SW concept, it should be controlled by the guest SW.

Dan (option a):
- Error handling when creating the pCXL region at runtime. E.g. 
Available HPA in the FWMS in the host is running out when creating the 
pCXL region
- CXL.cache might need extra handling which cannot be done at runtime. 
(Need to check Alejandro's doc)

Next step:

- Will check with Alejandro and start from his doc about CXL.cache concerns.

3) Is this exclusively a type2 extension or how do you envision type1/3
devices with vfio? (Alex)

For type-3 device passthrough, due to its nature of memory expander, CXL 
folks have decided to use either virtio-mem or another stub driver in 
QEMU to manage/map the memory to the guest.

For type-1 device, I am not aware of any type-1 device in the market. 
Dan commented in the CXL discord group:

"my understanding is that some of the CXL FPGA kits offer Type-1 flows, 
but those are for custom solutions not open-market. I am aware of some 
private deployments of such hardware, but nothing with an upstream driver."

My take is that we don't need to consider support type-1 device 
passthrough so far.

Z.

> Hi folks:
> 
> As promised in the LPC, here are all you need (patches, repos, guiding
> video, kernel config) to build a environment to test the vfio-cxl-core.
> 
> Thanks so much for the discussions! Enjoy and see you in the next one.
> 
> Background
> ==========
> 
> Compute Express Link (CXL) is an open standard interconnect built upon
> industrial PCI layers to enhance the performance and efficiency of data
> centers by enabling high-speed, low-latency communication between CPUs
> and various types of devices such as accelerators, memory.
> 
> It supports three key protocols: CXL.io as the control protocol, CXL.cache
> as the cache-coherent host-device data transfer protocol, and CXL.mem as
> memory expansion protocol. CXL Type 2 devices leverage the three protocols
> to seamlessly integrate with host CPUs, providing a unified and efficient
> interface for high-speed data transfer and memory sharing. This integration
> is crucial for heterogeneous computing environments where accelerators,
> such as GPUs, and other specialized processors, are used to handle
> intensive workloads.
> 
> Goal
> ====
> 
> Although CXL is built upon the PCI layers, passing a CXL type-2 device can
> be different than PCI devices according to CXL specification[1]:
> 
> - CXL type-2 device initialization. CXL type-2 device requires an
> additional initialization sequence besides the PCI device initialization.
> CXL type-2 device initialization can be pretty complicated due to its
> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
> initialization sequence provided by the kernel CXL core is used.
> 
> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
> (Device PA) needs to be created to access the device memory directly. HDM
> decoders in the CXL topology need to be configured level by level to
> manage the mapping. After the region is created, it needs to be mapped to
> GPA in the virtual HDM decoders configured by the VM.
> 
> - CXL reset. The CXL device reset is different from the PCI device reset.
> A CXL reset sequence is introduced by the CXL spec.
> 
> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
> configuration for device enumeration and device control. (E.g. if a device
> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
> by the kernel CXL core, and the VM can not modify them.
> 
> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
> that can sit in a PCI BAR. The location of register groups sit in the PCI
> BAR is indicated by the register locator in the CXL DVSECs. They are also
> owned by the kernel CXL core. Some of them need to be emulated.
> 
> Design
> ======
> 
> To achieve the purpose above, the vfio-cxl-core is introduced to host the
> common routines that variant driver requires for device passthrough.
> Similar with the vfio-pci-core, the vfio-cxl-core provides common
> routines of vfio_device_ops for the variant driver to hook and perform the
> CXL routines behind it.
> 
> Besides, several extra APIs are introduced for the variant driver to
> provide the necessary information the kernel CXL core to initialize
> the CXL device. E.g., Device DPA.
> 
> CXL is built upon the PCI layers but with differences. Thus, the
> vfio-pci-core is aimed to be re-used as much as possible with the
> awareness of operating on a CXL device.
> 
> A new VFIO device region is introduced to expose the CXL region to the
> userspace. A new CXL VFIO device cap has also been introduced to convey
> the necessary CXL device information to the userspace.
> 
> Patches
> =======
> 
> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
> rebase them to V3 once the cxl-type2 support v3 patch review is done.
> 
> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
> 
> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
> initialization, CXL region creation.
> 
> PATCH 5: Expose the CXL region to the userspace.
> 
> PATCH 6-7: Prepare to emulate the HDM decoder registers.
> 
> PATCH 8: Emulate the HDM decoder registers.
> 
> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
> 
> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
> 
> PATCH 11: Expose the CXL device information that userspace needs.
> 
> PATCH 12: An example variant driver to demonstrate the usage of
> vfio-cxl-core from the perspective of the VFIO variant driver.
> 
> PATCH 13: A workaround needs suggestions.
> 
> Test
> ====
> 
> To test the patches and hack around, a virtual passthrough with nested
> virtualization approach is used.
> 
> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
> with the changes to emulate HDM decoders.
> 
> While running the vfio-cxl in the L1 guest, an example VFIO variant
> driver is used to attach with the QEMU CXL access device.
> 
> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
> VFIOStub.
> 
> In the L2 guest, a dummy CXL device driver is provided to attach to the
> virtual pass-thru device.
> 
> The dummy CXL type-2 device driver can successfully be loaded with the
> kernel cxl core type2 support, create CXL region by requesting the CXL
> core to allocate HPA and DPA and configure the HDM decoders.
> 
> To make sure everyone can test the patches, the kernel config of L1 and
> L2 are provided in the repos, the required kernel command params and
> qemu command line can be found from the demostration video.[5]
> 
> Repos
> =====
> 
> QEMU host: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
> L1 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
> L1 QEMU: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
> 
> [1] https://computeexpresslink.org/cxl-specification/
> [2] https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
> [3] https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
> 
> Feedback expected
> =================
> 
> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
> - Variant driver requirements from more hardware vendors.
> - vfio-cxl-core UABI to QEMU.
> 
> Moving foward
> =============
> 
> - Rebase the patches on top of Alejandro's PATCH v3.
> - Get Ira's type-2 emulated device patch into upstream as CXL folks and RH
>    folks both came to talk and expect this. I had a chat with Ira and he
>    expected me to take it over. Will start a discussion in the CXL discord
>    group for the desgin of V1.
> - Sparse map in vfio-cxl-core.
> 
> Known issues
> ============
> 
> - Teardown path. Missing teardown paths have been implements in Alejandor's
>    PATCH v3. It should be solved after the rebase.
> 
> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is missing
>    in the Ira's patch. When rebooting L1, many CXL registers are not reset.
>    This will be addressed in the formal review of emulated CXL type-2 device
>    support.
> 
> Zhi Wang (13):
>    cxl: allow a type-2 device not to have memory device registers
>    cxl: introduce cxl_get_hdm_info()
>    cxl: introduce cxl_find_comp_reglock_offset()
>    vfio: introduce vfio-cxl core preludes
>    vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>      region
>    vfio/pci: expose vfio_pci_rw()
>    vfio/cxl: introduce vfio_cxl_core_{read, write}()
>    vfio/cxl: emulate HDM decoder registers
>    vfio/pci: introduce CXL device awareness
>    vfio/pci: emulate CXL DVSEC registers in the configuration space
>    vfio/cxl: introduce VFIO CXL device cap
>    vfio/cxl: VFIO variant driver for QEMU CXL accel device
>    vfio/cxl: workaround: don't take resource region when cxl is enabled.
> 
>   drivers/cxl/core/pci.c              |  28 ++
>   drivers/cxl/core/regs.c             |  22 +
>   drivers/cxl/cxl.h                   |   1 +
>   drivers/cxl/cxlpci.h                |   3 +
>   drivers/cxl/pci.c                   |  14 +-
>   drivers/vfio/pci/Kconfig            |   6 +
>   drivers/vfio/pci/Makefile           |   5 +
>   drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>   drivers/vfio/pci/cxl-accel/Makefile |   3 +
>   drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>   drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>   drivers/vfio/pci/vfio_pci_config.c  |  10 +
>   drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>   drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>   include/linux/cxl_accel_mem.h       |   3 +
>   include/linux/cxl_accel_pci.h       |   6 +
>   include/linux/vfio_pci_core.h       |  53 +++
>   include/uapi/linux/vfio.h           |  14 +
>   18 files changed, 992 insertions(+), 32 deletions(-)
>   create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>   create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>   create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>   create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>
Alejandro Lucero Palau Oct. 21, 2024, 11:07 a.m. UTC | #12
On 10/19/24 06:30, Zhi Wang wrote:
> On 04/10/2024 14.40, Jonathan Cameron wrote:
>> External email: Use caution opening links or attachments
>>
>>
>>>>> Presumably, the host creates one large CXL region that covers the entire
>>>>> DPA, while QEMU can virtually partition it into different regions and
>>>>> map them to different virtual CXL region if QEMU presents multiple HDM
>>>>> decoders to the guest.
>>>> I'm not sure why it would do that. Can't think why you'd break up
>>>> a host region - maybe I'm missing something.
>>>>
>>> It is mostly concerning about a device can have multiple HDM decoders.
>>> In the current design, a large physical CXL (pCXL) region with the whole
>>> DPA will be passed to the userspace. Thinking that the guest will see
>>> the virtual multiple HDM decoders, which usually SW is asking for, the
>>> guest SW might create multiple virtual CXL regions. In that case QEMU
>>> needs to map them into different regions of the pCXL region.
>> Don't let the guest see multiple HDM decoders?
>>
>> There is no obvious reason why it would want them other than type
>> differences.
>>
>> Why is it useful for a type 2 device to be setup for multiple CXL regions?
>> It shouldn't be a performance thing. Might be convenient for management
>> I guess, but the driver can layer it's own allocator etc on top of a single
>> region so I'm not sure I see a reason to do this...
>>
> Sorry for the late reply as I were confirming the this requirement with
> folks. It make sense to have only one HDM decoder for the guest CXL
> type-2 device driver. I think it is similar to efx_cxl according to the
> code. Alejandro, it would be nice you can confirm this.


Not sure if how sfc does this should be taken as any confirmation of 
expected behavior/needs, but we plan to have just one HDM decoder and 
one region covering it all.


But maybe it is important to make clear the distinction between an HDM 
decoder and CXL regions linked to it. In other words, I can foresee 
different regions programmed by the driver because the way coherency is 
going to be handled by the device changes for each region. So supporting 
multiple regions makes sense with just one decoder. Not sure if we 
should rely on an use case for supporting more than one region ...


> Thanks,
> Zhi.
>> Jonathan
>>
>>
Alejandro Lucero Palau Oct. 21, 2024, 1:10 p.m. UTC | #13
Hi Zhi,


Some comments below.


On 10/21/24 11:49, Zhi Wang wrote:
> On 21/09/2024 1.34, Zhi Wang wrote:
>
> Hi folks:
>
> Thanks so much for the comments and discussions in the mail and
> collaboration meeting. Here are the update of the major opens raised and
> conclusion/next steps:
>
> 1) It is not necessary to support the multiple virtual HDM decoders for
> CXL type-2 device. (Jonathan)
>
> Was asking SW folks around about the requirement of multiple HDM
> decoders in a CXL type-2 device driver. It seems one is enough, which is
> reasonable, because the CXL region created by the type-2 device driver
> is mostly kept for its own private use.
>
> 2) Pre-created vs post-created CXL region for the guest.
> (Dan/Kevin/Alejandro)
>
> There has been a discussion about when to create CXL region for the guest.
>
> Option a: The pCXL region is pre-created before VM boots. When a guest
> creates the CXL region via its virtual HDM decoder, QEMU maps the pCXL
> region to the virtual CXL region configured by the guest. Changes and
> re-configuration of the pCXL region is not expected.
>
> Option b: The pCXL region will be (re)created when a guest creates the
> CXL region via its virtual HDM decoder. QEMU traps the HDM decoder
> commits, triggers the pCXL region creation, maps the pCXL to the virtual
> CXL region.
>
> Alejandro (option b):
> - Will write a doc to elaborate the problem of CXL.cache and why option
> b should be chosen.
>
> Kevin (option b):
> - CXL region is a SW concept, it should be controlled by the guest SW.
>
> Dan (option a):
> - Error handling when creating the pCXL region at runtime. E.g.
> Available HPA in the FWMS in the host is running out when creating the
> pCXL region


I think there is nothing option b can not do including any error 
handling. Available HPA can change, but this is not different to this 
being handled for host devices trying to get an HPA range concurrently.


> - CXL.cache might need extra handling which cannot be done at runtime.
> (Need to check Alejandro's doc)
>
> Next step:
>
> - Will check with Alejandro and start from his doc about CXL.cache concerns.


Working on it. Hopefully a first draft next week.


> 3) Is this exclusively a type2 extension or how do you envision type1/3
> devices with vfio? (Alex)
>
> For type-3 device passthrough, due to its nature of memory expander, CXL
> folks have decided to use either virtio-mem or another stub driver in
> QEMU to manage/map the memory to the guest.
>
> For type-1 device, I am not aware of any type-1 device in the market.
> Dan commented in the CXL discord group:
>
> "my understanding is that some of the CXL FPGA kits offer Type-1 flows,
> but those are for custom solutions not open-market. I am aware of some
> private deployments of such hardware, but nothing with an upstream driver."
>
> My take is that we don't need to consider support type-1 device
> passthrough so far.


I can not see a difference between Type1 and Type2 regarding CXL.cache 
support. Once we have a solution for Type2, that should be fine for Type1.

Thanks,

Alejandro


>
> Z.
>
>> Hi folks:
>>
>> As promised in the LPC, here are all you need (patches, repos, guiding
>> video, kernel config) to build a environment to test the vfio-cxl-core.
>>
>> Thanks so much for the discussions! Enjoy and see you in the next one.
>>
>> Background
>> ==========
>>
>> Compute Express Link (CXL) is an open standard interconnect built upon
>> industrial PCI layers to enhance the performance and efficiency of data
>> centers by enabling high-speed, low-latency communication between CPUs
>> and various types of devices such as accelerators, memory.
>>
>> It supports three key protocols: CXL.io as the control protocol, CXL.cache
>> as the cache-coherent host-device data transfer protocol, and CXL.mem as
>> memory expansion protocol. CXL Type 2 devices leverage the three protocols
>> to seamlessly integrate with host CPUs, providing a unified and efficient
>> interface for high-speed data transfer and memory sharing. This integration
>> is crucial for heterogeneous computing environments where accelerators,
>> such as GPUs, and other specialized processors, are used to handle
>> intensive workloads.
>>
>> Goal
>> ====
>>
>> Although CXL is built upon the PCI layers, passing a CXL type-2 device can
>> be different than PCI devices according to CXL specification[1]:
>>
>> - CXL type-2 device initialization. CXL type-2 device requires an
>> additional initialization sequence besides the PCI device initialization.
>> CXL type-2 device initialization can be pretty complicated due to its
>> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
>> initialization sequence provided by the kernel CXL core is used.
>>
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
>> by the kernel CXL core, and the VM can not modify them.
>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
>>
>> Design
>> ======
>>
>> To achieve the purpose above, the vfio-cxl-core is introduced to host the
>> common routines that variant driver requires for device passthrough.
>> Similar with the vfio-pci-core, the vfio-cxl-core provides common
>> routines of vfio_device_ops for the variant driver to hook and perform the
>> CXL routines behind it.
>>
>> Besides, several extra APIs are introduced for the variant driver to
>> provide the necessary information the kernel CXL core to initialize
>> the CXL device. E.g., Device DPA.
>>
>> CXL is built upon the PCI layers but with differences. Thus, the
>> vfio-pci-core is aimed to be re-used as much as possible with the
>> awareness of operating on a CXL device.
>>
>> A new VFIO device region is introduced to expose the CXL region to the
>> userspace. A new CXL VFIO device cap has also been introduced to convey
>> the necessary CXL device information to the userspace.
>>
>> Patches
>> =======
>>
>> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
>> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>>
>> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>>
>> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
>> initialization, CXL region creation.
>>
>> PATCH 5: Expose the CXL region to the userspace.
>>
>> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>>
>> PATCH 8: Emulate the HDM decoder registers.
>>
>> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>>
>> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>>
>> PATCH 11: Expose the CXL device information that userspace needs.
>>
>> PATCH 12: An example variant driver to demonstrate the usage of
>> vfio-cxl-core from the perspective of the VFIO variant driver.
>>
>> PATCH 13: A workaround needs suggestions.
>>
>> Test
>> ====
>>
>> To test the patches and hack around, a virtual passthrough with nested
>> virtualization approach is used.
>>
>> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
>> with the changes to emulate HDM decoders.
>>
>> While running the vfio-cxl in the L1 guest, an example VFIO variant
>> driver is used to attach with the QEMU CXL access device.
>>
>> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
>> VFIOStub.
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
>>
>> To make sure everyone can test the patches, the kernel config of L1 and
>> L2 are provided in the repos, the required kernel command params and
>> qemu command line can be found from the demostration video.[5]
>>
>> Repos
>> =====
>>
>> QEMU host: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
>> L1 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
>> L1 QEMU: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
>> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>>
>> [1] https://computeexpresslink.org/cxl-specification/
>> [2] https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
>> [3] https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
>> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>>
>> Feedback expected
>> =================
>>
>> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
>> - Variant driver requirements from more hardware vendors.
>> - vfio-cxl-core UABI to QEMU.
>>
>> Moving foward
>> =============
>>
>> - Rebase the patches on top of Alejandro's PATCH v3.
>> - Get Ira's type-2 emulated device patch into upstream as CXL folks and RH
>>     folks both came to talk and expect this. I had a chat with Ira and he
>>     expected me to take it over. Will start a discussion in the CXL discord
>>     group for the desgin of V1.
>> - Sparse map in vfio-cxl-core.
>>
>> Known issues
>> ============
>>
>> - Teardown path. Missing teardown paths have been implements in Alejandor's
>>     PATCH v3. It should be solved after the rebase.
>>
>> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is missing
>>     in the Ira's patch. When rebooting L1, many CXL registers are not reset.
>>     This will be addressed in the formal review of emulated CXL type-2 device
>>     support.
>>
>> Zhi Wang (13):
>>     cxl: allow a type-2 device not to have memory device registers
>>     cxl: introduce cxl_get_hdm_info()
>>     cxl: introduce cxl_find_comp_reglock_offset()
>>     vfio: introduce vfio-cxl core preludes
>>     vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>>       region
>>     vfio/pci: expose vfio_pci_rw()
>>     vfio/cxl: introduce vfio_cxl_core_{read, write}()
>>     vfio/cxl: emulate HDM decoder registers
>>     vfio/pci: introduce CXL device awareness
>>     vfio/pci: emulate CXL DVSEC registers in the configuration space
>>     vfio/cxl: introduce VFIO CXL device cap
>>     vfio/cxl: VFIO variant driver for QEMU CXL accel device
>>     vfio/cxl: workaround: don't take resource region when cxl is enabled.
>>
>>    drivers/cxl/core/pci.c              |  28 ++
>>    drivers/cxl/core/regs.c             |  22 +
>>    drivers/cxl/cxl.h                   |   1 +
>>    drivers/cxl/cxlpci.h                |   3 +
>>    drivers/cxl/pci.c                   |  14 +-
>>    drivers/vfio/pci/Kconfig            |   6 +
>>    drivers/vfio/pci/Makefile           |   5 +
>>    drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>>    drivers/vfio/pci/cxl-accel/Makefile |   3 +
>>    drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>>    drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>>    drivers/vfio/pci/vfio_pci_config.c  |  10 +
>>    drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>>    drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>>    include/linux/cxl_accel_mem.h       |   3 +
>>    include/linux/cxl_accel_pci.h       |   6 +
>>    include/linux/vfio_pci_core.h       |  53 +++
>>    include/uapi/linux/vfio.h           |  14 +
>>    18 files changed, 992 insertions(+), 32 deletions(-)
>>    create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>>    create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>>    create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>>    create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>>
Zhi Wang Oct. 30, 2024, 11:56 a.m. UTC | #14
On Fri, 20 Sep 2024 15:34:33 -0700
Zhi Wang <zhiw@nvidia.com> wrote:

Hi folks:

As what Kevin and Alex pointed out that we need to discuss about the
virtualization polices for the registers that need to be handled by the
vfio-cxl core, here are my summary about the virtualization polices for
configuration space registers.

For CXL MMIO registers, I am leaning towards vfio-cxl core respecting
the BAR Virtualization ACL Register Block (8.2.6 BAR Virtualization ACL
Register Block), which shows the register ranges can be safely
passed to the guest. For registers are not in the Virtualization ACL
Register Block, or a device doesn't present a BAR Virtualization ACL
Register Block, we leave the handling the to the variant driver.
Besides, 

Feel free to comment.

Z.

----

8.1 Configuration Space Registers
=================================

8.1.3 PCIE DVSEC for CXL Devices
==================================

- DVS Header 1 (0x4)
All bits are RO, emulated as RO and initial values are from the HW.

- DVS Header 2 (0x8)
All bits are RO, emulated as RO and initial values are from the HW.

- DVSEC CXL Capability (0xa)

Overide values from HW:

Bit [5:4] RO - HDM_Count:
Support only one HDM decoder

Bit [12] RO - TSP Capable:
Not supported in v1

Other bits are RO, emulated, and inital values are from the HW.

- DVSEC CXL Control (0xc)

Bit [0] RWL - Cache_Enable:
Emulated as RO if CONFIG_LOCK bit is set, otherwise RW.

Bit [1] RO - IO_Enable:
Emulated as RO.

Bit [12:2] RWL:
Emulated as RO when CONFIG_LOCK bit is set, otherwise RW.

Bit [13] Rsvd:
Emualted as RO

Bit [14] RWL:
Emulated as RO when CONFIG_LOCK bit is set, otherwise RW.

Bit [15] Rsvd:
Emulated as RO.

- DVSEC CXL Status (0xe)

Bit [14] RW1CS - Viral_Status:
Emulate write one clear bit.

Other bits are Rsvd and emulated as RO, inital values are from the HW.

- DVSEC CXL Control 2 (0x10)

Bit [0] RW - Disable caching:
Disable the caching on HW when VM writes bit 1.

Bit [1] RW - Initiate Cache Write Back and Invalidation:
Trigger the cache writeback and invalidation via Linux CXL core, update
cache invalid bit in DVSEC CXL Status 2.

Bit [2] RW - Initiate CXL Reset:
Trigger the CXL reset via Linux CXL core, update the CXL reset complete
, CXL reset error in DVSEC CXL Status 2.

Bit [3] RW - CXL Reset Mem Clr Enable:
As a param when trigger the CXL reset via Linux CXL core.

Bit [4] Desired Volatile HDM State after Hot Reset - RWS/RO
Write the bit on the HW or via Linux CXL core.

Bit [5] Modified Completion Eanble - RW/RO
Write the bit on the HW or via Linux CXL core.

Other bits are Rsvd, emulated as RO and inital values are from the HW.

- DVSEC CXL Status 2 (0x12)
Bit [0] RO - Cache Invalid:
Updated when emulating DVSEC CXL Control 2.

Bit [1] RO - CXL Reset Complete:
Updated when emulating DVSEC CXL Control 2.

Bit [1] RO - CXL Reset Error:
Updated when emulating DVSEC CXL Control 2.

Bit [3] RW1CS/RsvdZ - Volatile HDM Preservation Error:
Read the bit from the HW.

Bit [14:4] Rsvd:
Emulated as RO and inital values are from the HW.

Bit [15] RO - Power Management Intalization Complete:
Read the bit from the HW.

DVSEC CXL Capability2 (0x16)
All bits are RO, emulated as RO and initial values are from the HW.

DVSEC CXL Range 1 Size High (0x18)
All bits are RO, emulated as RO and initial values are from the HW.

DVSEC CXL Range 1 Size Low (0x1c)
All bits are RO, emulated as RO and initial values are from the HW.

DVSEC CXL Range 1 Base High (0x20)
Emulated as RW

DVSEC CXL Range 1 Base Low (0x24)
Emulated as RW

DVSEC CXL Range 2 Size High (0x28)
All bits are RO, emulated as RO and initial values are from the HW.

DVSEC CXL Range 2 Size Low (0x2c)
All bits are RO, emulated as RO and initial values are from the HW.

DVSEC CXL Range 2 Base High (0x30)
Emulated as RW

DVSEC CXL Range 2 Base Low (0x34)
Emulated as RW

DVSEC CXL Capability 3 (0x38)
All bits are RO, emulated as RO and initial values are from the HW.

8.1.4 Non-CXL Function Map DVSEC
================================
Not supported

8.1.5 CXL Extensions DVSEC for Ports
====================================
For root port/switches, no need to support in type-2 device passthorugh

8.1.6 GPF DVSEC for CXL Port
============================
For root port/switches, no need to support in type-2 device passthorugh

8.1.7 GPF DVSEC for CXL Device
==============================

DVS Header 1 (0x4)
All bits are RO, emulated as RO and initial values are from the HW.

DVS Header 2 (0x8)
All bits are RO, emulated as RO and initial values are from the HW.

GPF Phase 2 Duration (0xa)
All bits are RO, emulated as RO and initial values are from the HW.

GPF Phase 2 Power (0xc)
All bits are RO, emulated as RO and initial values are from the HW.

8.1.8 PCIE DVSEC for Flex Bus Port
==================================
For root port/switches, no need to support in type-2 device passthorugh

8.1.9 Register Locator DVSEC
============================

DVS Header 1 (0x4)
All bits are RO, emulated as RO and initial values are from the HW.

DVS Header 2 (0x8)
All bits are RO, emulated as RO and initial values are from the HW.

Register Block 1-3 (Varies)
All bits are RO, emulated as RO and initial values are from the HW.

8.1.10 MLD DVSEC
================
Not supported. Mostly this is for type-3 device.

8.1.11 Table Access DOE
Coupled with QEMU DOE emulation

8.1.12 Memory Device Configuration Space Layout
Not supported. Mostly this is for type-3 device.

8.1.13 Switch Mailbox CCI Configuration Space Layout
Not supported. This is for switches.


> Hi folks:
> 
> As promised in the LPC, here are all you need (patches, repos, guiding
> video, kernel config) to build a environment to test the
> vfio-cxl-core.
> 
> Thanks so much for the discussions! Enjoy and see you in the next one.
> 
> Background
> ==========
> 
> Compute Express Link (CXL) is an open standard interconnect built upon
> industrial PCI layers to enhance the performance and efficiency of
> data centers by enabling high-speed, low-latency communication
> between CPUs and various types of devices such as accelerators,
> memory.
> 
> It supports three key protocols: CXL.io as the control protocol,
> CXL.cache as the cache-coherent host-device data transfer protocol,
> and CXL.mem as memory expansion protocol. CXL Type 2 devices leverage
> the three protocols to seamlessly integrate with host CPUs, providing
> a unified and efficient interface for high-speed data transfer and
> memory sharing. This integration is crucial for heterogeneous
> computing environments where accelerators, such as GPUs, and other
> specialized processors, are used to handle intensive workloads.
> 
> Goal
> ====
> 
> Although CXL is built upon the PCI layers, passing a CXL type-2
> device can be different than PCI devices according to CXL
> specification[1]:
> 
> - CXL type-2 device initialization. CXL type-2 device requires an
> additional initialization sequence besides the PCI device
> initialization. CXL type-2 device initialization can be pretty
> complicated due to its hierarchy of register interfaces. Thus, a
> standard CXL type-2 driver initialization sequence provided by the
> kernel CXL core is used.
> 
> - Create a CXL region and map it to the VM. A mapping between HPA and
> DPA (Device PA) needs to be created to access the device memory
> directly. HDM decoders in the CXL topology need to be configured
> level by level to manage the mapping. After the region is created, it
> needs to be mapped to GPA in the virtual HDM decoders configured by
> the VM.
> 
> - CXL reset. The CXL device reset is different from the PCI device
> reset. A CXL reset sequence is introduced by the CXL spec.
> 
> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in
> the configuration for device enumeration and device control. (E.g. if
> a device is capable of CXL.mem CXL.cache, enable/disable capability)
> They are owned by the kernel CXL core, and the VM can not modify them.
> 
> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO
> registers that can sit in a PCI BAR. The location of register groups
> sit in the PCI BAR is indicated by the register locator in the CXL
> DVSECs. They are also owned by the kernel CXL core. Some of them need
> to be emulated.
> 
> Design
> ======
> 
> To achieve the purpose above, the vfio-cxl-core is introduced to host
> the common routines that variant driver requires for device
> passthrough. Similar with the vfio-pci-core, the vfio-cxl-core
> provides common routines of vfio_device_ops for the variant driver to
> hook and perform the CXL routines behind it.
> 
> Besides, several extra APIs are introduced for the variant driver to
> provide the necessary information the kernel CXL core to initialize
> the CXL device. E.g., Device DPA.
> 
> CXL is built upon the PCI layers but with differences. Thus, the
> vfio-pci-core is aimed to be re-used as much as possible with the
> awareness of operating on a CXL device.
> 
> A new VFIO device region is introduced to expose the CXL region to the
> userspace. A new CXL VFIO device cap has also been introduced to
> convey the necessary CXL device information to the userspace.
> 
> Patches
> =======
> 
> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
> rebase them to V3 once the cxl-type2 support v3 patch review is done.
> 
> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
> 
> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
> initialization, CXL region creation.
> 
> PATCH 5: Expose the CXL region to the userspace.
> 
> PATCH 6-7: Prepare to emulate the HDM decoder registers.
> 
> PATCH 8: Emulate the HDM decoder registers.
> 
> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
> 
> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
> 
> PATCH 11: Expose the CXL device information that userspace needs.
> 
> PATCH 12: An example variant driver to demonstrate the usage of
> vfio-cxl-core from the perspective of the VFIO variant driver.
> 
> PATCH 13: A workaround needs suggestions.
> 
> Test
> ====
> 
> To test the patches and hack around, a virtual passthrough with nested
> virtualization approach is used.
> 
> The host QEMU emulates a CXL type-2 accel device based on Ira's
> patches with the changes to emulate HDM decoders.
> 
> While running the vfio-cxl in the L1 guest, an example VFIO variant
> driver is used to attach with the QEMU CXL access device.
> 
> The L2 guest can be booted via the QEMU with the vfio-cxl support in
> the VFIOStub.
> 
> In the L2 guest, a dummy CXL device driver is provided to attach to
> the virtual pass-thru device.
> 
> The dummy CXL type-2 device driver can successfully be loaded with the
> kernel cxl core type2 support, create CXL region by requesting the CXL
> core to allocate HPA and DPA and configure the HDM decoders.
> 
> To make sure everyone can test the patches, the kernel config of L1
> and L2 are provided in the repos, the required kernel command params
> and qemu command line can be found from the demostration video.[5]
> 
> Repos
> =====
> 
> QEMU host:
> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host L1
> Kernel:
> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
> L1 QEMU:
> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
> L2 Kernel:
> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
> 
> [1] https://computeexpresslink.org/cxl-specification/
> [2]
> https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
> [3]
> https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
> 
> Feedback expected
> =================
> 
> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
> - Variant driver requirements from more hardware vendors.
> - vfio-cxl-core UABI to QEMU.
> 
> Moving foward
> =============
> 
> - Rebase the patches on top of Alejandro's PATCH v3.
> - Get Ira's type-2 emulated device patch into upstream as CXL folks
> and RH folks both came to talk and expect this. I had a chat with Ira
> and he expected me to take it over. Will start a discussion in the
> CXL discord group for the desgin of V1.
> - Sparse map in vfio-cxl-core.
> 
> Known issues
> ============
> 
> - Teardown path. Missing teardown paths have been implements in
> Alejandor's PATCH v3. It should be solved after the rebase.
> 
> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is
> missing in the Ira's patch. When rebooting L1, many CXL registers are
> not reset. This will be addressed in the formal review of emulated
> CXL type-2 device support.
> 
> Zhi Wang (13):
>   cxl: allow a type-2 device not to have memory device registers
>   cxl: introduce cxl_get_hdm_info()
>   cxl: introduce cxl_find_comp_reglock_offset()
>   vfio: introduce vfio-cxl core preludes
>   vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>     region
>   vfio/pci: expose vfio_pci_rw()
>   vfio/cxl: introduce vfio_cxl_core_{read, write}()
>   vfio/cxl: emulate HDM decoder registers
>   vfio/pci: introduce CXL device awareness
>   vfio/pci: emulate CXL DVSEC registers in the configuration space
>   vfio/cxl: introduce VFIO CXL device cap
>   vfio/cxl: VFIO variant driver for QEMU CXL accel device
>   vfio/cxl: workaround: don't take resource region when cxl is
> enabled.
> 
>  drivers/cxl/core/pci.c              |  28 ++
>  drivers/cxl/core/regs.c             |  22 +
>  drivers/cxl/cxl.h                   |   1 +
>  drivers/cxl/cxlpci.h                |   3 +
>  drivers/cxl/pci.c                   |  14 +-
>  drivers/vfio/pci/Kconfig            |   6 +
>  drivers/vfio/pci/Makefile           |   5 +
>  drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>  drivers/vfio/pci/cxl-accel/Makefile |   3 +
>  drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>  drivers/vfio/pci/vfio_cxl_core.c    | 647
> ++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_config.c  |
> 10 + drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>  drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>  include/linux/cxl_accel_mem.h       |   3 +
>  include/linux/cxl_accel_pci.h       |   6 +
>  include/linux/vfio_pci_core.h       |  53 +++
>  include/uapi/linux/vfio.h           |  14 +
>  18 files changed, 992 insertions(+), 32 deletions(-)
>  create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>  create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>  create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>  create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>