mbox series

[RFC,0/4] cxl: introduce CXL Virtualization module

Message ID 20231228060510.1178981-1-dongsheng.yang@easystack.cn
Headers show
Series cxl: introduce CXL Virtualization module | expand

Message

Dongsheng Yang Dec. 28, 2023, 6:05 a.m. UTC
Hi all:
	This patchset introduce cxlv module to allow user to
create virtual cxl device. it's based linux6.7-rc5, you can
get the code from https://github.com/DataTravelGuide/linux

	As the real CXL device is not widely available now, we need
some virtual cxl device to do uplayer software developing or
testing. Qemu is good for functional testing, but not good
for some performance testing.

	The new CXLV module allow user to use the reserved RAM[1], to
create virtual cxl device. When the cxlv module load, it will
create a directory named as "cxl_virt" under /sys/devices/virtual:

	"/sys/devices/virtual/cxl_virt/"

that's the top level device for all cxlv devices.
At the same time, cxlv module will create a debugfs directory:

/sys/kernel/debug/cxl/cxlv
├── create
└── remove

the create and remove debugfs file is the cxlv entry to create or remove
a cxlv device.

	Each cxlv device have its owned virtual pci related bridge and bus, cxlv
will create a new root_port for the new cxlv device, setup cxl ports for
dport and nvdimm-bridge. After that, we will add the virtual pci device,
that will go into the cxl_pci_probe to setup new memdev.

	Then we can see the cxl device with cxl list and use it as a real cxl
device.

 $ echo "memstart=$((8*1024*1024*1024)),cxltype=3,pmem=1,memsize=$((2*1024*1024*1024))" > /sys/kernel/debug/cxl/cxlv/create
 $ cxl list
[
  {
    "memdev":"mem0",
    "pmem_size":1879048192,
    "serial":0,
    "numa_node":0,
    "host":"0010:01:00.0"
  }
]
 $ cxl create-region -m mem0 -d decoder0.0 -t pmem
{
  "region":"region0",
  "resource":"0x210000000",
  "size":"1792.00 MiB (1879.05 MB)",
  "type":"pmem",
  "interleave_ways":1,
  "interleave_granularity":256,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem0",
      "decoder":"decoder2.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region

 $ ndctl create-namespace -r region0 -m fsdax --map dev -t pmem -b 0
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":"1762.00 MiB (1847.59 MB)",
  "uuid":"686fd289-a252-42cf-a3a5-95a39ed5c9d5",
  "sector_size":512,
  "align":2097152,
  "blockdev":"pmem0"
}

 $ mkfs.xfs -f /dev/pmem0 
meta-data=/dev/pmem0             isize=512    agcount=4, agsize=112768
blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1,
rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=451072, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Any comment is welcome!

TODO: implement cxlv command in ndctl to do cxlv device management.

[1]: Add argument in kernel command line: "memmap=nn[KMG]$ss[KMG]",
detail in Documentation/driver-api/cxl/memory-devices.rst

Thanx

Dongsheng Yang (4):
  cxl: move some function from acpi module to core module
  cxl/port: allow dport host to be driver-less device
  cxl/port: introduce cxl_disable_port() function
  cxl: introduce CXL Virtualization module

 MAINTAINERS                         |   6 +
 drivers/cxl/Kconfig                 |  11 +
 drivers/cxl/Makefile                |   1 +
 drivers/cxl/acpi.c                  | 143 +-----
 drivers/cxl/core/port.c             | 231 ++++++++-
 drivers/cxl/cxl.h                   |   6 +
 drivers/cxl/cxl_virt/Makefile       |   5 +
 drivers/cxl/cxl_virt/cxlv.h         |  87 ++++
 drivers/cxl/cxl_virt/cxlv_debugfs.c | 260 ++++++++++
 drivers/cxl/cxl_virt/cxlv_device.c  | 311 ++++++++++++
 drivers/cxl/cxl_virt/cxlv_main.c    |  67 +++
 drivers/cxl/cxl_virt/cxlv_pci.c     | 710 ++++++++++++++++++++++++++++
 drivers/cxl/cxl_virt/cxlv_pci.h     | 549 +++++++++++++++++++++
 drivers/cxl/cxl_virt/cxlv_port.c    | 149 ++++++
 14 files changed, 2388 insertions(+), 148 deletions(-)
 create mode 100644 drivers/cxl/cxl_virt/Makefile
 create mode 100644 drivers/cxl/cxl_virt/cxlv.h
 create mode 100644 drivers/cxl/cxl_virt/cxlv_debugfs.c
 create mode 100644 drivers/cxl/cxl_virt/cxlv_device.c
 create mode 100644 drivers/cxl/cxl_virt/cxlv_main.c
 create mode 100644 drivers/cxl/cxl_virt/cxlv_pci.c
 create mode 100644 drivers/cxl/cxl_virt/cxlv_pci.h
 create mode 100644 drivers/cxl/cxl_virt/cxlv_port.c

Comments

Ira Weiny Jan. 3, 2024, 5:22 p.m. UTC | #1
Dongsheng Yang wrote:
> Hi all:
> 	This patchset introduce cxlv module to allow user to
> create virtual cxl device. it's based linux6.7-rc5, you can
> get the code from https://github.com/DataTravelGuide/linux
> 
> 	As the real CXL device is not widely available now, we need
> some virtual cxl device to do uplayer software developing or
> testing. Qemu is good for functional testing, but not good
> for some performance testing.

Do you have more details on what performance is missing from Qemu and why
this solution is better than a solution to fix Qemu?

Long term it seems better to fix Qemu for this type of work.

Are their other advantages to having this additional test infrastructure
in the kernel?  We already have cxl_test.

Ira

> 
> 	The new CXLV module allow user to use the reserved RAM[1], to
> create virtual cxl device. When the cxlv module load, it will
> create a directory named as "cxl_virt" under /sys/devices/virtual:
> 
> 	"/sys/devices/virtual/cxl_virt/"
> 
> that's the top level device for all cxlv devices.
> At the same time, cxlv module will create a debugfs directory:
> 
> /sys/kernel/debug/cxl/cxlv
> ├── create
> └── remove
> 
> the create and remove debugfs file is the cxlv entry to create or remove
> a cxlv device.
> 
> 	Each cxlv device have its owned virtual pci related bridge and bus, cxlv
> will create a new root_port for the new cxlv device, setup cxl ports for
> dport and nvdimm-bridge. After that, we will add the virtual pci device,
> that will go into the cxl_pci_probe to setup new memdev.
> 
> 	Then we can see the cxl device with cxl list and use it as a real cxl
> device.
> 
>  $ echo "memstart=$((8*1024*1024*1024)),cxltype=3,pmem=1,memsize=$((2*1024*1024*1024))" > /sys/kernel/debug/cxl/cxlv/create
>  $ cxl list
> [
>   {
>     "memdev":"mem0",
>     "pmem_size":1879048192,
>     "serial":0,
>     "numa_node":0,
>     "host":"0010:01:00.0"
>   }
> ]
>  $ cxl create-region -m mem0 -d decoder0.0 -t pmem
> {
>   "region":"region0",
>   "resource":"0x210000000",
>   "size":"1792.00 MiB (1879.05 MB)",
>   "type":"pmem",
>   "interleave_ways":1,
>   "interleave_granularity":256,
>   "decode_state":"commit",
>   "mappings":[
>     {
>       "position":0,
>       "memdev":"mem0",
>       "decoder":"decoder2.0"
>     }
>   ]
> }
> cxl region: cmd_create_region: created 1 region
> 
>  $ ndctl create-namespace -r region0 -m fsdax --map dev -t pmem -b 0
> {
>   "dev":"namespace0.0",
>   "mode":"fsdax",
>   "map":"dev",
>   "size":"1762.00 MiB (1847.59 MB)",
>   "uuid":"686fd289-a252-42cf-a3a5-95a39ed5c9d5",
>   "sector_size":512,
>   "align":2097152,
>   "blockdev":"pmem0"
> }
> 
>  $ mkfs.xfs -f /dev/pmem0 
> meta-data=/dev/pmem0             isize=512    agcount=4, agsize=112768
> blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1,
> rmapbt=0
>          =                       reflink=1    bigtime=0 inobtcount=0
> data     =                       bsize=4096   blocks=451072, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> Any comment is welcome!
> 
> TODO: implement cxlv command in ndctl to do cxlv device management.
> 
> [1]: Add argument in kernel command line: "memmap=nn[KMG]$ss[KMG]",
> detail in Documentation/driver-api/cxl/memory-devices.rst
> 
> Thanx
> 
> Dongsheng Yang (4):
>   cxl: move some function from acpi module to core module
>   cxl/port: allow dport host to be driver-less device
>   cxl/port: introduce cxl_disable_port() function
>   cxl: introduce CXL Virtualization module
> 
>  MAINTAINERS                         |   6 +
>  drivers/cxl/Kconfig                 |  11 +
>  drivers/cxl/Makefile                |   1 +
>  drivers/cxl/acpi.c                  | 143 +-----
>  drivers/cxl/core/port.c             | 231 ++++++++-
>  drivers/cxl/cxl.h                   |   6 +
>  drivers/cxl/cxl_virt/Makefile       |   5 +
>  drivers/cxl/cxl_virt/cxlv.h         |  87 ++++
>  drivers/cxl/cxl_virt/cxlv_debugfs.c | 260 ++++++++++
>  drivers/cxl/cxl_virt/cxlv_device.c  | 311 ++++++++++++
>  drivers/cxl/cxl_virt/cxlv_main.c    |  67 +++
>  drivers/cxl/cxl_virt/cxlv_pci.c     | 710 ++++++++++++++++++++++++++++
>  drivers/cxl/cxl_virt/cxlv_pci.h     | 549 +++++++++++++++++++++
>  drivers/cxl/cxl_virt/cxlv_port.c    | 149 ++++++
>  14 files changed, 2388 insertions(+), 148 deletions(-)
>  create mode 100644 drivers/cxl/cxl_virt/Makefile
>  create mode 100644 drivers/cxl/cxl_virt/cxlv.h
>  create mode 100644 drivers/cxl/cxl_virt/cxlv_debugfs.c
>  create mode 100644 drivers/cxl/cxl_virt/cxlv_device.c
>  create mode 100644 drivers/cxl/cxl_virt/cxlv_main.c
>  create mode 100644 drivers/cxl/cxl_virt/cxlv_pci.c
>  create mode 100644 drivers/cxl/cxl_virt/cxlv_pci.h
>  create mode 100644 drivers/cxl/cxl_virt/cxlv_port.c
> 
> -- 
> 2.34.1
>
Dan Williams Jan. 3, 2024, 8:48 p.m. UTC | #2
Dongsheng Yang wrote:
> Hi all:
> 	This patchset introduce cxlv module to allow user to
> create virtual cxl device. it's based linux6.7-rc5, you can
> get the code from https://github.com/DataTravelGuide/linux
> 
> 	As the real CXL device is not widely available now, we need
> some virtual cxl device to do uplayer software developing or
> testing. Qemu is good for functional testing, but not good
> for some performance testing.

How is it performance testing if it's just using host-DRAM? Is the use
case something like pinning the benchmark on Socket0 and target DRAM on
Socket1 as emulated CXL to approximate CXL bus latency?

> 
> 	The new CXLV module allow user to use the reserved RAM[1], to
> create virtual cxl device. When the cxlv module load, it will
> create a directory named as "cxl_virt" under /sys/devices/virtual:
> 
> 	"/sys/devices/virtual/cxl_virt/"
> 
> that's the top level device for all cxlv devices.
> At the same time, cxlv module will create a debugfs directory:
> 
> /sys/kernel/debug/cxl/cxlv
> ├── create
> └── remove
> 
> the create and remove debugfs file is the cxlv entry to create or remove
> a cxlv device.
> 
> 	Each cxlv device have its owned virtual pci related bridge and bus, cxlv
> will create a new root_port for the new cxlv device, setup cxl ports for
> dport and nvdimm-bridge. After that, we will add the virtual pci device,
> that will go into the cxl_pci_probe to setup new memdev.
> 
> 	Then we can see the cxl device with cxl list and use it as a real cxl
> device.
> 
>  $ echo "memstart=$((8*1024*1024*1024)),cxltype=3,pmem=1,memsize=$((2*1024*1024*1024))" > /sys/kernel/debug/cxl/cxlv/create

Are these ranges reserved out of the mmap at boot time? 

[..]
>  14 files changed, 2388 insertions(+), 148 deletions(-)

This seems like a lot of code for something that is mostly already
supported by tools/testing/cxl/ (cxl_test). That too creates virtual CXL
devices that support ABI flows that are difficult to support in QEMU.
The only thing missing for "performance / functional emulation" testing
today is backing the memory regions with accessible memory rather than
unusable address space.

It is also the case that the static nature of cxl_test topology
definition has already started to prove too limiting for some tests. So
an enhancement to make cxl_test more dynamic like your proposed command
interface is appealing.

One change to get cxl_test to get it to emulate with DRAM rather than
fake address space is to just fill cxl_mock_pool with addresses backed
by DRAM rather than addresses from an unused portion of the physical
address map.

Currently cxl_test defines address ranges that may be larger than what a
host or VM can support, so this would be a new cxl_test mode limited by
available / reserved memory capacity.

See cxl-create-region.sh for an example of the virtual CXL regions that
cxl_test creates today:

https://github.com/pmem/ndctl/blob/main/test/cxl-create-region.sh
Jonathan Cameron Jan. 8, 2024, 12:28 p.m. UTC | #3
On Wed, 3 Jan 2024 09:22:36 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> Dongsheng Yang wrote:
> > Hi all:
> > 	This patchset introduce cxlv module to allow user to
> > create virtual cxl device. it's based linux6.7-rc5, you can
> > get the code from https://github.com/DataTravelGuide/linux
> > 
> > 	As the real CXL device is not widely available now, we need
> > some virtual cxl device to do uplayer software developing or
> > testing. Qemu is good for functional testing, but not good
> > for some performance testing.  
> 
> Do you have more details on what performance is missing from Qemu and why
> this solution is better than a solution to fix Qemu?
> 
> Long term it seems better to fix Qemu for this type of work.

I plan to look at this sometime soon, but note that the fix will be special
cases only (no interleave!) in the short term.  Emulating interleave
is always going to be costly - can probably be better than we have it today
for large granularity (pages) but I'm not sure we will ever care enough
to implement that.

For virtualization usecases, if we go with CXL emulation as the path for
DCD then we'll just emulate direct connected devices and patch up the
perf characteristics to cover interleave, switches etc.

With such limitations we can get QEMU to perform well. I'm not keen on
separating QEMU for functional testing from QEMU for workload testing
but meh, we can at least make it automatic to use a higher perf root
if the interleave config allows it.

> 
> Are their other advantages to having this additional test infrastructure
> in the kernel?  We already have cxl_test.
> 
> Ira
> 
> > 
> > 	The new CXLV module allow user to use the reserved RAM[1], to
> > create virtual cxl device. When the cxlv module load, it will
> > create a directory named as "cxl_virt" under /sys/devices/virtual:
> > 
> > 	"/sys/devices/virtual/cxl_virt/"
> > 
> > that's the top level device for all cxlv devices.
> > At the same time, cxlv module will create a debugfs directory:
> > 
> > /sys/kernel/debug/cxl/cxlv
> > ├── create
> > └── remove
> > 
> > the create and remove debugfs file is the cxlv entry to create or remove
> > a cxlv device.
> > 
> > 	Each cxlv device have its owned virtual pci related bridge and bus, cxlv
> > will create a new root_port for the new cxlv device, setup cxl ports for
> > dport and nvdimm-bridge. After that, we will add the virtual pci device,
> > that will go into the cxl_pci_probe to setup new memdev.
> > 
> > 	Then we can see the cxl device with cxl list and use it as a real cxl
> > device.
> > 
> >  $ echo "memstart=$((8*1024*1024*1024)),cxltype=3,pmem=1,memsize=$((2*1024*1024*1024))" > /sys/kernel/debug/cxl/cxlv/create
> >  $ cxl list
> > [
> >   {
> >     "memdev":"mem0",
> >     "pmem_size":1879048192,
> >     "serial":0,
> >     "numa_node":0,
> >     "host":"0010:01:00.0"
> >   }
> > ]
> >  $ cxl create-region -m mem0 -d decoder0.0 -t pmem
> > {
> >   "region":"region0",
> >   "resource":"0x210000000",
> >   "size":"1792.00 MiB (1879.05 MB)",
> >   "type":"pmem",
> >   "interleave_ways":1,
> >   "interleave_granularity":256,
> >   "decode_state":"commit",
> >   "mappings":[
> >     {
> >       "position":0,
> >       "memdev":"mem0",
> >       "decoder":"decoder2.0"
> >     }
> >   ]
> > }
> > cxl region: cmd_create_region: created 1 region
> > 
> >  $ ndctl create-namespace -r region0 -m fsdax --map dev -t pmem -b 0
> > {
> >   "dev":"namespace0.0",
> >   "mode":"fsdax",
> >   "map":"dev",
> >   "size":"1762.00 MiB (1847.59 MB)",
> >   "uuid":"686fd289-a252-42cf-a3a5-95a39ed5c9d5",
> >   "sector_size":512,
> >   "align":2097152,
> >   "blockdev":"pmem0"
> > }
> > 
> >  $ mkfs.xfs -f /dev/pmem0 
> > meta-data=/dev/pmem0             isize=512    agcount=4, agsize=112768
> > blks
> >          =                       sectsz=4096  attr=2, projid32bit=1
> >          =                       crc=1        finobt=1, sparse=1,
> > rmapbt=0
> >          =                       reflink=1    bigtime=0 inobtcount=0
> > data     =                       bsize=4096   blocks=451072, imaxpct=25
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> > log      =internal log           bsize=4096   blocks=2560, version=2
> >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > Any comment is welcome!
> > 
> > TODO: implement cxlv command in ndctl to do cxlv device management.
> > 
> > [1]: Add argument in kernel command line: "memmap=nn[KMG]$ss[KMG]",
> > detail in Documentation/driver-api/cxl/memory-devices.rst
> > 
> > Thanx
> > 
> > Dongsheng Yang (4):
> >   cxl: move some function from acpi module to core module
> >   cxl/port: allow dport host to be driver-less device
> >   cxl/port: introduce cxl_disable_port() function
> >   cxl: introduce CXL Virtualization module
> > 
> >  MAINTAINERS                         |   6 +
> >  drivers/cxl/Kconfig                 |  11 +
> >  drivers/cxl/Makefile                |   1 +
> >  drivers/cxl/acpi.c                  | 143 +-----
> >  drivers/cxl/core/port.c             | 231 ++++++++-
> >  drivers/cxl/cxl.h                   |   6 +
> >  drivers/cxl/cxl_virt/Makefile       |   5 +
> >  drivers/cxl/cxl_virt/cxlv.h         |  87 ++++
> >  drivers/cxl/cxl_virt/cxlv_debugfs.c | 260 ++++++++++
> >  drivers/cxl/cxl_virt/cxlv_device.c  | 311 ++++++++++++
> >  drivers/cxl/cxl_virt/cxlv_main.c    |  67 +++
> >  drivers/cxl/cxl_virt/cxlv_pci.c     | 710 ++++++++++++++++++++++++++++
> >  drivers/cxl/cxl_virt/cxlv_pci.h     | 549 +++++++++++++++++++++
> >  drivers/cxl/cxl_virt/cxlv_port.c    | 149 ++++++
> >  14 files changed, 2388 insertions(+), 148 deletions(-)
> >  create mode 100644 drivers/cxl/cxl_virt/Makefile
> >  create mode 100644 drivers/cxl/cxl_virt/cxlv.h
> >  create mode 100644 drivers/cxl/cxl_virt/cxlv_debugfs.c
> >  create mode 100644 drivers/cxl/cxl_virt/cxlv_device.c
> >  create mode 100644 drivers/cxl/cxl_virt/cxlv_main.c
> >  create mode 100644 drivers/cxl/cxl_virt/cxlv_pci.c
> >  create mode 100644 drivers/cxl/cxl_virt/cxlv_pci.h
> >  create mode 100644 drivers/cxl/cxl_virt/cxlv_port.c
> > 
> > -- 
> > 2.34.1
> >   
> 
> 
>
Dongsheng Yang Jan. 10, 2024, 2:07 a.m. UTC | #4
在 2024/1/4 星期四 上午 1:22, Ira Weiny 写道:
> Dongsheng Yang wrote:
>> Hi all:
>> 	This patchset introduce cxlv module to allow user to
>> create virtual cxl device. it's based linux6.7-rc5, you can
>> get the code from https://github.com/DataTravelGuide/linux
>>
>> 	As the real CXL device is not widely available now, we need
>> some virtual cxl device to do uplayer software developing or
>> testing. Qemu is good for functional testing, but not good
>> for some performance testing.
> 
> Do you have more details on what performance is missing from Qemu and why
> this solution is better than a solution to fix Qemu?
> 
> Long term it seems better to fix Qemu for this type of work.
> 
> Are their other advantages to having this additional test infrastructure
> in the kernel?  We already have cxl_test.

Hi Ira,
	Let me explain more about what I mean by "qemu is not good for some 
performance testing". cxlv is not designed to test cxl driver itself, it 
is used to do performance testing for upper level layer software. I can 
give an example about the performance data:

(1) fio to test /dev/dax0.0 in qemu
qemu with memory-backend-ram, and create region and namespace with mod 
of devdax. and then run fio with ioengine dev-dax, fio file as below[1].

fio in qemu result detail in [2], the average iops is: avg=1919.26.

(2) fio to test /dev/dax0.0 in native host with cxlv
use cxlv to create cxl device and create region and namespace with mod 
of devdax. then run fio with the same fio file in [1].

fio in host result detail in [3], the average iops is: avg=1510391.68.


Now you can see the resule iops is about 1500K vs 1.9K.

I can explain more about why this matters, I am doing another project in 
block device layer, named cbd(cxl block device). It uses cxl memdev as a 
cache and a backing with other block device. it works similar with 
bcache, but is newly designed for cxl memory device which is byte 
addressable and latency very small. So I need a fast cxl device to 
verify my design in uppper layer is working well, E,g, indexing. qemu is 
too slow to this kind of performance testing. I dont think we can "fix" 
it, that's not what qemu need to do.

So when I say qemu is not good for performance testing, it's not saying 
I want some performance improvement of cxl implement in qemu, but I want 
to say the whole qemu-way is not suitable for latency sensitive testing.

Thanx




[1]:
[global]
bs=1K
ioengine=dev-dax
norandommap
time_based
runtime=10
group_reporting
disable_lat=1
disable_slat=1
disable_clat=1
clat_percentiles=0
cpus_allowed_policy=split

# For the dev-dax engine:
#
#   IOs always complete immediately
#   IOs are always direct
#
iodepth=1
direct=0
thread
numjobs=1
#
# The dev-dax engine does IO to DAX device that are special character
# devices exported by the kernel (e.g. /dev/dax0.0). The device is
# opened normally and then the region is accessible via mmap. We do
# not use the O_DIRECT flag because the device is naturally direct
# access. The O_DIRECT flags will result in failure. The engine
# access the underlying NVDIMM directly once the mmapping is setup.
#
# Check the alignment requirement of your DAX device. Currently the default
# should be 2M. Blocksize (bs) should meet alignment requirement.
#
# An example of creating a dev dax device node from pmem:
# ndctl create-namespace --reconfig=namespace0.0 --mode=dax --force
#
filename=/dev/dax0.0

[dev-dax-write]
rw=randwrite
stonewall

[2]:
# fio ./dax.fio
dev-dax-write: (g=0): rw=randwrite, bs=(R) 1024B-1024B, (W) 1024B-1024B, 
(T) 1024B-1024B, ioengine=dev-dax, iodepth=1
fio-3.36
Starting 1 thread
Jobs: 1 (f=1): [w(1)][100.0%][w=1929KiB/s][w=1929 IOPS][eta 00m:00s] 

dev-dax-write: (groupid=0, jobs=1): err= 0: pid=1198: Tue Jan  9 
10:17:21 2024
   write: IOPS=1917, BW=1918KiB/s (1964kB/s)(18.7MiB/10001msec); 0 zone 
resets
    bw (  KiB/s): min= 1700, max= 1944, per=100.00%, avg=1919.26, 
stdev=54.14, samples=19
    iops        : min= 1700, max= 1944, avg=1919.26, stdev=54.14, 
samples=19
   cpu          : usr=99.97%, sys=0.00%, ctx=12, majf=0, minf=126 

   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued rwts: total=0,19181,0,0 short=0,0,0,0 dropped=0,0,0,0 

      latency   : target=0, window=0, percentile=100.00%, depth=1 


Run status group 0 (all jobs):
   WRITE: bw=1918KiB/s (1964kB/s), 1918KiB/s-1918KiB/s 
(1964kB/s-1964kB/s), io=18.7MiB (19.6MB), run=10001-10001msec

[3]:
# fio ./dax.fio 

dev-dax-write: (g=0): rw=randwrite, bs=(R) 1024B-1024B, (W) 1024B-1024B, 
(T) 1024B-1024B, ioengine=dev-dax, iodepth=1
fio-3.36 

Starting 1 thread
Jobs: 1 (f=1): [w(1)][100.0%][w=1480MiB/s][w=1515k IOPS][eta 00m:00s] 

dev-dax-write: (groupid=0, jobs=1): err= 0: pid=41999: Tue Jan  9 
18:11:18 2024
   write: IOPS=1510k, BW=1474MiB/s (1546MB/s)(14.4GiB/10000msec); 0 zone 
resets
    bw (  MiB/s): min= 1418, max= 1480, per=100.00%, avg=1474.99, 
stdev=13.83, samples=19
    iops        : min=1452406, max=1515908, avg=1510391.68, 
stdev=14156.58, samples=19 
 
                                                     cpu          : 
usr=99.82%, sys=0.00%, ctx=22, majf=0, minf=899 
 
 
                        IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 
16=0.0%, 32=0.0%, >=64=0.0% 
 
                                                                submit 
   : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 

      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued rwts: total=0,15096228,0,0 short=0,0,0,0 dropped=0,0,0,0 
 
 

      latency   : target=0, window=0, percentile=100.00%, depth=1 

 

Run status group 0 (all jobs): 

   WRITE: bw=1474MiB/s (1546MB/s), 1474MiB/s-1474MiB/s 
(1546MB/s-1546MB/s), io=14.4GiB (15.5GB), run=10000-10000msec
> 
> Ira
> 
>>
>> 	The new CXLV module allow user to use the reserved RAM[1], to
>> create virtual cxl device. When the cxlv module load, it will
>> create a directory named as "cxl_virt" under /sys/devices/virtual:
>>
>> 	"/sys/devices/virtual/cxl_virt/"
>>
>> that's the top level device for all cxlv devices.
>> At the same time, cxlv module will create a debugfs directory:
>>
>> /sys/kernel/debug/cxl/cxlv
>> ├── create
>> └── remove
>>
>> the create and remove debugfs file is the cxlv entry to create or remove
>> a cxlv device.
>>
>> 	Each cxlv device have its owned virtual pci related bridge and bus, cxlv
>> will create a new root_port for the new cxlv device, setup cxl ports for
>> dport and nvdimm-bridge. After that, we will add the virtual pci device,
>> that will go into the cxl_pci_probe to setup new memdev.
>>
>> 	Then we can see the cxl device with cxl list and use it as a real cxl
>> device.
>>
>>   $ echo "memstart=$((8*1024*1024*1024)),cxltype=3,pmem=1,memsize=$((2*1024*1024*1024))" > /sys/kernel/debug/cxl/cxlv/create
>>   $ cxl list
>> [
>>    {
>>      "memdev":"mem0",
>>      "pmem_size":1879048192,
>>      "serial":0,
>>      "numa_node":0,
>>      "host":"0010:01:00.0"
>>    }
>> ]
>>   $ cxl create-region -m mem0 -d decoder0.0 -t pmem
>> {
>>    "region":"region0",
>>    "resource":"0x210000000",
>>    "size":"1792.00 MiB (1879.05 MB)",
>>    "type":"pmem",
>>    "interleave_ways":1,
>>    "interleave_granularity":256,
>>    "decode_state":"commit",
>>    "mappings":[
>>      {
>>        "position":0,
>>        "memdev":"mem0",
>>        "decoder":"decoder2.0"
>>      }
>>    ]
>> }
>> cxl region: cmd_create_region: created 1 region
>>
>>   $ ndctl create-namespace -r region0 -m fsdax --map dev -t pmem -b 0
>> {
>>    "dev":"namespace0.0",
>>    "mode":"fsdax",
>>    "map":"dev",
>>    "size":"1762.00 MiB (1847.59 MB)",
>>    "uuid":"686fd289-a252-42cf-a3a5-95a39ed5c9d5",
>>    "sector_size":512,
>>    "align":2097152,
>>    "blockdev":"pmem0"
>> }
>>
>>   $ mkfs.xfs -f /dev/pmem0
>> meta-data=/dev/pmem0             isize=512    agcount=4, agsize=112768
>> blks
>>           =                       sectsz=4096  attr=2, projid32bit=1
>>           =                       crc=1        finobt=1, sparse=1,
>> rmapbt=0
>>           =                       reflink=1    bigtime=0 inobtcount=0
>> data     =                       bsize=4096   blocks=451072, imaxpct=25
>>           =                       sunit=0      swidth=0 blks
>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>> log      =internal log           bsize=4096   blocks=2560, version=2
>>           =                       sectsz=4096  sunit=1 blks, lazy-count=1
>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>
>> Any comment is welcome!
>>
>> TODO: implement cxlv command in ndctl to do cxlv device management.
>>
>> [1]: Add argument in kernel command line: "memmap=nn[KMG]$ss[KMG]",
>> detail in Documentation/driver-api/cxl/memory-devices.rst
>>
>> Thanx
>>
>> Dongsheng Yang (4):
>>    cxl: move some function from acpi module to core module
>>    cxl/port: allow dport host to be driver-less device
>>    cxl/port: introduce cxl_disable_port() function
>>    cxl: introduce CXL Virtualization module
>>
>>   MAINTAINERS                         |   6 +
>>   drivers/cxl/Kconfig                 |  11 +
>>   drivers/cxl/Makefile                |   1 +
>>   drivers/cxl/acpi.c                  | 143 +-----
>>   drivers/cxl/core/port.c             | 231 ++++++++-
>>   drivers/cxl/cxl.h                   |   6 +
>>   drivers/cxl/cxl_virt/Makefile       |   5 +
>>   drivers/cxl/cxl_virt/cxlv.h         |  87 ++++
>>   drivers/cxl/cxl_virt/cxlv_debugfs.c | 260 ++++++++++
>>   drivers/cxl/cxl_virt/cxlv_device.c  | 311 ++++++++++++
>>   drivers/cxl/cxl_virt/cxlv_main.c    |  67 +++
>>   drivers/cxl/cxl_virt/cxlv_pci.c     | 710 ++++++++++++++++++++++++++++
>>   drivers/cxl/cxl_virt/cxlv_pci.h     | 549 +++++++++++++++++++++
>>   drivers/cxl/cxl_virt/cxlv_port.c    | 149 ++++++
>>   14 files changed, 2388 insertions(+), 148 deletions(-)
>>   create mode 100644 drivers/cxl/cxl_virt/Makefile
>>   create mode 100644 drivers/cxl/cxl_virt/cxlv.h
>>   create mode 100644 drivers/cxl/cxl_virt/cxlv_debugfs.c
>>   create mode 100644 drivers/cxl/cxl_virt/cxlv_device.c
>>   create mode 100644 drivers/cxl/cxl_virt/cxlv_main.c
>>   create mode 100644 drivers/cxl/cxl_virt/cxlv_pci.c
>>   create mode 100644 drivers/cxl/cxl_virt/cxlv_pci.h
>>   create mode 100644 drivers/cxl/cxl_virt/cxlv_port.c
>>
>> -- 
>> 2.34.1
>>
> 
>
Dan Williams Jan. 25, 2024, 3:49 a.m. UTC | #5
Dongsheng Yang wrote:
> 
> 
> 在 2024/1/4 星期四 上午 4:48, Dan Williams 写道:
> > Dongsheng Yang wrote:
> >> Hi all:
> >> 	This patchset introduce cxlv module to allow user to
> >> create virtual cxl device. it's based linux6.7-rc5, you can
> >> get the code from https://github.com/DataTravelGuide/linux
> >>
> >> 	As the real CXL device is not widely available now, we need
> >> some virtual cxl device to do uplayer software developing or
> >> testing. Qemu is good for functional testing, but not good
> >> for some performance testing.
> > 
> > How is it performance testing if it's just using host-DRAM? Is the use
> > case something like pinning the benchmark on Socket0 and target DRAM on
> > Socket1 as emulated CXL to approximate CXL bus latency?
> 
> Hi Dan,
> 	I give an example as below, please check it inline.
> > 
> >>
> >> 	The new CXLV module allow user to use the reserved RAM[1], to
> >> create virtual cxl device. When the cxlv module load, it will
> >> create a directory named as "cxl_virt" under /sys/devices/virtual:
> >>
> >> 	"/sys/devices/virtual/cxl_virt/"
> >>
> >> that's the top level device for all cxlv devices.
> >> At the same time, cxlv module will create a debugfs directory:
> >>
> >> /sys/kernel/debug/cxl/cxlv
> >> ├── create
> >> └── remove
> >>
> >> the create and remove debugfs file is the cxlv entry to create or remove
> >> a cxlv device.
> >>
> >> 	Each cxlv device have its owned virtual pci related bridge and bus, cxlv
> >> will create a new root_port for the new cxlv device, setup cxl ports for
> >> dport and nvdimm-bridge. After that, we will add the virtual pci device,
> >> that will go into the cxl_pci_probe to setup new memdev.
> >>
> >> 	Then we can see the cxl device with cxl list and use it as a real cxl
> >> device.
> >>
> >>   $ echo "memstart=$((8*1024*1024*1024)),cxltype=3,pmem=1,memsize=$((2*1024*1024*1024))" > /sys/kernel/debug/cxl/cxlv/create
> > 
> > Are these ranges reserved out of the mmap at boot time?
> 
> Yes, it is reserved by memmap option in boot cmdline. I use memmap=8G$8G.

A faster way to get to a device-dax interface fronting reserved memory
is to use the efi_fake_mem= command line option.

For example:

    efi_fake_mem=4G@13G:0x40000

...assigns 4GB of System-RAM starting at the 13G physical offset with
the EFI_MEMORY_SP attribute. By default the kernel creates device-dax
devices for that dedicated memory.

For dax mapping performance testing you don't need any of the CXL
driver infrastructure since the CXL driver has nothing to do with the
data path.
Dongsheng Yang Jan. 25, 2024, 6:49 a.m. UTC | #6
在 2024/1/25 星期四 上午 11:49, Dan Williams 写道:
> Dongsheng Yang wrote:
>>
>>
>> 在 2024/1/4 星期四 上午 4:48, Dan Williams 写道:
>>> Dongsheng Yang wrote:
>>>> Hi all:
>>>> 	This patchset introduce cxlv module to allow user to
>>>> create virtual cxl device. it's based linux6.7-rc5, you can
>>>> get the code from https://github.com/DataTravelGuide/linux
>>>>
>>>> 	As the real CXL device is not widely available now, we need
>>>> some virtual cxl device to do uplayer software developing or
>>>> testing. Qemu is good for functional testing, but not good
>>>> for some performance testing.
>>>
>>> How is it performance testing if it's just using host-DRAM? Is the use
>>> case something like pinning the benchmark on Socket0 and target DRAM on
>>> Socket1 as emulated CXL to approximate CXL bus latency?
>>
>> Hi Dan,
>> 	I give an example as below, please check it inline.
>>>
>>>>
>>>> 	The new CXLV module allow user to use the reserved RAM[1], to
>>>> create virtual cxl device. When the cxlv module load, it will
>>>> create a directory named as "cxl_virt" under /sys/devices/virtual:
>>>>
>>>> 	"/sys/devices/virtual/cxl_virt/"
>>>>
>>>> that's the top level device for all cxlv devices.
>>>> At the same time, cxlv module will create a debugfs directory:
>>>>
>>>> /sys/kernel/debug/cxl/cxlv
>>>> ├── create
>>>> └── remove
>>>>
>>>> the create and remove debugfs file is the cxlv entry to create or remove
>>>> a cxlv device.
>>>>
>>>> 	Each cxlv device have its owned virtual pci related bridge and bus, cxlv
>>>> will create a new root_port for the new cxlv device, setup cxl ports for
>>>> dport and nvdimm-bridge. After that, we will add the virtual pci device,
>>>> that will go into the cxl_pci_probe to setup new memdev.
>>>>
>>>> 	Then we can see the cxl device with cxl list and use it as a real cxl
>>>> device.
>>>>
>>>>    $ echo "memstart=$((8*1024*1024*1024)),cxltype=3,pmem=1,memsize=$((2*1024*1024*1024))" > /sys/kernel/debug/cxl/cxlv/create
>>>
>>> Are these ranges reserved out of the mmap at boot time?
>>
>> Yes, it is reserved by memmap option in boot cmdline. I use memmap=8G$8G.
> 
> A faster way to get to a device-dax interface fronting reserved memory
> is to use the efi_fake_mem= command line option.
> 
> For example:
> 
>      efi_fake_mem=4G@13G:0x40000
> 
> ...assigns 4GB of System-RAM starting at the 13G physical offset with
> the EFI_MEMORY_SP attribute. By default the kernel creates device-dax
> devices for that dedicated memory.
> 
> For dax mapping performance testing you don't need any of the CXL
> driver infrastructure since the CXL driver has nothing to do with the
> data path.

Thanx for your information, I create cxlv because I think there could be 
some other use cases other than device-dax to use cxl memdev. In that 
way, we need emulate cxl memdev in cxl driver level.

If we always use cxl memdev by creating a region and creating a 
device-dax, I agree we dont need to emulate it in cxl driver level.

Thanx
>
Dan Williams Jan. 25, 2024, 7:46 a.m. UTC | #7
Dongsheng Yang wrote:
[..]
> > A faster way to get to a device-dax interface fronting reserved memory
> > is to use the efi_fake_mem= command line option.
> > 
> > For example:
> > 
> >      efi_fake_mem=4G@13G:0x40000
> > 
> > ...assigns 4GB of System-RAM starting at the 13G physical offset with
> > the EFI_MEMORY_SP attribute. By default the kernel creates device-dax
> > devices for that dedicated memory.
> > 
> > For dax mapping performance testing you don't need any of the CXL
> > driver infrastructure since the CXL driver has nothing to do with the
> > data path.
> 
> Thanx for your information, I create cxlv because I think there could be 
> some other use cases other than device-dax to use cxl memdev. In that 
> way, we need emulate cxl memdev in cxl driver level.
> 
> If we always use cxl memdev by creating a region and creating a 
> device-dax, I agree we dont need to emulate it in cxl driver level.

Upstream Linux has no apparent demand for cxlv as cxl_test, QEMU CXL,
and/or EFI_MEMORY_SP enumeration of device-dax already covers the need.
Hyeongtak Ji May 3, 2024, 5:12 a.m. UTC | #8
Hello Dongsheng,

Thank you for sharing this work!  I might be a little late but it would be
helpful if you can answer a few questions below.

On Thu, 28 Dec 2023 06:05:06 +0000 Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> Hi all:
> 	This patchset introduce cxlv module to allow user to
> create virtual cxl device. it's based linux6.7-rc5, you can
> get the code from https://github.com/DataTravelGuide/linux
> 
> 	As the real CXL device is not widely available now, we need
> some virtual cxl device to do uplayer software developing or
> testing. Qemu is good for functional testing, but not good
> for some performance testing.
> 
> 	The new CXLV module allow user to use the reserved RAM[1], to
> create virtual cxl device. When the cxlv module load, it will
> create a directory named as "cxl_virt" under /sys/devices/virtual:
> 
> 	"/sys/devices/virtual/cxl_virt/"
> 
> that's the top level device for all cxlv devices.
> At the same time, cxlv module will create a debugfs directory:
> 
> /sys/kernel/debug/cxl/cxlv
> ├── create
> └── remove
> 
> the create and remove debugfs file is the cxlv entry to create or remove
> a cxlv device.
> 
> 	Each cxlv device have its owned virtual pci related bridge and bus, cxlv
> will create a new root_port for the new cxlv device, setup cxl ports for
> dport and nvdimm-bridge. After that, we will add the virtual pci device,
> that will go into the cxl_pci_probe to setup new memdev.
> 
> 	Then we can see the cxl device with cxl list and use it as a real cxl
> device.
> 
>  $ echo "memstart=$((8*1024*1024*1024)),cxltype=3,pmem=1,memsize=$((2*1024*1024*1024))" > /sys/kernel/debug/cxl/cxlv/create

I tried following your usage (w/ "memmap=8G$8G") but it does not seem to work
well. After creation I got logs like below:

  [   35.484764] PCI host bridge to bus 0010:00
  [   35.485015] pci_bus 0010:00: root bus resource [io  0x0000-0xffff]
  [   35.485446] pci_bus 0010:00: root bus resource [mem 0x00000000-0x7fffffffff]
  [   35.485817] pci_bus 0010:00: root bus resource [bus 00-ff]
  [   35.486126] pci 0010:00:00.0: [7c73:9a6c] type 01 class 0x060400
  [   35.486436] pci 0010:00:00.0: reg 0x10: [mem 0x200100000-0x2001fffff 64bit pref]
  [   35.486875] pci 0010:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
  [   35.487300] pci 0010:01:00.0: [7c73:9a6c] type 00 class 0x050210
  [   35.487745] pci 0010:01:00.0: reg 0x10: [mem 0x200000000-0x2001fffff 64bit pref]
  [   35.488171] pci 0010:00:00.0: PCI bridge to [bus 01-ff]
  [   35.488438] pci 0010:00:00.0:   bridge window [io  0x0000-0x0fff]
  [   35.488756] pci 0010:00:00.0:   bridge window [mem 0x00000000-0x000fffff]
  [   35.489101] pci 0010:00:00.0:   bridge window [mem 0x00000000-0x000fffff pref]
  [   35.489462] pci_bus 0010:01: busn_res: [bus 01-ff] end is updated to 01
  [   35.511966] pcieport 0010:00:00.0: enabling device (0000 -> 0003)
  [   35.512403] pci 0010:00:00.0: enabling device (0000 -> 0003)
  [   35.512738] cxl_pci 0010:01:00.0: enabling device (0000 -> 0002)
  [   35.517755] cxl_virt cxlv0: unsupported cmd: 0x301
  [   35.542026] cxl_virt cxlv0: unsupported cmd: 0x4500
  [   35.543738] cxl_virt cxlv0: unsupported cmd: 0x4500

Is it normal to get "unsupported cmd" here?

>  $ cxl list
> [
>   {
>     "memdev":"mem0",
>     "pmem_size":1879048192,
>     "serial":0,
>     "numa_node":0,
>     "host":"0010:01:00.0"
>   }
> ]

I got the exact same result against `cxl list`.

>  $ cxl create-region -m mem0 -d decoder0.0 -t pmem
> {
>   "region":"region0",
>   "resource":"0x210000000",
>   "size":"1792.00 MiB (1879.05 MB)",
>   "type":"pmem",
>   "interleave_ways":1,
>   "interleave_granularity":256,
>   "decode_state":"commit",
>   "mappings":[
>     {
>       "position":0,
>       "memdev":"mem0",
>       "decoder":"decoder2.0"
>     }
>   ]
> }
> cxl region: cmd_create_region: created 1 region

Instead of successful region creation, what I could find was

  $ cxl create-region -m mem0 -d decoder0.0 -t pmem
  cxl region: create_region: region0: failed to commit decode: No such device or address
  cxl region: cmd_create_region: created 0 regions

How can I follow your usage in the letter?

...snip...

Kind regards,
Hyeongtak