mbox series

[RFC,00/15] Region driver

Message ID 20220413183720.2444089-1-ben.widawsky@intel.com (mailing list archive)
Headers show
Series Region driver | expand

Message

Ben Widawsky April 13, 2022, 6:37 p.m. UTC
Spring cleaning is here and we're starting fresh so I won't be referencing
previous postings and I've removed revision history from commit messages.

This patch series introduces the CXL region driver as well as associated APIs in
CXL core to create and configure regions. Regions are defined by the CXL 2.0
specification [1], a summary follows.

A region surfaces a swath of RAM (persistent or volatile) that appears as normal
memory to the operating system. The memory, unless programmed by BIOS, or a
previous Operating System, is inaccessible until the CXL driver creates a region
for it.A region may be strided (interleave granularity) across multiple devices
(interleave ways). The interleaving may traverse multiple levels of the CXL
hierarchy.

+-------------------------+      +-------------------------+
|                         |      |                         |
|   CXL 2.0 Host Bridge   |      |   CXL 2.0 Host Bridge   |
|                         |      |                         |
|  +------+     +------+  |      |  +------+     +------+  |
|  |  RP  |     |  RP  |  |      |  |  RP  |     |  RP  |  |
+--+------+-----+------+--+      +--+------+-----+------+--+
      |            |                   |               \--
      |            |                   |        +-------+-\--+------+
   +------+    +-------+            +-------+   |       |USP |      |
   |Type 3|    |Type 3 |            |Type 3 |   |       +----+      |
   |Device|    |Device |            |Device |   |     CXL Switch    |
   +------+    +-------+            +-------+   | +----+     +----+ |
                                                | |DSP |     |DSP | |
                                                +-+-|--+-----+-|--+-+
                                                    |          |
                                                +------+    +-------+
                                                |Type 3|    |Type 3 |
                                                |Device|    |Device |
                                                +------+    +-------+

Region verification and programming state are owned by the cxl_region driver
(implemented in the cxl_region module). Much of the region driver is an
implementation of algorithms described in the CXL Type 3 Memory Device Software
Guide [2].

The region driver is responsible for configuring regions found on persistent
capacities in the Label Storage Area (LSA), it will also enumerate regions
configured by BIOS, usually volatile capacities, and will allow for dynamic
region creation (which can then be stored in the LSA). Only dynamically created
regions are implemented thus far.

Dan has previously stated that he doesn't want to merge ABI until the whole
series is posted and reviewed, to make sure we have no gaps. As such, the goal
of posting this series is *not* to discuss the ABI specifically, feedback is of
course welcome. In other wordsIt has been discussed previously. The goal is to find
architectural flaws in the implementation of the ABI that may pose problematic
for cases we haven't yet conceived.

Since region creation is done via sysfs, it is left to userspace to prevent
racing for resource usage. Here is an overview for creating a x1 256M
dynamically created region programming to be used by userspace clients. In this
example, the following topology is used (cropped for brevity):
/sys/bus/cxl/devices/
├── decoder0.0 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.0
├── decoder0.1 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.1
├── decoder1.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/decoder1.0
├── decoder2.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/decoder2.0
├── decoder3.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3/decoder3.0
├── decoder4.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4/decoder4.0
├── decoder5.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5/decoder5.0
├── decoder6.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6/decoder6.0
├── endpoint3 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3
├── endpoint4 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4
├── endpoint5 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5
├── endpoint6 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6
...

1. Select a Root Decoder whose interleave spans the desired interleave config
   - devices, IG, IW, Large enough address space.
   - ie. pick decoder0.0
2. Program the decoders for the endpoints comprising the interleave set.
   - ie. echo $((256 << 20)) > /sys/bus/cxl/devices/decoder3.0
3. Create a region
   - ie. echo $(cat create_pmem_region) >| create_pmem_region
4. Configure a region
   - ie. echo 256 >| interleave_granularity
	 echo 1 >| interleave_ways
	 echo $((256 << 20)) >| size
	 echo decoder3.0 >| target0
5. Bind the region driver to the region
   - ie. echo region0 > /sys/bus/cxl/drivers/cxl_region/bind


[1]: https://www.computeexpresslink.org/download-the-specification
[2]: https://cdrdv2.intel.com/v1/dl/getContent/643805?wapkw=CXL%20memory%20device%20sw%20guide

Ben Widawsky (15):
  cxl/core: Use is_endpoint_decoder
  cxl/core/hdm: Bail on endpoint init fail
  Revert "cxl/core: Convert decoder range to resource"
  cxl/core: Create distinct decoder structs
  cxl/acpi: Reserve CXL resources from request_free_mem_region
  cxl/acpi: Manage root decoder's address space
  cxl/port: Surface ram and pmem resources
  cxl/core/hdm: Allocate resources from the media
  cxl/core/port: Add attrs for size and volatility
  cxl/core: Extract IW/IG decoding
  cxl/acpi: Use common IW/IG decoding
  cxl/region: Add region creation ABI
  cxl/core/port: Add attrs for root ways & granularity
  cxl/region: Introduce configuration
  cxl/region: Introduce a cxl_region driver

 Documentation/ABI/testing/sysfs-bus-cxl       |  96 ++-
 .../driver-api/cxl/memory-devices.rst         |  14 +
 drivers/cxl/Kconfig                           |  10 +
 drivers/cxl/Makefile                          |   2 +
 drivers/cxl/acpi.c                            |  83 ++-
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/core.h                       |   4 +
 drivers/cxl/core/hdm.c                        |  44 +-
 drivers/cxl/core/port.c                       | 363 ++++++++--
 drivers/cxl/core/region.c                     | 669 ++++++++++++++++++
 drivers/cxl/cxl.h                             | 168 ++++-
 drivers/cxl/mem.c                             |   7 +-
 drivers/cxl/region.c                          | 333 +++++++++
 drivers/cxl/region.h                          | 105 +++
 include/linux/ioport.h                        |   1 +
 kernel/resource.c                             |  11 +-
 tools/testing/cxl/Kbuild                      |   1 +
 tools/testing/cxl/test/cxl.c                  |   2 +-
 18 files changed, 1810 insertions(+), 104 deletions(-)
 create mode 100644 drivers/cxl/core/region.c
 create mode 100644 drivers/cxl/region.c
 create mode 100644 drivers/cxl/region.h


base-commit: 7dc1d11d7abae52aada5340fb98885f0ddbb7c37

Comments

Jonathan Cameron May 20, 2022, 4:23 p.m. UTC | #1
On Wed, 13 Apr 2022 11:37:05 -0700
Ben Widawsky <ben.widawsky@intel.com> wrote:

> Spring cleaning is here and we're starting fresh so I won't be referencing
> previous postings and I've removed revision history from commit messages.
> 
> This patch series introduces the CXL region driver as well as associated APIs in
> CXL core to create and configure regions. Regions are defined by the CXL 2.0
> specification [1], a summary follows.
> 
> A region surfaces a swath of RAM (persistent or volatile) that appears as normal
> memory to the operating system. The memory, unless programmed by BIOS, or a
> previous Operating System, is inaccessible until the CXL driver creates a region
> for it.A region may be strided (interleave granularity) across multiple devices
> (interleave ways). The interleaving may traverse multiple levels of the CXL
> hierarchy.
> 
> +-------------------------+      +-------------------------+
> |                         |      |                         |
> |   CXL 2.0 Host Bridge   |      |   CXL 2.0 Host Bridge   |
> |                         |      |                         |
> |  +------+     +------+  |      |  +------+     +------+  |
> |  |  RP  |     |  RP  |  |      |  |  RP  |     |  RP  |  |
> +--+------+-----+------+--+      +--+------+-----+------+--+
>       |            |                   |               \--
>       |            |                   |        +-------+-\--+------+
>    +------+    +-------+            +-------+   |       |USP |      |
>    |Type 3|    |Type 3 |            |Type 3 |   |       +----+      |
>    |Device|    |Device |            |Device |   |     CXL Switch    |
>    +------+    +-------+            +-------+   | +----+     +----+ |
>                                                 | |DSP |     |DSP | |
>                                                 +-+-|--+-----+-|--+-+
>                                                     |          |
>                                                 +------+    +-------+
>                                                 |Type 3|    |Type 3 |
>                                                 |Device|    |Device |
>                                                 +------+    +-------+
> 
> Region verification and programming state are owned by the cxl_region driver
> (implemented in the cxl_region module). Much of the region driver is an
> implementation of algorithms described in the CXL Type 3 Memory Device Software
> Guide [2].
> 
> The region driver is responsible for configuring regions found on persistent
> capacities in the Label Storage Area (LSA), it will also enumerate regions
> configured by BIOS, usually volatile capacities, and will allow for dynamic
> region creation (which can then be stored in the LSA). Only dynamically created
> regions are implemented thus far.
> 
> Dan has previously stated that he doesn't want to merge ABI until the whole
> series is posted and reviewed, to make sure we have no gaps. As such, the goal
> of posting this series is *not* to discuss the ABI specifically, feedback is of
> course welcome. In other wordsIt has been discussed previously. The goal is to find
> architectural flaws in the implementation of the ABI that may pose problematic
> for cases we haven't yet conceived.
> 
> Since region creation is done via sysfs, it is left to userspace to prevent
> racing for resource usage. Here is an overview for creating a x1 256M
> dynamically created region programming to be used by userspace clients. In this
> example, the following topology is used (cropped for brevity):
> /sys/bus/cxl/devices/
> ├── decoder0.0 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.0
> ├── decoder0.1 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.1
> ├── decoder1.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/decoder1.0
> ├── decoder2.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/decoder2.0
> ├── decoder3.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3/decoder3.0
> ├── decoder4.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4/decoder4.0
> ├── decoder5.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5/decoder5.0
> ├── decoder6.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6/decoder6.0
> ├── endpoint3 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3
> ├── endpoint4 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4
> ├── endpoint5 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5
> ├── endpoint6 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6
> ...
> 
> 1. Select a Root Decoder whose interleave spans the desired interleave config
>    - devices, IG, IW, Large enough address space.
>    - ie. pick decoder0.0
> 2. Program the decoders for the endpoints comprising the interleave set.
>    - ie. echo $((256 << 20)) > /sys/bus/cxl/devices/decoder3.0
> 3. Create a region
>    - ie. echo $(cat create_pmem_region) >| create_pmem_region
> 4. Configure a region
>    - ie. echo 256 >| interleave_granularity
> 	 echo 1 >| interleave_ways
> 	 echo $((256 << 20)) >| size
> 	 echo decoder3.0 >| target0
> 5. Bind the region driver to the region
>    - ie. echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
> 
Hi Ben,

I finally got around to actually trying this out on top of Dan's recent fix set
(I rebased it from the cxl/preview branch on kernel.org).

I'm not having much luck actually bring up a region.

The patch set refers to configuring the end point decoders, but all their
sysfs attributes are read only.  Am I missing a dependency somewhere or
is the intent that this series is part of the solution only?

I'm confused!

Jonathan

> 
> [1]: https://www.computeexpresslink.org/download-the-specification
> [2]: https://cdrdv2.intel.com/v1/dl/getContent/643805?wapkw=CXL%20memory%20device%20sw%20guide
> 
> Ben Widawsky (15):
>   cxl/core: Use is_endpoint_decoder
>   cxl/core/hdm: Bail on endpoint init fail
>   Revert "cxl/core: Convert decoder range to resource"
>   cxl/core: Create distinct decoder structs
>   cxl/acpi: Reserve CXL resources from request_free_mem_region
>   cxl/acpi: Manage root decoder's address space
>   cxl/port: Surface ram and pmem resources
>   cxl/core/hdm: Allocate resources from the media
>   cxl/core/port: Add attrs for size and volatility
>   cxl/core: Extract IW/IG decoding
>   cxl/acpi: Use common IW/IG decoding
>   cxl/region: Add region creation ABI
>   cxl/core/port: Add attrs for root ways & granularity
>   cxl/region: Introduce configuration
>   cxl/region: Introduce a cxl_region driver
> 
>  Documentation/ABI/testing/sysfs-bus-cxl       |  96 ++-
>  .../driver-api/cxl/memory-devices.rst         |  14 +
>  drivers/cxl/Kconfig                           |  10 +
>  drivers/cxl/Makefile                          |   2 +
>  drivers/cxl/acpi.c                            |  83 ++-
>  drivers/cxl/core/Makefile                     |   1 +
>  drivers/cxl/core/core.h                       |   4 +
>  drivers/cxl/core/hdm.c                        |  44 +-
>  drivers/cxl/core/port.c                       | 363 ++++++++--
>  drivers/cxl/core/region.c                     | 669 ++++++++++++++++++
>  drivers/cxl/cxl.h                             | 168 ++++-
>  drivers/cxl/mem.c                             |   7 +-
>  drivers/cxl/region.c                          | 333 +++++++++
>  drivers/cxl/region.h                          | 105 +++
>  include/linux/ioport.h                        |   1 +
>  kernel/resource.c                             |  11 +-
>  tools/testing/cxl/Kbuild                      |   1 +
>  tools/testing/cxl/test/cxl.c                  |   2 +-
>  18 files changed, 1810 insertions(+), 104 deletions(-)
>  create mode 100644 drivers/cxl/core/region.c
>  create mode 100644 drivers/cxl/region.c
>  create mode 100644 drivers/cxl/region.h
> 
> 
> base-commit: 7dc1d11d7abae52aada5340fb98885f0ddbb7c37
Dan Williams May 20, 2022, 4:41 p.m. UTC | #2
On Fri, May 20, 2022 at 9:23 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Wed, 13 Apr 2022 11:37:05 -0700
> Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> > Spring cleaning is here and we're starting fresh so I won't be referencing
> > previous postings and I've removed revision history from commit messages.
> >
> > This patch series introduces the CXL region driver as well as associated APIs in
> > CXL core to create and configure regions. Regions are defined by the CXL 2.0
> > specification [1], a summary follows.
> >
> > A region surfaces a swath of RAM (persistent or volatile) that appears as normal
> > memory to the operating system. The memory, unless programmed by BIOS, or a
> > previous Operating System, is inaccessible until the CXL driver creates a region
> > for it.A region may be strided (interleave granularity) across multiple devices
> > (interleave ways). The interleaving may traverse multiple levels of the CXL
> > hierarchy.
> >
> > +-------------------------+      +-------------------------+
> > |                         |      |                         |
> > |   CXL 2.0 Host Bridge   |      |   CXL 2.0 Host Bridge   |
> > |                         |      |                         |
> > |  +------+     +------+  |      |  +------+     +------+  |
> > |  |  RP  |     |  RP  |  |      |  |  RP  |     |  RP  |  |
> > +--+------+-----+------+--+      +--+------+-----+------+--+
> >       |            |                   |               \--
> >       |            |                   |        +-------+-\--+------+
> >    +------+    +-------+            +-------+   |       |USP |      |
> >    |Type 3|    |Type 3 |            |Type 3 |   |       +----+      |
> >    |Device|    |Device |            |Device |   |     CXL Switch    |
> >    +------+    +-------+            +-------+   | +----+     +----+ |
> >                                                 | |DSP |     |DSP | |
> >                                                 +-+-|--+-----+-|--+-+
> >                                                     |          |
> >                                                 +------+    +-------+
> >                                                 |Type 3|    |Type 3 |
> >                                                 |Device|    |Device |
> >                                                 +------+    +-------+
> >
> > Region verification and programming state are owned by the cxl_region driver
> > (implemented in the cxl_region module). Much of the region driver is an
> > implementation of algorithms described in the CXL Type 3 Memory Device Software
> > Guide [2].
> >
> > The region driver is responsible for configuring regions found on persistent
> > capacities in the Label Storage Area (LSA), it will also enumerate regions
> > configured by BIOS, usually volatile capacities, and will allow for dynamic
> > region creation (which can then be stored in the LSA). Only dynamically created
> > regions are implemented thus far.
> >
> > Dan has previously stated that he doesn't want to merge ABI until the whole
> > series is posted and reviewed, to make sure we have no gaps. As such, the goal
> > of posting this series is *not* to discuss the ABI specifically, feedback is of
> > course welcome. In other wordsIt has been discussed previously. The goal is to find
> > architectural flaws in the implementation of the ABI that may pose problematic
> > for cases we haven't yet conceived.
> >
> > Since region creation is done via sysfs, it is left to userspace to prevent
> > racing for resource usage. Here is an overview for creating a x1 256M
> > dynamically created region programming to be used by userspace clients. In this
> > example, the following topology is used (cropped for brevity):
> > /sys/bus/cxl/devices/
> > ├── decoder0.0 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.0
> > ├── decoder0.1 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.1
> > ├── decoder1.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/decoder1.0
> > ├── decoder2.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/decoder2.0
> > ├── decoder3.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3/decoder3.0
> > ├── decoder4.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4/decoder4.0
> > ├── decoder5.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5/decoder5.0
> > ├── decoder6.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6/decoder6.0
> > ├── endpoint3 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3
> > ├── endpoint4 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4
> > ├── endpoint5 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5
> > ├── endpoint6 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6
> > ...
> >
> > 1. Select a Root Decoder whose interleave spans the desired interleave config
> >    - devices, IG, IW, Large enough address space.
> >    - ie. pick decoder0.0
> > 2. Program the decoders for the endpoints comprising the interleave set.
> >    - ie. echo $((256 << 20)) > /sys/bus/cxl/devices/decoder3.0
> > 3. Create a region
> >    - ie. echo $(cat create_pmem_region) >| create_pmem_region
> > 4. Configure a region
> >    - ie. echo 256 >| interleave_granularity
> >        echo 1 >| interleave_ways
> >        echo $((256 << 20)) >| size
> >        echo decoder3.0 >| target0
> > 5. Bind the region driver to the region
> >    - ie. echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
> >
> Hi Ben,
>
> I finally got around to actually trying this out on top of Dan's recent fix set
> (I rebased it from the cxl/preview branch on kernel.org).
>
> I'm not having much luck actually bring up a region.
>
> The patch set refers to configuring the end point decoders, but all their
> sysfs attributes are read only.  Am I missing a dependency somewhere or
> is the intent that this series is part of the solution only?
>
> I'm confused!

There's a new series that's being reviewed internally before going to the list:

https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3

Given the proximity to the merge window opening and the need to get
the "mem_enabled" series staged, I asked Ben to hold it back from the
list for now.

There are some changes I am folding into it, but I hope to send it out
in the next few days after "mem_enabled" is finalized.
Jonathan Cameron May 31, 2022, 12:21 p.m. UTC | #3
....

> > Hi Ben,
> >
> > I finally got around to actually trying this out on top of Dan's recent fix set
> > (I rebased it from the cxl/preview branch on kernel.org).
> >
> > I'm not having much luck actually bring up a region.
> >
> > The patch set refers to configuring the end point decoders, but all their
> > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > is the intent that this series is part of the solution only?
> >
> > I'm confused!  
> 
> There's a new series that's being reviewed internally before going to the list:
> 
> https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> 
> Given the proximity to the merge window opening and the need to get
> the "mem_enabled" series staged, I asked Ben to hold it back from the
> list for now.
> 
> There are some changes I am folding into it, but I hope to send it out
> in the next few days after "mem_enabled" is finalized.

Hi Dan,

I switched from an earlier version of the region code over to a rebase of the tree.
Two issues below you may already have fixed.

The second is a carry over from an earlier set so I haven't tested
without it but looks like it's still valid.

Anyhow, thought it might save some cycles to preempt you sending
out the series if these issues are still present.

Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
devices, but once you post I'll test more extensively.  I've not
really thought about the below much, so might not be best way to fix.

Found a bug in QEMU code as well (missing write masks for the
target list registers) - will post fix for that shortly.

Thanks,

Jonathan


From fa31f37214fcb121428be1ceb87ae335209fa4cc Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Date: Tue, 31 May 2022 13:13:51 +0100
Subject: [PATCH] Fixes for region code

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/region.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
index e81c5b1339ec..fbbf084004d9 100644
--- a/drivers/cxl/region.c
+++ b/drivers/cxl/region.c
@@ -229,10 +229,10 @@ static struct cxl_decoder *stage_decoder(struct cxl_region *cxlr,

        return cxld;
 }
-
+// calculating whilst working down the tree - so divide granularity of previous level by local ways.
 static int calculate_ig(struct cxl_decoder *pcxld)
 {
-       return cxl_to_interleave_granularity(cxl_from_granularity(pcxld->cip.g) + cxl_from_ways(pcxld->cip.w));
+       return cxl_to_interleave_granularity(cxl_from_granularity(pcxld->cip.g) - cxl_from_ways(pcxld->cip.w));
 }

 static void unstage_decoder(struct cxl_decoder *cxld)
@@ -302,7 +302,8 @@ static struct cxl_decoder *stage_hb(struct cxl_region *cxlr,
                          t->nr_targets))
                return NULL;

-       t->target[port_grouping] = root_port;
+       //      t->target[port_grouping] = root_port;
+       t->target[hbd->cip.w] = root_port;
        hbd->cip.w++;

        /* If no switch, root port is connected to memdev */
--
2.32.0
Dan Williams June 23, 2022, 5:40 a.m. UTC | #4
Jonathan Cameron wrote:
> ....
> 
> > > Hi Ben,
> > >
> > > I finally got around to actually trying this out on top of Dan's recent fix set
> > > (I rebased it from the cxl/preview branch on kernel.org).
> > >
> > > I'm not having much luck actually bring up a region.
> > >
> > > The patch set refers to configuring the end point decoders, but all their
> > > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > > is the intent that this series is part of the solution only?
> > >
> > > I'm confused!  
> > 
> > There's a new series that's being reviewed internally before going to the list:
> > 
> > https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> > 
> > Given the proximity to the merge window opening and the need to get
> > the "mem_enabled" series staged, I asked Ben to hold it back from the
> > list for now.
> > 
> > There are some changes I am folding into it, but I hope to send it out
> > in the next few days after "mem_enabled" is finalized.
> 
> Hi Dan,
> 
> I switched from an earlier version of the region code over to a rebase of the tree.
> Two issues below you may already have fixed.
> 
> The second is a carry over from an earlier set so I haven't tested
> without it but looks like it's still valid.
> 
> Anyhow, thought it might save some cycles to preempt you sending
> out the series if these issues are still present.
> 
> Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
> devices, but once you post I'll test more extensively.  I've not
> really thought about the below much, so might not be best way to fix.
> 
> Found a bug in QEMU code as well (missing write masks for the
> target list registers) - will post fix for that shortly.

Hi Jonathan,

Tomorrow I'll post the tranche to the list, but wanted to let you and
others watching that that the 'preview' branch [1] now has the proposed
initial region support. Once the bots give the thumbs up I'll send it
along.

To date I've only tested it with cxl_test and an internal test vehicle.
The cxl_test script I used to setup and teardown a x8 interleave across
x2 host bridges and x4 switches is:

---

#!/bin/bash
modprobe cxl_test
udevadm settle
decoder=$(cxl list -b cxl_test -D -d root | jq -r ".[] |
          select(.pmem_capable == true) | 
          select(.nr_targets == 2) |
          .decoder")

readarray -t mem < <(cxl list -M -d $decoder | jq -r ".[].memdev")
readarray -t endpoint < <(cxl reserve-dpa -t pmem ${mem[*]} -s $((256<<20)) |
                          jq -r ".[] | .decoder.decoder")
region=$(cat /sys/bus/cxl/devices/$decoder/create_pmem_region)
echo $region > /sys/bus/cxl/devices/$decoder/create_pmem_region
uuidgen > /sys/bus/cxl/devices/$region/uuid
nr_targets=${#endpoint[@]}
echo $nr_targets > /sys/bus/cxl/devices/$region/interleave_ways
g=$(cat /sys/bus/cxl/devices/$decoder/interleave_granularity)
echo $g > /sys/bus/cxl/devices/$region/interleave_granularity
echo $((nr_targets * (256<<20))) > /sys/bus/cxl/devices/$region/size
port_dev0=$(cxl list -T -d $decoder | jq -r ".[] |
            .targets | .[] | select(.position == 0) | .target")
port_dev1=$(cxl list -T -d $decoder | jq -r ".[] |
            .targets | .[] | select(.position == 1) | .target")
readarray -t mem_sort0 < <(cxl list -M -p $port_dev0 | jq -r ".[] | .memdev")
readarray -t mem_sort1 < <(cxl list -M -p $port_dev1 | jq -r ".[] | .memdev")
mem_sort=()
mem_sort[0]=${mem_sort0[0]}
mem_sort[1]=${mem_sort1[0]}
mem_sort[2]=${mem_sort0[2]}
mem_sort[3]=${mem_sort1[2]}
mem_sort[4]=${mem_sort0[1]}
mem_sort[5]=${mem_sort1[1]}
mem_sort[6]=${mem_sort0[3]}
mem_sort[7]=${mem_sort1[3]}

#mem_sort[2]=${mem_sort0[0]}
#mem_sort[1]=${mem_sort1[0]}
#mem_sort[0]=${mem_sort0[2]}
#mem_sort[3]=${mem_sort1[2]}
#mem_sort[4]=${mem_sort0[1]}
#mem_sort[5]=${mem_sort1[1]}
#mem_sort[6]=${mem_sort0[3]}
#mem_sort[7]=${mem_sort1[3]}

endpoint=()
for i in ${mem_sort[@]}
do
        readarray -O ${#endpoint[@]} -t endpoint < <(cxl list -Di -d endpoint -m $i | jq -r ".[] |
                                                     select(.mode == \"pmem\") | .decoder")
done
pos=0
for i in ${endpoint[@]}
do
        echo $i > /sys/bus/cxl/devices/$region/target$pos
        pos=$((pos+1))
done
echo "$region added ${#endpoint[@]} targets: ${endpoint[@]}"

echo 1 > /sys/bus/cxl/devices/$region/commit
echo 0 > /sys/bus/cxl/devices/$region/commit

pos=0
for i in ${endpoint[@]}
do
        echo "" > /sys/bus/cxl/devices/$region/target$pos
        pos=$((pos+1))
done
readarray -t endpoint < <(cxl free-dpa -t pmem ${mem[*]} |
                          jq -r ".[] | .decoder.decoder")
echo "$region released ${#endpoint[@]} targets: ${endpoint[@]}"

---

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=preview
Jonathan Cameron June 23, 2022, 3:08 p.m. UTC | #5
On Wed, 22 Jun 2022 22:40:48 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > ....
> >   
> > > > Hi Ben,
> > > >
> > > > I finally got around to actually trying this out on top of Dan's recent fix set
> > > > (I rebased it from the cxl/preview branch on kernel.org).
> > > >
> > > > I'm not having much luck actually bring up a region.
> > > >
> > > > The patch set refers to configuring the end point decoders, but all their
> > > > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > > > is the intent that this series is part of the solution only?
> > > >
> > > > I'm confused!    
> > > 
> > > There's a new series that's being reviewed internally before going to the list:
> > > 
> > > https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> > > 
> > > Given the proximity to the merge window opening and the need to get
> > > the "mem_enabled" series staged, I asked Ben to hold it back from the
> > > list for now.
> > > 
> > > There are some changes I am folding into it, but I hope to send it out
> > > in the next few days after "mem_enabled" is finalized.  
> > 
> > Hi Dan,
> > 
> > I switched from an earlier version of the region code over to a rebase of the tree.
> > Two issues below you may already have fixed.
> > 
> > The second is a carry over from an earlier set so I haven't tested
> > without it but looks like it's still valid.
> > 
> > Anyhow, thought it might save some cycles to preempt you sending
> > out the series if these issues are still present.
> > 
> > Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
> > devices, but once you post I'll test more extensively.  I've not
> > really thought about the below much, so might not be best way to fix.
> > 
> > Found a bug in QEMU code as well (missing write masks for the
> > target list registers) - will post fix for that shortly.  
> 
> Hi Jonathan,
> 
> Tomorrow I'll post the tranche to the list, but wanted to let you and
> others watching that that the 'preview' branch [1] now has the proposed
> initial region support. Once the bots give the thumbs up I'll send it
> along.
> 
> To date I've only tested it with cxl_test and an internal test vehicle.
> The cxl_test script I used to setup and teardown a x8 interleave across
> x2 host bridges and x4 switches is:

Thanks.  Trivial feedback from a very quick play (busy day).

Bit odd that regionX/size is once write - get an error even if
writing same value to it twice.

Also not debugged yet but on just got a null pointer dereference on

echo decoder3.0 > target0

Beyond a stacktrace pointing at store_targetN and dereference is of
0x00008 no idea yet.

I was testing with a slightly modified version of a nasty script
I was using to test with Ben's code previously.  Might well be
doing something wrong but obviously need to fix that crash anyway!
Will move to your nicer script below at somepoint as I've been lazy
enough I'm still hand editing a few lines depending on number on
a particular run.

Should have some time tomorrow to debug, but definitely 'here be
dragons' at the moment.

Jonathan

> 
> ---
> 
> #!/bin/bash
> modprobe cxl_test
> udevadm settle
> decoder=$(cxl list -b cxl_test -D -d root | jq -r ".[] |
>           select(.pmem_capable == true) | 
>           select(.nr_targets == 2) |
>           .decoder")
> 
> readarray -t mem < <(cxl list -M -d $decoder | jq -r ".[].memdev")
> readarray -t endpoint < <(cxl reserve-dpa -t pmem ${mem[*]} -s $((256<<20)) |
>                           jq -r ".[] | .decoder.decoder")
> region=$(cat /sys/bus/cxl/devices/$decoder/create_pmem_region)
> echo $region > /sys/bus/cxl/devices/$decoder/create_pmem_region
> uuidgen > /sys/bus/cxl/devices/$region/uuid
> nr_targets=${#endpoint[@]}
> echo $nr_targets > /sys/bus/cxl/devices/$region/interleave_ways
> g=$(cat /sys/bus/cxl/devices/$decoder/interleave_granularity)
> echo $g > /sys/bus/cxl/devices/$region/interleave_granularity
> echo $((nr_targets * (256<<20))) > /sys/bus/cxl/devices/$region/size
> port_dev0=$(cxl list -T -d $decoder | jq -r ".[] |
>             .targets | .[] | select(.position == 0) | .target")
> port_dev1=$(cxl list -T -d $decoder | jq -r ".[] |
>             .targets | .[] | select(.position == 1) | .target")
> readarray -t mem_sort0 < <(cxl list -M -p $port_dev0 | jq -r ".[] | .memdev")
> readarray -t mem_sort1 < <(cxl list -M -p $port_dev1 | jq -r ".[] | .memdev")
> mem_sort=()
> mem_sort[0]=${mem_sort0[0]}
> mem_sort[1]=${mem_sort1[0]}
> mem_sort[2]=${mem_sort0[2]}
> mem_sort[3]=${mem_sort1[2]}
> mem_sort[4]=${mem_sort0[1]}
> mem_sort[5]=${mem_sort1[1]}
> mem_sort[6]=${mem_sort0[3]}
> mem_sort[7]=${mem_sort1[3]}
> 
> #mem_sort[2]=${mem_sort0[0]}
> #mem_sort[1]=${mem_sort1[0]}
> #mem_sort[0]=${mem_sort0[2]}
> #mem_sort[3]=${mem_sort1[2]}
> #mem_sort[4]=${mem_sort0[1]}
> #mem_sort[5]=${mem_sort1[1]}
> #mem_sort[6]=${mem_sort0[3]}
> #mem_sort[7]=${mem_sort1[3]}
> 
> endpoint=()
> for i in ${mem_sort[@]}
> do
>         readarray -O ${#endpoint[@]} -t endpoint < <(cxl list -Di -d endpoint -m $i | jq -r ".[] |
>                                                      select(.mode == \"pmem\") | .decoder")
> done
> pos=0
> for i in ${endpoint[@]}
> do
>         echo $i > /sys/bus/cxl/devices/$region/target$pos
>         pos=$((pos+1))
> done
> echo "$region added ${#endpoint[@]} targets: ${endpoint[@]}"
> 
> echo 1 > /sys/bus/cxl/devices/$region/commit
> echo 0 > /sys/bus/cxl/devices/$region/commit
> 
> pos=0
> for i in ${endpoint[@]}
> do
>         echo "" > /sys/bus/cxl/devices/$region/target$pos
>         pos=$((pos+1))
> done
> readarray -t endpoint < <(cxl free-dpa -t pmem ${mem[*]} |
>                           jq -r ".[] | .decoder.decoder")
> echo "$region released ${#endpoint[@]} targets: ${endpoint[@]}"
> 
> ---
> 
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=preview
Dan Williams June 23, 2022, 5:33 p.m. UTC | #6
Jonathan Cameron wrote:
> On Wed, 22 Jun 2022 22:40:48 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Jonathan Cameron wrote:
> > > ....
> > >   
> > > > > Hi Ben,
> > > > >
> > > > > I finally got around to actually trying this out on top of Dan's recent fix set
> > > > > (I rebased it from the cxl/preview branch on kernel.org).
> > > > >
> > > > > I'm not having much luck actually bring up a region.
> > > > >
> > > > > The patch set refers to configuring the end point decoders, but all their
> > > > > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > > > > is the intent that this series is part of the solution only?
> > > > >
> > > > > I'm confused!    
> > > > 
> > > > There's a new series that's being reviewed internally before going to the list:
> > > > 
> > > > https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> > > > 
> > > > Given the proximity to the merge window opening and the need to get
> > > > the "mem_enabled" series staged, I asked Ben to hold it back from the
> > > > list for now.
> > > > 
> > > > There are some changes I am folding into it, but I hope to send it out
> > > > in the next few days after "mem_enabled" is finalized.  
> > > 
> > > Hi Dan,
> > > 
> > > I switched from an earlier version of the region code over to a rebase of the tree.
> > > Two issues below you may already have fixed.
> > > 
> > > The second is a carry over from an earlier set so I haven't tested
> > > without it but looks like it's still valid.
> > > 
> > > Anyhow, thought it might save some cycles to preempt you sending
> > > out the series if these issues are still present.
> > > 
> > > Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
> > > devices, but once you post I'll test more extensively.  I've not
> > > really thought about the below much, so might not be best way to fix.
> > > 
> > > Found a bug in QEMU code as well (missing write masks for the
> > > target list registers) - will post fix for that shortly.  
> > 
> > Hi Jonathan,
> > 
> > Tomorrow I'll post the tranche to the list, but wanted to let you and
> > others watching that that the 'preview' branch [1] now has the proposed
> > initial region support. Once the bots give the thumbs up I'll send it
> > along.
> > 
> > To date I've only tested it with cxl_test and an internal test vehicle.
> > The cxl_test script I used to setup and teardown a x8 interleave across
> > x2 host bridges and x4 switches is:
> 
> Thanks.  Trivial feedback from a very quick play (busy day).
> 
> Bit odd that regionX/size is once write - get an error even if
> writing same value to it twice.

Ah true, that should just silently succeed.

> Also not debugged yet but on just got a null pointer dereference on
> 
> echo decoder3.0 > target0
> 
> Beyond a stacktrace pointing at store_targetN and dereference is of
> 0x00008 no idea yet.

The compiler unfortunately does a good job inlining the entirety of all the
leaf functions beneath store_targetN() so I have found myself needing to
sprinkle "noinline" to get better back traces.

> 
> I was testing with a slightly modified version of a nasty script
> I was using to test with Ben's code previously.  Might well be
> doing something wrong but obviously need to fix that crash anyway!

Most definitely.

> Will move to your nicer script below at somepoint as I've been lazy
> enough I'm still hand editing a few lines depending on number on
> a particular run.
> 
> Should have some time tomorrow to debug, but definitely 'here be
> dragons' at the moment.

Yes. Even before this posting I had shaken out a few crash scenarios just from
moving from my old QEMU baseline to "jic123/cxl-rework-draft-2" which did
things like collide PCI MMIO with cxl_test fake CXL ranges. By the way, is
there a "latest" tag I should be following to stay in sync with what you are
running for QEMU+CXL?  If only to reproduce the same crash scenarios.
Dan Williams June 23, 2022, 11:44 p.m. UTC | #7
Dan Williams wrote:
> Jonathan Cameron wrote:
> > On Wed, 22 Jun 2022 22:40:48 -0700
> > Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> > > Jonathan Cameron wrote:
> > > > ....
> > > >   
> > > > > > Hi Ben,
> > > > > >
> > > > > > I finally got around to actually trying this out on top of Dan's recent fix set
> > > > > > (I rebased it from the cxl/preview branch on kernel.org).
> > > > > >
> > > > > > I'm not having much luck actually bring up a region.
> > > > > >
> > > > > > The patch set refers to configuring the end point decoders, but all their
> > > > > > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > > > > > is the intent that this series is part of the solution only?
> > > > > >
> > > > > > I'm confused!    
> > > > > 
> > > > > There's a new series that's being reviewed internally before going to the list:
> > > > > 
> > > > > https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> > > > > 
> > > > > Given the proximity to the merge window opening and the need to get
> > > > > the "mem_enabled" series staged, I asked Ben to hold it back from the
> > > > > list for now.
> > > > > 
> > > > > There are some changes I am folding into it, but I hope to send it out
> > > > > in the next few days after "mem_enabled" is finalized.  
> > > > 
> > > > Hi Dan,
> > > > 
> > > > I switched from an earlier version of the region code over to a rebase of the tree.
> > > > Two issues below you may already have fixed.
> > > > 
> > > > The second is a carry over from an earlier set so I haven't tested
> > > > without it but looks like it's still valid.
> > > > 
> > > > Anyhow, thought it might save some cycles to preempt you sending
> > > > out the series if these issues are still present.
> > > > 
> > > > Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
> > > > devices, but once you post I'll test more extensively.  I've not
> > > > really thought about the below much, so might not be best way to fix.
> > > > 
> > > > Found a bug in QEMU code as well (missing write masks for the
> > > > target list registers) - will post fix for that shortly.  
> > > 
> > > Hi Jonathan,
> > > 
> > > Tomorrow I'll post the tranche to the list, but wanted to let you and
> > > others watching that that the 'preview' branch [1] now has the proposed
> > > initial region support. Once the bots give the thumbs up I'll send it
> > > along.
> > > 
> > > To date I've only tested it with cxl_test and an internal test vehicle.
> > > The cxl_test script I used to setup and teardown a x8 interleave across
> > > x2 host bridges and x4 switches is:
> > 
> > Thanks.  Trivial feedback from a very quick play (busy day).
> > 
> > Bit odd that regionX/size is once write - get an error even if
> > writing same value to it twice.
> 
> Ah true, that should just silently succeed.

I fixed this one.

> 
> > Also not debugged yet but on just got a null pointer dereference on
> > 
> > echo decoder3.0 > target0
> > 
> > Beyond a stacktrace pointing at store_targetN and dereference is of
> > 0x00008 no idea yet.
> 
> The compiler unfortunately does a good job inlining the entirety of all the
> leaf functions beneath store_targetN() so I have found myself needing to
> sprinkle "noinline" to get better back traces.
> 
> > 
> > I was testing with a slightly modified version of a nasty script
> > I was using to test with Ben's code previously.  Might well be
> > doing something wrong but obviously need to fix that crash anyway!
> 
> Most definitely.

I tried to reproduce this one, but unfortunately, "worked for me". So
send along more reproduction details when you get a chance, but I'll
proceed with posting the series for now. I tried the following on my
QEMU config to reproduce:

# cxl reserve-dpa -t pmem mem0
{
  "memdev":"mem0",
  "pmem_size":"512.00 MiB (536.87 MB)",
  "ram_size":0,
  "serial":"0",
  "host":"0000:35:00.0",
  "decoder":{
    "decoder":"decoder2.0",
    "state":"disabled",
    "dpa_size":"512.00 MiB (536.87 MB)",
    "mode":"pmem"
  }
}
# echo region1 > /sys/bus/cxl/devices/decoder0.0/create_pmem_region
# echo 1 > /sys/bus/cxl/devices/region1/interleave_ways 
# echo 256 > /sys/bus/cxl/devices/region1/interleave_granularity 
# echo $((512 << 20)) > /sys/bus/cxl/devices/region1/size
# echo decoder2.0 > /sys/bus/cxl/devices/region1/target0
Jonathan Cameron June 24, 2022, 9:08 a.m. UTC | #8
On Thu, 23 Jun 2022 10:33:38 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > On Wed, 22 Jun 2022 22:40:48 -0700
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >   
> > > Jonathan Cameron wrote:  
> > > > ....
> > > >     
> > > > > > Hi Ben,
> > > > > >
> > > > > > I finally got around to actually trying this out on top of Dan's recent fix set
> > > > > > (I rebased it from the cxl/preview branch on kernel.org).
> > > > > >
> > > > > > I'm not having much luck actually bring up a region.
> > > > > >
> > > > > > The patch set refers to configuring the end point decoders, but all their
> > > > > > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > > > > > is the intent that this series is part of the solution only?
> > > > > >
> > > > > > I'm confused!      
> > > > > 
> > > > > There's a new series that's being reviewed internally before going to the list:
> > > > > 
> > > > > https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> > > > > 
> > > > > Given the proximity to the merge window opening and the need to get
> > > > > the "mem_enabled" series staged, I asked Ben to hold it back from the
> > > > > list for now.
> > > > > 
> > > > > There are some changes I am folding into it, but I hope to send it out
> > > > > in the next few days after "mem_enabled" is finalized.    
> > > > 
> > > > Hi Dan,
> > > > 
> > > > I switched from an earlier version of the region code over to a rebase of the tree.
> > > > Two issues below you may already have fixed.
> > > > 
> > > > The second is a carry over from an earlier set so I haven't tested
> > > > without it but looks like it's still valid.
> > > > 
> > > > Anyhow, thought it might save some cycles to preempt you sending
> > > > out the series if these issues are still present.
> > > > 
> > > > Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
> > > > devices, but once you post I'll test more extensively.  I've not
> > > > really thought about the below much, so might not be best way to fix.
> > > > 
> > > > Found a bug in QEMU code as well (missing write masks for the
> > > > target list registers) - will post fix for that shortly.    
> > > 
> > > Hi Jonathan,
> > > 
> > > Tomorrow I'll post the tranche to the list, but wanted to let you and
> > > others watching that that the 'preview' branch [1] now has the proposed
> > > initial region support. Once the bots give the thumbs up I'll send it
> > > along.
> > > 
> > > To date I've only tested it with cxl_test and an internal test vehicle.
> > > The cxl_test script I used to setup and teardown a x8 interleave across
> > > x2 host bridges and x4 switches is:  
> > 
> > Thanks.  Trivial feedback from a very quick play (busy day).
> > 
> > Bit odd that regionX/size is once write - get an error even if
> > writing same value to it twice.  
> 
> Ah true, that should just silently succeed.
> 
> > Also not debugged yet but on just got a null pointer dereference on
> > 
> > echo decoder3.0 > target0
> > 
> > Beyond a stacktrace pointing at store_targetN and dereference is of
> > 0x00008 no idea yet.  
> 
> The compiler unfortunately does a good job inlining the entirety of all the
> leaf functions beneath store_targetN() so I have found myself needing to
> sprinkle "noinline" to get better back traces.
> 
> > 
> > I was testing with a slightly modified version of a nasty script
> > I was using to test with Ben's code previously.  Might well be
> > doing something wrong but obviously need to fix that crash anyway!  
> 
> Most definitely.
> 
> > Will move to your nicer script below at somepoint as I've been lazy
> > enough I'm still hand editing a few lines depending on number on
> > a particular run.
> > 
> > Should have some time tomorrow to debug, but definitely 'here be
> > dragons' at the moment.  
> 
> Yes. Even before this posting I had shaken out a few crash scenarios just from
> moving from my old QEMU baseline to "jic123/cxl-rework-draft-2" which did
> things like collide PCI MMIO with cxl_test fake CXL ranges. By the way, is
> there a "latest" tag I should be following to stay in sync with what you are
> running for QEMU+CXL? 

For this particular feature should just be mainline QEMU now.
Switch support was picked up a few days ago and I haven't pushed out a rebase
on top of that yet. Famous last words, but I don't 'think' that anything
that isn't yet in upstream QEMU should effect the region code.

I am testing on ARM (which requires the arch and board support which is
awaiting review) but doubt that causes this problem...

> If only to reproduce the same crash scenarios.