[00/26] DCD: Add support for Dynamic Capacity Devices (DCD)

Message ID	20240324-dcd-type2-upstream-v1-0-b7b00d623625@intel.com
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1F39715E812; Sun, 24 Mar 2024 23:18:14 +0000 (UTC) From: ira.weiny@intel.com Subject: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD) Date: Sun, 24 Mar 2024 16:18:03 -0700 Message-Id: <20240324-dcd-type2-upstream-v1-0-b7b00d623625@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit To: Dave Jiang <dave.jiang@intel.com>, Fan Ni <fan.ni@samsung.com>, Jonathan Cameron <Jonathan.Cameron@huawei.com>, Navneet Singh <navneet.singh@intel.com> Cc: Dan Williams <dan.j.williams@intel.com>, Davidlohr Bueso <dave@stgolabs.net>, Alison Schofield <alison.schofield@intel.com>, Vishal Verma <vishal.l.verma@intel.com>, Ira Weiny <ira.weiny@intel.com>, linux-btrfs@vger.kernel.org, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, Jonathan Cameron <Jonathan.Cameron@Huawei.com>, Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.com>
Series	DCD: Add support for Dynamic Capacity Devices (DCD) \| expand [00/26] DCD: Add support for Dynamic Capacity Devices (DCD) [01/26] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) [02/26] cxl/core: Separate region mode from decoder mode [03/26] cxl/mem: Read dynamic capacity configuration from the device [04/26] cxl/region: Add dynamic capacity decoder and region modes [05/26] cxl/core: Simplify cxl_dpa_set_mode() [06/26] cxl/port: Add Dynamic Capacity mode support to endpoint decoders [07/26] cxl/port: Add dynamic capacity size support to endpoint decoders [08/26] cxl/mem: Expose device dynamic capacity capabilities [09/26] cxl/region: Add Dynamic Capacity CXL region support [10/26] cxl/events: Factor out event msgnum configuration [11/26] cxl/pci: Delay event buffer allocation [12/26] cxl/pci: Factor out interrupt policy check [13/26] cxl/mem: Configure dynamic capacity interrupts [14/26] cxl/region: Read existing extents on region creation [15/26] range: Add range_overlaps() [16/26] cxl/extent: Realize extent devices [17/26] dax/region: Create extent resources on DAX region driver load [18/26] cxl/mem: Handle DCD add & release capacity events. [19/26] dax/bus: Factor out dev dax resize logic [20/26] dax: Document dax dev range tuple [21/26] dax/region: Prevent range mapping allocation on sparse regions [22/26] dax/region: Support DAX device creation on sparse DAX regions [23/26] cxl/mem: Trace Dynamic capacity Event Record [24/26] tools/testing/cxl: Make event logs dynamic [25/26] tools/testing/cxl: Add DC Regions to mock mem data [26/26] tools/testing/cxl: Add Dynamic Capacity events

Ira Weiny March 24, 2024, 11:18 p.m. UTC

A git tree of this series can be found here:

	https://github.com/weiny2/linux-kernel/tree/dcd-2024-03-24

Pre-requisite:
==============

	The locking introduced by Vishal for DAX regions:
	https://lore.kernel.org/all/20240124-vv-dax_abi-v7-1-20d16cb8d23d@intel.com/T/#u

Background
==========

A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
device that allows the memory capacity to change dynamically, without
the need for resetting the device, reconfiguring HDM decoders, or
reconfiguring software DAX regions.

One of the biggest use cases for Dynamic Capacity is to allow hosts to
share memory dynamically within a data center without increasing the
per-host attached memory.

The general flow for the addition or removal of memory is to have an
orchestrator coordinate the use of the memory.  Generally there are 5
actors in such a system, the Orchestrator, Fabric Manager, the Device
the host sees, the Host Kernel, and a Host User.

Typical work flows are shown below.

Orchestrator      FM         Device       Host Kernel    Host User

    |             |           |            |              |
    |-------------- Create region ----------------------->|
    |             |           |            |              |
    |             |           |            |<-- Create ---|
    |             |           |            |    Region    |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |-- Add ----->|-- Add --->|--- Add --->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Accept -|<- Accept  -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |<- Create --->|
    |             |           |            |   DAX dev    |-- Use memory
    |             |           |            |              |   |
    |             |           |            |              |   |
    |             |           |            |<- Release ---| <-+
    |             |           |            |   DAX dev    |
    |             |           |            |              |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |-- Remove -->|- Release->|- Release ->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Release-|<- Release -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |              |
    |-- Add ----->|-- Add --->|--- Add --->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Accept -|<- Accept  -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |<- Create ----|
    |             |           |            |   DAX dev    |-- Use memory
    |             |           |            |              |   |
    |             |           |            |<- Release ---| <-+
    |             |           |            |   DAX dev    |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |-- Remove -->|- Release->|- Release ->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Release-|<- Release -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |              |
    |-- Add ----->|-- Add --->|--- Add --->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |<- Create ----|
    |             |           |            |   DAX dev    |-- Use memory
    |             |           |            |              |   |
    |-- Remove -->|- Release->|- Release ->|              |   |
    |  Capacity   |  Extent   |   Extent   |              |   |
    |             |           |            |              |   |
    |             |           |     (Release Ignored)     |   |
    |             |           |            |              |   |
    |             |           |            |<- Release ---| <-+
    |             |           |            |   DAX dev    |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |             |- Release->|- Release ->|              |
    |             |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Release-|<- Release -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |<- Destroy ---|
    |             |           |            |   Region     |
    |             |           |            |              |

Previous RFCs of this series[0] resulted in significant architectural
comments.  Previous versions allowed memory capacity to be accepted by
the host regardless of the existence of a software region being mapped.

With this new patch set the order of the create region and DAX device
creation must be synchronized with the Orchestrator adding/removing
capacity.  The host kernel will reject an add extent event if the region
is not created yet.  It will also ignore a release if the DAX device is
created and referencing an extent.

Neither of these synchronizations are anticipated to be an issue with
real applications.

In order to allow for capacity to be added and removed a new concept of
a sparse DAX region is introduced.  A sparse DAX region may have 0 or
more bytes of available space.  The total space depends on the number
and size of the extents which have been added.

Initially it is anticipated that users of the memory will carefully
coordinate the surfacing of additional capacity with the creation of DAX
devices which use that capacity.  Therefore, the allocation of the
memory to DAX devices does not allow for specific associations between
DAX device and extent.  This keeps allocations very similar to existing
DAX region behavior.

Great care was taken to greatly simplify extent tracking.  Specifically,
in comparison to previous versions of the patch set, all extent tracking
xarrays have been eliminated from the code.  In addition, most of the
extra software objects and associated referenced counts have been
eliminated.

In this version, extents are tracked purely as sub-devices of the
region.  This ensures that the region destruction cleans up all extent
allocations properly.  Device managed callbacks are wired to ensure any
additional data required for DAX device references are handled
correctly.

Due to these major changes I'm setting this new series to V1.

In summary the major functionality of this series includes:

- Getting the dynamic capacity (DC) configuration information from cxl
  devices

- Configuring the DC regions reported by hardware

- Enhancing the CXL and DAX regions for dynamic capacity support
	a. Maintain a logical separation between hardware extents and
	   software managed region extents.  This provides an
	   abstraction between the layers and should allow for
	   interleaving in the future

- Get hardware extent lists for endpoint decoders upon
  region creation.

- Adjust extent/region memory available on the following events.
        a. Add capacity Events
	b. Release capacity events

- Host response for add capacity
	a. do not accept the extent if:
		If the region does not exist
		or an error occurs realizing the extent
	B. If the region does exist
		realize a DAX region extent with 1:1 mapping (no
		interleave yet)

- Host response for remove capacity
	a. If no DAX devices reference the extent release the extent
	b. If a reference does exist, ignore the request.
	   (Require FM to issue release again.)

- Modify DAX device creation/resize to account for extents within a
  sparse DAX region

- Trace Dynamic Capacity events for debugging

- Add cxl-test infrastructure to allow for faster unit testing
  (See new ndctl branch for cxl-dcd.sh test[1])

Fan Ni's latest v5 of Qemu DCD was used for testing.[2]

Remaining work:

	1) Integrate the QoS work from Dave Jiang
	2) Interleave support

Possible additional work depending on requirements:

	1) Allow mapping to specific extents (perhaps based on
	   label/tag)
	2) Release extents when DAX devices are released if a release
	   was previously seen from the device
	3) Accept a new extent which extends (but overlaps) an existing
	   extent(s)

[0] RFC v2: https://lore.kernel.org/r/20230604-dcd-type2-upstream-v2-0-f740c47e7916@intel.com
[1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-03-22
[2] https://lore.kernel.org/all/20240304194331.1586191-1-nifan.cxl@gmail.com/

---
Changes for v1:
- iweiny: Largely new series
- iweiny: Remove review tags due to the series being a major rework
- iweiny: Fix authorship for Navneet patches
- iweiny: Remove extent xarrays
- iweiny: Remove kreferences, replace with 1 use count protected under dax_rwsem
- iweiny: Mark all sysfs entries for the 6.10 June 2024 kernel
- iweiny: Remove gotos
- iweiny: Fix 0day issues
- Jonathan Cameron: address comments
- Navneet Singh: address comments
- Dan Williams: address comments
- Dave Jiang: address comments
- Fan Ni: address comments
- Jørgen Hansen: address comments
- Link to RFC v2: https://lore.kernel.org/r/20230604-dcd-type2-upstream-v2-0-f740c47e7916@intel.com

---
Ira Weiny (12):
      cxl/core: Simplify cxl_dpa_set_mode()
      cxl/events: Factor out event msgnum configuration
      cxl/pci: Delay event buffer allocation
      cxl/pci: Factor out interrupt policy check
      range: Add range_overlaps()
      dax/bus: Factor out dev dax resize logic
      dax: Document dax dev range tuple
      dax/region: Prevent range mapping allocation on sparse regions
      dax/region: Support DAX device creation on sparse DAX regions
      tools/testing/cxl: Make event logs dynamic
      tools/testing/cxl: Add DC Regions to mock mem data
      tools/testing/cxl: Add Dynamic Capacity events

Navneet Singh (14):
      cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
      cxl/core: Separate region mode from decoder mode
      cxl/mem: Read dynamic capacity configuration from the device
      cxl/region: Add dynamic capacity decoder and region modes
      cxl/port: Add Dynamic Capacity mode support to endpoint decoders
      cxl/port: Add dynamic capacity size support to endpoint decoders
      cxl/mem: Expose device dynamic capacity capabilities
      cxl/region: Add Dynamic Capacity CXL region support
      cxl/mem: Configure dynamic capacity interrupts
      cxl/region: Read existing extents on region creation
      cxl/extent: Realize extent devices
      dax/region: Create extent resources on DAX region driver load
      cxl/mem: Handle DCD add & release capacity events.
      cxl/mem: Trace Dynamic capacity Event Record

 Documentation/ABI/testing/sysfs-bus-cxl |  60 ++-
 drivers/cxl/core/Makefile               |   1 +
 drivers/cxl/core/core.h                 |  10 +
 drivers/cxl/core/extent.c               | 145 +++++
 drivers/cxl/core/hdm.c                  | 254 +++++++--
 drivers/cxl/core/mbox.c                 | 591 ++++++++++++++++++++-
 drivers/cxl/core/memdev.c               |  76 +++
 drivers/cxl/core/port.c                 |  19 +
 drivers/cxl/core/region.c               | 334 +++++++++++-
 drivers/cxl/core/trace.h                |  65 +++
 drivers/cxl/cxl.h                       | 127 ++++-
 drivers/cxl/cxlmem.h                    | 114 ++++
 drivers/cxl/mem.c                       |  45 ++
 drivers/cxl/pci.c                       | 122 +++--
 drivers/dax/bus.c                       | 353 +++++++++---
 drivers/dax/bus.h                       |   4 +-
 drivers/dax/cxl.c                       | 127 ++++-
 drivers/dax/dax-private.h               |  40 +-
 drivers/dax/hmem/hmem.c                 |   2 +-
 drivers/dax/pmem.c                      |   2 +-
 fs/btrfs/ordered-data.c                 |  10 +-
 include/linux/cxl-event.h               |  31 ++
 include/linux/range.h                   |   7 +
 tools/testing/cxl/Kbuild                |   1 +
 tools/testing/cxl/test/mem.c            | 914 ++++++++++++++++++++++++++++----
 25 files changed, 3152 insertions(+), 302 deletions(-)
---
base-commit: dff54316795991e88a453a095a9322718a34034a
change-id: 20230604-dcd-type2-upstream-0cd15f6216fd

Best regards,

fan March 25, 2024, 7:24 p.m. UTC | #1

On Sun, Mar 24, 2024 at 04:18:03PM -0700, ira.weiny@intel.com wrote:
> A git tree of this series can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/dcd-2024-03-24
> 
> Pre-requisite:
> ==============
> 
> 	The locking introduced by Vishal for DAX regions:
> 	https://lore.kernel.org/all/20240124-vv-dax_abi-v7-1-20d16cb8d23d@intel.com/T/#u
> 
> Background
> ==========
> 
> A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> device that allows the memory capacity to change dynamically, without
> the need for resetting the device, reconfiguring HDM decoders, or
> reconfiguring software DAX regions.
> 
> One of the biggest use cases for Dynamic Capacity is to allow hosts to
> share memory dynamically within a data center without increasing the
> per-host attached memory.
> 
> The general flow for the addition or removal of memory is to have an
> orchestrator coordinate the use of the memory.  Generally there are 5
> actors in such a system, the Orchestrator, Fabric Manager, the Device
> the host sees, the Host Kernel, and a Host User.
> 
> Typical work flows are shown below.
> 
> Orchestrator      FM         Device       Host Kernel    Host User
> 
>     |             |           |            |              |
>     |-------------- Create region ----------------------->|
>     |             |           |            |              |
>     |             |           |            |<-- Create ---|
>     |             |           |            |    Region    |
>     |<------------- Signal done --------------------------|
>     |             |           |            |              |
>     |-- Add ----->|-- Add --->|--- Add --->|              |
>     |  Capacity   |  Extent   |   Extent   |              |
>     |             |           |            |              |
>     |             |<- Accept -|<- Accept  -|              |
>     |             |   Extent  |   Extent   |              |
>     |             |           |            |<- Create --->|
>     |             |           |            |   DAX dev    |-- Use memory
>     |             |           |            |              |   |
>     |             |           |            |              |   |
>     |             |           |            |<- Release ---| <-+
>     |             |           |            |   DAX dev    |
>     |             |           |            |              |
>     |<------------- Signal done --------------------------|
>     |             |           |            |              |
>     |-- Remove -->|- Release->|- Release ->|              |
>     |  Capacity   |  Extent   |   Extent   |              |
>     |             |           |            |              |
>     |             |<- Release-|<- Release -|              |
>     |             |   Extent  |   Extent   |              |
>     |             |           |            |              |
>     |-- Add ----->|-- Add --->|--- Add --->|              |
>     |  Capacity   |  Extent   |   Extent   |              |
>     |             |           |            |              |
>     |             |<- Accept -|<- Accept  -|              |
>     |             |   Extent  |   Extent   |              |
>     |             |           |            |<- Create ----|
>     |             |           |            |   DAX dev    |-- Use memory
>     |             |           |            |              |   |
>     |             |           |            |<- Release ---| <-+
>     |             |           |            |   DAX dev    |
>     |<------------- Signal done --------------------------|
>     |             |           |            |              |
>     |-- Remove -->|- Release->|- Release ->|              |
>     |  Capacity   |  Extent   |   Extent   |              |
>     |             |           |            |              |
>     |             |<- Release-|<- Release -|              |
>     |             |   Extent  |   Extent   |              |
>     |             |           |            |              |
>     |-- Add ----->|-- Add --->|--- Add --->|              |
>     |  Capacity   |  Extent   |   Extent   |              |
>     |             |           |            |<- Create ----|
>     |             |           |            |   DAX dev    |-- Use memory
>     |             |           |            |              |   |
>     |-- Remove -->|- Release->|- Release ->|              |   |
>     |  Capacity   |  Extent   |   Extent   |              |   |
>     |             |           |            |              |   |
>     |             |           |     (Release Ignored)     |   |
>     |             |           |            |              |   |
>     |             |           |            |<- Release ---| <-+
>     |             |           |            |   DAX dev    |
>     |<------------- Signal done --------------------------|
>     |             |           |            |              |
>     |             |- Release->|- Release ->|              |
>     |             |  Extent   |   Extent   |              |
>     |             |           |            |              |
>     |             |<- Release-|<- Release -|              |
>     |             |   Extent  |   Extent   |              |
>     |             |           |            |<- Destroy ---|
>     |             |           |            |   Region     |
>     |             |           |            |              |
> 
> Previous RFCs of this series[0] resulted in significant architectural
> comments.  Previous versions allowed memory capacity to be accepted by
> the host regardless of the existence of a software region being mapped.
> 
> With this new patch set the order of the create region and DAX device
> creation must be synchronized with the Orchestrator adding/removing
> capacity.  The host kernel will reject an add extent event if the region
> is not created yet.  It will also ignore a release if the DAX device is
> created and referencing an extent.
> 
> Neither of these synchronizations are anticipated to be an issue with
> real applications.
> 
> In order to allow for capacity to be added and removed a new concept of
> a sparse DAX region is introduced.  A sparse DAX region may have 0 or
> more bytes of available space.  The total space depends on the number
> and size of the extents which have been added.
> 
> Initially it is anticipated that users of the memory will carefully
> coordinate the surfacing of additional capacity with the creation of DAX
> devices which use that capacity.  Therefore, the allocation of the
> memory to DAX devices does not allow for specific associations between
> DAX device and extent.  This keeps allocations very similar to existing
> DAX region behavior.
> 
> Great care was taken to greatly simplify extent tracking.  Specifically,
> in comparison to previous versions of the patch set, all extent tracking
> xarrays have been eliminated from the code.  In addition, most of the
> extra software objects and associated referenced counts have been
> eliminated.
> 
> In this version, extents are tracked purely as sub-devices of the
> region.  This ensures that the region destruction cleans up all extent
> allocations properly.  Device managed callbacks are wired to ensure any
> additional data required for DAX device references are handled
> correctly.
> 
> Due to these major changes I'm setting this new series to V1.
> 
> In summary the major functionality of this series includes:
> 
> - Getting the dynamic capacity (DC) configuration information from cxl
>   devices
> 
> - Configuring the DC regions reported by hardware
> 
> - Enhancing the CXL and DAX regions for dynamic capacity support
> 	a. Maintain a logical separation between hardware extents and
> 	   software managed region extents.  This provides an
> 	   abstraction between the layers and should allow for
> 	   interleaving in the future
> 
> - Get hardware extent lists for endpoint decoders upon
>   region creation.
> 
> - Adjust extent/region memory available on the following events.
>         a. Add capacity Events
> 	b. Release capacity events
> 
> - Host response for add capacity
> 	a. do not accept the extent if:
> 		If the region does not exist
> 		or an error occurs realizing the extent
> 	B. If the region does exist
> 		realize a DAX region extent with 1:1 mapping (no
> 		interleave yet)
> 
> - Host response for remove capacity
> 	a. If no DAX devices reference the extent release the extent
> 	b. If a reference does exist, ignore the request.
> 	   (Require FM to issue release again.)
> 
> - Modify DAX device creation/resize to account for extents within a
>   sparse DAX region
> 
> - Trace Dynamic Capacity events for debugging
> 
> - Add cxl-test infrastructure to allow for faster unit testing
>   (See new ndctl branch for cxl-dcd.sh test[1])
> 
> Fan Ni's latest v5 of Qemu DCD was used for testing.[2]
> 
> Remaining work:
> 
> 	1) Integrate the QoS work from Dave Jiang
> 	2) Interleave support
> 
> Possible additional work depending on requirements:
> 
> 	1) Allow mapping to specific extents (perhaps based on
> 	   label/tag)
> 	2) Release extents when DAX devices are released if a release
> 	   was previously seen from the device
> 	3) Accept a new extent which extends (but overlaps) an existing
> 	   extent(s)
> 
> [0] RFC v2: https://lore.kernel.org/r/20230604-dcd-type2-upstream-v2-0-f740c47e7916@intel.com
> [1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-03-22
> [2] https://lore.kernel.org/all/20240304194331.1586191-1-nifan.cxl@gmail.com/
> 
> ---
> Changes for v1:
> - iweiny: Largely new series
> - iweiny: Remove review tags due to the series being a major rework
> - iweiny: Fix authorship for Navneet patches
> - iweiny: Remove extent xarrays
> - iweiny: Remove kreferences, replace with 1 use count protected under dax_rwsem
> - iweiny: Mark all sysfs entries for the 6.10 June 2024 kernel
> - iweiny: Remove gotos
> - iweiny: Fix 0day issues
> - Jonathan Cameron: address comments
> - Navneet Singh: address comments
> - Dan Williams: address comments
> - Dave Jiang: address comments
> - Fan Ni: address comments
> - Jørgen Hansen: address comments
> - Link to RFC v2: https://lore.kernel.org/r/20230604-dcd-type2-upstream-v2-0-f740c47e7916@intel.com
> 

Hi Ira,
Have not got a chance to check the code yet, but I noticed one thing
when testing with my DCD emulation code.
Currently, if we do partial release, it seems the whole extent will be
removed. Is it designed intentionally?

Fan

> ---
> Ira Weiny (12):
>       cxl/core: Simplify cxl_dpa_set_mode()
>       cxl/events: Factor out event msgnum configuration
>       cxl/pci: Delay event buffer allocation
>       cxl/pci: Factor out interrupt policy check
>       range: Add range_overlaps()
>       dax/bus: Factor out dev dax resize logic
>       dax: Document dax dev range tuple
>       dax/region: Prevent range mapping allocation on sparse regions
>       dax/region: Support DAX device creation on sparse DAX regions
>       tools/testing/cxl: Make event logs dynamic
>       tools/testing/cxl: Add DC Regions to mock mem data
>       tools/testing/cxl: Add Dynamic Capacity events
> 
> Navneet Singh (14):
>       cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
>       cxl/core: Separate region mode from decoder mode
>       cxl/mem: Read dynamic capacity configuration from the device
>       cxl/region: Add dynamic capacity decoder and region modes
>       cxl/port: Add Dynamic Capacity mode support to endpoint decoders
>       cxl/port: Add dynamic capacity size support to endpoint decoders
>       cxl/mem: Expose device dynamic capacity capabilities
>       cxl/region: Add Dynamic Capacity CXL region support
>       cxl/mem: Configure dynamic capacity interrupts
>       cxl/region: Read existing extents on region creation
>       cxl/extent: Realize extent devices
>       dax/region: Create extent resources on DAX region driver load
>       cxl/mem: Handle DCD add & release capacity events.
>       cxl/mem: Trace Dynamic capacity Event Record
> 
>  Documentation/ABI/testing/sysfs-bus-cxl |  60 ++-
>  drivers/cxl/core/Makefile               |   1 +
>  drivers/cxl/core/core.h                 |  10 +
>  drivers/cxl/core/extent.c               | 145 +++++
>  drivers/cxl/core/hdm.c                  | 254 +++++++--
>  drivers/cxl/core/mbox.c                 | 591 ++++++++++++++++++++-
>  drivers/cxl/core/memdev.c               |  76 +++
>  drivers/cxl/core/port.c                 |  19 +
>  drivers/cxl/core/region.c               | 334 +++++++++++-
>  drivers/cxl/core/trace.h                |  65 +++
>  drivers/cxl/cxl.h                       | 127 ++++-
>  drivers/cxl/cxlmem.h                    | 114 ++++
>  drivers/cxl/mem.c                       |  45 ++
>  drivers/cxl/pci.c                       | 122 +++--
>  drivers/dax/bus.c                       | 353 +++++++++---
>  drivers/dax/bus.h                       |   4 +-
>  drivers/dax/cxl.c                       | 127 ++++-
>  drivers/dax/dax-private.h               |  40 +-
>  drivers/dax/hmem/hmem.c                 |   2 +-
>  drivers/dax/pmem.c                      |   2 +-
>  fs/btrfs/ordered-data.c                 |  10 +-
>  include/linux/cxl-event.h               |  31 ++
>  include/linux/range.h                   |   7 +
>  tools/testing/cxl/Kbuild                |   1 +
>  tools/testing/cxl/test/mem.c            | 914 ++++++++++++++++++++++++++++----
>  25 files changed, 3152 insertions(+), 302 deletions(-)
> ---
> base-commit: dff54316795991e88a453a095a9322718a34034a
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> 
> Best regards,
> -- 
> Ira Weiny <ira.weiny@intel.com>
>

Ira Weiny March 28, 2024, 5:20 a.m. UTC | #2

fan wrote:
> On Sun, Mar 24, 2024 at 04:18:03PM -0700, ira.weiny@intel.com wrote:
> > A git tree of this series can be found here:

[snip]

> > 
> 
> Hi Ira,
> Have not got a chance to check the code yet, but I noticed one thing
> when testing with my DCD emulation code.
> Currently, if we do partial release, it seems the whole extent will be
> removed. Is it designed intentionally?
> 

Yes that is my intent.  I specifically called that out in patch 18.

https://lore.kernel.org/all/20240324-dcd-type2-upstream-v1-18-b7b00d623625@intel.com/

I thought we discussed this in one of the collaboration calls.  Mainly
this is to simplify by not attempting any split of the extents the host is
tracking.  It really is expected that the FM/device is going to keep those
extents offered and release them in their entirety.  I understand this may
complicate the device because it may see a release of memory prior to the
request of that release.  And perhaps this complicates the device.  But in
that case it (or the FM really) should not attempt to release partial
extents.

Ira

[snip]

Jonathan Cameron April 3, 2024, 8:39 p.m. UTC | #3

On Wed, 27 Mar 2024 22:20:45 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> fan wrote:
> > On Sun, Mar 24, 2024 at 04:18:03PM -0700, ira.weiny@intel.com wrote:  
> > > A git tree of this series can be found here:  
> 
> [snip]
> 
> > >   
> > 
> > Hi Ira,
> > Have not got a chance to check the code yet, but I noticed one thing
> > when testing with my DCD emulation code.
> > Currently, if we do partial release, it seems the whole extent will be
> > removed. Is it designed intentionally?
> >   
> 
> Yes that is my intent.  I specifically called that out in patch 18.
> 
> https://lore.kernel.org/all/20240324-dcd-type2-upstream-v1-18-b7b00d623625@intel.com/
> 
> I thought we discussed this in one of the collaboration calls.  Mainly
> this is to simplify by not attempting any split of the extents the host is
> tracking.  It really is expected that the FM/device is going to keep those
> extents offered and release them in their entirety.  I understand this may
> complicate the device because it may see a release of memory prior to the
> request of that release.  And perhaps this complicates the device.  But in
> that case it (or the FM really) should not attempt to release partial
> extents.

It was discussed at some point as you say. Feels like something that might not
be set in stone for ever, but for now it is a reasonable simplifying assumption.
The device might not maintain the separation of neighboring extents
but the FM probably will.  If it turns out real use models are different,
then we 'guessed' wrong and get to write more complex code. 

Device always has to cope with unsolicited release so don't think this adds
any burden.  That includes a race where the host releases capacity when
it hasn't yet seen the event the device has sent to release part of the same
capacity.  There is text about async release always being possible in the
spec to cover these overlapping cases but upshot of that one is it must be
permissible to release a containing capacity as you are doing.

Jonathan

> Ira
> 
> [snip]

Jonathan Cameron April 4, 2024, 10:20 a.m. UTC | #4

I haven't quite gone through the whole patch set yet, so ignore me
if you already have this somewhere, but I'd like to see this info
captured somewhere other than the cover letter. Either some Documentation
files or maybe a bit code comment in appropriate location?

> A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> device that allows the memory capacity to change dynamically, without
> the need for resetting the device, reconfiguring HDM decoders, or
> reconfiguring software DAX regions.
> 
> One of the biggest use cases for Dynamic Capacity is to allow hosts to
> share memory dynamically within a data center without increasing the

Probably good to rephrase to avoid share - given 'sharing' isn't
yet supported. 
"to access a common pool of memory dynamically ..." maybe?

> per-host attached memory.
> 
> The general flow for the addition or removal of memory is to have an
> orchestrator coordinate the use of the memory.  Generally there are 5
> actors in such a system, the Orchestrator, Fabric Manager, the Device
> the host sees, the Host Kernel, and a Host User.
> 
> Typical work flows are shown below.
> 
> Orchestrator      FM         Device       Host Kernel    Host User
> 
>     |             |           |            |              |
>     |-------------- Create region ----------------------->|
>     |             |           |            |              |
>     |             |           |            |<-- Create ---|
>     |             |           |            |    Region    |
>     |<------------- Signal done --------------------------|
>     |             |           |            |              |
>     |-- Add ----->|-- Add --->|--- Add --->|              |
>     |  Capacity   |  Extent   |   Extent   |              |
>     |             |           |            |              |
>     |             |<- Accept -|<- Accept  -|              |
>     |             |   Extent  |   Extent   |              |
>     |             |           |            |<- Create --->|
>     |             |           |            |   DAX dev    |-- Use memory
>     |             |           |            |              |   |
>     |             |           |            |              |   |
>     |             |           |            |<- Release ---| <-+
>     |             |           |            |   DAX dev    |
>     |             |           |            |              |
>     |<------------- Signal done --------------------------|
>     |             |           |            |              |
>     |-- Remove -->|- Release->|- Release ->|              |
>     |  Capacity   |  Extent   |   Extent   |              |
>     |             |           |            |              |
>     |             |<- Release-|<- Release -|              |
>     |             |   Extent  |   Extent   |              |

Missing a signal from the FM to orchestrator to say release done.

>     |             |           |            |              |
>     |-- Add ----->|-- Add --->|--- Add --->|              |
>     |  Capacity   |  Extent   |   Extent   |              |
>     |             |           |            |              |
>     |             |<- Accept -|<- Accept  -|              |
>     |             |   Extent  |   Extent   |              |
>     |             |           |            |<- Create ----|
>     |             |           |            |   DAX dev    |-- Use memory
>     |             |           |            |              |   |
>     |             |           |            |<- Release ---| <-+
>     |             |           |            |   DAX dev    |
>     |<------------- Signal done --------------------------|
>     |             |           |            |              |
>     |-- Remove -->|- Release->|- Release ->|              |
>     |  Capacity   |  Extent   |   Extent   |              |
>     |             |           |            |              |
>     |             |<- Release-|<- Release -|              |
>     |             |   Extent  |   Extent   |              |

As above, need to let the Orchestrator know it's done (the FM
knows so can pass the info onwards

>     |             |           |            |              |
>     |-- Add ----->|-- Add --->|--- Add --->|              |
>     |  Capacity   |  Extent   |   Extent   |              |
>     |             |           |            |<- Create ----|
>     |             |           |            |   DAX dev    |-- Use memory
>     |             |           |            |              |   |
>     |-- Remove -->|- Release->|- Release ->|              |   |
>     |  Capacity   |  Extent   |   Extent   |              |   |
>     |             |           |            |              |   |
>     |             |           |     (Release Ignored)     |   |
>     |             |           |            |              |   |
>     |             |           |            |<- Release ---| <-+
>     |             |           |            |   DAX dev    |
>     |<------------- Signal done --------------------------|
>     |             |           |            |              |
>     |             |- Release->|- Release ->|              |
>     |             |  Extent   |   Extent   |              |
>     |             |           |            |              |
>     |             |<- Release-|<- Release -|              |
>     |             |   Extent  |   Extent   |              |

My guess is FM would let the orchestrator know it had some capacity
back?

>     |             |           |            |<- Destroy ---|
>     |             |           |            |   Region     |
>     |             |           |            |              |

No path for async release yet?  I think we will want to add that
soon.  Host rebooting etc may not care to talk directly to the
orchestrator.

> 
> Previous RFCs of this series[0] resulted in significant architectural
> comments.  Previous versions allowed memory capacity to be accepted by
> the host regardless of the existence of a software region being mapped.
> 
> With this new patch set the order of the create region and DAX device
> creation must be synchronized with the Orchestrator adding/removing
> capacity.  The host kernel will reject an add extent event if the region
> is not created yet.  It will also ignore a release if the DAX device is
> created and referencing an extent.
> 
> Neither of these synchronizations are anticipated to be an issue with
> real applications.
> 
> In order to allow for capacity to be added and removed a new concept of
> a sparse DAX region is introduced.  A sparse DAX region may have 0 or
> more bytes of available space.  The total space depends on the number
> and size of the extents which have been added.
> 
> Initially it is anticipated that users of the memory will carefully
> coordinate the surfacing of additional capacity with the creation of DAX
> devices which use that capacity.  Therefore, the allocation of the
> memory to DAX devices does not allow for specific associations between
> DAX device and extent.  This keeps allocations very similar to existing
> DAX region behavior.

I don't quite follow.  Is the point that there is only one active dax
dev to which any new extents are added?  Or that a particular set
of extents offered together get put into a new device and next set get
another new device?

> 
> Great care was taken to greatly simplify extent tracking.  Specifically,
> in comparison to previous versions of the patch set, all extent tracking
> xarrays have been eliminated from the code.  In addition, most of the
> extra software objects and associated referenced counts have been
> eliminated.
> 
> In this version, extents are tracked purely as sub-devices of the
> region.  This ensures that the region destruction cleans up all extent
> allocations properly.  Device managed callbacks are wired to ensure any
> additional data required for DAX device references are handled
> correctly.
> 
> Due to these major changes I'm setting this new series to V1.
> 
> In summary the major functionality of this series includes:
> 
> - Getting the dynamic capacity (DC) configuration information from cxl
>   devices
> 
> - Configuring the DC regions reported by hardware
> 
> - Enhancing the CXL and DAX regions for dynamic capacity support
> 	a. Maintain a logical separation between hardware extents and
> 	   software managed region extents.  This provides an
> 	   abstraction between the layers and should allow for
> 	   interleaving in the future
> 
> - Get hardware extent lists for endpoint decoders upon
>   region creation.
> 
> - Adjust extent/region memory available on the following events.
>         a. Add capacity Events
> 	b. Release capacity events
Trivial but fix the indent

> 
> - Host response for add capacity
> 	a. do not accept the extent if:
> 		If the region does not exist
> 		or an error occurs realizing the extent
> 	B. If the region does exist
b.
> 		realize a DAX region extent with 1:1 mapping (no
> 		interleave yet)
> 
> - Host response for remove capacity
> 	a. If no DAX devices reference the extent release the extent
> 	b. If a reference does exist, ignore the request.
> 	   (Require FM to issue release again.)
> 
> - Modify DAX device creation/resize to account for extents within a
>   sparse DAX region
> 
> - Trace Dynamic Capacity events for debugging
> 
> - Add cxl-test infrastructure to allow for faster unit testing
>   (See new ndctl branch for cxl-dcd.sh test[1])
> 
> Fan Ni's latest v5 of Qemu DCD was used for testing.[2]
> 
> Remaining work:
> 
> 	1) Integrate the QoS work from Dave Jiang
> 	2) Interleave support
> 
> Possible additional work depending on requirements:
> 
> 	1) Allow mapping to specific extents (perhaps based on
> 	   label/tag)
> 	2) Release extents when DAX devices are released if a release
> 	   was previously seen from the device
> 	3) Accept a new extent which extends (but overlaps) an existing
> 	   extent(s)

> 
> [0] RFC v2: https://lore.kernel.org/r/20230604-dcd-type2-upstream-v2-0-f740c47e7916@intel.com
> [1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-03-22
> [2] https://lore.kernel.org/all/20240304194331.1586191-1-nifan.cxl@gmail.com/
> 
> ---
> Changes for v1:
> - iweiny: Largely new series
> - iweiny: Remove review tags due to the series being a major rework
> - iweiny: Fix authorship for Navneet patches
> - iweiny: Remove extent xarrays
> - iweiny: Remove kreferences, replace with 1 use count protected under dax_rwsem
> - iweiny: Mark all sysfs entries for the 6.10 June 2024 kernel
> - iweiny: Remove gotos
> - iweiny: Fix 0day issues
> - Jonathan Cameron: address comments
> - Navneet Singh: address comments
> - Dan Williams: address comments
> - Dave Jiang: address comments
> - Fan Ni: address comments
> - Jørgen Hansen: address comments
> - Link to RFC v2: https://lore.kernel.org/r/20230604-dcd-type2-upstream-v2-0-f740c47e7916@intel.com
> 
> ---
> Ira Weiny (12):
>       cxl/core: Simplify cxl_dpa_set_mode()
>       cxl/events: Factor out event msgnum configuration
>       cxl/pci: Delay event buffer allocation
>       cxl/pci: Factor out interrupt policy check
>       range: Add range_overlaps()
>       dax/bus: Factor out dev dax resize logic
>       dax: Document dax dev range tuple
>       dax/region: Prevent range mapping allocation on sparse regions
>       dax/region: Support DAX device creation on sparse DAX regions
>       tools/testing/cxl: Make event logs dynamic
>       tools/testing/cxl: Add DC Regions to mock mem data
>       tools/testing/cxl: Add Dynamic Capacity events
> 
> Navneet Singh (14):
>       cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
>       cxl/core: Separate region mode from decoder mode
>       cxl/mem: Read dynamic capacity configuration from the device
>       cxl/region: Add dynamic capacity decoder and region modes
>       cxl/port: Add Dynamic Capacity mode support to endpoint decoders
>       cxl/port: Add dynamic capacity size support to endpoint decoders
>       cxl/mem: Expose device dynamic capacity capabilities
>       cxl/region: Add Dynamic Capacity CXL region support
>       cxl/mem: Configure dynamic capacity interrupts
>       cxl/region: Read existing extents on region creation
>       cxl/extent: Realize extent devices
>       dax/region: Create extent resources on DAX region driver load
>       cxl/mem: Handle DCD add & release capacity events.
>       cxl/mem: Trace Dynamic capacity Event Record
> 
>  Documentation/ABI/testing/sysfs-bus-cxl |  60 ++-
>  drivers/cxl/core/Makefile               |   1 +
>  drivers/cxl/core/core.h                 |  10 +
>  drivers/cxl/core/extent.c               | 145 +++++
>  drivers/cxl/core/hdm.c                  | 254 +++++++--
>  drivers/cxl/core/mbox.c                 | 591 ++++++++++++++++++++-
>  drivers/cxl/core/memdev.c               |  76 +++
>  drivers/cxl/core/port.c                 |  19 +
>  drivers/cxl/core/region.c               | 334 +++++++++++-
>  drivers/cxl/core/trace.h                |  65 +++
>  drivers/cxl/cxl.h                       | 127 ++++-
>  drivers/cxl/cxlmem.h                    | 114 ++++
>  drivers/cxl/mem.c                       |  45 ++
>  drivers/cxl/pci.c                       | 122 +++--
>  drivers/dax/bus.c                       | 353 +++++++++---
>  drivers/dax/bus.h                       |   4 +-
>  drivers/dax/cxl.c                       | 127 ++++-
>  drivers/dax/dax-private.h               |  40 +-
>  drivers/dax/hmem/hmem.c                 |   2 +-
>  drivers/dax/pmem.c                      |   2 +-
>  fs/btrfs/ordered-data.c                 |  10 +-
>  include/linux/cxl-event.h               |  31 ++
>  include/linux/range.h                   |   7 +
>  tools/testing/cxl/Kbuild                |   1 +
>  tools/testing/cxl/test/mem.c            | 914 ++++++++++++++++++++++++++++----
>  25 files changed, 3152 insertions(+), 302 deletions(-)
> ---
> base-commit: dff54316795991e88a453a095a9322718a34034a
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> 
> Best regards,

Jonathan Cameron April 4, 2024, 5:49 p.m. UTC | #5

> 
> Fan Ni's latest v5 of Qemu DCD was used for testing.[2]
Hi Ira, Navneet.
> 
> Remaining work:
> 
> 	1) Integrate the QoS work from Dave Jiang
> 	2) Interleave support


More flag.  This one I think is potentially important and don't
see any handling in here.

Whilst an FM could in theory be careful to avoid sending a
sparse set of extents, if the device is managing the memory range
(which is possible all it supports) and the FM issues an Initiate Dynamic
Capacity Add with Free (again may be all device supports) then we
can't stop the device issuing a bunch of sparse extents.

Now it won't be broken as such without this, but every time we
accept the first extent that will implicitly reject the rest.
That will look very ugly to an FM which has to poke potentially many
times to successfully allocate memory to a host.

I also don't think it will be that hard to support, but maybe I'm
missing something? 

My first thought is it's just a loop in cxl_handle_dcd_add_extent()
over a list of extents passed in then slightly more complex response
generation.

I don't want this to block getting initial DCD support in but it
will be a bit ugly if we quickly support the more flag and then end
up with just one kernel that an FM has to be careful with...

Jonathan

Alison Schofield April 10, 2024, 6:01 p.m. UTC | #6

On Sun, Mar 24, 2024 at 04:18:03PM -0700, Ira Weiny wrote:
> A git tree of this series can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/dcd-2024-03-24

This would benefit from another checkpatch run and cleanup.

--Alison

Ira Weiny May 1, 2024, 11:49 p.m. UTC | #7

Jonathan Cameron wrote:
> 
> > 
> > Fan Ni's latest v5 of Qemu DCD was used for testing.[2]
> Hi Ira, Navneet.
> > 
> > Remaining work:
> > 
> > 	1) Integrate the QoS work from Dave Jiang
> > 	2) Interleave support
> 
> 
> More flag.  This one I think is potentially important and don't
> see any handling in here.

Nope I admit I missed the spec requirement.

> 
> Whilst an FM could in theory be careful to avoid sending a
> sparse set of extents, if the device is managing the memory range
> (which is possible all it supports) and the FM issues an Initiate Dynamic
> Capacity Add with Free (again may be all device supports) then we
> can't stop the device issuing a bunch of sparse extents.
> 
> Now it won't be broken as such without this, but every time we
> accept the first extent that will implicitly reject the rest.
> That will look very ugly to an FM which has to poke potentially many
> times to successfully allocate memory to a host.

This helps me to see see why the more bit is useful.

> 
> I also don't think it will be that hard to support, but maybe I'm
> missing something? 

Just a bunch of code and refactoring busy work.  ;-)  It's not rocket
science but does fundamentally change the arch again.

> 
> My first thought is it's just a loop in cxl_handle_dcd_add_extent()
> over a list of extents passed in then slightly more complex response
> generation.

Not exactly 'just a loop'.  No matter how I work this out there is the
possibility that some extents get surfaced and then the kernel tries to
remove them because it should not have.

To be most safe the cxl core is going to have to make 2 round trips to the
cxl region layer for each extent.  The first determines if the extent is
valid and creates the extent as much as possible.  The second actually
surfaces the extents.  However, if the surface fails then you might not
get the extents back.  So now we are in an invalid state.  :-/  WARN and
continue I guess?!??!

I think the safest way to handle this is add a new kernel notify event
called 'extent create' which stops short of surfacing the extent.  [I'm
not 100% sure how this is going to affect interleave.]

I think the safest logic for add is something like:

	cxl_handle_dcd_add_event()
		add_extent(squirl_list, extent);

		if (more bit) /* wait for more */
			return;

		/* Create extents to hedge the bets against failure */
		for_each(squirl_list)
			if (notify 'extent create' != ok)
				send_response(fail);
				return;

		for_each(squirl_list)
			if (notify 'surface' != ok)
				/*
				 * If the more bit was set, some extents
				 * have been surfaced and now need to be
				 * removed...
				 *
				 * Try to remove them and hope...
				 */
				WARN_ON('surface extents failed');
				for_each(squirl_list)
					notify 'remove without response'
				send_response(fail);
				return;

		send_response(squirl_list, accept);

The logic for remove is not changed AFAICS because the device must allow
for memory to be released at any time so the host is free to release each
of the extents individually despite the 'more' bit???

> 
> I don't want this to block getting initial DCD support in but it
> will be a bit ugly if we quickly support the more flag and then end
> up with just one kernel that an FM has to be careful with...

I'm not sure which is worse.  Given your use case above it seems like the
more bit may be more important for 'dumb' devices which want to add
extents in blocks before responding to the FM.  Thus complicating the FM.

It seems 'smarter' devices which could figure this out (not requiring the
more bit) are the ones which will be developed later.  So it seems the use
case time line is the opposite of what we need right now.

For that reason I'm inclined to try and get this in.

Ira

Jonathan Cameron May 3, 2024, 9:20 a.m. UTC | #8

On Wed, 1 May 2024 16:49:24 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Jonathan Cameron wrote:
> >   
> > > 
> > > Fan Ni's latest v5 of Qemu DCD was used for testing.[2]  
> > Hi Ira, Navneet.  
> > > 
> > > Remaining work:
> > > 
> > > 	1) Integrate the QoS work from Dave Jiang
> > > 	2) Interleave support  
> > 
> > 
> > More flag.  This one I think is potentially important and don't
> > see any handling in here.  
> 
> Nope I admit I missed the spec requirement.
> 
> > 
> > Whilst an FM could in theory be careful to avoid sending a
> > sparse set of extents, if the device is managing the memory range
> > (which is possible all it supports) and the FM issues an Initiate Dynamic
> > Capacity Add with Free (again may be all device supports) then we
> > can't stop the device issuing a bunch of sparse extents.
> > 
> > Now it won't be broken as such without this, but every time we
> > accept the first extent that will implicitly reject the rest.
> > That will look very ugly to an FM which has to poke potentially many
> > times to successfully allocate memory to a host.  
> 
> This helps me to see see why the more bit is useful.
> 
> > 
> > I also don't think it will be that hard to support, but maybe I'm
> > missing something?   
> 
> Just a bunch of code and refactoring busy work.  ;-)  It's not rocket
> science but does fundamentally change the arch again.
> 
> > 
> > My first thought is it's just a loop in cxl_handle_dcd_add_extent()
> > over a list of extents passed in then slightly more complex response
> > generation.  
> 
> Not exactly 'just a loop'.  No matter how I work this out there is the
> possibility that some extents get surfaced and then the kernel tries to
> remove them because it should not have.

Lets consider why it might need to back out.
1) Device sends an invalid set of extents - so maybe one in a later message
   overlaps with an already allocated extent.   Device bug, handling can
   be extremely inelegant - up to crashing the kernel.  Worst that happens
   due to race is probably a poison storm / machine check fun?  Not our
   responsibility to deal with something that broken (in my view!) Best effort
   only.

2) Host can't handle the extent for some reason and didn't know that until
   later - can just reject the ones it can't handle. 

> 
> To be most safe the cxl core is going to have to make 2 round trips to the
> cxl region layer for each extent.  The first determines if the extent is
> valid and creates the extent as much as possible.  The second actually
> surfaces the extents.  However, if the surface fails then you might not
> get the extents back.  So now we are in an invalid state.  :-/  WARN and
> continue I guess?!??!

Yes. Orchestrator can decide how to handle - probably reboot server in as
gentle a fashion as possible.


> 
> I think the safest way to handle this is add a new kernel notify event
> called 'extent create' which stops short of surfacing the extent.  [I'm
> not 100% sure how this is going to affect interleave.]
> 
> I think the safest logic for add is something like:
> 
> 	cxl_handle_dcd_add_event()
> 		add_extent(squirl_list, extent);
> 
> 		if (more bit) /* wait for more */
> 			return;
> 
> 		/* Create extents to hedge the bets against failure */
> 		for_each(squirl_list)
> 			if (notify 'extent create' != ok)
> 				send_response(fail);
> 				return;
> 		
> 		for_each(squirl_list)
> 			if (notify 'surface' != ok)
> 				/*
> 				 * If the more bit was set, some extents
> 				 * have been surfaced and now need to be
> 				 * removed...
> 				 *
> 				 * Try to remove them and hope...
> 				 */

If we failed to surface them all another option is just tell the device
that.   Responds with the extents that successfully surfaced and reject
all others (or all after the one that failed?)  So for the lower layers
send the device a response that says "thanks but I only took these ones"
and for the upper layers pretend "I was only offered these ones"

> 				WARN_ON('surface extents failed');
> 				for_each(squirl_list)
> 					notify 'remove without response'
> 				send_response(fail);
> 				return;
> 
> 		send_response(squirl_list, accept);
> 
> The logic for remove is not changed AFAICS because the device must allow
> for memory to be released at any time so the host is free to release each
> of the extents individually despite the 'more' bit???

Yes, but only after it accepted them - which needs to be done in one go.
So you can't just send releases before that (the device will return an
error and keep them in the pending list I think...)

> 
> > 
> > I don't want this to block getting initial DCD support in but it
> > will be a bit ugly if we quickly support the more flag and then end
> > up with just one kernel that an FM has to be careful with...  
> 
> I'm not sure which is worse.  Given your use case above it seems like the
> more bit may be more important for 'dumb' devices which want to add
> extents in blocks before responding to the FM.  Thus complicating the FM.
> 
> It seems 'smarter' devices which could figure this out (not requiring the
> more bit) are the ones which will be developed later.  So it seems the use
> case time line is the opposite of what we need right now.

Once we hit shareable capacity (which the smarter devices will use) then
this become the dominant approach to non contiguous allocations because
you can't add extents with a given tag in multiple goes.

So I'd expect the more flag to be more common not less over time.
> 
> For that reason I'm inclined to try and get this in.
> 

Great - but I'd not worry too much about bad effects if you get invalid
lists from the device.  If the only option is shout and panic, then fine
though I'd imagine we can do slightly better than that, so maybe warn
extensively and don't let the region be used.

Jonathan

> Ira
>

Ira Weiny May 6, 2024, 4:24 a.m. UTC | #9

Jonathan Cameron wrote:
> On Wed, 1 May 2024 16:49:24 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > Jonathan Cameron wrote:
> > >   
> > > > 
> > > > Fan Ni's latest v5 of Qemu DCD was used for testing.[2]  
> > > Hi Ira, Navneet.  
> > > > 
> > > > Remaining work:
> > > > 
> > > > 	1) Integrate the QoS work from Dave Jiang
> > > > 	2) Interleave support  
> > > 
> > > 
> > > More flag.  This one I think is potentially important and don't
> > > see any handling in here.  
> > 
> > Nope I admit I missed the spec requirement.
> > 
> > > 
> > > Whilst an FM could in theory be careful to avoid sending a
> > > sparse set of extents, if the device is managing the memory range
> > > (which is possible all it supports) and the FM issues an Initiate Dynamic
> > > Capacity Add with Free (again may be all device supports) then we
> > > can't stop the device issuing a bunch of sparse extents.
> > > 
> > > Now it won't be broken as such without this, but every time we
> > > accept the first extent that will implicitly reject the rest.
> > > That will look very ugly to an FM which has to poke potentially many
> > > times to successfully allocate memory to a host.  
> > 
> > This helps me to see see why the more bit is useful.
> > 
> > > 
> > > I also don't think it will be that hard to support, but maybe I'm
> > > missing something?   
> > 
> > Just a bunch of code and refactoring busy work.  ;-)  It's not rocket
> > science but does fundamentally change the arch again.
> > 
> > > 
> > > My first thought is it's just a loop in cxl_handle_dcd_add_extent()
> > > over a list of extents passed in then slightly more complex response
> > > generation.  
> > 
> > Not exactly 'just a loop'.  No matter how I work this out there is the
> > possibility that some extents get surfaced and then the kernel tries to
> > remove them because it should not have.
> 
> Lets consider why it might need to back out.
> 1) Device sends an invalid set of extents - so maybe one in a later message
>    overlaps with an already allocated extent.   Device bug, handling can
>    be extremely inelegant - up to crashing the kernel.  Worst that happens
>    due to race is probably a poison storm / machine check fun?  Not our
>    responsibility to deal with something that broken (in my view!) Best effort
>    only.
> 
> 2) Host can't handle the extent for some reason and didn't know that until
>    later - can just reject the ones it can't handle. 

3) Something in the host fails like ENOMEM on a later extent surface which
   requires the host to back out of all of them.

3 should be rare and I'm working toward it.  But it is possible this will
happen.

If you have a 'prepare' notify it should avoid most of these because the
extents will be mostly formed.  But there are some error paths on the actual
surface code path.

> 
> > 
> > To be most safe the cxl core is going to have to make 2 round trips to the
> > cxl region layer for each extent.  The first determines if the extent is
> > valid and creates the extent as much as possible.  The second actually
> > surfaces the extents.  However, if the surface fails then you might not
> > get the extents back.  So now we are in an invalid state.  :-/  WARN and
> > continue I guess?!??!
> 
> Yes. Orchestrator can decide how to handle - probably reboot server in as
> gentle a fashion as possible.
> 

Ok

> 
> > 
> > I think the safest way to handle this is add a new kernel notify event
> > called 'extent create' which stops short of surfacing the extent.  [I'm
> > not 100% sure how this is going to affect interleave.]
> > 
> > I think the safest logic for add is something like:
> > 
> > 	cxl_handle_dcd_add_event()
> > 		add_extent(squirl_list, extent);
> > 
> > 		if (more bit) /* wait for more */
> > 			return;
> > 
> > 		/* Create extents to hedge the bets against failure */
> > 		for_each(squirl_list)
> > 			if (notify 'extent create' != ok)
> > 				send_response(fail);
> > 				return;
> > 		
> > 		for_each(squirl_list)
> > 			if (notify 'surface' != ok)
> > 				/*
> > 				 * If the more bit was set, some extents
> > 				 * have been surfaced and now need to be
> > 				 * removed...
> > 				 *
> > 				 * Try to remove them and hope...
> > 				 */
> 
> If we failed to surface them all another option is just tell the device
> that.   Responds with the extents that successfully surfaced and reject
> all others (or all after the one that failed?)  So for the lower layers
> send the device a response that says "thanks but I only took these ones"
> and for the upper layers pretend "I was only offered these ones"
> 

But doesn't that basically break the more bit?  I'm willing to do that as it is
easier for the host.

> > 				WARN_ON('surface extents failed');
> > 				for_each(squirl_list)
> > 					notify 'remove without response'
> > 				send_response(fail);
> > 				return;
> > 
> > 		send_response(squirl_list, accept);
> > 
> > The logic for remove is not changed AFAICS because the device must allow
> > for memory to be released at any time so the host is free to release each
> > of the extents individually despite the 'more' bit???
> 
> Yes, but only after it accepted them - which needs to be done in one go.
> So you can't just send releases before that (the device will return an
> error and keep them in the pending list I think...)

:-(  OK so this more bit is really more...  no pun intended.  Because this
breaks the entire model I have if I have to treat these as a huge atomic unit.

Let me think on that a bit more.  Obviously it is just tagging an iterating the
extents to find those associated with a more bit on accept.  But it will take
some time to code up.

> 
> > 
> > > 
> > > I don't want this to block getting initial DCD support in but it
> > > will be a bit ugly if we quickly support the more flag and then end
> > > up with just one kernel that an FM has to be careful with...  
> > 
> > I'm not sure which is worse.  Given your use case above it seems like the
> > more bit may be more important for 'dumb' devices which want to add
> > extents in blocks before responding to the FM.  Thus complicating the FM.
> > 
> > It seems 'smarter' devices which could figure this out (not requiring the
> > more bit) are the ones which will be developed later.  So it seems the use
> > case time line is the opposite of what we need right now.
> 
> Once we hit shareable capacity (which the smarter devices will use) then
> this become the dominant approach to non contiguous allocations because
> you can't add extents with a given tag in multiple goes.

Why not?  Sharing is going to require some synchronization with the
orchestrator and can't the user app just report it did not get all it's memory
and wait for more?  With the same tag?

> 
> So I'd expect the more flag to be more common not less over time.
> > 
> > For that reason I'm inclined to try and get this in.
> > 
> 
> Great - but I'd not worry too much about bad effects if you get invalid
> lists from the device.  If the only option is shout and panic, then fine
> though I'd imagine we can do slightly better than that, so maybe warn
> extensively and don't let the region be used.

It is not just about invalid lists.  It is that setting up the extent devices
may fail and waiting for the devices to be set up means that they are user
visible.  So that is the chicken and the egg...

This is unlikely and perhaps the partials should just be surfaced and accept
whatever works.  Then let it all tear down later if it does not all go.

But I was trying to honor the accept 'all or nothing' as that is what has been
stated as the requirement of the more bit.

But it seems that it does not __have__ to be atomic.  Or at least the partials
can be cleaned up and all tried again.

Ira

> 
> Jonathan
> 
> > Ira
> > 
>

Jonathan Cameron May 8, 2024, 2:43 p.m. UTC | #10

On Sun, 5 May 2024 21:24:14 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Jonathan Cameron wrote:
> > On Wed, 1 May 2024 16:49:24 -0700
> > Ira Weiny <ira.weiny@intel.com> wrote:
> >   
> > > Jonathan Cameron wrote:  
> > > >     
> > > > > 
> > > > > Fan Ni's latest v5 of Qemu DCD was used for testing.[2]    
> > > > Hi Ira, Navneet.    
> > > > > 
> > > > > Remaining work:
> > > > > 
> > > > > 	1) Integrate the QoS work from Dave Jiang
> > > > > 	2) Interleave support    
> > > > 
> > > > 
> > > > More flag.  This one I think is potentially important and don't
> > > > see any handling in here.    
> > > 
> > > Nope I admit I missed the spec requirement.
> > >   
> > > > 
> > > > Whilst an FM could in theory be careful to avoid sending a
> > > > sparse set of extents, if the device is managing the memory range
> > > > (which is possible all it supports) and the FM issues an Initiate Dynamic
> > > > Capacity Add with Free (again may be all device supports) then we
> > > > can't stop the device issuing a bunch of sparse extents.
> > > > 
> > > > Now it won't be broken as such without this, but every time we
> > > > accept the first extent that will implicitly reject the rest.
> > > > That will look very ugly to an FM which has to poke potentially many
> > > > times to successfully allocate memory to a host.    
> > > 
> > > This helps me to see see why the more bit is useful.
> > >   
> > > > 
> > > > I also don't think it will be that hard to support, but maybe I'm
> > > > missing something?     
> > > 
> > > Just a bunch of code and refactoring busy work.  ;-)  It's not rocket
> > > science but does fundamentally change the arch again.
> > >   
> > > > 
> > > > My first thought is it's just a loop in cxl_handle_dcd_add_extent()
> > > > over a list of extents passed in then slightly more complex response
> > > > generation.    
> > > 
> > > Not exactly 'just a loop'.  No matter how I work this out there is the
> > > possibility that some extents get surfaced and then the kernel tries to
> > > remove them because it should not have.  
> > 
> > Lets consider why it might need to back out.
> > 1) Device sends an invalid set of extents - so maybe one in a later message
> >    overlaps with an already allocated extent.   Device bug, handling can
> >    be extremely inelegant - up to crashing the kernel.  Worst that happens
> >    due to race is probably a poison storm / machine check fun?  Not our
> >    responsibility to deal with something that broken (in my view!) Best effort
> >    only.
> > 
> > 2) Host can't handle the extent for some reason and didn't know that until
> >    later - can just reject the ones it can't handle.   
> 
> 3) Something in the host fails like ENOMEM on a later extent surface which
>    requires the host to back out of all of them.
> 
> 3 should be rare and I'm working toward it.  But it is possible this will
> happen.
> 
> If you have a 'prepare' notify it should avoid most of these because the
> extents will be mostly formed.  But there are some error paths on the actual
> surface code path.

True. If these are really small allocations then elegant handling feels like
a nice to have rather than a requirement.

> 
> >   
> > > 
> > > To be most safe the cxl core is going to have to make 2 round trips to the
> > > cxl region layer for each extent.  The first determines if the extent is
> > > valid and creates the extent as much as possible.  The second actually
> > > surfaces the extents.  However, if the surface fails then you might not
> > > get the extents back.  So now we are in an invalid state.  :-/  WARN and
> > > continue I guess?!??!  
> > 
> > Yes. Orchestrator can decide how to handle - probably reboot server in as
> > gentle a fashion as possible.
> >   
> 
> Ok
> 
> >   
> > > 
> > > I think the safest way to handle this is add a new kernel notify event
> > > called 'extent create' which stops short of surfacing the extent.  [I'm
> > > not 100% sure how this is going to affect interleave.]
> > > 
> > > I think the safest logic for add is something like:
> > > 
> > > 	cxl_handle_dcd_add_event()
> > > 		add_extent(squirl_list, extent);
> > > 
> > > 		if (more bit) /* wait for more */
> > > 			return;
> > > 
> > > 		/* Create extents to hedge the bets against failure */
> > > 		for_each(squirl_list)
> > > 			if (notify 'extent create' != ok)
> > > 				send_response(fail);
> > > 				return;
> > > 		
> > > 		for_each(squirl_list)
> > > 			if (notify 'surface' != ok)
> > > 				/*
> > > 				 * If the more bit was set, some extents
> > > 				 * have been surfaced and now need to be
> > > 				 * removed...
> > > 				 *
> > > 				 * Try to remove them and hope...
> > > 				 */  
> > 
> > If we failed to surface them all another option is just tell the device
> > that.   Responds with the extents that successfully surfaced and reject
> > all others (or all after the one that failed?)  So for the lower layers
> > send the device a response that says "thanks but I only took these ones"
> > and for the upper layers pretend "I was only offered these ones"
> >   
> 
> But doesn't that basically break the more bit?  I'm willing to do that as it is
> easier for the host.

Don't think so.  We can always accept part of the offered extents in same
way we can accept part of a single offered extent if we like.
The more flag just means we only get to do that communication of what
we accepted once. So we have to reply with what we want and don't set
more flag in last message - thus indicating we don't want the rest.
(making sure we also tidy up the log for the ones we rejected)

> 
> > > 				WARN_ON('surface extents failed');
> > > 				for_each(squirl_list)
> > > 					notify 'remove without response'
> > > 				send_response(fail);
> > > 				return;
> > > 
> > > 		send_response(squirl_list, accept);
> > > 
> > > The logic for remove is not changed AFAICS because the device must allow
> > > for memory to be released at any time so the host is free to release each
> > > of the extents individually despite the 'more' bit???  
> > 
> > Yes, but only after it accepted them - which needs to be done in one go.
> > So you can't just send releases before that (the device will return an
> > error and keep them in the pending list I think...)  
> 
> :-(  OK so this more bit is really more...  no pun intended.  Because this
> breaks the entire model I have if I have to treat these as a huge atomic unit.
> 
> Let me think on that a bit more.  Obviously it is just tagging an iterating the
> extents to find those associated with a more bit on accept.  But it will take
> some time to code up.

The ability to give up at any point (though you need to read and clear the extents
that are left) should get around a lot of the complexity but sure it's
not a trivial thing to support.

I'd flip a 'something went wrong flag' on the the first failure, carry on the
walk not surfacing anything else, but clearing the logs etc, then finally reply
with what succeeded before that 'went wrong' flag was set.

> 
> >   
> > >   
> > > > 
> > > > I don't want this to block getting initial DCD support in but it
> > > > will be a bit ugly if we quickly support the more flag and then end
> > > > up with just one kernel that an FM has to be careful with...    
> > > 
> > > I'm not sure which is worse.  Given your use case above it seems like the
> > > more bit may be more important for 'dumb' devices which want to add
> > > extents in blocks before responding to the FM.  Thus complicating the FM.
> > > 
> > > It seems 'smarter' devices which could figure this out (not requiring the
> > > more bit) are the ones which will be developed later.  So it seems the use
> > > case time line is the opposite of what we need right now.  
> > 
> > Once we hit shareable capacity (which the smarter devices will use) then
> > this become the dominant approach to non contiguous allocations because
> > you can't add extents with a given tag in multiple goes.  
> 
> Why not?  Sharing is going to require some synchronization with the
> orchestrator and can't the user app just report it did not get all it's memory
> and wait for more?  With the same tag?

Hmm. I was sure the spec said sharing did not allow addition of capacity after
first creation, but now can't find it.  If you did do it though, fun
occurs when you then pass it on to the second device because you have
to do that via tag alone.

I believe this is about simplification on the device side because
offers of extents to other hosts are done by tag. If you allow extra ones
to turn up there are race conditions to potentially deal with.

7.6.7.6.5 Initiate Dynamic Capacity add.

"Enable shared Access" Enable access to extents previously added to another
host in a DC region that reports the "sharable" flag, as designated by the
specific tag value.

Note it is up to the device to offer the same capacity to all hosts for
which this is issued.  There is no extent list or length provided.


> 
> > 
> > So I'd expect the more flag to be more common not less over time.  
> > > 
> > > For that reason I'm inclined to try and get this in.
> > >   
> > 
> > Great - but I'd not worry too much about bad effects if you get invalid
> > lists from the device.  If the only option is shout and panic, then fine
> > though I'd imagine we can do slightly better than that, so maybe warn
> > extensively and don't let the region be used.  
> 
> It is not just about invalid lists.  It is that setting up the extent devices
> may fail and waiting for the devices to be set up means that they are user
> visible.  So that is the chicken and the egg...
> 
> This is unlikely and perhaps the partials should just be surfaced and accept
> whatever works.  Then let it all tear down later if it does not all go.
> 
> But I was trying to honor the accept 'all or nothing' as that is what has been
> stated as the requirement of the more bit.

That's not quite true - for shared it is all or nothing (after first host anyway) but
for other capacity it is 'accept X and reject Y in one go'.  You don't need to
take it all but you only get one go to say what you did accept.

> 
> But it seems that it does not __have__ to be atomic.  Or at least the partials
> can be cleaned up and all tried again.

With care you can accept up to a point, then give those back if you like - or
carry on and use them.

Jonathan

> 
> Ira
> 
> > 
> > Jonathan
> >   
> > > Ira
> > >   
> >   
> 
> 
>

[00/26] DCD: Add support for Dynamic Capacity Devices (DCD)

Message

Comments