[RFC,0/4] Region Creation

Message ID	20210610185725.897541-1-ben.widawsky@intel.com
Headers	show Return-Path: <linux-cxl-owner@kernel.org> IronPort-SDR: GqOjwPqR4VojBqaMgRDBRfeEotVTOw0Lp8GeG5Hkvm+SQ5H6o7tVaHAHhx0ad6OzZGRbY55fvf jAarIcHvSJdA== IronPort-SDR: gxv9jIHB7K/Gxg59D0UyGo6ecrV2Bbty0KmIhVlbIcMCEgxY8azBqx35THejsfD8oLKpZGGiko VunA2qbHWaXw== From: Ben Widawsky <ben.widawsky@intel.com> To: linux-cxl@vger.kernel.org Cc: Ben Widawsky <ben.widawsky@intel.com>, Alison Schofield <alison.schofield@intel.com>, Dan Williams <dan.j.williams@intel.com>, Ira Weiny <ira.weiny@intel.com>, Jonathan Cameron <Jonathan.Cameron@Huawei.com>, Vishal Verma <vishal.l.verma@intel.com> Subject: [RFC PATCH 0/4] Region Creation Date: Thu, 10 Jun 2021 11:57:21 -0700 Message-Id: <20210610185725.897541-1-ben.widawsky@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Region Creation \| expand [RFC,0/4] Region Creation [RFC,1/4] cxl/region: Add region creation ABI [RFC,2/4] cxl/region: Create attribute structure / verify [RFC,3/4] cxl: Move cxl_memdev conversion helper to mem.h [RFC,4/4] cxl/region: Introduce concept of region configuration

Ben Widawsky June 10, 2021, 6:57 p.m. UTC

CXL interleave sets and non-interleave sets are described via regions. A region
is specified in the CXL 2.0 specification and the purpose is to create a
standardized way to preserve the region across reboots.

Introduced here is the basic mechanism to create and configure and delete a CXL
region. Configuring a region simply means giving it a size, offset within the
CFMWS window, UUID, and a target list. Enabling/activating a region, which
ultimately means programming the HDM decoders in the chain, is left for later
work.

The patches are only minimally tested so far in QEMU emulation and so x1
interleave is all that's supported.

Here is a sample topology (also in patch #4)

    decoder1.0
    ├── create_region
    ├── delete_region
    ├── devtype
    ├── locked
    ├── region1.0:0
    │   ├── offset
    │   ├── size
    │   ├── subsystem -> ../../../../../../../bus/cxl
    │   ├── target0
    │   ├── uevent
    │   ├── uuid
    │   └── verify
    ├── size
    ├── start
    ├── subsystem -> ../../../../../../bus/cxl
    ├── target_list
    ├── target_type
    └── uevent

Ben Widawsky (4):
  cxl/region: Add region creation ABI
  cxl/region: Create attribute structure / verify
  cxl: Move cxl_memdev conversion helper to mem.h
  cxl/region: Introduce concept of region configuration

 Documentation/ABI/testing/sysfs-bus-cxl       |  59 +++
 .../driver-api/cxl/memory-devices.rst         |   8 +
 drivers/cxl/Makefile                          |   2 +-
 drivers/cxl/core.c                            |  71 ++++
 drivers/cxl/cxl.h                             |  11 +
 drivers/cxl/mem.h                             |  26 ++
 drivers/cxl/pci.c                             |   5 -
 drivers/cxl/region.c                          | 400 ++++++++++++++++++
 8 files changed, 576 insertions(+), 6 deletions(-)
 create mode 100644 drivers/cxl/region.c

Jonathan Cameron June 11, 2021, 1:11 p.m. UTC | #1

On Thu, 10 Jun 2021 11:57:21 -0700
Ben Widawsky <ben.widawsky@intel.com> wrote:

> CXL interleave sets and non-interleave sets are described via regions. A region
> is specified in the CXL 2.0 specification and the purpose is to create a
> standardized way to preserve the region across reboots.

A specific section reference would be helpful.

> 
> Introduced here is the basic mechanism to create and configure and delete a CXL
> region. Configuring a region simply means giving it a size, offset within the
> CFMWS window, UUID, and a target list. Enabling/activating a region, which
> ultimately means programming the HDM decoders in the chain, is left for later
> work.
> 
> The patches are only minimally tested so far in QEMU emulation and so x1
> interleave is all that's supported.

I'm guessing this is why it's an RFC rather than a final submission?

If you can call out the RFC reasons in a cover letter it is helpful
as saves people wondering what specifically you want comments on.

> 
> Here is a sample topology (also in patch #4)
> 
>     decoder1.0
>     ├── create_region
>     ├── delete_region
>     ├── devtype
>     ├── locked
>     ├── region1.0:0
>     │   ├── offset
>     │   ├── size
>     │   ├── subsystem -> ../../../../../../../bus/cxl
>     │   ├── target0
>     │   ├── uevent
>     │   ├── uuid
>     │   └── verify
>     ├── size
>     ├── start
>     ├── subsystem -> ../../../../../../bus/cxl
>     ├── target_list
>     ├── target_type
>     └── uevent
> 
> Ben Widawsky (4):
>   cxl/region: Add region creation ABI
>   cxl/region: Create attribute structure / verify
>   cxl: Move cxl_memdev conversion helper to mem.h
>   cxl/region: Introduce concept of region configuration
> 
>  Documentation/ABI/testing/sysfs-bus-cxl       |  59 +++
>  .../driver-api/cxl/memory-devices.rst         |   8 +
>  drivers/cxl/Makefile                          |   2 +-
>  drivers/cxl/core.c                            |  71 ++++
>  drivers/cxl/cxl.h                             |  11 +
>  drivers/cxl/mem.h                             |  26 ++
>  drivers/cxl/pci.c                             |   5 -
>  drivers/cxl/region.c                          | 400 ++++++++++++++++++
>  8 files changed, 576 insertions(+), 6 deletions(-)
>  create mode 100644 drivers/cxl/region.c
>

Jonathan Cameron June 11, 2021, 1:53 p.m. UTC | #2

On Fri, 11 Jun 2021 14:11:36 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Thu, 10 Jun 2021 11:57:21 -0700
> Ben Widawsky <ben.widawsky@intel.com> wrote:
> 
> > CXL interleave sets and non-interleave sets are described via regions. A region
> > is specified in the CXL 2.0 specification and the purpose is to create a
> > standardized way to preserve the region across reboots.  
> 
> A specific section reference would be helpful.
> 
> > 
> > Introduced here is the basic mechanism to create and configure and delete a CXL
> > region. Configuring a region simply means giving it a size, offset within the
> > CFMWS window, UUID, and a target list. Enabling/activating a region, which
> > ultimately means programming the HDM decoders in the chain, is left for later
> > work.
> > 
> > The patches are only minimally tested so far in QEMU emulation and so x1
> > interleave is all that's supported.  
> 
> I'm guessing this is why it's an RFC rather than a final submission?
> 
> If you can call out the RFC reasons in a cover letter it is helpful
> as saves people wondering what specifically you want comments on.

Hi Ben,

Having read through them all, I think this needs more thought than
I feel up to on a Friday afternoon.  Will get back to you on v2
perhaps.

Jonathan

> 
> > 
> > Here is a sample topology (also in patch #4)
> > 
> >     decoder1.0
> >     ├── create_region
> >     ├── delete_region
> >     ├── devtype
> >     ├── locked
> >     ├── region1.0:0
> >     │   ├── offset
> >     │   ├── size
> >     │   ├── subsystem -> ../../../../../../../bus/cxl
> >     │   ├── target0
> >     │   ├── uevent
> >     │   ├── uuid
> >     │   └── verify
> >     ├── size
> >     ├── start
> >     ├── subsystem -> ../../../../../../bus/cxl
> >     ├── target_list
> >     ├── target_type
> >     └── uevent
> > 
> > Ben Widawsky (4):
> >   cxl/region: Add region creation ABI
> >   cxl/region: Create attribute structure / verify
> >   cxl: Move cxl_memdev conversion helper to mem.h
> >   cxl/region: Introduce concept of region configuration
> > 
> >  Documentation/ABI/testing/sysfs-bus-cxl       |  59 +++
> >  .../driver-api/cxl/memory-devices.rst         |   8 +
> >  drivers/cxl/Makefile                          |   2 +-
> >  drivers/cxl/core.c                            |  71 ++++
> >  drivers/cxl/cxl.h                             |  11 +
> >  drivers/cxl/mem.h                             |  26 ++
> >  drivers/cxl/pci.c                             |   5 -
> >  drivers/cxl/region.c                          | 400 ++++++++++++++++++
> >  8 files changed, 576 insertions(+), 6 deletions(-)
> >  create mode 100644 drivers/cxl/region.c
> >   
>

Ben Widawsky June 11, 2021, 4:12 p.m. UTC | #3

On 21-06-11 14:53:31, Jonathan Cameron wrote:
> On Fri, 11 Jun 2021 14:11:36 +0100
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> 
> > On Thu, 10 Jun 2021 11:57:21 -0700
> > Ben Widawsky <ben.widawsky@intel.com> wrote:
> > 
> > > CXL interleave sets and non-interleave sets are described via regions. A region
> > > is specified in the CXL 2.0 specification and the purpose is to create a
> > > standardized way to preserve the region across reboots.  
> > 
> > A specific section reference would be helpful.
> > 
> > > 
> > > Introduced here is the basic mechanism to create and configure and delete a CXL
> > > region. Configuring a region simply means giving it a size, offset within the
> > > CFMWS window, UUID, and a target list. Enabling/activating a region, which
> > > ultimately means programming the HDM decoders in the chain, is left for later
> > > work.
> > > 
> > > The patches are only minimally tested so far in QEMU emulation and so x1
> > > interleave is all that's supported.  
> > 
> > I'm guessing this is why it's an RFC rather than a final submission?
> > 
> > If you can call out the RFC reasons in a cover letter it is helpful
> > as saves people wondering what specifically you want comments on.
> 
> Hi Ben,
> 
> Having read through them all, I think this needs more thought than
> I feel up to on a Friday afternoon.  Will get back to you on v2
> perhaps.
> 
> Jonathan
> 

Thanks for looking. Totally fair too :-)

I'm mainly looking for feedback on the region creation and configuration from
ABI perspective. Nitty gritty code review can happen with the v1 submission.

> > 
> > > 
> > > Here is a sample topology (also in patch #4)
> > > 
> > >     decoder1.0
> > >     ├── create_region
> > >     ├── delete_region
> > >     ├── devtype
> > >     ├── locked
> > >     ├── region1.0:0
> > >     │   ├── offset
> > >     │   ├── size
> > >     │   ├── subsystem -> ../../../../../../../bus/cxl
> > >     │   ├── target0
> > >     │   ├── uevent
> > >     │   ├── uuid
> > >     │   └── verify
> > >     ├── size
> > >     ├── start
> > >     ├── subsystem -> ../../../../../../bus/cxl
> > >     ├── target_list
> > >     ├── target_type
> > >     └── uevent
> > > 
> > > Ben Widawsky (4):
> > >   cxl/region: Add region creation ABI
> > >   cxl/region: Create attribute structure / verify
> > >   cxl: Move cxl_memdev conversion helper to mem.h
> > >   cxl/region: Introduce concept of region configuration
> > > 
> > >  Documentation/ABI/testing/sysfs-bus-cxl       |  59 +++
> > >  .../driver-api/cxl/memory-devices.rst         |   8 +
> > >  drivers/cxl/Makefile                          |   2 +-
> > >  drivers/cxl/core.c                            |  71 ++++
> > >  drivers/cxl/cxl.h                             |  11 +
> > >  drivers/cxl/mem.h                             |  26 ++
> > >  drivers/cxl/pci.c                             |   5 -
> > >  drivers/cxl/region.c                          | 400 ++++++++++++++++++
> > >  8 files changed, 576 insertions(+), 6 deletions(-)
> > >  create mode 100644 drivers/cxl/region.c
> > >   
> > 
>

Dan Williams June 12, 2021, 12:44 a.m. UTC | #4

On Thu, Jun 10, 2021 at 11:58 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> CXL interleave sets and non-interleave sets are described via regions. A region
> is specified in the CXL 2.0 specification and the purpose is to create a
> standardized way to preserve the region across reboots.
>
> Introduced here is the basic mechanism to create and configure and delete a CXL
> region. Configuring a region simply means giving it a size, offset within the
> CFMWS window, UUID, and a target list. Enabling/activating a region, which
> ultimately means programming the HDM decoders in the chain, is left for later
> work.
>
> The patches are only minimally tested so far in QEMU emulation and so x1
> interleave is all that's supported.
>
> Here is a sample topology (also in patch #4)

I'm just going to react to the attributes before looking at the
implementation to make sure we're level set.

>
>     decoder1.0
>     ├── create_region
>     ├── delete_region
>     ├── devtype
>     ├── locked
>     ├── region1.0:0
>     │   ├── offset

Is this the region's offset relative to the next available free space
in the parent decoder range? If this is output only I think it's ok,
but I think the address space allocation decision belongs to the
region driver at activation time. I.e. userspace does not have much of
a chance at specifying this relative all the other dynamic operations
that can be happening in the decoder.

>     │   ├── size
>     │   ├── subsystem -> ../../../../../../../bus/cxl
>     │   ├── target0
>     │   ├── uevent
>     │   ├── uuid
>     │   └── verify

I don't understand the role of a standalone @verify attribute, there
is verification that can happen per attribute write, and there is
final verification that can happen at region bind time. Either way
anything verify would check is duplicated somewhere else, and the
verification per attribute update is more precise. For example writes
to @size can check for free space in parent decoder and fail if
unavailable. Writes to targetX can fail if the memdev is not connected
to this decoder's port topology, or the memdev is out of decoder
resources. The final region bind will fail if mid-level switches are
lacking decoder resources, or would require changing a decoder
configuration that is pinned active.

Jonathan Cameron June 14, 2021, 8:20 a.m. UTC | #5

On Fri, 11 Jun 2021 17:44:02 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Thu, Jun 10, 2021 at 11:58 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > CXL interleave sets and non-interleave sets are described via regions. A region
> > is specified in the CXL 2.0 specification and the purpose is to create a
> > standardized way to preserve the region across reboots.
> >
> > Introduced here is the basic mechanism to create and configure and delete a CXL
> > region. Configuring a region simply means giving it a size, offset within the
> > CFMWS window, UUID, and a target list. Enabling/activating a region, which
> > ultimately means programming the HDM decoders in the chain, is left for later
> > work.
> >
> > The patches are only minimally tested so far in QEMU emulation and so x1
> > interleave is all that's supported.
> >
> > Here is a sample topology (also in patch #4)  
> 
> I'm just going to react to the attributes before looking at the
> implementation to make sure we're level set.
> 
> >
> >     decoder1.0
> >     ├── create_region
> >     ├── delete_region
> >     ├── devtype
> >     ├── locked
> >     ├── region1.0:0
> >     │   ├── offset  
> 
> Is this the region's offset relative to the next available free space
> in the parent decoder range? If this is output only I think it's ok,
> but I think the address space allocation decision belongs to the
> region driver at activation time. I.e. userspace does not have much of
> a chance at specifying this relative all the other dynamic operations
> that can be happening in the decoder.
> 
> >     │   ├── size
> >     │   ├── subsystem -> ../../../../../../../bus/cxl
> >     │   ├── target0
> >     │   ├── uevent
> >     │   ├── uuid
> >     │   └── verify  
> 
> I don't understand the role of a standalone @verify attribute, there
> is verification that can happen per attribute write, and there is
> final verification that can happen at region bind time. Either way
> anything verify would check is duplicated somewhere else, and the
> verification per attribute update is more precise. For example writes
> to @size can check for free space in parent decoder and fail if
> unavailable.

This comes back to your question above on whether offset is writable
and what it is with respect to.

If it is writeable, then you can't really verify size and offset
separately.

I'm not against just doing it on commit.

> Writes to targetX can fail if the memdev is not connected
> to this decoder's port topology, or the memdev is out of decoder
> resources. The final region bind will fail if mid-level switches are
> lacking decoder resources, or would require changing a decoder
> configuration that is pinned active.

Ben Widawsky June 14, 2021, 4:12 p.m. UTC | #6

On 21-06-11 17:44:02, Dan Williams wrote:
> On Thu, Jun 10, 2021 at 11:58 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > CXL interleave sets and non-interleave sets are described via regions. A region
> > is specified in the CXL 2.0 specification and the purpose is to create a
> > standardized way to preserve the region across reboots.
> >
> > Introduced here is the basic mechanism to create and configure and delete a CXL
> > region. Configuring a region simply means giving it a size, offset within the
> > CFMWS window, UUID, and a target list. Enabling/activating a region, which
> > ultimately means programming the HDM decoders in the chain, is left for later
> > work.
> >
> > The patches are only minimally tested so far in QEMU emulation and so x1
> > interleave is all that's supported.
> >
> > Here is a sample topology (also in patch #4)
> 
> I'm just going to react to the attributes before looking at the
> implementation to make sure we're level set.
> 
> >
> >     decoder1.0
> >     ├── create_region
> >     ├── delete_region
> >     ├── devtype
> >     ├── locked
> >     ├── region1.0:0
> >     │   ├── offset
> 
> Is this the region's offset relative to the next available free space
> in the parent decoder range? If this is output only I think it's ok,
> but I think the address space allocation decision belongs to the
> region driver at activation time. I.e. userspace does not have much of
> a chance at specifying this relative all the other dynamic operations
> that can be happening in the decoder.
> 

This was my mistake. Offset will be determined by the driver and I intend for
this to be read-only.

> >     │   ├── size
> >     │   ├── subsystem -> ../../../../../../../bus/cxl
> >     │   ├── target0
> >     │   ├── uevent
> >     │   ├── uuid
> >     │   └── verify
> 
> I don't understand the role of a standalone @verify attribute, there
> is verification that can happen per attribute write, and there is
> final verification that can happen at region bind time. Either way
> anything verify would check is duplicated somewhere else, and the
> verification per attribute update is more precise. For example writes
> to @size can check for free space in parent decoder and fail if
> unavailable. Writes to targetX can fail if the memdev is not connected
> to this decoder's port topology, or the memdev is out of decoder
> resources. The final region bind will fail if mid-level switches are
> lacking decoder resources, or would require changing a decoder
> configuration that is pinned active.

I strongly believe verification per attribute write will get too fragile. I'm
afraid it's going to require writing attributes in a specific order so that we
can do said verification in a sane way. We can skip that and just check it all
on bind. Most of that logic is what would be contained in verify(), so why not
expose it for userspace that may want to test out various configs without
actually trying to bind?

Also, I like having ABI that helps userspace get details on the configuration
failure reason. You mention in the other reply, TRACE_EVENT. I suppose userspace
could use tracepoints, or scrape dmesg for this same info. Maybe it's 6 one way,
a half dozen the other. I'd be interested to know if there are other examples of
tracepoints being used by userspace in a way like this and what the experience
is like.

To summarize, I think we need an atomic way to do verification (which obviously
happens at bind()), and I think we need UAPI to get the configuration error.

Dan Williams June 14, 2021, 9:04 p.m. UTC | #7

On Mon, Jun 14, 2021 at 9:12 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 21-06-11 17:44:02, Dan Williams wrote:
> > On Thu, Jun 10, 2021 at 11:58 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > CXL interleave sets and non-interleave sets are described via regions. A region
> > > is specified in the CXL 2.0 specification and the purpose is to create a
> > > standardized way to preserve the region across reboots.
> > >
> > > Introduced here is the basic mechanism to create and configure and delete a CXL
> > > region. Configuring a region simply means giving it a size, offset within the
> > > CFMWS window, UUID, and a target list. Enabling/activating a region, which
> > > ultimately means programming the HDM decoders in the chain, is left for later
> > > work.
> > >
> > > The patches are only minimally tested so far in QEMU emulation and so x1
> > > interleave is all that's supported.
> > >
> > > Here is a sample topology (also in patch #4)
> >
> > I'm just going to react to the attributes before looking at the
> > implementation to make sure we're level set.
> >
> > >
> > >     decoder1.0
> > >     ├── create_region
> > >     ├── delete_region
> > >     ├── devtype
> > >     ├── locked
> > >     ├── region1.0:0
> > >     │   ├── offset
> >
> > Is this the region's offset relative to the next available free space
> > in the parent decoder range? If this is output only I think it's ok,
> > but I think the address space allocation decision belongs to the
> > region driver at activation time. I.e. userspace does not have much of
> > a chance at specifying this relative all the other dynamic operations
> > that can be happening in the decoder.
> >
>
> This was my mistake. Offset will be determined by the driver and I intend for
> this to be read-only.
>
> > >     │   ├── size
> > >     │   ├── subsystem -> ../../../../../../../bus/cxl
> > >     │   ├── target0
> > >     │   ├── uevent
> > >     │   ├── uuid
> > >     │   └── verify
> >
> > I don't understand the role of a standalone @verify attribute, there
> > is verification that can happen per attribute write, and there is
> > final verification that can happen at region bind time. Either way
> > anything verify would check is duplicated somewhere else, and the
> > verification per attribute update is more precise. For example writes
> > to @size can check for free space in parent decoder and fail if
> > unavailable. Writes to targetX can fail if the memdev is not connected
> > to this decoder's port topology, or the memdev is out of decoder
> > resources. The final region bind will fail if mid-level switches are
> > lacking decoder resources, or would require changing a decoder
> > configuration that is pinned active.
>
> I strongly believe verification per attribute write will get too fragile. I'm
> afraid it's going to require writing attributes in a specific order so that we
> can do said verification in a sane way. We can skip that and just check it all
> on bind. Most of that logic is what would be contained in verify(), so why not
> expose it for userspace that may want to test out various configs without
> actually trying to bind?

Because there's no harm in actually trying to bind. A verify attribute
is at best redundant, or I am otherwise not understanding the proposed
use case?

> Also, I like having ABI that helps userspace get details on the configuration
> failure reason. You mention in the other reply, TRACE_EVENT. I suppose userspace
> could use tracepoints, or scrape dmesg for this same info. Maybe it's 6 one way,
> a half dozen the other. I'd be interested to know if there are other examples of
> tracepoints being used by userspace in a way like this and what the experience
> is like.
>
> To summarize, I think we need an atomic way to do verification (which obviously
> happens at bind()), and I think we need UAPI to get the configuration error.

I expect higher order configuration error reporting and non-atomic
pre-verification to come from user tooling. As for what the kernel can
do at runtime in the absence of user tooling, or in the development of
more aware tooling has been debated in the past [1]. In this case the
entire decoder resource topology is visible in userspace, and while
userspace can't atomically predict what will happen, it also does not
need to because the admin should not be racing resource querying and
resource consumption if they want to get a reliable answer. The reason
I recommended TRACE_EVENT() rather than dev_dbg() is due to being able
to filter event messages by cpu, pid, tid, uid... etc. Another
approach I have seen upstream is to emit extra variables with a
KOBJ_CHANGE event, but that is more about event reporting than extra
information about provisioning failure.

[1]: https://lwn.net/Articles/657341/

Ben Widawsky June 14, 2021, 9:54 p.m. UTC | #8

On 21-06-14 14:04:32, Dan Williams wrote:
> On Mon, Jun 14, 2021 at 9:12 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > On 21-06-11 17:44:02, Dan Williams wrote:
> > > On Thu, Jun 10, 2021 at 11:58 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > CXL interleave sets and non-interleave sets are described via regions. A region
> > > > is specified in the CXL 2.0 specification and the purpose is to create a
> > > > standardized way to preserve the region across reboots.
> > > >
> > > > Introduced here is the basic mechanism to create and configure and delete a CXL
> > > > region. Configuring a region simply means giving it a size, offset within the
> > > > CFMWS window, UUID, and a target list. Enabling/activating a region, which
> > > > ultimately means programming the HDM decoders in the chain, is left for later
> > > > work.
> > > >
> > > > The patches are only minimally tested so far in QEMU emulation and so x1
> > > > interleave is all that's supported.
> > > >
> > > > Here is a sample topology (also in patch #4)
> > >
> > > I'm just going to react to the attributes before looking at the
> > > implementation to make sure we're level set.
> > >
> > > >
> > > >     decoder1.0
> > > >     ├── create_region
> > > >     ├── delete_region
> > > >     ├── devtype
> > > >     ├── locked
> > > >     ├── region1.0:0
> > > >     │   ├── offset
> > >
> > > Is this the region's offset relative to the next available free space
> > > in the parent decoder range? If this is output only I think it's ok,
> > > but I think the address space allocation decision belongs to the
> > > region driver at activation time. I.e. userspace does not have much of
> > > a chance at specifying this relative all the other dynamic operations
> > > that can be happening in the decoder.
> > >
> >
> > This was my mistake. Offset will be determined by the driver and I intend for
> > this to be read-only.
> >
> > > >     │   ├── size
> > > >     │   ├── subsystem -> ../../../../../../../bus/cxl
> > > >     │   ├── target0
> > > >     │   ├── uevent
> > > >     │   ├── uuid
> > > >     │   └── verify
> > >
> > > I don't understand the role of a standalone @verify attribute, there
> > > is verification that can happen per attribute write, and there is
> > > final verification that can happen at region bind time. Either way
> > > anything verify would check is duplicated somewhere else, and the
> > > verification per attribute update is more precise. For example writes
> > > to @size can check for free space in parent decoder and fail if
> > > unavailable. Writes to targetX can fail if the memdev is not connected
> > > to this decoder's port topology, or the memdev is out of decoder
> > > resources. The final region bind will fail if mid-level switches are
> > > lacking decoder resources, or would require changing a decoder
> > > configuration that is pinned active.
> >
> > I strongly believe verification per attribute write will get too fragile. I'm
> > afraid it's going to require writing attributes in a specific order so that we
> > can do said verification in a sane way. We can skip that and just check it all
> > on bind. Most of that logic is what would be contained in verify(), so why not
> > expose it for userspace that may want to test out various configs without
> > actually trying to bind?
> 
> Because there's no harm in actually trying to bind. A verify attribute
> is at best redundant, or I am otherwise not understanding the proposed
> use case?
> 

That's the use case. Though I don't consider it redundant. All bind() can return
is errnos + what you mention below (and following LWN link).

> > Also, I like having ABI that helps userspace get details on the configuration
> > failure reason. You mention in the other reply, TRACE_EVENT. I suppose userspace
> > could use tracepoints, or scrape dmesg for this same info. Maybe it's 6 one way,
> > a half dozen the other. I'd be interested to know if there are other examples of
> > tracepoints being used by userspace in a way like this and what the experience
> > is like.
> >
> > To summarize, I think we need an atomic way to do verification (which obviously
> > happens at bind()), and I think we need UAPI to get the configuration error.
> 
> I expect higher order configuration error reporting and non-atomic
> pre-verification to come from user tooling.

But isn't that just duplicating code that we have to have in the kernel anyway?

> As for what the kernel can do at runtime in the absence of user tooling, or in
> the development of more aware tooling has been debated in the past [1]. In
> this case the entire decoder resource topology is visible in userspace, an3d
> while userspace can't atomically predict what will happen, it also does not
> need to because the admin should not be racing resource querying and resource
> consumption if they want to get a reliable answer. The reason I recommended
> TRACE_EVENT() rather than dev_dbg() is due to being able to filter event
> messages by cpu, pid, tid, uid... etc. Another approach I have seen upstream
> is to emit extra variables with a KOBJ_CHANGE event, but that is more about
> event reporting than extra information about provisioning failure.

Interesting. Thanks for the link, it looks like it never landed. I think trace
makes a good deal of sense considering all the options. I'm not convinced the
interface is "at best redundant". I'll just drop verify(). I have no further
arguments in favor and you don't sound convinced of the original ones.

> 

> [1]: https://lwn.net/Articles/657341/

Dan Williams June 14, 2021, 10:21 p.m. UTC | #9

On Mon, Jun 14, 2021 at 2:54 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 21-06-14 14:04:32, Dan Williams wrote:
> > On Mon, Jun 14, 2021 at 9:12 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > On 21-06-11 17:44:02, Dan Williams wrote:
> > > > On Thu, Jun 10, 2021 at 11:58 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > >
> > > > > CXL interleave sets and non-interleave sets are described via regions. A region
> > > > > is specified in the CXL 2.0 specification and the purpose is to create a
> > > > > standardized way to preserve the region across reboots.
> > > > >
> > > > > Introduced here is the basic mechanism to create and configure and delete a CXL
> > > > > region. Configuring a region simply means giving it a size, offset within the
> > > > > CFMWS window, UUID, and a target list. Enabling/activating a region, which
> > > > > ultimately means programming the HDM decoders in the chain, is left for later
> > > > > work.
> > > > >
> > > > > The patches are only minimally tested so far in QEMU emulation and so x1
> > > > > interleave is all that's supported.
> > > > >
> > > > > Here is a sample topology (also in patch #4)
> > > >
> > > > I'm just going to react to the attributes before looking at the
> > > > implementation to make sure we're level set.
> > > >
> > > > >
> > > > >     decoder1.0
> > > > >     ├── create_region
> > > > >     ├── delete_region
> > > > >     ├── devtype
> > > > >     ├── locked
> > > > >     ├── region1.0:0
> > > > >     │   ├── offset
> > > >
> > > > Is this the region's offset relative to the next available free space
> > > > in the parent decoder range? If this is output only I think it's ok,
> > > > but I think the address space allocation decision belongs to the
> > > > region driver at activation time. I.e. userspace does not have much of
> > > > a chance at specifying this relative all the other dynamic operations
> > > > that can be happening in the decoder.
> > > >
> > >
> > > This was my mistake. Offset will be determined by the driver and I intend for
> > > this to be read-only.
> > >
> > > > >     │   ├── size
> > > > >     │   ├── subsystem -> ../../../../../../../bus/cxl
> > > > >     │   ├── target0
> > > > >     │   ├── uevent
> > > > >     │   ├── uuid
> > > > >     │   └── verify
> > > >
> > > > I don't understand the role of a standalone @verify attribute, there
> > > > is verification that can happen per attribute write, and there is
> > > > final verification that can happen at region bind time. Either way
> > > > anything verify would check is duplicated somewhere else, and the
> > > > verification per attribute update is more precise. For example writes
> > > > to @size can check for free space in parent decoder and fail if
> > > > unavailable. Writes to targetX can fail if the memdev is not connected
> > > > to this decoder's port topology, or the memdev is out of decoder
> > > > resources. The final region bind will fail if mid-level switches are
> > > > lacking decoder resources, or would require changing a decoder
> > > > configuration that is pinned active.
> > >
> > > I strongly believe verification per attribute write will get too fragile. I'm
> > > afraid it's going to require writing attributes in a specific order so that we
> > > can do said verification in a sane way. We can skip that and just check it all
> > > on bind. Most of that logic is what would be contained in verify(), so why not
> > > expose it for userspace that may want to test out various configs without
> > > actually trying to bind?
> >
> > Because there's no harm in actually trying to bind. A verify attribute
> > is at best redundant, or I am otherwise not understanding the proposed
> > use case?
> >
>
> That's the use case. Though I don't consider it redundant. All bind() can return
> is errnos + what you mention below (and following LWN link).
>
> > > Also, I like having ABI that helps userspace get details on the configuration
> > > failure reason. You mention in the other reply, TRACE_EVENT. I suppose userspace
> > > could use tracepoints, or scrape dmesg for this same info. Maybe it's 6 one way,
> > > a half dozen the other. I'd be interested to know if there are other examples of
> > > tracepoints being used by userspace in a way like this and what the experience
> > > is like.
> > >
> > > To summarize, I think we need an atomic way to do verification (which obviously
> > > happens at bind()), and I think we need UAPI to get the configuration error.
> >
> > I expect higher order configuration error reporting and non-atomic
> > pre-verification to come from user tooling.
>
> But isn't that just duplicating code that we have to have in the kernel anyway?

It is, but it's less constrained. The kernel could only tell you yes
or no for a given region verification. The tooling can identify
tradeoffs and potential resource collisions across regions.

> > As for what the kernel can do at runtime in the absence of user tooling, or in
> > the development of more aware tooling has been debated in the past [1]. In
> > this case the entire decoder resource topology is visible in userspace, an3d
> > while userspace can't atomically predict what will happen, it also does not
> > need to because the admin should not be racing resource querying and resource
> > consumption if they want to get a reliable answer. The reason I recommended
> > TRACE_EVENT() rather than dev_dbg() is due to being able to filter event
> > messages by cpu, pid, tid, uid... etc. Another approach I have seen upstream
> > is to emit extra variables with a KOBJ_CHANGE event, but that is more about
> > event reporting than extra information about provisioning failure.
>
> Interesting. Thanks for the link, it looks like it never landed. I think trace
> makes a good deal of sense considering all the options. I'm not convinced the
> interface is "at best redundant". I'll just drop verify(). I have no further
> arguments in favor and you don't sound convinced of the original ones.

My main concern is the atomicity of the verify response when two
agents are racing through the resource tree. The user tooling can
either explicitly coordinate with other user tooling, or the admin
could implicitly avoid races. When the races happen they will likely
be cases where bind() fails and verify() said everything was ok. In
those cases we'll still need a mechanism to understand why bind()
failed and by that time it seems everything verify() might have been
able to offer has been reimplemented for tracing bind().

[RFC,0/4] Region Creation

Message

Comments