diff mbox series

[v3,02/14] cxl/region: Introduce concept of region configuration

Message ID 20220128002707.391076-3-ben.widawsky@intel.com (mailing list archive)
State New, archived
Headers show
Series CXL Region driver | expand

Commit Message

Ben Widawsky Jan. 28, 2022, 12:26 a.m. UTC
The region creation APIs create a vacant region. Configuring the region
works in the same way as similar subsystems such as devdax. Sysfs attrs
will be provided to allow userspace to configure the region.  Finally
once all configuration is complete, userspace may activate the region.

Introduced here are the most basic attributes needed to configure a
region. Details of these attribute are described in the ABI
Documentation. Sanity checking of configuration parameters are done at
region binding time. This consolidates all such logic in one place,
rather than being strewn across multiple places.

A example is provided below:

/sys/bus/cxl/devices/region0.0:0
├── interleave_granularity
├── interleave_ways
├── offset
├── size
├── subsystem -> ../../../../../../bus/cxl
├── target0
├── uevent
└── uuid

Reported-by: kernel test robot <lkp@intel.com> (v2)
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
 drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
 2 files changed, 340 insertions(+)

Comments

Dan Williams Jan. 29, 2022, 12:25 a.m. UTC | #1
On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> The region creation APIs create a vacant region. Configuring the region
> works in the same way as similar subsystems such as devdax. Sysfs attrs
> will be provided to allow userspace to configure the region.  Finally
> once all configuration is complete, userspace may activate the region.
>
> Introduced here are the most basic attributes needed to configure a
> region. Details of these attribute are described in the ABI

s/attribute/attributes/

> Documentation. Sanity checking of configuration parameters are done at
> region binding time. This consolidates all such logic in one place,
> rather than being strewn across multiple places.

I think that's too late for some of the validation. The complex
validation that the region driver does throughout the topology is
different from the basic input validation that can  be done at the
sysfs write time. For example ,this patch allows negative
interleave_granularity values to specified, just return -EINVAL. I
agree that sysfs should not validate everything, I disagree with
pushing all validation to cxl_region_probe().

>
> A example is provided below:
>
> /sys/bus/cxl/devices/region0.0:0
> ├── interleave_granularity
> ├── interleave_ways
> ├── offset
> ├── size
> ├── subsystem -> ../../../../../../bus/cxl
> ├── target0
> ├── uevent
> └── uuid

As mentioned off-list, it looks like devtype and modalias are missing.

>
> Reported-by: kernel test robot <lkp@intel.com> (v2)
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
>  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
>  2 files changed, 340 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index dcc728458936..50ba5018014d 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -187,3 +187,43 @@ Description:
>                 region driver before being deleted. The attributes expects a
>                 region in the form "regionX.Y:Z". The region's name, allocated
>                 by reading create_region, will also be released.
> +
> +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset

This is just another 'resource' attribute for the physical base
address of the region, right? 'offset' sounds like something that
would be relative instead of absolute.

> +Date:          August, 2021

Same date update comment here.

> +KernelVersion: v5.18
> +Contact:       linux-cxl@vger.kernel.org
> +Description:
> +               (RO) A region resides within an address space that is claimed by
> +               a decoder.

"A region is a contiguous partition of a CXL Root decoder address space."

>                  Region space allocation is handled by the driver, but

"Region capacity is allocated by writing to the size attribute, the
resulting physical address base determined by the driver is reflected
here."

> +               the offset may be read by userspace tooling in order to
> +               determine fragmentation, and available size for new regions.

I would also expect, before / along with these new region attributes,
there would be 'available' and 'max_extent_available' at the decoder
level to indicate how much free space the decoder has and how big the
next region creation can be. User tooling can walk  the decoder and
the regions together to determine fragmentation if necessary, but for
the most part the tool likely only cares about "how big can the next
region be?" and "how full is this decoder?".


> +
> +What:
> +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> +Date:          August, 2021
> +KernelVersion: v5.18
> +Contact:       linux-cxl@vger.kernel.org
> +Description:
> +               (RW) Configuring regions requires a minimal set of parameters in
> +               order for the subsequent bind operation to succeed. The
> +               following parameters are defined:

Let's split up the descriptions into individual sections. That can
also document the order that attributes must be written. For example,
doesn't size need to be set before targets are added so that targets
can be validated whether they have sufficient capacity?

> +
> +               ==      ========================================================
> +               interleave_granularity Mandatory. Number of consecutive bytes
> +                       each device in the interleave set will claim. The
> +                       possible interleave granularity values are determined by
> +                       the CXL spec and the participating devices.
> +               interleave_ways Mandatory. Number of devices participating in the
> +                       region. Each device will provide 1/interleave of storage
> +                       for the region.
> +               size    Manadatory. Phsyical address space the region will
> +                       consume.

s/Phsyical/Physical/

> +               target  Mandatory. Memory devices are the backing storage for a
> +                       region. There will be N targets based on the number of
> +                       interleave ways that the top level decoder is configured
> +                       for.

That doesn't sound right, IW at the root != IW at the endpoint level
and the region needs to record all the endpoint level targets.

> Each target must be set with a memdev device ie.
> +                       'mem1'. This attribute only becomes available after
> +                       setting the 'interleave' attribute.
> +               uuid    Optional. A unique identifier for the region. If none is
> +                       selected, the kernel will create one.

Let's drop the Mandatory / Optional distinction, or I am otherwise not
understanding what this is trying to document. For example 'uuid' is
"mandatory" for PMEM regions and "omitted" for volatile regions not
optional.

> +               ==      ========================================================
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 1a448543db0d..3b48e0469fc7 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3,9 +3,12 @@
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/device.h>
>  #include <linux/module.h>
> +#include <linux/sizes.h>
>  #include <linux/slab.h>
> +#include <linux/uuid.h>
>  #include <linux/idr.h>
>  #include <region.h>
> +#include <cxlmem.h>
>  #include <cxl.h>
>  #include "core.h"
>
> @@ -18,11 +21,305 @@
>   * (programming the hardware) is handled by a separate region driver.
>   */
>
> +struct cxl_region *to_cxl_region(struct device *dev);
> +static const struct attribute_group region_interleave_group;
> +
> +static bool is_region_active(struct cxl_region *cxlr)
> +{
> +       /* TODO: Regions can't be activated yet. */
> +       return false;

This function seems redundant with just checking "cxlr->dev.driver !=
NULL"? The benefit of that is there is no need to carry a TODO in the
series.

> +}
> +
> +static void remove_target(struct cxl_region *cxlr, int target)
> +{
> +       struct cxl_memdev *cxlmd;
> +
> +       cxlmd = cxlr->config.targets[target];
> +       if (cxlmd)
> +               put_device(&cxlmd->dev);

A memdev can be a member of multiple regions at once, shouldn't this
be an endpoint decoder or similar, not the entire memdev?

Also, if memdevs autoremove themselves from regions at memdev
->remove() time then I don't think the region needs to hold references
on memdevs.

> +       cxlr->config.targets[target] = NULL;
> +}
> +
> +static ssize_t interleave_ways_show(struct device *dev,
> +                                   struct device_attribute *attr, char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> +}
> +
> +static ssize_t interleave_ways_store(struct device *dev,
> +                                    struct device_attribute *attr,
> +                                    const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       int ret, prev_iw;
> +       int val;

I would expect:

if (dev->driver)
   return -EBUSY;

...to shutdown configuration writes once the region is active. Might
also need a region-wide seqlock like target_list_show. So that region
probe drains  all active sysfs writers before assuming the
configuration is stable.

> +
> +       prev_iw = cxlr->config.interleave_ways;
> +       ret = kstrtoint(buf, 0, &val);
> +       if (ret)
> +               return ret;
> +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> +               return -EINVAL;
> +
> +       cxlr->config.interleave_ways = val;
> +
> +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> +       if (ret < 0)
> +               goto err;
> +
> +       sysfs_notify(&dev->kobj, NULL, "target_interleave");

Why?

> +
> +       while (prev_iw > cxlr->config.interleave_ways)
> +               remove_target(cxlr, --prev_iw);

To make the kernel side simpler this attribute could just require that
setting interleave ways is a one way street, if you want to change it
you need to delete the region and start over.

> +
> +       return len;
> +
> +err:
> +       cxlr->config.interleave_ways = prev_iw;
> +       return ret;
> +}
> +static DEVICE_ATTR_RW(interleave_ways);
> +
> +static ssize_t interleave_granularity_show(struct device *dev,
> +                                          struct device_attribute *attr,
> +                                          char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
> +}
> +
> +static ssize_t interleave_granularity_store(struct device *dev,
> +                                           struct device_attribute *attr,
> +                                           const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       int val, ret;
> +
> +       ret = kstrtoint(buf, 0, &val);
> +       if (ret)
> +               return ret;
> +       cxlr->config.interleave_granularity = val;

This wants minimum input validation and synchronization against an
active region.

> +
> +       return len;
> +}
> +static DEVICE_ATTR_RW(interleave_granularity);
> +
> +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> +                          char *buf)
> +{
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       resource_size_t offset;
> +
> +       if (!cxlr->res)
> +               return sysfs_emit(buf, "\n");

Should be an error I would think. I.e. require size to be set before
s/offset/resource/ can be read.

> +
> +       offset = cxld->platform_res.start - cxlr->res->start;

Why make usersapce do the offset math?

> +
> +       return sysfs_emit(buf, "%pa\n", &offset);
> +}
> +static DEVICE_ATTR_RO(offset);

This can be DEVICE_ATTR_ADMIN_RO() to hide physical address layout
information from non-root.

> +
> +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> +                        char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%llu\n", cxlr->config.size);

Perhaps no need to store size separately if this becomes:

sysfs_emit(buf, "%llu\n", (unsigned long long) resource_size(cxlr->res));


...?

> +}
> +
> +static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> +                         const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       unsigned long long val;
> +       ssize_t rc;
> +
> +       rc = kstrtoull(buf, 0, &val);
> +       if (rc)
> +               return rc;
> +
> +       device_lock(&cxlr->dev);
> +       if (is_region_active(cxlr))
> +               rc = -EBUSY;
> +       else
> +               cxlr->config.size = val;
> +       device_unlock(&cxlr->dev);

I think lockdep will complain about device_lock() usage in an
attribute. Try changing this to cxl_device_lock() with
CONFIG_PROVE_CXL_LOCKING=y.

> +
> +       return rc ? rc : len;
> +}
> +static DEVICE_ATTR_RW(size);
> +
> +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> +                        char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> +}
> +
> +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> +                         const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       ssize_t rc;
> +
> +       if (len != UUID_STRING_LEN + 1)
> +               return -EINVAL;
> +
> +       device_lock(&cxlr->dev);
> +       if (is_region_active(cxlr))
> +               rc = -EBUSY;
> +       else
> +               rc = uuid_parse(buf, &cxlr->config.uuid);
> +       device_unlock(&cxlr->dev);
> +
> +       return rc ? rc : len;
> +}
> +static DEVICE_ATTR_RW(uuid);
> +
> +static struct attribute *region_attrs[] = {
> +       &dev_attr_interleave_ways.attr,
> +       &dev_attr_interleave_granularity.attr,
> +       &dev_attr_offset.attr,
> +       &dev_attr_size.attr,
> +       &dev_attr_uuid.attr,
> +       NULL,
> +};
> +
> +static const struct attribute_group region_group = {
> +       .attrs = region_attrs,
> +};
> +
> +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> +{
> +       int ret;
> +
> +       device_lock(&cxlr->dev);
> +       if (!cxlr->config.targets[n])
> +               ret = sysfs_emit(buf, "\n");
> +       else
> +               ret = sysfs_emit(buf, "%s\n",
> +                                dev_name(&cxlr->config.targets[n]->dev));
> +       device_unlock(&cxlr->dev);

The component contribution of a memdev to a region is a DPA-span, not
the whole memdev. I would expect something like dax_mapping_attributes
or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
information about the component contribution of a memdev to a region.

> +
> +       return ret;
> +}
> +
> +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> +                         size_t len)
> +{
> +       struct device *memdev_dev;
> +       struct cxl_memdev *cxlmd;
> +
> +       device_lock(&cxlr->dev);
> +
> +       if (len == 1 || cxlr->config.targets[n])
> +               remove_target(cxlr, n);
> +
> +       /* Remove target special case */
> +       if (len == 1) {
> +               device_unlock(&cxlr->dev);
> +               return len;
> +       }
> +
> +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);

I think this wants to be an endpoint decoder, not a memdev. Because
it's the decoder that joins a memdev to a region, or at least a
decoder should be picked when the memdev is assigned so that the DPA
mapping can be registered. If all the decoders are allocated then fail
here.

> +       if (!memdev_dev) {
> +               device_unlock(&cxlr->dev);
> +               return -ENOENT;
> +       }
> +
> +       /* reference to memdev held until target is unset or region goes away */
> +
> +       cxlmd = to_cxl_memdev(memdev_dev);
> +       cxlr->config.targets[n] = cxlmd;
> +
> +       device_unlock(&cxlr->dev);
> +
> +       return len;
> +}
> +
> +#define TARGET_ATTR_RW(n)                                                      \
> +       static ssize_t target##n##_show(                                       \
> +               struct device *dev, struct device_attribute *attr, char *buf)  \
> +       {                                                                      \
> +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> +       }                                                                      \
> +       static ssize_t target##n##_store(struct device *dev,                   \
> +                                        struct device_attribute *attr,        \
> +                                        const char *buf, size_t len)          \
> +       {                                                                      \
> +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> +       }                                                                      \
> +       static DEVICE_ATTR_RW(target##n)
> +
> +TARGET_ATTR_RW(0);
> +TARGET_ATTR_RW(1);
> +TARGET_ATTR_RW(2);
> +TARGET_ATTR_RW(3);
> +TARGET_ATTR_RW(4);
> +TARGET_ATTR_RW(5);
> +TARGET_ATTR_RW(6);
> +TARGET_ATTR_RW(7);
> +TARGET_ATTR_RW(8);
> +TARGET_ATTR_RW(9);
> +TARGET_ATTR_RW(10);
> +TARGET_ATTR_RW(11);
> +TARGET_ATTR_RW(12);
> +TARGET_ATTR_RW(13);
> +TARGET_ATTR_RW(14);
> +TARGET_ATTR_RW(15);
> +
> +static struct attribute *interleave_attrs[] = {
> +       &dev_attr_target0.attr,
> +       &dev_attr_target1.attr,
> +       &dev_attr_target2.attr,
> +       &dev_attr_target3.attr,
> +       &dev_attr_target4.attr,
> +       &dev_attr_target5.attr,
> +       &dev_attr_target6.attr,
> +       &dev_attr_target7.attr,
> +       &dev_attr_target8.attr,
> +       &dev_attr_target9.attr,
> +       &dev_attr_target10.attr,
> +       &dev_attr_target11.attr,
> +       &dev_attr_target12.attr,
> +       &dev_attr_target13.attr,
> +       &dev_attr_target14.attr,
> +       &dev_attr_target15.attr,
> +       NULL,
> +};
> +
> +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> +{
> +       struct device *dev = container_of(kobj, struct device, kobj);
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       if (n < cxlr->config.interleave_ways)
> +               return a->mode;
> +       return 0;
> +}
> +
> +static const struct attribute_group region_interleave_group = {
> +       .attrs = interleave_attrs,
> +       .is_visible = visible_targets,
> +};
> +
> +static const struct attribute_group *region_groups[] = {
> +       &region_group,
> +       &region_interleave_group,
> +       NULL,
> +};
> +
>  static void cxl_region_release(struct device *dev);
>
>  static const struct device_type cxl_region_type = {
>         .name = "cxl_region",
>         .release = cxl_region_release,
> +       .groups = region_groups
>  };
>
>  static ssize_t create_region_show(struct device *dev,
> @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
>  {
>         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
>         struct cxl_region *cxlr = to_cxl_region(dev);
> +       int i;
>
>         ida_free(&cxld->region_ida, cxlr->id);
> +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> +               remove_target(cxlr, i);

Like the last patch this feels too late. I expect whatever unregisters
the region should have already handled removing the targets.

>         kfree(cxlr);
>  }
>
> --
> 2.35.0
>
Ben Widawsky Feb. 1, 2022, 2:59 p.m. UTC | #2
I will cut to the part that effects ABI so tool development can continue. I'll
get back to the other bits later.

On 22-01-28 16:25:34, Dan Williams wrote:

[snip]

> 
> > +
> > +       return ret;
> > +}
> > +
> > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > +                         size_t len)
> > +{
> > +       struct device *memdev_dev;
> > +       struct cxl_memdev *cxlmd;
> > +
> > +       device_lock(&cxlr->dev);
> > +
> > +       if (len == 1 || cxlr->config.targets[n])
> > +               remove_target(cxlr, n);
> > +
> > +       /* Remove target special case */
> > +       if (len == 1) {
> > +               device_unlock(&cxlr->dev);
> > +               return len;
> > +       }
> > +
> > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> 
> I think this wants to be an endpoint decoder, not a memdev. Because
> it's the decoder that joins a memdev to a region, or at least a
> decoder should be picked when the memdev is assigned so that the DPA
> mapping can be registered. If all the decoders are allocated then fail
> here.
> 

You've put two points in here:

1. Handle decoder allocation at sysfs boundary. I'll respond to this when I come
back around to the rest of the review comments.

2. Take a decoder for target instead of a memdev. I don't agree with this
direction as it's asymmetric to how LSA processing works. The goal was to model
the LSA for configuration. The kernel will have to be in the business of
reserving and enumerating decoders out of memdevs for both LSA (where we have a
list of memdevs) and volatile (where we use the memdevs in the system to
enumerate populated decoders). I don't see much value in making userspace do the
same.

I'd like to ask you reconsider if you still think it's preferable to use
decoders as part of the ABI and if you still feel that way I can go change it
since it has minimal impact overall.

> > +       if (!memdev_dev) {
> > +               device_unlock(&cxlr->dev);
> > +               return -ENOENT;
> > +       }
> > +
> > +       /* reference to memdev held until target is unset or region goes away */
> > +
> > +       cxlmd = to_cxl_memdev(memdev_dev);
> > +       cxlr->config.targets[n] = cxlmd;
> > +
> > +       device_unlock(&cxlr->dev);
> > +
> > +       return len;
> > +}
> > +

[snip]
Ben Widawsky Feb. 1, 2022, 11:11 p.m. UTC | #3
On 22-01-28 16:25:34, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > The region creation APIs create a vacant region. Configuring the region
> > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > will be provided to allow userspace to configure the region.  Finally
> > once all configuration is complete, userspace may activate the region.
> >
> > Introduced here are the most basic attributes needed to configure a
> > region. Details of these attribute are described in the ABI
> 
> s/attribute/attributes/
> 
> > Documentation. Sanity checking of configuration parameters are done at
> > region binding time. This consolidates all such logic in one place,
> > rather than being strewn across multiple places.
> 
> I think that's too late for some of the validation. The complex
> validation that the region driver does throughout the topology is
> different from the basic input validation that can  be done at the
> sysfs write time. For example ,this patch allows negative
> interleave_granularity values to specified, just return -EINVAL. I
> agree that sysfs should not validate everything, I disagree with
> pushing all validation to cxl_region_probe().
> 

Two points:c
1. How do we distinguish "basic input validation". It'd be good if we could
   define "basic input validation". For instance, when I first wrote these
   patches, x3 would have been EINVAL, but today it's allowed. Can you help
   enumerate what you consider basic.

2. I like the idea that all validation takes place in one place. Obviously you
   do not. So, see #1 and I will rework.

> >
> > A example is provided below:
> >
> > /sys/bus/cxl/devices/region0.0:0
> > ├── interleave_granularity
> > ├── interleave_ways
> > ├── offset
> > ├── size
> > ├── subsystem -> ../../../../../../bus/cxl
> > ├── target0
> > ├── uevent
> > └── uuid
> 
> As mentioned off-list, it looks like devtype and modalias are missing.
> 

Thanks.

> >
> > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> >  2 files changed, 340 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index dcc728458936..50ba5018014d 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -187,3 +187,43 @@ Description:
> >                 region driver before being deleted. The attributes expects a
> >                 region in the form "regionX.Y:Z". The region's name, allocated
> >                 by reading create_region, will also be released.
> > +
> > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> 
> This is just another 'resource' attribute for the physical base
> address of the region, right? 'offset' sounds like something that
> would be relative instead of absolute.
> 

It is offset. I can change it to physical base if you'd like but I thought that
information wasn't critically important for userspace to have. Does userspace
care about the physical base?

> > +Date:          August, 2021
> 
> Same date update comment here.
> 
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               (RO) A region resides within an address space that is claimed by
> > +               a decoder.
> 
> "A region is a contiguous partition of a CXL Root decoder address space."
> 
> >                  Region space allocation is handled by the driver, but
> 
> "Region capacity is allocated by writing to the size attribute, the
> resulting physical address base determined by the driver is reflected
> here."
> 
> > +               the offset may be read by userspace tooling in order to
> > +               determine fragmentation, and available size for new regions.
> 
> I would also expect, before / along with these new region attributes,
> there would be 'available' and 'max_extent_available' at the decoder
> level to indicate how much free space the decoder has and how big the
> next region creation can be. User tooling can walk  the decoder and
> the regions together to determine fragmentation if necessary, but for
> the most part the tool likely only cares about "how big can the next
> region be?" and "how full is this decoder?".

Sounds good.

> 
> 
> > +
> > +What:
> > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > +Date:          August, 2021
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               (RW) Configuring regions requires a minimal set of parameters in
> > +               order for the subsequent bind operation to succeed. The
> > +               following parameters are defined:
> 
> Let's split up the descriptions into individual sections. That can
> also document the order that attributes must be written. For example,
> doesn't size need to be set before targets are added so that targets
> can be validated whether they have sufficient capacity?
> 

Okay. Order doesn't matter if you do validation all in one place as it is, but
sounds like we're changing that. So I can split it when we figure out what
validation is actually occurring at the sysfs attr boundary.

> > +
> > +               ==      ========================================================
> > +               interleave_granularity Mandatory. Number of consecutive bytes
> > +                       each device in the interleave set will claim. The
> > +                       possible interleave granularity values are determined by
> > +                       the CXL spec and the participating devices.
> > +               interleave_ways Mandatory. Number of devices participating in the
> > +                       region. Each device will provide 1/interleave of storage
> > +                       for the region.
> > +               size    Manadatory. Phsyical address space the region will
> > +                       consume.
> 
> s/Phsyical/Physical/
> 
> > +               target  Mandatory. Memory devices are the backing storage for a
> > +                       region. There will be N targets based on the number of
> > +                       interleave ways that the top level decoder is configured
> > +                       for.
> 
> That doesn't sound right, IW at the root != IW at the endpoint level
> and the region needs to record all the endpoint level targets.


Yes This is wrong. I thought I had fixed it, but I guess not.

> 
> > Each target must be set with a memdev device ie.
> > +                       'mem1'. This attribute only becomes available after
> > +                       setting the 'interleave' attribute.
> > +               uuid    Optional. A unique identifier for the region. If none is
> > +                       selected, the kernel will create one.
> 
> Let's drop the Mandatory / Optional distinction, or I am otherwise not
> understanding what this is trying to document. For example 'uuid' is
> "mandatory" for PMEM regions and "omitted" for volatile regions not
> optional.
> 

Well the kernel fills it in if userspace leaves it out. I'm guessing you're
going to ask me to change that, so I will remove Mandatory/Optional.

> > +               ==      ========================================================
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 1a448543db0d..3b48e0469fc7 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -3,9 +3,12 @@
> >  #include <linux/io-64-nonatomic-lo-hi.h>
> >  #include <linux/device.h>
> >  #include <linux/module.h>
> > +#include <linux/sizes.h>
> >  #include <linux/slab.h>
> > +#include <linux/uuid.h>
> >  #include <linux/idr.h>
> >  #include <region.h>
> > +#include <cxlmem.h>
> >  #include <cxl.h>
> >  #include "core.h"
> >
> > @@ -18,11 +21,305 @@
> >   * (programming the hardware) is handled by a separate region driver.
> >   */
> >
> > +struct cxl_region *to_cxl_region(struct device *dev);
> > +static const struct attribute_group region_interleave_group;
> > +
> > +static bool is_region_active(struct cxl_region *cxlr)
> > +{
> > +       /* TODO: Regions can't be activated yet. */
> > +       return false;
> 
> This function seems redundant with just checking "cxlr->dev.driver !=
> NULL"? The benefit of that is there is no need to carry a TODO in the
> series.
> 

Yeah. I think checking driver bind status is sufficient to replace this.

> > +}
> > +
> > +static void remove_target(struct cxl_region *cxlr, int target)
> > +{
> > +       struct cxl_memdev *cxlmd;
> > +
> > +       cxlmd = cxlr->config.targets[target];
> > +       if (cxlmd)
> > +               put_device(&cxlmd->dev);
> 
> A memdev can be a member of multiple regions at once, shouldn't this
> be an endpoint decoder or similar, not the entire memdev?

Is this referring to the later question about whether targets are decoders or
memdevs? The thought was each region would hold a reference to all memdevs in
the interleave set.

> 
> Also, if memdevs autoremove themselves from regions at memdev
> ->remove() time then I don't think the region needs to hold references
> on memdevs.
> 

I'll defer to you on that. I'll remove holding the reference, but I definitely
haven't solved the interaction when a memdev goes away. I had been thinking the
inverse originally, a memdev can't go away until the region is gone. According
to the spec, these devices can't be hot removed, only managed remove, so if
things blew up, not our problem. However, if we have decent infrastructure to
support better than that, we should.

> > +       cxlr->config.targets[target] = NULL;
> > +}
> > +
> > +static ssize_t interleave_ways_show(struct device *dev,
> > +                                   struct device_attribute *attr, char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > +}
> > +
> > +static ssize_t interleave_ways_store(struct device *dev,
> > +                                    struct device_attribute *attr,
> > +                                    const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int ret, prev_iw;
> > +       int val;
> 
> I would expect:
> 
> if (dev->driver)
>    return -EBUSY;
> 
> ...to shutdown configuration writes once the region is active. Might
> also need a region-wide seqlock like target_list_show. So that region
> probe drains  all active sysfs writers before assuming the
> configuration is stable.
> 

Okay.

> > +
> > +       prev_iw = cxlr->config.interleave_ways;
> > +       ret = kstrtoint(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > +               return -EINVAL;
> > +
> > +       cxlr->config.interleave_ways = val;
> > +
> > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > +       if (ret < 0)
> > +               goto err;
> > +
> > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> 
> Why?
> 

I copied it from another driver. I didn't check if it was actually needed or
not.

> > +
> > +       while (prev_iw > cxlr->config.interleave_ways)
> > +               remove_target(cxlr, --prev_iw);
> 
> To make the kernel side simpler this attribute could just require that
> setting interleave ways is a one way street, if you want to change it
> you need to delete the region and start over.
> 

I'm fine with that.

> > +
> > +       return len;
> > +
> > +err:
> > +       cxlr->config.interleave_ways = prev_iw;
> > +       return ret;
> > +}
> > +static DEVICE_ATTR_RW(interleave_ways);
> > +
> > +static ssize_t interleave_granularity_show(struct device *dev,
> > +                                          struct device_attribute *attr,
> > +                                          char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
> > +}
> > +
> > +static ssize_t interleave_granularity_store(struct device *dev,
> > +                                           struct device_attribute *attr,
> > +                                           const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int val, ret;
> > +
> > +       ret = kstrtoint(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> > +       cxlr->config.interleave_granularity = val;
> 
> This wants minimum input validation and synchronization against an
> active region.
> 
> > +
> > +       return len;
> > +}
> > +static DEVICE_ATTR_RW(interleave_granularity);
> > +
> > +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> > +                          char *buf)
> > +{
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       resource_size_t offset;
> > +
> > +       if (!cxlr->res)
> > +               return sysfs_emit(buf, "\n");
> 
> Should be an error I would think. I.e. require size to be set before
> s/offset/resource/ can be read.
> 
> > +
> > +       offset = cxld->platform_res.start - cxlr->res->start;
> 
> Why make usersapce do the offset math?
> 
> > +
> > +       return sysfs_emit(buf, "%pa\n", &offset);
> > +}
> > +static DEVICE_ATTR_RO(offset);
> 
> This can be DEVICE_ATTR_ADMIN_RO() to hide physical address layout
> information from non-root.
> 
> > +
> > +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> > +                        char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%llu\n", cxlr->config.size);
> 
> Perhaps no need to store size separately if this becomes:
> 
> sysfs_emit(buf, "%llu\n", (unsigned long long) resource_size(cxlr->res));
> 
> 
> ...?
> 
> > +}
> > +
> > +static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> > +                         const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       unsigned long long val;
> > +       ssize_t rc;
> > +
> > +       rc = kstrtoull(buf, 0, &val);
> > +       if (rc)
> > +               return rc;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (is_region_active(cxlr))
> > +               rc = -EBUSY;
> > +       else
> > +               cxlr->config.size = val;
> > +       device_unlock(&cxlr->dev);
> 
> I think lockdep will complain about device_lock() usage in an
> attribute. Try changing this to cxl_device_lock() with
> CONFIG_PROVE_CXL_LOCKING=y.
> 
> > +
> > +       return rc ? rc : len;
> > +}
> > +static DEVICE_ATTR_RW(size);
> > +
> > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > +                        char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > +}
> > +
> > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > +                         const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       ssize_t rc;
> > +
> > +       if (len != UUID_STRING_LEN + 1)
> > +               return -EINVAL;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (is_region_active(cxlr))
> > +               rc = -EBUSY;
> > +       else
> > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > +       device_unlock(&cxlr->dev);
> > +
> > +       return rc ? rc : len;
> > +}
> > +static DEVICE_ATTR_RW(uuid);
> > +
> > +static struct attribute *region_attrs[] = {
> > +       &dev_attr_interleave_ways.attr,
> > +       &dev_attr_interleave_granularity.attr,
> > +       &dev_attr_offset.attr,
> > +       &dev_attr_size.attr,
> > +       &dev_attr_uuid.attr,
> > +       NULL,
> > +};
> > +
> > +static const struct attribute_group region_group = {
> > +       .attrs = region_attrs,
> > +};
> > +
> > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > +{
> > +       int ret;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (!cxlr->config.targets[n])
> > +               ret = sysfs_emit(buf, "\n");
> > +       else
> > +               ret = sysfs_emit(buf, "%s\n",
> > +                                dev_name(&cxlr->config.targets[n]->dev));
> > +       device_unlock(&cxlr->dev);
> 
> The component contribution of a memdev to a region is a DPA-span, not
> the whole memdev. I would expect something like dax_mapping_attributes
> or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> information about the component contribution of a memdev to a region.
> 

I had been thinking the kernel would manage the DPS spans of a memdev (and
create the mappings). I can make this look like dax_mapping_attributes.

> > +
> > +       return ret;
> > +}
> > +
> > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > +                         size_t len)
> > +{
> > +       struct device *memdev_dev;
> > +       struct cxl_memdev *cxlmd;
> > +
> > +       device_lock(&cxlr->dev);
> > +
> > +       if (len == 1 || cxlr->config.targets[n])
> > +               remove_target(cxlr, n);
> > +
> > +       /* Remove target special case */
> > +       if (len == 1) {
> > +               device_unlock(&cxlr->dev);
> > +               return len;
> > +       }
> > +
> > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> 
> I think this wants to be an endpoint decoder, not a memdev. Because
> it's the decoder that joins a memdev to a region, or at least a
> decoder should be picked when the memdev is assigned so that the DPA
> mapping can be registered. If all the decoders are allocated then fail
> here.
> 

My preference is obviously how it is, using memdevs and having the decoders
allocated at bind time. I don't have an objective argument why one is better
than the other so I will change it. I will make the interface take a set of
decoders.

> > +       if (!memdev_dev) {
> > +               device_unlock(&cxlr->dev);
> > +               return -ENOENT;
> > +       }
> > +
> > +       /* reference to memdev held until target is unset or region goes away */
> > +
> > +       cxlmd = to_cxl_memdev(memdev_dev);
> > +       cxlr->config.targets[n] = cxlmd;
> > +
> > +       device_unlock(&cxlr->dev);
> > +
> > +       return len;
> > +}
> > +
> > +#define TARGET_ATTR_RW(n)                                                      \
> > +       static ssize_t target##n##_show(                                       \
> > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > +       {                                                                      \
> > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > +       }                                                                      \
> > +       static ssize_t target##n##_store(struct device *dev,                   \
> > +                                        struct device_attribute *attr,        \
> > +                                        const char *buf, size_t len)          \
> > +       {                                                                      \
> > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > +       }                                                                      \
> > +       static DEVICE_ATTR_RW(target##n)
> > +
> > +TARGET_ATTR_RW(0);
> > +TARGET_ATTR_RW(1);
> > +TARGET_ATTR_RW(2);
> > +TARGET_ATTR_RW(3);
> > +TARGET_ATTR_RW(4);
> > +TARGET_ATTR_RW(5);
> > +TARGET_ATTR_RW(6);
> > +TARGET_ATTR_RW(7);
> > +TARGET_ATTR_RW(8);
> > +TARGET_ATTR_RW(9);
> > +TARGET_ATTR_RW(10);
> > +TARGET_ATTR_RW(11);
> > +TARGET_ATTR_RW(12);
> > +TARGET_ATTR_RW(13);
> > +TARGET_ATTR_RW(14);
> > +TARGET_ATTR_RW(15);
> > +
> > +static struct attribute *interleave_attrs[] = {
> > +       &dev_attr_target0.attr,
> > +       &dev_attr_target1.attr,
> > +       &dev_attr_target2.attr,
> > +       &dev_attr_target3.attr,
> > +       &dev_attr_target4.attr,
> > +       &dev_attr_target5.attr,
> > +       &dev_attr_target6.attr,
> > +       &dev_attr_target7.attr,
> > +       &dev_attr_target8.attr,
> > +       &dev_attr_target9.attr,
> > +       &dev_attr_target10.attr,
> > +       &dev_attr_target11.attr,
> > +       &dev_attr_target12.attr,
> > +       &dev_attr_target13.attr,
> > +       &dev_attr_target14.attr,
> > +       &dev_attr_target15.attr,
> > +       NULL,
> > +};
> > +
> > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > +{
> > +       struct device *dev = container_of(kobj, struct device, kobj);
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       if (n < cxlr->config.interleave_ways)
> > +               return a->mode;
> > +       return 0;
> > +}
> > +
> > +static const struct attribute_group region_interleave_group = {
> > +       .attrs = interleave_attrs,
> > +       .is_visible = visible_targets,
> > +};
> > +
> > +static const struct attribute_group *region_groups[] = {
> > +       &region_group,
> > +       &region_interleave_group,
> > +       NULL,
> > +};
> > +
> >  static void cxl_region_release(struct device *dev);
> >
> >  static const struct device_type cxl_region_type = {
> >         .name = "cxl_region",
> >         .release = cxl_region_release,
> > +       .groups = region_groups
> >  };
> >
> >  static ssize_t create_region_show(struct device *dev,
> > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> >  {
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> >         struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int i;
> >
> >         ida_free(&cxld->region_ida, cxlr->id);
> > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > +               remove_target(cxlr, i);
> 
> Like the last patch this feels too late. I expect whatever unregisters
> the region should have already handled removing the targets.
> 

I think I already explained why it works this way. I will change it.

> >         kfree(cxlr);
> >  }
> >
> > --
> > 2.35.0
> >
Dan Williams Feb. 3, 2022, 5:06 a.m. UTC | #4
On Tue, Feb 1, 2022 at 6:59 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> I will cut to the part that effects ABI so tool development can continue. I'll
> get back to the other bits later.
>
> On 22-01-28 16:25:34, Dan Williams wrote:
>
> [snip]
>
> >
> > > +
> > > +       return ret;
> > > +}
> > > +
> > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > +                         size_t len)
> > > +{
> > > +       struct device *memdev_dev;
> > > +       struct cxl_memdev *cxlmd;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +
> > > +       if (len == 1 || cxlr->config.targets[n])
> > > +               remove_target(cxlr, n);
> > > +
> > > +       /* Remove target special case */
> > > +       if (len == 1) {
> > > +               device_unlock(&cxlr->dev);
> > > +               return len;
> > > +       }
> > > +
> > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> >
> > I think this wants to be an endpoint decoder, not a memdev. Because
> > it's the decoder that joins a memdev to a region, or at least a
> > decoder should be picked when the memdev is assigned so that the DPA
> > mapping can be registered. If all the decoders are allocated then fail
> > here.
> >
>
> You've put two points in here:
>
> 1. Handle decoder allocation at sysfs boundary. I'll respond to this when I come
> back around to the rest of the review comments.
>
> 2. Take a decoder for target instead of a memdev. I don't agree with this
> direction as it's asymmetric to how LSA processing works. The goal was to model
> the LSA for configuration. The kernel will have to be in the business of
> reserving and enumerating decoders out of memdevs for both LSA (where we have a
> list of memdevs) and volatile (where we use the memdevs in the system to
> enumerate populated decoders). I don't see much value in making userspace do the
> same.
>
> I'd like to ask you reconsider if you still think it's preferable to use
> decoders as part of the ABI and if you still feel that way I can go change it
> since it has minimal impact overall.

It's more than a preference. I think there are fundamental recovery
scenarios where the kernel needs userspace help to resolve decoder /
DPA assignment and conflicts.

PMEM interleaves behave similarly to RAID where you have multiple
devices in a set that can each fail independently, and because they
are hotplug capable the chances of migrating devices from one system
to another are higher than PMEM devices today where hotplug is mostly
non-existent. If you lurk on linux-raid long enough you will
inevitably encounter someone coming to the list saying, "help a drive
in my RAID array was dying. I managed to save it off, help me
reassemble my array". The story often gets worse when they say "I
managed to corrupt my metadata block, so I don't know what order the
drives are supposed to be in". There are several breadcrumbs and trial
and error steps that one takes to try to get the data back online:
https://raid.wiki.kernel.org/index.php/RAID_Recovery.

Now imagine that scenario with CXL where there are additional
complicating factors like label-storage-area can fail independently of
the data area, there are region labels with HPA fields that mandate
assembly at a given address, decoders must be programmed in increasing
DPA order, volatile memory and locked/fixed decoders complicate
decoder selection. Considering all the ways that CXL region assembly
can fail it seems inevitable that someone will get into a situation
where they need to pick the decoder and the DPA to map while also not
clobbering the LSA. I.e. I see a "CXL Recovery" wiki in our future.

The requirements that I think fall out from that are:

1/ Region assembly needs to be possible without updating labels. So,
in contrast to the way that nvdimm does it, instead of updating the
region label on every attribute write there would be a commit step
that realizes the current region configuration in the labels, or is
ommitted in recovery scenarios where you are not ready to clobber the
labels.

2/ Userspace needs the flexibility to be able to select/override which
DPA gets mapped in which decoder (kernel would handle DPA skip).

All the ways I can think of to augment the ABI to allow for this style
of recovery devolve to just assigning decoders to regions in the first
instance.
Dan Williams Feb. 3, 2022, 5:48 p.m. UTC | #5
On Tue, Feb 1, 2022 at 3:11 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-01-28 16:25:34, Dan Williams wrote:
> > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > The region creation APIs create a vacant region. Configuring the region
> > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > will be provided to allow userspace to configure the region.  Finally
> > > once all configuration is complete, userspace may activate the region.
> > >
> > > Introduced here are the most basic attributes needed to configure a
> > > region. Details of these attribute are described in the ABI
> >
> > s/attribute/attributes/
> >
> > > Documentation. Sanity checking of configuration parameters are done at
> > > region binding time. This consolidates all such logic in one place,
> > > rather than being strewn across multiple places.
> >
> > I think that's too late for some of the validation. The complex
> > validation that the region driver does throughout the topology is
> > different from the basic input validation that can  be done at the
> > sysfs write time. For example ,this patch allows negative
> > interleave_granularity values to specified, just return -EINVAL. I
> > agree that sysfs should not validate everything, I disagree with
> > pushing all validation to cxl_region_probe().
> >
>
> Two points:c
> 1. How do we distinguish "basic input validation". It'd be good if we could
>    define "basic input validation". For instance, when I first wrote these
>    patches, x3 would have been EINVAL, but today it's allowed. Can you help
>    enumerate what you consider basic.

I internalized this kernel design principle from Dave Miller many
years ago paraphrasing "push decision making out to leaf code as much
as possible", and centralizing all validation in cxl_region_probe()
violates. The software that makes the mistake does not know it made a
mistake until much later and "probe failed" is less descriptive than
"EINVAL writing interleave_ways" . I wish I could find the thread
because it also talked about his iteration process.

Basic input validation to me is things like:

- Don't allow writes while the region is active
- Check that values are in bound. So yes, the interleave-ways value of
3 would fail until the kernel supports it, and granularity values >
16K would also fail.
- Check that memdevs are actually downstream targets of the given decoder
- Check that the region uuid is unique
- Check that decoder has capacity
- Check that the memdev has capacity
- Check that the decoder to map the DPA is actually available given
decoders must be programmed in increasing DPA order

Essentially any validation short of walking the topology to program
upstream decoders since those errors are only resolved by racing
region probes that try to grab upstream decoder resources.

>
> 2. I like the idea that all validation takes place in one place. Obviously you
>    do not. So, see #1 and I will rework.

The validation helpers need to be written once, where they are called
does not much matter, does it?

>
> > >
> > > A example is provided below:
> > >
> > > /sys/bus/cxl/devices/region0.0:0
> > > ├── interleave_granularity
> > > ├── interleave_ways
> > > ├── offset
> > > ├── size
> > > ├── subsystem -> ../../../../../../bus/cxl
> > > ├── target0
> > > ├── uevent
> > > └── uuid
> >
> > As mentioned off-list, it looks like devtype and modalias are missing.
> >
>
> Thanks.
>
> > >
> > > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> > >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> > >  2 files changed, 340 insertions(+)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > index dcc728458936..50ba5018014d 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > @@ -187,3 +187,43 @@ Description:
> > >                 region driver before being deleted. The attributes expects a
> > >                 region in the form "regionX.Y:Z". The region's name, allocated
> > >                 by reading create_region, will also be released.
> > > +
> > > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> >
> > This is just another 'resource' attribute for the physical base
> > address of the region, right? 'offset' sounds like something that
> > would be relative instead of absolute.
> >
>
> It is offset. I can change it to physical base if you'd like but I thought that
> information wasn't critically important for userspace to have. Does userspace
> care about the physical base?

Yes, similar use case as /proc/iomem. Error handling comes to mind as
you can see physical address data in the messages like machine check
notification and go immediately match that to a CXL region. PCI,
NVDIMM, and DAX all emit a "resource" attribute to identify the
physical address base.

>
> > > +Date:          August, 2021
> >
> > Same date update comment here.
> >
> > > +KernelVersion: v5.18
> > > +Contact:       linux-cxl@vger.kernel.org
> > > +Description:
> > > +               (RO) A region resides within an address space that is claimed by
> > > +               a decoder.
> >
> > "A region is a contiguous partition of a CXL Root decoder address space."
> >
> > >                  Region space allocation is handled by the driver, but
> >
> > "Region capacity is allocated by writing to the size attribute, the
> > resulting physical address base determined by the driver is reflected
> > here."
> >
> > > +               the offset may be read by userspace tooling in order to
> > > +               determine fragmentation, and available size for new regions.
> >
> > I would also expect, before / along with these new region attributes,
> > there would be 'available' and 'max_extent_available' at the decoder
> > level to indicate how much free space the decoder has and how big the
> > next region creation can be. User tooling can walk  the decoder and
> > the regions together to determine fragmentation if necessary, but for
> > the most part the tool likely only cares about "how big can the next
> > region be?" and "how full is this decoder?".
>
> Sounds good.
>
> >
> >
> > > +
> > > +What:
> > > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > > +Date:          August, 2021
> > > +KernelVersion: v5.18
> > > +Contact:       linux-cxl@vger.kernel.org
> > > +Description:
> > > +               (RW) Configuring regions requires a minimal set of parameters in
> > > +               order for the subsequent bind operation to succeed. The
> > > +               following parameters are defined:
> >
> > Let's split up the descriptions into individual sections. That can
> > also document the order that attributes must be written. For example,
> > doesn't size need to be set before targets are added so that targets
> > can be validated whether they have sufficient capacity?
> >
>
> Okay. Order doesn't matter if you do validation all in one place as it is, but
> sounds like we're changing that. So I can split it when we figure out what
> validation is actually occurring at the sysfs attr boundary.

Forcing a write order simplifies the validation matrix. Consider the
reduction in test surface if the kernel is more strict about what it
allows into the kernel early. Let's make syzbot's job harder.

>
> > > +
> > > +               ==      ========================================================
> > > +               interleave_granularity Mandatory. Number of consecutive bytes
> > > +                       each device in the interleave set will claim. The
> > > +                       possible interleave granularity values are determined by
> > > +                       the CXL spec and the participating devices.
> > > +               interleave_ways Mandatory. Number of devices participating in the
> > > +                       region. Each device will provide 1/interleave of storage
> > > +                       for the region.
> > > +               size    Manadatory. Phsyical address space the region will
> > > +                       consume.
> >
> > s/Phsyical/Physical/
> >
> > > +               target  Mandatory. Memory devices are the backing storage for a
> > > +                       region. There will be N targets based on the number of
> > > +                       interleave ways that the top level decoder is configured
> > > +                       for.
> >
> > That doesn't sound right, IW at the root != IW at the endpoint level
> > and the region needs to record all the endpoint level targets.
>
>
> Yes This is wrong. I thought I had fixed it, but I guess not.
>
> >
> > > Each target must be set with a memdev device ie.
> > > +                       'mem1'. This attribute only becomes available after
> > > +                       setting the 'interleave' attribute.
> > > +               uuid    Optional. A unique identifier for the region. If none is
> > > +                       selected, the kernel will create one.
> >
> > Let's drop the Mandatory / Optional distinction, or I am otherwise not
> > understanding what this is trying to document. For example 'uuid' is
> > "mandatory" for PMEM regions and "omitted" for volatile regions not
> > optional.
> >
>
> Well the kernel fills it in if userspace leaves it out. I'm guessing you're
> going to ask me to change that, so I will remove Mandatory/Optional.

Yeah, why carry unnecessary code in the kernel? Userspace is well
equipped to meet the requirement that it writes the UUID.

>
> > > +               ==      ========================================================
> > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > index 1a448543db0d..3b48e0469fc7 100644
> > > --- a/drivers/cxl/core/region.c
> > > +++ b/drivers/cxl/core/region.c
> > > @@ -3,9 +3,12 @@
> > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > >  #include <linux/device.h>
> > >  #include <linux/module.h>
> > > +#include <linux/sizes.h>
> > >  #include <linux/slab.h>
> > > +#include <linux/uuid.h>
> > >  #include <linux/idr.h>
> > >  #include <region.h>
> > > +#include <cxlmem.h>
> > >  #include <cxl.h>
> > >  #include "core.h"
> > >
> > > @@ -18,11 +21,305 @@
> > >   * (programming the hardware) is handled by a separate region driver.
> > >   */
> > >
> > > +struct cxl_region *to_cxl_region(struct device *dev);
> > > +static const struct attribute_group region_interleave_group;
> > > +
> > > +static bool is_region_active(struct cxl_region *cxlr)
> > > +{
> > > +       /* TODO: Regions can't be activated yet. */
> > > +       return false;
> >
> > This function seems redundant with just checking "cxlr->dev.driver !=
> > NULL"? The benefit of that is there is no need to carry a TODO in the
> > series.
> >
>
> Yeah. I think checking driver bind status is sufficient to replace this.
>
> > > +}
> > > +
> > > +static void remove_target(struct cxl_region *cxlr, int target)
> > > +{
> > > +       struct cxl_memdev *cxlmd;
> > > +
> > > +       cxlmd = cxlr->config.targets[target];
> > > +       if (cxlmd)
> > > +               put_device(&cxlmd->dev);
> >
> > A memdev can be a member of multiple regions at once, shouldn't this
> > be an endpoint decoder or similar, not the entire memdev?
>
> Is this referring to the later question about whether targets are decoders or
> memdevs?

Yes.

> The thought was each region would hold a reference to all memdevs in
> the interleave set.

It's not clear that a region needs to hold a reference if a memdev
self removes itself from the region before it is unregistered. I am
open to being convinced this is needed but it would need come with an
explanation of what can a region do with a memdev reference after that
memdev has experienced a ->remove() event.

> > Also, if memdevs autoremove themselves from regions at memdev
> > ->remove() time then I don't think the region needs to hold references
> > on memdevs.
> >
>
> I'll defer to you on that. I'll remove holding the reference, but I definitely
> haven't solved the interaction when a memdev goes away. I had been thinking the
> inverse originally, a memdev can't go away until the region is gone. According
> to the spec, these devices can't be hot removed, only managed remove, so if
> things blew up, not our problem. However, if we have decent infrastructure to
> support better than that, we should.

It turns out there's no such thing as "managed remove" as far as the
kernel is concerned, it's all up to userspace. ->remove() can happen
at any time and ->remove() can not fail. Per Bjorn Linux does not
support PCIe hotplug latching so there is no kernel mechanism to block
hotplug. Unless and until the PCIe native hotplug code picks up a
mechanism to deny unplug events CXL needs to be prepared for any
memdev to experience ->remove() regardless of region status.

For example, I expect the "managed remove" flow is something like this:

# cxl disable-memdev mem3
cxl memdev: action_disable: mem3 is part of an active region
cxl memdev: cmd_disable_memdev: disabled 0 mem

...where the tool tries to enforce safety, but if someone really wants
that device gone:

# cxl disable-memdev mem3 --force
cxl memdev: cmd_disable_memdev: disabled 1 mem

...and the CXL sub-system will need to trigger memory-failure across
all the regions that were impacted by that violent event.

>
> > > +       cxlr->config.targets[target] = NULL;
> > > +}
> > > +
> > > +static ssize_t interleave_ways_show(struct device *dev,
> > > +                                   struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > > +}
> > > +
> > > +static ssize_t interleave_ways_store(struct device *dev,
> > > +                                    struct device_attribute *attr,
> > > +                                    const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       int ret, prev_iw;
> > > +       int val;
> >
> > I would expect:
> >
> > if (dev->driver)
> >    return -EBUSY;
> >
> > ...to shutdown configuration writes once the region is active. Might
> > also need a region-wide seqlock like target_list_show. So that region
> > probe drains  all active sysfs writers before assuming the
> > configuration is stable.
> >
>
> Okay.
>
> > > +
> > > +       prev_iw = cxlr->config.interleave_ways;
> > > +       ret = kstrtoint(buf, 0, &val);
> > > +       if (ret)
> > > +               return ret;
> > > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > > +               return -EINVAL;
> > > +
> > > +       cxlr->config.interleave_ways = val;
> > > +
> > > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > > +       if (ret < 0)
> > > +               goto err;
> > > +
> > > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> >
> > Why?
> >
>
> I copied it from another driver. I didn't check if it was actually needed or
> not.

It's not needed since the agent that wrote interleave ways is also
expected to be the agent that is configuring the rest of the
parameters.

>
> > > +
> > > +       while (prev_iw > cxlr->config.interleave_ways)
> > > +               remove_target(cxlr, --prev_iw);
> >
> > To make the kernel side simpler this attribute could just require that
> > setting interleave ways is a one way street, if you want to change it
> > you need to delete the region and start over.
> >
>
> I'm fine with that.
>
> > > +
> > > +       return len;
> > > +
> > > +err:
> > > +       cxlr->config.interleave_ways = prev_iw;
> > > +       return ret;
> > > +}
> > > +static DEVICE_ATTR_RW(interleave_ways);
> > > +
> > > +static ssize_t interleave_granularity_show(struct device *dev,
> > > +                                          struct device_attribute *attr,
> > > +                                          char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
> > > +}
> > > +
> > > +static ssize_t interleave_granularity_store(struct device *dev,
> > > +                                           struct device_attribute *attr,
> > > +                                           const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       int val, ret;
> > > +
> > > +       ret = kstrtoint(buf, 0, &val);
> > > +       if (ret)
> > > +               return ret;
> > > +       cxlr->config.interleave_granularity = val;
> >
> > This wants minimum input validation and synchronization against an
> > active region.
> >
> > > +
> > > +       return len;
> > > +}
> > > +static DEVICE_ATTR_RW(interleave_granularity);
> > > +
> > > +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> > > +                          char *buf)
> > > +{
> > > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       resource_size_t offset;
> > > +
> > > +       if (!cxlr->res)
> > > +               return sysfs_emit(buf, "\n");
> >
> > Should be an error I would think. I.e. require size to be set before
> > s/offset/resource/ can be read.
> >
> > > +
> > > +       offset = cxld->platform_res.start - cxlr->res->start;
> >
> > Why make usersapce do the offset math?
> >
> > > +
> > > +       return sysfs_emit(buf, "%pa\n", &offset);
> > > +}
> > > +static DEVICE_ATTR_RO(offset);
> >
> > This can be DEVICE_ATTR_ADMIN_RO() to hide physical address layout
> > information from non-root.
> >
> > > +
> > > +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> > > +                        char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%llu\n", cxlr->config.size);
> >
> > Perhaps no need to store size separately if this becomes:
> >
> > sysfs_emit(buf, "%llu\n", (unsigned long long) resource_size(cxlr->res));
> >
> >
> > ...?
> >
> > > +}
> > > +
> > > +static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> > > +                         const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       unsigned long long val;
> > > +       ssize_t rc;
> > > +
> > > +       rc = kstrtoull(buf, 0, &val);
> > > +       if (rc)
> > > +               return rc;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +       if (is_region_active(cxlr))
> > > +               rc = -EBUSY;
> > > +       else
> > > +               cxlr->config.size = val;
> > > +       device_unlock(&cxlr->dev);
> >
> > I think lockdep will complain about device_lock() usage in an
> > attribute. Try changing this to cxl_device_lock() with
> > CONFIG_PROVE_CXL_LOCKING=y.
> >
> > > +
> > > +       return rc ? rc : len;
> > > +}
> > > +static DEVICE_ATTR_RW(size);
> > > +
> > > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > > +                        char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > > +}
> > > +
> > > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > > +                         const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       ssize_t rc;
> > > +
> > > +       if (len != UUID_STRING_LEN + 1)
> > > +               return -EINVAL;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +       if (is_region_active(cxlr))
> > > +               rc = -EBUSY;
> > > +       else
> > > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > > +       device_unlock(&cxlr->dev);
> > > +
> > > +       return rc ? rc : len;
> > > +}
> > > +static DEVICE_ATTR_RW(uuid);
> > > +
> > > +static struct attribute *region_attrs[] = {
> > > +       &dev_attr_interleave_ways.attr,
> > > +       &dev_attr_interleave_granularity.attr,
> > > +       &dev_attr_offset.attr,
> > > +       &dev_attr_size.attr,
> > > +       &dev_attr_uuid.attr,
> > > +       NULL,
> > > +};
> > > +
> > > +static const struct attribute_group region_group = {
> > > +       .attrs = region_attrs,
> > > +};
> > > +
> > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > +{
> > > +       int ret;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +       if (!cxlr->config.targets[n])
> > > +               ret = sysfs_emit(buf, "\n");
> > > +       else
> > > +               ret = sysfs_emit(buf, "%s\n",
> > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > +       device_unlock(&cxlr->dev);
> >
> > The component contribution of a memdev to a region is a DPA-span, not
> > the whole memdev. I would expect something like dax_mapping_attributes
> > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > information about the component contribution of a memdev to a region.
> >
>
> I had been thinking the kernel would manage the DPS spans of a memdev (and
> create the mappings). I can make this look like dax_mapping_attributes.

I think we get this for free by just linking the region to each
endpoint decoder in use and then userspace can walk that to get DPA
info from the decoder (modulo extending endpoint decoders with DPA
extent info).

>
> > > +
> > > +       return ret;
> > > +}
> > > +
> > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > +                         size_t len)
> > > +{
> > > +       struct device *memdev_dev;
> > > +       struct cxl_memdev *cxlmd;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +
> > > +       if (len == 1 || cxlr->config.targets[n])
> > > +               remove_target(cxlr, n);
> > > +
> > > +       /* Remove target special case */
> > > +       if (len == 1) {
> > > +               device_unlock(&cxlr->dev);
> > > +               return len;
> > > +       }
> > > +
> > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> >
> > I think this wants to be an endpoint decoder, not a memdev. Because
> > it's the decoder that joins a memdev to a region, or at least a
> > decoder should be picked when the memdev is assigned so that the DPA
> > mapping can be registered. If all the decoders are allocated then fail
> > here.
> >
>
> My preference is obviously how it is, using memdevs and having the decoders
> allocated at bind time. I don't have an objective argument why one is better
> than the other so I will change it. I will make the interface take a set of
> decoders.

Let me know if the arguments I have made for why more granularity is
necessary have changed your preference. I'm open to hearing alternate
ideas.
Ben Widawsky Feb. 3, 2022, 10:23 p.m. UTC | #6
On 22-02-03 09:48:49, Dan Williams wrote:
> On Tue, Feb 1, 2022 at 3:11 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > On 22-01-28 16:25:34, Dan Williams wrote:
> > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > The region creation APIs create a vacant region. Configuring the region
> > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > will be provided to allow userspace to configure the region.  Finally
> > > > once all configuration is complete, userspace may activate the region.
> > > >
> > > > Introduced here are the most basic attributes needed to configure a
> > > > region. Details of these attribute are described in the ABI
> > >
> > > s/attribute/attributes/
> > >
> > > > Documentation. Sanity checking of configuration parameters are done at
> > > > region binding time. This consolidates all such logic in one place,
> > > > rather than being strewn across multiple places.
> > >
> > > I think that's too late for some of the validation. The complex
> > > validation that the region driver does throughout the topology is
> > > different from the basic input validation that can  be done at the
> > > sysfs write time. For example ,this patch allows negative
> > > interleave_granularity values to specified, just return -EINVAL. I
> > > agree that sysfs should not validate everything, I disagree with
> > > pushing all validation to cxl_region_probe().
> > >
> >
> > Two points:c
> > 1. How do we distinguish "basic input validation". It'd be good if we could
> >    define "basic input validation". For instance, when I first wrote these
> >    patches, x3 would have been EINVAL, but today it's allowed. Can you help
> >    enumerate what you consider basic.
> 
> I internalized this kernel design principle from Dave Miller many
> years ago paraphrasing "push decision making out to leaf code as much
> as possible", and centralizing all validation in cxl_region_probe()
> violates. The software that makes the mistake does not know it made a
> mistake until much later and "probe failed" is less descriptive than
> "EINVAL writing interleave_ways" . I wish I could find the thread
> because it also talked about his iteration process.

It would definitely be interesting to understand why pushing decision making
into the leaf code is a violation. Was it primary around the descriptiveness of
the error?

> 
> Basic input validation to me is things like:
> 
> - Don't allow writes while the region is active
> - Check that values are in bound. So yes, the interleave-ways value of
> 3 would fail until the kernel supports it, and granularity values >
> 16K would also fail.
> - Check that memdevs are actually downstream targets of the given decoder
> - Check that the region uuid is unique

These are obviously easy and informative at attr store time (in fact, active was
meant to be checked already for many cases). So if we agree to codify this at
probe via WARN, and add it to kdoc, I've no problem with it.

> - Check that decoder has capacity
> - Check that the memdev has capacity
> - Check that the decoder to map the DPA is actually available given
> decoders must be programmed in increasing DPA order
> 
> Essentially any validation short of walking the topology to program
> upstream decoders since those errors are only resolved by racing
> region probes that try to grab upstream decoder resources.
> 

I intentionally avoided doing a lot of these until probe because it seemed like
not a great policy to deny regions from being populated if another region
utilizing those resources hasn't been bound yes. For a simple example, if x1
region A is created and utilizes all of memdev ɑ's capacity you block out any
other region setup using memdev ɑ, even if region A wasn't bound. There's a
similar problem with specifying decoders as part of configuration.

I'll infer from your comment that you are fine with this tradeoff, or you have
some other way to manage this in mind.

I really see any validation which requires removal of resources from the system
to be more fit for bind time. I suppose if the proposal is to move the region
attributes to be DEVICE_ATTR_ADMIN, that pushes the problem onto the system
administrator. It just seemed like most of the interface could be non-root.

> >
> > 2. I like the idea that all validation takes place in one place. Obviously you
> >    do not. So, see #1 and I will rework.
> 
> The validation helpers need to be written once, where they are called
> does not much matter, does it?
> 

Somewhat addressed above too...

I think that depends on whether the full list is established as mentioned. If in
the region driver we can put several assertions that a variety of things don't
need [re]validation, then it doesn't matter. Without this, when trying to debug
or add code you need to figure out which place is doing the validation and which
place should do it.

At the very least I think the plan should be established in a kdoc.

> >
> > > >
> > > > A example is provided below:
> > > >
> > > > /sys/bus/cxl/devices/region0.0:0
> > > > ├── interleave_granularity
> > > > ├── interleave_ways
> > > > ├── offset
> > > > ├── size
> > > > ├── subsystem -> ../../../../../../bus/cxl
> > > > ├── target0
> > > > ├── uevent
> > > > └── uuid
> > >
> > > As mentioned off-list, it looks like devtype and modalias are missing.
> > >
> >
> > Thanks.
> >
> > > >
> > > > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > > ---
> > > >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> > > >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> > > >  2 files changed, 340 insertions(+)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > index dcc728458936..50ba5018014d 100644
> > > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > @@ -187,3 +187,43 @@ Description:
> > > >                 region driver before being deleted. The attributes expects a
> > > >                 region in the form "regionX.Y:Z". The region's name, allocated
> > > >                 by reading create_region, will also be released.
> > > > +
> > > > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> > >
> > > This is just another 'resource' attribute for the physical base
> > > address of the region, right? 'offset' sounds like something that
> > > would be relative instead of absolute.
> > >
> >
> > It is offset. I can change it to physical base if you'd like but I thought that
> > information wasn't critically important for userspace to have. Does userspace
> > care about the physical base?
> 
> Yes, similar use case as /proc/iomem. Error handling comes to mind as
> you can see physical address data in the messages like machine check
> notification and go immediately match that to a CXL region. PCI,
> NVDIMM, and DAX all emit a "resource" attribute to identify the
> physical address base.
> 
> >
> > > > +Date:          August, 2021
> > >
> > > Same date update comment here.
> > >
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RO) A region resides within an address space that is claimed by
> > > > +               a decoder.
> > >
> > > "A region is a contiguous partition of a CXL Root decoder address space."
> > >
> > > >                  Region space allocation is handled by the driver, but
> > >
> > > "Region capacity is allocated by writing to the size attribute, the
> > > resulting physical address base determined by the driver is reflected
> > > here."
> > >
> > > > +               the offset may be read by userspace tooling in order to
> > > > +               determine fragmentation, and available size for new regions.
> > >
> > > I would also expect, before / along with these new region attributes,
> > > there would be 'available' and 'max_extent_available' at the decoder
> > > level to indicate how much free space the decoder has and how big the
> > > next region creation can be. User tooling can walk  the decoder and
> > > the regions together to determine fragmentation if necessary, but for
> > > the most part the tool likely only cares about "how big can the next
> > > region be?" and "how full is this decoder?".
> >
> > Sounds good.
> >
> > >
> > >
> > > > +
> > > > +What:
> > > > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > > > +Date:          August, 2021
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RW) Configuring regions requires a minimal set of parameters in
> > > > +               order for the subsequent bind operation to succeed. The
> > > > +               following parameters are defined:
> > >
> > > Let's split up the descriptions into individual sections. That can
> > > also document the order that attributes must be written. For example,
> > > doesn't size need to be set before targets are added so that targets
> > > can be validated whether they have sufficient capacity?
> > >
> >
> > Okay. Order doesn't matter if you do validation all in one place as it is, but
> > sounds like we're changing that. So I can split it when we figure out what
> > validation is actually occurring at the sysfs attr boundary.
> 
> Forcing a write order simplifies the validation matrix. Consider the
> reduction in test surface if the kernel is more strict about what it
> allows into the kernel early. Let's make syzbot's job harder.
> 
> >
> > > > +
> > > > +               ==      ========================================================
> > > > +               interleave_granularity Mandatory. Number of consecutive bytes
> > > > +                       each device in the interleave set will claim. The
> > > > +                       possible interleave granularity values are determined by
> > > > +                       the CXL spec and the participating devices.
> > > > +               interleave_ways Mandatory. Number of devices participating in the
> > > > +                       region. Each device will provide 1/interleave of storage
> > > > +                       for the region.
> > > > +               size    Manadatory. Phsyical address space the region will
> > > > +                       consume.
> > >
> > > s/Phsyical/Physical/
> > >
> > > > +               target  Mandatory. Memory devices are the backing storage for a
> > > > +                       region. There will be N targets based on the number of
> > > > +                       interleave ways that the top level decoder is configured
> > > > +                       for.
> > >
> > > That doesn't sound right, IW at the root != IW at the endpoint level
> > > and the region needs to record all the endpoint level targets.
> >
> >
> > Yes This is wrong. I thought I had fixed it, but I guess not.
> >
> > >
> > > > Each target must be set with a memdev device ie.
> > > > +                       'mem1'. This attribute only becomes available after
> > > > +                       setting the 'interleave' attribute.
> > > > +               uuid    Optional. A unique identifier for the region. If none is
> > > > +                       selected, the kernel will create one.
> > >
> > > Let's drop the Mandatory / Optional distinction, or I am otherwise not
> > > understanding what this is trying to document. For example 'uuid' is
> > > "mandatory" for PMEM regions and "omitted" for volatile regions not
> > > optional.
> > >
> >
> > Well the kernel fills it in if userspace leaves it out. I'm guessing you're
> > going to ask me to change that, so I will remove Mandatory/Optional.
> 
> Yeah, why carry unnecessary code in the kernel? Userspace is well
> equipped to meet the requirement that it writes the UUID.
> 
> >
> > > > +               ==      ========================================================
> > > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > > index 1a448543db0d..3b48e0469fc7 100644
> > > > --- a/drivers/cxl/core/region.c
> > > > +++ b/drivers/cxl/core/region.c
> > > > @@ -3,9 +3,12 @@
> > > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > > >  #include <linux/device.h>
> > > >  #include <linux/module.h>
> > > > +#include <linux/sizes.h>
> > > >  #include <linux/slab.h>
> > > > +#include <linux/uuid.h>
> > > >  #include <linux/idr.h>
> > > >  #include <region.h>
> > > > +#include <cxlmem.h>
> > > >  #include <cxl.h>
> > > >  #include "core.h"
> > > >
> > > > @@ -18,11 +21,305 @@
> > > >   * (programming the hardware) is handled by a separate region driver.
> > > >   */
> > > >
> > > > +struct cxl_region *to_cxl_region(struct device *dev);
> > > > +static const struct attribute_group region_interleave_group;
> > > > +
> > > > +static bool is_region_active(struct cxl_region *cxlr)
> > > > +{
> > > > +       /* TODO: Regions can't be activated yet. */
> > > > +       return false;
> > >
> > > This function seems redundant with just checking "cxlr->dev.driver !=
> > > NULL"? The benefit of that is there is no need to carry a TODO in the
> > > series.
> > >
> >
> > Yeah. I think checking driver bind status is sufficient to replace this.
> >
> > > > +}
> > > > +
> > > > +static void remove_target(struct cxl_region *cxlr, int target)
> > > > +{
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       cxlmd = cxlr->config.targets[target];
> > > > +       if (cxlmd)
> > > > +               put_device(&cxlmd->dev);
> > >
> > > A memdev can be a member of multiple regions at once, shouldn't this
> > > be an endpoint decoder or similar, not the entire memdev?
> >
> > Is this referring to the later question about whether targets are decoders or
> > memdevs?
> 
> Yes.
> 
> > The thought was each region would hold a reference to all memdevs in
> > the interleave set.
> 
> It's not clear that a region needs to hold a reference if a memdev
> self removes itself from the region before it is unregistered. I am
> open to being convinced this is needed but it would need come with an
> explanation of what can a region do with a memdev reference after that
> memdev has experienced a ->remove() event.
> 
> > > Also, if memdevs autoremove themselves from regions at memdev
> > > ->remove() time then I don't think the region needs to hold references
> > > on memdevs.
> > >
> >
> > I'll defer to you on that. I'll remove holding the reference, but I definitely
> > haven't solved the interaction when a memdev goes away. I had been thinking the
> > inverse originally, a memdev can't go away until the region is gone. According
> > to the spec, these devices can't be hot removed, only managed remove, so if
> > things blew up, not our problem. However, if we have decent infrastructure to
> > support better than that, we should.
> 
> It turns out there's no such thing as "managed remove" as far as the
> kernel is concerned, it's all up to userspace. ->remove() can happen
> at any time and ->remove() can not fail. Per Bjorn Linux does not
> support PCIe hotplug latching so there is no kernel mechanism to block
> hotplug. Unless and until the PCIe native hotplug code picks up a
> mechanism to deny unplug events CXL needs to be prepared for any
> memdev to experience ->remove() regardless of region status.
> 
> For example, I expect the "managed remove" flow is something like this:
> 
> # cxl disable-memdev mem3
> cxl memdev: action_disable: mem3 is part of an active region
> cxl memdev: cmd_disable_memdev: disabled 0 mem
> 
> ...where the tool tries to enforce safety, but if someone really wants
> that device gone:
> 
> # cxl disable-memdev mem3 --force
> cxl memdev: cmd_disable_memdev: disabled 1 mem
> 
> ...and the CXL sub-system will need to trigger memory-failure across
> all the regions that were impacted by that violent event.
> 
> >
> > > > +       cxlr->config.targets[target] = NULL;
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_show(struct device *dev,
> > > > +                                   struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_store(struct device *dev,
> > > > +                                    struct device_attribute *attr,
> > > > +                                    const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int ret, prev_iw;
> > > > +       int val;
> > >
> > > I would expect:
> > >
> > > if (dev->driver)
> > >    return -EBUSY;
> > >
> > > ...to shutdown configuration writes once the region is active. Might
> > > also need a region-wide seqlock like target_list_show. So that region
> > > probe drains  all active sysfs writers before assuming the
> > > configuration is stable.
> > >
> >
> > Okay.
> >
> > > > +
> > > > +       prev_iw = cxlr->config.interleave_ways;
> > > > +       ret = kstrtoint(buf, 0, &val);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > > > +               return -EINVAL;
> > > > +
> > > > +       cxlr->config.interleave_ways = val;
> > > > +
> > > > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > > > +       if (ret < 0)
> > > > +               goto err;
> > > > +
> > > > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> > >
> > > Why?
> > >
> >
> > I copied it from another driver. I didn't check if it was actually needed or
> > not.
> 
> It's not needed since the agent that wrote interleave ways is also
> expected to be the agent that is configuring the rest of the
> parameters.
> 
> >
> > > > +
> > > > +       while (prev_iw > cxlr->config.interleave_ways)
> > > > +               remove_target(cxlr, --prev_iw);
> > >
> > > To make the kernel side simpler this attribute could just require that
> > > setting interleave ways is a one way street, if you want to change it
> > > you need to delete the region and start over.
> > >
> >
> > I'm fine with that.
> >
> > > > +
> > > > +       return len;
> > > > +
> > > > +err:
> > > > +       cxlr->config.interleave_ways = prev_iw;
> > > > +       return ret;
> > > > +}
> > > > +static DEVICE_ATTR_RW(interleave_ways);
> > > > +
> > > > +static ssize_t interleave_granularity_show(struct device *dev,
> > > > +                                          struct device_attribute *attr,
> > > > +                                          char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
> > > > +}
> > > > +
> > > > +static ssize_t interleave_granularity_store(struct device *dev,
> > > > +                                           struct device_attribute *attr,
> > > > +                                           const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int val, ret;
> > > > +
> > > > +       ret = kstrtoint(buf, 0, &val);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +       cxlr->config.interleave_granularity = val;
> > >
> > > This wants minimum input validation and synchronization against an
> > > active region.
> > >
> > > > +
> > > > +       return len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(interleave_granularity);
> > > > +
> > > > +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> > > > +                          char *buf)
> > > > +{
> > > > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       resource_size_t offset;
> > > > +
> > > > +       if (!cxlr->res)
> > > > +               return sysfs_emit(buf, "\n");
> > >
> > > Should be an error I would think. I.e. require size to be set before
> > > s/offset/resource/ can be read.
> > >
> > > > +
> > > > +       offset = cxld->platform_res.start - cxlr->res->start;
> > >
> > > Why make usersapce do the offset math?
> > >
> > > > +
> > > > +       return sysfs_emit(buf, "%pa\n", &offset);
> > > > +}
> > > > +static DEVICE_ATTR_RO(offset);
> > >
> > > This can be DEVICE_ATTR_ADMIN_RO() to hide physical address layout
> > > information from non-root.
> > >
> > > > +
> > > > +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> > > > +                        char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%llu\n", cxlr->config.size);
> > >
> > > Perhaps no need to store size separately if this becomes:
> > >
> > > sysfs_emit(buf, "%llu\n", (unsigned long long) resource_size(cxlr->res));
> > >
> > >
> > > ...?
> > >
> > > > +}
> > > > +
> > > > +static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> > > > +                         const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       unsigned long long val;
> > > > +       ssize_t rc;
> > > > +
> > > > +       rc = kstrtoull(buf, 0, &val);
> > > > +       if (rc)
> > > > +               return rc;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               cxlr->config.size = val;
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > I think lockdep will complain about device_lock() usage in an
> > > attribute. Try changing this to cxl_device_lock() with
> > > CONFIG_PROVE_CXL_LOCKING=y.
> > >
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(size);
> > > > +
> > > > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > > > +                        char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > > > +}
> > > > +
> > > > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > > > +                         const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       ssize_t rc;
> > > > +
> > > > +       if (len != UUID_STRING_LEN + 1)
> > > > +               return -EINVAL;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > > > +       device_unlock(&cxlr->dev);
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(uuid);
> > > > +
> > > > +static struct attribute *region_attrs[] = {
> > > > +       &dev_attr_interleave_ways.attr,
> > > > +       &dev_attr_interleave_granularity.attr,
> > > > +       &dev_attr_offset.attr,
> > > > +       &dev_attr_size.attr,
> > > > +       &dev_attr_uuid.attr,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static const struct attribute_group region_group = {
> > > > +       .attrs = region_attrs,
> > > > +};
> > > > +
> > > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > > +{
> > > > +       int ret;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (!cxlr->config.targets[n])
> > > > +               ret = sysfs_emit(buf, "\n");
> > > > +       else
> > > > +               ret = sysfs_emit(buf, "%s\n",
> > > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > The component contribution of a memdev to a region is a DPA-span, not
> > > the whole memdev. I would expect something like dax_mapping_attributes
> > > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > > information about the component contribution of a memdev to a region.
> > >
> >
> > I had been thinking the kernel would manage the DPS spans of a memdev (and
> > create the mappings). I can make this look like dax_mapping_attributes.
> 
> I think we get this for free by just linking the region to each
> endpoint decoder in use and then userspace can walk that to get DPA
> info from the decoder (modulo extending endpoint decoders with DPA
> extent info).
> 
> >
> > > > +
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > > +                         size_t len)
> > > > +{
> > > > +       struct device *memdev_dev;
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +
> > > > +       if (len == 1 || cxlr->config.targets[n])
> > > > +               remove_target(cxlr, n);
> > > > +
> > > > +       /* Remove target special case */
> > > > +       if (len == 1) {
> > > > +               device_unlock(&cxlr->dev);
> > > > +               return len;
> > > > +       }
> > > > +
> > > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> > >
> > > I think this wants to be an endpoint decoder, not a memdev. Because
> > > it's the decoder that joins a memdev to a region, or at least a
> > > decoder should be picked when the memdev is assigned so that the DPA
> > > mapping can be registered. If all the decoders are allocated then fail
> > > here.
> > >
> >
> > My preference is obviously how it is, using memdevs and having the decoders
> > allocated at bind time. I don't have an objective argument why one is better
> > than the other so I will change it. I will make the interface take a set of
> > decoders.
> 
> Let me know if the arguments I have made for why more granularity is
> necessary have changed your preference. I'm open to hearing alternate
> ideas.
Dan Williams Feb. 3, 2022, 11:27 p.m. UTC | #7
On Thu, Feb 3, 2022 at 2:23 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-02-03 09:48:49, Dan Williams wrote:
> > On Tue, Feb 1, 2022 at 3:11 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > On 22-01-28 16:25:34, Dan Williams wrote:
> > > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > >
> > > > > The region creation APIs create a vacant region. Configuring the region
> > > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > > will be provided to allow userspace to configure the region.  Finally
> > > > > once all configuration is complete, userspace may activate the region.
> > > > >
> > > > > Introduced here are the most basic attributes needed to configure a
> > > > > region. Details of these attribute are described in the ABI
> > > >
> > > > s/attribute/attributes/
> > > >
> > > > > Documentation. Sanity checking of configuration parameters are done at
> > > > > region binding time. This consolidates all such logic in one place,
> > > > > rather than being strewn across multiple places.
> > > >
> > > > I think that's too late for some of the validation. The complex
> > > > validation that the region driver does throughout the topology is
> > > > different from the basic input validation that can  be done at the
> > > > sysfs write time. For example ,this patch allows negative
> > > > interleave_granularity values to specified, just return -EINVAL. I
> > > > agree that sysfs should not validate everything, I disagree with
> > > > pushing all validation to cxl_region_probe().
> > > >
> > >
> > > Two points:c
> > > 1. How do we distinguish "basic input validation". It'd be good if we could
> > >    define "basic input validation". For instance, when I first wrote these
> > >    patches, x3 would have been EINVAL, but today it's allowed. Can you help
> > >    enumerate what you consider basic.
> >
> > I internalized this kernel design principle from Dave Miller many
> > years ago paraphrasing "push decision making out to leaf code as much
> > as possible", and centralizing all validation in cxl_region_probe()
> > violates. The software that makes the mistake does not know it made a
> > mistake until much later and "probe failed" is less descriptive than
> > "EINVAL writing interleave_ways" . I wish I could find the thread
> > because it also talked about his iteration process.
>
> It would definitely be interesting to understand why pushing decision making
> into the leaf code is a violation. Was it primary around the descriptiveness of
> the error?

You mean the other way round, why is it a violation to move decision
making into the core? It was a comment about the inflexibility of the
core logic vs leaf logic, in the case of CXL it's about the
observability of errors at the right granularity which the core can
not do because the core is disconnected from the transaction that
injected the error.

> > Basic input validation to me is things like:
> >
> > - Don't allow writes while the region is active
> > - Check that values are in bound. So yes, the interleave-ways value of
> > 3 would fail until the kernel supports it, and granularity values >
> > 16K would also fail.
> > - Check that memdevs are actually downstream targets of the given decoder
> > - Check that the region uuid is unique
>
> These are obviously easy and informative at attr store time (in fact, active was
> meant to be checked already for many cases). So if we agree to codify this at
> probe via WARN, and add it to kdoc, I've no problem with it.

Why is WARN needed? Either the sysfs validation does it job correctly
or it doesn't. Also if sysfs didn't WARN when the bad input is
specified why would the core do anything higher than dev_err()?
Basically I think the bar for WARN is obvious kernel programming error
where only a kernel-developer will see it vs EINVAL at runtime
scenarios. I have seen Greg raise the bar for WARN in his reviews
given how many deployments turn on 'panic_on_warn'.

> > - Check that decoder has capacity
> > - Check that the memdev has capacity
> > - Check that the decoder to map the DPA is actually available given
> > decoders must be programmed in increasing DPA order
> >
> > Essentially any validation short of walking the topology to program
> > upstream decoders since those errors are only resolved by racing
> > region probes that try to grab upstream decoder resources.
> >
>
> I intentionally avoided doing a lot of these until probe because it seemed like
> not a great policy to deny regions from being populated if another region
> utilizing those resources hasn't been bound yes. For a simple example, if x1
> region A is created and utilizes all of memdev ɑ's capacity you block out any
> other region setup using memdev ɑ, even if region A wasn't bound. There's a
> similar problem with specifying decoders as part of configuration.
>
> I'll infer from your comment that you are fine with this tradeoff, or you have
> some other way to manage this in mind.

It comes back to observability if threadA allocates all the DPA then
yes all other threads should see -ENOSPC. No different than if 3 fdisk
threads all tried to create a partition, the first one to the kernel
wins. If threadA does not end up activating that regionA's capacity
that's userspace's fault, and the admin needs to make sure that
configuration does not race itself. The kernel allocating DPA
immediately lets those races be found early such that threadB finds
all the DPA gone and stops trying to create the region.

> I really see any validation which requires removal of resources from the system
> to be more fit for bind time. I suppose if the proposal is to move the region
> attributes to be DEVICE_ATTR_ADMIN, that pushes the problem onto the system
> administrator. It just seemed like most of the interface could be non-root.

None of the sysfs entries for CXL are writable by non-root.

DEVICE_ATTR_RW() is 0644
DEVICE_ATTR_ADMIN_RW() is 0600

Yes, pushing the problem onto the sysadmin is the only option. Only
CAP_SYS_ADMIN can be trusted to muck with the physical address layout
of the system. Even then CONFIG_LOCKDOWN_KERNEL wants to limit what
CAP_SYS_ADMIN can to do the memory configuration, so I don't see any
room for non-root to be considered in this ABI.

>
> > >
> > > 2. I like the idea that all validation takes place in one place. Obviously you
> > >    do not. So, see #1 and I will rework.
> >
> > The validation helpers need to be written once, where they are called
> > does not much matter, does it?
> >
>
> Somewhat addressed above too...
>
> I think that depends on whether the full list is established as mentioned. If in
> the region driver we can put several assertions that a variety of things don't
> need [re]validation, then it doesn't matter. Without this, when trying to debug
> or add code you need to figure out which place is doing the validation and which
> place should do it.

All I can say it has not been a problem in practice for NVDIMM debug
scenarios which does validation at probe for pre-existing namespaces
and validation at sysfs write for namespace creation.

> At the very least I think the plan should be established in a kdoc.

Sure, a "CXL Region: Theory of Operation" would be a good document to
lead into the patch series as a follow-on to "CXL Bus: Theory of
Operation".
Ben Widawsky Feb. 4, 2022, 12:19 a.m. UTC | #8
On 22-02-03 15:27:02, Dan Williams wrote:
> On Thu, Feb 3, 2022 at 2:23 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > On 22-02-03 09:48:49, Dan Williams wrote:
> > > On Tue, Feb 1, 2022 at 3:11 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > On 22-01-28 16:25:34, Dan Williams wrote:
> > > > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > > >
> > > > > > The region creation APIs create a vacant region. Configuring the region
> > > > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > > > will be provided to allow userspace to configure the region.  Finally
> > > > > > once all configuration is complete, userspace may activate the region.
> > > > > >
> > > > > > Introduced here are the most basic attributes needed to configure a
> > > > > > region. Details of these attribute are described in the ABI
> > > > >
> > > > > s/attribute/attributes/
> > > > >
> > > > > > Documentation. Sanity checking of configuration parameters are done at
> > > > > > region binding time. This consolidates all such logic in one place,
> > > > > > rather than being strewn across multiple places.
> > > > >
> > > > > I think that's too late for some of the validation. The complex
> > > > > validation that the region driver does throughout the topology is
> > > > > different from the basic input validation that can  be done at the
> > > > > sysfs write time. For example ,this patch allows negative
> > > > > interleave_granularity values to specified, just return -EINVAL. I
> > > > > agree that sysfs should not validate everything, I disagree with
> > > > > pushing all validation to cxl_region_probe().
> > > > >
> > > >
> > > > Two points:c
> > > > 1. How do we distinguish "basic input validation". It'd be good if we could
> > > >    define "basic input validation". For instance, when I first wrote these
> > > >    patches, x3 would have been EINVAL, but today it's allowed. Can you help
> > > >    enumerate what you consider basic.
> > >
> > > I internalized this kernel design principle from Dave Miller many
> > > years ago paraphrasing "push decision making out to leaf code as much
> > > as possible", and centralizing all validation in cxl_region_probe()
> > > violates. The software that makes the mistake does not know it made a
> > > mistake until much later and "probe failed" is less descriptive than
> > > "EINVAL writing interleave_ways" . I wish I could find the thread
> > > because it also talked about his iteration process.
> >
> > It would definitely be interesting to understand why pushing decision making
> > into the leaf code is a violation. Was it primary around the descriptiveness of
> > the error?
> 
> You mean the other way round, why is it a violation to move decision
> making into the core? It was a comment about the inflexibility of the
> core logic vs leaf logic, in the case of CXL it's about the
> observability of errors at the right granularity which the core can
> not do because the core is disconnected from the transaction that
> injected the error.

I did mean the other way around. The thing that gets tricky if you do it at the
sysfs boundary is you do have to start seeing the interface as stateful. Perhaps
the complexity I see arising from this won't materialize, so I'll try it and
see. It seems like it can get messy quickly though.

> 
> > > Basic input validation to me is things like:
> > >
> > > - Don't allow writes while the region is active
> > > - Check that values are in bound. So yes, the interleave-ways value of
> > > 3 would fail until the kernel supports it, and granularity values >
> > > 16K would also fail.
> > > - Check that memdevs are actually downstream targets of the given decoder
> > > - Check that the region uuid is unique
> >
> > These are obviously easy and informative at attr store time (in fact, active was
> > meant to be checked already for many cases). So if we agree to codify this at
> > probe via WARN, and add it to kdoc, I've no problem with it.
> 
> Why is WARN needed? Either the sysfs validation does it job correctly
> or it doesn't. Also if sysfs didn't WARN when the bad input is
> specified why would the core do anything higher than dev_err()?
> Basically I think the bar for WARN is obvious kernel programming error
> where only a kernel-developer will see it vs EINVAL at runtime
> scenarios. I have seen Greg raise the bar for WARN in his reviews
> given how many deployments turn on 'panic_on_warn'.

Ultimately some checking will need to occur in one form or another in region
probe(). Either explicit via conditional: if (!is_valid(interleave_ways)) return
-EINVAL, or implicitly, for example 1 << (rootd_ig - cxlr_ig) is some invalid
nonsense which later fails host bridge programming verification. Before
discussing further, which are you suggesting?

> 
> > > - Check that decoder has capacity
> > > - Check that the memdev has capacity
> > > - Check that the decoder to map the DPA is actually available given
> > > decoders must be programmed in increasing DPA order
> > >
> > > Essentially any validation short of walking the topology to program
> > > upstream decoders since those errors are only resolved by racing
> > > region probes that try to grab upstream decoder resources.
> > >
> >
> > I intentionally avoided doing a lot of these until probe because it seemed like
> > not a great policy to deny regions from being populated if another region
> > utilizing those resources hasn't been bound yes. For a simple example, if x1
> > region A is created and utilizes all of memdev ɑ's capacity you block out any
> > other region setup using memdev ɑ, even if region A wasn't bound. There's a
> > similar problem with specifying decoders as part of configuration.
> >
> > I'll infer from your comment that you are fine with this tradeoff, or you have
> > some other way to manage this in mind.
> 
> It comes back to observability if threadA allocates all the DPA then
> yes all other threads should see -ENOSPC. No different than if 3 fdisk
> threads all tried to create a partition, the first one to the kernel
> wins. If threadA does not end up activating that regionA's capacity
> that's userspace's fault, and the admin needs to make sure that
> configuration does not race itself. The kernel allocating DPA
> immediately lets those races be found early such that threadB finds
> all the DPA gone and stops trying to create the region.

Okay. I don't have a strong opinion on how userspace should or shouldn't use
this interface. It seems less friendly to do it this way, but per the following
comment, if it's root only, it doesn't really matter.

I was under the impression you expected userspace to manage the DPA as well. I
don't really see any reason why the kernel should manage it if userspace is
already handling all these other bits. Let userspace set the offset and size
(can make a single device attr for it), and upon doing so it gets reserved.

> 
> > I really see any validation which requires removal of resources from the system
> > to be more fit for bind time. I suppose if the proposal is to move the region
> > attributes to be DEVICE_ATTR_ADMIN, that pushes the problem onto the system
> > administrator. It just seemed like most of the interface could be non-root.
> 
> None of the sysfs entries for CXL are writable by non-root.
> 
> DEVICE_ATTR_RW() is 0644
> DEVICE_ATTR_ADMIN_RW() is 0600

My mistake. I forgot about that.

> 
> Yes, pushing the problem onto the sysadmin is the only option. Only
> CAP_SYS_ADMIN can be trusted to muck with the physical address layout
> of the system. Even then CONFIG_LOCKDOWN_KERNEL wants to limit what
> CAP_SYS_ADMIN can to do the memory configuration, so I don't see any
> room for non-root to be considered in this ABI.

That's fine. As the interface is today (before your requested changes) only
region->probe() is something that mucks with the physical address layout. It
could theoretically be entirely configured (not bound) by userspace.

> 
> >
> > > >
> > > > 2. I like the idea that all validation takes place in one place. Obviously you
> > > >    do not. So, see #1 and I will rework.
> > >
> > > The validation helpers need to be written once, where they are called
> > > does not much matter, does it?
> > >
> >
> > Somewhat addressed above too...
> >
> > I think that depends on whether the full list is established as mentioned. If in
> > the region driver we can put several assertions that a variety of things don't
> > need [re]validation, then it doesn't matter. Without this, when trying to debug
> > or add code you need to figure out which place is doing the validation and which
> > place should do it.
> 
> All I can say it has not been a problem in practice for NVDIMM debug
> scenarios which does validation at probe for pre-existing namespaces
> and validation at sysfs write for namespace creation.
> 
> > At the very least I think the plan should be established in a kdoc.
> 
> Sure, a "CXL Region: Theory of Operation" would be a good document to
> lead into the patch series as a follow-on to "CXL Bus: Theory of
> Operation".

Yeah. I will write it once we close this discussion.
Dan Williams Feb. 4, 2022, 2:45 a.m. UTC | #9
On Thu, Feb 3, 2022 at 4:20 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
[..]
> > > > Basic input validation to me is things like:
> > > >
> > > > - Don't allow writes while the region is active
> > > > - Check that values are in bound. So yes, the interleave-ways value of
> > > > 3 would fail until the kernel supports it, and granularity values >
> > > > 16K would also fail.
> > > > - Check that memdevs are actually downstream targets of the given decoder
> > > > - Check that the region uuid is unique
> > >
> > > These are obviously easy and informative at attr store time (in fact, active was
> > > meant to be checked already for many cases). So if we agree to codify this at
> > > probe via WARN, and add it to kdoc, I've no problem with it.
> >
> > Why is WARN needed? Either the sysfs validation does it job correctly
> > or it doesn't. Also if sysfs didn't WARN when the bad input is
> > specified why would the core do anything higher than dev_err()?
> > Basically I think the bar for WARN is obvious kernel programming error
> > where only a kernel-developer will see it vs EINVAL at runtime
> > scenarios. I have seen Greg raise the bar for WARN in his reviews
> > given how many deployments turn on 'panic_on_warn'.
>
> Ultimately some checking will need to occur in one form or another in region
> probe(). Either explicit via conditional: if (!is_valid(interleave_ways)) return
> -EINVAL, or implicitly, for example 1 << (rootd_ig - cxlr_ig) is some invalid
> nonsense which later fails host bridge programming verification. Before
> discussing further, which are you suggesting?

Explicit validation at probe in addition to the explicit validation at
the sysfs boundary (as much as possible to report errors early). The
"at probe time" validation does not know if this was a new region, or
one enumerated from LSA or the configuration that the BIOS specified.
So I do expect validation overlap, but there will also be distinct
checks for those different scenarios. For example, see how NVDIMM
validates namespace configuration writes via sysfs, but does not
validate the LSA because it's writing the label and had better be
prepared to read what it writes.

>
> >
> > > > - Check that decoder has capacity
> > > > - Check that the memdev has capacity
> > > > - Check that the decoder to map the DPA is actually available given
> > > > decoders must be programmed in increasing DPA order
> > > >
> > > > Essentially any validation short of walking the topology to program
> > > > upstream decoders since those errors are only resolved by racing
> > > > region probes that try to grab upstream decoder resources.
> > > >
> > >
> > > I intentionally avoided doing a lot of these until probe because it seemed like
> > > not a great policy to deny regions from being populated if another region
> > > utilizing those resources hasn't been bound yes. For a simple example, if x1
> > > region A is created and utilizes all of memdev ɑ's capacity you block out any
> > > other region setup using memdev ɑ, even if region A wasn't bound. There's a
> > > similar problem with specifying decoders as part of configuration.
> > >
> > > I'll infer from your comment that you are fine with this tradeoff, or you have
> > > some other way to manage this in mind.
> >
> > It comes back to observability if threadA allocates all the DPA then
> > yes all other threads should see -ENOSPC. No different than if 3 fdisk
> > threads all tried to create a partition, the first one to the kernel
> > wins. If threadA does not end up activating that regionA's capacity
> > that's userspace's fault, and the admin needs to make sure that
> > configuration does not race itself. The kernel allocating DPA
> > immediately lets those races be found early such that threadB finds
> > all the DPA gone and stops trying to create the region.
>
> Okay. I don't have a strong opinion on how userspace should or shouldn't use
> this interface. It seems less friendly to do it this way, but per the following
> comment, if it's root only, it doesn't really matter.
>
> I was under the impression you expected userspace to manage the DPA as well. I
> don't really see any reason why the kernel should manage it if userspace is
> already handling all these other bits. Let userspace set the offset and size
> (can make a single device attr for it), and upon doing so it gets reserved.

Userspace sense requests, kernel allocates or denies that request
after resolving races with other requesters. Yes, this makes the
interface stateful. Sysfs is not suited to stateless operation.
Ben Widawsky Feb. 17, 2022, 6:36 p.m. UTC | #10
Consolidating earlier discussions...

On 22-01-28 16:25:34, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > The region creation APIs create a vacant region. Configuring the region
> > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > will be provided to allow userspace to configure the region.  Finally
> > once all configuration is complete, userspace may activate the region.
> >
> > Introduced here are the most basic attributes needed to configure a
> > region. Details of these attribute are described in the ABI
> 
> s/attribute/attributes/
> 
> > Documentation. Sanity checking of configuration parameters are done at
> > region binding time. This consolidates all such logic in one place,
> > rather than being strewn across multiple places.
> 
> I think that's too late for some of the validation. The complex
> validation that the region driver does throughout the topology is
> different from the basic input validation that can  be done at the
> sysfs write time. For example ,this patch allows negative
> interleave_granularity values to specified, just return -EINVAL. I
> agree that sysfs should not validate everything, I disagree with
> pushing all validation to cxl_region_probe().
> 

Okay. It might save us some back and forth if you could outline everything you'd
expect to be validated, but I can also make an attempt to figure out the
reasonable set of things.

> >
> > A example is provided below:
> >
> > /sys/bus/cxl/devices/region0.0:0
> > ├── interleave_granularity
> > ├── interleave_ways
> > ├── offset
> > ├── size
> > ├── subsystem -> ../../../../../../bus/cxl
> > ├── target0
> > ├── uevent
> > └── uuid
> 
> As mentioned off-list, it looks like devtype and modalias are missing.
> 

Yep. This belongs in the previous patch though.

> >
> > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> >  2 files changed, 340 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index dcc728458936..50ba5018014d 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -187,3 +187,43 @@ Description:
> >                 region driver before being deleted. The attributes expects a
> >                 region in the form "regionX.Y:Z". The region's name, allocated
> >                 by reading create_region, will also be released.
> > +
> > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> 
> This is just another 'resource' attribute for the physical base
> address of the region, right? 'offset' sounds like something that
> would be relative instead of absolute.
> 

It was meant to be relative. I can make it absolute if that's preferable but the
physical base is known at the decoder level already.

> > +Date:          August, 2021
> 
> Same date update comment here.
> 
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               (RO) A region resides within an address space that is claimed by
> > +               a decoder.
> 
> "A region is a contiguous partition of a CXL Root decoder address space."
> 
> >                  Region space allocation is handled by the driver, but
> 
> "Region capacity is allocated by writing to the size attribute, the
> resulting physical address base determined by the driver is reflected
> here."
> 
> > +               the offset may be read by userspace tooling in order to
> > +               determine fragmentation, and available size for new regions.
> 
> I would also expect, before / along with these new region attributes,
> there would be 'available' and 'max_extent_available' at the decoder
> level to indicate how much free space the decoder has and how big the
> next region creation can be. User tooling can walk  the decoder and
> the regions together to determine fragmentation if necessary, but for
> the most part the tool likely only cares about "how big can the next
> region be?" and "how full is this decoder?".
> 

Since this is the configuration part of the ABI, I'd rather add that information
when the plumbing to report them exists. I'm struggling to understand the
balance (as mentioned also earlier in this mail thread) as to what userspace
does and what the kernel does. I will add these as you request.

> 
> > +
> > +What:
> > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > +Date:          August, 2021
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               (RW) Configuring regions requires a minimal set of parameters in
> > +               order for the subsequent bind operation to succeed. The
> > +               following parameters are defined:
> 
> Let's split up the descriptions into individual sections. That can
> also document the order that attributes must be written. For example,
> doesn't size need to be set before targets are added so that targets
> can be validated whether they have sufficient capacity?
> 

Okay. Since we're moving toward making the sysfs ABI stateful, would you like me
to make the attrs only visible when they can actually be set?

> > +
> > +               ==      ========================================================
> > +               interleave_granularity Mandatory. Number of consecutive bytes
> > +                       each device in the interleave set will claim. The
> > +                       possible interleave granularity values are determined by
> > +                       the CXL spec and the participating devices.
> > +               interleave_ways Mandatory. Number of devices participating in the
> > +                       region. Each device will provide 1/interleave of storage
> > +                       for the region.
> > +               size    Manadatory. Phsyical address space the region will
> > +                       consume.
> 
> s/Phsyical/Physical/
> 
> > +               target  Mandatory. Memory devices are the backing storage for a
> > +                       region. There will be N targets based on the number of
> > +                       interleave ways that the top level decoder is configured
> > +                       for.
> 
> That doesn't sound right, IW at the root != IW at the endpoint level
> and the region needs to record all the endpoint level targets.

Correct.

> 
> > Each target must be set with a memdev device ie.
> > +                       'mem1'. This attribute only becomes available after
> > +                       setting the 'interleave' attribute.
> > +               uuid    Optional. A unique identifier for the region. If none is
> > +                       selected, the kernel will create one.
> 
> Let's drop the Mandatory / Optional distinction, or I am otherwise not
> understanding what this is trying to document. For example 'uuid' is
> "mandatory" for PMEM regions and "omitted" for volatile regions not
> optional.

Okay.

> 
> > +               ==      ========================================================
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 1a448543db0d..3b48e0469fc7 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -3,9 +3,12 @@
> >  #include <linux/io-64-nonatomic-lo-hi.h>
> >  #include <linux/device.h>
> >  #include <linux/module.h>
> > +#include <linux/sizes.h>
> >  #include <linux/slab.h>
> > +#include <linux/uuid.h>
> >  #include <linux/idr.h>
> >  #include <region.h>
> > +#include <cxlmem.h>
> >  #include <cxl.h>
> >  #include "core.h"
> >
> > @@ -18,11 +21,305 @@
> >   * (programming the hardware) is handled by a separate region driver.
> >   */
> >
> > +struct cxl_region *to_cxl_region(struct device *dev);
> > +static const struct attribute_group region_interleave_group;
> > +
> > +static bool is_region_active(struct cxl_region *cxlr)
> > +{
> > +       /* TODO: Regions can't be activated yet. */
> > +       return false;
> 
> This function seems redundant with just checking "cxlr->dev.driver !=
> NULL"? The benefit of that is there is no need to carry a TODO in the
> series.
> 

The idea behind this was to give the reviewer somewhat of a bigger picture as to
how things should work in the code rather than in a commit message. I will
remove this.

> > +}
> > +
> > +static void remove_target(struct cxl_region *cxlr, int target)
> > +{
> > +       struct cxl_memdev *cxlmd;
> > +
> > +       cxlmd = cxlr->config.targets[target];
> > +       if (cxlmd)
> > +               put_device(&cxlmd->dev);
> 
> A memdev can be a member of multiple regions at once, shouldn't this
> be an endpoint decoder or similar, not the entire memdev?
> 
> Also, if memdevs autoremove themselves from regions at memdev
> ->remove() time then I don't think the region needs to hold references
> on memdevs.
> 

This needs some work. The concern I have is region operations will need to
operate on memdevs/decoders at various points in time. When the memdev goes
away, the region will also need to go away. None of that plumbing was in place
in v3 and the reference on the memdev was just a half-hearted attempt at doing
the right thing.

For now if you prefer I remove the reference, but perhaps the decoder reference
would buy us some safety?

> > +       cxlr->config.targets[target] = NULL;
> > +}
> > +
> > +static ssize_t interleave_ways_show(struct device *dev,
> > +                                   struct device_attribute *attr, char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > +}
> > +
> > +static ssize_t interleave_ways_store(struct device *dev,
> > +                                    struct device_attribute *attr,
> > +                                    const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int ret, prev_iw;
> > +       int val;
> 
> I would expect:
> 
> if (dev->driver)
>    return -EBUSY;
> 
> ...to shutdown configuration writes once the region is active. Might
> also need a region-wide seqlock like target_list_show. So that region
> probe drains  all active sysfs writers before assuming the
> configuration is stable.

Initially my thought here is that this is a problem for userspace to deal with.
If userspace can't figure out how to synchronously configure and bind the
region, that's not a kernel problem. However, we've put some effort into
protecting userspace from itself in the create ABI, so it might be more in line
to do that here.

In summary, I'm fine to add it, but I think I really need to get more in your
brain about the userspace/kernel divide sooner rather than later.

> 
> > +
> > +       prev_iw = cxlr->config.interleave_ways;
> > +       ret = kstrtoint(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > +               return -EINVAL;
> > +
> > +       cxlr->config.interleave_ways = val;
> > +
> > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > +       if (ret < 0)
> > +               goto err;
> > +
> > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> 
> Why?
> 

copypasta

> > +
> > +       while (prev_iw > cxlr->config.interleave_ways)
> > +               remove_target(cxlr, --prev_iw);
> 
> To make the kernel side simpler this attribute could just require that
> setting interleave ways is a one way street, if you want to change it
> you need to delete the region and start over.
> 

Okay. One of the earlier versions did this implicitly since the #ways was needed
to create the region. I thought from the ABI perspective, flexibility was good.
Userspace may choose not to utilize it.

> > +
> > +       return len;
> > +
> > +err:
> > +       cxlr->config.interleave_ways = prev_iw;
> > +       return ret;
> > +}
> > +static DEVICE_ATTR_RW(interleave_ways);
> > +
> > +static ssize_t interleave_granularity_show(struct device *dev,
> > +                                          struct device_attribute *attr,
> > +                                          char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
> > +}
> > +
> > +static ssize_t interleave_granularity_store(struct device *dev,
> > +                                           struct device_attribute *attr,
> > +                                           const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int val, ret;
> > +
> > +       ret = kstrtoint(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> > +       cxlr->config.interleave_granularity = val;
> 
> This wants minimum input validation and synchronization against an
> active region.
> 

Okay. Already comprehended above.

> > +
> > +       return len;
> > +}
> > +static DEVICE_ATTR_RW(interleave_granularity);
> > +
> > +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> > +                          char *buf)
> > +{
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       resource_size_t offset;
> > +
> > +       if (!cxlr->res)
> > +               return sysfs_emit(buf, "\n");
> 
> Should be an error I would think. I.e. require size to be set before
> s/offset/resource/ can be read.
> 

Okay. Already comprehended above.

> > +
> > +       offset = cxld->platform_res.start - cxlr->res->start;
> 
> Why make usersapce do the offset math?
> 

Okay. Already comprehended above.

> > +
> > +       return sysfs_emit(buf, "%pa\n", &offset);
> > +}
> > +static DEVICE_ATTR_RO(offset);
> 
> This can be DEVICE_ATTR_ADMIN_RO() to hide physical address layout
> information from non-root.
> 
> > +
> > +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> > +                        char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%llu\n", cxlr->config.size);
> 
> Perhaps no need to store size separately if this becomes:
> 
> sysfs_emit(buf, "%llu\n", (unsigned long long) resource_size(cxlr->res));

iirc that broke the 80 character limit so I opted to do it this way. I can
change it.

> 
> 
> ...?
> 
> > +}
> > +
> > +static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> > +                         const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       unsigned long long val;
> > +       ssize_t rc;
> > +
> > +       rc = kstrtoull(buf, 0, &val);
> > +       if (rc)
> > +               return rc;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (is_region_active(cxlr))
> > +               rc = -EBUSY;
> > +       else
> > +               cxlr->config.size = val;
> > +       device_unlock(&cxlr->dev);
> 
> I think lockdep will complain about device_lock() usage in an
> attribute. Try changing this to cxl_device_lock() with
> CONFIG_PROVE_CXL_LOCKING=y.
> 

I might have messed it up, but I didn't seem to run into an issue. With the
driver bound check though, it can go away.

I think it would be really good to add this kind of detail to sysfs.rst. Quick
grep finds me arm64/kernel/mte and the nfit driver taking the device lock in an
attr.


> > +
> > +       return rc ? rc : len;
> > +}
> > +static DEVICE_ATTR_RW(size);
> > +
> > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > +                        char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > +}
> > +
> > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > +                         const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       ssize_t rc;
> > +
> > +       if (len != UUID_STRING_LEN + 1)
> > +               return -EINVAL;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (is_region_active(cxlr))
> > +               rc = -EBUSY;
> > +       else
> > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > +       device_unlock(&cxlr->dev);
> > +
> > +       return rc ? rc : len;
> > +}
> > +static DEVICE_ATTR_RW(uuid);
> > +
> > +static struct attribute *region_attrs[] = {
> > +       &dev_attr_interleave_ways.attr,
> > +       &dev_attr_interleave_granularity.attr,
> > +       &dev_attr_offset.attr,
> > +       &dev_attr_size.attr,
> > +       &dev_attr_uuid.attr,
> > +       NULL,
> > +};
> > +
> > +static const struct attribute_group region_group = {
> > +       .attrs = region_attrs,
> > +};
> > +
> > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > +{
> > +       int ret;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (!cxlr->config.targets[n])
> > +               ret = sysfs_emit(buf, "\n");
> > +       else
> > +               ret = sysfs_emit(buf, "%s\n",
> > +                                dev_name(&cxlr->config.targets[n]->dev));
> > +       device_unlock(&cxlr->dev);
> 
> The component contribution of a memdev to a region is a DPA-span, not
> the whole memdev. I would expect something like dax_mapping_attributes
> or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> information about the component contribution of a memdev to a region.
> 

I think show_target should just return the chosen decoder and then the decoder
attributes will tell the rest, wouldn't they?

> > +
> > +       return ret;
> > +}
> > +
> > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > +                         size_t len)
> > +{
> > +       struct device *memdev_dev;
> > +       struct cxl_memdev *cxlmd;
> > +
> > +       device_lock(&cxlr->dev);
> > +
> > +       if (len == 1 || cxlr->config.targets[n])
> > +               remove_target(cxlr, n);
> > +
> > +       /* Remove target special case */
> > +       if (len == 1) {
> > +               device_unlock(&cxlr->dev);
> > +               return len;
> > +       }
> > +
> > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> 
> I think this wants to be an endpoint decoder, not a memdev. Because
> it's the decoder that joins a memdev to a region, or at least a
> decoder should be picked when the memdev is assigned so that the DPA
> mapping can be registered. If all the decoders are allocated then fail
> here.
>

Per above, I think making this decoders makes sense. I could make it flexible
for ease of use, like if you specify memX, the kernel will pick a decoder for
you however I suspect you won't like that.

> > +       if (!memdev_dev) {
> > +               device_unlock(&cxlr->dev);
> > +               return -ENOENT;
> > +       }
> > +
> > +       /* reference to memdev held until target is unset or region goes away */
> > +
> > +       cxlmd = to_cxl_memdev(memdev_dev);
> > +       cxlr->config.targets[n] = cxlmd;
> > +
> > +       device_unlock(&cxlr->dev);
> > +
> > +       return len;
> > +}
> > +
> > +#define TARGET_ATTR_RW(n)                                                      \
> > +       static ssize_t target##n##_show(                                       \
> > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > +       {                                                                      \
> > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > +       }                                                                      \
> > +       static ssize_t target##n##_store(struct device *dev,                   \
> > +                                        struct device_attribute *attr,        \
> > +                                        const char *buf, size_t len)          \
> > +       {                                                                      \
> > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > +       }                                                                      \
> > +       static DEVICE_ATTR_RW(target##n)
> > +
> > +TARGET_ATTR_RW(0);
> > +TARGET_ATTR_RW(1);
> > +TARGET_ATTR_RW(2);
> > +TARGET_ATTR_RW(3);
> > +TARGET_ATTR_RW(4);
> > +TARGET_ATTR_RW(5);
> > +TARGET_ATTR_RW(6);
> > +TARGET_ATTR_RW(7);
> > +TARGET_ATTR_RW(8);
> > +TARGET_ATTR_RW(9);
> > +TARGET_ATTR_RW(10);
> > +TARGET_ATTR_RW(11);
> > +TARGET_ATTR_RW(12);
> > +TARGET_ATTR_RW(13);
> > +TARGET_ATTR_RW(14);
> > +TARGET_ATTR_RW(15);
> > +
> > +static struct attribute *interleave_attrs[] = {
> > +       &dev_attr_target0.attr,
> > +       &dev_attr_target1.attr,
> > +       &dev_attr_target2.attr,
> > +       &dev_attr_target3.attr,
> > +       &dev_attr_target4.attr,
> > +       &dev_attr_target5.attr,
> > +       &dev_attr_target6.attr,
> > +       &dev_attr_target7.attr,
> > +       &dev_attr_target8.attr,
> > +       &dev_attr_target9.attr,
> > +       &dev_attr_target10.attr,
> > +       &dev_attr_target11.attr,
> > +       &dev_attr_target12.attr,
> > +       &dev_attr_target13.attr,
> > +       &dev_attr_target14.attr,
> > +       &dev_attr_target15.attr,
> > +       NULL,
> > +};
> > +
> > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > +{
> > +       struct device *dev = container_of(kobj, struct device, kobj);
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       if (n < cxlr->config.interleave_ways)
> > +               return a->mode;
> > +       return 0;
> > +}
> > +
> > +static const struct attribute_group region_interleave_group = {
> > +       .attrs = interleave_attrs,
> > +       .is_visible = visible_targets,
> > +};
> > +
> > +static const struct attribute_group *region_groups[] = {
> > +       &region_group,
> > +       &region_interleave_group,
> > +       NULL,
> > +};
> > +
> >  static void cxl_region_release(struct device *dev);
> >
> >  static const struct device_type cxl_region_type = {
> >         .name = "cxl_region",
> >         .release = cxl_region_release,
> > +       .groups = region_groups
> >  };
> >
> >  static ssize_t create_region_show(struct device *dev,
> > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> >  {
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> >         struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int i;
> >
> >         ida_free(&cxld->region_ida, cxlr->id);
> > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > +               remove_target(cxlr, i);
> 
> Like the last patch this feels too late. I expect whatever unregisters
> the region should have already handled removing the targets.
> 

Would remove() be more appropriate?

> >         kfree(cxlr);
> >  }
> >
> > --
> > 2.35.0
> >
Dan Williams Feb. 17, 2022, 7:57 p.m. UTC | #11
On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Consolidating earlier discussions...
>
> On 22-01-28 16:25:34, Dan Williams wrote:
> > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > The region creation APIs create a vacant region. Configuring the region
> > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > will be provided to allow userspace to configure the region.  Finally
> > > once all configuration is complete, userspace may activate the region.
> > >
> > > Introduced here are the most basic attributes needed to configure a
> > > region. Details of these attribute are described in the ABI
> >
> > s/attribute/attributes/
> >
> > > Documentation. Sanity checking of configuration parameters are done at
> > > region binding time. This consolidates all such logic in one place,
> > > rather than being strewn across multiple places.
> >
> > I think that's too late for some of the validation. The complex
> > validation that the region driver does throughout the topology is
> > different from the basic input validation that can  be done at the
> > sysfs write time. For example ,this patch allows negative
> > interleave_granularity values to specified, just return -EINVAL. I
> > agree that sysfs should not validate everything, I disagree with
> > pushing all validation to cxl_region_probe().
> >
>
> Okay. It might save us some back and forth if you could outline everything you'd
> expect to be validated, but I can also make an attempt to figure out the
> reasonable set of things.

Input validation. Every value that gets written to a sysfs attribute
should be checked for validity, more below:

>
> > >
> > > A example is provided below:
> > >
> > > /sys/bus/cxl/devices/region0.0:0
> > > ├── interleave_granularity

...validate granularity is within spec and can be supported by the root decoder.

> > > ├── interleave_ways

...validate ways is within spec and can be supported by the root decoder.

> > > ├── offset
> > > ├── size

...try to reserve decoder capacity to validate that there is available space.

> > > ├── subsystem -> ../../../../../../bus/cxl
> > > ├── target0

...validate that the target maps to the decoder.

> > > ├── uevent
> > > └── uuid

...validate that the uuid is unique relative to other regions.

> >
> > As mentioned off-list, it looks like devtype and modalias are missing.
> >
>
> Yep. This belongs in the previous patch though.
>
> > >
> > > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> > >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> > >  2 files changed, 340 insertions(+)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > index dcc728458936..50ba5018014d 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > @@ -187,3 +187,43 @@ Description:
> > >                 region driver before being deleted. The attributes expects a
> > >                 region in the form "regionX.Y:Z". The region's name, allocated
> > >                 by reading create_region, will also be released.
> > > +
> > > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> >
> > This is just another 'resource' attribute for the physical base
> > address of the region, right? 'offset' sounds like something that
> > would be relative instead of absolute.
> >
>
> It was meant to be relative. I can make it absolute if that's preferable but the
> physical base is known at the decoder level already.

Yes, but it saves userspace a step to get the absolute value here and
matches what happens in PCI sysfs where the user is not required to
look up the bridge base to calculate the absolute value.

>
> > > +Date:          August, 2021
> >
> > Same date update comment here.
> >
> > > +KernelVersion: v5.18
> > > +Contact:       linux-cxl@vger.kernel.org
> > > +Description:
> > > +               (RO) A region resides within an address space that is claimed by
> > > +               a decoder.
> >
> > "A region is a contiguous partition of a CXL Root decoder address space."
> >
> > >                  Region space allocation is handled by the driver, but
> >
> > "Region capacity is allocated by writing to the size attribute, the
> > resulting physical address base determined by the driver is reflected
> > here."
> >
> > > +               the offset may be read by userspace tooling in order to
> > > +               determine fragmentation, and available size for new regions.
> >
> > I would also expect, before / along with these new region attributes,
> > there would be 'available' and 'max_extent_available' at the decoder
> > level to indicate how much free space the decoder has and how big the
> > next region creation can be. User tooling can walk  the decoder and
> > the regions together to determine fragmentation if necessary, but for
> > the most part the tool likely only cares about "how big can the next
> > region be?" and "how full is this decoder?".
> >
>
> Since this is the configuration part of the ABI, I'd rather add that information
> when the plumbing to report them exists. I'm struggling to understand the
> balance (as mentioned also earlier in this mail thread) as to what userspace
> does and what the kernel does. I will add these as you request.

Userspace asks by DPA size and SPA size and the kernel validates /
performs the allocation on its behalf.

> > > +
> > > +What:
> > > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > > +Date:          August, 2021
> > > +KernelVersion: v5.18
> > > +Contact:       linux-cxl@vger.kernel.org
> > > +Description:
> > > +               (RW) Configuring regions requires a minimal set of parameters in
> > > +               order for the subsequent bind operation to succeed. The
> > > +               following parameters are defined:
> >
> > Let's split up the descriptions into individual sections. That can
> > also document the order that attributes must be written. For example,
> > doesn't size need to be set before targets are added so that targets
> > can be validated whether they have sufficient capacity?
> >
>
> Okay. Since we're moving toward making the sysfs ABI stateful,

sysfs is always stateful. Stateless would be an ioctl.

> would you like me
> to make the attrs only visible when they can actually be set?

No, that's a bit too much magic, and it would be racy.

>
> > > +
> > > +               ==      ========================================================
> > > +               interleave_granularity Mandatory. Number of consecutive bytes
> > > +                       each device in the interleave set will claim. The
> > > +                       possible interleave granularity values are determined by
> > > +                       the CXL spec and the participating devices.
> > > +               interleave_ways Mandatory. Number of devices participating in the
> > > +                       region. Each device will provide 1/interleave of storage
> > > +                       for the region.
> > > +               size    Manadatory. Phsyical address space the region will
> > > +                       consume.
> >
> > s/Phsyical/Physical/
> >
> > > +               target  Mandatory. Memory devices are the backing storage for a
> > > +                       region. There will be N targets based on the number of
> > > +                       interleave ways that the top level decoder is configured
> > > +                       for.
> >
> > That doesn't sound right, IW at the root != IW at the endpoint level
> > and the region needs to record all the endpoint level targets.
>
> Correct.
>
> >
> > > Each target must be set with a memdev device ie.
> > > +                       'mem1'. This attribute only becomes available after
> > > +                       setting the 'interleave' attribute.
> > > +               uuid    Optional. A unique identifier for the region. If none is
> > > +                       selected, the kernel will create one.
> >
> > Let's drop the Mandatory / Optional distinction, or I am otherwise not
> > understanding what this is trying to document. For example 'uuid' is
> > "mandatory" for PMEM regions and "omitted" for volatile regions not
> > optional.
>
> Okay.
>
> >
> > > +               ==      ========================================================
> > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > index 1a448543db0d..3b48e0469fc7 100644
> > > --- a/drivers/cxl/core/region.c
> > > +++ b/drivers/cxl/core/region.c
> > > @@ -3,9 +3,12 @@
> > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > >  #include <linux/device.h>
> > >  #include <linux/module.h>
> > > +#include <linux/sizes.h>
> > >  #include <linux/slab.h>
> > > +#include <linux/uuid.h>
> > >  #include <linux/idr.h>
> > >  #include <region.h>
> > > +#include <cxlmem.h>
> > >  #include <cxl.h>
> > >  #include "core.h"
> > >
> > > @@ -18,11 +21,305 @@
> > >   * (programming the hardware) is handled by a separate region driver.
> > >   */
> > >
> > > +struct cxl_region *to_cxl_region(struct device *dev);
> > > +static const struct attribute_group region_interleave_group;
> > > +
> > > +static bool is_region_active(struct cxl_region *cxlr)
> > > +{
> > > +       /* TODO: Regions can't be activated yet. */
> > > +       return false;
> >
> > This function seems redundant with just checking "cxlr->dev.driver !=
> > NULL"? The benefit of that is there is no need to carry a TODO in the
> > series.
> >
>
> The idea behind this was to give the reviewer somewhat of a bigger picture as to
> how things should work in the code rather than in a commit message. I will
> remove this.

They look premature to me.

>
> > > +}
> > > +
> > > +static void remove_target(struct cxl_region *cxlr, int target)
> > > +{
> > > +       struct cxl_memdev *cxlmd;
> > > +
> > > +       cxlmd = cxlr->config.targets[target];
> > > +       if (cxlmd)
> > > +               put_device(&cxlmd->dev);
> >
> > A memdev can be a member of multiple regions at once, shouldn't this
> > be an endpoint decoder or similar, not the entire memdev?
> >
> > Also, if memdevs autoremove themselves from regions at memdev
> > ->remove() time then I don't think the region needs to hold references
> > on memdevs.
> >
>
> This needs some work. The concern I have is region operations will need to
> operate on memdevs/decoders at various points in time. When the memdev goes
> away, the region will also need to go away. None of that plumbing was in place
> in v3 and the reference on the memdev was just a half-hearted attempt at doing
> the right thing.
>
> For now if you prefer I remove the reference, but perhaps the decoder reference
> would buy us some safety?

So, I don't want to merge an interim solution. I think this series
needs to prove out the end to end final ABI with all the lifetime
issues worked out before committing to it upstream because lifetime
issues get much harder to fix when they also need to conform to a
legacy ABI.

>
> > > +       cxlr->config.targets[target] = NULL;
> > > +}
> > > +
> > > +static ssize_t interleave_ways_show(struct device *dev,
> > > +                                   struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > > +}
> > > +
> > > +static ssize_t interleave_ways_store(struct device *dev,
> > > +                                    struct device_attribute *attr,
> > > +                                    const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       int ret, prev_iw;
> > > +       int val;
> >
> > I would expect:
> >
> > if (dev->driver)
> >    return -EBUSY;
> >
> > ...to shutdown configuration writes once the region is active. Might
> > also need a region-wide seqlock like target_list_show. So that region
> > probe drains  all active sysfs writers before assuming the
> > configuration is stable.
>
> Initially my thought here is that this is a problem for userspace to deal with.
> If userspace can't figure out how to synchronously configure and bind the
> region, that's not a kernel problem.

The kernel always needs to protect itself. Userspace is free to race
itself, but it can not be allowed to trigger a kernel race. So there
needs to be protection against userspace writing interleave_ways and
the kernel being able to trust that interleave_ways is now static for
the life of the region.

> However, we've put some effort into
> protecting userspace from itself in the create ABI, so it might be more in line
> to do that here.

That safety was about preventing userspace from leaking kernel memory,
not about protecting userspace from itself. It's still the case that
userspace racing itself will get horribly confused when it collides
region creation, but the kernel protects itself by resolving the race.

> In summary, I'm fine to add it, but I think I really need to get more in your
> brain about the userspace/kernel divide sooner rather than later.

Don't let userspace break the kernel, that's it.

>
> >
> > > +
> > > +       prev_iw = cxlr->config.interleave_ways;
> > > +       ret = kstrtoint(buf, 0, &val);
> > > +       if (ret)
> > > +               return ret;
> > > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > > +               return -EINVAL;
> > > +
> > > +       cxlr->config.interleave_ways = val;
> > > +
> > > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > > +       if (ret < 0)
> > > +               goto err;
> > > +
> > > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> >
> > Why?
> >
>
> copypasta
>
> > > +
> > > +       while (prev_iw > cxlr->config.interleave_ways)
> > > +               remove_target(cxlr, --prev_iw);
> >
> > To make the kernel side simpler this attribute could just require that
> > setting interleave ways is a one way street, if you want to change it
> > you need to delete the region and start over.
> >
>
> Okay. One of the earlier versions did this implicitly since the #ways was needed
> to create the region. I thought from the ABI perspective, flexibility was good.
> Userspace may choose not to utilize it.

More flexibility == more maintenance burden. If it's not strictly
necessary, don't expose it, so making this read-only seems simpler to
me.

[..]
> > > +       device_lock(&cxlr->dev);
> > > +       if (is_region_active(cxlr))
> > > +               rc = -EBUSY;
> > > +       else
> > > +               cxlr->config.size = val;
> > > +       device_unlock(&cxlr->dev);
> >
> > I think lockdep will complain about device_lock() usage in an
> > attribute. Try changing this to cxl_device_lock() with
> > CONFIG_PROVE_CXL_LOCKING=y.
> >
>
> I might have messed it up, but I didn't seem to run into an issue. With the
> driver bound check though, it can go away.
>
> I think it would be really good to add this kind of detail to sysfs.rst. Quick
> grep finds me arm64/kernel/mte and the nfit driver taking the device lock in an
> attr.

Yeah, CONFIG_PROVE_{NVDIMM,CXL}_LOCKING needs to annotate the
driver-core as well. I'm concerned there's a class of deadlocks that
lockdep just can't see.

>
>
> > > +
> > > +       return rc ? rc : len;
> > > +}
> > > +static DEVICE_ATTR_RW(size);
> > > +
> > > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > > +                        char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > > +}
> > > +
> > > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > > +                         const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       ssize_t rc;
> > > +
> > > +       if (len != UUID_STRING_LEN + 1)
> > > +               return -EINVAL;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +       if (is_region_active(cxlr))
> > > +               rc = -EBUSY;
> > > +       else
> > > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > > +       device_unlock(&cxlr->dev);
> > > +
> > > +       return rc ? rc : len;
> > > +}
> > > +static DEVICE_ATTR_RW(uuid);
> > > +
> > > +static struct attribute *region_attrs[] = {
> > > +       &dev_attr_interleave_ways.attr,
> > > +       &dev_attr_interleave_granularity.attr,
> > > +       &dev_attr_offset.attr,
> > > +       &dev_attr_size.attr,
> > > +       &dev_attr_uuid.attr,
> > > +       NULL,
> > > +};
> > > +
> > > +static const struct attribute_group region_group = {
> > > +       .attrs = region_attrs,
> > > +};
> > > +
> > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > +{
> > > +       int ret;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +       if (!cxlr->config.targets[n])
> > > +               ret = sysfs_emit(buf, "\n");
> > > +       else
> > > +               ret = sysfs_emit(buf, "%s\n",
> > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > +       device_unlock(&cxlr->dev);
> >
> > The component contribution of a memdev to a region is a DPA-span, not
> > the whole memdev. I would expect something like dax_mapping_attributes
> > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > information about the component contribution of a memdev to a region.
> >
>
> I think show_target should just return the chosen decoder and then the decoder
> attributes will tell the rest, wouldn't they?

Given the conflicts that can arise between HDM decoders needing to map
increasing DPA values and other conflicts that there will be
situations where the kernel auto-picking a decoder will get in the
way. Exposing the decoder selection to userspace also gives one more
place to do leaf validation. I.e. at decoder-to-region assignment time
the kernel can validate that the DPA is available and can be mapped by
the given decoder given the state of other decoders on that device.

>
> > > +
> > > +       return ret;
> > > +}
> > > +
> > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > +                         size_t len)
> > > +{
> > > +       struct device *memdev_dev;
> > > +       struct cxl_memdev *cxlmd;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +
> > > +       if (len == 1 || cxlr->config.targets[n])
> > > +               remove_target(cxlr, n);
> > > +
> > > +       /* Remove target special case */
> > > +       if (len == 1) {
> > > +               device_unlock(&cxlr->dev);
> > > +               return len;
> > > +       }
> > > +
> > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> >
> > I think this wants to be an endpoint decoder, not a memdev. Because
> > it's the decoder that joins a memdev to a region, or at least a
> > decoder should be picked when the memdev is assigned so that the DPA
> > mapping can be registered. If all the decoders are allocated then fail
> > here.
> >
>
> Per above, I think making this decoders makes sense. I could make it flexible
> for ease of use, like if you specify memX, the kernel will pick a decoder for
> you however I suspect you won't like that.

Right, put the user friendliness in the tooling, not sysfs ABI.

>
> > > +       if (!memdev_dev) {
> > > +               device_unlock(&cxlr->dev);
> > > +               return -ENOENT;
> > > +       }
> > > +
> > > +       /* reference to memdev held until target is unset or region goes away */
> > > +
> > > +       cxlmd = to_cxl_memdev(memdev_dev);
> > > +       cxlr->config.targets[n] = cxlmd;
> > > +
> > > +       device_unlock(&cxlr->dev);
> > > +
> > > +       return len;
> > > +}
> > > +
> > > +#define TARGET_ATTR_RW(n)                                                      \
> > > +       static ssize_t target##n##_show(                                       \
> > > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > > +       {                                                                      \
> > > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > > +       }                                                                      \
> > > +       static ssize_t target##n##_store(struct device *dev,                   \
> > > +                                        struct device_attribute *attr,        \
> > > +                                        const char *buf, size_t len)          \
> > > +       {                                                                      \
> > > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > > +       }                                                                      \
> > > +       static DEVICE_ATTR_RW(target##n)
> > > +
> > > +TARGET_ATTR_RW(0);
> > > +TARGET_ATTR_RW(1);
> > > +TARGET_ATTR_RW(2);
> > > +TARGET_ATTR_RW(3);
> > > +TARGET_ATTR_RW(4);
> > > +TARGET_ATTR_RW(5);
> > > +TARGET_ATTR_RW(6);
> > > +TARGET_ATTR_RW(7);
> > > +TARGET_ATTR_RW(8);
> > > +TARGET_ATTR_RW(9);
> > > +TARGET_ATTR_RW(10);
> > > +TARGET_ATTR_RW(11);
> > > +TARGET_ATTR_RW(12);
> > > +TARGET_ATTR_RW(13);
> > > +TARGET_ATTR_RW(14);
> > > +TARGET_ATTR_RW(15);
> > > +
> > > +static struct attribute *interleave_attrs[] = {
> > > +       &dev_attr_target0.attr,
> > > +       &dev_attr_target1.attr,
> > > +       &dev_attr_target2.attr,
> > > +       &dev_attr_target3.attr,
> > > +       &dev_attr_target4.attr,
> > > +       &dev_attr_target5.attr,
> > > +       &dev_attr_target6.attr,
> > > +       &dev_attr_target7.attr,
> > > +       &dev_attr_target8.attr,
> > > +       &dev_attr_target9.attr,
> > > +       &dev_attr_target10.attr,
> > > +       &dev_attr_target11.attr,
> > > +       &dev_attr_target12.attr,
> > > +       &dev_attr_target13.attr,
> > > +       &dev_attr_target14.attr,
> > > +       &dev_attr_target15.attr,
> > > +       NULL,
> > > +};
> > > +
> > > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > > +{
> > > +       struct device *dev = container_of(kobj, struct device, kobj);
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       if (n < cxlr->config.interleave_ways)
> > > +               return a->mode;
> > > +       return 0;
> > > +}
> > > +
> > > +static const struct attribute_group region_interleave_group = {
> > > +       .attrs = interleave_attrs,
> > > +       .is_visible = visible_targets,
> > > +};
> > > +
> > > +static const struct attribute_group *region_groups[] = {
> > > +       &region_group,
> > > +       &region_interleave_group,
> > > +       NULL,
> > > +};
> > > +
> > >  static void cxl_region_release(struct device *dev);
> > >
> > >  static const struct device_type cxl_region_type = {
> > >         .name = "cxl_region",
> > >         .release = cxl_region_release,
> > > +       .groups = region_groups
> > >  };
> > >
> > >  static ssize_t create_region_show(struct device *dev,
> > > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> > >  {
> > >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > >         struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       int i;
> > >
> > >         ida_free(&cxld->region_ida, cxlr->id);
> > > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > > +               remove_target(cxlr, i);
> >
> > Like the last patch this feels too late. I expect whatever unregisters
> > the region should have already handled removing the targets.
> >
>
> Would remove() be more appropriate?

->remove() does not seem a good fit since it may be the case that
someone wants do "echo $region >
/sys/bus/cxl/drivers/cxl_region/unbind; echo $region >
/sys/bus/cxl/drivers/cxl_region/bind;" without needing to go
reconfigure the targets. I am suggesting that before
device_unregister(&cxlr->dev) the targets are released.
Ben Widawsky Feb. 17, 2022, 8:20 p.m. UTC | #12
On 22-02-17 11:57:59, Dan Williams wrote:
> On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > Consolidating earlier discussions...
> >
> > On 22-01-28 16:25:34, Dan Williams wrote:
> > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > The region creation APIs create a vacant region. Configuring the region
> > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > will be provided to allow userspace to configure the region.  Finally
> > > > once all configuration is complete, userspace may activate the region.
> > > >
> > > > Introduced here are the most basic attributes needed to configure a
> > > > region. Details of these attribute are described in the ABI
> > >
> > > s/attribute/attributes/
> > >
> > > > Documentation. Sanity checking of configuration parameters are done at
> > > > region binding time. This consolidates all such logic in one place,
> > > > rather than being strewn across multiple places.
> > >
> > > I think that's too late for some of the validation. The complex
> > > validation that the region driver does throughout the topology is
> > > different from the basic input validation that can  be done at the
> > > sysfs write time. For example ,this patch allows negative
> > > interleave_granularity values to specified, just return -EINVAL. I
> > > agree that sysfs should not validate everything, I disagree with
> > > pushing all validation to cxl_region_probe().
> > >
> >
> > Okay. It might save us some back and forth if you could outline everything you'd
> > expect to be validated, but I can also make an attempt to figure out the
> > reasonable set of things.
> 
> Input validation. Every value that gets written to a sysfs attribute
> should be checked for validity, more below:
> 
> >
> > > >
> > > > A example is provided below:
> > > >
> > > > /sys/bus/cxl/devices/region0.0:0
> > > > ├── interleave_granularity
> 
> ...validate granularity is within spec and can be supported by the root decoder.
> 
> > > > ├── interleave_ways
> 
> ...validate ways is within spec and can be supported by the root decoder.
> 
> > > > ├── offset
> > > > ├── size
> 
> ...try to reserve decoder capacity to validate that there is available space.
> 
> > > > ├── subsystem -> ../../../../../../bus/cxl
> > > > ├── target0
> 
> ...validate that the target maps to the decoder.
> 
> > > > ├── uevent
> > > > └── uuid
> 
> ...validate that the uuid is unique relative to other regions.
> 

Okay.

> > >
> > > As mentioned off-list, it looks like devtype and modalias are missing.
> > >
> >
> > Yep. This belongs in the previous patch though.
> >
> > > >
> > > > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > > ---
> > > >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> > > >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> > > >  2 files changed, 340 insertions(+)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > index dcc728458936..50ba5018014d 100644
> > > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > @@ -187,3 +187,43 @@ Description:
> > > >                 region driver before being deleted. The attributes expects a
> > > >                 region in the form "regionX.Y:Z". The region's name, allocated
> > > >                 by reading create_region, will also be released.
> > > > +
> > > > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> > >
> > > This is just another 'resource' attribute for the physical base
> > > address of the region, right? 'offset' sounds like something that
> > > would be relative instead of absolute.
> > >
> >
> > It was meant to be relative. I can make it absolute if that's preferable but the
> > physical base is known at the decoder level already.
> 
> Yes, but it saves userspace a step to get the absolute value here and
> matches what happens in PCI sysfs where the user is not required to
> look up the bridge base to calculate the absolute value.
> 

Okay.

> >
> > > > +Date:          August, 2021
> > >
> > > Same date update comment here.
> > >
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RO) A region resides within an address space that is claimed by
> > > > +               a decoder.
> > >
> > > "A region is a contiguous partition of a CXL Root decoder address space."
> > >
> > > >                  Region space allocation is handled by the driver, but
> > >
> > > "Region capacity is allocated by writing to the size attribute, the
> > > resulting physical address base determined by the driver is reflected
> > > here."
> > >
> > > > +               the offset may be read by userspace tooling in order to
> > > > +               determine fragmentation, and available size for new regions.
> > >
> > > I would also expect, before / along with these new region attributes,
> > > there would be 'available' and 'max_extent_available' at the decoder
> > > level to indicate how much free space the decoder has and how big the
> > > next region creation can be. User tooling can walk  the decoder and
> > > the regions together to determine fragmentation if necessary, but for
> > > the most part the tool likely only cares about "how big can the next
> > > region be?" and "how full is this decoder?".
> > >
> >
> > Since this is the configuration part of the ABI, I'd rather add that information
> > when the plumbing to report them exists. I'm struggling to understand the
> > balance (as mentioned also earlier in this mail thread) as to what userspace
> > does and what the kernel does. I will add these as you request.
> 
> Userspace asks by DPA size and SPA size and the kernel validates /
> performs the allocation on its behalf.

Okay, I guess that addresses below regarding the tuple.

> 
> > > > +
> > > > +What:
> > > > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > > > +Date:          August, 2021
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RW) Configuring regions requires a minimal set of parameters in
> > > > +               order for the subsequent bind operation to succeed. The
> > > > +               following parameters are defined:
> > >
> > > Let's split up the descriptions into individual sections. That can
> > > also document the order that attributes must be written. For example,
> > > doesn't size need to be set before targets are added so that targets
> > > can be validated whether they have sufficient capacity?
> > >
> >
> > Okay. Since we're moving toward making the sysfs ABI stateful,
> 
> sysfs is always stateful. Stateless would be an ioctl.
> 
> > would you like me
> > to make the attrs only visible when they can actually be set?
> 
> No, that's a bit too much magic, and it would be racy.
> 
> >
> > > > +
> > > > +               ==      ========================================================
> > > > +               interleave_granularity Mandatory. Number of consecutive bytes
> > > > +                       each device in the interleave set will claim. The
> > > > +                       possible interleave granularity values are determined by
> > > > +                       the CXL spec and the participating devices.
> > > > +               interleave_ways Mandatory. Number of devices participating in the
> > > > +                       region. Each device will provide 1/interleave of storage
> > > > +                       for the region.
> > > > +               size    Manadatory. Phsyical address space the region will
> > > > +                       consume.
> > >
> > > s/Phsyical/Physical/
> > >
> > > > +               target  Mandatory. Memory devices are the backing storage for a
> > > > +                       region. There will be N targets based on the number of
> > > > +                       interleave ways that the top level decoder is configured
> > > > +                       for.
> > >
> > > That doesn't sound right, IW at the root != IW at the endpoint level
> > > and the region needs to record all the endpoint level targets.
> >
> > Correct.
> >
> > >
> > > > Each target must be set with a memdev device ie.
> > > > +                       'mem1'. This attribute only becomes available after
> > > > +                       setting the 'interleave' attribute.
> > > > +               uuid    Optional. A unique identifier for the region. If none is
> > > > +                       selected, the kernel will create one.
> > >
> > > Let's drop the Mandatory / Optional distinction, or I am otherwise not
> > > understanding what this is trying to document. For example 'uuid' is
> > > "mandatory" for PMEM regions and "omitted" for volatile regions not
> > > optional.
> >
> > Okay.
> >
> > >
> > > > +               ==      ========================================================
> > > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > > index 1a448543db0d..3b48e0469fc7 100644
> > > > --- a/drivers/cxl/core/region.c
> > > > +++ b/drivers/cxl/core/region.c
> > > > @@ -3,9 +3,12 @@
> > > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > > >  #include <linux/device.h>
> > > >  #include <linux/module.h>
> > > > +#include <linux/sizes.h>
> > > >  #include <linux/slab.h>
> > > > +#include <linux/uuid.h>
> > > >  #include <linux/idr.h>
> > > >  #include <region.h>
> > > > +#include <cxlmem.h>
> > > >  #include <cxl.h>
> > > >  #include "core.h"
> > > >
> > > > @@ -18,11 +21,305 @@
> > > >   * (programming the hardware) is handled by a separate region driver.
> > > >   */
> > > >
> > > > +struct cxl_region *to_cxl_region(struct device *dev);
> > > > +static const struct attribute_group region_interleave_group;
> > > > +
> > > > +static bool is_region_active(struct cxl_region *cxlr)
> > > > +{
> > > > +       /* TODO: Regions can't be activated yet. */
> > > > +       return false;
> > >
> > > This function seems redundant with just checking "cxlr->dev.driver !=
> > > NULL"? The benefit of that is there is no need to carry a TODO in the
> > > series.
> > >
> >
> > The idea behind this was to give the reviewer somewhat of a bigger picture as to
> > how things should work in the code rather than in a commit message. I will
> > remove this.
> 
> They look premature to me.
> 

Given that you don't want me to reference the DWG, it is. The steps outlined
with TODOs were all based on the DWG's overall flow.

> >
> > > > +}
> > > > +
> > > > +static void remove_target(struct cxl_region *cxlr, int target)
> > > > +{
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       cxlmd = cxlr->config.targets[target];
> > > > +       if (cxlmd)
> > > > +               put_device(&cxlmd->dev);
> > >
> > > A memdev can be a member of multiple regions at once, shouldn't this
> > > be an endpoint decoder or similar, not the entire memdev?
> > >
> > > Also, if memdevs autoremove themselves from regions at memdev
> > > ->remove() time then I don't think the region needs to hold references
> > > on memdevs.
> > >
> >
> > This needs some work. The concern I have is region operations will need to
> > operate on memdevs/decoders at various points in time. When the memdev goes
> > away, the region will also need to go away. None of that plumbing was in place
> > in v3 and the reference on the memdev was just a half-hearted attempt at doing
> > the right thing.
> >
> > For now if you prefer I remove the reference, but perhaps the decoder reference
> > would buy us some safety?
> 
> So, I don't want to merge an interim solution. I think this series
> needs to prove out the end to end final ABI with all the lifetime
> issues worked out before committing to it upstream because lifetime
> issues get much harder to fix when they also need to conform to a
> legacy ABI.
> 

I should have been clearer. I had been planning to send at least one more
version before promising to fix lifetime issues. I will skip that step.

> >
> > > > +       cxlr->config.targets[target] = NULL;
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_show(struct device *dev,
> > > > +                                   struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_store(struct device *dev,
> > > > +                                    struct device_attribute *attr,
> > > > +                                    const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int ret, prev_iw;
> > > > +       int val;
> > >
> > > I would expect:
> > >
> > > if (dev->driver)
> > >    return -EBUSY;
> > >
> > > ...to shutdown configuration writes once the region is active. Might
> > > also need a region-wide seqlock like target_list_show. So that region
> > > probe drains  all active sysfs writers before assuming the
> > > configuration is stable.
> >
> > Initially my thought here is that this is a problem for userspace to deal with.
> > If userspace can't figure out how to synchronously configure and bind the
> > region, that's not a kernel problem.
> 
> The kernel always needs to protect itself. Userspace is free to race
> itself, but it can not be allowed to trigger a kernel race. So there
> needs to be protection against userspace writing interleave_ways and
> the kernel being able to trust that interleave_ways is now static for
> the life of the region.

Yeah - originally I was relying on the device_lock for this, but that now
doesn't work. seqlock is fine. I could also copy all the config information at
the beginning of probe and simply use that.

If we're going the route of making interleave_ways write-once, why not make all
attributes the same?

> 
> > However, we've put some effort into
> > protecting userspace from itself in the create ABI, so it might be more in line
> > to do that here.
> 
> That safety was about preventing userspace from leaking kernel memory,
> not about protecting userspace from itself. It's still the case that
> userspace racing itself will get horribly confused when it collides
> region creation, but the kernel protects itself by resolving the race.
> 
> > In summary, I'm fine to add it, but I think I really need to get more in your
> > brain about the userspace/kernel divide sooner rather than later.
> 
> Don't let userspace break the kernel, that's it.
> 
> >
> > >
> > > > +
> > > > +       prev_iw = cxlr->config.interleave_ways;
> > > > +       ret = kstrtoint(buf, 0, &val);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > > > +               return -EINVAL;
> > > > +
> > > > +       cxlr->config.interleave_ways = val;
> > > > +
> > > > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > > > +       if (ret < 0)
> > > > +               goto err;
> > > > +
> > > > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> > >
> > > Why?
> > >
> >
> > copypasta
> >
> > > > +
> > > > +       while (prev_iw > cxlr->config.interleave_ways)
> > > > +               remove_target(cxlr, --prev_iw);
> > >
> > > To make the kernel side simpler this attribute could just require that
> > > setting interleave ways is a one way street, if you want to change it
> > > you need to delete the region and start over.
> > >
> >
> > Okay. One of the earlier versions did this implicitly since the #ways was needed
> > to create the region. I thought from the ABI perspective, flexibility was good.
> > Userspace may choose not to utilize it.
> 
> More flexibility == more maintenance burden. If it's not strictly
> necessary, don't expose it, so making this read-only seems simpler to
> me.
> 
> [..]
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               cxlr->config.size = val;
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > I think lockdep will complain about device_lock() usage in an
> > > attribute. Try changing this to cxl_device_lock() with
> > > CONFIG_PROVE_CXL_LOCKING=y.
> > >
> >
> > I might have messed it up, but I didn't seem to run into an issue. With the
> > driver bound check though, it can go away.
> >
> > I think it would be really good to add this kind of detail to sysfs.rst. Quick
> > grep finds me arm64/kernel/mte and the nfit driver taking the device lock in an
> > attr.
> 
> Yeah, CONFIG_PROVE_{NVDIMM,CXL}_LOCKING needs to annotate the
> driver-core as well. I'm concerned there's a class of deadlocks that
> lockdep just can't see.
> 
> >
> >
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(size);
> > > > +
> > > > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > > > +                        char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > > > +}
> > > > +
> > > > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > > > +                         const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       ssize_t rc;
> > > > +
> > > > +       if (len != UUID_STRING_LEN + 1)
> > > > +               return -EINVAL;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > > > +       device_unlock(&cxlr->dev);
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(uuid);
> > > > +
> > > > +static struct attribute *region_attrs[] = {
> > > > +       &dev_attr_interleave_ways.attr,
> > > > +       &dev_attr_interleave_granularity.attr,
> > > > +       &dev_attr_offset.attr,
> > > > +       &dev_attr_size.attr,
> > > > +       &dev_attr_uuid.attr,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static const struct attribute_group region_group = {
> > > > +       .attrs = region_attrs,
> > > > +};
> > > > +
> > > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > > +{
> > > > +       int ret;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (!cxlr->config.targets[n])
> > > > +               ret = sysfs_emit(buf, "\n");
> > > > +       else
> > > > +               ret = sysfs_emit(buf, "%s\n",
> > > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > The component contribution of a memdev to a region is a DPA-span, not
> > > the whole memdev. I would expect something like dax_mapping_attributes
> > > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > > information about the component contribution of a memdev to a region.
> > >
> >
> > I think show_target should just return the chosen decoder and then the decoder
> > attributes will tell the rest, wouldn't they?
> 
> Given the conflicts that can arise between HDM decoders needing to map
> increasing DPA values and other conflicts that there will be
> situations where the kernel auto-picking a decoder will get in the
> way. Exposing the decoder selection to userspace also gives one more
> place to do leaf validation. I.e. at decoder-to-region assignment time
> the kernel can validate that the DPA is available and can be mapped by
> the given decoder given the state of other decoders on that device.
> 

Okay, but per below, these are associated with setting the target. The attribute
show does only need to provide the decoder, then userspace can look at the
decoder to find if it's active/DPAs/etc.

> >
> > > > +
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > > +                         size_t len)
> > > > +{
> > > > +       struct device *memdev_dev;
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +
> > > > +       if (len == 1 || cxlr->config.targets[n])
> > > > +               remove_target(cxlr, n);
> > > > +
> > > > +       /* Remove target special case */
> > > > +       if (len == 1) {
> > > > +               device_unlock(&cxlr->dev);
> > > > +               return len;
> > > > +       }
> > > > +
> > > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> > >
> > > I think this wants to be an endpoint decoder, not a memdev. Because
> > > it's the decoder that joins a memdev to a region, or at least a
> > > decoder should be picked when the memdev is assigned so that the DPA
> > > mapping can be registered. If all the decoders are allocated then fail
> > > here.
> > >
> >
> > Per above, I think making this decoders makes sense. I could make it flexible
> > for ease of use, like if you specify memX, the kernel will pick a decoder for
> > you however I suspect you won't like that.
> 
> Right, put the user friendliness in the tooling, not sysfs ABI.
> 

Okay.

> >
> > > > +       if (!memdev_dev) {
> > > > +               device_unlock(&cxlr->dev);
> > > > +               return -ENOENT;
> > > > +       }
> > > > +
> > > > +       /* reference to memdev held until target is unset or region goes away */
> > > > +
> > > > +       cxlmd = to_cxl_memdev(memdev_dev);
> > > > +       cxlr->config.targets[n] = cxlmd;
> > > > +
> > > > +       device_unlock(&cxlr->dev);
> > > > +
> > > > +       return len;
> > > > +}
> > > > +
> > > > +#define TARGET_ATTR_RW(n)                                                      \
> > > > +       static ssize_t target##n##_show(                                       \
> > > > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > > > +       {                                                                      \
> > > > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > > > +       }                                                                      \
> > > > +       static ssize_t target##n##_store(struct device *dev,                   \
> > > > +                                        struct device_attribute *attr,        \
> > > > +                                        const char *buf, size_t len)          \
> > > > +       {                                                                      \
> > > > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > > > +       }                                                                      \
> > > > +       static DEVICE_ATTR_RW(target##n)
> > > > +
> > > > +TARGET_ATTR_RW(0);
> > > > +TARGET_ATTR_RW(1);
> > > > +TARGET_ATTR_RW(2);
> > > > +TARGET_ATTR_RW(3);
> > > > +TARGET_ATTR_RW(4);
> > > > +TARGET_ATTR_RW(5);
> > > > +TARGET_ATTR_RW(6);
> > > > +TARGET_ATTR_RW(7);
> > > > +TARGET_ATTR_RW(8);
> > > > +TARGET_ATTR_RW(9);
> > > > +TARGET_ATTR_RW(10);
> > > > +TARGET_ATTR_RW(11);
> > > > +TARGET_ATTR_RW(12);
> > > > +TARGET_ATTR_RW(13);
> > > > +TARGET_ATTR_RW(14);
> > > > +TARGET_ATTR_RW(15);
> > > > +
> > > > +static struct attribute *interleave_attrs[] = {
> > > > +       &dev_attr_target0.attr,
> > > > +       &dev_attr_target1.attr,
> > > > +       &dev_attr_target2.attr,
> > > > +       &dev_attr_target3.attr,
> > > > +       &dev_attr_target4.attr,
> > > > +       &dev_attr_target5.attr,
> > > > +       &dev_attr_target6.attr,
> > > > +       &dev_attr_target7.attr,
> > > > +       &dev_attr_target8.attr,
> > > > +       &dev_attr_target9.attr,
> > > > +       &dev_attr_target10.attr,
> > > > +       &dev_attr_target11.attr,
> > > > +       &dev_attr_target12.attr,
> > > > +       &dev_attr_target13.attr,
> > > > +       &dev_attr_target14.attr,
> > > > +       &dev_attr_target15.attr,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > > > +{
> > > > +       struct device *dev = container_of(kobj, struct device, kobj);
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       if (n < cxlr->config.interleave_ways)
> > > > +               return a->mode;
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static const struct attribute_group region_interleave_group = {
> > > > +       .attrs = interleave_attrs,
> > > > +       .is_visible = visible_targets,
> > > > +};
> > > > +
> > > > +static const struct attribute_group *region_groups[] = {
> > > > +       &region_group,
> > > > +       &region_interleave_group,
> > > > +       NULL,
> > > > +};
> > > > +
> > > >  static void cxl_region_release(struct device *dev);
> > > >
> > > >  static const struct device_type cxl_region_type = {
> > > >         .name = "cxl_region",
> > > >         .release = cxl_region_release,
> > > > +       .groups = region_groups
> > > >  };
> > > >
> > > >  static ssize_t create_region_show(struct device *dev,
> > > > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> > > >  {
> > > >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > > >         struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int i;
> > > >
> > > >         ida_free(&cxld->region_ida, cxlr->id);
> > > > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > > > +               remove_target(cxlr, i);
> > >
> > > Like the last patch this feels too late. I expect whatever unregisters
> > > the region should have already handled removing the targets.
> > >
> >
> > Would remove() be more appropriate?
> 
> ->remove() does not seem a good fit since it may be the case that
> someone wants do "echo $region >
> /sys/bus/cxl/drivers/cxl_region/unbind; echo $region >
> /sys/bus/cxl/drivers/cxl_region/bind;" without needing to go
> reconfigure the targets. I am suggesting that before
> device_unregister(&cxlr->dev) the targets are released.

Okay.

Why would one want to do this? I acknowledge someone *may* do that. I'd like to
know what value you see there.
Dan Williams Feb. 17, 2022, 9:12 p.m. UTC | #13
On Thu, Feb 17, 2022 at 12:20 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
[..]
> > > > > +static bool is_region_active(struct cxl_region *cxlr)
> > > > > +{
> > > > > +       /* TODO: Regions can't be activated yet. */
> > > > > +       return false;
> > > >
> > > > This function seems redundant with just checking "cxlr->dev.driver !=
> > > > NULL"? The benefit of that is there is no need to carry a TODO in the
> > > > series.
> > > >
> > >
> > > The idea behind this was to give the reviewer somewhat of a bigger picture as to
> > > how things should work in the code rather than in a commit message. I will
> > > remove this.
> >
> > They look premature to me.
> >
>
> Given that you don't want me to reference the DWG, it is. The steps outlined
> with TODOs were all based on the DWG's overall flow.

Right, the DWG is a good bootstrap, but the Linux implementation is
going to go beyond it so might as well couch all the language in terms
of base spec references and Linux documentation and not have this
indirection to a 3rd document.

[..]
> > > > ...to shutdown configuration writes once the region is active. Might
> > > > also need a region-wide seqlock like target_list_show. So that region
> > > > probe drains  all active sysfs writers before assuming the
> > > > configuration is stable.
> > >
> > > Initially my thought here is that this is a problem for userspace to deal with.
> > > If userspace can't figure out how to synchronously configure and bind the
> > > region, that's not a kernel problem.
> >
> > The kernel always needs to protect itself. Userspace is free to race
> > itself, but it can not be allowed to trigger a kernel race. So there
> > needs to be protection against userspace writing interleave_ways and
> > the kernel being able to trust that interleave_ways is now static for
> > the life of the region.
>
> Yeah - originally I was relying on the device_lock for this, but that now
> doesn't work. seqlock is fine. I could also copy all the config information at
> the beginning of probe and simply use that.
>
> If we're going the route of making interleave_ways write-once, why not make all
> attributes the same?

Sure. It could always be relaxed later if there was a convincing need
to modify an existing region without tearing it down first. In fact,
that reconfigure flexibility was a source of bugs in NVDIMM sysfs ABI
that the tooling did not leverage because "ndctl create-namespace
--reconfigure" internally did: read namespace attributes, destroy
namepsace, create new namespace with saved attributes.

[..]
> > > > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > > > +{
> > > > > +       int ret;
> > > > > +
> > > > > +       device_lock(&cxlr->dev);
> > > > > +       if (!cxlr->config.targets[n])
> > > > > +               ret = sysfs_emit(buf, "\n");
> > > > > +       else
> > > > > +               ret = sysfs_emit(buf, "%s\n",
> > > > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > > > +       device_unlock(&cxlr->dev);
> > > >
> > > > The component contribution of a memdev to a region is a DPA-span, not
> > > > the whole memdev. I would expect something like dax_mapping_attributes
> > > > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > > > information about the component contribution of a memdev to a region.
> > > >
> > >
> > > I think show_target should just return the chosen decoder and then the decoder
> > > attributes will tell the rest, wouldn't they?
> >
> > Given the conflicts that can arise between HDM decoders needing to map
> > increasing DPA values and other conflicts that there will be
> > situations where the kernel auto-picking a decoder will get in the
> > way. Exposing the decoder selection to userspace also gives one more
> > place to do leaf validation. I.e. at decoder-to-region assignment time
> > the kernel can validate that the DPA is available and can be mapped by
> > the given decoder given the state of other decoders on that device.
> >
>
> Okay, but per below, these are associated with setting the target. The attribute
> show does only need to provide the decoder, then userspace can look at the
> decoder to find if it's active/DPAs/etc.

Yes.

>
> > >
> > > > > +
> > > > > +       return ret;
> > > > > +}
> > > > > +
> > > > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > > > +                         size_t len)
> > > > > +{
> > > > > +       struct device *memdev_dev;
> > > > > +       struct cxl_memdev *cxlmd;
> > > > > +
> > > > > +       device_lock(&cxlr->dev);
> > > > > +
> > > > > +       if (len == 1 || cxlr->config.targets[n])
> > > > > +               remove_target(cxlr, n);
> > > > > +
> > > > > +       /* Remove target special case */
> > > > > +       if (len == 1) {
> > > > > +               device_unlock(&cxlr->dev);
> > > > > +               return len;
> > > > > +       }
> > > > > +
> > > > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> > > >
> > > > I think this wants to be an endpoint decoder, not a memdev. Because
> > > > it's the decoder that joins a memdev to a region, or at least a
> > > > decoder should be picked when the memdev is assigned so that the DPA
> > > > mapping can be registered. If all the decoders are allocated then fail
> > > > here.
> > > >
> > >
> > > Per above, I think making this decoders makes sense. I could make it flexible
> > > for ease of use, like if you specify memX, the kernel will pick a decoder for
> > > you however I suspect you won't like that.
> >
> > Right, put the user friendliness in the tooling, not sysfs ABI.
> >
>
> Okay.
>
> > >
> > > > > +       if (!memdev_dev) {
> > > > > +               device_unlock(&cxlr->dev);
> > > > > +               return -ENOENT;
> > > > > +       }
> > > > > +
> > > > > +       /* reference to memdev held until target is unset or region goes away */
> > > > > +
> > > > > +       cxlmd = to_cxl_memdev(memdev_dev);
> > > > > +       cxlr->config.targets[n] = cxlmd;
> > > > > +
> > > > > +       device_unlock(&cxlr->dev);
> > > > > +
> > > > > +       return len;
> > > > > +}
> > > > > +
> > > > > +#define TARGET_ATTR_RW(n)                                                      \
> > > > > +       static ssize_t target##n##_show(                                       \
> > > > > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > > > > +       {                                                                      \
> > > > > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > > > > +       }                                                                      \
> > > > > +       static ssize_t target##n##_store(struct device *dev,                   \
> > > > > +                                        struct device_attribute *attr,        \
> > > > > +                                        const char *buf, size_t len)          \
> > > > > +       {                                                                      \
> > > > > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > > > > +       }                                                                      \
> > > > > +       static DEVICE_ATTR_RW(target##n)
> > > > > +
> > > > > +TARGET_ATTR_RW(0);
> > > > > +TARGET_ATTR_RW(1);
> > > > > +TARGET_ATTR_RW(2);
> > > > > +TARGET_ATTR_RW(3);
> > > > > +TARGET_ATTR_RW(4);
> > > > > +TARGET_ATTR_RW(5);
> > > > > +TARGET_ATTR_RW(6);
> > > > > +TARGET_ATTR_RW(7);
> > > > > +TARGET_ATTR_RW(8);
> > > > > +TARGET_ATTR_RW(9);
> > > > > +TARGET_ATTR_RW(10);
> > > > > +TARGET_ATTR_RW(11);
> > > > > +TARGET_ATTR_RW(12);
> > > > > +TARGET_ATTR_RW(13);
> > > > > +TARGET_ATTR_RW(14);
> > > > > +TARGET_ATTR_RW(15);
> > > > > +
> > > > > +static struct attribute *interleave_attrs[] = {
> > > > > +       &dev_attr_target0.attr,
> > > > > +       &dev_attr_target1.attr,
> > > > > +       &dev_attr_target2.attr,
> > > > > +       &dev_attr_target3.attr,
> > > > > +       &dev_attr_target4.attr,
> > > > > +       &dev_attr_target5.attr,
> > > > > +       &dev_attr_target6.attr,
> > > > > +       &dev_attr_target7.attr,
> > > > > +       &dev_attr_target8.attr,
> > > > > +       &dev_attr_target9.attr,
> > > > > +       &dev_attr_target10.attr,
> > > > > +       &dev_attr_target11.attr,
> > > > > +       &dev_attr_target12.attr,
> > > > > +       &dev_attr_target13.attr,
> > > > > +       &dev_attr_target14.attr,
> > > > > +       &dev_attr_target15.attr,
> > > > > +       NULL,
> > > > > +};
> > > > > +
> > > > > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > > > > +{
> > > > > +       struct device *dev = container_of(kobj, struct device, kobj);
> > > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > > +
> > > > > +       if (n < cxlr->config.interleave_ways)
> > > > > +               return a->mode;
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > > +static const struct attribute_group region_interleave_group = {
> > > > > +       .attrs = interleave_attrs,
> > > > > +       .is_visible = visible_targets,
> > > > > +};
> > > > > +
> > > > > +static const struct attribute_group *region_groups[] = {
> > > > > +       &region_group,
> > > > > +       &region_interleave_group,
> > > > > +       NULL,
> > > > > +};
> > > > > +
> > > > >  static void cxl_region_release(struct device *dev);
> > > > >
> > > > >  static const struct device_type cxl_region_type = {
> > > > >         .name = "cxl_region",
> > > > >         .release = cxl_region_release,
> > > > > +       .groups = region_groups
> > > > >  };
> > > > >
> > > > >  static ssize_t create_region_show(struct device *dev,
> > > > > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> > > > >  {
> > > > >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > > > >         struct cxl_region *cxlr = to_cxl_region(dev);
> > > > > +       int i;
> > > > >
> > > > >         ida_free(&cxld->region_ida, cxlr->id);
> > > > > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > > > > +               remove_target(cxlr, i);
> > > >
> > > > Like the last patch this feels too late. I expect whatever unregisters
> > > > the region should have already handled removing the targets.
> > > >
> > >
> > > Would remove() be more appropriate?
> >
> > ->remove() does not seem a good fit since it may be the case that
> > someone wants do "echo $region >
> > /sys/bus/cxl/drivers/cxl_region/unbind; echo $region >
> > /sys/bus/cxl/drivers/cxl_region/bind;" without needing to go
> > reconfigure the targets. I am suggesting that before
> > device_unregister(&cxlr->dev) the targets are released.
>
> Okay.
>
> Why would one want to do this? I acknowledge someone *may* do that. I'd like to
> know what value you see there.

There are several debug and error handling scenarios that say "quiesce
CXL.mem". Seems reasonable to map those to "cxl disable-region", and
seems unreasonable that "cxl disable-region" requires "cxl
create-region" to get back to operational state.
Ben Widawsky Feb. 23, 2022, 9:49 p.m. UTC | #14
On 22-02-17 11:57:59, Dan Williams wrote:
> On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > Consolidating earlier discussions...
> >
> > On 22-01-28 16:25:34, Dan Williams wrote:
> > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > The region creation APIs create a vacant region. Configuring the region
> > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > will be provided to allow userspace to configure the region.  Finally
> > > > once all configuration is complete, userspace may activate the region.
> > > >
> > > > Introduced here are the most basic attributes needed to configure a
> > > > region. Details of these attribute are described in the ABI
> > >
> > > s/attribute/attributes/
> > >
> > > > Documentation. Sanity checking of configuration parameters are done at
> > > > region binding time. This consolidates all such logic in one place,
> > > > rather than being strewn across multiple places.
> > >
> > > I think that's too late for some of the validation. The complex
> > > validation that the region driver does throughout the topology is
> > > different from the basic input validation that can  be done at the
> > > sysfs write time. For example ,this patch allows negative
> > > interleave_granularity values to specified, just return -EINVAL. I
> > > agree that sysfs should not validate everything, I disagree with
> > > pushing all validation to cxl_region_probe().
> > >
> >
> > Okay. It might save us some back and forth if you could outline everything you'd
> > expect to be validated, but I can also make an attempt to figure out the
> > reasonable set of things.
> 
> Input validation. Every value that gets written to a sysfs attribute
> should be checked for validity, more below:
> 
> >
> > > >
> > > > A example is provided below:
> > > >
> > > > /sys/bus/cxl/devices/region0.0:0
> > > > ├── interleave_granularity
> 
> ...validate granularity is within spec and can be supported by the root decoder.
> 
> > > > ├── interleave_ways
> 
> ...validate ways is within spec and can be supported by the root decoder.

I'm not sure how to do this one. Validation requires device positions and we
can't set the targets until ways is set. Can you please provide some more
insight on what you'd like me to check in addition to the value being within
spec?

> 
> > > > ├── offset
> > > > ├── size
> 
> ...try to reserve decoder capacity to validate that there is available space.
> 
> > > > ├── subsystem -> ../../../../../../bus/cxl
> > > > ├── target0
> 
> ...validate that the target maps to the decoder.
> 
> > > > ├── uevent
> > > > └── uuid
> 
> ...validate that the uuid is unique relative to other regions.
> 
> > >
> > > As mentioned off-list, it looks like devtype and modalias are missing.
> > >
> >
> > Yep. This belongs in the previous patch though.
> >
> > > >
> > > > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > > ---
> > > >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> > > >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> > > >  2 files changed, 340 insertions(+)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > index dcc728458936..50ba5018014d 100644
> > > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > @@ -187,3 +187,43 @@ Description:
> > > >                 region driver before being deleted. The attributes expects a
> > > >                 region in the form "regionX.Y:Z". The region's name, allocated
> > > >                 by reading create_region, will also be released.
> > > > +
> > > > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> > >
> > > This is just another 'resource' attribute for the physical base
> > > address of the region, right? 'offset' sounds like something that
> > > would be relative instead of absolute.
> > >
> >
> > It was meant to be relative. I can make it absolute if that's preferable but the
> > physical base is known at the decoder level already.
> 
> Yes, but it saves userspace a step to get the absolute value here and
> matches what happens in PCI sysfs where the user is not required to
> look up the bridge base to calculate the absolute value.
> 
> >
> > > > +Date:          August, 2021
> > >
> > > Same date update comment here.
> > >
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RO) A region resides within an address space that is claimed by
> > > > +               a decoder.
> > >
> > > "A region is a contiguous partition of a CXL Root decoder address space."
> > >
> > > >                  Region space allocation is handled by the driver, but
> > >
> > > "Region capacity is allocated by writing to the size attribute, the
> > > resulting physical address base determined by the driver is reflected
> > > here."
> > >
> > > > +               the offset may be read by userspace tooling in order to
> > > > +               determine fragmentation, and available size for new regions.
> > >
> > > I would also expect, before / along with these new region attributes,
> > > there would be 'available' and 'max_extent_available' at the decoder
> > > level to indicate how much free space the decoder has and how big the
> > > next region creation can be. User tooling can walk  the decoder and
> > > the regions together to determine fragmentation if necessary, but for
> > > the most part the tool likely only cares about "how big can the next
> > > region be?" and "how full is this decoder?".
> > >
> >
> > Since this is the configuration part of the ABI, I'd rather add that information
> > when the plumbing to report them exists. I'm struggling to understand the
> > balance (as mentioned also earlier in this mail thread) as to what userspace
> > does and what the kernel does. I will add these as you request.
> 
> Userspace asks by DPA size and SPA size and the kernel validates /
> performs the allocation on its behalf.
> 
> > > > +
> > > > +What:
> > > > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > > > +Date:          August, 2021
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RW) Configuring regions requires a minimal set of parameters in
> > > > +               order for the subsequent bind operation to succeed. The
> > > > +               following parameters are defined:
> > >
> > > Let's split up the descriptions into individual sections. That can
> > > also document the order that attributes must be written. For example,
> > > doesn't size need to be set before targets are added so that targets
> > > can be validated whether they have sufficient capacity?
> > >
> >
> > Okay. Since we're moving toward making the sysfs ABI stateful,
> 
> sysfs is always stateful. Stateless would be an ioctl.
> 
> > would you like me
> > to make the attrs only visible when they can actually be set?
> 
> No, that's a bit too much magic, and it would be racy.
> 
> >
> > > > +
> > > > +               ==      ========================================================
> > > > +               interleave_granularity Mandatory. Number of consecutive bytes
> > > > +                       each device in the interleave set will claim. The
> > > > +                       possible interleave granularity values are determined by
> > > > +                       the CXL spec and the participating devices.
> > > > +               interleave_ways Mandatory. Number of devices participating in the
> > > > +                       region. Each device will provide 1/interleave of storage
> > > > +                       for the region.
> > > > +               size    Manadatory. Phsyical address space the region will
> > > > +                       consume.
> > >
> > > s/Phsyical/Physical/
> > >
> > > > +               target  Mandatory. Memory devices are the backing storage for a
> > > > +                       region. There will be N targets based on the number of
> > > > +                       interleave ways that the top level decoder is configured
> > > > +                       for.
> > >
> > > That doesn't sound right, IW at the root != IW at the endpoint level
> > > and the region needs to record all the endpoint level targets.
> >
> > Correct.
> >
> > >
> > > > Each target must be set with a memdev device ie.
> > > > +                       'mem1'. This attribute only becomes available after
> > > > +                       setting the 'interleave' attribute.
> > > > +               uuid    Optional. A unique identifier for the region. If none is
> > > > +                       selected, the kernel will create one.
> > >
> > > Let's drop the Mandatory / Optional distinction, or I am otherwise not
> > > understanding what this is trying to document. For example 'uuid' is
> > > "mandatory" for PMEM regions and "omitted" for volatile regions not
> > > optional.
> >
> > Okay.
> >
> > >
> > > > +               ==      ========================================================
> > > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > > index 1a448543db0d..3b48e0469fc7 100644
> > > > --- a/drivers/cxl/core/region.c
> > > > +++ b/drivers/cxl/core/region.c
> > > > @@ -3,9 +3,12 @@
> > > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > > >  #include <linux/device.h>
> > > >  #include <linux/module.h>
> > > > +#include <linux/sizes.h>
> > > >  #include <linux/slab.h>
> > > > +#include <linux/uuid.h>
> > > >  #include <linux/idr.h>
> > > >  #include <region.h>
> > > > +#include <cxlmem.h>
> > > >  #include <cxl.h>
> > > >  #include "core.h"
> > > >
> > > > @@ -18,11 +21,305 @@
> > > >   * (programming the hardware) is handled by a separate region driver.
> > > >   */
> > > >
> > > > +struct cxl_region *to_cxl_region(struct device *dev);
> > > > +static const struct attribute_group region_interleave_group;
> > > > +
> > > > +static bool is_region_active(struct cxl_region *cxlr)
> > > > +{
> > > > +       /* TODO: Regions can't be activated yet. */
> > > > +       return false;
> > >
> > > This function seems redundant with just checking "cxlr->dev.driver !=
> > > NULL"? The benefit of that is there is no need to carry a TODO in the
> > > series.
> > >
> >
> > The idea behind this was to give the reviewer somewhat of a bigger picture as to
> > how things should work in the code rather than in a commit message. I will
> > remove this.
> 
> They look premature to me.
> 
> >
> > > > +}
> > > > +
> > > > +static void remove_target(struct cxl_region *cxlr, int target)
> > > > +{
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       cxlmd = cxlr->config.targets[target];
> > > > +       if (cxlmd)
> > > > +               put_device(&cxlmd->dev);
> > >
> > > A memdev can be a member of multiple regions at once, shouldn't this
> > > be an endpoint decoder or similar, not the entire memdev?
> > >
> > > Also, if memdevs autoremove themselves from regions at memdev
> > > ->remove() time then I don't think the region needs to hold references
> > > on memdevs.
> > >
> >
> > This needs some work. The concern I have is region operations will need to
> > operate on memdevs/decoders at various points in time. When the memdev goes
> > away, the region will also need to go away. None of that plumbing was in place
> > in v3 and the reference on the memdev was just a half-hearted attempt at doing
> > the right thing.
> >
> > For now if you prefer I remove the reference, but perhaps the decoder reference
> > would buy us some safety?
> 
> So, I don't want to merge an interim solution. I think this series
> needs to prove out the end to end final ABI with all the lifetime
> issues worked out before committing to it upstream because lifetime
> issues get much harder to fix when they also need to conform to a
> legacy ABI.
> 
> >
> > > > +       cxlr->config.targets[target] = NULL;
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_show(struct device *dev,
> > > > +                                   struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_store(struct device *dev,
> > > > +                                    struct device_attribute *attr,
> > > > +                                    const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int ret, prev_iw;
> > > > +       int val;
> > >
> > > I would expect:
> > >
> > > if (dev->driver)
> > >    return -EBUSY;
> > >
> > > ...to shutdown configuration writes once the region is active. Might
> > > also need a region-wide seqlock like target_list_show. So that region
> > > probe drains  all active sysfs writers before assuming the
> > > configuration is stable.
> >
> > Initially my thought here is that this is a problem for userspace to deal with.
> > If userspace can't figure out how to synchronously configure and bind the
> > region, that's not a kernel problem.
> 
> The kernel always needs to protect itself. Userspace is free to race
> itself, but it can not be allowed to trigger a kernel race. So there
> needs to be protection against userspace writing interleave_ways and
> the kernel being able to trust that interleave_ways is now static for
> the life of the region.
> 
> > However, we've put some effort into
> > protecting userspace from itself in the create ABI, so it might be more in line
> > to do that here.
> 
> That safety was about preventing userspace from leaking kernel memory,
> not about protecting userspace from itself. It's still the case that
> userspace racing itself will get horribly confused when it collides
> region creation, but the kernel protects itself by resolving the race.
> 
> > In summary, I'm fine to add it, but I think I really need to get more in your
> > brain about the userspace/kernel divide sooner rather than later.
> 
> Don't let userspace break the kernel, that's it.
> 
> >
> > >
> > > > +
> > > > +       prev_iw = cxlr->config.interleave_ways;
> > > > +       ret = kstrtoint(buf, 0, &val);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > > > +               return -EINVAL;
> > > > +
> > > > +       cxlr->config.interleave_ways = val;
> > > > +
> > > > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > > > +       if (ret < 0)
> > > > +               goto err;
> > > > +
> > > > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> > >
> > > Why?
> > >
> >
> > copypasta
> >
> > > > +
> > > > +       while (prev_iw > cxlr->config.interleave_ways)
> > > > +               remove_target(cxlr, --prev_iw);
> > >
> > > To make the kernel side simpler this attribute could just require that
> > > setting interleave ways is a one way street, if you want to change it
> > > you need to delete the region and start over.
> > >
> >
> > Okay. One of the earlier versions did this implicitly since the #ways was needed
> > to create the region. I thought from the ABI perspective, flexibility was good.
> > Userspace may choose not to utilize it.
> 
> More flexibility == more maintenance burden. If it's not strictly
> necessary, don't expose it, so making this read-only seems simpler to
> me.
> 
> [..]
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               cxlr->config.size = val;
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > I think lockdep will complain about device_lock() usage in an
> > > attribute. Try changing this to cxl_device_lock() with
> > > CONFIG_PROVE_CXL_LOCKING=y.
> > >
> >
> > I might have messed it up, but I didn't seem to run into an issue. With the
> > driver bound check though, it can go away.
> >
> > I think it would be really good to add this kind of detail to sysfs.rst. Quick
> > grep finds me arm64/kernel/mte and the nfit driver taking the device lock in an
> > attr.
> 
> Yeah, CONFIG_PROVE_{NVDIMM,CXL}_LOCKING needs to annotate the
> driver-core as well. I'm concerned there's a class of deadlocks that
> lockdep just can't see.
> 
> >
> >
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(size);
> > > > +
> > > > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > > > +                        char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > > > +}
> > > > +
> > > > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > > > +                         const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       ssize_t rc;
> > > > +
> > > > +       if (len != UUID_STRING_LEN + 1)
> > > > +               return -EINVAL;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > > > +       device_unlock(&cxlr->dev);
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(uuid);
> > > > +
> > > > +static struct attribute *region_attrs[] = {
> > > > +       &dev_attr_interleave_ways.attr,
> > > > +       &dev_attr_interleave_granularity.attr,
> > > > +       &dev_attr_offset.attr,
> > > > +       &dev_attr_size.attr,
> > > > +       &dev_attr_uuid.attr,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static const struct attribute_group region_group = {
> > > > +       .attrs = region_attrs,
> > > > +};
> > > > +
> > > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > > +{
> > > > +       int ret;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (!cxlr->config.targets[n])
> > > > +               ret = sysfs_emit(buf, "\n");
> > > > +       else
> > > > +               ret = sysfs_emit(buf, "%s\n",
> > > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > The component contribution of a memdev to a region is a DPA-span, not
> > > the whole memdev. I would expect something like dax_mapping_attributes
> > > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > > information about the component contribution of a memdev to a region.
> > >
> >
> > I think show_target should just return the chosen decoder and then the decoder
> > attributes will tell the rest, wouldn't they?
> 
> Given the conflicts that can arise between HDM decoders needing to map
> increasing DPA values and other conflicts that there will be
> situations where the kernel auto-picking a decoder will get in the
> way. Exposing the decoder selection to userspace also gives one more
> place to do leaf validation. I.e. at decoder-to-region assignment time
> the kernel can validate that the DPA is available and can be mapped by
> the given decoder given the state of other decoders on that device.
> 
> >
> > > > +
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > > +                         size_t len)
> > > > +{
> > > > +       struct device *memdev_dev;
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +
> > > > +       if (len == 1 || cxlr->config.targets[n])
> > > > +               remove_target(cxlr, n);
> > > > +
> > > > +       /* Remove target special case */
> > > > +       if (len == 1) {
> > > > +               device_unlock(&cxlr->dev);
> > > > +               return len;
> > > > +       }
> > > > +
> > > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> > >
> > > I think this wants to be an endpoint decoder, not a memdev. Because
> > > it's the decoder that joins a memdev to a region, or at least a
> > > decoder should be picked when the memdev is assigned so that the DPA
> > > mapping can be registered. If all the decoders are allocated then fail
> > > here.
> > >
> >
> > Per above, I think making this decoders makes sense. I could make it flexible
> > for ease of use, like if you specify memX, the kernel will pick a decoder for
> > you however I suspect you won't like that.
> 
> Right, put the user friendliness in the tooling, not sysfs ABI.
> 
> >
> > > > +       if (!memdev_dev) {
> > > > +               device_unlock(&cxlr->dev);
> > > > +               return -ENOENT;
> > > > +       }
> > > > +
> > > > +       /* reference to memdev held until target is unset or region goes away */
> > > > +
> > > > +       cxlmd = to_cxl_memdev(memdev_dev);
> > > > +       cxlr->config.targets[n] = cxlmd;
> > > > +
> > > > +       device_unlock(&cxlr->dev);
> > > > +
> > > > +       return len;
> > > > +}
> > > > +
> > > > +#define TARGET_ATTR_RW(n)                                                      \
> > > > +       static ssize_t target##n##_show(                                       \
> > > > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > > > +       {                                                                      \
> > > > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > > > +       }                                                                      \
> > > > +       static ssize_t target##n##_store(struct device *dev,                   \
> > > > +                                        struct device_attribute *attr,        \
> > > > +                                        const char *buf, size_t len)          \
> > > > +       {                                                                      \
> > > > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > > > +       }                                                                      \
> > > > +       static DEVICE_ATTR_RW(target##n)
> > > > +
> > > > +TARGET_ATTR_RW(0);
> > > > +TARGET_ATTR_RW(1);
> > > > +TARGET_ATTR_RW(2);
> > > > +TARGET_ATTR_RW(3);
> > > > +TARGET_ATTR_RW(4);
> > > > +TARGET_ATTR_RW(5);
> > > > +TARGET_ATTR_RW(6);
> > > > +TARGET_ATTR_RW(7);
> > > > +TARGET_ATTR_RW(8);
> > > > +TARGET_ATTR_RW(9);
> > > > +TARGET_ATTR_RW(10);
> > > > +TARGET_ATTR_RW(11);
> > > > +TARGET_ATTR_RW(12);
> > > > +TARGET_ATTR_RW(13);
> > > > +TARGET_ATTR_RW(14);
> > > > +TARGET_ATTR_RW(15);
> > > > +
> > > > +static struct attribute *interleave_attrs[] = {
> > > > +       &dev_attr_target0.attr,
> > > > +       &dev_attr_target1.attr,
> > > > +       &dev_attr_target2.attr,
> > > > +       &dev_attr_target3.attr,
> > > > +       &dev_attr_target4.attr,
> > > > +       &dev_attr_target5.attr,
> > > > +       &dev_attr_target6.attr,
> > > > +       &dev_attr_target7.attr,
> > > > +       &dev_attr_target8.attr,
> > > > +       &dev_attr_target9.attr,
> > > > +       &dev_attr_target10.attr,
> > > > +       &dev_attr_target11.attr,
> > > > +       &dev_attr_target12.attr,
> > > > +       &dev_attr_target13.attr,
> > > > +       &dev_attr_target14.attr,
> > > > +       &dev_attr_target15.attr,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > > > +{
> > > > +       struct device *dev = container_of(kobj, struct device, kobj);
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       if (n < cxlr->config.interleave_ways)
> > > > +               return a->mode;
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static const struct attribute_group region_interleave_group = {
> > > > +       .attrs = interleave_attrs,
> > > > +       .is_visible = visible_targets,
> > > > +};
> > > > +
> > > > +static const struct attribute_group *region_groups[] = {
> > > > +       &region_group,
> > > > +       &region_interleave_group,
> > > > +       NULL,
> > > > +};
> > > > +
> > > >  static void cxl_region_release(struct device *dev);
> > > >
> > > >  static const struct device_type cxl_region_type = {
> > > >         .name = "cxl_region",
> > > >         .release = cxl_region_release,
> > > > +       .groups = region_groups
> > > >  };
> > > >
> > > >  static ssize_t create_region_show(struct device *dev,
> > > > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> > > >  {
> > > >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > > >         struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int i;
> > > >
> > > >         ida_free(&cxld->region_ida, cxlr->id);
> > > > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > > > +               remove_target(cxlr, i);
> > >
> > > Like the last patch this feels too late. I expect whatever unregisters
> > > the region should have already handled removing the targets.
> > >
> >
> > Would remove() be more appropriate?
> 
> ->remove() does not seem a good fit since it may be the case that
> someone wants do "echo $region >
> /sys/bus/cxl/drivers/cxl_region/unbind; echo $region >
> /sys/bus/cxl/drivers/cxl_region/bind;" without needing to go
> reconfigure the targets. I am suggesting that before
> device_unregister(&cxlr->dev) the targets are released.
Dan Williams Feb. 23, 2022, 10:24 p.m. UTC | #15
On Wed, Feb 23, 2022 at 1:50 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-02-17 11:57:59, Dan Williams wrote:
> > On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > Consolidating earlier discussions...
> > >
> > > On 22-01-28 16:25:34, Dan Williams wrote:
> > > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > >
> > > > > The region creation APIs create a vacant region. Configuring the region
> > > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > > will be provided to allow userspace to configure the region.  Finally
> > > > > once all configuration is complete, userspace may activate the region.
> > > > >
> > > > > Introduced here are the most basic attributes needed to configure a
> > > > > region. Details of these attribute are described in the ABI
> > > >
> > > > s/attribute/attributes/
> > > >
> > > > > Documentation. Sanity checking of configuration parameters are done at
> > > > > region binding time. This consolidates all such logic in one place,
> > > > > rather than being strewn across multiple places.
> > > >
> > > > I think that's too late for some of the validation. The complex
> > > > validation that the region driver does throughout the topology is
> > > > different from the basic input validation that can  be done at the
> > > > sysfs write time. For example ,this patch allows negative
> > > > interleave_granularity values to specified, just return -EINVAL. I
> > > > agree that sysfs should not validate everything, I disagree with
> > > > pushing all validation to cxl_region_probe().
> > > >
> > >
> > > Okay. It might save us some back and forth if you could outline everything you'd
> > > expect to be validated, but I can also make an attempt to figure out the
> > > reasonable set of things.
> >
> > Input validation. Every value that gets written to a sysfs attribute
> > should be checked for validity, more below:
> >
> > >
> > > > >
> > > > > A example is provided below:
> > > > >
> > > > > /sys/bus/cxl/devices/region0.0:0
> > > > > ├── interleave_granularity
> >
> > ...validate granularity is within spec and can be supported by the root decoder.
> >
> > > > > ├── interleave_ways
> >
> > ...validate ways is within spec and can be supported by the root decoder.
>
> I'm not sure how to do this one. Validation requires device positions and we
> can't set the targets until ways is set. Can you please provide some more
> insight on what you'd like me to check in addition to the value being within
> spec?

For example you could check that interleave_ways is >= to the root
level interleave. I.e. it would be invalid to attempt a x1 interleave
on a decoder that is x2 interleaved at the host-bridge level.
Ben Widawsky Feb. 23, 2022, 10:31 p.m. UTC | #16
On 22-02-23 14:24:00, Dan Williams wrote:
> On Wed, Feb 23, 2022 at 1:50 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > On 22-02-17 11:57:59, Dan Williams wrote:
> > > On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > Consolidating earlier discussions...
> > > >
> > > > On 22-01-28 16:25:34, Dan Williams wrote:
> > > > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > > >
> > > > > > The region creation APIs create a vacant region. Configuring the region
> > > > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > > > will be provided to allow userspace to configure the region.  Finally
> > > > > > once all configuration is complete, userspace may activate the region.
> > > > > >
> > > > > > Introduced here are the most basic attributes needed to configure a
> > > > > > region. Details of these attribute are described in the ABI
> > > > >
> > > > > s/attribute/attributes/
> > > > >
> > > > > > Documentation. Sanity checking of configuration parameters are done at
> > > > > > region binding time. This consolidates all such logic in one place,
> > > > > > rather than being strewn across multiple places.
> > > > >
> > > > > I think that's too late for some of the validation. The complex
> > > > > validation that the region driver does throughout the topology is
> > > > > different from the basic input validation that can  be done at the
> > > > > sysfs write time. For example ,this patch allows negative
> > > > > interleave_granularity values to specified, just return -EINVAL. I
> > > > > agree that sysfs should not validate everything, I disagree with
> > > > > pushing all validation to cxl_region_probe().
> > > > >
> > > >
> > > > Okay. It might save us some back and forth if you could outline everything you'd
> > > > expect to be validated, but I can also make an attempt to figure out the
> > > > reasonable set of things.
> > >
> > > Input validation. Every value that gets written to a sysfs attribute
> > > should be checked for validity, more below:
> > >
> > > >
> > > > > >
> > > > > > A example is provided below:
> > > > > >
> > > > > > /sys/bus/cxl/devices/region0.0:0
> > > > > > ├── interleave_granularity
> > >
> > > ...validate granularity is within spec and can be supported by the root decoder.
> > >
> > > > > > ├── interleave_ways
> > >
> > > ...validate ways is within spec and can be supported by the root decoder.
> >
> > I'm not sure how to do this one. Validation requires device positions and we
> > can't set the targets until ways is set. Can you please provide some more
> > insight on what you'd like me to check in addition to the value being within
> > spec?
> 
> For example you could check that interleave_ways is >= to the root
> level interleave. I.e. it would be invalid to attempt a x1 interleave
> on a decoder that is x2 interleaved at the host-bridge level.

I tried to convince myself that that assertion always holds and didn't feel
super comfortable. If you do, I can add those kinds of checks.

Thanks.
Ben
Dan Williams Feb. 23, 2022, 10:42 p.m. UTC | #17
On Wed, Feb 23, 2022 at 2:31 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-02-23 14:24:00, Dan Williams wrote:
> > On Wed, Feb 23, 2022 at 1:50 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > On 22-02-17 11:57:59, Dan Williams wrote:
> > > > On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > >
> > > > > Consolidating earlier discussions...
> > > > >
> > > > > On 22-01-28 16:25:34, Dan Williams wrote:
> > > > > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > > > >
> > > > > > > The region creation APIs create a vacant region. Configuring the region
> > > > > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > > > > will be provided to allow userspace to configure the region.  Finally
> > > > > > > once all configuration is complete, userspace may activate the region.
> > > > > > >
> > > > > > > Introduced here are the most basic attributes needed to configure a
> > > > > > > region. Details of these attribute are described in the ABI
> > > > > >
> > > > > > s/attribute/attributes/
> > > > > >
> > > > > > > Documentation. Sanity checking of configuration parameters are done at
> > > > > > > region binding time. This consolidates all such logic in one place,
> > > > > > > rather than being strewn across multiple places.
> > > > > >
> > > > > > I think that's too late for some of the validation. The complex
> > > > > > validation that the region driver does throughout the topology is
> > > > > > different from the basic input validation that can  be done at the
> > > > > > sysfs write time. For example ,this patch allows negative
> > > > > > interleave_granularity values to specified, just return -EINVAL. I
> > > > > > agree that sysfs should not validate everything, I disagree with
> > > > > > pushing all validation to cxl_region_probe().
> > > > > >
> > > > >
> > > > > Okay. It might save us some back and forth if you could outline everything you'd
> > > > > expect to be validated, but I can also make an attempt to figure out the
> > > > > reasonable set of things.
> > > >
> > > > Input validation. Every value that gets written to a sysfs attribute
> > > > should be checked for validity, more below:
> > > >
> > > > >
> > > > > > >
> > > > > > > A example is provided below:
> > > > > > >
> > > > > > > /sys/bus/cxl/devices/region0.0:0
> > > > > > > ├── interleave_granularity
> > > >
> > > > ...validate granularity is within spec and can be supported by the root decoder.
> > > >
> > > > > > > ├── interleave_ways
> > > >
> > > > ...validate ways is within spec and can be supported by the root decoder.
> > >
> > > I'm not sure how to do this one. Validation requires device positions and we
> > > can't set the targets until ways is set. Can you please provide some more
> > > insight on what you'd like me to check in addition to the value being within
> > > spec?
> >
> > For example you could check that interleave_ways is >= to the root
> > level interleave. I.e. it would be invalid to attempt a x1 interleave
> > on a decoder that is x2 interleaved at the host-bridge level.
>
> I tried to convince myself that that assertion always holds and didn't feel
> super comfortable. If you do, I can add those kinds of checks.

The only way to support a x1 region on an x2 interleave is to have the
size be equal to interleave granularity so that accesses stay
contained to that one device.

In fact that's another validation step, which you might already have,
region size must be >= and aligned to interleave_granularity *
interleave_ways.
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index dcc728458936..50ba5018014d 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -187,3 +187,43 @@  Description:
 		region driver before being deleted. The attributes expects a
 		region in the form "regionX.Y:Z". The region's name, allocated
 		by reading create_region, will also be released.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
+Date:		August, 2021
+KernelVersion:	v5.18
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) A region resides within an address space that is claimed by
+		a decoder. Region space allocation is handled by the driver, but
+		the offset may be read by userspace tooling in order to
+		determine fragmentation, and available size for new regions.
+
+What:
+/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
+Date:		August, 2021
+KernelVersion:	v5.18
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RW) Configuring regions requires a minimal set of parameters in
+		order for the subsequent bind operation to succeed. The
+		following parameters are defined:
+
+		==	========================================================
+		interleave_granularity Mandatory. Number of consecutive bytes
+			each device in the interleave set will claim. The
+			possible interleave granularity values are determined by
+			the CXL spec and the participating devices.
+		interleave_ways Mandatory. Number of devices participating in the
+			region. Each device will provide 1/interleave of storage
+			for the region.
+		size	Manadatory. Phsyical address space the region will
+			consume.
+		target  Mandatory. Memory devices are the backing storage for a
+			region. There will be N targets based on the number of
+			interleave ways that the top level decoder is configured
+			for. Each target must be set with a memdev device ie.
+			'mem1'. This attribute only becomes available after
+			setting the 'interleave' attribute.
+		uuid	Optional. A unique identifier for the region. If none is
+			selected, the kernel will create one.
+		==	========================================================
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 1a448543db0d..3b48e0469fc7 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3,9 +3,12 @@ 
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/device.h>
 #include <linux/module.h>
+#include <linux/sizes.h>
 #include <linux/slab.h>
+#include <linux/uuid.h>
 #include <linux/idr.h>
 #include <region.h>
+#include <cxlmem.h>
 #include <cxl.h>
 #include "core.h"
 
@@ -18,11 +21,305 @@ 
  * (programming the hardware) is handled by a separate region driver.
  */
 
+struct cxl_region *to_cxl_region(struct device *dev);
+static const struct attribute_group region_interleave_group;
+
+static bool is_region_active(struct cxl_region *cxlr)
+{
+	/* TODO: Regions can't be activated yet. */
+	return false;
+}
+
+static void remove_target(struct cxl_region *cxlr, int target)
+{
+	struct cxl_memdev *cxlmd;
+
+	cxlmd = cxlr->config.targets[target];
+	if (cxlmd)
+		put_device(&cxlmd->dev);
+	cxlr->config.targets[target] = NULL;
+}
+
+static ssize_t interleave_ways_show(struct device *dev,
+				    struct device_attribute *attr, char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
+}
+
+static ssize_t interleave_ways_store(struct device *dev,
+				     struct device_attribute *attr,
+				     const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	int ret, prev_iw;
+	int val;
+
+	prev_iw = cxlr->config.interleave_ways;
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+	if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
+		return -EINVAL;
+
+	cxlr->config.interleave_ways = val;
+
+	ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
+	if (ret < 0)
+		goto err;
+
+	sysfs_notify(&dev->kobj, NULL, "target_interleave");
+
+	while (prev_iw > cxlr->config.interleave_ways)
+		remove_target(cxlr, --prev_iw);
+
+	return len;
+
+err:
+	cxlr->config.interleave_ways = prev_iw;
+	return ret;
+}
+static DEVICE_ATTR_RW(interleave_ways);
+
+static ssize_t interleave_granularity_show(struct device *dev,
+					   struct device_attribute *attr,
+					   char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
+}
+
+static ssize_t interleave_granularity_store(struct device *dev,
+					    struct device_attribute *attr,
+					    const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	int val, ret;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+	cxlr->config.interleave_granularity = val;
+
+	return len;
+}
+static DEVICE_ATTR_RW(interleave_granularity);
+
+static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
+			   char *buf)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	resource_size_t offset;
+
+	if (!cxlr->res)
+		return sysfs_emit(buf, "\n");
+
+	offset = cxld->platform_res.start - cxlr->res->start;
+
+	return sysfs_emit(buf, "%pa\n", &offset);
+}
+static DEVICE_ATTR_RO(offset);
+
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%llu\n", cxlr->config.size);
+}
+
+static ssize_t size_store(struct device *dev, struct device_attribute *attr,
+			  const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	unsigned long long val;
+	ssize_t rc;
+
+	rc = kstrtoull(buf, 0, &val);
+	if (rc)
+		return rc;
+
+	device_lock(&cxlr->dev);
+	if (is_region_active(cxlr))
+		rc = -EBUSY;
+	else
+		cxlr->config.size = val;
+	device_unlock(&cxlr->dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(size);
+
+static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
+}
+
+static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
+			  const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	ssize_t rc;
+
+	if (len != UUID_STRING_LEN + 1)
+		return -EINVAL;
+
+	device_lock(&cxlr->dev);
+	if (is_region_active(cxlr))
+		rc = -EBUSY;
+	else
+		rc = uuid_parse(buf, &cxlr->config.uuid);
+	device_unlock(&cxlr->dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static struct attribute *region_attrs[] = {
+	&dev_attr_interleave_ways.attr,
+	&dev_attr_interleave_granularity.attr,
+	&dev_attr_offset.attr,
+	&dev_attr_size.attr,
+	&dev_attr_uuid.attr,
+	NULL,
+};
+
+static const struct attribute_group region_group = {
+	.attrs = region_attrs,
+};
+
+static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
+{
+	int ret;
+
+	device_lock(&cxlr->dev);
+	if (!cxlr->config.targets[n])
+		ret = sysfs_emit(buf, "\n");
+	else
+		ret = sysfs_emit(buf, "%s\n",
+				 dev_name(&cxlr->config.targets[n]->dev));
+	device_unlock(&cxlr->dev);
+
+	return ret;
+}
+
+static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
+			  size_t len)
+{
+	struct device *memdev_dev;
+	struct cxl_memdev *cxlmd;
+
+	device_lock(&cxlr->dev);
+
+	if (len == 1 || cxlr->config.targets[n])
+		remove_target(cxlr, n);
+
+	/* Remove target special case */
+	if (len == 1) {
+		device_unlock(&cxlr->dev);
+		return len;
+	}
+
+	memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
+	if (!memdev_dev) {
+		device_unlock(&cxlr->dev);
+		return -ENOENT;
+	}
+
+	/* reference to memdev held until target is unset or region goes away */
+
+	cxlmd = to_cxl_memdev(memdev_dev);
+	cxlr->config.targets[n] = cxlmd;
+
+	device_unlock(&cxlr->dev);
+
+	return len;
+}
+
+#define TARGET_ATTR_RW(n)                                                      \
+	static ssize_t target##n##_show(                                       \
+		struct device *dev, struct device_attribute *attr, char *buf)  \
+	{                                                                      \
+		return show_targetN(to_cxl_region(dev), buf, (n));             \
+	}                                                                      \
+	static ssize_t target##n##_store(struct device *dev,                   \
+					 struct device_attribute *attr,        \
+					 const char *buf, size_t len)          \
+	{                                                                      \
+		return set_targetN(to_cxl_region(dev), buf, (n), len);         \
+	}                                                                      \
+	static DEVICE_ATTR_RW(target##n)
+
+TARGET_ATTR_RW(0);
+TARGET_ATTR_RW(1);
+TARGET_ATTR_RW(2);
+TARGET_ATTR_RW(3);
+TARGET_ATTR_RW(4);
+TARGET_ATTR_RW(5);
+TARGET_ATTR_RW(6);
+TARGET_ATTR_RW(7);
+TARGET_ATTR_RW(8);
+TARGET_ATTR_RW(9);
+TARGET_ATTR_RW(10);
+TARGET_ATTR_RW(11);
+TARGET_ATTR_RW(12);
+TARGET_ATTR_RW(13);
+TARGET_ATTR_RW(14);
+TARGET_ATTR_RW(15);
+
+static struct attribute *interleave_attrs[] = {
+	&dev_attr_target0.attr,
+	&dev_attr_target1.attr,
+	&dev_attr_target2.attr,
+	&dev_attr_target3.attr,
+	&dev_attr_target4.attr,
+	&dev_attr_target5.attr,
+	&dev_attr_target6.attr,
+	&dev_attr_target7.attr,
+	&dev_attr_target8.attr,
+	&dev_attr_target9.attr,
+	&dev_attr_target10.attr,
+	&dev_attr_target11.attr,
+	&dev_attr_target12.attr,
+	&dev_attr_target13.attr,
+	&dev_attr_target14.attr,
+	&dev_attr_target15.attr,
+	NULL,
+};
+
+static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	if (n < cxlr->config.interleave_ways)
+		return a->mode;
+	return 0;
+}
+
+static const struct attribute_group region_interleave_group = {
+	.attrs = interleave_attrs,
+	.is_visible = visible_targets,
+};
+
+static const struct attribute_group *region_groups[] = {
+	&region_group,
+	&region_interleave_group,
+	NULL,
+};
+
 static void cxl_region_release(struct device *dev);
 
 static const struct device_type cxl_region_type = {
 	.name = "cxl_region",
 	.release = cxl_region_release,
+	.groups = region_groups
 };
 
 static ssize_t create_region_show(struct device *dev,
@@ -108,8 +405,11 @@  static void cxl_region_release(struct device *dev)
 {
 	struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
 	struct cxl_region *cxlr = to_cxl_region(dev);
+	int i;
 
 	ida_free(&cxld->region_ida, cxlr->id);
+	for (i = 0; i < cxlr->config.interleave_ways; i++)
+		remove_target(cxlr, i);
 	kfree(cxlr);
 }