diff mbox series

[v3,21/25] dax/region: Create resources on sparse DAX regions

Message ID 20240816-dcd-type2-upstream-v3-21-7c9b96cba6d7@intel.com
State New
Headers show
Series DCD: Add support for Dynamic Capacity Devices (DCD) | expand

Commit Message

Ira Weiny Aug. 16, 2024, 2:44 p.m. UTC
From: Navneet Singh <navneet.singh@intel.com>

DAX regions which map dynamic capacity partitions require that memory be
allowed to come and go.  Recall sparse regions were created for this
purpose.  Now that extents can be realized within DAX regions the DAX
region driver can start tracking sub-resource information.

The tight relationship between DAX region operations and extent
operations require memory changes to be controlled synchronously with
the user of the region.  Synchronize through the dax_region_rwsem and by
having the region driver drive both the region device as well as the
extent sub-devices.

Recall requests to remove extents can happen at any time and that a host
is not obligated to release the memory until it is not being used.  If
an extent is not used allow a release response.

The DAX layer has no need for the details of the CXL memory extent
devices.  Expose extents to the DAX layer as device children of the DAX
region device.  A single callback from the driver aids the DAX layer to
determine if the child device is an extent.  The DAX layer also
registers a devres function to automatically clean up when the device is
removed from the region.

There is a race between extents being surfaced and the dax_cxl driver
being loaded.  The driver must therefore scan for any existing extents
while still under the device lock.

Respond to extent notifications.  Manage the DAX region resource tree
based on the extents lifetime.  Return the status of remove
notifications to lower layers such that it can manage the hardware
appropriately.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: patch reorder]
[iweiny: move hunks from other patches to clarify code changes and
         add/release flows WRT dax regions]
[iweiny: use %par]
[iweiny: clean up variable names]
[iweiny: Simplify sparse_ops]
[Fan: avoid open coding range_len()]
[djbw: s/reg_ext/region_extent]
---
 drivers/cxl/core/extent.c |  76 +++++++++++++--
 drivers/cxl/cxl.h         |   6 ++
 drivers/dax/bus.c         | 243 +++++++++++++++++++++++++++++++++++++++++-----
 drivers/dax/bus.h         |   3 +-
 drivers/dax/cxl.c         |  63 +++++++++++-
 drivers/dax/dax-private.h |  34 +++++++
 drivers/dax/hmem/hmem.c   |   2 +-
 drivers/dax/pmem.c        |   2 +-
 8 files changed, 391 insertions(+), 38 deletions(-)

Comments

Markus Elfring Aug. 18, 2024, 11:38 a.m. UTC | #1
> +++ b/drivers/cxl/core/extent.c
> @@ -271,20 +271,67 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
>  	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
>  }
>
> +static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> +			      struct region_extent *region_extent)
> +{> +	device_lock(dev);
> +	if (dev->driver) {> +	}
> +	device_unlock(dev);
> +	return rc;
> +}
…

Under which circumstances would you become interested to apply a statement
like “guard(device)(dev);”?
https://elixir.bootlin.com/linux/v6.11-rc3/source/include/linux/device.h#L1027

Regards,
Markus
Dave Jiang Aug. 19, 2024, 11:30 p.m. UTC | #2
On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> DAX regions which map dynamic capacity partitions require that memory be
> allowed to come and go.  Recall sparse regions were created for this
> purpose.  Now that extents can be realized within DAX regions the DAX
> region driver can start tracking sub-resource information.
> 
> The tight relationship between DAX region operations and extent
> operations require memory changes to be controlled synchronously with
> the user of the region.  Synchronize through the dax_region_rwsem and by
> having the region driver drive both the region device as well as the
> extent sub-devices.
> 
> Recall requests to remove extents can happen at any time and that a host
> is not obligated to release the memory until it is not being used.  If
> an extent is not used allow a release response.
> 
> The DAX layer has no need for the details of the CXL memory extent
> devices.  Expose extents to the DAX layer as device children of the DAX
> region device.  A single callback from the driver aids the DAX layer to
> determine if the child device is an extent.  The DAX layer also
> registers a devres function to automatically clean up when the device is
> removed from the region.
> 
> There is a race between extents being surfaced and the dax_cxl driver
> being loaded.  The driver must therefore scan for any existing extents
> while still under the device lock.
> 
> Respond to extent notifications.  Manage the DAX region resource tree
> based on the extents lifetime.  Return the status of remove
> notifications to lower layers such that it can manage the hardware
> appropriately.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [iweiny: patch reorder]
> [iweiny: move hunks from other patches to clarify code changes and
>          add/release flows WRT dax regions]
> [iweiny: use %par]
> [iweiny: clean up variable names]
> [iweiny: Simplify sparse_ops]
> [Fan: avoid open coding range_len()]
> [djbw: s/reg_ext/region_extent]
> ---
>  drivers/cxl/core/extent.c |  76 +++++++++++++--
>  drivers/cxl/cxl.h         |   6 ++
>  drivers/dax/bus.c         | 243 +++++++++++++++++++++++++++++++++++++++++-----
>  drivers/dax/bus.h         |   3 +-
>  drivers/dax/cxl.c         |  63 +++++++++++-
>  drivers/dax/dax-private.h |  34 +++++++
>  drivers/dax/hmem/hmem.c   |   2 +-
>  drivers/dax/pmem.c        |   2 +-
>  8 files changed, 391 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index d7d526a51e2b..103b0bec3a4a 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -271,20 +271,67 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
>  	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
>  }
>  
> +static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> +			      struct region_extent *region_extent)
> +{
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +	int rc = 0;
> +
> +	cxlr_dax = cxlr->cxlr_dax;
> +	dev = &cxlr_dax->dev;
> +	dev_dbg(dev, "Trying notify: type %d HPA %par\n",
> +		event, &region_extent->hpa_range);
> +
> +	/*
> +	 * NOTE the lack of a driver indicates a notification has failed.  No
> +	 * user space coordiantion was possible.
> +	 */
> +	device_lock(dev);
> +	if (dev->driver) {
> +		struct cxl_driver *driver = to_cxl_drv(dev->driver);
> +		struct cxl_notify_data notify_data = (struct cxl_notify_data) {
> +			.event = event,
> +			.region_extent = region_extent,
> +		};
> +
> +		if (driver->notify) {
> +			dev_dbg(dev, "Notify: type %d HPA %par\n",
> +				event, &region_extent->hpa_range);
> +			rc = driver->notify(dev, &notify_data);
> +		}
> +	}
> +	device_unlock(dev);

Maybe a cleaner version:
	guard(device)(dev);
	if (!dev->driver || !dev->driver->notify)
		return 0;

	dev_dbg(...);
	return driver->notify(dev, &notify_data);


> +	return rc;
> +}
> +
> +struct rm_data {
> +	struct cxl_region *cxlr;
> +	struct range *range;
> +};
> +
>  static int cxlr_rm_extent(struct device *dev, void *data)
>  {
>  	struct region_extent *region_extent = to_region_extent(dev);
> -	struct range *region_hpa_range = data;
> +	struct rm_data *rm_data = data;
> +	int rc;
>  
>  	if (!region_extent)
>  		return 0;
>  
>  	/*
> -	 * Any extent which 'touches' the released range is removed.
> +	 * Any extent which 'touches' the released range is attempted to be
> +	 * removed.
>  	 */
> -	if (range_overlaps(region_hpa_range, &region_extent->hpa_range)) {
> +	if (range_overlaps(rm_data->range, &region_extent->hpa_range)) {
> +		struct cxl_region *cxlr = rm_data->cxlr;
> +
>  		dev_dbg(dev, "Remove region extent HPA %par\n",
>  			&region_extent->hpa_range);
> +		rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, region_extent);
> +		if (rc == -EBUSY)
> +			return 0;
> +		/* Extent not in use or error, remove it */
>  		region_rm_extent(region_extent);
>  	}
>  	return 0;
> @@ -312,8 +359,13 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
>  
>  	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
>  
> +	struct rm_data rm_data = {
> +		.cxlr = cxlr,
> +		.range = &hpa_range,
> +	};
> +
>  	/* Remove region extents which overlap */
> -	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
> +	return device_for_each_child(&cxlr->cxlr_dax->dev, &rm_data,
>  				     cxlr_rm_extent);
>  }
>  
> @@ -338,8 +390,20 @@ static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
>  		return rc;
>  	}
>  
> -	/* device model handles freeing region_extent */
> -	return online_region_extent(region_extent);
> +	rc = online_region_extent(region_extent);
> +	/* device model handled freeing region_extent */
> +	if (rc)
> +		return rc;
> +
> +	rc = cxlr_notify_extent(cxlr_dax->cxlr, DCD_ADD_CAPACITY, region_extent);
> +	/*
> +	 * The region device was breifly live but DAX layer ensures it was not
> +	 * used
> +	 */
> +	if (rc)
> +		region_rm_extent(region_extent);
> +
> +	return rc;
>  }
>  
>  /* Callers are expected to ensure cxled has been attached to a region */
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index c858e3957fd5..9abbfc68c6ad 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -916,10 +916,16 @@ bool is_cxl_region(struct device *dev);
>  
>  extern struct bus_type cxl_bus_type;
>  
> +struct cxl_notify_data {
> +	enum dc_event event;
> +	struct region_extent *region_extent;
> +};
> +
>  struct cxl_driver {
>  	const char *name;
>  	int (*probe)(struct device *dev);
>  	void (*remove)(struct device *dev);
> +	int (*notify)(struct device *dev, struct cxl_notify_data *notify_data);
>  	struct device_driver drv;
>  	int id;
>  };
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 975860371d9f..f14b0cfa7edd 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -183,6 +183,83 @@ static bool is_sparse(struct dax_region *dax_region)
>  	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
>  }
>  
> +static void __dax_release_resource(struct dax_resource *dax_resource)
> +{
> +	struct dax_region *dax_region = dax_resource->region;
> +
> +	lockdep_assert_held_write(&dax_region_rwsem);
> +	dev_dbg(dax_region->dev, "Extent release resource %pr\n",
> +		dax_resource->res);
> +	if (dax_resource->res)
> +		__release_region(&dax_region->res, dax_resource->res->start,
> +				 resource_size(dax_resource->res));
> +	dax_resource->res = NULL;
> +}
> +
> +static void dax_release_resource(void *res)
> +{
> +	struct dax_resource *dax_resource = res;
> +
> +	guard(rwsem_write)(&dax_region_rwsem);
> +	__dax_release_resource(dax_resource);
> +	kfree(dax_resource);
> +}
> +
> +int dax_region_add_resource(struct dax_region *dax_region,
> +			    struct device *device,
> +			    resource_size_t start, resource_size_t length)
>
kdoc header?

 +{
> +	struct resource *new_resource;
> +	int rc;
> +
> +	struct dax_resource *dax_resource __free(kfree) =
> +				kzalloc(sizeof(*dax_resource), GFP_KERNEL);
> +	if (!dax_resource)
> +		return -ENOMEM;
> +
> +	guard(rwsem_write)(&dax_region_rwsem);
> +
> +	dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
> +	new_resource = __request_region(&dax_region->res, start, length, "extent", 0);
> +	if (!new_resource) {
> +		dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
> +			&start, &length);
> +		return -ENOSPC;
> +	}
> +
> +	dev_dbg(dax_region->dev, "add resource %pr\n", new_resource);
> +	dax_resource->region = dax_region;
> +	dax_resource->res = new_resource;
> +	dev_set_drvdata(device, dax_resource);
> +	rc = devm_add_action_or_reset(device, dax_release_resource,
> +				      no_free_ptr(dax_resource));
> +	/*  On error; ensure driver data is cleared under semaphore */
> +	if (rc)
> +		dev_set_drvdata(device, NULL);
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(dax_region_add_resource);
> +
> +int dax_region_rm_resource(struct dax_region *dax_region,
> +			   struct device *dev)

kdoc header
> +{
> +	struct dax_resource *dax_resource;
> +
> +	guard(rwsem_write)(&dax_region_rwsem);
> +
> +	dax_resource = dev_get_drvdata(dev);
> +	if (!dax_resource)
> +		return 0;
> +
> +	if (dax_resource->use_cnt)
> +		return -EBUSY;
> +
> +	/* avoid races with users trying to use the extent */
> +	__dax_release_resource(dax_resource);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(dax_region_rm_resource);
> +
>  bool static_dev_dax(struct dev_dax *dev_dax)
>  {
>  	return is_static(dev_dax->region);
> @@ -296,19 +373,44 @@ static ssize_t region_align_show(struct device *dev,
>  static struct device_attribute dev_attr_region_align =
>  		__ATTR(align, 0400, region_align_show, NULL);
>  
> +#define for_each_child_resource(extent, res) \
> +	for (res = (extent)->child; res; res = res->sibling)
> +
> +resource_size_t
> +dax_avail_size(struct resource *dax_resource)
kdoc header

DJ

> +{
> +	resource_size_t rc;
> +	struct resource *used_res;
> +
> +	rc = resource_size(dax_resource);
> +	for_each_child_resource(dax_resource, used_res)
> +		rc -= resource_size(used_res);
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(dax_avail_size);
> +
>  #define for_each_dax_region_resource(dax_region, res) \
>  	for (res = (dax_region)->res.child; res; res = res->sibling)
>  
>  static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>  {
> -	resource_size_t size = resource_size(&dax_region->res);
> +	resource_size_t size;
>  	struct resource *res;
>  
>  	lockdep_assert_held(&dax_region_rwsem);
>  
> -	if (is_sparse(dax_region))
> -		return 0;
> +	if (is_sparse(dax_region)) {
> +		/*
> +		 * Children of a sparse region represent available space not
> +		 * used space.
> +		 */
> +		size = 0;
> +		for_each_dax_region_resource(dax_region, res)
> +			size += dax_avail_size(res);
> +		return size;
> +	}
>  
> +	size = resource_size(&dax_region->res);
>  	for_each_dax_region_resource(dax_region, res)
>  		size -= resource_size(res);
>  	return size;
> @@ -449,15 +551,26 @@ EXPORT_SYMBOL_GPL(kill_dev_dax);
>  static void trim_dev_dax_range(struct dev_dax *dev_dax)
>  {
>  	int i = dev_dax->nr_range - 1;
> -	struct range *range = &dev_dax->ranges[i].range;
> +	struct dev_dax_range *dev_range = &dev_dax->ranges[i];
> +	struct range *range = &dev_range->range;
>  	struct dax_region *dax_region = dev_dax->region;
> +	struct resource *res = &dax_region->res;
>  
>  	lockdep_assert_held_write(&dax_region_rwsem);
>  	dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
>  		(unsigned long long)range->start,
>  		(unsigned long long)range->end);
>  
> -	__release_region(&dax_region->res, range->start, range_len(range));
> +	if (dev_range->dax_resource) {
> +		res = dev_range->dax_resource->res;
> +		dev_dbg(&dev_dax->dev, "Trim sparse extent %pr\n", res);
> +	}
> +
> +	__release_region(res, range->start, range_len(range));
> +
> +	if (dev_range->dax_resource)
> +		dev_range->dax_resource->use_cnt--;
> +
>  	if (--dev_dax->nr_range == 0) {
>  		kfree(dev_dax->ranges);
>  		dev_dax->ranges = NULL;
> @@ -640,7 +753,7 @@ static void dax_region_unregister(void *region)
>  
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct range *range, int target_node, unsigned int align,
> -		unsigned long flags)
> +		unsigned long flags, struct dax_sparse_ops *sparse_ops)
>  {
>  	struct dax_region *dax_region;
>  
> @@ -658,12 +771,16 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  			|| !IS_ALIGNED(range_len(range), align))
>  		return NULL;
>  
> +	if (!sparse_ops && (flags & IORESOURCE_DAX_SPARSE_CAP))
> +		return NULL;
> +
>  	dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL);
>  	if (!dax_region)
>  		return NULL;
>  
>  	dev_set_drvdata(parent, dax_region);
>  	kref_init(&dax_region->kref);
> +	dax_region->sparse_ops = sparse_ops;
>  	dax_region->id = region_id;
>  	dax_region->align = align;
>  	dax_region->dev = parent;
> @@ -845,7 +962,8 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
>  }
>  
>  static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
> -			       u64 start, resource_size_t size)
> +			       u64 start, resource_size_t size,
> +			       struct dax_resource *dax_resource)
>  {
>  	struct device *dev = &dev_dax->dev;
>  	struct dev_dax_range *ranges;
> @@ -884,6 +1002,7 @@ static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
>  			.start = alloc->start,
>  			.end = alloc->end,
>  		},
> +		.dax_resource = dax_resource,
>  	};
>  
>  	dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
> @@ -966,7 +1085,8 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
>  	int i;
>  
>  	for (i = dev_dax->nr_range - 1; i >= 0; i--) {
> -		struct range *range = &dev_dax->ranges[i].range;
> +		struct dev_dax_range *dev_range = &dev_dax->ranges[i];
> +		struct range *range = &dev_range->range;
>  		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
>  		struct resource *adjust = NULL, *res;
>  		resource_size_t shrink;
> @@ -982,12 +1102,21 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
>  			continue;
>  		}
>  
> -		for_each_dax_region_resource(dax_region, res)
> -			if (strcmp(res->name, dev_name(dev)) == 0
> -					&& res->start == range->start) {
> -				adjust = res;
> -				break;
> -			}
> +		if (dev_range->dax_resource) {
> +			for_each_child_resource(dev_range->dax_resource->res, res)
> +				if (strcmp(res->name, dev_name(dev)) == 0
> +						&& res->start == range->start) {
> +					adjust = res;
> +					break;
> +				}
> +		} else {
> +			for_each_dax_region_resource(dax_region, res)
> +				if (strcmp(res->name, dev_name(dev)) == 0
> +						&& res->start == range->start) {
> +					adjust = res;
> +					break;
> +				}
> +		}
>  
>  		if (dev_WARN_ONCE(dev, !adjust || i != dev_dax->nr_range - 1,
>  					"failed to find matching resource\n"))
> @@ -1025,19 +1154,21 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
>  }
>  
>  /**
> - * dev_dax_resize_static - Expand the device into the unused portion of the
> - * region. This may involve adjusting the end of an existing resource, or
> - * allocating a new resource.
> + * __dev_dax_resize - Expand the device into the unused portion of the region.
> + * This may involve adjusting the end of an existing resource, or allocating a
> + * new resource.
>   *
>   * @parent: parent resource to allocate this range in
>   * @dev_dax: DAX device to be expanded
>   * @to_alloc: amount of space to alloc; must be <= space available in @parent
> + * @dax_resource: if sparse; the parent resource
>   *
>   * Return the amount of space allocated or -ERRNO on failure
>   */
> -static ssize_t dev_dax_resize_static(struct resource *parent,
> -				     struct dev_dax *dev_dax,
> -				     resource_size_t to_alloc)
> +static ssize_t __dev_dax_resize(struct resource *parent,
> +				struct dev_dax *dev_dax,
> +				resource_size_t to_alloc,
> +				struct dax_resource *dax_resource)
>  {
>  	struct resource *res, *first;
>  	int rc;
> @@ -1045,7 +1176,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
>  	first = parent->child;
>  	if (!first) {
>  		rc = alloc_dev_dax_range(parent, dev_dax,
> -					   parent->start, to_alloc);
> +					   parent->start, to_alloc,
> +					   dax_resource);
>  		if (rc)
>  			return rc;
>  		return to_alloc;
> @@ -1059,7 +1191,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
>  		if (res == first && res->start > parent->start) {
>  			alloc = min(res->start - parent->start, to_alloc);
>  			rc = alloc_dev_dax_range(parent, dev_dax,
> -						 parent->start, alloc);
> +						 parent->start, alloc,
> +						 dax_resource);
>  			if (rc)
>  				return rc;
>  			return alloc;
> @@ -1083,7 +1216,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
>  				return rc;
>  			return alloc;
>  		}
> -		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
> +		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc,
> +					 dax_resource);
>  		if (rc)
>  			return rc;
>  		return alloc;
> @@ -1094,6 +1228,54 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
>  	return 0;
>  }
>  
> +static ssize_t dev_dax_resize_static(struct dax_region *dax_region,
> +				     struct dev_dax *dev_dax,
> +				     resource_size_t to_alloc)
> +{
> +	return __dev_dax_resize(&dax_region->res, dev_dax, to_alloc, NULL);
> +}
> +
> +static int find_free_extent(struct device *dev, void *data)
> +{
> +	struct dax_region *dax_region = data;
> +	struct dax_resource *dax_resource;
> +
> +	if (!dax_region->sparse_ops->is_extent(dev))
> +		return 0;
> +
> +	dax_resource = dev_get_drvdata(dev);
> +	if (!dax_resource || !dax_avail_size(dax_resource->res))
> +		return 0;
> +	return 1;
> +}
> +
> +static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
> +				     struct dev_dax *dev_dax,
> +				     resource_size_t to_alloc)
> +{
> +	struct dax_resource *dax_resource;
> +	resource_size_t available_size;
> +	struct device *extent_dev;
> +	ssize_t alloc;
> +
> +	extent_dev = device_find_child(dax_region->dev, dax_region,
> +				       find_free_extent);
> +	if (!extent_dev)
> +		return 0;
> +
> +	dax_resource = dev_get_drvdata(extent_dev);
> +	if (!dax_resource)
> +		return 0;
> +
> +	available_size = dax_avail_size(dax_resource->res);
> +	to_alloc = min(available_size, to_alloc);
> +	alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc, dax_resource);
> +	if (alloc > 0)
> +		dax_resource->use_cnt++;
> +	put_device(extent_dev);
> +	return alloc;
> +}
> +
>  static ssize_t dev_dax_resize(struct dax_region *dax_region,
>  		struct dev_dax *dev_dax, resource_size_t size)
>  {
> @@ -1117,7 +1299,10 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
>  		return -ENXIO;
>  
>  retry:
> -	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
> +	if (is_sparse(dax_region))
> +		alloc = dev_dax_resize_sparse(dax_region, dev_dax, to_alloc);
> +	else
> +		alloc = dev_dax_resize_static(dax_region, dev_dax, to_alloc);
>  	if (alloc <= 0)
>  		return alloc;
>  	to_alloc -= alloc;
> @@ -1226,7 +1411,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
>  	to_alloc = range_len(&r);
>  	if (alloc_is_aligned(dev_dax, to_alloc))
>  		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
> -					 to_alloc);
> +					 to_alloc, NULL);
>  	up_write(&dax_dev_rwsem);
>  	up_write(&dax_region_rwsem);
>  
> @@ -1494,8 +1679,14 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
>  	device_initialize(dev);
>  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
>  
> +	if (is_sparse(dax_region) && data->size) {
> +		dev_err(parent, "Sparse DAX region devices are created initially with 0 size");
> +		rc = -EINVAL;
> +		goto err_id;
> +	}
> +
>  	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
> -				 data->size);
> +				 data->size, NULL);
>  	if (rc)
>  		goto err_range;
>  
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 783bfeef42cc..ae5029ea6047 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -9,6 +9,7 @@ struct dev_dax;
>  struct resource;
>  struct dax_device;
>  struct dax_region;
> +struct dax_sparse_ops;
>  
>  /* dax bus specific ioresource flags */
>  #define IORESOURCE_DAX_STATIC BIT(0)
> @@ -17,7 +18,7 @@ struct dax_region;
>  
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct range *range, int target_node, unsigned int align,
> -		unsigned long flags);
> +		unsigned long flags, struct dax_sparse_ops *sparse_ops);
>  
>  struct dev_dax_data {
>  	struct dax_region *dax_region;
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 367e86b1c22a..bf3b82b0120d 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -5,6 +5,60 @@
>  
>  #include "../cxl/cxl.h"
>  #include "bus.h"
> +#include "dax-private.h"
> +
> +static int __cxl_dax_add_resource(struct dax_region *dax_region,
> +				  struct region_extent *region_extent)
> +{
> +	resource_size_t start, length;
> +	struct device *dev;
> +
> +	dev = &region_extent->dev;
> +	start = dax_region->res.start + region_extent->hpa_range.start;
> +	length = range_len(&region_extent->hpa_range);
> +	return dax_region_add_resource(dax_region, dev, start, length);
> +}
> +
> +static int cxl_dax_add_resource(struct device *dev, void *data)
> +{
> +	struct dax_region *dax_region = data;
> +	struct region_extent *region_extent;
> +
> +	region_extent = to_region_extent(dev);
> +	if (!region_extent)
> +		return 0;
> +
> +	dev_dbg(dax_region->dev, "Adding resource HPA %par\n",
> +		&region_extent->hpa_range);
> +
> +	return __cxl_dax_add_resource(dax_region, region_extent);
> +}
> +
> +static int cxl_dax_region_notify(struct device *dev,
> +				 struct cxl_notify_data *notify_data)
> +{
> +	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> +	struct dax_region *dax_region = dev_get_drvdata(dev);
> +	struct region_extent *region_extent = notify_data->region_extent;
> +
> +	switch (notify_data->event) {
> +	case DCD_ADD_CAPACITY:
> +		return __cxl_dax_add_resource(dax_region, region_extent);
> +	case DCD_RELEASE_CAPACITY:
> +		return dax_region_rm_resource(dax_region, &region_extent->dev);
> +	case DCD_FORCED_CAPACITY_RELEASE:
> +	default:
> +		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
> +			notify_data->event);
> +		break;
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +struct dax_sparse_ops sparse_ops = {
> +	.is_extent = is_region_extent,
> +};
>  
>  static int cxl_dax_region_probe(struct device *dev)
>  {
> @@ -24,14 +78,16 @@ static int cxl_dax_region_probe(struct device *dev)
>  		flags |= IORESOURCE_DAX_SPARSE_CAP;
>  
>  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, flags);
> +				      PMD_SIZE, flags, &sparse_ops);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> -	if (cxlr->mode == CXL_REGION_DC)
> +	if (cxlr->mode == CXL_REGION_DC) {
> +		device_for_each_child(&cxlr_dax->dev, dax_region,
> +				      cxl_dax_add_resource);
>  		/* Add empty seed dax device */
>  		dev_size = 0;
> -	else
> +	} else
>  		dev_size = range_len(&cxlr_dax->hpa_range);
>  
>  	data = (struct dev_dax_data) {
> @@ -47,6 +103,7 @@ static int cxl_dax_region_probe(struct device *dev)
>  static struct cxl_driver cxl_dax_region_driver = {
>  	.name = "cxl_dax_region",
>  	.probe = cxl_dax_region_probe,
> +	.notify = cxl_dax_region_notify,
>  	.id = CXL_DEVICE_DAX_REGION,
>  	.drv = {
>  		.suppress_bind_attrs = true,
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index ccde98c3d4e2..9e9f98c85620 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -16,6 +16,36 @@ struct inode *dax_inode(struct dax_device *dax_dev);
>  int dax_bus_init(void);
>  void dax_bus_exit(void);
>  
> +/**
> + * struct dax_resource - For sparse regions; an active resource
> + * @region: dax_region this resources is in
> + * @res: resource
> + * @use_cnt: count the number of uses of this resource
> + *
> + * Changes to the dax_reigon and the dax_resources within it are protected by
> + * dax_region_rwsem
> + */
> +struct dax_resource {
> +	struct dax_region *region;
> +	struct resource *res;
> +	unsigned int use_cnt;
> +};
> +int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
> +			    resource_size_t start, resource_size_t length);
> +int dax_region_rm_resource(struct dax_region *dax_region,
> +			   struct device *dev);
> +resource_size_t dax_avail_size(struct resource *dax_resource);
> +
> +typedef int (*match_cb)(struct device *dev, resource_size_t *size_avail);
> +
> +/**
> + * struct dax_sparse_ops - Operations for sparse regions
> + * @is_extent: return if the device is an extent
> + */
> +struct dax_sparse_ops {
> +	bool (*is_extent)(struct device *dev);
> +};
> +
>  /**
>   * struct dax_region - mapping infrastructure for dax devices
>   * @id: kernel-wide unique region for a memory range
> @@ -27,6 +57,7 @@ void dax_bus_exit(void);
>   * @res: resource tree to track instance allocations
>   * @seed: allow userspace to find the first unbound seed device
>   * @youngest: allow userspace to find the most recently created device
> + * @sparse_ops: operations required for sparse regions
>   */
>  struct dax_region {
>  	int id;
> @@ -38,6 +69,7 @@ struct dax_region {
>  	struct resource res;
>  	struct device *seed;
>  	struct device *youngest;
> +	struct dax_sparse_ops *sparse_ops;
>  };
>  
>  struct dax_mapping {
> @@ -62,6 +94,7 @@ struct dax_mapping {
>   * @pgoff: page offset
>   * @range: resource-span
>   * @mapping: device to assist in interrogating the range layout
> + * @dax_resource: if not NULL; dax sparse resource containing this range
>   */
>  struct dev_dax {
>  	struct dax_region *region;
> @@ -79,6 +112,7 @@ struct dev_dax {
>  		unsigned long pgoff;
>  		struct range range;
>  		struct dax_mapping *mapping;
> +		struct dax_resource *dax_resource;
>  	} *ranges;
>  };
>  
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 5e7c53f18491..0eea65052874 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -28,7 +28,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
>  
>  	mri = dev->platform_data;
>  	dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
> -				      mri->target_node, PMD_SIZE, flags);
> +				      mri->target_node, PMD_SIZE, flags, NULL);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
> index c8ebf4e281f2..f927e855f240 100644
> --- a/drivers/dax/pmem.c
> +++ b/drivers/dax/pmem.c
> @@ -54,7 +54,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
>  	range.start += offset;
>  	dax_region = alloc_dax_region(dev, region_id, &range,
>  			nd_region->target_node, le32_to_cpu(pfn_sb->align),
> -			IORESOURCE_DAX_STATIC);
> +			IORESOURCE_DAX_STATIC, NULL);
>  	if (!dax_region)
>  		return ERR_PTR(-ENOMEM);
>  
>
Ira Weiny Aug. 23, 2024, 2:28 p.m. UTC | #3
Dave Jiang wrote:
> 
> 
> On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 

[snip]

> > diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> > index d7d526a51e2b..103b0bec3a4a 100644
> > --- a/drivers/cxl/core/extent.c
> > +++ b/drivers/cxl/core/extent.c
> > @@ -271,20 +271,67 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
> >  	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
> >  }
> >  
> > +static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> > +			      struct region_extent *region_extent)
> > +{
> > +	struct cxl_dax_region *cxlr_dax;
> > +	struct device *dev;
> > +	int rc = 0;
> > +
> > +	cxlr_dax = cxlr->cxlr_dax;
> > +	dev = &cxlr_dax->dev;
> > +	dev_dbg(dev, "Trying notify: type %d HPA %par\n",
> > +		event, &region_extent->hpa_range);
> > +
> > +	/*
> > +	 * NOTE the lack of a driver indicates a notification has failed.  No
> > +	 * user space coordiantion was possible.
> > +	 */
> > +	device_lock(dev);
> > +	if (dev->driver) {
> > +		struct cxl_driver *driver = to_cxl_drv(dev->driver);
> > +		struct cxl_notify_data notify_data = (struct cxl_notify_data) {
> > +			.event = event,
> > +			.region_extent = region_extent,
> > +		};
> > +
> > +		if (driver->notify) {
> > +			dev_dbg(dev, "Notify: type %d HPA %par\n",
> > +				event, &region_extent->hpa_range);
> > +			rc = driver->notify(dev, &notify_data);
> > +		}
> > +	}
> > +	device_unlock(dev);
> 
> Maybe a cleaner version:
> 	guard(device)(dev);
> 	if (!dev->driver || !dev->driver->notify)
> 		return 0;

There is no dev->driver->notify.  But this works.

        if (!dev->driver)
                return 0;
        driver = to_cxl_drv(dev->driver);
        if (!driver->notify)
                return 0;

Not quite as clean but I did miss the use of guard.

> 
> 	dev_dbg(...);
> 	return driver->notify(dev, &notify_data);
>

I've cleaned it up.

[snip]

> > +
> > +int dax_region_add_resource(struct dax_region *dax_region,
> > +			    struct device *device,
> > +			    resource_size_t start, resource_size_t length)
> >
> kdoc header?

Because dax_region_add_resource() is part of the DAX private interfaces
and not intended to be used outside the DAX subsystem I skipped the kdoc
here even though the function must be exported.  Same for
dax_region_rm_resource() and dax_avail_size().

This is similar to run_dax() in that it is designed as a 'generic
operation' within the dax subsystem but not generally useful to the
kernel.

For now I'll move their declarations in dax-private.h and make a similar
comment.

Ira

> 
>  +{
> > +	struct resource *new_resource;
> > +	int rc;
> > +
> > +	struct dax_resource *dax_resource __free(kfree) =
> > +				kzalloc(sizeof(*dax_resource), GFP_KERNEL);
> > +	if (!dax_resource)
> > +		return -ENOMEM;
> > +
> > +	guard(rwsem_write)(&dax_region_rwsem);
> > +
> > +	dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
> > +	new_resource = __request_region(&dax_region->res, start, length, "extent", 0);
> > +	if (!new_resource) {
> > +		dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
> > +			&start, &length);
> > +		return -ENOSPC;
> > +	}
> > +
> > +	dev_dbg(dax_region->dev, "add resource %pr\n", new_resource);
> > +	dax_resource->region = dax_region;
> > +	dax_resource->res = new_resource;
> > +	dev_set_drvdata(device, dax_resource);
> > +	rc = devm_add_action_or_reset(device, dax_release_resource,
> > +				      no_free_ptr(dax_resource));
> > +	/*  On error; ensure driver data is cleared under semaphore */
> > +	if (rc)
> > +		dev_set_drvdata(device, NULL);
> > +	return rc;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_region_add_resource);
> > +
> > +int dax_region_rm_resource(struct dax_region *dax_region,
> > +			   struct device *dev)
> 
> kdoc header
> > +{
> > +	struct dax_resource *dax_resource;
> > +
> > +	guard(rwsem_write)(&dax_region_rwsem);
> > +
> > +	dax_resource = dev_get_drvdata(dev);
> > +	if (!dax_resource)
> > +		return 0;
> > +
> > +	if (dax_resource->use_cnt)
> > +		return -EBUSY;
> > +
> > +	/* avoid races with users trying to use the extent */
> > +	__dax_release_resource(dax_resource);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_region_rm_resource);
> > +
> >  bool static_dev_dax(struct dev_dax *dev_dax)
> >  {
> >  	return is_static(dev_dax->region);
> > @@ -296,19 +373,44 @@ static ssize_t region_align_show(struct device *dev,
> >  static struct device_attribute dev_attr_region_align =
> >  		__ATTR(align, 0400, region_align_show, NULL);
> >  
> > +#define for_each_child_resource(extent, res) \
> > +	for (res = (extent)->child; res; res = res->sibling)
> > +
> > +resource_size_t
> > +dax_avail_size(struct resource *dax_resource)
> kdoc header
> 
> DJ
> 

[snip]
Jonathan Cameron Aug. 27, 2024, 2:12 p.m. UTC | #4
On Fri, 16 Aug 2024 09:44:29 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> DAX regions which map dynamic capacity partitions require that memory be
> allowed to come and go.  Recall sparse regions were created for this
> purpose.  Now that extents can be realized within DAX regions the DAX
> region driver can start tracking sub-resource information.
> 
> The tight relationship between DAX region operations and extent
> operations require memory changes to be controlled synchronously with
> the user of the region.  Synchronize through the dax_region_rwsem and by
> having the region driver drive both the region device as well as the
> extent sub-devices.
> 
> Recall requests to remove extents can happen at any time and that a host
> is not obligated to release the memory until it is not being used.  If
> an extent is not used allow a release response.
> 
> The DAX layer has no need for the details of the CXL memory extent
> devices.  Expose extents to the DAX layer as device children of the DAX
> region device.  A single callback from the driver aids the DAX layer to
> determine if the child device is an extent.  The DAX layer also
> registers a devres function to automatically clean up when the device is
> removed from the region.
> 
> There is a race between extents being surfaced and the dax_cxl driver
> being loaded.  The driver must therefore scan for any existing extents
> while still under the device lock.
> 
> Respond to extent notifications.  Manage the DAX region resource tree
> based on the extents lifetime.  Return the status of remove
> notifications to lower layers such that it can manage the hardware
> appropriately.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
A few minor comments inline.

Jonathan

> 
> ---
> Changes:
> [iweiny: patch reorder]
> [iweiny: move hunks from other patches to clarify code changes and
>          add/release flows WRT dax regions]
> [iweiny: use %par]
> [iweiny: clean up variable names]
> [iweiny: Simplify sparse_ops]
> [Fan: avoid open coding range_len()]
> [djbw: s/reg_ext/region_extent]
> ---
>  drivers/cxl/core/extent.c |  76 +++++++++++++--
>  drivers/cxl/cxl.h         |   6 ++
>  drivers/dax/bus.c         | 243 +++++++++++++++++++++++++++++++++++++++++-----
>  drivers/dax/bus.h         |   3 +-
>  drivers/dax/cxl.c         |  63 +++++++++++-
>  drivers/dax/dax-private.h |  34 +++++++
>  drivers/dax/hmem/hmem.c   |   2 +-
>  drivers/dax/pmem.c        |   2 +-
>  8 files changed, 391 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index d7d526a51e2b..103b0bec3a4a 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -271,20 +271,67 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
>  	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
>  }
>  
> +static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> +			      struct region_extent *region_extent)
> +{
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +	int rc = 0;
> +
> +	cxlr_dax = cxlr->cxlr_dax;
> +	dev = &cxlr_dax->dev;
> +	dev_dbg(dev, "Trying notify: type %d HPA %par\n",
> +		event, &region_extent->hpa_range);
> +
> +	/*
> +	 * NOTE the lack of a driver indicates a notification has failed.  No
> +	 * user space coordiantion was possible.
> +	 */
> +	device_lock(dev);

I'd use guard() for this as then can just return the notify result
and drop local variable rc.


> +	if (dev->driver) {
> +		struct cxl_driver *driver = to_cxl_drv(dev->driver);
> +		struct cxl_notify_data notify_data = (struct cxl_notify_data) {
> +			.event = event,
> +			.region_extent = region_extent,
> +		};
> +
> +		if (driver->notify) {
> +			dev_dbg(dev, "Notify: type %d HPA %par\n",
> +				event, &region_extent->hpa_range);
> +			rc = driver->notify(dev, &notify_data);
> +		}
> +	}
> +	device_unlock(dev);
> +	return rc;
> +}
>
> @@ -338,8 +390,20 @@ static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
>  		return rc;
>  	}
>  
> -	/* device model handles freeing region_extent */
> -	return online_region_extent(region_extent);
> +	rc = online_region_extent(region_extent);
> +	/* device model handled freeing region_extent */
> +	if (rc)
> +		return rc;
> +
> +	rc = cxlr_notify_extent(cxlr_dax->cxlr, DCD_ADD_CAPACITY, region_extent);
> +	/*
> +	 * The region device was breifly live but DAX layer ensures it was not

briefly

> +	 * used
> +	 */
> +	if (rc)
> +		region_rm_extent(region_extent);	
> +
> +	return rc;
>  }

> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 975860371d9f..f14b0cfa7edd 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c

> +EXPORT_SYMBOL_GPL(dax_region_add_resource);
> +
> +int dax_region_rm_resource(struct dax_region *dax_region,
> +			   struct device *dev)
> +{
> +	struct dax_resource *dax_resource;
> +
> +	guard(rwsem_write)(&dax_region_rwsem);
> +
> +	dax_resource = dev_get_drvdata(dev);
> +	if (!dax_resource)
> +		return 0;
> +
> +	if (dax_resource->use_cnt)
> +		return -EBUSY;
> +
> +	/* avoid races with users trying to use the extent */

Not obvious to me from local code, why does releasing the resource
here avoid a race?  Perhaps the comment needs expanding.

> +	__dax_release_resource(dax_resource);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(dax_region_rm_resource);
> +


> +static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
> +				     struct dev_dax *dev_dax,
> +				     resource_size_t to_alloc)
> +{
> +	struct dax_resource *dax_resource;
> +	resource_size_t available_size;
> +	struct device *extent_dev;
> +	ssize_t alloc;
> +
> +	extent_dev = device_find_child(dax_region->dev, dax_region,
> +				       find_free_extent);

There is a __free for put device and it will tidy this up a tiny bit.

> +	if (!extent_dev)
> +		return 0;
> +
> +	dax_resource = dev_get_drvdata(extent_dev);
> +	if (!dax_resource)
> +		return 0;
> +
> +	available_size = dax_avail_size(dax_resource->res);
> +	to_alloc = min(available_size, to_alloc);
I'd put those two inline and skip the local variables unless
they have more use in later patches.

	alloc = __dev_dax_resize(dax_resources->res, dev_dax,
				 min(dax_avail_size(dax_resources->res), to_alloc),
				 dax_resource);
	
				
> +	alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc, dax_resource);
> +	if (alloc > 0)
> +		dax_resource->use_cnt++;
> +	put_device(extent_dev);
> +	return alloc;
> +}
> +

> @@ -1494,8 +1679,14 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
>  	device_initialize(dev);
>  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
>  
> +	if (is_sparse(dax_region) && data->size) {
> +		dev_err(parent, "Sparse DAX region devices are created initially with 0 size");
must be created initially with 0 size.

Otherwise this error message says that they are, so why is it an error?

> +		rc = -EINVAL;
> +		goto err_id;
> +	}
> +
>  	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
> -				 data->size);
> +				 data->size, NULL);
>  	if (rc)
>  		goto err_range;
>  

> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 367e86b1c22a..bf3b82b0120d 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -5,6 +5,60 @@

...

> +static int cxl_dax_region_notify(struct device *dev,
> +				 struct cxl_notify_data *notify_data)
> +{
> +	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> +	struct dax_region *dax_region = dev_get_drvdata(dev);
> +	struct region_extent *region_extent = notify_data->region_extent;
> +
> +	switch (notify_data->event) {
> +	case DCD_ADD_CAPACITY:
> +		return __cxl_dax_add_resource(dax_region, region_extent);
> +	case DCD_RELEASE_CAPACITY:
> +		return dax_region_rm_resource(dax_region, &region_extent->dev);
> +	case DCD_FORCED_CAPACITY_RELEASE:
> +	default:
> +		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
> +			notify_data->event);
> +		break;
Might as well return here and not below.
Makes it really really obvious this is the error path and currently the only
one that hits the return statement.
> +	}
> +
> +	return -ENXIO;
> +}

>  static int cxl_dax_region_probe(struct device *dev)
>  {
> @@ -24,14 +78,16 @@ static int cxl_dax_region_probe(struct device *dev)
>  		flags |= IORESOURCE_DAX_SPARSE_CAP;
>  
>  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, flags);
> +				      PMD_SIZE, flags, &sparse_ops);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> -	if (cxlr->mode == CXL_REGION_DC)
> +	if (cxlr->mode == CXL_REGION_DC) {
> +		device_for_each_child(&cxlr_dax->dev, dax_region,
> +				      cxl_dax_add_resource);
>  		/* Add empty seed dax device */
>  		dev_size = 0;
> -	else
> +	} else

Coding style says that you need brackets for all branches if
one needs them (as multiline).  Just above:
https://www.kernel.org/doc/html/v4.10/process/coding-style.html#spaces


>  		dev_size = range_len(&cxlr_dax->hpa_range);
>  
>  	data = (struct dev_dax_data) {
Ira Weiny Aug. 29, 2024, 9:54 p.m. UTC | #5
Jonathan Cameron wrote:
> On Fri, 16 Aug 2024 09:44:29 -0500
> ira.weiny@intel.com wrote:
> 
> > From: Navneet Singh <navneet.singh@intel.com>
> > 

[snip]

> > diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> > index d7d526a51e2b..103b0bec3a4a 100644
> > --- a/drivers/cxl/core/extent.c
> > +++ b/drivers/cxl/core/extent.c
> > @@ -271,20 +271,67 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
> >  	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
> >  }
> >  
> > +static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> > +			      struct region_extent *region_extent)
> > +{
> > +	struct cxl_dax_region *cxlr_dax;
> > +	struct device *dev;
> > +	int rc = 0;
> > +
> > +	cxlr_dax = cxlr->cxlr_dax;
> > +	dev = &cxlr_dax->dev;
> > +	dev_dbg(dev, "Trying notify: type %d HPA %par\n",
> > +		event, &region_extent->hpa_range);
> > +
> > +	/*
> > +	 * NOTE the lack of a driver indicates a notification has failed.  No
> > +	 * user space coordiantion was possible.
> > +	 */
> > +	device_lock(dev);
> 
> I'd use guard() for this as then can just return the notify result
> and drop local variable rc

Yep Dave already mentioned this and it is done.

> 
> 
> > +	if (dev->driver) {
> > +		struct cxl_driver *driver = to_cxl_drv(dev->driver);
> > +		struct cxl_notify_data notify_data = (struct cxl_notify_data) {
> > +			.event = event,
> > +			.region_extent = region_extent,
> > +		};
> > +
> > +		if (driver->notify) {
> > +			dev_dbg(dev, "Notify: type %d HPA %par\n",
> > +				event, &region_extent->hpa_range);
> > +			rc = driver->notify(dev, &notify_data);
> > +		}
> > +	}
> > +	device_unlock(dev);
> > +	return rc;
> > +}
> >
> > @@ -338,8 +390,20 @@ static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
> >  		return rc;
> >  	}
> >  
> > -	/* device model handles freeing region_extent */
> > -	return online_region_extent(region_extent);
> > +	rc = online_region_extent(region_extent);
> > +	/* device model handled freeing region_extent */
> > +	if (rc)
> > +		return rc;
> > +
> > +	rc = cxlr_notify_extent(cxlr_dax->cxlr, DCD_ADD_CAPACITY, region_extent);
> > +	/*
> > +	 * The region device was breifly live but DAX layer ensures it was not
> 
> briefly

Fixed Thanks

> 
> > +	 * used
> > +	 */
> > +	if (rc)
> > +		region_rm_extent(region_extent);	
> > +
> > +	return rc;
> >  }
> 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 975860371d9f..f14b0cfa7edd 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> 
> > +EXPORT_SYMBOL_GPL(dax_region_add_resource);
> > +
> > +int dax_region_rm_resource(struct dax_region *dax_region,
> > +			   struct device *dev)
> > +{
> > +	struct dax_resource *dax_resource;
> > +
> > +	guard(rwsem_write)(&dax_region_rwsem);
> > +
> > +	dax_resource = dev_get_drvdata(dev);
> > +	if (!dax_resource)
> > +		return 0;
> > +
> > +	if (dax_resource->use_cnt)
> > +		return -EBUSY;
> > +
> > +	/* avoid races with users trying to use the extent */
> 
> Not obvious to me from local code, why does releasing the resource
> here avoid a race?  Perhaps the comment needs expanding.

We are under the dax_region_rwsem here.  So that avoids the users from seeing
the extent while the lower level code is going to remove it.

How about?

        /*
         * release the resource under dax_region_rwsem to avoid races with
         * users trying to use the extent
         */

> 
> > +	__dax_release_resource(dax_resource);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_region_rm_resource);
> > +
> 
> 
> > +static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
> > +				     struct dev_dax *dev_dax,
> > +				     resource_size_t to_alloc)
> > +{
> > +	struct dax_resource *dax_resource;
> > +	resource_size_t available_size;
> > +	struct device *extent_dev;
> > +	ssize_t alloc;
> > +
> > +	extent_dev = device_find_child(dax_region->dev, dax_region,
> > +				       find_free_extent);
> 
> There is a __free for put device and it will tidy this up a tiny bit.

And fix the bug...  :-/  Thanks.

> 
> > +	if (!extent_dev)
> > +		return 0;
> > +
> > +	dax_resource = dev_get_drvdata(extent_dev);
> > +	if (!dax_resource)
> > +		return 0;
> > +
> > +	available_size = dax_avail_size(dax_resource->res);
> > +	to_alloc = min(available_size, to_alloc);
> I'd put those two inline and skip the local variables unless
> they have more use in later patches.

Nope just made the line shorter and I tend to not embed calls like that but it
is fine your way too.  I'll change it.

> 
> 	alloc = __dev_dax_resize(dax_resources->res, dev_dax,
> 				 min(dax_avail_size(dax_resources->res), to_alloc),
> 				 dax_resource);
> 	
> 				
> > +	alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc, dax_resource);
> > +	if (alloc > 0)
> > +		dax_resource->use_cnt++;
> > +	put_device(extent_dev);
> > +	return alloc;
> > +}
> > +
> 
> > @@ -1494,8 +1679,14 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
> >  	device_initialize(dev);
> >  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
> >  
> > +	if (is_sparse(dax_region) && data->size) {
> > +		dev_err(parent, "Sparse DAX region devices are created initially with 0 size");
> must be created initially with 0 size.
> 
> Otherwise this error message says that they are, so why is it an error?

Indeed!

> 
> > +		rc = -EINVAL;
> > +		goto err_id;
> > +	}
> > +
> >  	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
> > -				 data->size);
> > +				 data->size, NULL);
> >  	if (rc)
> >  		goto err_range;
> >  
> 
> > diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> > index 367e86b1c22a..bf3b82b0120d 100644
> > --- a/drivers/dax/cxl.c
> > +++ b/drivers/dax/cxl.c
> > @@ -5,6 +5,60 @@
> 
> ...
> 
> > +static int cxl_dax_region_notify(struct device *dev,
> > +				 struct cxl_notify_data *notify_data)
> > +{
> > +	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> > +	struct dax_region *dax_region = dev_get_drvdata(dev);
> > +	struct region_extent *region_extent = notify_data->region_extent;
> > +
> > +	switch (notify_data->event) {
> > +	case DCD_ADD_CAPACITY:
> > +		return __cxl_dax_add_resource(dax_region, region_extent);
> > +	case DCD_RELEASE_CAPACITY:
> > +		return dax_region_rm_resource(dax_region, &region_extent->dev);
> > +	case DCD_FORCED_CAPACITY_RELEASE:
> > +	default:
> > +		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
> > +			notify_data->event);
> > +		break;
> Might as well return here and not below.
> Makes it really really obvious this is the error path and currently the only
> one that hits the return statement.
> > +	}

ok

> > +
> > +	return -ENXIO;
> > +}
> 
> >  static int cxl_dax_region_probe(struct device *dev)
> >  {
> > @@ -24,14 +78,16 @@ static int cxl_dax_region_probe(struct device *dev)
> >  		flags |= IORESOURCE_DAX_SPARSE_CAP;
> >  
> >  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> > -				      PMD_SIZE, flags);
> > +				      PMD_SIZE, flags, &sparse_ops);
> >  	if (!dax_region)
> >  		return -ENOMEM;
> >  
> > -	if (cxlr->mode == CXL_REGION_DC)
> > +	if (cxlr->mode == CXL_REGION_DC) {
> > +		device_for_each_child(&cxlr_dax->dev, dax_region,
> > +				      cxl_dax_add_resource);
> >  		/* Add empty seed dax device */
> >  		dev_size = 0;
> > -	else
> > +	} else
> 
> Coding style says that you need brackets for all branches if
> one needs them (as multiline).  Just above:
> https://www.kernel.org/doc/html/v4.10/process/coding-style.html#spaces
> 

Yep missed it.  thanks for the review!
Ira

[snip]
diff mbox series

Patch

diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index d7d526a51e2b..103b0bec3a4a 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -271,20 +271,67 @@  static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
 	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
 }
 
+static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+			      struct region_extent *region_extent)
+{
+	struct cxl_dax_region *cxlr_dax;
+	struct device *dev;
+	int rc = 0;
+
+	cxlr_dax = cxlr->cxlr_dax;
+	dev = &cxlr_dax->dev;
+	dev_dbg(dev, "Trying notify: type %d HPA %par\n",
+		event, &region_extent->hpa_range);
+
+	/*
+	 * NOTE the lack of a driver indicates a notification has failed.  No
+	 * user space coordiantion was possible.
+	 */
+	device_lock(dev);
+	if (dev->driver) {
+		struct cxl_driver *driver = to_cxl_drv(dev->driver);
+		struct cxl_notify_data notify_data = (struct cxl_notify_data) {
+			.event = event,
+			.region_extent = region_extent,
+		};
+
+		if (driver->notify) {
+			dev_dbg(dev, "Notify: type %d HPA %par\n",
+				event, &region_extent->hpa_range);
+			rc = driver->notify(dev, &notify_data);
+		}
+	}
+	device_unlock(dev);
+	return rc;
+}
+
+struct rm_data {
+	struct cxl_region *cxlr;
+	struct range *range;
+};
+
 static int cxlr_rm_extent(struct device *dev, void *data)
 {
 	struct region_extent *region_extent = to_region_extent(dev);
-	struct range *region_hpa_range = data;
+	struct rm_data *rm_data = data;
+	int rc;
 
 	if (!region_extent)
 		return 0;
 
 	/*
-	 * Any extent which 'touches' the released range is removed.
+	 * Any extent which 'touches' the released range is attempted to be
+	 * removed.
 	 */
-	if (range_overlaps(region_hpa_range, &region_extent->hpa_range)) {
+	if (range_overlaps(rm_data->range, &region_extent->hpa_range)) {
+		struct cxl_region *cxlr = rm_data->cxlr;
+
 		dev_dbg(dev, "Remove region extent HPA %par\n",
 			&region_extent->hpa_range);
+		rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, region_extent);
+		if (rc == -EBUSY)
+			return 0;
+		/* Extent not in use or error, remove it */
 		region_rm_extent(region_extent);
 	}
 	return 0;
@@ -312,8 +359,13 @@  int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
 
 	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
 
+	struct rm_data rm_data = {
+		.cxlr = cxlr,
+		.range = &hpa_range,
+	};
+
 	/* Remove region extents which overlap */
-	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
+	return device_for_each_child(&cxlr->cxlr_dax->dev, &rm_data,
 				     cxlr_rm_extent);
 }
 
@@ -338,8 +390,20 @@  static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
 		return rc;
 	}
 
-	/* device model handles freeing region_extent */
-	return online_region_extent(region_extent);
+	rc = online_region_extent(region_extent);
+	/* device model handled freeing region_extent */
+	if (rc)
+		return rc;
+
+	rc = cxlr_notify_extent(cxlr_dax->cxlr, DCD_ADD_CAPACITY, region_extent);
+	/*
+	 * The region device was breifly live but DAX layer ensures it was not
+	 * used
+	 */
+	if (rc)
+		region_rm_extent(region_extent);
+
+	return rc;
 }
 
 /* Callers are expected to ensure cxled has been attached to a region */
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index c858e3957fd5..9abbfc68c6ad 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -916,10 +916,16 @@  bool is_cxl_region(struct device *dev);
 
 extern struct bus_type cxl_bus_type;
 
+struct cxl_notify_data {
+	enum dc_event event;
+	struct region_extent *region_extent;
+};
+
 struct cxl_driver {
 	const char *name;
 	int (*probe)(struct device *dev);
 	void (*remove)(struct device *dev);
+	int (*notify)(struct device *dev, struct cxl_notify_data *notify_data);
 	struct device_driver drv;
 	int id;
 };
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 975860371d9f..f14b0cfa7edd 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -183,6 +183,83 @@  static bool is_sparse(struct dax_region *dax_region)
 	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
 }
 
+static void __dax_release_resource(struct dax_resource *dax_resource)
+{
+	struct dax_region *dax_region = dax_resource->region;
+
+	lockdep_assert_held_write(&dax_region_rwsem);
+	dev_dbg(dax_region->dev, "Extent release resource %pr\n",
+		dax_resource->res);
+	if (dax_resource->res)
+		__release_region(&dax_region->res, dax_resource->res->start,
+				 resource_size(dax_resource->res));
+	dax_resource->res = NULL;
+}
+
+static void dax_release_resource(void *res)
+{
+	struct dax_resource *dax_resource = res;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+	__dax_release_resource(dax_resource);
+	kfree(dax_resource);
+}
+
+int dax_region_add_resource(struct dax_region *dax_region,
+			    struct device *device,
+			    resource_size_t start, resource_size_t length)
+{
+	struct resource *new_resource;
+	int rc;
+
+	struct dax_resource *dax_resource __free(kfree) =
+				kzalloc(sizeof(*dax_resource), GFP_KERNEL);
+	if (!dax_resource)
+		return -ENOMEM;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+
+	dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
+	new_resource = __request_region(&dax_region->res, start, length, "extent", 0);
+	if (!new_resource) {
+		dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
+			&start, &length);
+		return -ENOSPC;
+	}
+
+	dev_dbg(dax_region->dev, "add resource %pr\n", new_resource);
+	dax_resource->region = dax_region;
+	dax_resource->res = new_resource;
+	dev_set_drvdata(device, dax_resource);
+	rc = devm_add_action_or_reset(device, dax_release_resource,
+				      no_free_ptr(dax_resource));
+	/*  On error; ensure driver data is cleared under semaphore */
+	if (rc)
+		dev_set_drvdata(device, NULL);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(dax_region_add_resource);
+
+int dax_region_rm_resource(struct dax_region *dax_region,
+			   struct device *dev)
+{
+	struct dax_resource *dax_resource;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+
+	dax_resource = dev_get_drvdata(dev);
+	if (!dax_resource)
+		return 0;
+
+	if (dax_resource->use_cnt)
+		return -EBUSY;
+
+	/* avoid races with users trying to use the extent */
+	__dax_release_resource(dax_resource);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_region_rm_resource);
+
 bool static_dev_dax(struct dev_dax *dev_dax)
 {
 	return is_static(dev_dax->region);
@@ -296,19 +373,44 @@  static ssize_t region_align_show(struct device *dev,
 static struct device_attribute dev_attr_region_align =
 		__ATTR(align, 0400, region_align_show, NULL);
 
+#define for_each_child_resource(extent, res) \
+	for (res = (extent)->child; res; res = res->sibling)
+
+resource_size_t
+dax_avail_size(struct resource *dax_resource)
+{
+	resource_size_t rc;
+	struct resource *used_res;
+
+	rc = resource_size(dax_resource);
+	for_each_child_resource(dax_resource, used_res)
+		rc -= resource_size(used_res);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(dax_avail_size);
+
 #define for_each_dax_region_resource(dax_region, res) \
 	for (res = (dax_region)->res.child; res; res = res->sibling)
 
 static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
 {
-	resource_size_t size = resource_size(&dax_region->res);
+	resource_size_t size;
 	struct resource *res;
 
 	lockdep_assert_held(&dax_region_rwsem);
 
-	if (is_sparse(dax_region))
-		return 0;
+	if (is_sparse(dax_region)) {
+		/*
+		 * Children of a sparse region represent available space not
+		 * used space.
+		 */
+		size = 0;
+		for_each_dax_region_resource(dax_region, res)
+			size += dax_avail_size(res);
+		return size;
+	}
 
+	size = resource_size(&dax_region->res);
 	for_each_dax_region_resource(dax_region, res)
 		size -= resource_size(res);
 	return size;
@@ -449,15 +551,26 @@  EXPORT_SYMBOL_GPL(kill_dev_dax);
 static void trim_dev_dax_range(struct dev_dax *dev_dax)
 {
 	int i = dev_dax->nr_range - 1;
-	struct range *range = &dev_dax->ranges[i].range;
+	struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+	struct range *range = &dev_range->range;
 	struct dax_region *dax_region = dev_dax->region;
+	struct resource *res = &dax_region->res;
 
 	lockdep_assert_held_write(&dax_region_rwsem);
 	dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
 		(unsigned long long)range->start,
 		(unsigned long long)range->end);
 
-	__release_region(&dax_region->res, range->start, range_len(range));
+	if (dev_range->dax_resource) {
+		res = dev_range->dax_resource->res;
+		dev_dbg(&dev_dax->dev, "Trim sparse extent %pr\n", res);
+	}
+
+	__release_region(res, range->start, range_len(range));
+
+	if (dev_range->dax_resource)
+		dev_range->dax_resource->use_cnt--;
+
 	if (--dev_dax->nr_range == 0) {
 		kfree(dev_dax->ranges);
 		dev_dax->ranges = NULL;
@@ -640,7 +753,7 @@  static void dax_region_unregister(void *region)
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
-		unsigned long flags)
+		unsigned long flags, struct dax_sparse_ops *sparse_ops)
 {
 	struct dax_region *dax_region;
 
@@ -658,12 +771,16 @@  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 			|| !IS_ALIGNED(range_len(range), align))
 		return NULL;
 
+	if (!sparse_ops && (flags & IORESOURCE_DAX_SPARSE_CAP))
+		return NULL;
+
 	dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL);
 	if (!dax_region)
 		return NULL;
 
 	dev_set_drvdata(parent, dax_region);
 	kref_init(&dax_region->kref);
+	dax_region->sparse_ops = sparse_ops;
 	dax_region->id = region_id;
 	dax_region->align = align;
 	dax_region->dev = parent;
@@ -845,7 +962,8 @@  static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
 }
 
 static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
-			       u64 start, resource_size_t size)
+			       u64 start, resource_size_t size,
+			       struct dax_resource *dax_resource)
 {
 	struct device *dev = &dev_dax->dev;
 	struct dev_dax_range *ranges;
@@ -884,6 +1002,7 @@  static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
 			.start = alloc->start,
 			.end = alloc->end,
 		},
+		.dax_resource = dax_resource,
 	};
 
 	dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
@@ -966,7 +1085,8 @@  static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 	int i;
 
 	for (i = dev_dax->nr_range - 1; i >= 0; i--) {
-		struct range *range = &dev_dax->ranges[i].range;
+		struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+		struct range *range = &dev_range->range;
 		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
 		struct resource *adjust = NULL, *res;
 		resource_size_t shrink;
@@ -982,12 +1102,21 @@  static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 			continue;
 		}
 
-		for_each_dax_region_resource(dax_region, res)
-			if (strcmp(res->name, dev_name(dev)) == 0
-					&& res->start == range->start) {
-				adjust = res;
-				break;
-			}
+		if (dev_range->dax_resource) {
+			for_each_child_resource(dev_range->dax_resource->res, res)
+				if (strcmp(res->name, dev_name(dev)) == 0
+						&& res->start == range->start) {
+					adjust = res;
+					break;
+				}
+		} else {
+			for_each_dax_region_resource(dax_region, res)
+				if (strcmp(res->name, dev_name(dev)) == 0
+						&& res->start == range->start) {
+					adjust = res;
+					break;
+				}
+		}
 
 		if (dev_WARN_ONCE(dev, !adjust || i != dev_dax->nr_range - 1,
 					"failed to find matching resource\n"))
@@ -1025,19 +1154,21 @@  static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
 }
 
 /**
- * dev_dax_resize_static - Expand the device into the unused portion of the
- * region. This may involve adjusting the end of an existing resource, or
- * allocating a new resource.
+ * __dev_dax_resize - Expand the device into the unused portion of the region.
+ * This may involve adjusting the end of an existing resource, or allocating a
+ * new resource.
  *
  * @parent: parent resource to allocate this range in
  * @dev_dax: DAX device to be expanded
  * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ * @dax_resource: if sparse; the parent resource
  *
  * Return the amount of space allocated or -ERRNO on failure
  */
-static ssize_t dev_dax_resize_static(struct resource *parent,
-				     struct dev_dax *dev_dax,
-				     resource_size_t to_alloc)
+static ssize_t __dev_dax_resize(struct resource *parent,
+				struct dev_dax *dev_dax,
+				resource_size_t to_alloc,
+				struct dax_resource *dax_resource)
 {
 	struct resource *res, *first;
 	int rc;
@@ -1045,7 +1176,8 @@  static ssize_t dev_dax_resize_static(struct resource *parent,
 	first = parent->child;
 	if (!first) {
 		rc = alloc_dev_dax_range(parent, dev_dax,
-					   parent->start, to_alloc);
+					   parent->start, to_alloc,
+					   dax_resource);
 		if (rc)
 			return rc;
 		return to_alloc;
@@ -1059,7 +1191,8 @@  static ssize_t dev_dax_resize_static(struct resource *parent,
 		if (res == first && res->start > parent->start) {
 			alloc = min(res->start - parent->start, to_alloc);
 			rc = alloc_dev_dax_range(parent, dev_dax,
-						 parent->start, alloc);
+						 parent->start, alloc,
+						 dax_resource);
 			if (rc)
 				return rc;
 			return alloc;
@@ -1083,7 +1216,8 @@  static ssize_t dev_dax_resize_static(struct resource *parent,
 				return rc;
 			return alloc;
 		}
-		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
+		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc,
+					 dax_resource);
 		if (rc)
 			return rc;
 		return alloc;
@@ -1094,6 +1228,54 @@  static ssize_t dev_dax_resize_static(struct resource *parent,
 	return 0;
 }
 
+static ssize_t dev_dax_resize_static(struct dax_region *dax_region,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
+{
+	return __dev_dax_resize(&dax_region->res, dev_dax, to_alloc, NULL);
+}
+
+static int find_free_extent(struct device *dev, void *data)
+{
+	struct dax_region *dax_region = data;
+	struct dax_resource *dax_resource;
+
+	if (!dax_region->sparse_ops->is_extent(dev))
+		return 0;
+
+	dax_resource = dev_get_drvdata(dev);
+	if (!dax_resource || !dax_avail_size(dax_resource->res))
+		return 0;
+	return 1;
+}
+
+static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
+{
+	struct dax_resource *dax_resource;
+	resource_size_t available_size;
+	struct device *extent_dev;
+	ssize_t alloc;
+
+	extent_dev = device_find_child(dax_region->dev, dax_region,
+				       find_free_extent);
+	if (!extent_dev)
+		return 0;
+
+	dax_resource = dev_get_drvdata(extent_dev);
+	if (!dax_resource)
+		return 0;
+
+	available_size = dax_avail_size(dax_resource->res);
+	to_alloc = min(available_size, to_alloc);
+	alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc, dax_resource);
+	if (alloc > 0)
+		dax_resource->use_cnt++;
+	put_device(extent_dev);
+	return alloc;
+}
+
 static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		struct dev_dax *dev_dax, resource_size_t size)
 {
@@ -1117,7 +1299,10 @@  static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		return -ENXIO;
 
 retry:
-	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+	if (is_sparse(dax_region))
+		alloc = dev_dax_resize_sparse(dax_region, dev_dax, to_alloc);
+	else
+		alloc = dev_dax_resize_static(dax_region, dev_dax, to_alloc);
 	if (alloc <= 0)
 		return alloc;
 	to_alloc -= alloc;
@@ -1226,7 +1411,7 @@  static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
 	to_alloc = range_len(&r);
 	if (alloc_is_aligned(dev_dax, to_alloc))
 		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
-					 to_alloc);
+					 to_alloc, NULL);
 	up_write(&dax_dev_rwsem);
 	up_write(&dax_region_rwsem);
 
@@ -1494,8 +1679,14 @@  static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	device_initialize(dev);
 	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
 
+	if (is_sparse(dax_region) && data->size) {
+		dev_err(parent, "Sparse DAX region devices are created initially with 0 size");
+		rc = -EINVAL;
+		goto err_id;
+	}
+
 	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
-				 data->size);
+				 data->size, NULL);
 	if (rc)
 		goto err_range;
 
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 783bfeef42cc..ae5029ea6047 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -9,6 +9,7 @@  struct dev_dax;
 struct resource;
 struct dax_device;
 struct dax_region;
+struct dax_sparse_ops;
 
 /* dax bus specific ioresource flags */
 #define IORESOURCE_DAX_STATIC BIT(0)
@@ -17,7 +18,7 @@  struct dax_region;
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
-		unsigned long flags);
+		unsigned long flags, struct dax_sparse_ops *sparse_ops);
 
 struct dev_dax_data {
 	struct dax_region *dax_region;
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 367e86b1c22a..bf3b82b0120d 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -5,6 +5,60 @@ 
 
 #include "../cxl/cxl.h"
 #include "bus.h"
+#include "dax-private.h"
+
+static int __cxl_dax_add_resource(struct dax_region *dax_region,
+				  struct region_extent *region_extent)
+{
+	resource_size_t start, length;
+	struct device *dev;
+
+	dev = &region_extent->dev;
+	start = dax_region->res.start + region_extent->hpa_range.start;
+	length = range_len(&region_extent->hpa_range);
+	return dax_region_add_resource(dax_region, dev, start, length);
+}
+
+static int cxl_dax_add_resource(struct device *dev, void *data)
+{
+	struct dax_region *dax_region = data;
+	struct region_extent *region_extent;
+
+	region_extent = to_region_extent(dev);
+	if (!region_extent)
+		return 0;
+
+	dev_dbg(dax_region->dev, "Adding resource HPA %par\n",
+		&region_extent->hpa_range);
+
+	return __cxl_dax_add_resource(dax_region, region_extent);
+}
+
+static int cxl_dax_region_notify(struct device *dev,
+				 struct cxl_notify_data *notify_data)
+{
+	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+	struct dax_region *dax_region = dev_get_drvdata(dev);
+	struct region_extent *region_extent = notify_data->region_extent;
+
+	switch (notify_data->event) {
+	case DCD_ADD_CAPACITY:
+		return __cxl_dax_add_resource(dax_region, region_extent);
+	case DCD_RELEASE_CAPACITY:
+		return dax_region_rm_resource(dax_region, &region_extent->dev);
+	case DCD_FORCED_CAPACITY_RELEASE:
+	default:
+		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
+			notify_data->event);
+		break;
+	}
+
+	return -ENXIO;
+}
+
+struct dax_sparse_ops sparse_ops = {
+	.is_extent = is_region_extent,
+};
 
 static int cxl_dax_region_probe(struct device *dev)
 {
@@ -24,14 +78,16 @@  static int cxl_dax_region_probe(struct device *dev)
 		flags |= IORESOURCE_DAX_SPARSE_CAP;
 
 	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
-				      PMD_SIZE, flags);
+				      PMD_SIZE, flags, &sparse_ops);
 	if (!dax_region)
 		return -ENOMEM;
 
-	if (cxlr->mode == CXL_REGION_DC)
+	if (cxlr->mode == CXL_REGION_DC) {
+		device_for_each_child(&cxlr_dax->dev, dax_region,
+				      cxl_dax_add_resource);
 		/* Add empty seed dax device */
 		dev_size = 0;
-	else
+	} else
 		dev_size = range_len(&cxlr_dax->hpa_range);
 
 	data = (struct dev_dax_data) {
@@ -47,6 +103,7 @@  static int cxl_dax_region_probe(struct device *dev)
 static struct cxl_driver cxl_dax_region_driver = {
 	.name = "cxl_dax_region",
 	.probe = cxl_dax_region_probe,
+	.notify = cxl_dax_region_notify,
 	.id = CXL_DEVICE_DAX_REGION,
 	.drv = {
 		.suppress_bind_attrs = true,
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index ccde98c3d4e2..9e9f98c85620 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -16,6 +16,36 @@  struct inode *dax_inode(struct dax_device *dax_dev);
 int dax_bus_init(void);
 void dax_bus_exit(void);
 
+/**
+ * struct dax_resource - For sparse regions; an active resource
+ * @region: dax_region this resources is in
+ * @res: resource
+ * @use_cnt: count the number of uses of this resource
+ *
+ * Changes to the dax_reigon and the dax_resources within it are protected by
+ * dax_region_rwsem
+ */
+struct dax_resource {
+	struct dax_region *region;
+	struct resource *res;
+	unsigned int use_cnt;
+};
+int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
+			    resource_size_t start, resource_size_t length);
+int dax_region_rm_resource(struct dax_region *dax_region,
+			   struct device *dev);
+resource_size_t dax_avail_size(struct resource *dax_resource);
+
+typedef int (*match_cb)(struct device *dev, resource_size_t *size_avail);
+
+/**
+ * struct dax_sparse_ops - Operations for sparse regions
+ * @is_extent: return if the device is an extent
+ */
+struct dax_sparse_ops {
+	bool (*is_extent)(struct device *dev);
+};
+
 /**
  * struct dax_region - mapping infrastructure for dax devices
  * @id: kernel-wide unique region for a memory range
@@ -27,6 +57,7 @@  void dax_bus_exit(void);
  * @res: resource tree to track instance allocations
  * @seed: allow userspace to find the first unbound seed device
  * @youngest: allow userspace to find the most recently created device
+ * @sparse_ops: operations required for sparse regions
  */
 struct dax_region {
 	int id;
@@ -38,6 +69,7 @@  struct dax_region {
 	struct resource res;
 	struct device *seed;
 	struct device *youngest;
+	struct dax_sparse_ops *sparse_ops;
 };
 
 struct dax_mapping {
@@ -62,6 +94,7 @@  struct dax_mapping {
  * @pgoff: page offset
  * @range: resource-span
  * @mapping: device to assist in interrogating the range layout
+ * @dax_resource: if not NULL; dax sparse resource containing this range
  */
 struct dev_dax {
 	struct dax_region *region;
@@ -79,6 +112,7 @@  struct dev_dax {
 		unsigned long pgoff;
 		struct range range;
 		struct dax_mapping *mapping;
+		struct dax_resource *dax_resource;
 	} *ranges;
 };
 
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 5e7c53f18491..0eea65052874 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -28,7 +28,7 @@  static int dax_hmem_probe(struct platform_device *pdev)
 
 	mri = dev->platform_data;
 	dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
-				      mri->target_node, PMD_SIZE, flags);
+				      mri->target_node, PMD_SIZE, flags, NULL);
 	if (!dax_region)
 		return -ENOMEM;
 
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index c8ebf4e281f2..f927e855f240 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -54,7 +54,7 @@  static struct dev_dax *__dax_pmem_probe(struct device *dev)
 	range.start += offset;
 	dax_region = alloc_dax_region(dev, region_id, &range,
 			nd_region->target_node, le32_to_cpu(pfn_sb->align),
-			IORESOURCE_DAX_STATIC);
+			IORESOURCE_DAX_STATIC, NULL);
 	if (!dax_region)
 		return ERR_PTR(-ENOMEM);