Message ID | 20240816-dcd-type2-upstream-v3-18-7c9b96cba6d7@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | DCD: Add support for Dynamic Capacity Devices (DCD) | expand |
On 8/16/24 7:44 AM, ira.weiny@intel.com wrote: > From: Navneet Singh <navneet.singh@intel.com> > > A dynamic capacity device (DCD) sends events to signal the host for > changes in the availability of Dynamic Capacity (DC) memory. These > events contain extents describing a DPA range and meta data for memory > to be added or removed. Events may be sent from the device at any time. > > Three types of events can be signaled, Add, Release, and Force Release. > > On add, the host may accept or reject the memory being offered. If no > region exists, or the extent is invalid, the extent should be rejected. > Add extent events may be grouped by a 'more' bit which indicates those > extents should be processed as a group. > > On remove, the host can delay the response until the host is safely not > using the memory. If no region exists the release can be sent > immediately. The host may also release extents (or partial extents) at > any time. Thus the 'more' bit grouping of release events is of less > value and can be ignored in favor of sending multiple release capacity > responses for groups of release events. > > Force removal is intended as a mechanism between the FM and the device > and intended only when the host is unresponsive, out of sync, or > otherwise broken. Purposely ignore force removal events. > > Regions are made up of one or more devices which may be surfacing memory > to the host. Once all devices in a region have surfaced an extent the > region can expose a corresponding extent for the user to consume. > Without interleaving a device extent forms a 1:1 relationship with the > region extent. Immediately surface a region extent upon getting a > device extent. > > Per the specification the device is allowed to offer or remove extents > at any time. However, anticipated use cases can expect extents to be > offered, accepted, and removed in well defined chunks. > > Simplify extent tracking with the following restrictions. > > 1) Flag for removal any extent which overlaps a requested > release range. > 2) Refuse the offer of extents which overlap already accepted > memory ranges. > 3) Accept again a range which has already been accepted by the > host. (It is likely the device has an error because it > should already know that this range was accepted. But from > the host point of view it is safe to acknowledge that > acceptance again.) > > Management of the region extent devices must be synchronized with > potential uses of the memory within the DAX layer. Create region extent > devices as children of the cxl_dax_region device such that the DAX > region driver can co-drive them and synchronize with the DAX layer. > Synchronization and management is handled in a subsequent patch. > > Process DCD events and create region devices. > > Signed-off-by: Navneet Singh <navneet.singh@intel.com> > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> A few nits below, but in general Reviewed-by: Dave Jiang <dave.jiang@intel.com> > > --- > Changes: > [iweiny: combine this with the extent surface patches to better show the > lifetime extent objects in review] > [iweiny: clean up commit message.] > [iweiny: move extent verification of the 'read extents on region > creation' to this patch] > [iweiny: Provide for a common path for extent realization between an add > event and adding existing extents.] > [iweiny: Persist a check that an extent is within an endpoint decoder] > [iweiny: reduce exported and non-static calls] > [iweiny: use %par] > > <Combined comments from the old patches which were addressed> > > [Jonathan: implement the more bit with a simple algorithm which accepts > all extents it can. > Also include the response more bit to prevent payload > overflow] > [Fan: Do not error if a contained extent is added.] > [Jonathan: allocate ida after kzalloc] > [iweiny: fix ida resource leak] > [fan/djiang: remove unneeded memset] > [djiang: fix indentation] > [Jonathan: Fix indentation] > [Jonathan/djbw: make tag a uuid] > [djbw: create helper calc_hpa_range() straight away] > [djbw: Allow for multiple cxled_extents per region_extent] > [djbw: s/cxl_ed/cxled] > [djbw: s/cxl_release_ed_extent/cxled_release_extent/] > [djbw: s/reg_ext/region_extent/] > [djbw: s/dc_extent/extent/] > [Gregory/djbw: reject shared extents] > [iweiny: predicate extent.c compile on CONFIG_CXL_REGION] > --- > drivers/cxl/core/Makefile | 2 +- > drivers/cxl/core/core.h | 13 ++ > drivers/cxl/core/extent.c | 345 ++++++++++++++++++++++++++++++++++++++++++++++ > drivers/cxl/core/mbox.c | 268 ++++++++++++++++++++++++++++++++++- > drivers/cxl/core/region.c | 6 + > drivers/cxl/cxl.h | 52 ++++++- > drivers/cxl/cxlmem.h | 26 ++++ > include/linux/cxl-event.h | 32 +++++ > tools/testing/cxl/Kbuild | 3 +- > 9 files changed, 743 insertions(+), 4 deletions(-) > > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile > index 9259bcc6773c..3b812515e725 100644 > --- a/drivers/cxl/core/Makefile > +++ b/drivers/cxl/core/Makefile > @@ -15,4 +15,4 @@ cxl_core-y += hdm.o > cxl_core-y += pmu.o > cxl_core-y += cdat.o > cxl_core-$(CONFIG_TRACING) += trace.o > -cxl_core-$(CONFIG_CXL_REGION) += region.o > +cxl_core-$(CONFIG_CXL_REGION) += region.o extent.o > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h > index 76c4153a9b2c..8dfc97b2e0a4 100644 > --- a/drivers/cxl/core/core.h > +++ b/drivers/cxl/core/core.h > @@ -44,12 +44,24 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa, > u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd, > u64 dpa); > > +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent); > +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent); > #else > static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, > const struct cxl_memdev *cxlmd, u64 dpa) > { > return ULLONG_MAX; > } > +static inline int cxl_add_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + return 0; > +} > +static inline int cxl_rm_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + return 0; > +} > static inline > struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa, > struct cxl_endpoint_decoder **cxled) > @@ -121,5 +133,6 @@ long cxl_pci_get_latency(struct pci_dev *pdev); > int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr, > enum access_coordinate_class access); > bool cxl_need_node_perf_attrs_update(int nid); > +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range); > > #endif /* __CXL_CORE_H__ */ > diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c > new file mode 100644 > index 000000000000..34456594cdc3 > --- /dev/null > +++ b/drivers/cxl/core/extent.c > @@ -0,0 +1,345 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright(c) 2024 Intel Corporation. All rights reserved. */ > + > +#include <linux/device.h> > +#include <cxl.h> > + > +#include "core.h" > + > +static void cxled_release_extent(struct cxl_endpoint_decoder *cxled, > + struct cxled_extent *ed_extent) > +{ > + struct cxl_memdev_state *mds = cxled_to_mds(cxled); > + struct device *dev = &cxled->cxld.dev; > + > + dev_dbg(dev, "Remove extent %par (%*phC)\n", &ed_extent->dpa_range, > + CXL_EXTENT_TAG_LEN, ed_extent->tag); > + memdev_release_extent(mds, &ed_extent->dpa_range); > + kfree(ed_extent); > +} > + > +static void free_region_extent(struct region_extent *region_extent) > +{ > + struct cxled_extent *ed_extent; > + unsigned long index; > + > + /* > + * Remove from each endpoint decoder the extent which backs this region > + * extent > + */ > + xa_for_each(®ion_extent->decoder_extents, index, ed_extent) > + cxled_release_extent(ed_extent->cxled, ed_extent); > + xa_destroy(®ion_extent->decoder_extents); > + ida_free(®ion_extent->cxlr_dax->extent_ida, region_extent->dev.id); > + kfree(region_extent); > +} > + > +static void region_extent_release(struct device *dev) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + > + free_region_extent(region_extent); > +} > + > +static const struct device_type region_extent_type = { > + .name = "extent", > + .release = region_extent_release, > +}; > + > +bool is_region_extent(struct device *dev) > +{ > + return dev->type == ®ion_extent_type; > +} > +EXPORT_SYMBOL_NS_GPL(is_region_extent, CXL); > + > +static void region_extent_unregister(void *ext) > +{ > + struct region_extent *region_extent = ext; > + > + dev_dbg(®ion_extent->dev, "DAX region rm extent HPA %par\n", > + ®ion_extent->hpa_range); > + device_unregister(®ion_extent->dev); > +} > + > +static void region_rm_extent(struct region_extent *region_extent) > +{ > + struct device *region_dev = region_extent->dev.parent; > + > + devm_release_action(region_dev, region_extent_unregister, region_extent); > +} > + > +static struct region_extent * > +alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range, u8 *tag) > +{ > + int id; > + > + struct region_extent *region_extent __free(kfree) = > + kzalloc(sizeof(*region_extent), GFP_KERNEL); > + if (!region_extent) > + return ERR_PTR(-ENOMEM); > + > + id = ida_alloc(&cxlr_dax->extent_ida, GFP_KERNEL); > + if (id < 0) > + return ERR_PTR(-ENOMEM); > + > + region_extent->hpa_range = *hpa_range; > + region_extent->cxlr_dax = cxlr_dax; > + import_uuid(®ion_extent->tag, tag); > + region_extent->dev.id = id; > + xa_init(®ion_extent->decoder_extents); > + return no_free_ptr(region_extent); > +} > + > +static int online_region_extent(struct region_extent *region_extent) > +{ > + struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax; > + struct device *dev; > + int rc; > + > + dev = ®ion_extent->dev; Nit. You can move this up to when you declare 'dev'. > + device_initialize(dev); > + device_set_pm_not_required(dev); > + dev->parent = &cxlr_dax->dev; > + dev->type = ®ion_extent_type; > + rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id); > + if (rc) > + goto err; > + > + rc = device_add(dev); > + if (rc) > + goto err; > + > + dev_dbg(dev, "region extent HPA %par\n", ®ion_extent->hpa_range); > + return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister, > + region_extent); > + > +err: > + dev_err(&cxlr_dax->dev, "Failed to initialize region extent HPA %par\n", > + ®ion_extent->hpa_range); > + > + put_device(dev); > + return rc; > +} > + > +struct match_data { > + struct cxl_endpoint_decoder *cxled; > + struct range *new_range; > +}; > + > +static int match_contains(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct match_data *md = data; > + struct cxled_extent *entry; > + unsigned long index; > + > + if (!region_extent) > + return 0; > + > + xa_for_each(®ion_extent->decoder_extents, index, entry) { > + if (md->cxled == entry->cxled && > + range_contains(&entry->dpa_range, md->new_range)) > + return true; > + } > + return false; > +} > + > +static bool extents_contain(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct range *new_range) > +{ > + struct device *extent_device; > + struct match_data md = { > + .cxled = cxled, > + .new_range = new_range, > + }; > + > + extent_device = device_find_child(&cxlr_dax->dev, &md, match_contains); > + if (!extent_device) > + return false; > + > + put_device(extent_device); > + return true; > +} > + > +static int match_overlaps(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct match_data *md = data; > + struct cxled_extent *entry; > + unsigned long index; > + > + if (!region_extent) > + return 0; > + > + xa_for_each(®ion_extent->decoder_extents, index, entry) { > + if (md->cxled == entry->cxled && > + range_overlaps(&entry->dpa_range, md->new_range)) > + return true; > + } > + > + return false; > +} > + > +static bool extents_overlap(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct range *new_range) > +{ > + struct device *extent_device; > + struct match_data md = { > + .cxled = cxled, > + .new_range = new_range, > + }; > + > + extent_device = device_find_child(&cxlr_dax->dev, &md, match_overlaps); > + if (!extent_device) > + return false; > + > + put_device(extent_device); > + return true; > +} > + > +static void calc_hpa_range(struct cxl_endpoint_decoder *cxled, > + struct cxl_dax_region *cxlr_dax, > + struct range *dpa_range, > + struct range *hpa_range) > +{ > + resource_size_t dpa_offset, hpa; > + > + dpa_offset = dpa_range->start - cxled->dpa_res->start; > + hpa = cxled->cxld.hpa_range.start + dpa_offset; > + > + hpa_range->start = hpa - cxlr_dax->hpa_range.start; > + hpa_range->end = hpa_range->start + range_len(dpa_range) - 1; > +} > + > +static int cxlr_rm_extent(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct range *region_hpa_range = data; > + > + if (!region_extent) > + return 0; > + > + /* > + * Any extent which 'touches' the released range is removed. > + */ > + if (range_overlaps(region_hpa_range, ®ion_extent->hpa_range)) { > + dev_dbg(dev, "Remove region extent HPA %par\n", > + ®ion_extent->hpa_range); > + region_rm_extent(region_extent); > + } > + return 0; > +} > + > +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > +{ > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > + struct cxl_endpoint_decoder *cxled; > + struct range hpa_range, dpa_range; > + struct cxl_region *cxlr; > + > + dpa_range = (struct range) { > + .start = start_dpa, > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > + }; > + > + guard(rwsem_read)(&cxl_region_rwsem); > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > + if (!cxlr) { > + memdev_release_extent(mds, &dpa_range); > + return -ENXIO; > + } > + > + calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range); > + > + /* Remove region extents which overlap */ > + return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range, > + cxlr_rm_extent); > +} > + > +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct cxled_extent *ed_extent) > +{ > + struct region_extent *region_extent; > + struct range hpa_range; > + int rc; > + > + calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range); > + > + region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag); > + if (IS_ERR(region_extent)) > + return PTR_ERR(region_extent); > + > + rc = xa_insert(®ion_extent->decoder_extents, (unsigned long)ed_extent, ed_extent, > + GFP_KERNEL); > + if (rc) { > + free_region_extent(region_extent); > + return rc; > + } > + > + /* device model handles freeing region_extent */ > + return online_region_extent(region_extent); > +} > + > +/* Callers are expected to ensure cxled has been attached to a region */ > +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > +{ > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > + struct cxl_endpoint_decoder *cxled; > + struct range ed_range, ext_range; > + struct cxl_dax_region *cxlr_dax; > + struct cxled_extent *ed_extent; > + struct cxl_region *cxlr; > + struct device *dev; > + > + ext_range = (struct range) { > + .start = start_dpa, > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > + }; > + > + guard(rwsem_read)(&cxl_region_rwsem); > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > + if (!cxlr) > + return -ENXIO; > + > + cxlr_dax = cxled->cxld.region->cxlr_dax; > + dev = &cxled->cxld.dev; > + ed_range = (struct range) { > + .start = cxled->dpa_res->start, > + .end = cxled->dpa_res->end, > + }; > + > + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n", > + cxled->dpa_res, &ext_range); > + > + if (!range_contains(&ed_range, &ext_range)) { > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) is not fully in ED %par\n", > + &ext_range.start, CXL_EXTENT_TAG_LEN, > + extent->tag, &ed_range); > + return -ENXIO; > + } > + > + if (extents_contain(cxlr_dax, cxled, &ext_range)) > + return 0; > + > + if (extents_overlap(cxlr_dax, cxled, &ext_range)) > + return -ENXIO; > + > + ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL); > + if (!ed_extent) > + return -ENOMEM; > + > + ed_extent->cxled = cxled; > + ed_extent->dpa_range = ext_range; > + memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN); > + > + dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range, > + CXL_EXTENT_TAG_LEN, ed_extent->tag); > + > + return cxlr_add_extent(cxlr_dax, cxled, ed_extent); > +} > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 01a447aaa1b1..f629ad7488ac 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds) > } > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); > > +static int cxl_validate_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + u64 start = le64_to_cpu(extent->start_dpa); > + u64 length = le64_to_cpu(extent->length); > + struct device *dev = mds->cxlds.dev; > + > + struct range ext_range = (struct range){ > + .start = start, > + .end = start + length - 1, > + }; > + > + if (le16_to_cpu(extent->shared_extn_seq) != 0) { > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) can not be shared\n", > + &ext_range.start, CXL_EXTENT_TAG_LEN, > + extent->tag); > + return -ENXIO; > + } > + > + /* Extents must not cross DC region boundary's */ > + for (int i = 0; i < mds->nr_dc_region; i++) { > + struct cxl_dc_region_info *dcr = &mds->dc_region[i]; > + struct range region_range = (struct range) { > + .start = dcr->base, > + .end = dcr->base + dcr->decode_len - 1, > + }; > + > + if (range_contains(®ion_range, &ext_range)) { > + dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n", > + &ext_range, i, start - dcr->base, > + CXL_EXTENT_TAG_LEN, extent->tag); > + return 0; > + } > + } > + > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) is not in any DC region\n", > + &ext_range, CXL_EXTENT_TAG_LEN, extent->tag); > + return -ENXIO; > +} > + > void cxl_event_trace_record(const struct cxl_memdev *cxlmd, > enum cxl_event_log_type type, > enum cxl_event_type event_type, > @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds, > return rc; > } > > +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode, > + struct xarray *extent_array, int cnt) > +{ > + struct cxl_mbox_dc_response *p; > + struct cxl_mbox_cmd mbox_cmd; > + struct cxl_extent *extent; > + unsigned long index; > + u32 pl_index; > + int rc = 0; > + > + size_t pl_size = struct_size(p, extent_list, cnt); > + u32 max_extents = cnt; > + > + /* May have to use more bit on response. */ > + if (pl_size > mds->payload_size) { > + max_extents = (mds->payload_size - sizeof(*p)) / > + sizeof(struct updated_extent_list); > + pl_size = struct_size(p, extent_list, max_extents); > + } > + > + struct cxl_mbox_dc_response *response __free(kfree) = > + kzalloc(pl_size, GFP_KERNEL); > + if (!response) > + return -ENOMEM; > + > + pl_index = 0; > + xa_for_each(extent_array, index, extent) { > + > + response->extent_list[pl_index].dpa_start = extent->start_dpa; > + response->extent_list[pl_index].length = extent->length; > + pl_index++; > + response->extent_list_size = cpu_to_le32(pl_index); > + > + if (pl_index == max_extents) { > + mbox_cmd = (struct cxl_mbox_cmd) { > + .opcode = opcode, > + .size_in = struct_size(response, extent_list, > + pl_index), > + .payload_in = response, > + }; > + > + response->flags = 0; > + if (pl_index < cnt) > + response->flags &= CXL_DCD_EVENT_MORE; > + > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); > + if (rc) > + return rc; > + pl_index = 0; > + } > + } > + > + if (pl_index) { > + mbox_cmd = (struct cxl_mbox_cmd) { > + .opcode = opcode, > + .size_in = struct_size(response, extent_list, > + pl_index), > + .payload_in = response, > + }; > + > + response->flags = 0; > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); > + } > + > + return rc; > +} > + > +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range) > +{ > + struct device *dev = mds->cxlds.dev; > + struct xarray extent_list; > + > + struct cxl_extent extent = { > + .start_dpa = cpu_to_le64(range->start), > + .length = cpu_to_le64(range_len(range)), > + }; > + > + dev_dbg(dev, "Release response dpa %par\n", range); > + > + xa_init(&extent_list); > + if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) { > + dev_dbg(dev, "Failed to release %par\n", range); > + goto destroy; > + } > + > + if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1)) > + dev_dbg(dev, "Failed to release %par\n", range); > + > +destroy: > + xa_destroy(&extent_list); > +} > + > +static int validate_add_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + int rc; > + > + rc = cxl_validate_extent(mds, extent); > + if (rc) > + return rc; > + > + return cxl_add_extent(mds, extent); > +} > + > +static int cxl_add_pending(struct cxl_memdev_state *mds) > +{ > + struct device *dev = mds->cxlds.dev; > + struct cxl_extent *extent; > + unsigned long index; > + unsigned long cnt = 0; reverse xmas tree > + int rc; > + > + xa_for_each(&mds->pending_extents, index, extent) { > + if (validate_add_extent(mds, extent)) { > + dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n", > + le64_to_cpu(extent->start_dpa), > + le64_to_cpu(extent->length)); > + xa_erase(&mds->pending_extents, index); > + kfree(extent); > + continue; > + } > + cnt++; > + } > + rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE, > + &mds->pending_extents, cnt); > + xa_for_each(&mds->pending_extents, index, extent) { > + xa_erase(&mds->pending_extents, index); > + kfree(extent); > + } > + return rc; > +} > + > +static int handle_add_event(struct cxl_memdev_state *mds, > + struct cxl_event_dcd *event) > +{ > + struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL); for readability I would use *extent instead of *tmp > + struct device *dev = mds->cxlds.dev; > + > + if (!tmp) > + return -ENOMEM; > + > + memcpy(tmp, &event->extent, sizeof(*tmp)); > + if (xa_insert(&mds->pending_extents, (unsigned long)tmp, tmp, > + GFP_KERNEL)) { > + kfree(tmp); > + return -ENOMEM; > + } > + > + if (event->flags & CXL_DCD_EVENT_MORE) { > + dev_dbg(dev, "more bit set; delay the surfacing of extent\n"); > + return 0; > + } > + > + /* extents are removed and free'ed in cxl_add_pending() */ > + return cxl_add_pending(mds); > +} > + > +static char *cxl_dcd_evt_type_str(u8 type) > +{ > + switch (type) { > + case DCD_ADD_CAPACITY: > + return "add"; > + case DCD_RELEASE_CAPACITY: > + return "release"; > + case DCD_FORCED_CAPACITY_RELEASE: > + return "force release"; > + default: > + break; > + } > + > + return "<unknown>"; > +} > + > +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds, > + struct cxl_event_record_raw *raw_rec) > +{ > + struct cxl_event_dcd *event = &raw_rec->event.dcd; > + struct cxl_extent *extent = &event->extent; > + struct device *dev = mds->cxlds.dev; > + uuid_t *id = &raw_rec->id; > + > + if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID)) > + return -EINVAL; > + > + dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n", > + cxl_dcd_evt_type_str(event->event_type), > + le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length)); > + > + switch (event->event_type) { > + case DCD_ADD_CAPACITY: > + return handle_add_event(mds, event); > + case DCD_RELEASE_CAPACITY: > + return cxl_rm_extent(mds, &event->extent); > + case DCD_FORCED_CAPACITY_RELEASE: > + dev_err_ratelimited(dev, "Forced release event ignored.\n"); > + return 0; > + default: > + return -EINVAL; > + } > +} > + > static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > enum cxl_event_log_type type) > { > @@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > if (!nr_rec) > break; > > - for (i = 0; i < nr_rec; i++) > + for (i = 0; i < nr_rec; i++) { > __cxl_event_trace_record(cxlmd, type, > &payload->records[i]); > + if (type == CXL_EVENT_TYPE_DCD) { > + rc = cxl_handle_dcd_event_records(mds, > + &payload->records[i]); > + if (rc) > + dev_err_ratelimited(dev, "dcd event failed: %d\n", > + rc); > + } > + } > > if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW) > trace_cxl_overflow(cxlmd, type, payload); > @@ -1078,6 +1329,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status) > { > dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status); > > + if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD)) > + cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD); > if (status & CXLDEV_EVENT_STATUS_FATAL) > cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL); > if (status & CXLDEV_EVENT_STATUS_FAIL) > @@ -1610,6 +1863,17 @@ int cxl_poison_state_init(struct cxl_memdev_state *mds) > } > EXPORT_SYMBOL_NS_GPL(cxl_poison_state_init, CXL); > > +static void clear_pending_extents(void *_mds) > +{ > + struct cxl_memdev_state *mds = _mds; > + struct cxl_extent *extent; > + unsigned long index; > + > + xa_for_each(&mds->pending_extents, index, extent) > + kfree(extent); > + xa_destroy(&mds->pending_extents); > +} > + > struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > { > struct cxl_memdev_state *mds; > @@ -1628,6 +1892,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > mds->cxlds.type = CXL_DEVTYPE_CLASSMEM; > mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID; > mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID; > + xa_init(&mds->pending_extents); > + devm_add_action_or_reset(dev, clear_pending_extents, mds); > > return mds; > } > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index 8e0884b52f84..8c9171f914fb 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -3037,6 +3037,7 @@ static void cxl_dax_region_release(struct device *dev) > { > struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev); > > + ida_destroy(&cxlr_dax->extent_ida); > kfree(cxlr_dax); > } > > @@ -3090,6 +3091,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr) > > dev = &cxlr_dax->dev; > cxlr_dax->cxlr = cxlr; > + cxlr->cxlr_dax = cxlr_dax; > + ida_init(&cxlr_dax->extent_ida); > device_initialize(dev); > lockdep_set_class(&dev->mutex, &cxl_dax_region_key); > device_set_pm_not_required(dev); > @@ -3190,7 +3193,10 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr) > static void cxlr_dax_unregister(void *_cxlr_dax) > { > struct cxl_dax_region *cxlr_dax = _cxlr_dax; > + struct cxl_region *cxlr = cxlr_dax->cxlr; > > + cxlr->cxlr_dax = NULL; > + cxlr_dax->cxlr = NULL; > device_unregister(&cxlr_dax->dev); > } > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index 16861c867537..c858e3957fd5 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -11,6 +11,7 @@ > #include <linux/log2.h> > #include <linux/node.h> > #include <linux/io.h> > +#include <linux/cxl-event.h> > > extern const struct nvdimm_security_ops *cxl_security_ops; > > @@ -169,11 +170,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw) > #define CXLDEV_EVENT_STATUS_WARN BIT(1) > #define CXLDEV_EVENT_STATUS_FAIL BIT(2) > #define CXLDEV_EVENT_STATUS_FATAL BIT(3) > +#define CXLDEV_EVENT_STATUS_DCD BIT(4) > > #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \ > CXLDEV_EVENT_STATUS_WARN | \ > CXLDEV_EVENT_STATUS_FAIL | \ > - CXLDEV_EVENT_STATUS_FATAL) > + CXLDEV_EVENT_STATUS_FATAL | \ > + CXLDEV_EVENT_STATUS_DCD) > > /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */ > #define CXLDEV_EVENT_INT_MODE_MASK GENMASK(1, 0) > @@ -444,6 +447,18 @@ enum cxl_decoder_state { > CXL_DECODER_STATE_AUTO, > }; > > +/** > + * struct cxled_extent - Extent within an endpoint decoder > + * @cxled: Reference to the endpoint decoder > + * @dpa_range: DPA range this extent covers within the decoder > + * @tag: Tag from device for this extent > + */ > +struct cxled_extent { > + struct cxl_endpoint_decoder *cxled; > + struct range dpa_range; > + u8 tag[CXL_EXTENT_TAG_LEN]; > +}; > + > /** > * struct cxl_endpoint_decoder - Endpoint / SPA to DPA decoder > * @cxld: base cxl_decoder_object > @@ -569,6 +584,7 @@ struct cxl_region_params { > * @type: Endpoint decoder target type > * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown > * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge > + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge > * @flags: Region state flags > * @params: active + config params for the region > * @coord: QoS access coordinates for the region > @@ -582,6 +598,7 @@ struct cxl_region { > enum cxl_decoder_type type; > struct cxl_nvdimm_bridge *cxl_nvb; > struct cxl_pmem_region *cxlr_pmem; > + struct cxl_dax_region *cxlr_dax; > unsigned long flags; > struct cxl_region_params params; > struct access_coordinate coord[ACCESS_COORDINATE_MAX]; > @@ -622,12 +639,45 @@ struct cxl_pmem_region { > struct cxl_pmem_region_mapping mapping[]; > }; > > +/* See CXL 3.0 8.2.9.2.1.5 */ > +enum dc_event { > + DCD_ADD_CAPACITY, > + DCD_RELEASE_CAPACITY, > + DCD_FORCED_CAPACITY_RELEASE, > + DCD_REGION_CONFIGURATION_UPDATED, > +}; > + > struct cxl_dax_region { > struct device dev; > struct cxl_region *cxlr; > struct range hpa_range; > + struct ida extent_ida; > }; > > +/** > + * struct region_extent - CXL DAX region extent > + * @dev: device representing this extent > + * @cxlr_dax: back reference to parent region device > + * @hpa_range: HPA range of this extent > + * @tag: tag of the extent > + * @decoder_extents: Endpoint decoder extents which make up this region extent > + */ > +struct region_extent { > + struct device dev; > + struct cxl_dax_region *cxlr_dax; > + struct range hpa_range; > + uuid_t tag; > + struct xarray decoder_extents; > +}; > + > +bool is_region_extent(struct device *dev); > +static inline struct region_extent *to_region_extent(struct device *dev) > +{ > + if (!is_region_extent(dev)) > + return NULL; > + return container_of(dev, struct region_extent, dev); > +} > + > /** > * struct cxl_port - logical collection of upstream port devices and > * downstream port devices to construct a CXL memory > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h > index d41bec5433db..3a40fe1f0be7 100644 > --- a/drivers/cxl/cxlmem.h > +++ b/drivers/cxl/cxlmem.h > @@ -497,6 +497,7 @@ struct cxl_dc_region_info { > * @pmem_perf: performance data entry matched to PMEM partition > * @nr_dc_region: number of DC regions implemented in the memory device > * @dc_region: array containing info about the DC regions > + * @pending_extents: array of extents pending during more bit processing > * @event: event log driver state > * @poison: poison driver state info > * @security: security driver state info > @@ -532,6 +533,7 @@ struct cxl_memdev_state { > > u8 nr_dc_region; > struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION]; > + struct xarray pending_extents; > > struct cxl_event_state event; > struct cxl_poison_state poison; > @@ -607,6 +609,21 @@ enum cxl_opcode { > UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \ > 0x40, 0x3d, 0x86) > > +/* > + * Add Dynamic Capacity Response > + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169 > + */ > +struct cxl_mbox_dc_response { > + __le32 extent_list_size; > + u8 flags; > + u8 reserved[3]; > + struct updated_extent_list { > + __le64 dpa_start; > + __le64 length; > + u8 reserved[8]; > + } __packed extent_list[]; > +} __packed; > + > struct cxl_mbox_get_supported_logs { > __le16 entries; > u8 rsvd[6]; > @@ -669,6 +686,14 @@ struct cxl_mbox_identify { > UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \ > 0x13, 0xb7, 0x74) > > +/* > + * Dynamic Capacity Event Record > + * CXL rev 3.1 section 8.2.9.2.1; Table 8-43 > + */ > +#define CXL_EVENT_DC_EVENT_UUID \ > + UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \ > + 0x10, 0x1a, 0x2a) > + > /* > * Get Event Records output payload > * CXL rev 3.0 section 8.2.9.2.2; Table 8-50 > @@ -694,6 +719,7 @@ enum cxl_event_log_type { > CXL_EVENT_TYPE_WARN, > CXL_EVENT_TYPE_FAIL, > CXL_EVENT_TYPE_FATAL, > + CXL_EVENT_TYPE_DCD, > CXL_EVENT_TYPE_MAX > }; > > diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h > index 0bea1afbd747..eeda8059d81a 100644 > --- a/include/linux/cxl-event.h > +++ b/include/linux/cxl-event.h > @@ -96,11 +96,43 @@ struct cxl_event_mem_module { > u8 reserved[0x3d]; Previous code, but 61 would be better than 0x3d to be consistent with rest of cxl code > } __packed; > > +/* > + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51 > + */ > +#define CXL_EXTENT_TAG_LEN 0x10 > +struct cxl_extent { > + __le64 start_dpa; > + __le64 length; > + u8 tag[CXL_EXTENT_TAG_LEN]; > + __le16 shared_extn_seq; > + u8 reserved[0x6]; Why not just 6? In general I find it odd that this header uses hex for array indexing when the rest of the cxl code uses decimal. > +} __packed; > + > +/* > + * Dynamic Capacity Event Record > + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50 > + */ > +#define CXL_DCD_EVENT_MORE BIT(0) > +struct cxl_event_dcd { > + struct cxl_event_record_hdr hdr; > + u8 event_type; > + u8 validity_flags; > + __le16 host_id; > + u8 region_index; > + u8 flags; > + u8 reserved1[0x2]; also here, 2? > + struct cxl_extent extent; > + u8 reserved2[0x18]; 24? > + __le32 num_avail_extents; > + __le32 num_avail_tags; > +} __packed; > + > union cxl_event { > struct cxl_event_generic generic; > struct cxl_event_gen_media gen_media; > struct cxl_event_dram dram; > struct cxl_event_mem_module mem_module; > + struct cxl_event_dcd dcd; > /* dram & gen_media event header */ > struct cxl_event_media_hdr media_hdr; > } __packed; > diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild > index 030b388800f0..8238588fffdf 100644 > --- a/tools/testing/cxl/Kbuild > +++ b/tools/testing/cxl/Kbuild > @@ -61,7 +61,8 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o > cxl_core-y += $(CXL_CORE_SRC)/pmu.o > cxl_core-y += $(CXL_CORE_SRC)/cdat.o > cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o > -cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o > +cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \ > + $(CXL_CORE_SRC)/extent.o > cxl_core-y += config_check.o > cxl_core-y += cxl_core_test.o > cxl_core-y += cxl_core_exports.o >
Dave Jiang wrote: > > > On 8/16/24 7:44 AM, ira.weiny@intel.com wrote: > > From: Navneet Singh <navneet.singh@intel.com> > > [snip] > > > > Process DCD events and create region devices. > > > > Signed-off-by: Navneet Singh <navneet.singh@intel.com> > > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > A few nits below, but in general > Reviewed-by: Dave Jiang <dave.jiang@intel.com> Thanks. > > + > > +static int online_region_extent(struct region_extent *region_extent) > > +{ > > + struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax; > > + struct device *dev; > > + int rc; > > + > > + dev = ®ion_extent->dev; > > Nit. You can move this up to when you declare 'dev'. [done.] [snip] > > + > > +static int cxl_add_pending(struct cxl_memdev_state *mds) > > +{ > > + struct device *dev = mds->cxlds.dev; > > + struct cxl_extent *extent; > > + unsigned long index; > > + unsigned long cnt = 0; > reverse xmas tree yep. [done.] [snip] > > + > > +static int handle_add_event(struct cxl_memdev_state *mds, > > + struct cxl_event_dcd *event) > > +{ > > + struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL); > for readability I would use *extent instead of *tmp sure. [done.] [snip] > > > > diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h > > index 0bea1afbd747..eeda8059d81a 100644 > > --- a/include/linux/cxl-event.h > > +++ b/include/linux/cxl-event.h > > @@ -96,11 +96,43 @@ struct cxl_event_mem_module { > > u8 reserved[0x3d]; > Previous code, but 61 would be better than 0x3d to be consistent with rest of cxl code :-( I get the rest of the code argument. However, the specification uses hex for the number of bytes in the definitions. For this reason I prefer the use of hex here so that one can better match the code to the spec. > > > } __packed; > > > > +/* > > + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51 > > + */ > > +#define CXL_EXTENT_TAG_LEN 0x10 > > +struct cxl_extent { > > + __le64 start_dpa; > > + __le64 length; > > + u8 tag[CXL_EXTENT_TAG_LEN]; > > + __le16 shared_extn_seq; > > + u8 reserved[0x6]; > > Why not just 6? In general I find it odd that this header uses hex for > array indexing when the rest of the cxl code uses decimal. I was just directly matching the spec. > > > +} __packed; > > + > > +/* > > + * Dynamic Capacity Event Record > > + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50 > > + */ > > +#define CXL_DCD_EVENT_MORE BIT(0) > > +struct cxl_event_dcd { > > + struct cxl_event_record_hdr hdr; > > + u8 event_type; > > + u8 validity_flags; > > + __le16 host_id; > > + u8 region_index; > > + u8 flags; > > + u8 reserved1[0x2]; > > also here, 2? Same... I know it is odd when the hex string == the decimal string. > > > + struct cxl_extent extent; > > + u8 reserved2[0x18]; > > 24? same. Ira [snip]
On Fri, Aug 16, 2024 at 09:44:26AM -0500, ira.weiny@intel.com wrote: > From: Navneet Singh <navneet.singh@intel.com> > > A dynamic capacity device (DCD) sends events to signal the host for > changes in the availability of Dynamic Capacity (DC) memory. These > events contain extents describing a DPA range and meta data for memory > to be added or removed. Events may be sent from the device at any time. > > Three types of events can be signaled, Add, Release, and Force Release. > > On add, the host may accept or reject the memory being offered. If no > region exists, or the extent is invalid, the extent should be rejected. > Add extent events may be grouped by a 'more' bit which indicates those > extents should be processed as a group. > > On remove, the host can delay the response until the host is safely not > using the memory. If no region exists the release can be sent > immediately. The host may also release extents (or partial extents) at > any time. Thus the 'more' bit grouping of release events is of less > value and can be ignored in favor of sending multiple release capacity > responses for groups of release events. > > Force removal is intended as a mechanism between the FM and the device > and intended only when the host is unresponsive, out of sync, or > otherwise broken. Purposely ignore force removal events. > > Regions are made up of one or more devices which may be surfacing memory > to the host. Once all devices in a region have surfaced an extent the > region can expose a corresponding extent for the user to consume. > Without interleaving a device extent forms a 1:1 relationship with the > region extent. Immediately surface a region extent upon getting a > device extent. > > Per the specification the device is allowed to offer or remove extents > at any time. However, anticipated use cases can expect extents to be > offered, accepted, and removed in well defined chunks. > > Simplify extent tracking with the following restrictions. > > 1) Flag for removal any extent which overlaps a requested > release range. > 2) Refuse the offer of extents which overlap already accepted > memory ranges. > 3) Accept again a range which has already been accepted by the > host. (It is likely the device has an error because it > should already know that this range was accepted. But from > the host point of view it is safe to acknowledge that > acceptance again.) > > Management of the region extent devices must be synchronized with > potential uses of the memory within the DAX layer. Create region extent > devices as children of the cxl_dax_region device such that the DAX > region driver can co-drive them and synchronize with the DAX layer. > Synchronization and management is handled in a subsequent patch. > > Process DCD events and create region devices. > > Signed-off-by: Navneet Singh <navneet.singh@intel.com> > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > One minor change inline. > --- > Changes: > [iweiny: combine this with the extent surface patches to better show the > lifetime extent objects in review] > [iweiny: clean up commit message.] > [iweiny: move extent verification of the 'read extents on region > creation' to this patch] > [iweiny: Provide for a common path for extent realization between an add > event and adding existing extents.] > [iweiny: Persist a check that an extent is within an endpoint decoder] > [iweiny: reduce exported and non-static calls] > [iweiny: use %par] > > <Combined comments from the old patches which were addressed> > > [Jonathan: implement the more bit with a simple algorithm which accepts > all extents it can. > Also include the response more bit to prevent payload > overflow] > [Fan: Do not error if a contained extent is added.] > [Jonathan: allocate ida after kzalloc] > [iweiny: fix ida resource leak] > [fan/djiang: remove unneeded memset] > [djiang: fix indentation] > [Jonathan: Fix indentation] > [Jonathan/djbw: make tag a uuid] > [djbw: create helper calc_hpa_range() straight away] > [djbw: Allow for multiple cxled_extents per region_extent] > [djbw: s/cxl_ed/cxled] > [djbw: s/cxl_release_ed_extent/cxled_release_extent/] > [djbw: s/reg_ext/region_extent/] > [djbw: s/dc_extent/extent/] > [Gregory/djbw: reject shared extents] > [iweiny: predicate extent.c compile on CONFIG_CXL_REGION] > --- > drivers/cxl/core/Makefile | 2 +- > drivers/cxl/core/core.h | 13 ++ > drivers/cxl/core/extent.c | 345 ++++++++++++++++++++++++++++++++++++++++++++++ > drivers/cxl/core/mbox.c | 268 ++++++++++++++++++++++++++++++++++- > drivers/cxl/core/region.c | 6 + > drivers/cxl/cxl.h | 52 ++++++- > drivers/cxl/cxlmem.h | 26 ++++ > include/linux/cxl-event.h | 32 +++++ > tools/testing/cxl/Kbuild | 3 +- > 9 files changed, 743 insertions(+), 4 deletions(-) > > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile > index 9259bcc6773c..3b812515e725 100644 > --- a/drivers/cxl/core/Makefile > +++ b/drivers/cxl/core/Makefile > @@ -15,4 +15,4 @@ cxl_core-y += hdm.o > cxl_core-y += pmu.o > cxl_core-y += cdat.o > cxl_core-$(CONFIG_TRACING) += trace.o > -cxl_core-$(CONFIG_CXL_REGION) += region.o > +cxl_core-$(CONFIG_CXL_REGION) += region.o extent.o > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h > index 76c4153a9b2c..8dfc97b2e0a4 100644 > --- a/drivers/cxl/core/core.h > +++ b/drivers/cxl/core/core.h > @@ -44,12 +44,24 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa, > u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd, > u64 dpa); > > +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent); > +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent); > #else > static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, > const struct cxl_memdev *cxlmd, u64 dpa) > { > return ULLONG_MAX; > } > +static inline int cxl_add_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + return 0; > +} > +static inline int cxl_rm_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + return 0; > +} > static inline > struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa, > struct cxl_endpoint_decoder **cxled) > @@ -121,5 +133,6 @@ long cxl_pci_get_latency(struct pci_dev *pdev); > int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr, > enum access_coordinate_class access); > bool cxl_need_node_perf_attrs_update(int nid); > +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range); > > #endif /* __CXL_CORE_H__ */ > diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c > new file mode 100644 > index 000000000000..34456594cdc3 > --- /dev/null > +++ b/drivers/cxl/core/extent.c > @@ -0,0 +1,345 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright(c) 2024 Intel Corporation. All rights reserved. */ > + > +#include <linux/device.h> > +#include <cxl.h> > + > +#include "core.h" > + > +static void cxled_release_extent(struct cxl_endpoint_decoder *cxled, > + struct cxled_extent *ed_extent) > +{ > + struct cxl_memdev_state *mds = cxled_to_mds(cxled); > + struct device *dev = &cxled->cxld.dev; > + > + dev_dbg(dev, "Remove extent %par (%*phC)\n", &ed_extent->dpa_range, > + CXL_EXTENT_TAG_LEN, ed_extent->tag); > + memdev_release_extent(mds, &ed_extent->dpa_range); > + kfree(ed_extent); > +} > + > +static void free_region_extent(struct region_extent *region_extent) > +{ > + struct cxled_extent *ed_extent; > + unsigned long index; > + > + /* > + * Remove from each endpoint decoder the extent which backs this region > + * extent > + */ > + xa_for_each(®ion_extent->decoder_extents, index, ed_extent) > + cxled_release_extent(ed_extent->cxled, ed_extent); > + xa_destroy(®ion_extent->decoder_extents); > + ida_free(®ion_extent->cxlr_dax->extent_ida, region_extent->dev.id); > + kfree(region_extent); > +} > + > +static void region_extent_release(struct device *dev) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + > + free_region_extent(region_extent); > +} > + > +static const struct device_type region_extent_type = { > + .name = "extent", > + .release = region_extent_release, > +}; > + > +bool is_region_extent(struct device *dev) > +{ > + return dev->type == ®ion_extent_type; > +} > +EXPORT_SYMBOL_NS_GPL(is_region_extent, CXL); > + > +static void region_extent_unregister(void *ext) > +{ > + struct region_extent *region_extent = ext; > + > + dev_dbg(®ion_extent->dev, "DAX region rm extent HPA %par\n", > + ®ion_extent->hpa_range); > + device_unregister(®ion_extent->dev); > +} > + > +static void region_rm_extent(struct region_extent *region_extent) > +{ > + struct device *region_dev = region_extent->dev.parent; > + > + devm_release_action(region_dev, region_extent_unregister, region_extent); > +} > + > +static struct region_extent * > +alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range, u8 *tag) > +{ > + int id; > + > + struct region_extent *region_extent __free(kfree) = > + kzalloc(sizeof(*region_extent), GFP_KERNEL); > + if (!region_extent) > + return ERR_PTR(-ENOMEM); > + > + id = ida_alloc(&cxlr_dax->extent_ida, GFP_KERNEL); > + if (id < 0) > + return ERR_PTR(-ENOMEM); > + > + region_extent->hpa_range = *hpa_range; > + region_extent->cxlr_dax = cxlr_dax; > + import_uuid(®ion_extent->tag, tag); > + region_extent->dev.id = id; > + xa_init(®ion_extent->decoder_extents); > + return no_free_ptr(region_extent); > +} > + > +static int online_region_extent(struct region_extent *region_extent) > +{ > + struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax; > + struct device *dev; > + int rc; > + > + dev = ®ion_extent->dev; > + device_initialize(dev); > + device_set_pm_not_required(dev); > + dev->parent = &cxlr_dax->dev; > + dev->type = ®ion_extent_type; > + rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id); > + if (rc) > + goto err; > + > + rc = device_add(dev); > + if (rc) > + goto err; > + > + dev_dbg(dev, "region extent HPA %par\n", ®ion_extent->hpa_range); > + return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister, > + region_extent); > + > +err: > + dev_err(&cxlr_dax->dev, "Failed to initialize region extent HPA %par\n", > + ®ion_extent->hpa_range); > + > + put_device(dev); > + return rc; > +} > + > +struct match_data { > + struct cxl_endpoint_decoder *cxled; > + struct range *new_range; > +}; > + > +static int match_contains(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct match_data *md = data; > + struct cxled_extent *entry; > + unsigned long index; > + > + if (!region_extent) > + return 0; > + > + xa_for_each(®ion_extent->decoder_extents, index, entry) { > + if (md->cxled == entry->cxled && > + range_contains(&entry->dpa_range, md->new_range)) > + return true; > + } > + return false; > +} > + > +static bool extents_contain(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct range *new_range) > +{ > + struct device *extent_device; > + struct match_data md = { > + .cxled = cxled, > + .new_range = new_range, > + }; > + > + extent_device = device_find_child(&cxlr_dax->dev, &md, match_contains); > + if (!extent_device) > + return false; > + > + put_device(extent_device); > + return true; > +} > + > +static int match_overlaps(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct match_data *md = data; > + struct cxled_extent *entry; > + unsigned long index; > + > + if (!region_extent) > + return 0; > + > + xa_for_each(®ion_extent->decoder_extents, index, entry) { > + if (md->cxled == entry->cxled && > + range_overlaps(&entry->dpa_range, md->new_range)) > + return true; > + } > + > + return false; > +} > + > +static bool extents_overlap(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct range *new_range) > +{ > + struct device *extent_device; > + struct match_data md = { > + .cxled = cxled, > + .new_range = new_range, > + }; > + > + extent_device = device_find_child(&cxlr_dax->dev, &md, match_overlaps); > + if (!extent_device) > + return false; > + > + put_device(extent_device); > + return true; > +} > + > +static void calc_hpa_range(struct cxl_endpoint_decoder *cxled, > + struct cxl_dax_region *cxlr_dax, > + struct range *dpa_range, > + struct range *hpa_range) > +{ > + resource_size_t dpa_offset, hpa; > + > + dpa_offset = dpa_range->start - cxled->dpa_res->start; > + hpa = cxled->cxld.hpa_range.start + dpa_offset; > + > + hpa_range->start = hpa - cxlr_dax->hpa_range.start; > + hpa_range->end = hpa_range->start + range_len(dpa_range) - 1; > +} > + > +static int cxlr_rm_extent(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct range *region_hpa_range = data; > + > + if (!region_extent) > + return 0; > + > + /* > + * Any extent which 'touches' the released range is removed. > + */ > + if (range_overlaps(region_hpa_range, ®ion_extent->hpa_range)) { > + dev_dbg(dev, "Remove region extent HPA %par\n", > + ®ion_extent->hpa_range); > + region_rm_extent(region_extent); > + } > + return 0; > +} > + > +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > +{ > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > + struct cxl_endpoint_decoder *cxled; > + struct range hpa_range, dpa_range; > + struct cxl_region *cxlr; > + > + dpa_range = (struct range) { > + .start = start_dpa, > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > + }; > + > + guard(rwsem_read)(&cxl_region_rwsem); > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > + if (!cxlr) { > + memdev_release_extent(mds, &dpa_range); > + return -ENXIO; > + } > + > + calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range); > + > + /* Remove region extents which overlap */ > + return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range, > + cxlr_rm_extent); > +} > + > +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct cxled_extent *ed_extent) > +{ > + struct region_extent *region_extent; > + struct range hpa_range; > + int rc; > + > + calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range); > + > + region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag); > + if (IS_ERR(region_extent)) > + return PTR_ERR(region_extent); > + > + rc = xa_insert(®ion_extent->decoder_extents, (unsigned long)ed_extent, ed_extent, > + GFP_KERNEL); > + if (rc) { > + free_region_extent(region_extent); > + return rc; > + } > + > + /* device model handles freeing region_extent */ > + return online_region_extent(region_extent); > +} > + > +/* Callers are expected to ensure cxled has been attached to a region */ > +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > +{ > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > + struct cxl_endpoint_decoder *cxled; > + struct range ed_range, ext_range; > + struct cxl_dax_region *cxlr_dax; > + struct cxled_extent *ed_extent; > + struct cxl_region *cxlr; > + struct device *dev; > + > + ext_range = (struct range) { > + .start = start_dpa, > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > + }; > + > + guard(rwsem_read)(&cxl_region_rwsem); > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > + if (!cxlr) > + return -ENXIO; > + > + cxlr_dax = cxled->cxld.region->cxlr_dax; > + dev = &cxled->cxld.dev; > + ed_range = (struct range) { > + .start = cxled->dpa_res->start, > + .end = cxled->dpa_res->end, > + }; > + > + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n", > + cxled->dpa_res, &ext_range); > + > + if (!range_contains(&ed_range, &ext_range)) { > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) is not fully in ED %par\n", > + &ext_range.start, CXL_EXTENT_TAG_LEN, > + extent->tag, &ed_range); > + return -ENXIO; > + } > + > + if (extents_contain(cxlr_dax, cxled, &ext_range)) > + return 0; > + > + if (extents_overlap(cxlr_dax, cxled, &ext_range)) > + return -ENXIO; > + > + ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL); > + if (!ed_extent) > + return -ENOMEM; > + > + ed_extent->cxled = cxled; > + ed_extent->dpa_range = ext_range; > + memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN); > + > + dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range, > + CXL_EXTENT_TAG_LEN, ed_extent->tag); > + > + return cxlr_add_extent(cxlr_dax, cxled, ed_extent); > +} > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 01a447aaa1b1..f629ad7488ac 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds) > } > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); > > +static int cxl_validate_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + u64 start = le64_to_cpu(extent->start_dpa); > + u64 length = le64_to_cpu(extent->length); > + struct device *dev = mds->cxlds.dev; > + > + struct range ext_range = (struct range){ > + .start = start, > + .end = start + length - 1, > + }; > + > + if (le16_to_cpu(extent->shared_extn_seq) != 0) { > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) can not be shared\n", > + &ext_range.start, CXL_EXTENT_TAG_LEN, > + extent->tag); > + return -ENXIO; > + } > + > + /* Extents must not cross DC region boundary's */ > + for (int i = 0; i < mds->nr_dc_region; i++) { > + struct cxl_dc_region_info *dcr = &mds->dc_region[i]; > + struct range region_range = (struct range) { > + .start = dcr->base, > + .end = dcr->base + dcr->decode_len - 1, > + }; > + > + if (range_contains(®ion_range, &ext_range)) { > + dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n", > + &ext_range, i, start - dcr->base, > + CXL_EXTENT_TAG_LEN, extent->tag); > + return 0; > + } > + } > + > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) is not in any DC region\n", > + &ext_range, CXL_EXTENT_TAG_LEN, extent->tag); > + return -ENXIO; > +} > + > void cxl_event_trace_record(const struct cxl_memdev *cxlmd, > enum cxl_event_log_type type, > enum cxl_event_type event_type, > @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds, > return rc; > } > > +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode, > + struct xarray *extent_array, int cnt) > +{ > + struct cxl_mbox_dc_response *p; > + struct cxl_mbox_cmd mbox_cmd; > + struct cxl_extent *extent; > + unsigned long index; > + u32 pl_index; > + int rc = 0; > + > + size_t pl_size = struct_size(p, extent_list, cnt); > + u32 max_extents = cnt; > + > + /* May have to use more bit on response. */ > + if (pl_size > mds->payload_size) { > + max_extents = (mds->payload_size - sizeof(*p)) / > + sizeof(struct updated_extent_list); > + pl_size = struct_size(p, extent_list, max_extents); > + } > + > + struct cxl_mbox_dc_response *response __free(kfree) = > + kzalloc(pl_size, GFP_KERNEL); > + if (!response) > + return -ENOMEM; > + > + pl_index = 0; > + xa_for_each(extent_array, index, extent) { > + > + response->extent_list[pl_index].dpa_start = extent->start_dpa; > + response->extent_list[pl_index].length = extent->length; > + pl_index++; > + response->extent_list_size = cpu_to_le32(pl_index); > + > + if (pl_index == max_extents) { > + mbox_cmd = (struct cxl_mbox_cmd) { > + .opcode = opcode, > + .size_in = struct_size(response, extent_list, > + pl_index), > + .payload_in = response, > + }; > + > + response->flags = 0; > + if (pl_index < cnt) > + response->flags &= CXL_DCD_EVENT_MORE; > + > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); > + if (rc) > + return rc; > + pl_index = 0; > + } > + } > + > + if (pl_index) { > + mbox_cmd = (struct cxl_mbox_cmd) { > + .opcode = opcode, > + .size_in = struct_size(response, extent_list, > + pl_index), > + .payload_in = response, > + }; > + > + response->flags = 0; > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); > + } > + > + return rc; > +} > + > +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range) > +{ > + struct device *dev = mds->cxlds.dev; > + struct xarray extent_list; > + > + struct cxl_extent extent = { > + .start_dpa = cpu_to_le64(range->start), > + .length = cpu_to_le64(range_len(range)), > + }; > + > + dev_dbg(dev, "Release response dpa %par\n", range); > + > + xa_init(&extent_list); > + if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) { > + dev_dbg(dev, "Failed to release %par\n", range); > + goto destroy; > + } > + > + if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1)) > + dev_dbg(dev, "Failed to release %par\n", range); > + > +destroy: > + xa_destroy(&extent_list); > +} > + > +static int validate_add_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + int rc; > + > + rc = cxl_validate_extent(mds, extent); > + if (rc) > + return rc; > + > + return cxl_add_extent(mds, extent); > +} > + > +static int cxl_add_pending(struct cxl_memdev_state *mds) > +{ > + struct device *dev = mds->cxlds.dev; > + struct cxl_extent *extent; > + unsigned long index; > + unsigned long cnt = 0; > + int rc; > + > + xa_for_each(&mds->pending_extents, index, extent) { > + if (validate_add_extent(mds, extent)) { > + dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n", > + le64_to_cpu(extent->start_dpa), > + le64_to_cpu(extent->length)); > + xa_erase(&mds->pending_extents, index); > + kfree(extent); > + continue; > + } > + cnt++; > + } > + rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE, > + &mds->pending_extents, cnt); > + xa_for_each(&mds->pending_extents, index, extent) { > + xa_erase(&mds->pending_extents, index); > + kfree(extent); > + } > + return rc; > +} > + > +static int handle_add_event(struct cxl_memdev_state *mds, > + struct cxl_event_dcd *event) > +{ > + struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL); > + struct device *dev = mds->cxlds.dev; > + > + if (!tmp) > + return -ENOMEM; > + > + memcpy(tmp, &event->extent, sizeof(*tmp)); > + if (xa_insert(&mds->pending_extents, (unsigned long)tmp, tmp, > + GFP_KERNEL)) { > + kfree(tmp); > + return -ENOMEM; > + } > + > + if (event->flags & CXL_DCD_EVENT_MORE) { > + dev_dbg(dev, "more bit set; delay the surfacing of extent\n"); > + return 0; > + } > + > + /* extents are removed and free'ed in cxl_add_pending() */ > + return cxl_add_pending(mds); > +} > + > +static char *cxl_dcd_evt_type_str(u8 type) > +{ > + switch (type) { > + case DCD_ADD_CAPACITY: > + return "add"; > + case DCD_RELEASE_CAPACITY: > + return "release"; > + case DCD_FORCED_CAPACITY_RELEASE: > + return "force release"; > + default: > + break; > + } > + > + return "<unknown>"; > +} > + > +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds, > + struct cxl_event_record_raw *raw_rec) > +{ > + struct cxl_event_dcd *event = &raw_rec->event.dcd; > + struct cxl_extent *extent = &event->extent; > + struct device *dev = mds->cxlds.dev; > + uuid_t *id = &raw_rec->id; > + > + if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID)) > + return -EINVAL; > + > + dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n", > + cxl_dcd_evt_type_str(event->event_type), > + le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length)); > + > + switch (event->event_type) { > + case DCD_ADD_CAPACITY: > + return handle_add_event(mds, event); > + case DCD_RELEASE_CAPACITY: > + return cxl_rm_extent(mds, &event->extent); > + case DCD_FORCED_CAPACITY_RELEASE: > + dev_err_ratelimited(dev, "Forced release event ignored.\n"); > + return 0; > + default: > + return -EINVAL; > + } > +} > + > static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > enum cxl_event_log_type type) > { > @@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > if (!nr_rec) > break; > > - for (i = 0; i < nr_rec; i++) > + for (i = 0; i < nr_rec; i++) { > __cxl_event_trace_record(cxlmd, type, > &payload->records[i]); > + if (type == CXL_EVENT_TYPE_DCD) { > + rc = cxl_handle_dcd_event_records(mds, > + &payload->records[i]); > + if (rc) > + dev_err_ratelimited(dev, "dcd event failed: %d\n", > + rc); > + } > + } > > if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW) > trace_cxl_overflow(cxlmd, type, payload); > @@ -1078,6 +1329,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status) > { > dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status); > > + if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD)) > + cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD); > if (status & CXLDEV_EVENT_STATUS_FATAL) > cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL); > if (status & CXLDEV_EVENT_STATUS_FAIL) > @@ -1610,6 +1863,17 @@ int cxl_poison_state_init(struct cxl_memdev_state *mds) > } > EXPORT_SYMBOL_NS_GPL(cxl_poison_state_init, CXL); > > +static void clear_pending_extents(void *_mds) > +{ > + struct cxl_memdev_state *mds = _mds; > + struct cxl_extent *extent; > + unsigned long index; > + > + xa_for_each(&mds->pending_extents, index, extent) > + kfree(extent); > + xa_destroy(&mds->pending_extents); > +} > + > struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > { > struct cxl_memdev_state *mds; > @@ -1628,6 +1892,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > mds->cxlds.type = CXL_DEVTYPE_CLASSMEM; > mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID; > mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID; > + xa_init(&mds->pending_extents); > + devm_add_action_or_reset(dev, clear_pending_extents, mds); > > return mds; > } > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index 8e0884b52f84..8c9171f914fb 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -3037,6 +3037,7 @@ static void cxl_dax_region_release(struct device *dev) > { > struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev); > > + ida_destroy(&cxlr_dax->extent_ida); > kfree(cxlr_dax); > } > > @@ -3090,6 +3091,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr) > > dev = &cxlr_dax->dev; > cxlr_dax->cxlr = cxlr; > + cxlr->cxlr_dax = cxlr_dax; > + ida_init(&cxlr_dax->extent_ida); > device_initialize(dev); > lockdep_set_class(&dev->mutex, &cxl_dax_region_key); > device_set_pm_not_required(dev); > @@ -3190,7 +3193,10 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr) > static void cxlr_dax_unregister(void *_cxlr_dax) > { > struct cxl_dax_region *cxlr_dax = _cxlr_dax; > + struct cxl_region *cxlr = cxlr_dax->cxlr; > > + cxlr->cxlr_dax = NULL; > + cxlr_dax->cxlr = NULL; > device_unregister(&cxlr_dax->dev); > } > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index 16861c867537..c858e3957fd5 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -11,6 +11,7 @@ > #include <linux/log2.h> > #include <linux/node.h> > #include <linux/io.h> > +#include <linux/cxl-event.h> > > extern const struct nvdimm_security_ops *cxl_security_ops; > > @@ -169,11 +170,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw) > #define CXLDEV_EVENT_STATUS_WARN BIT(1) > #define CXLDEV_EVENT_STATUS_FAIL BIT(2) > #define CXLDEV_EVENT_STATUS_FATAL BIT(3) > +#define CXLDEV_EVENT_STATUS_DCD BIT(4) > > #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \ > CXLDEV_EVENT_STATUS_WARN | \ > CXLDEV_EVENT_STATUS_FAIL | \ > - CXLDEV_EVENT_STATUS_FATAL) > + CXLDEV_EVENT_STATUS_FATAL | \ > + CXLDEV_EVENT_STATUS_DCD) > > /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */ > #define CXLDEV_EVENT_INT_MODE_MASK GENMASK(1, 0) > @@ -444,6 +447,18 @@ enum cxl_decoder_state { > CXL_DECODER_STATE_AUTO, > }; > > +/** > + * struct cxled_extent - Extent within an endpoint decoder > + * @cxled: Reference to the endpoint decoder > + * @dpa_range: DPA range this extent covers within the decoder > + * @tag: Tag from device for this extent > + */ > +struct cxled_extent { > + struct cxl_endpoint_decoder *cxled; > + struct range dpa_range; > + u8 tag[CXL_EXTENT_TAG_LEN]; > +}; > + > /** > * struct cxl_endpoint_decoder - Endpoint / SPA to DPA decoder > * @cxld: base cxl_decoder_object > @@ -569,6 +584,7 @@ struct cxl_region_params { > * @type: Endpoint decoder target type > * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown > * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge > + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge > * @flags: Region state flags > * @params: active + config params for the region > * @coord: QoS access coordinates for the region > @@ -582,6 +598,7 @@ struct cxl_region { > enum cxl_decoder_type type; > struct cxl_nvdimm_bridge *cxl_nvb; > struct cxl_pmem_region *cxlr_pmem; > + struct cxl_dax_region *cxlr_dax; > unsigned long flags; > struct cxl_region_params params; > struct access_coordinate coord[ACCESS_COORDINATE_MAX]; > @@ -622,12 +639,45 @@ struct cxl_pmem_region { > struct cxl_pmem_region_mapping mapping[]; > }; > > +/* See CXL 3.0 8.2.9.2.1.5 */ Update the reference to reflect CXL 3.1. Fan > +enum dc_event { > + DCD_ADD_CAPACITY, > + DCD_RELEASE_CAPACITY, > + DCD_FORCED_CAPACITY_RELEASE, > + DCD_REGION_CONFIGURATION_UPDATED, > +}; > + > struct cxl_dax_region { > struct device dev; > struct cxl_region *cxlr; > struct range hpa_range; > + struct ida extent_ida; > }; > > +/** > + * struct region_extent - CXL DAX region extent > + * @dev: device representing this extent > + * @cxlr_dax: back reference to parent region device > + * @hpa_range: HPA range of this extent > + * @tag: tag of the extent > + * @decoder_extents: Endpoint decoder extents which make up this region extent > + */ > +struct region_extent { > + struct device dev; > + struct cxl_dax_region *cxlr_dax; > + struct range hpa_range; > + uuid_t tag; > + struct xarray decoder_extents; > +}; > + > +bool is_region_extent(struct device *dev); > +static inline struct region_extent *to_region_extent(struct device *dev) > +{ > + if (!is_region_extent(dev)) > + return NULL; > + return container_of(dev, struct region_extent, dev); > +} > + > /** > * struct cxl_port - logical collection of upstream port devices and > * downstream port devices to construct a CXL memory > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h > index d41bec5433db..3a40fe1f0be7 100644 > --- a/drivers/cxl/cxlmem.h > +++ b/drivers/cxl/cxlmem.h > @@ -497,6 +497,7 @@ struct cxl_dc_region_info { > * @pmem_perf: performance data entry matched to PMEM partition > * @nr_dc_region: number of DC regions implemented in the memory device > * @dc_region: array containing info about the DC regions > + * @pending_extents: array of extents pending during more bit processing > * @event: event log driver state > * @poison: poison driver state info > * @security: security driver state info > @@ -532,6 +533,7 @@ struct cxl_memdev_state { > > u8 nr_dc_region; > struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION]; > + struct xarray pending_extents; > > struct cxl_event_state event; > struct cxl_poison_state poison; > @@ -607,6 +609,21 @@ enum cxl_opcode { > UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \ > 0x40, 0x3d, 0x86) > > +/* > + * Add Dynamic Capacity Response > + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169 > + */ > +struct cxl_mbox_dc_response { > + __le32 extent_list_size; > + u8 flags; > + u8 reserved[3]; > + struct updated_extent_list { > + __le64 dpa_start; > + __le64 length; > + u8 reserved[8]; > + } __packed extent_list[]; > +} __packed; > + > struct cxl_mbox_get_supported_logs { > __le16 entries; > u8 rsvd[6]; > @@ -669,6 +686,14 @@ struct cxl_mbox_identify { > UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \ > 0x13, 0xb7, 0x74) > > +/* > + * Dynamic Capacity Event Record > + * CXL rev 3.1 section 8.2.9.2.1; Table 8-43 > + */ > +#define CXL_EVENT_DC_EVENT_UUID \ > + UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \ > + 0x10, 0x1a, 0x2a) > + > /* > * Get Event Records output payload > * CXL rev 3.0 section 8.2.9.2.2; Table 8-50 > @@ -694,6 +719,7 @@ enum cxl_event_log_type { > CXL_EVENT_TYPE_WARN, > CXL_EVENT_TYPE_FAIL, > CXL_EVENT_TYPE_FATAL, > + CXL_EVENT_TYPE_DCD, > CXL_EVENT_TYPE_MAX > }; > > diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h > index 0bea1afbd747..eeda8059d81a 100644 > --- a/include/linux/cxl-event.h > +++ b/include/linux/cxl-event.h > @@ -96,11 +96,43 @@ struct cxl_event_mem_module { > u8 reserved[0x3d]; > } __packed; > > +/* > + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51 > + */ > +#define CXL_EXTENT_TAG_LEN 0x10 > +struct cxl_extent { > + __le64 start_dpa; > + __le64 length; > + u8 tag[CXL_EXTENT_TAG_LEN]; > + __le16 shared_extn_seq; > + u8 reserved[0x6]; > +} __packed; > + > +/* > + * Dynamic Capacity Event Record > + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50 > + */ > +#define CXL_DCD_EVENT_MORE BIT(0) > +struct cxl_event_dcd { > + struct cxl_event_record_hdr hdr; > + u8 event_type; > + u8 validity_flags; > + __le16 host_id; > + u8 region_index; > + u8 flags; > + u8 reserved1[0x2]; > + struct cxl_extent extent; > + u8 reserved2[0x18]; > + __le32 num_avail_extents; > + __le32 num_avail_tags; > +} __packed; > + > union cxl_event { > struct cxl_event_generic generic; > struct cxl_event_gen_media gen_media; > struct cxl_event_dram dram; > struct cxl_event_mem_module mem_module; > + struct cxl_event_dcd dcd; > /* dram & gen_media event header */ > struct cxl_event_media_hdr media_hdr; > } __packed; > diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild > index 030b388800f0..8238588fffdf 100644 > --- a/tools/testing/cxl/Kbuild > +++ b/tools/testing/cxl/Kbuild > @@ -61,7 +61,8 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o > cxl_core-y += $(CXL_CORE_SRC)/pmu.o > cxl_core-y += $(CXL_CORE_SRC)/cdat.o > cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o > -cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o > +cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \ > + $(CXL_CORE_SRC)/extent.o > cxl_core-y += config_check.o > cxl_core-y += cxl_core_test.o > cxl_core-y += cxl_core_exports.o > > -- > 2.45.2 >
On Fri, 23 Aug 2024 14:32:32 -0700 Fan Ni <nifan.cxl@gmail.com> wrote: > On Fri, Aug 16, 2024 at 09:44:26AM -0500, ira.weiny@intel.com wrote: > > From: Navneet Singh <navneet.singh@intel.com> > > > > A dynamic capacity device (DCD) sends events to signal the host for > > changes in the availability of Dynamic Capacity (DC) memory. These > > events contain extents describing a DPA range and meta data for memory > > to be added or removed. Events may be sent from the device at any time. > > > > Three types of events can be signaled, Add, Release, and Force Release. > > > > On add, the host may accept or reject the memory being offered. If no > > region exists, or the extent is invalid, the extent should be rejected. > > Add extent events may be grouped by a 'more' bit which indicates those > > extents should be processed as a group. > > > > On remove, the host can delay the response until the host is safely not > > using the memory. If no region exists the release can be sent > > immediately. The host may also release extents (or partial extents) at > > any time. Thus the 'more' bit grouping of release events is of less > > value and can be ignored in favor of sending multiple release capacity > > responses for groups of release events. > > > > Force removal is intended as a mechanism between the FM and the device > > and intended only when the host is unresponsive, out of sync, or > > otherwise broken. Purposely ignore force removal events. > > > > Regions are made up of one or more devices which may be surfacing memory > > to the host. Once all devices in a region have surfaced an extent the > > region can expose a corresponding extent for the user to consume. > > Without interleaving a device extent forms a 1:1 relationship with the > > region extent. Immediately surface a region extent upon getting a > > device extent. > > > > Per the specification the device is allowed to offer or remove extents > > at any time. However, anticipated use cases can expect extents to be > > offered, accepted, and removed in well defined chunks. > > > > Simplify extent tracking with the following restrictions. > > > > 1) Flag for removal any extent which overlaps a requested > > release range. > > 2) Refuse the offer of extents which overlap already accepted > > memory ranges. > > 3) Accept again a range which has already been accepted by the > > host. (It is likely the device has an error because it > > should already know that this range was accepted. But from > > the host point of view it is safe to acknowledge that > > acceptance again.) > > > > Management of the region extent devices must be synchronized with > > potential uses of the memory within the DAX layer. Create region extent > > devices as children of the cxl_dax_region device such that the DAX > > region driver can co-drive them and synchronize with the DAX layer. > > Synchronization and management is handled in a subsequent patch. > > > > Process DCD events and create region devices. > > > > Signed-off-by: Navneet Singh <navneet.singh@intel.com> > > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > > One minor change inline. Hi Fan, Crop please. I scanned past it 3 times when scrolling without noticing what you'd actually commented on. > > +/* See CXL 3.0 8.2.9.2.1.5 */ > > Update the reference to reflect CXL 3.1. > > Fan >
On Fri, 16 Aug 2024 09:44:26 -0500 ira.weiny@intel.com wrote: > From: Navneet Singh <navneet.singh@intel.com> > > A dynamic capacity device (DCD) sends events to signal the host for > changes in the availability of Dynamic Capacity (DC) memory. These > events contain extents describing a DPA range and meta data for memory > to be added or removed. Events may be sent from the device at any time. > > Three types of events can be signaled, Add, Release, and Force Release. > > On add, the host may accept or reject the memory being offered. If no > region exists, or the extent is invalid, the extent should be rejected. > Add extent events may be grouped by a 'more' bit which indicates those > extents should be processed as a group. > > On remove, the host can delay the response until the host is safely not > using the memory. If no region exists the release can be sent > immediately. The host may also release extents (or partial extents) at > any time. Thus the 'more' bit grouping of release events is of less > value and can be ignored in favor of sending multiple release capacity > responses for groups of release events. > > Force removal is intended as a mechanism between the FM and the device > and intended only when the host is unresponsive, out of sync, or > otherwise broken. Purposely ignore force removal events. > > Regions are made up of one or more devices which may be surfacing memory > to the host. Once all devices in a region have surfaced an extent the > region can expose a corresponding extent for the user to consume. > Without interleaving a device extent forms a 1:1 relationship with the > region extent. Immediately surface a region extent upon getting a > device extent. > > Per the specification the device is allowed to offer or remove extents > at any time. However, anticipated use cases can expect extents to be > offered, accepted, and removed in well defined chunks. > > Simplify extent tracking with the following restrictions. > > 1) Flag for removal any extent which overlaps a requested > release range. > 2) Refuse the offer of extents which overlap already accepted > memory ranges. > 3) Accept again a range which has already been accepted by the > host. (It is likely the device has an error because it > should already know that this range was accepted. But from > the host point of view it is safe to acknowledge that > acceptance again.) > > Management of the region extent devices must be synchronized with > potential uses of the memory within the DAX layer. Create region extent > devices as children of the cxl_dax_region device such that the DAX > region driver can co-drive them and synchronize with the DAX layer. > Synchronization and management is handled in a subsequent patch. > > Process DCD events and create region devices. > > Signed-off-by: Navneet Singh <navneet.singh@intel.com> > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > A few minor bits and pieces inline. Jonathan > diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c > new file mode 100644 > index 000000000000..34456594cdc3 > --- /dev/null > +++ b/drivers/cxl/core/extent.c > +static int match_contains(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct match_data *md = data; > + struct cxled_extent *entry; > + unsigned long index; > + > + if (!region_extent) > + return 0; > + > + xa_for_each(®ion_extent->decoder_extents, index, entry) { > + if (md->cxled == entry->cxled && > + range_contains(&entry->dpa_range, md->new_range)) > + return true; As below, this returns int, so shouldn't be true or false. > + } > + return false; > +} > +static int match_overlaps(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct match_data *md = data; > + struct cxled_extent *entry; > + unsigned long index; > + > + if (!region_extent) > + return 0; > + > + xa_for_each(®ion_extent->decoder_extents, index, entry) { > + if (md->cxled == entry->cxled && > + range_overlaps(&entry->dpa_range, md->new_range)) > + return true; returns int, so returning true or false is odd. > + } > + > + return false; > +} > +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > +{ > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > + struct cxl_endpoint_decoder *cxled; > + struct range hpa_range, dpa_range; > + struct cxl_region *cxlr; > + > + dpa_range = (struct range) { > + .start = start_dpa, > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > + }; > + > + guard(rwsem_read)(&cxl_region_rwsem); > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > + if (!cxlr) { > + memdev_release_extent(mds, &dpa_range); How does this condition happen? Perhaps a comment needed. > + return -ENXIO; > + } > + > + calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range); > + > + /* Remove region extents which overlap */ > + return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range, > + cxlr_rm_extent); > +} > + > +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct cxled_extent *ed_extent) > +{ > + struct region_extent *region_extent; > + struct range hpa_range; > + int rc; > + > + calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range); > + > + region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag); > + if (IS_ERR(region_extent)) > + return PTR_ERR(region_extent); > + > + rc = xa_insert(®ion_extent->decoder_extents, (unsigned long)ed_extent, ed_extent, I'd wrap that earlier to keep the line a bit shorter. > + GFP_KERNEL); > + if (rc) { > + free_region_extent(region_extent); > + return rc; > + } > + > + /* device model handles freeing region_extent */ > + return online_region_extent(region_extent); > +} > + > +/* Callers are expected to ensure cxled has been attached to a region */ > +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > +{ > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > + struct cxl_endpoint_decoder *cxled; > + struct range ed_range, ext_range; > + struct cxl_dax_region *cxlr_dax; > + struct cxled_extent *ed_extent; > + struct cxl_region *cxlr; > + struct device *dev; > + > + ext_range = (struct range) { > + .start = start_dpa, > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > + }; > + > + guard(rwsem_read)(&cxl_region_rwsem); > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > + if (!cxlr) > + return -ENXIO; > + > + cxlr_dax = cxled->cxld.region->cxlr_dax; > + dev = &cxled->cxld.dev; > + ed_range = (struct range) { > + .start = cxled->dpa_res->start, > + .end = cxled->dpa_res->end, > + }; > + > + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n", > + cxled->dpa_res, &ext_range); > + > + if (!range_contains(&ed_range, &ext_range)) { > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) is not fully in ED %par\n", > + &ext_range.start, CXL_EXTENT_TAG_LEN, > + extent->tag, &ed_range); > + return -ENXIO; > + } > + > + if (extents_contain(cxlr_dax, cxled, &ext_range)) This case confuses me. If the extents are already there I think we should error out or at least print something as that's very wrong. > + return 0; > + > + if (extents_overlap(cxlr_dax, cxled, &ext_range)) > + return -ENXIO; > + > + ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL); > + if (!ed_extent) > + return -ENOMEM; > + > + ed_extent->cxled = cxled; > + ed_extent->dpa_range = ext_range; > + memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN); > + > + dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range, > + CXL_EXTENT_TAG_LEN, ed_extent->tag); > + > + return cxlr_add_extent(cxlr_dax, cxled, ed_extent); > +} > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 01a447aaa1b1..f629ad7488ac 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds) > } > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); > > +static int cxl_validate_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + u64 start = le64_to_cpu(extent->start_dpa); > + u64 length = le64_to_cpu(extent->length); > + struct device *dev = mds->cxlds.dev; > + > + struct range ext_range = (struct range){ > + .start = start, > + .end = start + length - 1, > + }; > + > + if (le16_to_cpu(extent->shared_extn_seq) != 0) { That's not the 'main' way to tell if an extent is shared because we could have a single extent (so seq == 0). Should verify it's not in a DCD region that is shareable to make this decision. I've lost track on the region handling so maybe you already do this by not including those regions at all? > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) can not be shared\n", > + &ext_range.start, CXL_EXTENT_TAG_LEN, > + extent->tag); > + return -ENXIO; > + } > + > + /* Extents must not cross DC region boundary's */ > + for (int i = 0; i < mds->nr_dc_region; i++) { > + struct cxl_dc_region_info *dcr = &mds->dc_region[i]; > + struct range region_range = (struct range) { > + .start = dcr->base, > + .end = dcr->base + dcr->decode_len - 1, > + }; > + > + if (range_contains(®ion_range, &ext_range)) { > + dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n", > + &ext_range, i, start - dcr->base, > + CXL_EXTENT_TAG_LEN, extent->tag); > + return 0; > + } > + } > + > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) is not in any DC region\n", > + &ext_range, CXL_EXTENT_TAG_LEN, extent->tag); > + return -ENXIO; > +} > + > void cxl_event_trace_record(const struct cxl_memdev *cxlmd, > enum cxl_event_log_type type, > enum cxl_event_type event_type, > @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds, > return rc; > } > > +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode, > + struct xarray *extent_array, int cnt) > +{ > + struct cxl_mbox_dc_response *p; > + struct cxl_mbox_cmd mbox_cmd; > + struct cxl_extent *extent; > + unsigned long index; > + u32 pl_index; > + int rc = 0; > + > + size_t pl_size = struct_size(p, extent_list, cnt); > + u32 max_extents = cnt; > + What is cnt is zero? All extents rejected so none in the extent_array. Need to send a zero extent response to reject them all IIRC. > + /* May have to use more bit on response. */ > + if (pl_size > mds->payload_size) { > + max_extents = (mds->payload_size - sizeof(*p)) / > + sizeof(struct updated_extent_list); > + pl_size = struct_size(p, extent_list, max_extents); > + } > + > + struct cxl_mbox_dc_response *response __free(kfree) = > + kzalloc(pl_size, GFP_KERNEL); > + if (!response) > + return -ENOMEM; > + > + pl_index = 0; > + xa_for_each(extent_array, index, extent) { > + > + response->extent_list[pl_index].dpa_start = extent->start_dpa; > + response->extent_list[pl_index].length = extent->length; > + pl_index++; > + response->extent_list_size = cpu_to_le32(pl_index); > + > + if (pl_index == max_extents) { > + mbox_cmd = (struct cxl_mbox_cmd) { > + .opcode = opcode, > + .size_in = struct_size(response, extent_list, > + pl_index), > + .payload_in = response, > + }; > + > + response->flags = 0; > + if (pl_index < cnt) > + response->flags &= CXL_DCD_EVENT_MORE; > + > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); > + if (rc) > + return rc; > + pl_index = 0; > + } > + } > + > + if (pl_index) { || !cnt I think so we send a nothing accepted message. > + mbox_cmd = (struct cxl_mbox_cmd) { > + .opcode = opcode, > + .size_in = struct_size(response, extent_list, > + pl_index), > + .payload_in = response, > + }; > + > + response->flags = 0; > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); if (rc) return rc; > + } > + return 0; So that reader doesn't have to check what rc was in !pl_index case and avoids assigning rc right at the top. > + return rc; > +} > +static int cxl_add_pending(struct cxl_memdev_state *mds) > +{ > + struct device *dev = mds->cxlds.dev; > + struct cxl_extent *extent; > + unsigned long index; > + unsigned long cnt = 0; > + int rc; > + > + xa_for_each(&mds->pending_extents, index, extent) { > + if (validate_add_extent(mds, extent)) { Add a comment here that not accepting an extent but accepting some or none means this one was rejected (I'd forgotten how that bit worked) > + dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n", > + le64_to_cpu(extent->start_dpa), > + le64_to_cpu(extent->length)); > + xa_erase(&mds->pending_extents, index); > + kfree(extent); > + continue; > + } > + cnt++; > + } > + rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE, > + &mds->pending_extents, cnt); > + xa_for_each(&mds->pending_extents, index, extent) { > + xa_erase(&mds->pending_extents, index); > + kfree(extent); > + } > + return rc; > +} > + > +static int handle_add_event(struct cxl_memdev_state *mds, > + struct cxl_event_dcd *event) > +{ > + struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL); > + struct device *dev = mds->cxlds.dev; > + > + if (!tmp) > + return -ENOMEM; > + > + memcpy(tmp, &event->extent, sizeof(*tmp)); kmemdup? > + if (xa_insert(&mds->pending_extents, (unsigned long)tmp, tmp, > + GFP_KERNEL)) { > + kfree(tmp); > + return -ENOMEM; > + } > + > + if (event->flags & CXL_DCD_EVENT_MORE) { > + dev_dbg(dev, "more bit set; delay the surfacing of extent\n"); > + return 0; > + } > + > + /* extents are removed and free'ed in cxl_add_pending() */ > + return cxl_add_pending(mds); > +} > static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > enum cxl_event_log_type type) > { > @@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > if (!nr_rec) > break; > > - for (i = 0; i < nr_rec; i++) > + for (i = 0; i < nr_rec; i++) { > __cxl_event_trace_record(cxlmd, type, > &payload->records[i]); > + if (type == CXL_EVENT_TYPE_DCD) { Bit of a deep indent so maybe flip logic? Logic wise it's a bit dubious as we might want to match other types in future though so up to you. if (type != CXL_EVENT_TYPE_DCD) continue; rc = > + rc = cxl_handle_dcd_event_records(mds, > + &payload->records[i]); > + if (rc) > + dev_err_ratelimited(dev, "dcd event failed: %d\n", > + rc); > + } > + } > > struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > { > struct cxl_memdev_state *mds; > @@ -1628,6 +1892,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > mds->cxlds.type = CXL_DEVTYPE_CLASSMEM; > mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID; > mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID; > + xa_init(&mds->pending_extents); > + devm_add_action_or_reset(dev, clear_pending_extents, mds); Why don't you need to check if this failed? Definitely seems unlikely to leave things in a good state. Unlikely to fail of course, but you never know. > > return mds; > } > @@ -3090,6 +3091,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr) > > dev = &cxlr_dax->dev; > cxlr_dax->cxlr = cxlr; > + cxlr->cxlr_dax = cxlr_dax; > + ida_init(&cxlr_dax->extent_ida); > device_initialize(dev); > lockdep_set_class(&dev->mutex, &cxl_dax_region_key); > device_set_pm_not_required(dev); > @@ -3190,7 +3193,10 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr) > static void cxlr_dax_unregister(void *_cxlr_dax) > { > struct cxl_dax_region *cxlr_dax = _cxlr_dax; > + struct cxl_region *cxlr = cxlr_dax->cxlr; > > + cxlr->cxlr_dax = NULL; > + cxlr_dax->cxlr = NULL; cxlr_dax->cxlr was assigned before this patch. I'm not seeing any new checks on these being non null so why are the needed? If there is a good reason for this then a comment would be useful. > device_unregister(&cxlr_dax->dev); > } >
On Tue, Aug 27, 2024 at 01:08:29PM +0100, Jonathan Cameron wrote: > On Fri, 23 Aug 2024 14:32:32 -0700 > Fan Ni <nifan.cxl@gmail.com> wrote: > > > On Fri, Aug 16, 2024 at 09:44:26AM -0500, ira.weiny@intel.com wrote: > > > From: Navneet Singh <navneet.singh@intel.com> > > > > > > A dynamic capacity device (DCD) sends events to signal the host for > > > changes in the availability of Dynamic Capacity (DC) memory. These > > > events contain extents describing a DPA range and meta data for memory > > > to be added or removed. Events may be sent from the device at any time. > > > > > > Three types of events can be signaled, Add, Release, and Force Release. > > > > > > On add, the host may accept or reject the memory being offered. If no > > > region exists, or the extent is invalid, the extent should be rejected. > > > Add extent events may be grouped by a 'more' bit which indicates those > > > extents should be processed as a group. > > > > > > On remove, the host can delay the response until the host is safely not > > > using the memory. If no region exists the release can be sent > > > immediately. The host may also release extents (or partial extents) at > > > any time. Thus the 'more' bit grouping of release events is of less > > > value and can be ignored in favor of sending multiple release capacity > > > responses for groups of release events. > > > > > > Force removal is intended as a mechanism between the FM and the device > > > and intended only when the host is unresponsive, out of sync, or > > > otherwise broken. Purposely ignore force removal events. > > > > > > Regions are made up of one or more devices which may be surfacing memory > > > to the host. Once all devices in a region have surfaced an extent the > > > region can expose a corresponding extent for the user to consume. > > > Without interleaving a device extent forms a 1:1 relationship with the > > > region extent. Immediately surface a region extent upon getting a > > > device extent. > > > > > > Per the specification the device is allowed to offer or remove extents > > > at any time. However, anticipated use cases can expect extents to be > > > offered, accepted, and removed in well defined chunks. > > > > > > Simplify extent tracking with the following restrictions. > > > > > > 1) Flag for removal any extent which overlaps a requested > > > release range. > > > 2) Refuse the offer of extents which overlap already accepted > > > memory ranges. > > > 3) Accept again a range which has already been accepted by the > > > host. (It is likely the device has an error because it > > > should already know that this range was accepted. But from > > > the host point of view it is safe to acknowledge that > > > acceptance again.) > > > > > > Management of the region extent devices must be synchronized with > > > potential uses of the memory within the DAX layer. Create region extent > > > devices as children of the cxl_dax_region device such that the DAX > > > region driver can co-drive them and synchronize with the DAX layer. > > > Synchronization and management is handled in a subsequent patch. > > > > > > Process DCD events and create region devices. > > > > > > Signed-off-by: Navneet Singh <navneet.singh@intel.com> > > > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > > > > > One minor change inline. > Hi Fan, > > Crop please. I scanned past it 3 times when scrolling without noticing > what you'd actually commented on. Sure. I will crop in the future. Thanks for the tips, Jonathan. Fan > > > > +/* See CXL 3.0 8.2.9.2.1.5 */ > > > > Update the reference to reflect CXL 3.1. > > > > Fan > >
Jonathan Cameron wrote: > On Fri, 16 Aug 2024 09:44:26 -0500 > ira.weiny@intel.com wrote: > > > From: Navneet Singh <navneet.singh@intel.com> > > [snip] > > +static int match_contains(struct device *dev, void *data) > > +{ > > + struct region_extent *region_extent = to_region_extent(dev); > > + struct match_data *md = data; > > + struct cxled_extent *entry; > > + unsigned long index; > > + > > + if (!region_extent) > > + return 0; > > + > > + xa_for_each(®ion_extent->decoder_extents, index, entry) { > > + if (md->cxled == entry->cxled && > > + range_contains(&entry->dpa_range, md->new_range)) > > + return true; > As below, this returns int, so shouldn't be true or false. Yep. Thanks. > > > + } > > + return false; > > +} > > > +static int match_overlaps(struct device *dev, void *data) > > +{ > > + struct region_extent *region_extent = to_region_extent(dev); > > + struct match_data *md = data; > > + struct cxled_extent *entry; > > + unsigned long index; > > + > > + if (!region_extent) > > + return 0; > > + > > + xa_for_each(®ion_extent->decoder_extents, index, entry) { > > + if (md->cxled == entry->cxled && > > + range_overlaps(&entry->dpa_range, md->new_range)) > > + return true; > > returns int, so returning true or false is odd. Yep. > > > + } > > + > > + return false; > > +} > > > > +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > > +{ > > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > > + struct cxl_endpoint_decoder *cxled; > > + struct range hpa_range, dpa_range; > > + struct cxl_region *cxlr; > > + > > + dpa_range = (struct range) { > > + .start = start_dpa, > > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > > + }; > > + > > + guard(rwsem_read)(&cxl_region_rwsem); > > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > > + if (!cxlr) { > > + memdev_release_extent(mds, &dpa_range); > > How does this condition happen? Perhaps a comment needed. Fair enough. Proposed comment. /* * No region can happen here for a few reasons: * * 1) Extents were accepted and the host crashed/rebooted * leaving them in an accepted state. On reboot the host * has not yet created a region to own them. * * 2) Region destruction won the race with the device releasing * all the extents. Here the release will be a duplicate of * the one sent via region destruction. * * 3) The device is confused and releasing extents for which no * region ever existed. * * In all these cases make sure the device knows we are not * using this extent. */ Item 2 is AFAICS ok with the spec. > > > + return -ENXIO; > > + } > > + > > + calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range); > > + > > + /* Remove region extents which overlap */ > > + return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range, > > + cxlr_rm_extent); > > +} > > + > > +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax, > > + struct cxl_endpoint_decoder *cxled, > > + struct cxled_extent *ed_extent) > > +{ > > + struct region_extent *region_extent; > > + struct range hpa_range; > > + int rc; > > + > > + calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range); > > + > > + region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag); > > + if (IS_ERR(region_extent)) > > + return PTR_ERR(region_extent); > > + > > + rc = xa_insert(®ion_extent->decoder_extents, (unsigned long)ed_extent, ed_extent, > > I'd wrap that earlier to keep the line a bit shorter. Done. > > > + GFP_KERNEL); > > + if (rc) { > > + free_region_extent(region_extent); > > + return rc; > > + } > > + > > + /* device model handles freeing region_extent */ > > + return online_region_extent(region_extent); > > +} > > + > > +/* Callers are expected to ensure cxled has been attached to a region */ > > +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > > +{ > > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > > + struct cxl_endpoint_decoder *cxled; > > + struct range ed_range, ext_range; > > + struct cxl_dax_region *cxlr_dax; > > + struct cxled_extent *ed_extent; > > + struct cxl_region *cxlr; > > + struct device *dev; > > + > > + ext_range = (struct range) { > > + .start = start_dpa, > > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > > + }; > > + > > + guard(rwsem_read)(&cxl_region_rwsem); > > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > > + if (!cxlr) > > + return -ENXIO; > > + > > + cxlr_dax = cxled->cxld.region->cxlr_dax; > > + dev = &cxled->cxld.dev; > > + ed_range = (struct range) { > > + .start = cxled->dpa_res->start, > > + .end = cxled->dpa_res->end, > > + }; > > + > > + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n", > > + cxled->dpa_res, &ext_range); > > + > > + if (!range_contains(&ed_range, &ext_range)) { > > + dev_err_ratelimited(dev, > > + "DC extent DPA %par (%*phC) is not fully in ED %par\n", > > + &ext_range.start, CXL_EXTENT_TAG_LEN, > > + extent->tag, &ed_range); > > + return -ENXIO; > > + } > > + > > + if (extents_contain(cxlr_dax, cxled, &ext_range)) > > This case confuses me. If the extents are already there I think we should > error out or at least print something as that's very wrong. I thought we discussed this in one of the community meetings that it would be ok to accept these. We could certainly print a warning here. In all honestly I'm wondering if these restrictions are really needed anymore. But at the same time I really, really, really don't think anyone has a good use case to have to support these cases. So I'm keeping the code simple for now. > > > + return 0; > > + > > + if (extents_overlap(cxlr_dax, cxled, &ext_range)) > > + return -ENXIO; > > + > > + ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL); > > + if (!ed_extent) > > + return -ENOMEM; > > + > > + ed_extent->cxled = cxled; > > + ed_extent->dpa_range = ext_range; > > + memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN); > > + > > + dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range, > > + CXL_EXTENT_TAG_LEN, ed_extent->tag); > > + > > + return cxlr_add_extent(cxlr_dax, cxled, ed_extent); > > +} > > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > > index 01a447aaa1b1..f629ad7488ac 100644 > > --- a/drivers/cxl/core/mbox.c > > +++ b/drivers/cxl/core/mbox.c > > @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds) > > } > > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); > > > > +static int cxl_validate_extent(struct cxl_memdev_state *mds, > > + struct cxl_extent *extent) > > +{ > > + u64 start = le64_to_cpu(extent->start_dpa); > > + u64 length = le64_to_cpu(extent->length); > > + struct device *dev = mds->cxlds.dev; > > + > > + struct range ext_range = (struct range){ > > + .start = start, > > + .end = start + length - 1, > > + }; > > + > > + if (le16_to_cpu(extent->shared_extn_seq) != 0) { > > That's not the 'main' way to tell if an extent is shared because > we could have a single extent (so seq == 0). > Should verify it's not in a DCD region that > is shareable to make this decision. Ah... :-/ > > I've lost track on the region handling so maybe you already do > this by not including those regions at all? I don't think so. I'll add the region check. I see now why I glossed over this though. The shared nature of a DCD partition is defined in the DSMAS. Is that correct? Or am I missing something in the spec? > > > + dev_err_ratelimited(dev, > > + "DC extent DPA %par (%*phC) can not be shared\n", > > + &ext_range.start, CXL_EXTENT_TAG_LEN, > > + extent->tag); > > + return -ENXIO; > > + } > > + > > + /* Extents must not cross DC region boundary's */ > > + for (int i = 0; i < mds->nr_dc_region; i++) { > > + struct cxl_dc_region_info *dcr = &mds->dc_region[i]; > > + struct range region_range = (struct range) { > > + .start = dcr->base, > > + .end = dcr->base + dcr->decode_len - 1, > > + }; > > + > > + if (range_contains(®ion_range, &ext_range)) { > > + dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n", > > + &ext_range, i, start - dcr->base, > > + CXL_EXTENT_TAG_LEN, extent->tag); > > + return 0; > > + } > > + } > > + > > + dev_err_ratelimited(dev, > > + "DC extent DPA %par (%*phC) is not in any DC region\n", > > + &ext_range, CXL_EXTENT_TAG_LEN, extent->tag); > > + return -ENXIO; > > +} > > + > > void cxl_event_trace_record(const struct cxl_memdev *cxlmd, > > enum cxl_event_log_type type, > > enum cxl_event_type event_type, > > @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds, > > return rc; > > } > > > > +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode, > > + struct xarray *extent_array, int cnt) > > +{ > > + struct cxl_mbox_dc_response *p; > > + struct cxl_mbox_cmd mbox_cmd; > > + struct cxl_extent *extent; > > + unsigned long index; > > + u32 pl_index; > > + int rc = 0; > > + > > + size_t pl_size = struct_size(p, extent_list, cnt); > > + u32 max_extents = cnt; > > + > What is cnt is zero? All extents rejected so none in the > extent_array. Need to send a zero extent response to reject > them all IIRC. yes. I missed that thanks. > > > + /* May have to use more bit on response. */ > > + if (pl_size > mds->payload_size) { > > + max_extents = (mds->payload_size - sizeof(*p)) / > > + sizeof(struct updated_extent_list); > > + pl_size = struct_size(p, extent_list, max_extents); > > + } > > + > > + struct cxl_mbox_dc_response *response __free(kfree) = > > + kzalloc(pl_size, GFP_KERNEL); > > + if (!response) > > + return -ENOMEM; > > + > > + pl_index = 0; > > + xa_for_each(extent_array, index, extent) { > > + > > + response->extent_list[pl_index].dpa_start = extent->start_dpa; > > + response->extent_list[pl_index].length = extent->length; > > + pl_index++; > > + response->extent_list_size = cpu_to_le32(pl_index); > > + > > + if (pl_index == max_extents) { > > + mbox_cmd = (struct cxl_mbox_cmd) { > > + .opcode = opcode, > > + .size_in = struct_size(response, extent_list, > > + pl_index), > > + .payload_in = response, > > + }; > > + > > + response->flags = 0; > > + if (pl_index < cnt) > > + response->flags &= CXL_DCD_EVENT_MORE; > > + > > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); > > + if (rc) > > + return rc; > > + pl_index = 0; > > + } > > + } > > + > > + if (pl_index) { > || !cnt > > I think so we send a nothing accepted message. Yep. > > > + mbox_cmd = (struct cxl_mbox_cmd) { > > + .opcode = opcode, > > + .size_in = struct_size(response, extent_list, > > + pl_index), > > + .payload_in = response, > > + }; > > + > > + response->flags = 0; > > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); > if (rc) > return rc; > > + } > > + > > return 0; So that reader doesn't have to check what rc was in !pl_index > case and avoids assigning rc right at the top. Ah thanks. That might have been left over from something previous. > > > > + return rc; > > +} > > > > +static int cxl_add_pending(struct cxl_memdev_state *mds) > > +{ > > + struct device *dev = mds->cxlds.dev; > > + struct cxl_extent *extent; > > + unsigned long index; > > + unsigned long cnt = 0; > > + int rc; > > + > > + xa_for_each(&mds->pending_extents, index, extent) { > > + if (validate_add_extent(mds, extent)) { > > > Add a comment here that not accepting an extent but > accepting some or none means this one was rejected (I'd forgotten how > that bit worked) Ok yeah that may not be clear without reading the spec closely. /* * Any extents which are to be rejected are omitted from * the response. An empty response means all are * rejected. */ > > > + dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n", > > + le64_to_cpu(extent->start_dpa), > > + le64_to_cpu(extent->length)); > > + xa_erase(&mds->pending_extents, index); > > + kfree(extent); > > + continue; > > + } > > + cnt++; > > + } > > + rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE, > > + &mds->pending_extents, cnt); > > + xa_for_each(&mds->pending_extents, index, extent) { > > + xa_erase(&mds->pending_extents, index); > > + kfree(extent); > > + } > > + return rc; > > +} > > + > > +static int handle_add_event(struct cxl_memdev_state *mds, > > + struct cxl_event_dcd *event) > > +{ > > + struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL); > > + struct device *dev = mds->cxlds.dev; > > + > > + if (!tmp) > > + return -ENOMEM; > > + > > + memcpy(tmp, &event->extent, sizeof(*tmp)); > > kmemdup? yep. > > > + if (xa_insert(&mds->pending_extents, (unsigned long)tmp, tmp, > > + GFP_KERNEL)) { > > + kfree(tmp); > > + return -ENOMEM; > > + } > > + > > + if (event->flags & CXL_DCD_EVENT_MORE) { > > + dev_dbg(dev, "more bit set; delay the surfacing of extent\n"); > > + return 0; > > + } > > + > > + /* extents are removed and free'ed in cxl_add_pending() */ > > + return cxl_add_pending(mds); > > +} > > > static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > > enum cxl_event_log_type type) > > { > > @@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > > if (!nr_rec) > > break; > > > > - for (i = 0; i < nr_rec; i++) > > + for (i = 0; i < nr_rec; i++) { > > __cxl_event_trace_record(cxlmd, type, > > &payload->records[i]); > > + if (type == CXL_EVENT_TYPE_DCD) { > Bit of a deep indent so maybe flip logic? > > Logic wise it's a bit dubious as we might want to match other > types in future though so up to you. I was thinking more along these lines. But the rc is unneeded. That print can be in the handle function. Something like this: diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c index 88b823afe482..e86a483d80eb 100644 --- a/drivers/cxl/core/mbox.c +++ b/drivers/cxl/core/mbox.c @@ -1231,16 +1231,17 @@ static char *cxl_dcd_evt_type_str(u8 type) return "<unknown>"; } -static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds, +static void cxl_handle_dcd_event_records(struct cxl_memdev_state *mds, struct cxl_event_record_raw *raw_rec) { struct cxl_event_dcd *event = &raw_rec->event.dcd; struct cxl_extent *extent = &event->extent; struct device *dev = mds->cxlds.dev; uuid_t *id = &raw_rec->id; + int rc; if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID)) - return -EINVAL; + return; dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n", cxl_dcd_evt_type_str(event->event_type), @@ -1248,15 +1249,22 @@ static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds, switch (event->event_type) { case DCD_ADD_CAPACITY: - return handle_add_event(mds, event); + rc = handle_add_event(mds, event); + break; case DCD_RELEASE_CAPACITY: - return cxl_rm_extent(mds, &event->extent); + rc = cxl_rm_extent(mds, &event->extent); + break; case DCD_FORCED_CAPACITY_RELEASE: dev_err_ratelimited(dev, "Forced release event ignored.\n"); - return 0; + rc = 0; + break; default: - return -EINVAL; + rc = -EINVAL; + break; } + + if (rc) + dev_err_ratelimited(dev, "dcd event failed: %d\n", rc); } static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, @@ -1297,13 +1305,9 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, for (i = 0; i < nr_rec; i++) { __cxl_event_trace_record(cxlmd, type, &payload->records[i]); - if (type == CXL_EVENT_TYPE_DCD) { - rc = cxl_handle_dcd_event_records(mds, - &payload->records[i]); - if (rc) - dev_err_ratelimited(dev, "dcd event failed: %d\n", - rc); - } + if (type == CXL_EVENT_TYPE_DCD) + cxl_handle_dcd_event_records(mds, + &payload->records[i]); } if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW) <end diff> > > if (type != CXL_EVENT_TYPE_DCD) > continue; > > rc = > > > + rc = cxl_handle_dcd_event_records(mds, > > + &payload->records[i]); > > + if (rc) > > + dev_err_ratelimited(dev, "dcd event failed: %d\n", > > + rc); > > + } > > + } > > > > > struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > > { > > struct cxl_memdev_state *mds; > > @@ -1628,6 +1892,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > > mds->cxlds.type = CXL_DEVTYPE_CLASSMEM; > > mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID; > > mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID; > > + xa_init(&mds->pending_extents); > > + devm_add_action_or_reset(dev, clear_pending_extents, mds); > > Why don't you need to check if this failed? Definitely seems unlikely > to leave things in a good state. Unlikely to fail of course, but you never know. yea good catch. > > > > > return mds; > > } > > > @@ -3090,6 +3091,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr) > > > > dev = &cxlr_dax->dev; > > cxlr_dax->cxlr = cxlr; > > + cxlr->cxlr_dax = cxlr_dax; > > + ida_init(&cxlr_dax->extent_ida); > > device_initialize(dev); > > lockdep_set_class(&dev->mutex, &cxl_dax_region_key); > > device_set_pm_not_required(dev); > > @@ -3190,7 +3193,10 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr) > > static void cxlr_dax_unregister(void *_cxlr_dax) > > { > > struct cxl_dax_region *cxlr_dax = _cxlr_dax; > > + struct cxl_region *cxlr = cxlr_dax->cxlr; > > > > + cxlr->cxlr_dax = NULL; > > + cxlr_dax->cxlr = NULL; > > cxlr_dax->cxlr was assigned before this patch. > > I'm not seeing any new checks on these being non null so why > are the needed? If there is a good reason for this then > a comment would be useful. I'm not sure anymore either. Perhaps this was left over from an earlier version. Or was something I thought I would need that ended up getting removed. I'll test without this hunk and remove it if I can. Thanks for the review, Ira [snip]
> > > +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > > > +{ > > > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > > > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > > > + struct cxl_endpoint_decoder *cxled; > > > + struct range hpa_range, dpa_range; > > > + struct cxl_region *cxlr; > > > + > > > + dpa_range = (struct range) { > > > + .start = start_dpa, > > > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > > > + }; > > > + > > > + guard(rwsem_read)(&cxl_region_rwsem); > > > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > > > + if (!cxlr) { > > > + memdev_release_extent(mds, &dpa_range); > > > > How does this condition happen? Perhaps a comment needed. > > Fair enough. Proposed comment. > > /* > * No region can happen here for a few reasons: > * > * 1) Extents were accepted and the host crashed/rebooted > * leaving them in an accepted state. On reboot the host > * has not yet created a region to own them. > * > * 2) Region destruction won the race with the device releasing > * all the extents. Here the release will be a duplicate of > * the one sent via region destruction. > * > * 3) The device is confused and releasing extents for which no > * region ever existed. > * > * In all these cases make sure the device knows we are not > * using this extent. > */ > > Item 2 is AFAICS ok with the spec. I'm not sure I follow 2. Why would device be releasing extents if we haven't given them back? We aren't supporting the mess that is force removal. > > > > > > + return -ENXIO; > > > + } > > > + > > > + calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range); > > > + > > > + /* Remove region extents which overlap */ > > > + return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range, > > > + cxlr_rm_extent); > > > +} > > > + > > > +/* Callers are expected to ensure cxled has been attached to a region */ > > > +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > > > +{ > > > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > > > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > > > + struct cxl_endpoint_decoder *cxled; > > > + struct range ed_range, ext_range; > > > + struct cxl_dax_region *cxlr_dax; > > > + struct cxled_extent *ed_extent; > > > + struct cxl_region *cxlr; > > > + struct device *dev; > > > + > > > + ext_range = (struct range) { > > > + .start = start_dpa, > > > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > > > + }; > > > + > > > + guard(rwsem_read)(&cxl_region_rwsem); > > > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > > > + if (!cxlr) > > > + return -ENXIO; > > > + > > > + cxlr_dax = cxled->cxld.region->cxlr_dax; > > > + dev = &cxled->cxld.dev; > > > + ed_range = (struct range) { > > > + .start = cxled->dpa_res->start, > > > + .end = cxled->dpa_res->end, > > > + }; > > > + > > > + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n", > > > + cxled->dpa_res, &ext_range); > > > + > > > + if (!range_contains(&ed_range, &ext_range)) { > > > + dev_err_ratelimited(dev, > > > + "DC extent DPA %par (%*phC) is not fully in ED %par\n", > > > + &ext_range.start, CXL_EXTENT_TAG_LEN, > > > + extent->tag, &ed_range); > > > + return -ENXIO; > > > + } > > > + > > > + if (extents_contain(cxlr_dax, cxled, &ext_range)) > > > > This case confuses me. If the extents are already there I think we should > > error out or at least print something as that's very wrong. > > I thought we discussed this in one of the community meetings that it would be > ok to accept these. We could certainly print a warning here. A warning probably does the job of indicating that 'something' odd is going on. A device should never resend an extent overlapping one it sent before, (assuming no removal happened inbetween) so this should never happen, but who knows :( > > In all honestly I'm wondering if these restrictions are really needed anymore. > But at the same time I really, really, really don't think anyone has a good use > case to have to support these cases. So I'm keeping the code simple for now. Fair enough. > > > > > > + return 0; > > > + > > > + if (extents_overlap(cxlr_dax, cxled, &ext_range)) > > > + return -ENXIO; > > > + > > > + ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL); > > > + if (!ed_extent) > > > + return -ENOMEM; > > > + > > > + ed_extent->cxled = cxled; > > > + ed_extent->dpa_range = ext_range; > > > + memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN); > > > + > > > + dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range, > > > + CXL_EXTENT_TAG_LEN, ed_extent->tag); > > > + > > > + return cxlr_add_extent(cxlr_dax, cxled, ed_extent); > > > +} > > > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > > > index 01a447aaa1b1..f629ad7488ac 100644 > > > --- a/drivers/cxl/core/mbox.c > > > +++ b/drivers/cxl/core/mbox.c > > > @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds) > > > } > > > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); > > > > > > +static int cxl_validate_extent(struct cxl_memdev_state *mds, > > > + struct cxl_extent *extent) > > > +{ > > > + u64 start = le64_to_cpu(extent->start_dpa); > > > + u64 length = le64_to_cpu(extent->length); > > > + struct device *dev = mds->cxlds.dev; > > > + > > > + struct range ext_range = (struct range){ > > > + .start = start, > > > + .end = start + length - 1, > > > + }; > > > + > > > + if (le16_to_cpu(extent->shared_extn_seq) != 0) { > > > > That's not the 'main' way to tell if an extent is shared because > > we could have a single extent (so seq == 0). > > Should verify it's not in a DCD region that > > is shareable to make this decision. > > Ah... :-/ > > > > > I've lost track on the region handling so maybe you already do > > this by not including those regions at all? > > I don't think so. > > I'll add the region check. I see now why I glossed over this though. The > shared nature of a DCD partition is defined in the DSMAS. > > Is that correct? Or am I missing something in the spec? Yes. That's matches my understanding (I might also be missing something of course :) > > > +static int cxl_add_pending(struct cxl_memdev_state *mds) > > > +{ > > > + struct device *dev = mds->cxlds.dev; > > > + struct cxl_extent *extent; > > > + unsigned long index; > > > + unsigned long cnt = 0; > > > + int rc; > > > + > > > + xa_for_each(&mds->pending_extents, index, extent) { > > > + if (validate_add_extent(mds, extent)) { > > > > > > Add a comment here that not accepting an extent but > > accepting some or none means this one was rejected (I'd forgotten how > > that bit worked) > > Ok yeah that may not be clear without reading the spec closely. > > /* > * Any extents which are to be rejected are omitted from > * the response. An empty response means all are > * rejected. > */ Perfect. > > > > > > + dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n", > > > + le64_to_cpu(extent->start_dpa), > > > + le64_to_cpu(extent->length)); > > > + xa_erase(&mds->pending_extents, index); > > > + kfree(extent); > > > + continue; > > > + } > > > + cnt++; > > > + } > > > + rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE, > > > + &mds->pending_extents, cnt); > > > + xa_for_each(&mds->pending_extents, index, extent) { > > > + xa_erase(&mds->pending_extents, index); > > > + kfree(extent); > > > + } > > > + return rc; > > > +} > > > + > > > static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > > > enum cxl_event_log_type type) > > > { > > > @@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > > > if (!nr_rec) > > > break; > > > > > > - for (i = 0; i < nr_rec; i++) > > > + for (i = 0; i < nr_rec; i++) { > > > __cxl_event_trace_record(cxlmd, type, > > > &payload->records[i]); > > > + if (type == CXL_EVENT_TYPE_DCD) { > > Bit of a deep indent so maybe flip logic? > > > > Logic wise it's a bit dubious as we might want to match other > > types in future though so up to you. > > I was thinking more along these lines. But the rc is unneeded. That print > can be in the handle function. > > > Something like this: Looks good to me. (cut to save on scrolling!) Jonathan
On 8/16/2024 10:44 PM, ira.weiny@intel.com wrote: > From: Navneet Singh <navneet.singh@intel.com> > > A dynamic capacity device (DCD) sends events to signal the host for > changes in the availability of Dynamic Capacity (DC) memory. These > events contain extents describing a DPA range and meta data for memory > to be added or removed. Events may be sent from the device at any time. > > Three types of events can be signaled, Add, Release, and Force Release. > > On add, the host may accept or reject the memory being offered. If no > region exists, or the extent is invalid, the extent should be rejected. > Add extent events may be grouped by a 'more' bit which indicates those > extents should be processed as a group. > > On remove, the host can delay the response until the host is safely not > using the memory. If no region exists the release can be sent > immediately. The host may also release extents (or partial extents) at > any time. Thus the 'more' bit grouping of release events is of less > value and can be ignored in favor of sending multiple release capacity > responses for groups of release events. > > Force removal is intended as a mechanism between the FM and the device > and intended only when the host is unresponsive, out of sync, or > otherwise broken. Purposely ignore force removal events. > > Regions are made up of one or more devices which may be surfacing memory > to the host. Once all devices in a region have surfaced an extent the > region can expose a corresponding extent for the user to consume. > Without interleaving a device extent forms a 1:1 relationship with the > region extent. Immediately surface a region extent upon getting a > device extent. > > Per the specification the device is allowed to offer or remove extents > at any time. However, anticipated use cases can expect extents to be > offered, accepted, and removed in well defined chunks. > > Simplify extent tracking with the following restrictions. > > 1) Flag for removal any extent which overlaps a requested > release range. > 2) Refuse the offer of extents which overlap already accepted > memory ranges. > 3) Accept again a range which has already been accepted by the > host. (It is likely the device has an error because it > should already know that this range was accepted. But from > the host point of view it is safe to acknowledge that > acceptance again.) > > Management of the region extent devices must be synchronized with > potential uses of the memory within the DAX layer. Create region extent > devices as children of the cxl_dax_region device such that the DAX > region driver can co-drive them and synchronize with the DAX layer. > Synchronization and management is handled in a subsequent patch. > > Process DCD events and create region devices. > > Signed-off-by: Navneet Singh <navneet.singh@intel.com> > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > --- > Changes: > [iweiny: combine this with the extent surface patches to better show the > lifetime extent objects in review] > [iweiny: clean up commit message.] > [iweiny: move extent verification of the 'read extents on region > creation' to this patch] > [iweiny: Provide for a common path for extent realization between an add > event and adding existing extents.] > [iweiny: Persist a check that an extent is within an endpoint decoder] > [iweiny: reduce exported and non-static calls] > [iweiny: use %par] > > <Combined comments from the old patches which were addressed> > > [Jonathan: implement the more bit with a simple algorithm which accepts > all extents it can. > Also include the response more bit to prevent payload > overflow] > [Fan: Do not error if a contained extent is added.] > [Jonathan: allocate ida after kzalloc] > [iweiny: fix ida resource leak] > [fan/djiang: remove unneeded memset] > [djiang: fix indentation] > [Jonathan: Fix indentation] > [Jonathan/djbw: make tag a uuid] > [djbw: create helper calc_hpa_range() straight away] > [djbw: Allow for multiple cxled_extents per region_extent] > [djbw: s/cxl_ed/cxled] > [djbw: s/cxl_release_ed_extent/cxled_release_extent/] > [djbw: s/reg_ext/region_extent/] > [djbw: s/dc_extent/extent/] > [Gregory/djbw: reject shared extents] > [iweiny: predicate extent.c compile on CONFIG_CXL_REGION] > --- > drivers/cxl/core/Makefile | 2 +- > drivers/cxl/core/core.h | 13 ++ > drivers/cxl/core/extent.c | 345 ++++++++++++++++++++++++++++++++++++++++++++++ > drivers/cxl/core/mbox.c | 268 ++++++++++++++++++++++++++++++++++- > drivers/cxl/core/region.c | 6 + > drivers/cxl/cxl.h | 52 ++++++- > drivers/cxl/cxlmem.h | 26 ++++ > include/linux/cxl-event.h | 32 +++++ > tools/testing/cxl/Kbuild | 3 +- > 9 files changed, 743 insertions(+), 4 deletions(-) [...] > + > +static bool extents_contain(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct range *new_range) > +{ > + struct device *extent_device; > + struct match_data md = { > + .cxled = cxled, > + .new_range = new_range, > + }; > + > + extent_device = device_find_child(&cxlr_dax->dev, &md, match_contains); Is it better to use __free(put_device) here to drop below 'put_device(extent_device)'? > + if (!extent_device) > + return false; > + > + put_device(extent_device); > + return true; > +} > + > +static int match_overlaps(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct match_data *md = data; > + struct cxled_extent *entry; > + unsigned long index; > + > + if (!region_extent) > + return 0; > + > + xa_for_each(®ion_extent->decoder_extents, index, entry) { > + if (md->cxled == entry->cxled && > + range_overlaps(&entry->dpa_range, md->new_range)) > + return true; > + } > + > + return false; > +} > + > +static bool extents_overlap(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct range *new_range) > +{ > + struct device *extent_device; > + struct match_data md = { > + .cxled = cxled, > + .new_range = new_range, > + }; > + > + extent_device = device_find_child(&cxlr_dax->dev, &md, match_overlaps); Same as above. > + if (!extent_device) > + return false; > + > + put_device(extent_device); > + return true; > +} > + > +static void calc_hpa_range(struct cxl_endpoint_decoder *cxled, > + struct cxl_dax_region *cxlr_dax, > + struct range *dpa_range, > + struct range *hpa_range) > +{ > + resource_size_t dpa_offset, hpa; > + > + dpa_offset = dpa_range->start - cxled->dpa_res->start; > + hpa = cxled->cxld.hpa_range.start + dpa_offset; > + > + hpa_range->start = hpa - cxlr_dax->hpa_range.start; > + hpa_range->end = hpa_range->start + range_len(dpa_range) - 1; > +} > + > +static int cxlr_rm_extent(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct range *region_hpa_range = data; > + > + if (!region_extent) > + return 0; > + > + /* > + * Any extent which 'touches' the released range is removed. > + */ > + if (range_overlaps(region_hpa_range, ®ion_extent->hpa_range)) { > + dev_dbg(dev, "Remove region extent HPA %par\n", > + ®ion_extent->hpa_range); > + region_rm_extent(region_extent); > + } > + return 0; > +} > + > +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > +{ > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > + struct cxl_endpoint_decoder *cxled; > + struct range hpa_range, dpa_range; > + struct cxl_region *cxlr; > + > + dpa_range = (struct range) { > + .start = start_dpa, > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > + }; > + > + guard(rwsem_read)(&cxl_region_rwsem); > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > + if (!cxlr) { > + memdev_release_extent(mds, &dpa_range); > + return -ENXIO; > + } > + > + calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range); > + > + /* Remove region extents which overlap */ > + return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range, > + cxlr_rm_extent); > +} > + > +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct cxled_extent *ed_extent) > +{ > + struct region_extent *region_extent; > + struct range hpa_range; > + int rc; > + > + calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range); > + > + region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag); > + if (IS_ERR(region_extent)) > + return PTR_ERR(region_extent); > + > + rc = xa_insert(®ion_extent->decoder_extents, (unsigned long)ed_extent, ed_extent, > + GFP_KERNEL); > + if (rc) { > + free_region_extent(region_extent); > + return rc; > + } > + > + /* device model handles freeing region_extent */ > + return online_region_extent(region_extent); > +} > + > +/* Callers are expected to ensure cxled has been attached to a region */ > +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > +{ > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > + struct cxl_endpoint_decoder *cxled; > + struct range ed_range, ext_range; > + struct cxl_dax_region *cxlr_dax; > + struct cxled_extent *ed_extent; > + struct cxl_region *cxlr; > + struct device *dev; > + > + ext_range = (struct range) { > + .start = start_dpa, > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > + }; > + > + guard(rwsem_read)(&cxl_region_rwsem); > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > + if (!cxlr) > + return -ENXIO; > + > + cxlr_dax = cxled->cxld.region->cxlr_dax; > + dev = &cxled->cxld.dev; > + ed_range = (struct range) { > + .start = cxled->dpa_res->start, > + .end = cxled->dpa_res->end, > + }; > + > + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n", > + cxled->dpa_res, &ext_range); > + > + if (!range_contains(&ed_range, &ext_range)) { > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) is not fully in ED %par\n", > + &ext_range.start, CXL_EXTENT_TAG_LEN, > + extent->tag, &ed_range); > + return -ENXIO; > + } > + > + if (extents_contain(cxlr_dax, cxled, &ext_range)) > + return 0; > + > + if (extents_overlap(cxlr_dax, cxled, &ext_range)) > + return -ENXIO; > + > + ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL); > + if (!ed_extent) > + return -ENOMEM; > + > + ed_extent->cxled = cxled; > + ed_extent->dpa_range = ext_range; > + memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN); > + > + dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range, > + CXL_EXTENT_TAG_LEN, ed_extent->tag); > + > + return cxlr_add_extent(cxlr_dax, cxled, ed_extent); > +} > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 01a447aaa1b1..f629ad7488ac 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds) > } > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); > > +static int cxl_validate_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + u64 start = le64_to_cpu(extent->start_dpa); > + u64 length = le64_to_cpu(extent->length); > + struct device *dev = mds->cxlds.dev; > + > + struct range ext_range = (struct range){ > + .start = start, > + .end = start + length - 1, > + }; > + > + if (le16_to_cpu(extent->shared_extn_seq) != 0) { > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) can not be shared\n", > + &ext_range.start, CXL_EXTENT_TAG_LEN, > + extent->tag); > + return -ENXIO; > + } > + > + /* Extents must not cross DC region boundary's */ > + for (int i = 0; i < mds->nr_dc_region; i++) { > + struct cxl_dc_region_info *dcr = &mds->dc_region[i]; > + struct range region_range = (struct range) { > + .start = dcr->base, > + .end = dcr->base + dcr->decode_len - 1, > + }; > + > + if (range_contains(®ion_range, &ext_range)) { > + dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n", > + &ext_range, i, start - dcr->base, > + CXL_EXTENT_TAG_LEN, extent->tag); > + return 0; > + } > + } > + > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) is not in any DC region\n", > + &ext_range, CXL_EXTENT_TAG_LEN, extent->tag); > + return -ENXIO; > +} > + > void cxl_event_trace_record(const struct cxl_memdev *cxlmd, > enum cxl_event_log_type type, > enum cxl_event_type event_type, > @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds, > return rc; > } > > +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode, > + struct xarray *extent_array, int cnt) > +{ > + struct cxl_mbox_dc_response *p; > + struct cxl_mbox_cmd mbox_cmd; > + struct cxl_extent *extent; > + unsigned long index; > + u32 pl_index; > + int rc = 0; > + > + size_t pl_size = struct_size(p, extent_list, cnt); > + u32 max_extents = cnt; > + > + /* May have to use more bit on response. */ > + if (pl_size > mds->payload_size) { > + max_extents = (mds->payload_size - sizeof(*p)) / > + sizeof(struct updated_extent_list); > + pl_size = struct_size(p, extent_list, max_extents); > + } > + > + struct cxl_mbox_dc_response *response __free(kfree) = > + kzalloc(pl_size, GFP_KERNEL); > + if (!response) > + return -ENOMEM; > + > + pl_index = 0; > + xa_for_each(extent_array, index, extent) { > + > + response->extent_list[pl_index].dpa_start = extent->start_dpa; > + response->extent_list[pl_index].length = extent->length; > + pl_index++; > + response->extent_list_size = cpu_to_le32(pl_index); > + > + if (pl_index == max_extents) { > + mbox_cmd = (struct cxl_mbox_cmd) { > + .opcode = opcode, > + .size_in = struct_size(response, extent_list, > + pl_index), > + .payload_in = response, > + }; > + > + response->flags = 0; > + if (pl_index < cnt) > + response->flags &= CXL_DCD_EVENT_MORE; Should "response->flags |= CXL_DCD_EVENT_MORE"? And seems like there is a bug if the value of 'cnt' is double the value of 'max_extents'. the response command will be sent in this xa_for_each() scope twice, and CXL_DCD_EVENT_MORE will be set for both times. because 'pl_index < cnt' is always true. > + > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); > + if (rc) > + return rc; > + pl_index = 0; > + } > + } > + > + if (pl_index) { > + mbox_cmd = (struct cxl_mbox_cmd) { > + .opcode = opcode, > + .size_in = struct_size(response, extent_list, > + pl_index), > + .payload_in = response, > + }; > + > + response->flags = 0; > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); > + } > + > + return rc; > +} > +
On Fri, Aug 16, 2024 at 09:44:26AM -0500, ira.weiny@intel.com wrote: > From: Navneet Singh <navneet.singh@intel.com> > > A dynamic capacity device (DCD) sends events to signal the host for > changes in the availability of Dynamic Capacity (DC) memory. These > events contain extents describing a DPA range and meta data for memory > to be added or removed. Events may be sent from the device at any time. > > Three types of events can be signaled, Add, Release, and Force Release. > > On add, the host may accept or reject the memory being offered. If no > region exists, or the extent is invalid, the extent should be rejected. > Add extent events may be grouped by a 'more' bit which indicates those > extents should be processed as a group. > > On remove, the host can delay the response until the host is safely not > using the memory. If no region exists the release can be sent > immediately. The host may also release extents (or partial extents) at > any time. Thus the 'more' bit grouping of release events is of less > value and can be ignored in favor of sending multiple release capacity > responses for groups of release events. > > Force removal is intended as a mechanism between the FM and the device > and intended only when the host is unresponsive, out of sync, or > otherwise broken. Purposely ignore force removal events. > > Regions are made up of one or more devices which may be surfacing memory > to the host. Once all devices in a region have surfaced an extent the > region can expose a corresponding extent for the user to consume. > Without interleaving a device extent forms a 1:1 relationship with the > region extent. Immediately surface a region extent upon getting a > device extent. > > Per the specification the device is allowed to offer or remove extents > at any time. However, anticipated use cases can expect extents to be > offered, accepted, and removed in well defined chunks. > > Simplify extent tracking with the following restrictions. > > 1) Flag for removal any extent which overlaps a requested > release range. > 2) Refuse the offer of extents which overlap already accepted > memory ranges. > 3) Accept again a range which has already been accepted by the > host. (It is likely the device has an error because it > should already know that this range was accepted. But from > the host point of view it is safe to acknowledge that > acceptance again.) > > Management of the region extent devices must be synchronized with > potential uses of the memory within the DAX layer. Create region extent > devices as children of the cxl_dax_region device such that the DAX > region driver can co-drive them and synchronize with the DAX layer. > Synchronization and management is handled in a subsequent patch. > > Process DCD events and create region devices. > > Signed-off-by: Navneet Singh <navneet.singh@intel.com> > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > One more minor inline. > +static int cxl_validate_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + u64 start = le64_to_cpu(extent->start_dpa); > + u64 length = le64_to_cpu(extent->length); > + struct device *dev = mds->cxlds.dev; > + > + struct range ext_range = (struct range){ > + .start = start, > + .end = start + length - 1, > + }; > + > + if (le16_to_cpu(extent->shared_extn_seq) != 0) { > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) can not be shared\n", > + &ext_range.start, CXL_EXTENT_TAG_LEN, > + extent->tag); > + return -ENXIO; > + } > + > + /* Extents must not cross DC region boundary's */ > + for (int i = 0; i < mds->nr_dc_region; i++) { > + struct cxl_dc_region_info *dcr = &mds->dc_region[i]; > + struct range region_range = (struct range) { > + .start = dcr->base, > + .end = dcr->base + dcr->decode_len - 1, > + }; > + > + if (range_contains(®ion_range, &ext_range)) { > + dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n", > + &ext_range, i, start - dcr->base, > + CXL_EXTENT_TAG_LEN, extent->tag); > + return 0; > + } > + } For extent validation, we may need to ensure its size is not 0 based on the spec. Noted that during testing, do not see issue for the case as 0-sized extents will be rejected when trying to add even though it passes the validation. Fan
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile index 9259bcc6773c..3b812515e725 100644 --- a/drivers/cxl/core/Makefile +++ b/drivers/cxl/core/Makefile @@ -15,4 +15,4 @@ cxl_core-y += hdm.o cxl_core-y += pmu.o cxl_core-y += cdat.o cxl_core-$(CONFIG_TRACING) += trace.o -cxl_core-$(CONFIG_CXL_REGION) += region.o +cxl_core-$(CONFIG_CXL_REGION) += region.o extent.o diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h index 76c4153a9b2c..8dfc97b2e0a4 100644 --- a/drivers/cxl/core/core.h +++ b/drivers/cxl/core/core.h @@ -44,12 +44,24 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa, u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd, u64 dpa); +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent); +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent); #else static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd, u64 dpa) { return ULLONG_MAX; } +static inline int cxl_add_extent(struct cxl_memdev_state *mds, + struct cxl_extent *extent) +{ + return 0; +} +static inline int cxl_rm_extent(struct cxl_memdev_state *mds, + struct cxl_extent *extent) +{ + return 0; +} static inline struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa, struct cxl_endpoint_decoder **cxled) @@ -121,5 +133,6 @@ long cxl_pci_get_latency(struct pci_dev *pdev); int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr, enum access_coordinate_class access); bool cxl_need_node_perf_attrs_update(int nid); +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range); #endif /* __CXL_CORE_H__ */ diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c new file mode 100644 index 000000000000..34456594cdc3 --- /dev/null +++ b/drivers/cxl/core/extent.c @@ -0,0 +1,345 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2024 Intel Corporation. All rights reserved. */ + +#include <linux/device.h> +#include <cxl.h> + +#include "core.h" + +static void cxled_release_extent(struct cxl_endpoint_decoder *cxled, + struct cxled_extent *ed_extent) +{ + struct cxl_memdev_state *mds = cxled_to_mds(cxled); + struct device *dev = &cxled->cxld.dev; + + dev_dbg(dev, "Remove extent %par (%*phC)\n", &ed_extent->dpa_range, + CXL_EXTENT_TAG_LEN, ed_extent->tag); + memdev_release_extent(mds, &ed_extent->dpa_range); + kfree(ed_extent); +} + +static void free_region_extent(struct region_extent *region_extent) +{ + struct cxled_extent *ed_extent; + unsigned long index; + + /* + * Remove from each endpoint decoder the extent which backs this region + * extent + */ + xa_for_each(®ion_extent->decoder_extents, index, ed_extent) + cxled_release_extent(ed_extent->cxled, ed_extent); + xa_destroy(®ion_extent->decoder_extents); + ida_free(®ion_extent->cxlr_dax->extent_ida, region_extent->dev.id); + kfree(region_extent); +} + +static void region_extent_release(struct device *dev) +{ + struct region_extent *region_extent = to_region_extent(dev); + + free_region_extent(region_extent); +} + +static const struct device_type region_extent_type = { + .name = "extent", + .release = region_extent_release, +}; + +bool is_region_extent(struct device *dev) +{ + return dev->type == ®ion_extent_type; +} +EXPORT_SYMBOL_NS_GPL(is_region_extent, CXL); + +static void region_extent_unregister(void *ext) +{ + struct region_extent *region_extent = ext; + + dev_dbg(®ion_extent->dev, "DAX region rm extent HPA %par\n", + ®ion_extent->hpa_range); + device_unregister(®ion_extent->dev); +} + +static void region_rm_extent(struct region_extent *region_extent) +{ + struct device *region_dev = region_extent->dev.parent; + + devm_release_action(region_dev, region_extent_unregister, region_extent); +} + +static struct region_extent * +alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range, u8 *tag) +{ + int id; + + struct region_extent *region_extent __free(kfree) = + kzalloc(sizeof(*region_extent), GFP_KERNEL); + if (!region_extent) + return ERR_PTR(-ENOMEM); + + id = ida_alloc(&cxlr_dax->extent_ida, GFP_KERNEL); + if (id < 0) + return ERR_PTR(-ENOMEM); + + region_extent->hpa_range = *hpa_range; + region_extent->cxlr_dax = cxlr_dax; + import_uuid(®ion_extent->tag, tag); + region_extent->dev.id = id; + xa_init(®ion_extent->decoder_extents); + return no_free_ptr(region_extent); +} + +static int online_region_extent(struct region_extent *region_extent) +{ + struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax; + struct device *dev; + int rc; + + dev = ®ion_extent->dev; + device_initialize(dev); + device_set_pm_not_required(dev); + dev->parent = &cxlr_dax->dev; + dev->type = ®ion_extent_type; + rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id); + if (rc) + goto err; + + rc = device_add(dev); + if (rc) + goto err; + + dev_dbg(dev, "region extent HPA %par\n", ®ion_extent->hpa_range); + return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister, + region_extent); + +err: + dev_err(&cxlr_dax->dev, "Failed to initialize region extent HPA %par\n", + ®ion_extent->hpa_range); + + put_device(dev); + return rc; +} + +struct match_data { + struct cxl_endpoint_decoder *cxled; + struct range *new_range; +}; + +static int match_contains(struct device *dev, void *data) +{ + struct region_extent *region_extent = to_region_extent(dev); + struct match_data *md = data; + struct cxled_extent *entry; + unsigned long index; + + if (!region_extent) + return 0; + + xa_for_each(®ion_extent->decoder_extents, index, entry) { + if (md->cxled == entry->cxled && + range_contains(&entry->dpa_range, md->new_range)) + return true; + } + return false; +} + +static bool extents_contain(struct cxl_dax_region *cxlr_dax, + struct cxl_endpoint_decoder *cxled, + struct range *new_range) +{ + struct device *extent_device; + struct match_data md = { + .cxled = cxled, + .new_range = new_range, + }; + + extent_device = device_find_child(&cxlr_dax->dev, &md, match_contains); + if (!extent_device) + return false; + + put_device(extent_device); + return true; +} + +static int match_overlaps(struct device *dev, void *data) +{ + struct region_extent *region_extent = to_region_extent(dev); + struct match_data *md = data; + struct cxled_extent *entry; + unsigned long index; + + if (!region_extent) + return 0; + + xa_for_each(®ion_extent->decoder_extents, index, entry) { + if (md->cxled == entry->cxled && + range_overlaps(&entry->dpa_range, md->new_range)) + return true; + } + + return false; +} + +static bool extents_overlap(struct cxl_dax_region *cxlr_dax, + struct cxl_endpoint_decoder *cxled, + struct range *new_range) +{ + struct device *extent_device; + struct match_data md = { + .cxled = cxled, + .new_range = new_range, + }; + + extent_device = device_find_child(&cxlr_dax->dev, &md, match_overlaps); + if (!extent_device) + return false; + + put_device(extent_device); + return true; +} + +static void calc_hpa_range(struct cxl_endpoint_decoder *cxled, + struct cxl_dax_region *cxlr_dax, + struct range *dpa_range, + struct range *hpa_range) +{ + resource_size_t dpa_offset, hpa; + + dpa_offset = dpa_range->start - cxled->dpa_res->start; + hpa = cxled->cxld.hpa_range.start + dpa_offset; + + hpa_range->start = hpa - cxlr_dax->hpa_range.start; + hpa_range->end = hpa_range->start + range_len(dpa_range) - 1; +} + +static int cxlr_rm_extent(struct device *dev, void *data) +{ + struct region_extent *region_extent = to_region_extent(dev); + struct range *region_hpa_range = data; + + if (!region_extent) + return 0; + + /* + * Any extent which 'touches' the released range is removed. + */ + if (range_overlaps(region_hpa_range, ®ion_extent->hpa_range)) { + dev_dbg(dev, "Remove region extent HPA %par\n", + ®ion_extent->hpa_range); + region_rm_extent(region_extent); + } + return 0; +} + +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) +{ + u64 start_dpa = le64_to_cpu(extent->start_dpa); + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; + struct cxl_endpoint_decoder *cxled; + struct range hpa_range, dpa_range; + struct cxl_region *cxlr; + + dpa_range = (struct range) { + .start = start_dpa, + .end = start_dpa + le64_to_cpu(extent->length) - 1, + }; + + guard(rwsem_read)(&cxl_region_rwsem); + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); + if (!cxlr) { + memdev_release_extent(mds, &dpa_range); + return -ENXIO; + } + + calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range); + + /* Remove region extents which overlap */ + return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range, + cxlr_rm_extent); +} + +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax, + struct cxl_endpoint_decoder *cxled, + struct cxled_extent *ed_extent) +{ + struct region_extent *region_extent; + struct range hpa_range; + int rc; + + calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range); + + region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag); + if (IS_ERR(region_extent)) + return PTR_ERR(region_extent); + + rc = xa_insert(®ion_extent->decoder_extents, (unsigned long)ed_extent, ed_extent, + GFP_KERNEL); + if (rc) { + free_region_extent(region_extent); + return rc; + } + + /* device model handles freeing region_extent */ + return online_region_extent(region_extent); +} + +/* Callers are expected to ensure cxled has been attached to a region */ +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) +{ + u64 start_dpa = le64_to_cpu(extent->start_dpa); + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; + struct cxl_endpoint_decoder *cxled; + struct range ed_range, ext_range; + struct cxl_dax_region *cxlr_dax; + struct cxled_extent *ed_extent; + struct cxl_region *cxlr; + struct device *dev; + + ext_range = (struct range) { + .start = start_dpa, + .end = start_dpa + le64_to_cpu(extent->length) - 1, + }; + + guard(rwsem_read)(&cxl_region_rwsem); + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); + if (!cxlr) + return -ENXIO; + + cxlr_dax = cxled->cxld.region->cxlr_dax; + dev = &cxled->cxld.dev; + ed_range = (struct range) { + .start = cxled->dpa_res->start, + .end = cxled->dpa_res->end, + }; + + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n", + cxled->dpa_res, &ext_range); + + if (!range_contains(&ed_range, &ext_range)) { + dev_err_ratelimited(dev, + "DC extent DPA %par (%*phC) is not fully in ED %par\n", + &ext_range.start, CXL_EXTENT_TAG_LEN, + extent->tag, &ed_range); + return -ENXIO; + } + + if (extents_contain(cxlr_dax, cxled, &ext_range)) + return 0; + + if (extents_overlap(cxlr_dax, cxled, &ext_range)) + return -ENXIO; + + ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL); + if (!ed_extent) + return -ENOMEM; + + ed_extent->cxled = cxled; + ed_extent->dpa_range = ext_range; + memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN); + + dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range, + CXL_EXTENT_TAG_LEN, ed_extent->tag); + + return cxlr_add_extent(cxlr_dax, cxled, ed_extent); +} diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c index 01a447aaa1b1..f629ad7488ac 100644 --- a/drivers/cxl/core/mbox.c +++ b/drivers/cxl/core/mbox.c @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds) } EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); +static int cxl_validate_extent(struct cxl_memdev_state *mds, + struct cxl_extent *extent) +{ + u64 start = le64_to_cpu(extent->start_dpa); + u64 length = le64_to_cpu(extent->length); + struct device *dev = mds->cxlds.dev; + + struct range ext_range = (struct range){ + .start = start, + .end = start + length - 1, + }; + + if (le16_to_cpu(extent->shared_extn_seq) != 0) { + dev_err_ratelimited(dev, + "DC extent DPA %par (%*phC) can not be shared\n", + &ext_range.start, CXL_EXTENT_TAG_LEN, + extent->tag); + return -ENXIO; + } + + /* Extents must not cross DC region boundary's */ + for (int i = 0; i < mds->nr_dc_region; i++) { + struct cxl_dc_region_info *dcr = &mds->dc_region[i]; + struct range region_range = (struct range) { + .start = dcr->base, + .end = dcr->base + dcr->decode_len - 1, + }; + + if (range_contains(®ion_range, &ext_range)) { + dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n", + &ext_range, i, start - dcr->base, + CXL_EXTENT_TAG_LEN, extent->tag); + return 0; + } + } + + dev_err_ratelimited(dev, + "DC extent DPA %par (%*phC) is not in any DC region\n", + &ext_range, CXL_EXTENT_TAG_LEN, extent->tag); + return -ENXIO; +} + void cxl_event_trace_record(const struct cxl_memdev *cxlmd, enum cxl_event_log_type type, enum cxl_event_type event_type, @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds, return rc; } +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode, + struct xarray *extent_array, int cnt) +{ + struct cxl_mbox_dc_response *p; + struct cxl_mbox_cmd mbox_cmd; + struct cxl_extent *extent; + unsigned long index; + u32 pl_index; + int rc = 0; + + size_t pl_size = struct_size(p, extent_list, cnt); + u32 max_extents = cnt; + + /* May have to use more bit on response. */ + if (pl_size > mds->payload_size) { + max_extents = (mds->payload_size - sizeof(*p)) / + sizeof(struct updated_extent_list); + pl_size = struct_size(p, extent_list, max_extents); + } + + struct cxl_mbox_dc_response *response __free(kfree) = + kzalloc(pl_size, GFP_KERNEL); + if (!response) + return -ENOMEM; + + pl_index = 0; + xa_for_each(extent_array, index, extent) { + + response->extent_list[pl_index].dpa_start = extent->start_dpa; + response->extent_list[pl_index].length = extent->length; + pl_index++; + response->extent_list_size = cpu_to_le32(pl_index); + + if (pl_index == max_extents) { + mbox_cmd = (struct cxl_mbox_cmd) { + .opcode = opcode, + .size_in = struct_size(response, extent_list, + pl_index), + .payload_in = response, + }; + + response->flags = 0; + if (pl_index < cnt) + response->flags &= CXL_DCD_EVENT_MORE; + + rc = cxl_internal_send_cmd(mds, &mbox_cmd); + if (rc) + return rc; + pl_index = 0; + } + } + + if (pl_index) { + mbox_cmd = (struct cxl_mbox_cmd) { + .opcode = opcode, + .size_in = struct_size(response, extent_list, + pl_index), + .payload_in = response, + }; + + response->flags = 0; + rc = cxl_internal_send_cmd(mds, &mbox_cmd); + } + + return rc; +} + +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range) +{ + struct device *dev = mds->cxlds.dev; + struct xarray extent_list; + + struct cxl_extent extent = { + .start_dpa = cpu_to_le64(range->start), + .length = cpu_to_le64(range_len(range)), + }; + + dev_dbg(dev, "Release response dpa %par\n", range); + + xa_init(&extent_list); + if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) { + dev_dbg(dev, "Failed to release %par\n", range); + goto destroy; + } + + if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1)) + dev_dbg(dev, "Failed to release %par\n", range); + +destroy: + xa_destroy(&extent_list); +} + +static int validate_add_extent(struct cxl_memdev_state *mds, + struct cxl_extent *extent) +{ + int rc; + + rc = cxl_validate_extent(mds, extent); + if (rc) + return rc; + + return cxl_add_extent(mds, extent); +} + +static int cxl_add_pending(struct cxl_memdev_state *mds) +{ + struct device *dev = mds->cxlds.dev; + struct cxl_extent *extent; + unsigned long index; + unsigned long cnt = 0; + int rc; + + xa_for_each(&mds->pending_extents, index, extent) { + if (validate_add_extent(mds, extent)) { + dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n", + le64_to_cpu(extent->start_dpa), + le64_to_cpu(extent->length)); + xa_erase(&mds->pending_extents, index); + kfree(extent); + continue; + } + cnt++; + } + rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE, + &mds->pending_extents, cnt); + xa_for_each(&mds->pending_extents, index, extent) { + xa_erase(&mds->pending_extents, index); + kfree(extent); + } + return rc; +} + +static int handle_add_event(struct cxl_memdev_state *mds, + struct cxl_event_dcd *event) +{ + struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL); + struct device *dev = mds->cxlds.dev; + + if (!tmp) + return -ENOMEM; + + memcpy(tmp, &event->extent, sizeof(*tmp)); + if (xa_insert(&mds->pending_extents, (unsigned long)tmp, tmp, + GFP_KERNEL)) { + kfree(tmp); + return -ENOMEM; + } + + if (event->flags & CXL_DCD_EVENT_MORE) { + dev_dbg(dev, "more bit set; delay the surfacing of extent\n"); + return 0; + } + + /* extents are removed and free'ed in cxl_add_pending() */ + return cxl_add_pending(mds); +} + +static char *cxl_dcd_evt_type_str(u8 type) +{ + switch (type) { + case DCD_ADD_CAPACITY: + return "add"; + case DCD_RELEASE_CAPACITY: + return "release"; + case DCD_FORCED_CAPACITY_RELEASE: + return "force release"; + default: + break; + } + + return "<unknown>"; +} + +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds, + struct cxl_event_record_raw *raw_rec) +{ + struct cxl_event_dcd *event = &raw_rec->event.dcd; + struct cxl_extent *extent = &event->extent; + struct device *dev = mds->cxlds.dev; + uuid_t *id = &raw_rec->id; + + if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID)) + return -EINVAL; + + dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n", + cxl_dcd_evt_type_str(event->event_type), + le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length)); + + switch (event->event_type) { + case DCD_ADD_CAPACITY: + return handle_add_event(mds, event); + case DCD_RELEASE_CAPACITY: + return cxl_rm_extent(mds, &event->extent); + case DCD_FORCED_CAPACITY_RELEASE: + dev_err_ratelimited(dev, "Forced release event ignored.\n"); + return 0; + default: + return -EINVAL; + } +} + static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, enum cxl_event_log_type type) { @@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, if (!nr_rec) break; - for (i = 0; i < nr_rec; i++) + for (i = 0; i < nr_rec; i++) { __cxl_event_trace_record(cxlmd, type, &payload->records[i]); + if (type == CXL_EVENT_TYPE_DCD) { + rc = cxl_handle_dcd_event_records(mds, + &payload->records[i]); + if (rc) + dev_err_ratelimited(dev, "dcd event failed: %d\n", + rc); + } + } if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW) trace_cxl_overflow(cxlmd, type, payload); @@ -1078,6 +1329,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status) { dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status); + if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD)) + cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD); if (status & CXLDEV_EVENT_STATUS_FATAL) cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL); if (status & CXLDEV_EVENT_STATUS_FAIL) @@ -1610,6 +1863,17 @@ int cxl_poison_state_init(struct cxl_memdev_state *mds) } EXPORT_SYMBOL_NS_GPL(cxl_poison_state_init, CXL); +static void clear_pending_extents(void *_mds) +{ + struct cxl_memdev_state *mds = _mds; + struct cxl_extent *extent; + unsigned long index; + + xa_for_each(&mds->pending_extents, index, extent) + kfree(extent); + xa_destroy(&mds->pending_extents); +} + struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) { struct cxl_memdev_state *mds; @@ -1628,6 +1892,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) mds->cxlds.type = CXL_DEVTYPE_CLASSMEM; mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID; mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID; + xa_init(&mds->pending_extents); + devm_add_action_or_reset(dev, clear_pending_extents, mds); return mds; } diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index 8e0884b52f84..8c9171f914fb 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -3037,6 +3037,7 @@ static void cxl_dax_region_release(struct device *dev) { struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev); + ida_destroy(&cxlr_dax->extent_ida); kfree(cxlr_dax); } @@ -3090,6 +3091,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr) dev = &cxlr_dax->dev; cxlr_dax->cxlr = cxlr; + cxlr->cxlr_dax = cxlr_dax; + ida_init(&cxlr_dax->extent_ida); device_initialize(dev); lockdep_set_class(&dev->mutex, &cxl_dax_region_key); device_set_pm_not_required(dev); @@ -3190,7 +3193,10 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr) static void cxlr_dax_unregister(void *_cxlr_dax) { struct cxl_dax_region *cxlr_dax = _cxlr_dax; + struct cxl_region *cxlr = cxlr_dax->cxlr; + cxlr->cxlr_dax = NULL; + cxlr_dax->cxlr = NULL; device_unregister(&cxlr_dax->dev); } diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 16861c867537..c858e3957fd5 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -11,6 +11,7 @@ #include <linux/log2.h> #include <linux/node.h> #include <linux/io.h> +#include <linux/cxl-event.h> extern const struct nvdimm_security_ops *cxl_security_ops; @@ -169,11 +170,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw) #define CXLDEV_EVENT_STATUS_WARN BIT(1) #define CXLDEV_EVENT_STATUS_FAIL BIT(2) #define CXLDEV_EVENT_STATUS_FATAL BIT(3) +#define CXLDEV_EVENT_STATUS_DCD BIT(4) #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \ CXLDEV_EVENT_STATUS_WARN | \ CXLDEV_EVENT_STATUS_FAIL | \ - CXLDEV_EVENT_STATUS_FATAL) + CXLDEV_EVENT_STATUS_FATAL | \ + CXLDEV_EVENT_STATUS_DCD) /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */ #define CXLDEV_EVENT_INT_MODE_MASK GENMASK(1, 0) @@ -444,6 +447,18 @@ enum cxl_decoder_state { CXL_DECODER_STATE_AUTO, }; +/** + * struct cxled_extent - Extent within an endpoint decoder + * @cxled: Reference to the endpoint decoder + * @dpa_range: DPA range this extent covers within the decoder + * @tag: Tag from device for this extent + */ +struct cxled_extent { + struct cxl_endpoint_decoder *cxled; + struct range dpa_range; + u8 tag[CXL_EXTENT_TAG_LEN]; +}; + /** * struct cxl_endpoint_decoder - Endpoint / SPA to DPA decoder * @cxld: base cxl_decoder_object @@ -569,6 +584,7 @@ struct cxl_region_params { * @type: Endpoint decoder target type * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge * @flags: Region state flags * @params: active + config params for the region * @coord: QoS access coordinates for the region @@ -582,6 +598,7 @@ struct cxl_region { enum cxl_decoder_type type; struct cxl_nvdimm_bridge *cxl_nvb; struct cxl_pmem_region *cxlr_pmem; + struct cxl_dax_region *cxlr_dax; unsigned long flags; struct cxl_region_params params; struct access_coordinate coord[ACCESS_COORDINATE_MAX]; @@ -622,12 +639,45 @@ struct cxl_pmem_region { struct cxl_pmem_region_mapping mapping[]; }; +/* See CXL 3.0 8.2.9.2.1.5 */ +enum dc_event { + DCD_ADD_CAPACITY, + DCD_RELEASE_CAPACITY, + DCD_FORCED_CAPACITY_RELEASE, + DCD_REGION_CONFIGURATION_UPDATED, +}; + struct cxl_dax_region { struct device dev; struct cxl_region *cxlr; struct range hpa_range; + struct ida extent_ida; }; +/** + * struct region_extent - CXL DAX region extent + * @dev: device representing this extent + * @cxlr_dax: back reference to parent region device + * @hpa_range: HPA range of this extent + * @tag: tag of the extent + * @decoder_extents: Endpoint decoder extents which make up this region extent + */ +struct region_extent { + struct device dev; + struct cxl_dax_region *cxlr_dax; + struct range hpa_range; + uuid_t tag; + struct xarray decoder_extents; +}; + +bool is_region_extent(struct device *dev); +static inline struct region_extent *to_region_extent(struct device *dev) +{ + if (!is_region_extent(dev)) + return NULL; + return container_of(dev, struct region_extent, dev); +} + /** * struct cxl_port - logical collection of upstream port devices and * downstream port devices to construct a CXL memory diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h index d41bec5433db..3a40fe1f0be7 100644 --- a/drivers/cxl/cxlmem.h +++ b/drivers/cxl/cxlmem.h @@ -497,6 +497,7 @@ struct cxl_dc_region_info { * @pmem_perf: performance data entry matched to PMEM partition * @nr_dc_region: number of DC regions implemented in the memory device * @dc_region: array containing info about the DC regions + * @pending_extents: array of extents pending during more bit processing * @event: event log driver state * @poison: poison driver state info * @security: security driver state info @@ -532,6 +533,7 @@ struct cxl_memdev_state { u8 nr_dc_region; struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION]; + struct xarray pending_extents; struct cxl_event_state event; struct cxl_poison_state poison; @@ -607,6 +609,21 @@ enum cxl_opcode { UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \ 0x40, 0x3d, 0x86) +/* + * Add Dynamic Capacity Response + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169 + */ +struct cxl_mbox_dc_response { + __le32 extent_list_size; + u8 flags; + u8 reserved[3]; + struct updated_extent_list { + __le64 dpa_start; + __le64 length; + u8 reserved[8]; + } __packed extent_list[]; +} __packed; + struct cxl_mbox_get_supported_logs { __le16 entries; u8 rsvd[6]; @@ -669,6 +686,14 @@ struct cxl_mbox_identify { UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \ 0x13, 0xb7, 0x74) +/* + * Dynamic Capacity Event Record + * CXL rev 3.1 section 8.2.9.2.1; Table 8-43 + */ +#define CXL_EVENT_DC_EVENT_UUID \ + UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \ + 0x10, 0x1a, 0x2a) + /* * Get Event Records output payload * CXL rev 3.0 section 8.2.9.2.2; Table 8-50 @@ -694,6 +719,7 @@ enum cxl_event_log_type { CXL_EVENT_TYPE_WARN, CXL_EVENT_TYPE_FAIL, CXL_EVENT_TYPE_FATAL, + CXL_EVENT_TYPE_DCD, CXL_EVENT_TYPE_MAX }; diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h index 0bea1afbd747..eeda8059d81a 100644 --- a/include/linux/cxl-event.h +++ b/include/linux/cxl-event.h @@ -96,11 +96,43 @@ struct cxl_event_mem_module { u8 reserved[0x3d]; } __packed; +/* + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51 + */ +#define CXL_EXTENT_TAG_LEN 0x10 +struct cxl_extent { + __le64 start_dpa; + __le64 length; + u8 tag[CXL_EXTENT_TAG_LEN]; + __le16 shared_extn_seq; + u8 reserved[0x6]; +} __packed; + +/* + * Dynamic Capacity Event Record + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50 + */ +#define CXL_DCD_EVENT_MORE BIT(0) +struct cxl_event_dcd { + struct cxl_event_record_hdr hdr; + u8 event_type; + u8 validity_flags; + __le16 host_id; + u8 region_index; + u8 flags; + u8 reserved1[0x2]; + struct cxl_extent extent; + u8 reserved2[0x18]; + __le32 num_avail_extents; + __le32 num_avail_tags; +} __packed; + union cxl_event { struct cxl_event_generic generic; struct cxl_event_gen_media gen_media; struct cxl_event_dram dram; struct cxl_event_mem_module mem_module; + struct cxl_event_dcd dcd; /* dram & gen_media event header */ struct cxl_event_media_hdr media_hdr; } __packed; diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild index 030b388800f0..8238588fffdf 100644 --- a/tools/testing/cxl/Kbuild +++ b/tools/testing/cxl/Kbuild @@ -61,7 +61,8 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o cxl_core-y += $(CXL_CORE_SRC)/pmu.o cxl_core-y += $(CXL_CORE_SRC)/cdat.o cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o -cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o +cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \ + $(CXL_CORE_SRC)/extent.o cxl_core-y += config_check.o cxl_core-y += cxl_core_test.o cxl_core-y += cxl_core_exports.o