Message ID | 20211120000250.1663391-21-ben.widawsky@intel.com |
---|---|
State | Superseded |
Headers | show |
Series | Add drivers for CXL ports and mem devices | expand |
Hi Ben, I love your patch! Yet something to improve: [auto build test ERROR on 53989fad1286e652ea3655ae3367ba698da8d2ff] url: https://github.com/0day-ci/linux/commits/Ben-Widawsky/Add-drivers-for-CXL-ports-and-mem-devices/20211120-080513 base: 53989fad1286e652ea3655ae3367ba698da8d2ff config: alpha-randconfig-r026-20211118 (attached as .config) compiler: alpha-linux-gcc (GCC) 11.2.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/8ff43502e84dd4fa1296a131cb0cc82146389db4 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Ben-Widawsky/Add-drivers-for-CXL-ports-and-mem-devices/20211120-080513 git checkout 8ff43502e84dd4fa1296a131cb0cc82146389db4 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=alpha If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): drivers/cxl/port.c: In function 'get_decoder_size': >> drivers/cxl/port.c:105:16: error: implicit declaration of function 'ioread64_hi_lo' [-Werror=implicit-function-declaration] 105 | return ioread64_hi_lo(hdm_decoder + | ^~~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +/ioread64_hi_lo +105 drivers/cxl/port.c 97 98 static u64 get_decoder_size(void __iomem *hdm_decoder, int n) 99 { 100 u32 ctrl = readl(hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(n)); 101 102 if (ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED) 103 return 0; 104 > 105 return ioread64_hi_lo(hdm_decoder + 106 CXL_HDM_DECODER0_SIZE_LOW_OFFSET(n)); 107 } 108 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Hi Ben, I love your patch! Yet something to improve: [auto build test ERROR on 53989fad1286e652ea3655ae3367ba698da8d2ff] url: https://github.com/0day-ci/linux/commits/Ben-Widawsky/Add-drivers-for-CXL-ports-and-mem-devices/20211120-080513 base: 53989fad1286e652ea3655ae3367ba698da8d2ff config: arm-randconfig-r034-20211119 (attached as .config) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install arm cross compiling tool for clang build # apt-get install binutils-arm-linux-gnueabi # https://github.com/0day-ci/linux/commit/8ff43502e84dd4fa1296a131cb0cc82146389db4 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Ben-Widawsky/Add-drivers-for-CXL-ports-and-mem-devices/20211120-080513 git checkout 8ff43502e84dd4fa1296a131cb0cc82146389db4 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=arm If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> drivers/cxl/port.c:105:9: error: implicit declaration of function 'ioread64_hi_lo' [-Werror,-Wimplicit-function-declaration] return ioread64_hi_lo(hdm_decoder + ^ drivers/cxl/port.c:191:11: error: implicit declaration of function 'ioread64_hi_lo' [-Werror,-Wimplicit-function-declaration] base = ioread64_hi_lo(decoderN(BASE_LOW_OFFSET, i)); ^ 2 errors generated. vim +/ioread64_hi_lo +105 drivers/cxl/port.c 97 98 static u64 get_decoder_size(void __iomem *hdm_decoder, int n) 99 { 100 u32 ctrl = readl(hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(n)); 101 102 if (ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED) 103 return 0; 104 > 105 return ioread64_hi_lo(hdm_decoder + 106 CXL_HDM_DECODER0_SIZE_LOW_OFFSET(n)); 107 } 108 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On Fri, 19 Nov 2021 16:02:47 -0800 Ben Widawsky <ben.widawsky@intel.com> wrote: > The CXL port driver is responsible for managing the decoder resources > contained within the port. It will also provide APIs that other drivers > will consume for managing these resources. > > There are 4 types of ports in a system: > 1. Platform port. This is a non-programmable entity. Such a port is > named rootX. It is enumerated by cxl_acpi in an ACPI based system. > 2. Hostbridge port. This ports register access is defined in a platform port's > specific way (CHBS for ACPI platforms). It has n downstream ports, > each of which are known as CXL 2.0 root ports. Once the platform > specific mechanism to get the offset to the registers is obtained it > operates just like other CXL components. The enumeration of this > component is started by cxl_acpi and completed by cxl_port. > 3. Switch port. A switch port is similar to a hostbridge port except > register access is defined in the CXL specification in a platform > agnostic way. The downstream ports for a switch are simply known as > downstream ports. The enumeration of these are entirely contained > within cxl_port. > 4. Endpoint port. Endpoint ports are similar to switch ports with the > exception that they have no downstream ports, only the underlying > media on the device. The enumeration of these are started by cxl_pci, > and completed by cxl_port. > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> A few comments inline including what looks to me memory on the stack which has gone out of scope when it's accessed. Jonathan > > --- > Changes since RFCv2: > - Reword commit message tense (Dan) > - Reword commit message > - Drop SOFTDEP since it's not needed yet (Dan) > - Add CONFIG_CXL_PORT (Dan) > - s/CXL_DECODER_F_EN/CXL_DECODER_F_ENABLE (Dan) > - rename cxl_hdm_decoder_ functions to "to_" (Dan) > - remove useless inline (Dan) > - Check endpoint decoder based on dport list instead of driver id (Dan) > - Use range instead of resource per dependent patch change > - Use clever union packing for target list (Dan) > - Only check NULL from devm_cxl_iomap_block (Dan) > - Properly parent the created cxl decoders > - Move bus rescanning from cxl_acpi to here (Dan) > - Remove references to "CFMWS" in core (Dan) > - Use macro to help keep within 80 character lines > --- > .../driver-api/cxl/memory-devices.rst | 5 + > drivers/cxl/Kconfig | 22 ++ > drivers/cxl/Makefile | 2 + > drivers/cxl/core/bus.c | 67 ++++ > drivers/cxl/core/regs.c | 6 +- > drivers/cxl/cxl.h | 34 +- > drivers/cxl/port.c | 323 ++++++++++++++++++ > 7 files changed, 450 insertions(+), 9 deletions(-) > create mode 100644 drivers/cxl/port.c > > diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst > index 3b8f41395f6b..fbf0393cdddc 100644 > --- a/Documentation/driver-api/cxl/memory-devices.rst > +++ b/Documentation/driver-api/cxl/memory-devices.rst > @@ -28,6 +28,11 @@ CXL Memory Device > .. kernel-doc:: drivers/cxl/pci.c > :internal: > > +CXL Port > +-------- > +.. kernel-doc:: drivers/cxl/port.c > + :doc: cxl port > + > CXL Core > -------- > .. kernel-doc:: drivers/cxl/cxl.h > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig > index ef05e96f8f97..3aeb33bba5a3 100644 > --- a/drivers/cxl/Kconfig > +++ b/drivers/cxl/Kconfig > @@ -77,4 +77,26 @@ config CXL_PMEM > provisioning the persistent memory capacity of CXL memory expanders. > > If unsure say 'm'. > + > +config CXL_MEM > + tristate "CXL.mem: Memory Devices" > + select CXL_PORT > + depends on CXL_PCI > + default CXL_BUS > + help > + The CXL.mem protocol allows a device to act as a provider of "System > + RAM" and/or "Persistent Memory" that is fully coherent as if the > + memory was attached to the typical CPU memory controller. This is > + known as HDM "Host-managed Device Memory". > + > + Say 'y/m' to enable a driver that will attach to CXL.mem devices for > + memory expansion and control of HDM. See Chapter 9.13 in the CXL 2.0 > + specification for a detailed description of HDM. > + > + If unsure say 'm'. > + > + > +config CXL_PORT > + tristate > + > endif > diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile > index cf07ae6cea17..56fcac2323cb 100644 > --- a/drivers/cxl/Makefile > +++ b/drivers/cxl/Makefile > @@ -3,7 +3,9 @@ obj-$(CONFIG_CXL_BUS) += core/ > obj-$(CONFIG_CXL_PCI) += cxl_pci.o > obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o > obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o > +obj-$(CONFIG_CXL_PORT) += cxl_port.o > > cxl_pci-y := pci.o > cxl_acpi-y := acpi.o > cxl_pmem-y := pmem.o > +cxl_port-y := port.o > diff --git a/drivers/cxl/core/bus.c b/drivers/cxl/core/bus.c > index 9e0d7d5d9298..46a06cfe79bd 100644 > --- a/drivers/cxl/core/bus.c > +++ b/drivers/cxl/core/bus.c > @@ -31,6 +31,8 @@ static DECLARE_RWSEM(root_port_sem); > > static struct device *cxl_topology_host; > > +static bool is_cxl_decoder(struct device *dev); > + > int cxl_register_topology_host(struct device *host) > { > down_write(&topology_host_sem); > @@ -75,6 +77,45 @@ static void put_cxl_topology_host(struct device *dev) > up_read(&topology_host_sem); > } > > +static int decoder_match(struct device *dev, void *data) > +{ > + struct resource *theirs = (struct resource *)data; > + struct cxl_decoder *cxld; > + > + if (!is_cxl_decoder(dev)) > + return 0; > + > + cxld = to_cxl_decoder(dev); > + if (theirs->start <= cxld->decoder_range.start && > + theirs->end >= cxld->decoder_range.end) > + return 1; > + > + return 0; > +} > + > +static struct cxl_decoder *cxl_find_root_decoder(resource_size_t base, > + resource_size_t size) > +{ > + struct cxl_decoder *cxld = NULL; > + struct device *cxldd; > + struct device *host; > + struct resource res = (struct resource){ > + .start = base, > + .end = base + size - 1, > + }; > + > + host = get_cxl_topology_host(); > + if (!host) > + return NULL; > + > + cxldd = device_find_child(host, &res, decoder_match); > + if (cxldd) > + cxld = to_cxl_decoder(cxldd); > + > + put_cxl_topology_host(host); > + return cxld; > +} > + > static ssize_t devtype_show(struct device *dev, struct device_attribute *attr, > char *buf) > { > @@ -280,6 +321,11 @@ bool is_root_decoder(struct device *dev) > } > EXPORT_SYMBOL_NS_GPL(is_root_decoder, CXL); > > +static bool is_cxl_decoder(struct device *dev) > +{ > + return dev->type->release == cxl_decoder_release; > +} > + > struct cxl_decoder *to_cxl_decoder(struct device *dev) > { > if (dev_WARN_ONCE(dev, dev->type->release != cxl_decoder_release, > @@ -327,6 +373,7 @@ struct cxl_port *to_cxl_port(struct device *dev) > return NULL; > return container_of(dev, struct cxl_port, dev); > } > +EXPORT_SYMBOL_NS_GPL(to_cxl_port, CXL); > > struct cxl_dport *cxl_get_root_dport(struct device *dev) > { > @@ -735,6 +782,24 @@ EXPORT_SYMBOL_NS_GPL(cxl_decoder_add, CXL); > > static void cxld_unregister(void *dev) > { > + struct cxl_decoder *plat_decoder, *cxld = to_cxl_decoder(dev); > + resource_size_t base, size; > + > + if (is_root_decoder(dev)) { > + device_unregister(dev); > + return; > + } > + > + base = cxld->decoder_range.start; > + size = range_len(&cxld->decoder_range); > + > + if (size) { > + plat_decoder = cxl_find_root_decoder(base, size); > + if (plat_decoder) > + __release_region(&plat_decoder->platform_res, base, > + size); > + } > + > device_unregister(dev); > } > > @@ -789,6 +854,8 @@ static int cxl_device_id(struct device *dev) > return CXL_DEVICE_NVDIMM_BRIDGE; > if (dev->type == &cxl_nvdimm_type) > return CXL_DEVICE_NVDIMM; > + if (dev->type == &cxl_port_type) > + return CXL_DEVICE_PORT; > return 0; > } > > diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c > index 41a0245867ea..f191b0c995a7 100644 > --- a/drivers/cxl/core/regs.c > +++ b/drivers/cxl/core/regs.c > @@ -159,9 +159,8 @@ void cxl_probe_device_regs(struct device *dev, void __iomem *base, > } > EXPORT_SYMBOL_NS_GPL(cxl_probe_device_regs, CXL); > > -static void __iomem *devm_cxl_iomap_block(struct device *dev, > - resource_size_t addr, > - resource_size_t length) > +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, > + resource_size_t length) > { > void __iomem *ret_val; > struct resource *res; > @@ -180,6 +179,7 @@ static void __iomem *devm_cxl_iomap_block(struct device *dev, > > return ret_val; > } > +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL); > > int cxl_map_component_regs(struct pci_dev *pdev, > struct cxl_component_regs *regs, > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index 3962a5e6a950..24fa16157d5e 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -17,6 +17,9 @@ > * (port-driver, region-driver, nvdimm object-drivers... etc). > */ > > +/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */ > +#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K > + > /* CXL 2.0 8.2.5 CXL.cache and CXL.mem Registers*/ > #define CXL_CM_OFFSET 0x1000 > #define CXL_CM_CAP_HDR_OFFSET 0x0 > @@ -36,11 +39,22 @@ > #define CXL_HDM_DECODER_CAP_OFFSET 0x0 > #define CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0) > #define CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4) > -#define CXL_HDM_DECODER0_BASE_LOW_OFFSET 0x10 > -#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET 0x14 > -#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET 0x18 > -#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET 0x1c > -#define CXL_HDM_DECODER0_CTRL_OFFSET 0x20 > +#define CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8) > +#define CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9) > +#define CXL_HDM_DECODER_CTRL_OFFSET 0x4 > +#define CXL_HDM_DECODER_ENABLE BIT(1) > +#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10) > +#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14) > +#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18) > +#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c) > +#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20) > +#define CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0) > +#define CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4) > +#define CXL_HDM_DECODER0_CTRL_COMMIT BIT(9) > +#define CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10) > +#define CXL_HDM_DECODER0_CTRL_TYPE BIT(12) > +#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24) > +#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28) > > static inline int cxl_hdm_decoder_count(u32 cap_hdr) > { > @@ -148,6 +162,8 @@ int cxl_map_device_regs(struct pci_dev *pdev, > enum cxl_regloc_type; > int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type, > struct cxl_register_map *map); > +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, > + resource_size_t length); > > #define CXL_RESOURCE_NONE ((resource_size_t) -1) > #define CXL_TARGET_STRLEN 20 > @@ -165,7 +181,8 @@ void cxl_unregister_topology_host(struct device *host); > #define CXL_DECODER_F_TYPE2 BIT(2) > #define CXL_DECODER_F_TYPE3 BIT(3) > #define CXL_DECODER_F_LOCK BIT(4) > -#define CXL_DECODER_F_MASK GENMASK(4, 0) > +#define CXL_DECODER_F_ENABLE BIT(5) > +#define CXL_DECODER_F_MASK GENMASK(5, 0) > > enum cxl_decoder_type { > CXL_DECODER_ACCELERATOR = 2, > @@ -255,6 +272,8 @@ struct cxl_walk_context { > * @dports: cxl_dport instances referenced by decoders > * @decoder_ida: allocator for decoder ids > * @component_reg_phys: component register capability base address (optional) > + * @rescan_work: worker object for bus rescans after port additions > + * @data: opaque data with driver specific usage > */ > struct cxl_port { > struct device dev; > @@ -263,6 +282,8 @@ struct cxl_port { > struct list_head dports; > struct ida decoder_ida; > resource_size_t component_reg_phys; > + struct work_struct rescan_work; > + void *data; > }; > > /** > @@ -325,6 +346,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv); > > #define CXL_DEVICE_NVDIMM_BRIDGE 1 > #define CXL_DEVICE_NVDIMM 2 > +#define CXL_DEVICE_PORT 3 > > #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*") > #define CXL_MODALIAS_FMT "cxl:t%d" > diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c > new file mode 100644 > index 000000000000..3c03131517af > --- /dev/null > +++ b/drivers/cxl/port.c > @@ -0,0 +1,323 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* Copyright(c) 2021 Intel Corporation. All rights reserved. */ > +#include <linux/device.h> > +#include <linux/module.h> > +#include <linux/slab.h> > + > +#include "cxlmem.h" > + > +/** > + * DOC: cxl port > + * > + * The port driver implements the set of functionality needed to allow full > + * decoder enumeration and routing. A CXL port is an abstraction of a CXL > + * component that implements some amount of CXL decoding of CXL.mem traffic. > + * As of the CXL 2.0 spec, this includes: > + * > + * .. list-table:: CXL Components w/ Ports > + * :widths: 25 25 50 > + * :header-rows: 1 > + * > + * * - component > + * - upstream > + * - downstream > + * * - Hostbridge > + * - ACPI0016 > + * - root port > + * * - Switch > + * - Switch Upstream Port > + * - Switch Downstream Port > + * * - Endpoint > + * - Endpoint Port > + * - N/A > + * > + * The primary service this driver provides is enumerating HDM decoders and > + * presenting APIs to other drivers to utilize the decoders. > + */ > + > +static struct workqueue_struct *cxl_port_wq; > + > +struct cxl_port_data { > + struct cxl_component_regs regs; > + > + struct port_caps { > + unsigned int count; > + unsigned int tc; > + unsigned int interleave11_8; > + unsigned int interleave14_12; > + } caps; > +}; > + > +static inline int to_interleave_granularity(u32 ctrl) > +{ > + int val = FIELD_GET(CXL_HDM_DECODER0_CTRL_IG_MASK, ctrl); > + > + return 256 << val; > +} > + > +static inline int to_interleave_ways(u32 ctrl) > +{ > + int val = FIELD_GET(CXL_HDM_DECODER0_CTRL_IW_MASK, ctrl); > + > + return 1 << val; > +} > + > +static void get_caps(struct cxl_port *port, struct cxl_port_data *cpd) > +{ > + void __iomem *hdm_decoder = cpd->regs.hdm_decoder; > + struct port_caps *caps = &cpd->caps; > + u32 hdm_cap; > + > + hdm_cap = readl(hdm_decoder + CXL_HDM_DECODER_CAP_OFFSET); > + > + caps->count = cxl_hdm_decoder_count(hdm_cap); > + caps->tc = FIELD_GET(CXL_HDM_DECODER_TARGET_COUNT_MASK, hdm_cap); > + caps->interleave11_8 = > + FIELD_GET(CXL_HDM_DECODER_INTERLEAVE_11_8, hdm_cap); > + caps->interleave14_12 = > + FIELD_GET(CXL_HDM_DECODER_INTERLEAVE_14_12, hdm_cap); > +} > + > +static int map_regs(struct cxl_port *port, void __iomem *crb, > + struct cxl_port_data *cpd) > +{ > + struct cxl_register_map map; > + struct cxl_component_reg_map *comp_map = &map.component_map; > + > + cxl_probe_component_regs(&port->dev, crb, comp_map); > + if (!comp_map->hdm_decoder.valid) { > + dev_err(&port->dev, "HDM decoder registers invalid\n"); > + return -ENXIO; > + } > + > + cpd->regs.hdm_decoder = crb + comp_map->hdm_decoder.offset; > + > + return 0; > +} > + > +static u64 get_decoder_size(void __iomem *hdm_decoder, int n) > +{ > + u32 ctrl = readl(hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(n)); > + > + if (ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED) > + return 0; > + > + return ioread64_hi_lo(hdm_decoder + > + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(n)); > +} > + > +static bool is_endpoint_port(struct cxl_port *port) > +{ > + /* Endpoints can't be ports... yet! */ > + return false; > +} > + > +static void rescan_ports(struct work_struct *work) > +{ > + if (bus_rescan_devices(&cxl_bus_type)) > + pr_warn("Failed to rescan\n"); > +} > + > +/* Minor layering violation */ > +static int dvsec_range_used(struct cxl_port *port) > +{ > + struct cxl_endpoint_dvsec_info *info; > + int i, ret = 0; > + > + if (!is_endpoint_port(port)) > + return 0; > + > + info = port->data; > + for (i = 0; i < info->ranges; i++) > + if (info->range[i].size) > + ret |= 1 << i; > + > + return ret; > +} > + > +static int enumerate_hdm_decoders(struct cxl_port *port, > + struct cxl_port_data *portdata) > +{ > + void __iomem *hdm_decoder = portdata->regs.hdm_decoder; > + bool global_enable; > + u32 global_ctrl; > + int i = 0; > + > + global_ctrl = readl(hdm_decoder + CXL_HDM_DECODER_CTRL_OFFSET); > + global_enable = global_ctrl & CXL_HDM_DECODER_ENABLE; > + if (!global_enable) { > + i = dvsec_range_used(port); > + if (i) { > + dev_err(&port->dev, > + "Couldn't add port because device is using DVSEC range registers\n"); > + return -EBUSY; > + } > + } > + > + for (i = 0; i < portdata->caps.count; i++) { > + int rc, target_count = portdata->caps.tc; > + struct cxl_decoder *cxld; > + int *target_map = NULL; > + u64 size; > + > + if (is_endpoint_port(port)) > + target_count = 0; > + > + cxld = cxl_decoder_alloc(port, target_count); > + if (IS_ERR(cxld)) { > + dev_warn(&port->dev, > + "Failed to allocate the decoder\n"); > + return PTR_ERR(cxld); > + } > + > + cxld->target_type = CXL_DECODER_EXPANDER; > + cxld->interleave_ways = 1; > + cxld->interleave_granularity = 0; > + > + size = get_decoder_size(hdm_decoder, i); > + if (size != 0) { > +#define decoderN(reg, n) hdm_decoder + CXL_HDM_DECODER0_##reg(n) Perhaps this block in the if (size != 0) would be more readable if broken out to a utility function? As normal I'm not keen on the macro magic if we can avoid it. > + int temp[CXL_DECODER_MAX_INTERLEAVE]; > + u64 base; > + u32 ctrl; > + int j; > + union { > + u64 value; > + char target_id[8]; > + } target_list; I thought we tried to avoid this union usage in kernel because of the whole thing about c compilers being able to do what they like with it... > + > + target_map = temp; This is set to a variable that goes out of scope whilst target_map is still in use. > + ctrl = readl(decoderN(CTRL_OFFSET, i)); > + base = ioread64_hi_lo(decoderN(BASE_LOW_OFFSET, i)); > + > + cxld->decoder_range = (struct range){ > + .start = base, > + .end = base + size - 1 > + }; > + > + cxld->flags = CXL_DECODER_F_ENABLE; > + cxld->interleave_ways = to_interleave_ways(ctrl); > + cxld->interleave_granularity = > + to_interleave_granularity(ctrl); > + > + if (FIELD_GET(CXL_HDM_DECODER0_CTRL_TYPE, ctrl) == 0) > + cxld->target_type = CXL_DECODER_ACCELERATOR; > + > + target_list.value = ioread64_hi_lo(decoderN(TL_LOW, i)); > + for (j = 0; j < cxld->interleave_ways; j++) > + target_map[j] = target_list.target_id[j]; > +#undef decoderN > + } > + > + rc = cxl_decoder_add_locked(cxld, target_map); > + if (rc) > + put_device(&cxld->dev); > + else > + rc = cxl_decoder_autoremove(&port->dev, cxld); > + if (rc) > + dev_err(&port->dev, "Failed to add decoder\n"); If that fails on the autoremove registration (unlikely) this message will be rather confusing - as the add was fine... This nest of carrying on when we have an error is getting ever deeper... > + else > + dev_dbg(&cxld->dev, "Added to port %s\n", > + dev_name(&port->dev)); > + } > + > + /* > + * Turn on global enable now since DVSEC ranges aren't being used and > + * we'll eventually want the decoder enabled. > + */ > + if (!global_enable) { > + dev_dbg(&port->dev, "Enabling HDM decode\n"); > + writel(global_ctrl | CXL_HDM_DECODER_ENABLE, hdm_decoder + CXL_HDM_DECODER_CTRL_OFFSET); > + } > + > + return 0; > +} > +
On 21-11-22 17:41:32, Jonathan Cameron wrote: > On Fri, 19 Nov 2021 16:02:47 -0800 > Ben Widawsky <ben.widawsky@intel.com> wrote: > > > The CXL port driver is responsible for managing the decoder resources > > contained within the port. It will also provide APIs that other drivers > > will consume for managing these resources. > > > > There are 4 types of ports in a system: > > 1. Platform port. This is a non-programmable entity. Such a port is > > named rootX. It is enumerated by cxl_acpi in an ACPI based system. > > 2. Hostbridge port. This ports register access is defined in a platform > > port's > > > specific way (CHBS for ACPI platforms). It has n downstream ports, > > each of which are known as CXL 2.0 root ports. Once the platform > > specific mechanism to get the offset to the registers is obtained it > > operates just like other CXL components. The enumeration of this > > component is started by cxl_acpi and completed by cxl_port. > > 3. Switch port. A switch port is similar to a hostbridge port except > > register access is defined in the CXL specification in a platform > > agnostic way. The downstream ports for a switch are simply known as > > downstream ports. The enumeration of these are entirely contained > > within cxl_port. > > 4. Endpoint port. Endpoint ports are similar to switch ports with the > > exception that they have no downstream ports, only the underlying > > media on the device. The enumeration of these are started by cxl_pci, > > and completed by cxl_port. > > > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> > A few comments inline including what looks to me memory on the stack which has > gone out of scope when it's accessed. > > Jonathan > > > > > --- > > Changes since RFCv2: > > - Reword commit message tense (Dan) > > - Reword commit message > > - Drop SOFTDEP since it's not needed yet (Dan) > > - Add CONFIG_CXL_PORT (Dan) > > - s/CXL_DECODER_F_EN/CXL_DECODER_F_ENABLE (Dan) > > - rename cxl_hdm_decoder_ functions to "to_" (Dan) > > - remove useless inline (Dan) > > - Check endpoint decoder based on dport list instead of driver id (Dan) > > - Use range instead of resource per dependent patch change > > - Use clever union packing for target list (Dan) > > - Only check NULL from devm_cxl_iomap_block (Dan) > > - Properly parent the created cxl decoders > > - Move bus rescanning from cxl_acpi to here (Dan) > > - Remove references to "CFMWS" in core (Dan) > > - Use macro to help keep within 80 character lines > > --- > > .../driver-api/cxl/memory-devices.rst | 5 + > > drivers/cxl/Kconfig | 22 ++ > > drivers/cxl/Makefile | 2 + > > drivers/cxl/core/bus.c | 67 ++++ > > drivers/cxl/core/regs.c | 6 +- > > drivers/cxl/cxl.h | 34 +- > > drivers/cxl/port.c | 323 ++++++++++++++++++ > > 7 files changed, 450 insertions(+), 9 deletions(-) > > create mode 100644 drivers/cxl/port.c > > > > diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst > > index 3b8f41395f6b..fbf0393cdddc 100644 > > --- a/Documentation/driver-api/cxl/memory-devices.rst > > +++ b/Documentation/driver-api/cxl/memory-devices.rst > > @@ -28,6 +28,11 @@ CXL Memory Device > > .. kernel-doc:: drivers/cxl/pci.c > > :internal: > > > > +CXL Port > > +-------- > > +.. kernel-doc:: drivers/cxl/port.c > > + :doc: cxl port > > + > > CXL Core > > -------- > > .. kernel-doc:: drivers/cxl/cxl.h > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig > > index ef05e96f8f97..3aeb33bba5a3 100644 > > --- a/drivers/cxl/Kconfig > > +++ b/drivers/cxl/Kconfig > > @@ -77,4 +77,26 @@ config CXL_PMEM > > provisioning the persistent memory capacity of CXL memory expanders. > > > > If unsure say 'm'. > > + > > +config CXL_MEM > > + tristate "CXL.mem: Memory Devices" > > + select CXL_PORT > > + depends on CXL_PCI > > + default CXL_BUS > > + help > > + The CXL.mem protocol allows a device to act as a provider of "System > > + RAM" and/or "Persistent Memory" that is fully coherent as if the > > + memory was attached to the typical CPU memory controller. This is > > + known as HDM "Host-managed Device Memory". > > + > > + Say 'y/m' to enable a driver that will attach to CXL.mem devices for > > + memory expansion and control of HDM. See Chapter 9.13 in the CXL 2.0 > > + specification for a detailed description of HDM. > > + > > + If unsure say 'm'. > > + > > + > > +config CXL_PORT > > + tristate > > + > > endif > > diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile > > index cf07ae6cea17..56fcac2323cb 100644 > > --- a/drivers/cxl/Makefile > > +++ b/drivers/cxl/Makefile > > @@ -3,7 +3,9 @@ obj-$(CONFIG_CXL_BUS) += core/ > > obj-$(CONFIG_CXL_PCI) += cxl_pci.o > > obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o > > obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o > > +obj-$(CONFIG_CXL_PORT) += cxl_port.o > > > > cxl_pci-y := pci.o > > cxl_acpi-y := acpi.o > > cxl_pmem-y := pmem.o > > +cxl_port-y := port.o > > diff --git a/drivers/cxl/core/bus.c b/drivers/cxl/core/bus.c > > index 9e0d7d5d9298..46a06cfe79bd 100644 > > --- a/drivers/cxl/core/bus.c > > +++ b/drivers/cxl/core/bus.c > > @@ -31,6 +31,8 @@ static DECLARE_RWSEM(root_port_sem); > > > > static struct device *cxl_topology_host; > > > > +static bool is_cxl_decoder(struct device *dev); > > + > > int cxl_register_topology_host(struct device *host) > > { > > down_write(&topology_host_sem); > > @@ -75,6 +77,45 @@ static void put_cxl_topology_host(struct device *dev) > > up_read(&topology_host_sem); > > } > > > > +static int decoder_match(struct device *dev, void *data) > > +{ > > + struct resource *theirs = (struct resource *)data; > > + struct cxl_decoder *cxld; > > + > > + if (!is_cxl_decoder(dev)) > > + return 0; > > + > > + cxld = to_cxl_decoder(dev); > > + if (theirs->start <= cxld->decoder_range.start && > > + theirs->end >= cxld->decoder_range.end) > > + return 1; > > + > > + return 0; > > +} > > + > > +static struct cxl_decoder *cxl_find_root_decoder(resource_size_t base, > > + resource_size_t size) > > +{ > > + struct cxl_decoder *cxld = NULL; > > + struct device *cxldd; > > + struct device *host; > > + struct resource res = (struct resource){ > > + .start = base, > > + .end = base + size - 1, > > + }; > > + > > + host = get_cxl_topology_host(); > > + if (!host) > > + return NULL; > > + > > + cxldd = device_find_child(host, &res, decoder_match); > > + if (cxldd) > > + cxld = to_cxl_decoder(cxldd); > > + > > + put_cxl_topology_host(host); > > + return cxld; > > +} > > + > > static ssize_t devtype_show(struct device *dev, struct device_attribute *attr, > > char *buf) > > { > > @@ -280,6 +321,11 @@ bool is_root_decoder(struct device *dev) > > } > > EXPORT_SYMBOL_NS_GPL(is_root_decoder, CXL); > > > > +static bool is_cxl_decoder(struct device *dev) > > +{ > > + return dev->type->release == cxl_decoder_release; > > +} > > + > > struct cxl_decoder *to_cxl_decoder(struct device *dev) > > { > > if (dev_WARN_ONCE(dev, dev->type->release != cxl_decoder_release, > > @@ -327,6 +373,7 @@ struct cxl_port *to_cxl_port(struct device *dev) > > return NULL; > > return container_of(dev, struct cxl_port, dev); > > } > > +EXPORT_SYMBOL_NS_GPL(to_cxl_port, CXL); > > > > struct cxl_dport *cxl_get_root_dport(struct device *dev) > > { > > @@ -735,6 +782,24 @@ EXPORT_SYMBOL_NS_GPL(cxl_decoder_add, CXL); > > > > static void cxld_unregister(void *dev) > > { > > + struct cxl_decoder *plat_decoder, *cxld = to_cxl_decoder(dev); > > + resource_size_t base, size; > > + > > + if (is_root_decoder(dev)) { > > + device_unregister(dev); > > + return; > > + } > > + > > + base = cxld->decoder_range.start; > > + size = range_len(&cxld->decoder_range); > > + > > + if (size) { > > + plat_decoder = cxl_find_root_decoder(base, size); > > + if (plat_decoder) > > + __release_region(&plat_decoder->platform_res, base, > > + size); > > + } > > + > > device_unregister(dev); > > } > > > > @@ -789,6 +854,8 @@ static int cxl_device_id(struct device *dev) > > return CXL_DEVICE_NVDIMM_BRIDGE; > > if (dev->type == &cxl_nvdimm_type) > > return CXL_DEVICE_NVDIMM; > > + if (dev->type == &cxl_port_type) > > + return CXL_DEVICE_PORT; > > return 0; > > } > > > > diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c > > index 41a0245867ea..f191b0c995a7 100644 > > --- a/drivers/cxl/core/regs.c > > +++ b/drivers/cxl/core/regs.c > > @@ -159,9 +159,8 @@ void cxl_probe_device_regs(struct device *dev, void __iomem *base, > > } > > EXPORT_SYMBOL_NS_GPL(cxl_probe_device_regs, CXL); > > > > -static void __iomem *devm_cxl_iomap_block(struct device *dev, > > - resource_size_t addr, > > - resource_size_t length) > > +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, > > + resource_size_t length) > > { > > void __iomem *ret_val; > > struct resource *res; > > @@ -180,6 +179,7 @@ static void __iomem *devm_cxl_iomap_block(struct device *dev, > > > > return ret_val; > > } > > +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL); > > > > int cxl_map_component_regs(struct pci_dev *pdev, > > struct cxl_component_regs *regs, > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > > index 3962a5e6a950..24fa16157d5e 100644 > > --- a/drivers/cxl/cxl.h > > +++ b/drivers/cxl/cxl.h > > @@ -17,6 +17,9 @@ > > * (port-driver, region-driver, nvdimm object-drivers... etc). > > */ > > > > +/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */ > > +#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K > > + > > /* CXL 2.0 8.2.5 CXL.cache and CXL.mem Registers*/ > > #define CXL_CM_OFFSET 0x1000 > > #define CXL_CM_CAP_HDR_OFFSET 0x0 > > @@ -36,11 +39,22 @@ > > #define CXL_HDM_DECODER_CAP_OFFSET 0x0 > > #define CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0) > > #define CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4) > > -#define CXL_HDM_DECODER0_BASE_LOW_OFFSET 0x10 > > -#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET 0x14 > > -#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET 0x18 > > -#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET 0x1c > > -#define CXL_HDM_DECODER0_CTRL_OFFSET 0x20 > > +#define CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8) > > +#define CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9) > > +#define CXL_HDM_DECODER_CTRL_OFFSET 0x4 > > +#define CXL_HDM_DECODER_ENABLE BIT(1) > > +#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10) > > +#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14) > > +#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18) > > +#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c) > > +#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20) > > +#define CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0) > > +#define CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4) > > +#define CXL_HDM_DECODER0_CTRL_COMMIT BIT(9) > > +#define CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10) > > +#define CXL_HDM_DECODER0_CTRL_TYPE BIT(12) > > +#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24) > > +#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28) > > > > static inline int cxl_hdm_decoder_count(u32 cap_hdr) > > { > > @@ -148,6 +162,8 @@ int cxl_map_device_regs(struct pci_dev *pdev, > > enum cxl_regloc_type; > > int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type, > > struct cxl_register_map *map); > > +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, > > + resource_size_t length); > > > > #define CXL_RESOURCE_NONE ((resource_size_t) -1) > > #define CXL_TARGET_STRLEN 20 > > @@ -165,7 +181,8 @@ void cxl_unregister_topology_host(struct device *host); > > #define CXL_DECODER_F_TYPE2 BIT(2) > > #define CXL_DECODER_F_TYPE3 BIT(3) > > #define CXL_DECODER_F_LOCK BIT(4) > > -#define CXL_DECODER_F_MASK GENMASK(4, 0) > > +#define CXL_DECODER_F_ENABLE BIT(5) > > +#define CXL_DECODER_F_MASK GENMASK(5, 0) > > > > enum cxl_decoder_type { > > CXL_DECODER_ACCELERATOR = 2, > > @@ -255,6 +272,8 @@ struct cxl_walk_context { > > * @dports: cxl_dport instances referenced by decoders > > * @decoder_ida: allocator for decoder ids > > * @component_reg_phys: component register capability base address (optional) > > + * @rescan_work: worker object for bus rescans after port additions > > + * @data: opaque data with driver specific usage > > */ > > struct cxl_port { > > struct device dev; > > @@ -263,6 +282,8 @@ struct cxl_port { > > struct list_head dports; > > struct ida decoder_ida; > > resource_size_t component_reg_phys; > > + struct work_struct rescan_work; > > + void *data; > > }; > > > > /** > > @@ -325,6 +346,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv); > > > > #define CXL_DEVICE_NVDIMM_BRIDGE 1 > > #define CXL_DEVICE_NVDIMM 2 > > +#define CXL_DEVICE_PORT 3 > > > > #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*") > > #define CXL_MODALIAS_FMT "cxl:t%d" > > diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c > > new file mode 100644 > > index 000000000000..3c03131517af > > --- /dev/null > > +++ b/drivers/cxl/port.c > > @@ -0,0 +1,323 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +/* Copyright(c) 2021 Intel Corporation. All rights reserved. */ > > +#include <linux/device.h> > > +#include <linux/module.h> > > +#include <linux/slab.h> > > + > > +#include "cxlmem.h" > > + > > +/** > > + * DOC: cxl port > > + * > > + * The port driver implements the set of functionality needed to allow full > > + * decoder enumeration and routing. A CXL port is an abstraction of a CXL > > + * component that implements some amount of CXL decoding of CXL.mem traffic. > > + * As of the CXL 2.0 spec, this includes: > > + * > > + * .. list-table:: CXL Components w/ Ports > > + * :widths: 25 25 50 > > + * :header-rows: 1 > > + * > > + * * - component > > + * - upstream > > + * - downstream > > + * * - Hostbridge > > + * - ACPI0016 > > + * - root port > > + * * - Switch > > + * - Switch Upstream Port > > + * - Switch Downstream Port > > + * * - Endpoint > > + * - Endpoint Port > > + * - N/A > > + * > > + * The primary service this driver provides is enumerating HDM decoders and > > + * presenting APIs to other drivers to utilize the decoders. > > + */ > > + > > +static struct workqueue_struct *cxl_port_wq; > > + > > +struct cxl_port_data { > > + struct cxl_component_regs regs; > > + > > + struct port_caps { > > + unsigned int count; > > + unsigned int tc; > > + unsigned int interleave11_8; > > + unsigned int interleave14_12; > > + } caps; > > +}; > > + > > +static inline int to_interleave_granularity(u32 ctrl) > > +{ > > + int val = FIELD_GET(CXL_HDM_DECODER0_CTRL_IG_MASK, ctrl); > > + > > + return 256 << val; > > +} > > + > > +static inline int to_interleave_ways(u32 ctrl) > > +{ > > + int val = FIELD_GET(CXL_HDM_DECODER0_CTRL_IW_MASK, ctrl); > > + > > + return 1 << val; > > +} > > + > > +static void get_caps(struct cxl_port *port, struct cxl_port_data *cpd) > > +{ > > + void __iomem *hdm_decoder = cpd->regs.hdm_decoder; > > + struct port_caps *caps = &cpd->caps; > > + u32 hdm_cap; > > + > > + hdm_cap = readl(hdm_decoder + CXL_HDM_DECODER_CAP_OFFSET); > > + > > + caps->count = cxl_hdm_decoder_count(hdm_cap); > > + caps->tc = FIELD_GET(CXL_HDM_DECODER_TARGET_COUNT_MASK, hdm_cap); > > + caps->interleave11_8 = > > + FIELD_GET(CXL_HDM_DECODER_INTERLEAVE_11_8, hdm_cap); > > + caps->interleave14_12 = > > + FIELD_GET(CXL_HDM_DECODER_INTERLEAVE_14_12, hdm_cap); > > +} > > + > > +static int map_regs(struct cxl_port *port, void __iomem *crb, > > + struct cxl_port_data *cpd) > > +{ > > + struct cxl_register_map map; > > + struct cxl_component_reg_map *comp_map = &map.component_map; > > + > > + cxl_probe_component_regs(&port->dev, crb, comp_map); > > + if (!comp_map->hdm_decoder.valid) { > > + dev_err(&port->dev, "HDM decoder registers invalid\n"); > > + return -ENXIO; > > + } > > + > > + cpd->regs.hdm_decoder = crb + comp_map->hdm_decoder.offset; > > + > > + return 0; > > +} > > + > > +static u64 get_decoder_size(void __iomem *hdm_decoder, int n) > > +{ > > + u32 ctrl = readl(hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(n)); > > + > > + if (ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED) > > + return 0; > > + > > + return ioread64_hi_lo(hdm_decoder + > > + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(n)); > > +} > > + > > +static bool is_endpoint_port(struct cxl_port *port) > > +{ > > + /* Endpoints can't be ports... yet! */ > > + return false; > > +} > > + > > +static void rescan_ports(struct work_struct *work) > > +{ > > + if (bus_rescan_devices(&cxl_bus_type)) > > + pr_warn("Failed to rescan\n"); > > +} > > + > > +/* Minor layering violation */ > > +static int dvsec_range_used(struct cxl_port *port) > > +{ > > + struct cxl_endpoint_dvsec_info *info; > > + int i, ret = 0; > > + > > + if (!is_endpoint_port(port)) > > + return 0; > > + > > + info = port->data; > > + for (i = 0; i < info->ranges; i++) > > + if (info->range[i].size) > > + ret |= 1 << i; > > + > > + return ret; > > +} > > + > > +static int enumerate_hdm_decoders(struct cxl_port *port, > > + struct cxl_port_data *portdata) > > +{ > > + void __iomem *hdm_decoder = portdata->regs.hdm_decoder; > > + bool global_enable; > > + u32 global_ctrl; > > + int i = 0; > > + > > + global_ctrl = readl(hdm_decoder + CXL_HDM_DECODER_CTRL_OFFSET); > > + global_enable = global_ctrl & CXL_HDM_DECODER_ENABLE; > > + if (!global_enable) { > > + i = dvsec_range_used(port); > > + if (i) { > > + dev_err(&port->dev, > > + "Couldn't add port because device is using DVSEC range registers\n"); > > + return -EBUSY; > > + } > > + } > > + > > + for (i = 0; i < portdata->caps.count; i++) { > > + int rc, target_count = portdata->caps.tc; > > + struct cxl_decoder *cxld; > > + int *target_map = NULL; > > + u64 size; > > + > > + if (is_endpoint_port(port)) > > + target_count = 0; > > + > > + cxld = cxl_decoder_alloc(port, target_count); > > + if (IS_ERR(cxld)) { > > + dev_warn(&port->dev, > > + "Failed to allocate the decoder\n"); > > + return PTR_ERR(cxld); > > + } > > + > > + cxld->target_type = CXL_DECODER_EXPANDER; > > + cxld->interleave_ways = 1; > > + cxld->interleave_granularity = 0; > > + > > + size = get_decoder_size(hdm_decoder, i); > > + if (size != 0) { > > +#define decoderN(reg, n) hdm_decoder + CXL_HDM_DECODER0_##reg(n) > > Perhaps this block in the if (size != 0) would be more readable if broken out > to a utility function? I don't get this comment, there is already get_decoder_size(). Can you please elaborate? > As normal I'm not keen on the macro magic if we can avoid it. > Yeah - just trying to not have to deal with wrapping long lines. > > > + int temp[CXL_DECODER_MAX_INTERLEAVE]; > > + u64 base; > > + u32 ctrl; > > + int j; > > + union { > > + u64 value; > > + char target_id[8]; > > + } target_list; > > I thought we tried to avoid this union usage in kernel because of the whole > thing about c compilers being able to do what they like with it... > I wasn't aware of the restriction. I can change it back if it's required. It does look a lot nicer this way. Is there a reference to this issue somewhere? > > + > > + target_map = temp; > > This is set to a variable that goes out of scope whilst target_map is still in > use. > Yikes. I'm pretty surprised the compiler didn't catch this. > > + ctrl = readl(decoderN(CTRL_OFFSET, i)); > > + base = ioread64_hi_lo(decoderN(BASE_LOW_OFFSET, i)); > > + > > + cxld->decoder_range = (struct range){ > > + .start = base, > > + .end = base + size - 1 > > + }; > > + > > + cxld->flags = CXL_DECODER_F_ENABLE; > > + cxld->interleave_ways = to_interleave_ways(ctrl); > > + cxld->interleave_granularity = > > + to_interleave_granularity(ctrl); > > + > > + if (FIELD_GET(CXL_HDM_DECODER0_CTRL_TYPE, ctrl) == 0) > > + cxld->target_type = CXL_DECODER_ACCELERATOR; > > + > > + target_list.value = ioread64_hi_lo(decoderN(TL_LOW, i)); > > + for (j = 0; j < cxld->interleave_ways; j++) > > + target_map[j] = target_list.target_id[j]; > > +#undef decoderN > > + } > > + > > + rc = cxl_decoder_add_locked(cxld, target_map); > > + if (rc) > > + put_device(&cxld->dev); > > + else > > + rc = cxl_decoder_autoremove(&port->dev, cxld); > > + if (rc) > > + dev_err(&port->dev, "Failed to add decoder\n"); > > If that fails on the autoremove registration (unlikely) this message > will be rather confusing - as the add was fine... > > This nest of carrying on when we have an error is getting ever deeper... > Yeah, this is not great. I will clean it up. Thanks. > > + else > > + dev_dbg(&cxld->dev, "Added to port %s\n", > > + dev_name(&port->dev)); > > + } > > + > > + /* > > + * Turn on global enable now since DVSEC ranges aren't being used and > > + * we'll eventually want the decoder enabled. > > + */ > > + if (!global_enable) { > > + dev_dbg(&port->dev, "Enabling HDM decode\n"); > > + writel(global_ctrl | CXL_HDM_DECODER_ENABLE, hdm_decoder + CXL_HDM_DECODER_CTRL_OFFSET); > > + } > > + > > + return 0; > > +} > > +
On Mon, 22 Nov 2021 15:38:20 -0800 Ben Widawsky <ben.widawsky@intel.com> wrote: ... > > > +static int enumerate_hdm_decoders(struct cxl_port *port, > > > + struct cxl_port_data *portdata) > > > +{ > > > + void __iomem *hdm_decoder = portdata->regs.hdm_decoder; > > > + bool global_enable; > > > + u32 global_ctrl; > > > + int i = 0; > > > + > > > + global_ctrl = readl(hdm_decoder + CXL_HDM_DECODER_CTRL_OFFSET); > > > + global_enable = global_ctrl & CXL_HDM_DECODER_ENABLE; > > > + if (!global_enable) { > > > + i = dvsec_range_used(port); > > > + if (i) { > > > + dev_err(&port->dev, > > > + "Couldn't add port because device is using DVSEC range registers\n"); > > > + return -EBUSY; > > > + } > > > + } > > > + > > > + for (i = 0; i < portdata->caps.count; i++) { > > > + int rc, target_count = portdata->caps.tc; > > > + struct cxl_decoder *cxld; > > > + int *target_map = NULL; > > > + u64 size; > > > + > > > + if (is_endpoint_port(port)) > > > + target_count = 0; > > > + > > > + cxld = cxl_decoder_alloc(port, target_count); > > > + if (IS_ERR(cxld)) { > > > + dev_warn(&port->dev, > > > + "Failed to allocate the decoder\n"); > > > + return PTR_ERR(cxld); > > > + } > > > + > > > + cxld->target_type = CXL_DECODER_EXPANDER; > > > + cxld->interleave_ways = 1; > > > + cxld->interleave_granularity = 0; > > > + > > > + size = get_decoder_size(hdm_decoder, i); > > > + if (size != 0) { > > > +#define decoderN(reg, n) hdm_decoder + CXL_HDM_DECODER0_##reg(n) > > > > Perhaps this block in the if (size != 0) would be more readable if broken out > > to a utility function? > > I don't get this comment, there is already get_decoder_size(). Can you please > elaborate? Sure. Just talking about having something like if (size != 0) init_decoder() // or something better named as an alternative to this deep nesting. > > > As normal I'm not keen on the macro magic if we can avoid it. > > > > Yeah - just trying to not have to deal with wrapping long lines. > > > > > > + int temp[CXL_DECODER_MAX_INTERLEAVE]; > > > + u64 base; > > > + u32 ctrl; > > > + int j; > > > + union { > > > + u64 value; > > > + char target_id[8]; > > > + } target_list; > > > > I thought we tried to avoid this union usage in kernel because of the whole > > thing about c compilers being able to do what they like with it... > > > > I wasn't aware of the restriction. I can change it back if it's required. It > does look a lot nicer this way. Is there a reference to this issue somewhere? Hmm. Seems I was out of date on this. There is a mess in the c99 standard that contradicts itself on whether you can do this or not. https://davmac.wordpress.com/2010/02/26/c99-revisited/ The pull request for a patch form Andy got a Linus response... https://lore.kernel.org/all/CAJZ5v0jq45atkapwSjJ4DkHhB1bfOA-Sh1TiA3dPXwKyFTBheA@mail.gmail.com/ > > > > + > > > + target_map = temp; > > > > This is set to a variable that goes out of scope whilst target_map is still in > > use. > > > > Yikes. I'm pretty surprised the compiler didn't catch this. > > > > + ctrl = readl(decoderN(CTRL_OFFSET, i)); > > > + base = ioread64_hi_lo(decoderN(BASE_LOW_OFFSET, i)); > > > + > > > + cxld->decoder_range = (struct range){ > > > + .start = base, > > > + .end = base + size - 1 > > > + }; > > > + > > > + cxld->flags = CXL_DECODER_F_ENABLE; > > > + cxld->interleave_ways = to_interleave_ways(ctrl); > > > + cxld->interleave_granularity = > > > + to_interleave_granularity(ctrl); > > > + > > > + if (FIELD_GET(CXL_HDM_DECODER0_CTRL_TYPE, ctrl) == 0) > > > + cxld->target_type = CXL_DECODER_ACCELERATOR; > > > + > > > + target_list.value = ioread64_hi_lo(decoderN(TL_LOW, i)); > > > + for (j = 0; j < cxld->interleave_ways; j++) > > > + target_map[j] = target_list.target_id[j]; > > > +#undef decoderN > > > + } > > > + > > > + rc = cxl_decoder_add_locked(cxld, target_map); > > > + if (rc) > > > + put_device(&cxld->dev); > > > + else > > > + rc = cxl_decoder_autoremove(&port->dev, cxld); > > > + if (rc) > > > + dev_err(&port->dev, "Failed to add decoder\n"); > > > > If that fails on the autoremove registration (unlikely) this message > > will be rather confusing - as the add was fine... > > > > This nest of carrying on when we have an error is getting ever deeper... > > > > Yeah, this is not great. I will clean it up. > > Thanks. > > > > + else > > > + dev_dbg(&cxld->dev, "Added to port %s\n", > > > + dev_name(&port->dev)); > > > + } > > > + > > > + /* > > > + * Turn on global enable now since DVSEC ranges aren't being used and > > > + * we'll eventually want the decoder enabled. > > > + */ > > > + if (!global_enable) { > > > + dev_dbg(&port->dev, "Enabling HDM decode\n"); > > > + writel(global_ctrl | CXL_HDM_DECODER_ENABLE, hdm_decoder + CXL_HDM_DECODER_CTRL_OFFSET); > > > + } > > > + > > > + return 0; > > > +} > > > +
On 21-11-23 11:38:23, Jonathan Cameron wrote: > On Mon, 22 Nov 2021 15:38:20 -0800 > Ben Widawsky <ben.widawsky@intel.com> wrote: > > ... > > > > > +static int enumerate_hdm_decoders(struct cxl_port *port, > > > > + struct cxl_port_data *portdata) > > > > +{ > > > > + void __iomem *hdm_decoder = portdata->regs.hdm_decoder; > > > > + bool global_enable; > > > > + u32 global_ctrl; > > > > + int i = 0; > > > > + > > > > + global_ctrl = readl(hdm_decoder + CXL_HDM_DECODER_CTRL_OFFSET); > > > > + global_enable = global_ctrl & CXL_HDM_DECODER_ENABLE; > > > > + if (!global_enable) { > > > > + i = dvsec_range_used(port); > > > > + if (i) { > > > > + dev_err(&port->dev, > > > > + "Couldn't add port because device is using DVSEC range registers\n"); > > > > + return -EBUSY; > > > > + } > > > > + } > > > > + > > > > + for (i = 0; i < portdata->caps.count; i++) { > > > > + int rc, target_count = portdata->caps.tc; > > > > + struct cxl_decoder *cxld; > > > > + int *target_map = NULL; > > > > + u64 size; > > > > + > > > > + if (is_endpoint_port(port)) > > > > + target_count = 0; > > > > + > > > > + cxld = cxl_decoder_alloc(port, target_count); > > > > + if (IS_ERR(cxld)) { > > > > + dev_warn(&port->dev, > > > > + "Failed to allocate the decoder\n"); > > > > + return PTR_ERR(cxld); > > > > + } > > > > + > > > > + cxld->target_type = CXL_DECODER_EXPANDER; > > > > + cxld->interleave_ways = 1; > > > > + cxld->interleave_granularity = 0; > > > > + > > > > + size = get_decoder_size(hdm_decoder, i); > > > > + if (size != 0) { > > > > +#define decoderN(reg, n) hdm_decoder + CXL_HDM_DECODER0_##reg(n) > > > > > > Perhaps this block in the if (size != 0) would be more readable if broken out > > > to a utility function? > > > > I don't get this comment, there is already get_decoder_size(). Can you please > > elaborate? > > Sure. Just talking about having something like > > if (size != 0) > init_decoder() // or something better named > > as an alternative to this deep nesting. > Sounds good. I can combine it with the similar initialization done in cxl_acpi. > > > > > As normal I'm not keen on the macro magic if we can avoid it. > > > > > > > Yeah - just trying to not have to deal with wrapping long lines. > > > > > > > > > + int temp[CXL_DECODER_MAX_INTERLEAVE]; > > > > + u64 base; > > > > + u32 ctrl; > > > > + int j; > > > > + union { > > > > + u64 value; > > > > + char target_id[8]; > > > > + } target_list; > > > > > > I thought we tried to avoid this union usage in kernel because of the whole > > > thing about c compilers being able to do what they like with it... > > > > > > > I wasn't aware of the restriction. I can change it back if it's required. It > > does look a lot nicer this way. Is there a reference to this issue somewhere? > > Hmm. Seems I was out of date on this. There is a mess in the c99 standard that > contradicts itself on whether you can do this or not. > > https://davmac.wordpress.com/2010/02/26/c99-revisited/ Thanks for the link. > > The pull request for a patch form Andy got a Linus response... > > https://lore.kernel.org/all/CAJZ5v0jq45atkapwSjJ4DkHhB1bfOA-Sh1TiA3dPXwKyFTBheA@mail.gmail.com/ > That was a fun read :-) I'll defer to Dan on this. This was actually his code that he suggested in review of the RFC. > > > > > > > + > > > > + target_map = temp; > > > > > > This is set to a variable that goes out of scope whilst target_map is still in > > > use. > > > > > > > Yikes. I'm pretty surprised the compiler didn't catch this. > > > > > > + ctrl = readl(decoderN(CTRL_OFFSET, i)); > > > > + base = ioread64_hi_lo(decoderN(BASE_LOW_OFFSET, i)); > > > > + > > > > + cxld->decoder_range = (struct range){ > > > > + .start = base, > > > > + .end = base + size - 1 > > > > + }; > > > > + > > > > + cxld->flags = CXL_DECODER_F_ENABLE; > > > > + cxld->interleave_ways = to_interleave_ways(ctrl); > > > > + cxld->interleave_granularity = > > > > + to_interleave_granularity(ctrl); > > > > + > > > > + if (FIELD_GET(CXL_HDM_DECODER0_CTRL_TYPE, ctrl) == 0) > > > > + cxld->target_type = CXL_DECODER_ACCELERATOR; > > > > + > > > > + target_list.value = ioread64_hi_lo(decoderN(TL_LOW, i)); > > > > + for (j = 0; j < cxld->interleave_ways; j++) > > > > + target_map[j] = target_list.target_id[j]; > > > > +#undef decoderN > > > > + } > > > > + > > > > + rc = cxl_decoder_add_locked(cxld, target_map); > > > > + if (rc) > > > > + put_device(&cxld->dev); > > > > + else > > > > + rc = cxl_decoder_autoremove(&port->dev, cxld); > > > > + if (rc) > > > > + dev_err(&port->dev, "Failed to add decoder\n"); > > > > > > If that fails on the autoremove registration (unlikely) this message > > > will be rather confusing - as the add was fine... > > > > > > This nest of carrying on when we have an error is getting ever deeper... > > > > > > > Yeah, this is not great. I will clean it up. > > > > Thanks. > > > > > > + else > > > > + dev_dbg(&cxld->dev, "Added to port %s\n", > > > > + dev_name(&port->dev)); > > > > + } > > > > + > > > > + /* > > > > + * Turn on global enable now since DVSEC ranges aren't being used and > > > > + * we'll eventually want the decoder enabled. > > > > + */ > > > > + if (!global_enable) { > > > > + dev_dbg(&port->dev, "Enabling HDM decode\n"); > > > > + writel(global_ctrl | CXL_HDM_DECODER_ENABLE, hdm_decoder + CXL_HDM_DECODER_CTRL_OFFSET); > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + >
On Fri, Nov 19, 2021 at 04:02:47PM -0800, Ben Widawsky wrote: > The CXL port driver is responsible for managing the decoder resources > contained within the port. It will also provide APIs that other drivers > will consume for managing these resources. > > There are 4 types of ports in a system: > 1. Platform port. This is a non-programmable entity. Such a port is > named rootX. It is enumerated by cxl_acpi in an ACPI based system. Can you mention the ACPI source (static table, namespace PNP ID) here? > 2. Hostbridge port. Is "hostbridge" styled as a single word in the spec? I've only seen "host bridge" elsewhere. > This ports register access is defined in a platform > specific way (CHBS for ACPI platforms). s/This ports/This port's/ This doesn't really make sense, though. Are you saying the register access *mechanism* is platform specific? Or merely that the enumeration method (ACPI table, ACPI namespace, DT, etc) is platform-specific? I assume "CHBS" is an ACPI static table? > It has n downstream ports, > each of which are known as CXL 2.0 root ports. This sounds like a "host bridge port *contains* these root ports." That doesn't sound right. > Once the platform > specific mechanism to get the offset to the registers is obtained it > operates just like other CXL components. The enumeration of this > component is started by cxl_acpi and completed by cxl_port. > 3. Switch port. A switch port is similar to a hostbridge port except > register access is defined in the CXL specification in a platform > agnostic way. The downstream ports for a switch are simply known as > downstream ports. The enumeration of these are entirely contained > within cxl_port. In PCIe, "Downstream Port" includes both Root Ports and Switch Downstream Ports. It sounds like that's not the case for CXL? Well, except above you say that a Host Bridge Port has N Downstream Ports, so I guess "Downstream Port" probably includes both Host Bridge Ports and Switch Downstream Ports. Maybe you should just omit the "The downstream ports for a switch are simply known as downstream ports" sentence. > 4. Endpoint port. Endpoint ports are similar to switch ports with the > exception that they have no downstream ports, only the underlying > media on the device. The enumeration of these are started by cxl_pci, > and completed by cxl_port. Does CXL use an "Upstream Port" concept similar to PCIe? In PCIe, "Upstream Port" includes both Switch Upstream Ports and the Upstream Port on an Endpoint. I hope this driver is not modeled on the PCI portdrv. IMO, that was a design error, and the "port service drivers" (PME, hotplug, AER, etc) should be directly integrated into the PCI core instead of pretending to be independent drivers. > --- a/Documentation/driver-api/cxl/memory-devices.rst > +++ b/Documentation/driver-api/cxl/memory-devices.rst > @@ -28,6 +28,11 @@ CXL Memory Device > .. kernel-doc:: drivers/cxl/pci.c > :internal: > > +CXL Port > +-------- > +.. kernel-doc:: drivers/cxl/port.c > + :doc: cxl port s/cxl port/CXL Port/ ? I don't know exactly how this gets rendered by ReST. > CXL Core > -------- > .. kernel-doc:: drivers/cxl/cxl.h > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig > index ef05e96f8f97..3aeb33bba5a3 100644 > --- a/drivers/cxl/Kconfig > +++ b/drivers/cxl/Kconfig > @@ -77,4 +77,26 @@ config CXL_PMEM > provisioning the persistent memory capacity of CXL memory expanders. > > If unsure say 'm'. > + > +config CXL_MEM > + tristate "CXL.mem: Memory Devices" > + select CXL_PORT > + depends on CXL_PCI > + default CXL_BUS > + help > + The CXL.mem protocol allows a device to act as a provider of "System > + RAM" and/or "Persistent Memory" that is fully coherent as if the > + memory was attached to the typical CPU memory controller. This is > + known as HDM "Host-managed Device Memory". s/was attached/were attached/ > + Say 'y/m' to enable a driver that will attach to CXL.mem devices for > + memory expansion and control of HDM. See Chapter 9.13 in the CXL 2.0 > + specification for a detailed description of HDM. > + > + If unsure say 'm'. > + > + Spurious blank line. > +config CXL_PORT > + tristate > + > endif > --- /dev/null > +++ b/drivers/cxl/port.c > @@ -0,0 +1,323 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* Copyright(c) 2021 Intel Corporation. All rights reserved. */ > +#include <linux/device.h> > +#include <linux/module.h> > +#include <linux/slab.h> > + > +#include "cxlmem.h" > + > +/** > + * DOC: cxl port s/cxl port/CXL port/ (I'm assuming this should match usage below) or maybe "CXL Port" both places to match typical PCIe spec usage? Capitalization is a useful hint that this term is defined by a spec. > + * > + * The port driver implements the set of functionality needed to allow full > + * decoder enumeration and routing. A CXL port is an abstraction of a CXL > + * component that implements some amount of CXL decoding of CXL.mem traffic. > + * As of the CXL 2.0 spec, this includes: > + * > + * .. list-table:: CXL Components w/ Ports > + * :widths: 25 25 50 > + * :header-rows: 1 > + * > + * * - component > + * - upstream > + * - downstream > + * * - Hostbridge s/Hostbridge/Host bridge/ > + * - ACPI0016 > + * - root port s/root port/Root Port/ to match Switch Ports below (and spec usage). > + * * - Switch > + * - Switch Upstream Port > + * - Switch Downstream Port > + * * - Endpoint > + * - Endpoint Port > + * - N/A What does "N/A" mean here? Is it telling us something useful? > +static void rescan_ports(struct work_struct *work) > +{ > + if (bus_rescan_devices(&cxl_bus_type)) > + pr_warn("Failed to rescan\n"); Needs some context. A bare "Failed to rescan" in the dmesg log without a clue about who emitted it is really annoying. Maybe you defined pr_fmt() somewhere; I couldn't find it easily. Bjorn
On 21-11-23 12:21:28, Bjorn Helgaas wrote: > On Fri, Nov 19, 2021 at 04:02:47PM -0800, Ben Widawsky wrote: > > The CXL port driver is responsible for managing the decoder resources > > contained within the port. It will also provide APIs that other drivers > > will consume for managing these resources. > > > > There are 4 types of ports in a system: > > 1. Platform port. This is a non-programmable entity. Such a port is > > named rootX. It is enumerated by cxl_acpi in an ACPI based system. > > Can you mention the ACPI source (static table, namespace PNP ID) here? Done. > > > 2. Hostbridge port. > > Is "hostbridge" styled as a single word in the spec? I've only seen > "host bridge" elsewhere. > 2 words. I'm sadly inconsistent with this particular word. CXL spec capitalizes it. > > This ports register access is defined in a platform > > specific way (CHBS for ACPI platforms). > > s/This ports/This port's/ > > This doesn't really make sense, though. Are you saying the register > access *mechanism* is platform specific? Or merely that the > enumeration method (ACPI table, ACPI namespace, DT, etc) is > platform-specific? > > I assume "CHBS" is an ACPI static table? > The registers are spec defined. The base address of those registers is defined in a platform specific manner. Enumeration is a better word. CHBS is an ACPI static table, yes. > > It has n downstream ports, > > each of which are known as CXL 2.0 root ports. > > This sounds like a "host bridge port *contains* these root ports." > That doesn't sound right. > What sounds better? A CXL 2.0 Root Port is CXL capabilities on top of the PCIe component which has the PCI_EXP_TYPE_ROOT_PORT cap. In my mental model, a host bridge does contain the root ports. Perhaps I am wrong about that? > > Once the platform > > specific mechanism to get the offset to the registers is obtained it > > operates just like other CXL components. The enumeration of this > > component is started by cxl_acpi and completed by cxl_port. > > > 3. Switch port. A switch port is similar to a hostbridge port except > > register access is defined in the CXL specification in a platform > > agnostic way. The downstream ports for a switch are simply known as > > downstream ports. The enumeration of these are entirely contained > > within cxl_port. > > In PCIe, "Downstream Port" includes both Root Ports and Switch > Downstream Ports. It sounds like that's not the case for CXL? > In CXL 2.0, it's fairly close to true that switch downstream ports and root ports are equivalent. From today's driver perspective they are equivalent. Root ports have some capabilities which switch downstream ports do not. > Well, except above you say that a Host Bridge Port has N Downstream > Ports, so I guess "Downstream Port" probably includes both Host Bridge > Ports and Switch Downstream Ports. Yes, in that case I meant a port which is downstream - confusing indeed. > > Maybe you should just omit the "The downstream ports for a switch are > simply known as downstream ports" sentence. > Sounds good. > > 4. Endpoint port. Endpoint ports are similar to switch ports with the > > exception that they have no downstream ports, only the underlying > > media on the device. The enumeration of these are started by cxl_pci, > > and completed by cxl_port. > > Does CXL use an "Upstream Port" concept similar to PCIe? In PCIe, > "Upstream Port" includes both Switch Upstream Ports and the Upstream > Port on an Endpoint. Not really. Endpoints aren't known as ports in the spec and they have a decent amount of divergence from upstream ports. The main area where they do converge is in the memory decoding capabilities. In fact, it might come to a point where we find adding an endpoint as a port does not make sense, but for now it does. > > I hope this driver is not modeled on the PCI portdrv. IMO, that was a > design error, and the "port service drivers" (PME, hotplug, AER, etc) > should be directly integrated into the PCI core instead of pretending > to be independent drivers. I'd like to understand a bit about why you think it was a design error. The port driver is intended to be a port services driver, however I see the services provided as quite different than the ones you mention. The primary service cxl_port will provide from here on out is the ability to manage HDM decoder resources for a port. Other independent drivers that want to use HDM decoder resources would rely on the port driver for this. It could be in CXL core just the same, but I don't see too much of a benefit since the code would be almost identical. One nice aspect of having the port driver outside of CXL core is it would allow CXL devices which do not need port services (type-1 and probably type-2) to not need to load the port driver. We do not have examples of such devices today but there's good evidence they are being built. Whether or not they will even want CXL core is another topic up for debate however. I admit Dan and I did discuss putting this in either its own port driver, or within core as a port driver. My inclination is, less is more in CXL core; but perhaps you will be able to change my mind. > > > --- a/Documentation/driver-api/cxl/memory-devices.rst > > +++ b/Documentation/driver-api/cxl/memory-devices.rst > > @@ -28,6 +28,11 @@ CXL Memory Device > > .. kernel-doc:: drivers/cxl/pci.c > > :internal: > > > > +CXL Port > > +-------- > > +.. kernel-doc:: drivers/cxl/port.c > > + :doc: cxl port > > s/cxl port/CXL Port/ ? I don't know exactly how this gets rendered by > ReST. I believe this is the specific tag as specified in the .c file. I can capitalize it, but we haven't done this for other tags. > > > CXL Core > > -------- > > .. kernel-doc:: drivers/cxl/cxl.h > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig > > index ef05e96f8f97..3aeb33bba5a3 100644 > > --- a/drivers/cxl/Kconfig > > +++ b/drivers/cxl/Kconfig > > @@ -77,4 +77,26 @@ config CXL_PMEM > > provisioning the persistent memory capacity of CXL memory expanders. > > > > If unsure say 'm'. > > + > > +config CXL_MEM > > + tristate "CXL.mem: Memory Devices" > > + select CXL_PORT > > + depends on CXL_PCI > > + default CXL_BUS > > + help > > + The CXL.mem protocol allows a device to act as a provider of "System > > + RAM" and/or "Persistent Memory" that is fully coherent as if the > > + memory was attached to the typical CPU memory controller. This is > > + known as HDM "Host-managed Device Memory". > > s/was attached/were attached/ > > > + Say 'y/m' to enable a driver that will attach to CXL.mem devices for > > + memory expansion and control of HDM. See Chapter 9.13 in the CXL 2.0 > > + specification for a detailed description of HDM. > > + > > + If unsure say 'm'. > > + > > + > > Spurious blank line. > > > +config CXL_PORT > > + tristate > > + > > endif > > > --- /dev/null > > +++ b/drivers/cxl/port.c > > @@ -0,0 +1,323 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +/* Copyright(c) 2021 Intel Corporation. All rights reserved. */ > > +#include <linux/device.h> > > +#include <linux/module.h> > > +#include <linux/slab.h> > > + > > +#include "cxlmem.h" > > + > > +/** > > + * DOC: cxl port > > s/cxl port/CXL port/ (I'm assuming this should match usage below) > or maybe "CXL Port" both places to match typical PCIe spec usage? > > Capitalization is a useful hint that this term is defined by a spec. > As above, I don't mind changing this at all, but this would be inconsistent with the other tags we have defined. > > + * > > + * The port driver implements the set of functionality needed to allow full > > + * decoder enumeration and routing. A CXL port is an abstraction of a CXL > > + * component that implements some amount of CXL decoding of CXL.mem traffic. > > + * As of the CXL 2.0 spec, this includes: > > + * > > + * .. list-table:: CXL Components w/ Ports > > + * :widths: 25 25 50 > > + * :header-rows: 1 > > + * > > + * * - component > > + * - upstream > > + * - downstream > > + * * - Hostbridge > > s/Hostbridge/Host bridge/ > > > + * - ACPI0016 > > + * - root port > > s/root port/Root Port/ to match Switch Ports below (and spec usage). > > > + * * - Switch > > + * - Switch Upstream Port > > + * - Switch Downstream Port > > + * * - Endpoint > > + * - Endpoint Port > > + * - N/A > > What does "N/A" mean here? Is it telling us something useful? This gets rendered into a table that looks like the following: | component | upstream | downstream | | --------- | -------- | ---------- | | Hostbridge | ACPI0016 | Root Port | | Switch | Switch Upstream Port | Switch Downstream Port | | Endpoint | Endpoint Port | N/A | > > > +static void rescan_ports(struct work_struct *work) > > +{ > > + if (bus_rescan_devices(&cxl_bus_type)) > > + pr_warn("Failed to rescan\n"); > > Needs some context. A bare "Failed to rescan" in the dmesg log > without a clue about who emitted it is really annoying. > > Maybe you defined pr_fmt() somewhere; I couldn't find it easily. > I actually didn't know about pr_fmt() trick, so thanks for that. I'll improve this message to be more useful and contextual. > Bjorn
On Tue, Nov 23, 2021 at 2:03 PM Ben Widawsky <ben.widawsky@intel.com> wrote: [..] > > I hope this driver is not modeled on the PCI portdrv. IMO, that was a > > design error, and the "port service drivers" (PME, hotplug, AER, etc) > > should be directly integrated into the PCI core instead of pretending > > to be independent drivers. > > I'd like to understand a bit about why you think it was a design error. The port > driver is intended to be a port services driver, however I see the services > provided as quite different than the ones you mention. The primary service > cxl_port will provide from here on out is the ability to manage HDM decoder > resources for a port. Other independent drivers that want to use HDM decoder > resources would rely on the port driver for this. > > It could be in CXL core just the same, but I don't see too much of a benefit > since the code would be almost identical. One nice aspect of having the port > driver outside of CXL core is it would allow CXL devices which do not need port > services (type-1 and probably type-2) to not need to load the port driver. We do > not have examples of such devices today but there's good evidence they are being > built. Whether or not they will even want CXL core is another topic up for > debate however. > > I admit Dan and I did discuss putting this in either its own port driver, or > within core as a port driver. My inclination is, less is more in CXL core; but > perhaps you will be able to change my mind. No, I don't think Bjorn is talking about whether the driver is integrated into cxl_core.ko vs its own cxl_port.ko. IIUC this goes back to the original contention about have /sys/bus/cxl at all: https://lore.kernel.org/r/CAPcyv4iu8D-hJoujLXw8a4myS7trOE1FcUhESLB_imGMECVfrg@mail.gmail.com Unlike pcieportdrv where the functionality is bounded to a single device instance with relatively simpler hierarchical coordination of the memory space and services. CXL interleaving is both a foreign concept to the PCIE core and an awkward fit for the pci_bus_type device model. CXL uses the cxl_bus_type and bus drivers to coordinate CXL operations that have cross PCI device implications. Outside of that I think the auxiliary device driver model, of which the PCIE portdrv model is an open-coded form, is a useful construct for separation of concerns. That said, I do want to hear more about what trouble it has caused though to make sure that CXL does not trip over the same issues longer term. [..] > > > +static void rescan_ports(struct work_struct *work) > > > +{ > > > + if (bus_rescan_devices(&cxl_bus_type)) > > > + pr_warn("Failed to rescan\n"); > > > > Needs some context. A bare "Failed to rescan" in the dmesg log > > without a clue about who emitted it is really annoying. > > > > Maybe you defined pr_fmt() somewhere; I couldn't find it easily. > > > > I actually didn't know about pr_fmt() trick, so thanks for that. I'll improve > this message to be more useful and contextual. I am skeptical that this implementation of the workqueue properly synchronizes with workqueue shutdown concerns, but I have not had a chance to dig in too deep on this patchset. Regardless, it is not worth reporting a rescan failure, because those are to be expected here. The rescan attempts to rescan when a constraint changes, there is no guarantee that all constraints are met just because one constraint changed, so rescan failures (device_attach() failures) are not interesting to report. The individual driver probe failure reporting is sufficient.
On 21-11-23 14:36:32, Dan Williams wrote: > On Tue, Nov 23, 2021 at 2:03 PM Ben Widawsky <ben.widawsky@intel.com> wrote: > [..] > > > I hope this driver is not modeled on the PCI portdrv. IMO, that was a > > > design error, and the "port service drivers" (PME, hotplug, AER, etc) > > > should be directly integrated into the PCI core instead of pretending > > > to be independent drivers. > > > > I'd like to understand a bit about why you think it was a design error. The port > > driver is intended to be a port services driver, however I see the services > > provided as quite different than the ones you mention. The primary service > > cxl_port will provide from here on out is the ability to manage HDM decoder > > resources for a port. Other independent drivers that want to use HDM decoder > > resources would rely on the port driver for this. > > > > It could be in CXL core just the same, but I don't see too much of a benefit > > since the code would be almost identical. One nice aspect of having the port > > driver outside of CXL core is it would allow CXL devices which do not need port > > services (type-1 and probably type-2) to not need to load the port driver. We do > > not have examples of such devices today but there's good evidence they are being > > built. Whether or not they will even want CXL core is another topic up for > > debate however. > > > > I admit Dan and I did discuss putting this in either its own port driver, or > > within core as a port driver. My inclination is, less is more in CXL core; but > > perhaps you will be able to change my mind. > > No, I don't think Bjorn is talking about whether the driver is > integrated into cxl_core.ko vs its own cxl_port.ko. IIUC this goes > back to the original contention about have /sys/bus/cxl at all: > > https://lore.kernel.org/r/CAPcyv4iu8D-hJoujLXw8a4myS7trOE1FcUhESLB_imGMECVfrg@mail.gmail.com > > Unlike pcieportdrv where the functionality is bounded to a single > device instance with relatively simpler hierarchical coordination of > the memory space and services. CXL interleaving is both a foreign > concept to the PCIE core and an awkward fit for the pci_bus_type > device model. CXL uses the cxl_bus_type and bus drivers to coordinate > CXL operations that have cross PCI device implications. > > Outside of that I think the auxiliary device driver model, of which > the PCIE portdrv model is an open-coded form, is a useful construct > for separation of concerns. That said, I do want to hear more about > what trouble it has caused though to make sure that CXL does not trip > over the same issues longer term. > > [..] > > > > +static void rescan_ports(struct work_struct *work) > > > > +{ > > > > + if (bus_rescan_devices(&cxl_bus_type)) > > > > + pr_warn("Failed to rescan\n"); > > > > > > Needs some context. A bare "Failed to rescan" in the dmesg log > > > without a clue about who emitted it is really annoying. > > > > > > Maybe you defined pr_fmt() somewhere; I couldn't find it easily. > > > > > > > I actually didn't know about pr_fmt() trick, so thanks for that. I'll improve > > this message to be more useful and contextual. > > I am skeptical that this implementation of the workqueue properly > synchronizes with workqueue shutdown concerns, but I have not had a > chance to dig in too deep on this patchset. Yeah, we discussed this. Please do check that. I'm happy to send out v2 with all of Jonathan's fixes first, so you don't have to duplicate effort with what he has already uncovered. > > Regardless, it is not worth reporting a rescan failure, because those > are to be expected here. The rescan attempts to rescan when a > constraint changes, there is no guarantee that all constraints are met > just because one constraint changed, so rescan failures > (device_attach() failures) are not interesting to report. The > individual driver probe failure reporting is sufficient. Agreed.
[+cc Christoph, since he has opinions about portdrv; Greg, Rafael, since they have good opinions about sysfs structure] On Tue, Nov 23, 2021 at 02:36:32PM -0800, Dan Williams wrote: > On Tue, Nov 23, 2021 at 2:03 PM Ben Widawsky <ben.widawsky@intel.com> wrote: > [..] > > > I hope this driver is not modeled on the PCI portdrv. IMO, that > > > was a design error, and the "port service drivers" (PME, > > > hotplug, AER, etc) should be directly integrated into the PCI > > > core instead of pretending to be independent drivers. > > > > I'd like to understand a bit about why you think it was a design > > error. The port driver is intended to be a port services driver, > > however I see the services provided as quite different than the > > ones you mention. The primary service cxl_port will provide from > > here on out is the ability to manage HDM decoder resources for a > > port. Other independent drivers that want to use HDM decoder > > resources would rely on the port driver for this. > > > > It could be in CXL core just the same, but I don't see too much of > > a benefit since the code would be almost identical. One nice > > aspect of having the port driver outside of CXL core is it would > > allow CXL devices which do not need port services (type-1 and > > probably type-2) to not need to load the port driver. We do not > > have examples of such devices today but there's good evidence they > > are being built. Whether or not they will even want CXL core is > > another topic up for debate however. > > > > I admit Dan and I did discuss putting this in either its own port > > driver, or within core as a port driver. My inclination is, less > > is more in CXL core; but perhaps you will be able to change my > > mind. > > No, I don't think Bjorn is talking about whether the driver is > integrated into cxl_core.ko vs its own cxl_port.ko. IIUC this goes > back to the original contention about have /sys/bus/cxl at all: > > https://lore.kernel.org/r/CAPcyv4iu8D-hJoujLXw8a4myS7trOE1FcUhESLB_imGMECVfrg@mail.gmail.com That question was about whether we want the same physical device to be represented both in the /sys/bus/pci hierarchy and the /sys/bus/cxl hierarchy. That still seems a little weird to me, but I don't know enough about the CXL constraints to really object to it. My question here is more about whether you're going to use something like the pcie_port_service_register() model for supporting multiple drivers attached to the same physical device. The PCIe portdrv creates a /sys/bus/pci_express hierarchy parallel to the /sys/bus/pci hierarchy. The pci_express hierarchy has a "device" for every service (hotplug, AER, DPC, PME, etc) (see pcie_device_init()). This device creation is quite complicated and depends on whether the Port advertises a Capability, whether the platform has granted control to the OS, whether support is compiled in, etc. I think that was a mistake because these hotplug, AER, DPC, PME "devices" are not independent things. They are optional features that can be added to a variety of devices, and those devices might have their own drivers. For example, we want to write drivers for vendor-specific functionality like PMUs in switches, but we can't do that cleanly because portdrv claims them. The portdrv features are fully specified by the PCIe spec, so nothing is vendor-specific. They share interrupts. They share power state. They cannot be reset independently. They are not addressable entities in the usual bus/device/function model. They can't be removed or added like the underlying device can. I wasn't there when they were designed, but from reading the spec, it seems like they were designed as optional features of a device, not as separate devices themselves. > Unlike pcieportdrv where the functionality is bounded to a single > device instance with relatively simpler hierarchical coordination of > the memory space and services. CXL interleaving is both a foreign > concept to the PCIE core and an awkward fit for the pci_bus_type > device model. CXL uses the cxl_bus_type and bus drivers to > coordinate CXL operations that have cross PCI device implications. > > Outside of that I think the auxiliary device driver model, of which > the PCIE portdrv model is an open-coded form, is a useful construct > for separation of concerns. That said, I do want to hear more about > what trouble it has caused though to make sure that CXL does not > trip over the same issues longer term. Bjorn
On Tue, Nov 23, 2021 at 3:56 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > [+cc Christoph, since he has opinions about portdrv; > Greg, Rafael, since they have good opinions about sysfs structure] > > On Tue, Nov 23, 2021 at 02:36:32PM -0800, Dan Williams wrote: > > On Tue, Nov 23, 2021 at 2:03 PM Ben Widawsky <ben.widawsky@intel.com> wrote: > > [..] > > > > I hope this driver is not modeled on the PCI portdrv. IMO, that > > > > was a design error, and the "port service drivers" (PME, > > > > hotplug, AER, etc) should be directly integrated into the PCI > > > > core instead of pretending to be independent drivers. > > > > > > I'd like to understand a bit about why you think it was a design > > > error. The port driver is intended to be a port services driver, > > > however I see the services provided as quite different than the > > > ones you mention. The primary service cxl_port will provide from > > > here on out is the ability to manage HDM decoder resources for a > > > port. Other independent drivers that want to use HDM decoder > > > resources would rely on the port driver for this. > > > > > > It could be in CXL core just the same, but I don't see too much of > > > a benefit since the code would be almost identical. One nice > > > aspect of having the port driver outside of CXL core is it would > > > allow CXL devices which do not need port services (type-1 and > > > probably type-2) to not need to load the port driver. We do not > > > have examples of such devices today but there's good evidence they > > > are being built. Whether or not they will even want CXL core is > > > another topic up for debate however. > > > > > > I admit Dan and I did discuss putting this in either its own port > > > driver, or within core as a port driver. My inclination is, less > > > is more in CXL core; but perhaps you will be able to change my > > > mind. > > > > No, I don't think Bjorn is talking about whether the driver is > > integrated into cxl_core.ko vs its own cxl_port.ko. IIUC this goes > > back to the original contention about have /sys/bus/cxl at all: > > > > https://lore.kernel.org/r/CAPcyv4iu8D-hJoujLXw8a4myS7trOE1FcUhESLB_imGMECVfrg@mail.gmail.com > > That question was about whether we want the same physical device to be > represented both in the /sys/bus/pci hierarchy and the /sys/bus/cxl > hierarchy. That still seems a little weird to me, but I don't know > enough about the CXL constraints to really object to it. > > My question here is more about whether you're going to use something > like the pcie_port_service_register() model for supporting multiple > drivers attached to the same physical device. > > The PCIe portdrv creates a /sys/bus/pci_express hierarchy parallel to > the /sys/bus/pci hierarchy. The pci_express hierarchy has a "device" > for every service (hotplug, AER, DPC, PME, etc) (see > pcie_device_init()). This device creation is quite complicated and > depends on whether the Port advertises a Capability, whether the > platform has granted control to the OS, whether support is compiled > in, etc. > > I think that was a mistake because these hotplug, AER, DPC, PME > "devices" are not independent things. They are optional features that > can be added to a variety of devices, and those devices might have > their own drivers. For example, we want to write drivers for > vendor-specific functionality like PMUs in switches, but we can't do > that cleanly because portdrv claims them. > > The portdrv features are fully specified by the PCIe spec, so nothing > is vendor-specific. They share interrupts. They share power state. > They cannot be reset independently. They are not addressable entities > in the usual bus/device/function model. They can't be removed or > added like the underlying device can. I wasn't there when they were > designed, but from reading the spec, it seems like they were designed > as optional features of a device, not as separate devices themselves. Let me ask a clarifying question coming from the other direction that resulted in the creation of the auxiliary bus architecture. Some background. RDMA is a protocol that may run on top of Ethernet. Consider the case where you have multiple generations of Ethernet adapter devices, but they all support common RDMA functionality. You only have the one PCI device to attach a unique Ethernet driver. What is an idiomatic way to deploy a module that automatically loads and attaches to the exported common functionality across adapters that otherwise have a unique native driver for the hardware device? Another example, the Native PCIe Enclosure Management (NPEM) specification defines a handful of registers that can appear anywhere in the PCIe hierarchy. How can you write a common driver that is generically applicable to any given NPEM instance? The auxiliary bus answer to those questions is to register an auxiliary device to be driven by a common auxiliary driver across hard-device generations from the same vendor or even across vendors. For your example about a PCIe port PMU driver it ultimately requires something to enumerate that capability and a library of code to exercise that functionality. What is a more natural fit than a Linux "device" and a Linux driver to coordinate attaching enabling code to a standalone hardware capability? PCIe portdrv may be awkward because there was never a real native driver for the device to begin with and it all could have handled by 'pcie_portdriver' directly without registering more Linux devices, but assigning self contained features to Linux devices otherwise seems a useful idiom to me. As for CXL, there is no analog of the PCIe portdrv pattern of just having a device act as a multiplexer of features to other Linux devices.
On Tue, Nov 23, 2021 at 04:40:06PM -0800, Dan Williams wrote: > Let me ask a clarifying question coming from the other direction that > resulted in the creation of the auxiliary bus architecture. Some > background. RDMA is a protocol that may run on top of Ethernet. No, RDMA is a concept. Linux supports 2 and a half RDMA protocols that run over ethernet (RoCE v1 and v2 and iWarp). > Consider the case where you have multiple generations of Ethernet > adapter devices, but they all support common RDMA functionality. You > only have the one PCI device to attach a unique Ethernet driver. What > is an idiomatic way to deploy a module that automatically loads and > attaches to the exported common functionality across adapters that > otherwise have a unique native driver for the hardware device? The whole aux bus drama is mostly because the intel design for these is really fucked up. All the sane HCAs do not use this model. All this attchment crap really should not be there. > Another example, the Native PCIe Enclosure Management (NPEM) > specification defines a handful of registers that can appear anywhere > in the PCIe hierarchy. How can you write a common driver that is > generically applicable to any given NPEM instance? Another totally messed up spec. But then pretty much everything coming from the PCIe SIG in terms of interface tends to be really, really broken lately.
On Tue, Nov 23, 2021 at 10:33 PM Christoph Hellwig <hch@lst.de> wrote: > > On Tue, Nov 23, 2021 at 04:40:06PM -0800, Dan Williams wrote: > > Let me ask a clarifying question coming from the other direction that > > resulted in the creation of the auxiliary bus architecture. Some > > background. RDMA is a protocol that may run on top of Ethernet. > > No, RDMA is a concept. Linux supports 2 and a half RDMA protocols > that run over ethernet (RoCE v1 and v2 and iWarp). Yes, I was being too coarse, point taken. However, I don't think that changes the observation that multiple vendors are using aux bus to share a feature driver across multiple base Ethernet drivers. > > > Consider the case where you have multiple generations of Ethernet > > adapter devices, but they all support common RDMA functionality. You > > only have the one PCI device to attach a unique Ethernet driver. What > > is an idiomatic way to deploy a module that automatically loads and > > attaches to the exported common functionality across adapters that > > otherwise have a unique native driver for the hardware device? > > The whole aux bus drama is mostly because the intel design for these > is really fucked up. All the sane HCAs do not use this model. All > this attchment crap really should not be there. I am missing the counter proposal in both Bjorn's and your distaste for aux bus and PCIe portdrv? > > Another example, the Native PCIe Enclosure Management (NPEM) > > specification defines a handful of registers that can appear anywhere > > in the PCIe hierarchy. How can you write a common driver that is > > generically applicable to any given NPEM instance? > > Another totally messed up spec. But then pretty much everything coming > from the PCIe SIG in terms of interface tends to be really, really > broken lately. DVSEC and DOE is more of the same in terms of composing add-on features into devices. Hardware vendors want to mix multiple hard-IPs into a single device, aux bus is one response. Topology specific buses like /sys/bus/cxl are another. This CXL port driver is offering enumeration, link management, and memory decode setup services to the rest of the topology. I see it as similar to management protocol services offered by libsas.
On Tue, Nov 23, 2021 at 11:17:55PM -0800, Dan Williams wrote: > I am missing the counter proposal in both Bjorn's and your distaste > for aux bus and PCIe portdrv? Given that I've only brought in in the last mail I have no idea what the original proposal even is.
On Wed, Nov 24, 2021 at 08:28:24AM +0100, Christoph Hellwig wrote: > On Tue, Nov 23, 2021 at 11:17:55PM -0800, Dan Williams wrote: > > I am missing the counter proposal in both Bjorn's and your distaste > > for aux bus and PCIe portdrv? > > Given that I've only brought in in the last mail I have no idea what > the original proposal even is. Neither do I :(
On Tue, Nov 23, 2021 at 11:33 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Wed, Nov 24, 2021 at 08:28:24AM +0100, Christoph Hellwig wrote: > > On Tue, Nov 23, 2021 at 11:17:55PM -0800, Dan Williams wrote: > > > I am missing the counter proposal in both Bjorn's and your distaste > > > for aux bus and PCIe portdrv? > > > > Given that I've only brought in in the last mail I have no idea what > > the original proposal even is. > > Neither do I :( To be clear I am also trying to get to the root of Bjorn's concern. The proposal in $SUBJECT is to build on / treat a CXL topology as a Linux device topology on /sys/bus/cxl that references devices on /sys/bus/platform (CXL ACPI topology root and Host Bridges) and /sys/bus/pci (CXL Switches and Endpoints). This CXL port device topology has already been shipping for a few kernel cycles. What is on the table now is a driver for CXL port devices (a logical Linux construct). The driver handles discovery of "component registers" either by ACPI table or PCI DVSEC and offers services to proxision CXL regions. CXL 'regions' are also proposed as Linux devices that represent an active CXL memory range which can interleave multiple endpoints across multiple switches and host bridges.
On Tue, Nov 23, 2021 at 11:54:03PM -0800, Dan Williams wrote: > On Tue, Nov 23, 2021 at 11:33 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Wed, Nov 24, 2021 at 08:28:24AM +0100, Christoph Hellwig wrote: > > > On Tue, Nov 23, 2021 at 11:17:55PM -0800, Dan Williams wrote: > > > > I am missing the counter proposal in both Bjorn's and your distaste > > > > for aux bus and PCIe portdrv? > > > > > > Given that I've only brought in in the last mail I have no idea what > > > the original proposal even is. > > > > Neither do I :( > > To be clear I am also trying to get to the root of Bjorn's concern. > > The proposal in $SUBJECT is to build on / treat a CXL topology as a > Linux device topology on /sys/bus/cxl that references devices on > /sys/bus/platform (CXL ACPI topology root and Host Bridges) and > /sys/bus/pci (CXL Switches and Endpoints). Wait, I am confused. A bus lists devices that are on that bus. It does not list devices that are on other busses. Now a device in a bus list can have a parent be on different types of busses, as the parent device does not matter, but the device itself can not be of different types. So I do not understand what you are describing here at all. Do you have an example output of sysfs that shows this situation? > This CXL port device topology has already been shipping for a few > kernel cycles. So it's always been broken? :) > What is on > the table now is a driver for CXL port devices (a logical Linux > construct). The driver handles discovery of "component registers" > either by ACPI table or PCI DVSEC and offers services to proxision CXL > regions. So a normal bus controller device that creates new devices of a bus type, right? What is special about that? > CXL 'regions' are also proposed as Linux devices that > represent an active CXL memory range which can interleave multiple > endpoints across multiple switches and host bridges. As long as these 'devices' have drivers and do not mess with the resources of any other device in the system, I do not understand the problem here. Or is the issue that you are again trying to carve up "real" devices into tiny devices that then somehow need to be aware of the resources being touched by different drivers at the same time? I'm still confused, sorry. greg k-h
On Wed, Nov 24, 2021 at 12:22 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Tue, Nov 23, 2021 at 11:54:03PM -0800, Dan Williams wrote: > > On Tue, Nov 23, 2021 at 11:33 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > > > > On Wed, Nov 24, 2021 at 08:28:24AM +0100, Christoph Hellwig wrote: > > > > On Tue, Nov 23, 2021 at 11:17:55PM -0800, Dan Williams wrote: > > > > > I am missing the counter proposal in both Bjorn's and your distaste > > > > > for aux bus and PCIe portdrv? > > > > > > > > Given that I've only brought in in the last mail I have no idea what > > > > the original proposal even is. > > > > > > Neither do I :( > > > > To be clear I am also trying to get to the root of Bjorn's concern. > > > > The proposal in $SUBJECT is to build on / treat a CXL topology as a > > Linux device topology on /sys/bus/cxl that references devices on > > /sys/bus/platform (CXL ACPI topology root and Host Bridges) and > > /sys/bus/pci (CXL Switches and Endpoints). > > Wait, I am confused. > > A bus lists devices that are on that bus. It does not list devices that > are on other busses. > > Now a device in a bus list can have a parent be on different types of > busses, as the parent device does not matter, but the device itself can > not be of different types. > > So I do not understand what you are describing here at all. Do you have > an example output of sysfs that shows this situation? Commit 40ba17afdfab ("cxl/acpi: Introduce cxl_decoder objects") ...has an example of the devices registered on the CXL bus by the cxl_acpi driver. > > This CXL port device topology has already been shipping for a few > > kernel cycles. > > So it's always been broken? :) Kidding aside, CXL has different moving pieces than PCI and the Linux device-driver model is how we are organizing that complexity. CXL is also symbiotically attached to PCI as it borrows the enumeration mechanism while building up a parallel CXL.mem universe to live alongside PCI.config and PCI.mmio. The CXL subsystem is similar to MD / DM where virtual devices are built up from other devices. > > What is on > > the table now is a driver for CXL port devices (a logical Linux > > construct). The driver handles discovery of "component registers" > > either by ACPI table or PCI DVSEC and offers services to proxision CXL > > regions. > > So a normal bus controller device that creates new devices of a bus > type, right? What is special about that? Yes, which is the root of my confusion about Bjorn's concern. > > CXL 'regions' are also proposed as Linux devices that > > represent an active CXL memory range which can interleave multiple > > endpoints across multiple switches and host bridges. > > As long as these 'devices' have drivers and do not mess with the > resources of any other device in the system, I do not understand the > problem here. Correct, these drivers manage CXL.mem resources and leave PCI.mmio resource management to PCI. > Or is the issue that you are again trying to carve up "real" devices > into tiny devices that then somehow need to be aware of the resources > being touched by different drivers at the same time? No, there's no multiplexing of resources across devices / drivers that requires cross-driver awareness. > I'm still confused, sorry. No worries, appreciate the attention to make sure this implementation is idiomatic and avoids any architectural dragons.
On Tue, Nov 23, 2021 at 02:03:15PM -0800, Ben Widawsky wrote: > On 21-11-23 12:21:28, Bjorn Helgaas wrote: > > On Fri, Nov 19, 2021 at 04:02:47PM -0800, Ben Widawsky wrote: > > > 2. Hostbridge port. > > > ... > > > It has n downstream ports, > > > each of which are known as CXL 2.0 root ports. > > > > This sounds like a "host bridge port *contains* these root ports." > > That doesn't sound right. > > What sounds better? A CXL 2.0 Root Port is CXL capabilities on top > of the PCIe component which has the PCI_EXP_TYPE_ROOT_PORT cap. In > my mental model, a host bridge does contain the root ports. Perhaps > I am wrong about that? "A host bridge contains the root ports" makes sense to me. "A host bridge *port* contains root ports" does not. The PCIe spec would describe this as a "Root Complex may support one or more [Root] Ports" (see PCIe r5.0, sec 1.3.1). In PCIe, a "Port" is "an interface between a component and a PCI Express Link." It doesn't contain other Ports. Sounds like CXL is the same here, and using the same terminology (assuming that's what the CXL spec does) will reduce confusion. > > > 3. Switch port. A switch port is similar to a hostbridge port except > > > register access is defined in the CXL specification in a platform > > > agnostic way. The downstream ports for a switch are simply known as > > > downstream ports. The enumeration of these are entirely contained > > > within cxl_port. > > > > In PCIe, "Downstream Port" includes both Root Ports and Switch > > Downstream Ports. It sounds like that's not the case for CXL? > > In CXL 2.0, it's fairly close to true that switch downstream ports > and root ports are equivalent. From today's driver perspective they > are equivalent. Root ports have some capabilities which switch > downstream ports do not. Same as PCIe. > > > 4. Endpoint port. Endpoint ports are similar to switch ports with the > > > exception that they have no downstream ports, only the underlying > > > media on the device. The enumeration of these are started by cxl_pci, > > > and completed by cxl_port. > > > > Does CXL use an "Upstream Port" concept similar to PCIe? In PCIe, > > "Upstream Port" includes both Switch Upstream Ports and the Upstream > > Port on an Endpoint. > > Not really. Endpoints aren't known as ports in the spec and they > have a decent amount of divergence from upstream ports. The main > area where they do converge is in the memory decoding capabilities. > In fact, it might come to a point where we find adding an endpoint > as a port does not make sense, but for now it does. Since a PCIe "Port" refers to the interface between a component and a Link, PCIe Endpoints have Upstream Ports just like switches do. I'm guessing CXL is the same. The part about "endpoint ports have no downstream ports" is what doesn't read well to me because ports don't have other ports. This section is about the four types of ports in a system. I'm trying to match those up with spec terms, either PCIe or CXL. It sounds like you intend bullet 3 to include both Switch Upstream Ports and Switch Downstream Ports, and bullet 4 to be only Endpoint Ports (which are Upstream Ports). > > I hope this driver is not modeled on the PCI portdrv. IMO, that > > was a design error, and the "port service drivers" (PME, hotplug, > > AER, etc) should be directly integrated into the PCI core instead > > of pretending to be independent drivers. > > I'd like to understand a bit about why you think it was a design > error. The port driver is intended to be a port services driver, > however I see the services provided as quite different than the ones > you mention. The primary service cxl_port will provide from here on > out is the ability to manage HDM decoder resources for a port. Other > independent drivers that want to use HDM decoder resources would > rely on the port driver for this. I'll continue this part in the later part of the thread. > > > + * * - Switch > > > + * - Switch Upstream Port > > > + * - Switch Downstream Port > > > + * * - Endpoint > > > + * - Endpoint Port > > > + * - N/A > > > > What does "N/A" mean here? Is it telling us something useful? > > This gets rendered into a table that looks like the following: > > | component | upstream | downstream | > | --------- | -------- | ---------- | > | Hostbridge | ACPI0016 | Root Port | > | Switch | Switch Upstream Port | Switch Downstream Port | > | Endpoint | Endpoint Port | N/A | Makes sense, thanks. I didn't know how to read the ReST and thought this was just a list, where N/A wouldn't make much sense. Bjorn
On Tue, Nov 23, 2021 at 11:17:55PM -0800, Dan Williams wrote: > On Tue, Nov 23, 2021 at 10:33 PM Christoph Hellwig <hch@lst.de> wrote: > > On Tue, Nov 23, 2021 at 04:40:06PM -0800, Dan Williams wrote: > > > Let me ask a clarifying question coming from the other direction that > > > resulted in the creation of the auxiliary bus architecture. Some > > > background. RDMA is a protocol that may run on top of Ethernet. > > > > No, RDMA is a concept. Linux supports 2 and a half RDMA protocols > > that run over ethernet (RoCE v1 and v2 and iWarp). > > Yes, I was being too coarse, point taken. However, I don't think that > changes the observation that multiple vendors are using aux bus to > share a feature driver across multiple base Ethernet drivers. > > > > Consider the case where you have multiple generations of Ethernet > > > adapter devices, but they all support common RDMA functionality. You > > > only have the one PCI device to attach a unique Ethernet driver. What > > > is an idiomatic way to deploy a module that automatically loads and > > > attaches to the exported common functionality across adapters that > > > otherwise have a unique native driver for the hardware device? > > > > The whole aux bus drama is mostly because the intel design for these > > is really fucked up. All the sane HCAs do not use this model. All > > this attchment crap really should not be there. > > I am missing the counter proposal in both Bjorn's and your distaste > for aux bus and PCIe portdrv? For the case of PCIe portdrv, the functionality involved is Power Management Events (PME), Advanced Error Reporting (AER), PCIe native hotplug, Downstream Port Containment (DPC), and Bandwidth Notifications. Currently each has a separate "port service driver" with .probe(), .remove(), .suspend(), .resume(), etc. The services share interrupt vectors. It's quite complicated to set them up, and it has to be done in the portdrv, not in the individual drivers. They also share power state (D0, D3hot, etc). In my mind these are not separate devices from the underlying PCI device, and I don't think splitting the support into "service drivers" made things better. I think it would be simpler if these were just added to pci_init_capabilities() like other optional pieces of PCI functionality. Sysfs looks like this: /sys/devices/pci0000:00/0000:00:1c.0/ # Root Port /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/ # AER "device" /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie010/ # BW notif /sys/bus/pci/devices/0000:00:1c.0 -> ../../../devices/pci0000:00/0000:00:1c.0/ /sys/bus/pci_express/devices/0000:00:1c.0:pcie002 -> ../../../devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/ The "pcie002" names (hex for PCIE_PORT_SERVICE_AER, etc.) are unintelligible. I don't know why we have a separate /sys/bus/pci_express hierarchy. IIUC, CXL devices will be enumerated by the usual PCI enumeration, so there will be a struct pci_dev for them, and they will appear under /sys/devices/pci*/. They will have the usual PCI Power Management, MSI, AER, DPC, and similar Capabilites, so the PCI core will manage them. CXL devices have lots of fancy additional features. Does that merit making a separate struct device and a separate sysfs hierarchy for them? I don't know. > > > Another example, the Native PCIe Enclosure Management (NPEM) > > > specification defines a handful of registers that can appear anywhere > > > in the PCIe hierarchy. How can you write a common driver that is > > > generically applicable to any given NPEM instance? > > > > Another totally messed up spec. But then pretty much everything coming > > from the PCIe SIG in terms of interface tends to be really, really > > broken lately. Hotplug is more central to PCI than NPEM is, but NPEM is a little bit like PCIe native hotplug in concept: hotplug has a few registers that control downstream indicators, interlock, and power controller; NPEM has registers that control downstream indicators. Both are prescribed by the PCIe spec and presumably designed to work alongside the usual device-specific drivers for bridges, SSDs, etc. I would at least explore the idea of doing common support by integrating NPEM into the PCI core. There would have to be some hook for the enclosure-specific bits, but I think it's fair for the details of sending commands and polling for command completed to be part of the PCI core. > DVSEC and DOE is more of the same in terms of composing add-on > features into devices. Hardware vendors want to mix multiple hard-IPs > into a single device, aux bus is one response. Topology specific buses > like /sys/bus/cxl are another. VSEC and DVSEC are pretty much wild cards since the PCIe spec says nothing about what registers they may contain or how they should work. DOE *is* specified by PCIe, at least in terms of the data transfer protocol (interrupt usage, read/write mailbox, etc). I think that, and the fact that it's not specific to CXL, means we need some kind of PCI core interface to do the transfers. > This CXL port driver is offering enumeration, link management, and > memory decode setup services to the rest of the topology. I see it as > similar to management protocol services offered by libsas. Bjorn
[ add Stuart for NPEM feedback ] On Thu, Dec 2, 2021 at 1:24 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > On Tue, Nov 23, 2021 at 11:17:55PM -0800, Dan Williams wrote: > > On Tue, Nov 23, 2021 at 10:33 PM Christoph Hellwig <hch@lst.de> wrote: > > > On Tue, Nov 23, 2021 at 04:40:06PM -0800, Dan Williams wrote: > > > > Let me ask a clarifying question coming from the other direction that > > > > resulted in the creation of the auxiliary bus architecture. Some > > > > background. RDMA is a protocol that may run on top of Ethernet. > > > > > > No, RDMA is a concept. Linux supports 2 and a half RDMA protocols > > > that run over ethernet (RoCE v1 and v2 and iWarp). > > > > Yes, I was being too coarse, point taken. However, I don't think that > > changes the observation that multiple vendors are using aux bus to > > share a feature driver across multiple base Ethernet drivers. > > > > > > Consider the case where you have multiple generations of Ethernet > > > > adapter devices, but they all support common RDMA functionality. You > > > > only have the one PCI device to attach a unique Ethernet driver. What > > > > is an idiomatic way to deploy a module that automatically loads and > > > > attaches to the exported common functionality across adapters that > > > > otherwise have a unique native driver for the hardware device? > > > > > > The whole aux bus drama is mostly because the intel design for these > > > is really fucked up. All the sane HCAs do not use this model. All > > > this attchment crap really should not be there. > > > > I am missing the counter proposal in both Bjorn's and your distaste > > for aux bus and PCIe portdrv? > > For the case of PCIe portdrv, the functionality involved is Power > Management Events (PME), Advanced Error Reporting (AER), PCIe native > hotplug, Downstream Port Containment (DPC), and Bandwidth > Notifications. > > Currently each has a separate "port service driver" with .probe(), > .remove(), .suspend(), .resume(), etc. > > The services share interrupt vectors. It's quite complicated to set > them up, and it has to be done in the portdrv, not in the individual > drivers. > > They also share power state (D0, D3hot, etc). > > In my mind these are not separate devices from the underlying PCI > device, and I don't think splitting the support into "service drivers" > made things better. I think it would be simpler if these were just > added to pci_init_capabilities() like other optional pieces of PCI > functionality. > > Sysfs looks like this: > > /sys/devices/pci0000:00/0000:00:1c.0/ # Root Port > /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/ # AER "device" > /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie010/ # BW notif > > /sys/bus/pci/devices/0000:00:1c.0 -> ../../../devices/pci0000:00/0000:00:1c.0/ > /sys/bus/pci_express/devices/0000:00:1c.0:pcie002 -> ../../../devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/ > > The "pcie002" names (hex for PCIE_PORT_SERVICE_AER, etc.) are > unintelligible. I don't know why we have a separate > /sys/bus/pci_express hierarchy. > > IIUC, CXL devices will be enumerated by the usual PCI enumeration, so > there will be a struct pci_dev for them, and they will appear under > /sys/devices/pci*/. > > They will have the usual PCI Power Management, MSI, AER, DPC, and > similar Capabilites, so the PCI core will manage them. > > CXL devices have lots of fancy additional features. Does that merit > making a separate struct device and a separate sysfs hierarchy for > them? I don't know. Thank you, this framing really helps. So, if I look at this through the lens of the "can this just be handled by pci_init_capabilities()?" guardband, where the capability is device-scoped and shares resources that a driver for the same device would use, then yes, PCIe port services do not merit a separate bus. CXL memory expansion is distinctly not in that category. CXL is a distinct specification (i.e. unlike PCIe port services which are literally in the PCI Sig purview), distinct transport/data layer (effectively a different physical connection on the wire), and is composable in ways that PCIe is not. For example, if you trigger FLR on a PCI device you would expect the entirety of pci_init_capabilities() needs to be saved / restored, CXL state is not affected by FLR. The separate link layer for CXL means that mere device visibility is not sufficient to determine CXL operation. Whereas with a PCI driver if you can see the device you know that the device is ready to be the target of config, io, and mmio cycles, CXL.mem operation may not be available when the device is visible to enumeration. The composability of CXL places the highest demand for an independent 'struct bus_type' and registering CXL devices for their corresponding PCIe devices. The bus is a rendezvous for all the moving pieces needed to validate and provision a CXL memory region that may span multiple endpoints, switches and host bridges. An action like reprogramming memory decode of an endpoint needs to be coordinated across the entire topology. The fact that the PCI core remains blissfully unaware of all these interdependencies is a feature because CXL is effectively a super-set of PCIe for fast-path CXL.mem operation, even though it is derivative for enumeration and slow-path manageability. So I hope you see CXL's need to create some PCIe-symbiotic objects in order to compose CXL objects (like interleaved memory regions) that have no PCIe analog. > > > > Another example, the Native PCIe Enclosure Management (NPEM) > > > > specification defines a handful of registers that can appear anywhere > > > > in the PCIe hierarchy. How can you write a common driver that is > > > > generically applicable to any given NPEM instance? > > > > > > Another totally messed up spec. But then pretty much everything coming > > > from the PCIe SIG in terms of interface tends to be really, really > > > broken lately. > > Hotplug is more central to PCI than NPEM is, but NPEM is a little bit > like PCIe native hotplug in concept: hotplug has a few registers that > control downstream indicators, interlock, and power controller; NPEM > has registers that control downstream indicators. > > Both are prescribed by the PCIe spec and presumably designed to work > alongside the usual device-specific drivers for bridges, SSDs, etc. > > I would at least explore the idea of doing common support by > integrating NPEM into the PCI core. There would have to be some hook > for the enclosure-specific bits, but I think it's fair for the details > of sending commands and polling for command completed to be part of > the PCI core. The problem with NPEM compared to hotplug LED signaling is that it has the strange property that an NPEM higher in the topology might override one lower in the topology. This is to support things like NVME enclosures where the NVME device itself may have an NPEM instance, but that instance is of course not suitable for signaling that the device is not present. So, instead the system may be designed to have an NPEM in the upstream switch port and disable the NPEM instances in the device. Platform firmware decides which NPEM is in use. It also has the "fun" property of additionally being overridden by ACPI. Stuart, have a look at collapsing the auxiliary-device approach into pci_init_capabilities() and whether that can still coordinate with the enclosure code. One of the nice properties of the auxiliary organization is well defined module boundaries. Will NPEM in the PCI core now force CONFIG_ENCLOSURE_SERVICES to be built-in? That seems an unwanted side effect to me. > > DVSEC and DOE is more of the same in terms of composing add-on > > features into devices. Hardware vendors want to mix multiple hard-IPs > > into a single device, aux bus is one response. Topology specific buses > > like /sys/bus/cxl are another. > > VSEC and DVSEC are pretty much wild cards since the PCIe spec says > nothing about what registers they may contain or how they should work. > > DOE *is* specified by PCIe, at least in terms of the data transfer > protocol (interrupt usage, read/write mailbox, etc). I think that, > and the fact that it's not specific to CXL, means we need some kind of > PCI core interface to do the transfers. DOE transfer code belongs in drivers/pci/ period, but does it really need to be in drivers/pci/built-in.a for everyone regardless of whether it is being used or not?
On Thu, Dec 02, 2021 at 05:38:17PM -0800, Dan Williams wrote: > [ add Stuart for NPEM feedback ] > > On Thu, Dec 2, 2021 at 1:24 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Tue, Nov 23, 2021 at 11:17:55PM -0800, Dan Williams wrote: > > > On Tue, Nov 23, 2021 at 10:33 PM Christoph Hellwig <hch@lst.de> wrote: > > > > On Tue, Nov 23, 2021 at 04:40:06PM -0800, Dan Williams wrote: > > > > > Let me ask a clarifying question coming from the other direction that > > > > > resulted in the creation of the auxiliary bus architecture. Some > > > > > background. RDMA is a protocol that may run on top of Ethernet. > > > > > > > > No, RDMA is a concept. Linux supports 2 and a half RDMA protocols > > > > that run over ethernet (RoCE v1 and v2 and iWarp). > > > > > > Yes, I was being too coarse, point taken. However, I don't think that > > > changes the observation that multiple vendors are using aux bus to > > > share a feature driver across multiple base Ethernet drivers. > > > > > > > > Consider the case where you have multiple generations of Ethernet > > > > > adapter devices, but they all support common RDMA functionality. You > > > > > only have the one PCI device to attach a unique Ethernet driver. What > > > > > is an idiomatic way to deploy a module that automatically loads and > > > > > attaches to the exported common functionality across adapters that > > > > > otherwise have a unique native driver for the hardware device? > > > > > > > > The whole aux bus drama is mostly because the intel design for these > > > > is really fucked up. All the sane HCAs do not use this model. All > > > > this attchment crap really should not be there. > > > > > > I am missing the counter proposal in both Bjorn's and your distaste > > > for aux bus and PCIe portdrv? > > > > For the case of PCIe portdrv, the functionality involved is Power > > Management Events (PME), Advanced Error Reporting (AER), PCIe native > > hotplug, Downstream Port Containment (DPC), and Bandwidth > > Notifications. > > > > Currently each has a separate "port service driver" with .probe(), > > .remove(), .suspend(), .resume(), etc. > > > > The services share interrupt vectors. It's quite complicated to set > > them up, and it has to be done in the portdrv, not in the individual > > drivers. > > > > They also share power state (D0, D3hot, etc). > > > > In my mind these are not separate devices from the underlying PCI > > device, and I don't think splitting the support into "service drivers" > > made things better. I think it would be simpler if these were just > > added to pci_init_capabilities() like other optional pieces of PCI > > functionality. > > > > Sysfs looks like this: > > > > /sys/devices/pci0000:00/0000:00:1c.0/ # Root Port > > /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/ # AER "device" > > /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie010/ # BW notif > > > > /sys/bus/pci/devices/0000:00:1c.0 -> ../../../devices/pci0000:00/0000:00:1c.0/ > > /sys/bus/pci_express/devices/0000:00:1c.0:pcie002 -> ../../../devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/ > > > > The "pcie002" names (hex for PCIE_PORT_SERVICE_AER, etc.) are > > unintelligible. I don't know why we have a separate > > /sys/bus/pci_express hierarchy. > > > > IIUC, CXL devices will be enumerated by the usual PCI enumeration, so > > there will be a struct pci_dev for them, and they will appear under > > /sys/devices/pci*/. > > > > They will have the usual PCI Power Management, MSI, AER, DPC, and > > similar Capabilites, so the PCI core will manage them. > > > > CXL devices have lots of fancy additional features. Does that merit > > making a separate struct device and a separate sysfs hierarchy for > > them? I don't know. > > Thank you, this framing really helps. > > So, if I look at this through the lens of the "can this just be > handled by pci_init_capabilities()?" guardband, where the capability > is device-scoped and shares resources that a driver for the same > device would use, then yes, PCIe port services do not merit a separate > bus. > > CXL memory expansion is distinctly not in that category. CXL is a > distinct specification (i.e. unlike PCIe port services which are > literally in the PCI Sig purview), distinct transport/data layer > (effectively a different physical connection on the wire), and is > composable in ways that PCIe is not. > > For example, if you trigger FLR on a PCI device you would expect the > entirety of pci_init_capabilities() needs to be saved / restored, CXL > state is not affected by FLR. > > The separate link layer for CXL means that mere device visibility is > not sufficient to determine CXL operation. Whereas with a PCI driver > if you can see the device you know that the device is ready to be the > target of config, io, and mmio cycles, Not strictly true. A PCI device must respond to config transactions within a short time after reset, but it does not respond to IO or MEM transactions until a driver sets PCI_COMMAND_IO or PCI_COMMAND_MEMORY. > ... CXL.mem operation may not be available when the device is > visible to enumeration. > The composability of CXL places the highest demand for an independent > 'struct bus_type' and registering CXL devices for their corresponding > PCIe devices. The bus is a rendezvous for all the moving pieces needed > to validate and provision a CXL memory region that may span multiple > endpoints, switches and host bridges. An action like reprogramming > memory decode of an endpoint needs to be coordinated across the entire > topology. I guess software uses the CXL DVSEC to distinguish a CXL device from a PCIe device, at least for 00.0. Not sure about non-Dev 0 Func 0 devices; maybe this implies other functions in the same device are part of the same CXL device? A CXL device may comprise several PCIe devices, and I think they may have non-CXL Capabilities, so I assume you need a struct pci_dev for each PCIe device? Looks like a single CXL DVSEC controls the entire CXL device (including several PCIe devices), so I assume you want some kind of struct cxl_dev to represent that aggregation? > The fact that the PCI core remains blissfully unaware of all these > interdependencies is a feature because CXL is effectively a super-set > of PCIe for fast-path CXL.mem operation, even though it is derivative > for enumeration and slow-path manageability. I don't know how blissfully unaware PCI can be. Can a user remove a PCIe device that's part of a CXL device via sysfs? Is the PCIe device available for drivers to claim? Do we need coordination between cxl_driver_register() and pci_register_driver()? Does CXL impose new constraints on PCI power management? > So I hope you see CXL's need to create some PCIe-symbiotic objects in > order to compose CXL objects (like interleaved memory regions) that > have no PCIe analog. > > Hotplug is more central to PCI than NPEM is, but NPEM is a little bit > > like PCIe native hotplug in concept: hotplug has a few registers that > > control downstream indicators, interlock, and power controller; NPEM > > has registers that control downstream indicators. > > > > Both are prescribed by the PCIe spec and presumably designed to work > > alongside the usual device-specific drivers for bridges, SSDs, etc. > > > > I would at least explore the idea of doing common support by > > integrating NPEM into the PCI core. There would have to be some hook > > for the enclosure-specific bits, but I think it's fair for the details > > of sending commands and polling for command completed to be part of > > the PCI core. > > The problem with NPEM compared to hotplug LED signaling is that it has > the strange property that an NPEM higher in the topology might > override one lower in the topology. This is to support things like > NVME enclosures where the NVME device itself may have an NPEM > instance, but that instance is of course not suitable for signaling > that the device is not present. I would envision the PCI core providing only a bare-bones "send this command and wait for it to complete" sort of interface. Everything else, including deciding which device to use, would go elsewhere. > So, instead the system may be designed to have an NPEM in the > upstream switch port and disable the NPEM instances in the device. > Platform firmware decides which NPEM is in use. Since you didn't mention a firmware interface to communicate which NPEM is in use, I assume firmware does this by preventing other devices from advertising NPEM support? > It also has the "fun" property of additionally being overridden by ACPI. > ... > One of the nice properties of the auxiliary organization is well > defined module boundaries. Will NPEM in the PCI core now force > CONFIG_ENCLOSURE_SERVICES to be built-in? That seems an unwanted side > effect to me. If the PCI core provides only the mechanism, and the smarts of NPEM are in something analogous to drivers/scsi/ses.c, I don't think it would force CONFIG_ENCLOSURE_SERVICES to be built-in. > > DOE *is* specified by PCIe, at least in terms of the data transfer > > protocol (interrupt usage, read/write mailbox, etc). I think that, > > and the fact that it's not specific to CXL, means we need some kind of > > PCI core interface to do the transfers. > > DOE transfer code belongs in drivers/pci/ period, but does it really > need to be in drivers/pci/built-in.a for everyone regardless of > whether it is being used or not? I think my opinion would depend on how small the DOE transfer code could be made and how much it would complicate things to make it a module. And also how we could enforce whatever mutual exclusion we need to make it safe. Bjorn
On Fri, Dec 3, 2021 at 2:03 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Thu, Dec 02, 2021 at 05:38:17PM -0800, Dan Williams wrote: > > [ add Stuart for NPEM feedback ] > > > > On Thu, Dec 2, 2021 at 1:24 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Tue, Nov 23, 2021 at 11:17:55PM -0800, Dan Williams wrote: > > > > On Tue, Nov 23, 2021 at 10:33 PM Christoph Hellwig <hch@lst.de> wrote: > > > > > On Tue, Nov 23, 2021 at 04:40:06PM -0800, Dan Williams wrote: > > > > > > Let me ask a clarifying question coming from the other direction that > > > > > > resulted in the creation of the auxiliary bus architecture. Some > > > > > > background. RDMA is a protocol that may run on top of Ethernet. > > > > > > > > > > No, RDMA is a concept. Linux supports 2 and a half RDMA protocols > > > > > that run over ethernet (RoCE v1 and v2 and iWarp). > > > > > > > > Yes, I was being too coarse, point taken. However, I don't think that > > > > changes the observation that multiple vendors are using aux bus to > > > > share a feature driver across multiple base Ethernet drivers. > > > > > > > > > > Consider the case where you have multiple generations of Ethernet > > > > > > adapter devices, but they all support common RDMA functionality. You > > > > > > only have the one PCI device to attach a unique Ethernet driver. What > > > > > > is an idiomatic way to deploy a module that automatically loads and > > > > > > attaches to the exported common functionality across adapters that > > > > > > otherwise have a unique native driver for the hardware device? > > > > > > > > > > The whole aux bus drama is mostly because the intel design for these > > > > > is really fucked up. All the sane HCAs do not use this model. All > > > > > this attchment crap really should not be there. > > > > > > > > I am missing the counter proposal in both Bjorn's and your distaste > > > > for aux bus and PCIe portdrv? > > > > > > For the case of PCIe portdrv, the functionality involved is Power > > > Management Events (PME), Advanced Error Reporting (AER), PCIe native > > > hotplug, Downstream Port Containment (DPC), and Bandwidth > > > Notifications. > > > > > > Currently each has a separate "port service driver" with .probe(), > > > .remove(), .suspend(), .resume(), etc. > > > > > > The services share interrupt vectors. It's quite complicated to set > > > them up, and it has to be done in the portdrv, not in the individual > > > drivers. > > > > > > They also share power state (D0, D3hot, etc). > > > > > > In my mind these are not separate devices from the underlying PCI > > > device, and I don't think splitting the support into "service drivers" > > > made things better. I think it would be simpler if these were just > > > added to pci_init_capabilities() like other optional pieces of PCI > > > functionality. > > > > > > Sysfs looks like this: > > > > > > /sys/devices/pci0000:00/0000:00:1c.0/ # Root Port > > > /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/ # AER "device" > > > /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie010/ # BW notif > > > > > > /sys/bus/pci/devices/0000:00:1c.0 -> ../../../devices/pci0000:00/0000:00:1c.0/ > > > /sys/bus/pci_express/devices/0000:00:1c.0:pcie002 -> ../../../devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/ > > > > > > The "pcie002" names (hex for PCIE_PORT_SERVICE_AER, etc.) are > > > unintelligible. I don't know why we have a separate > > > /sys/bus/pci_express hierarchy. > > > > > > IIUC, CXL devices will be enumerated by the usual PCI enumeration, so > > > there will be a struct pci_dev for them, and they will appear under > > > /sys/devices/pci*/. > > > > > > They will have the usual PCI Power Management, MSI, AER, DPC, and > > > similar Capabilites, so the PCI core will manage them. > > > > > > CXL devices have lots of fancy additional features. Does that merit > > > making a separate struct device and a separate sysfs hierarchy for > > > them? I don't know. > > > > Thank you, this framing really helps. > > > > So, if I look at this through the lens of the "can this just be > > handled by pci_init_capabilities()?" guardband, where the capability > > is device-scoped and shares resources that a driver for the same > > device would use, then yes, PCIe port services do not merit a separate > > bus. > > > > CXL memory expansion is distinctly not in that category. CXL is a > > distinct specification (i.e. unlike PCIe port services which are > > literally in the PCI Sig purview), distinct transport/data layer > > (effectively a different physical connection on the wire), and is > > composable in ways that PCIe is not. > > > > For example, if you trigger FLR on a PCI device you would expect the > > entirety of pci_init_capabilities() needs to be saved / restored, CXL > > state is not affected by FLR. > > > > The separate link layer for CXL means that mere device visibility is > > not sufficient to determine CXL operation. Whereas with a PCI driver > > if you can see the device you know that the device is ready to be the > > target of config, io, and mmio cycles, > > Not strictly true. A PCI device must respond to config transactions > within a short time after reset, but it does not respond to IO or MEM > transactions until a driver sets PCI_COMMAND_IO or PCI_COMMAND_MEMORY. Right, what I was attempting to convey is that it's not like CXL.mem. While flipping a bit on the device turns on PCI.mmio target support, there's additional polling and status checking to be done after the device is enabled to be a target for CXL.mem. I.e. the CXL.mem configuration is done via PCI.mmio* (*for CXL 2.0 devices) and only after the device negotiates a CXL link beyond the base PCIe link. It is also the case that the device may not be ready for CXL.mem until all the devices that compose the same range are available as well. > > > ... CXL.mem operation may not be available when the device is > > visible to enumeration. > > > The composability of CXL places the highest demand for an independent > > 'struct bus_type' and registering CXL devices for their corresponding > > PCIe devices. The bus is a rendezvous for all the moving pieces needed > > to validate and provision a CXL memory region that may span multiple > > endpoints, switches and host bridges. An action like reprogramming > > memory decode of an endpoint needs to be coordinated across the entire > > topology. > > I guess software uses the CXL DVSEC to distinguish a CXL device from a > PCIe device, at least for 00.0. driver/cxl/pci.c attaches to: "PCI_DEVICE_CLASS((PCI_CLASS_MEMORY_CXL << 8 | CXL_MEMORY_PROGIF), ~0)" I am not aware of any restriction for that class code to appear at function0? > Not sure about non-Dev 0 Func 0 > devices; maybe this implies other functions in the same device are > part of the same CXL device? DVSEC is really only telling us the layout of the MMIO register space. A CXL Memory Device (CXL.mem endpoint) implements so-called "device" registers and "component" registers. The "device" registers control things that a traditional PCI driver would control like a mailbox interface and device local status registers. The "component" registers are what mediate access to the CXL.mem decode space. It is somewhat confusing because CXL 1.1 devices used the DVSEC configuration registers directly for CXL.mem configuration, but CXL 2.0 ditched that organization. The Linux driver is only targeting CXL 2.0+ devices as platform firmware owns setting up CXL 1.1 memory expanders. As far as Linux is concerned CXL 1.1 looks like DDR i.e. it's just address space set up by the BIOS and populated into EFI and ACPI tables. CXL 1.1 is also implementing a 1:1 device-to-memory-range association. CXL 2.0 allows for N:1 devices-to-memory-range and dynamic configuration of that by the OS. CXL 1.1 platform firmware locks out the OS from making configuration changes. > A CXL device may comprise several PCIe devices, and I think they may > have non-CXL Capabilities, so I assume you need a struct pci_dev for > each PCIe device? > > Looks like a single CXL DVSEC controls the entire CXL device > (including several PCIe devices), so I assume you want some kind of > struct cxl_dev to represent that aggregation? We have 3 classes of a 'cxl_dev' in drivers/cxl: "mem" devices (PCIe endpoints) "port" devices (PCIe Upstream Switch Ports, PCIe Root Ports, ACPI0016 Host Bridge devices, and ACPI0017 CXL Root devices) "region" devices (aggregate devices composed of multiple mem devices). The "mem" driver is tasked with enumerating all "ports" in the path between itself and the ACPI0017 root and validating that a CXL link is up at each hop before enumerating "region" devices. > > > The fact that the PCI core remains blissfully unaware of all these > > interdependencies is a feature because CXL is effectively a super-set > > of PCIe for fast-path CXL.mem operation, even though it is derivative > > for enumeration and slow-path manageability. > > I don't know how blissfully unaware PCI can be. Can a user remove a > PCIe device that's part of a CXL device via sysfs? Yes. If that PCIe device is an endpoint then the corresponding "mem" driver will get a ->remove() event because it is registered as a child of that parent PCIe device. That in turn will tear down the relevant part of the CXL port topology. However, a gap (deliberate design choice?) in the PCI hotplug implementation I currently see is an ability for an endpoint PCI driver to dynamically deny hotplug requests based on the state of the device. pci_ignore_hotplug() seems inadequate for the job of making sure that someone first disables all participation of a "mem" device in all regions before asking for its physical removal. However, if someone force removes a device the driver and CXL subsystem will do the right thing and go fail that memory region for all users. I'm working with other DAX developers on a range based memory_failure() api for this case. > Is the PCIe device available for drivers to claim? drivers/cxl/pci.c *is* a native PCI driver for CXL memory expanders. You might be thinking of CXL accelerator devices, where the CXL.cache and CXL.mem capabilities are incremental capabilities for the accelerator. So, for example, a GPU with CXL.mem capabilities, would be mostly ignored by the drivers/cxl/ stack by default. Only if that device+driver wanted to emulate a generic memory expander and share its memory space with the host might it instantiate mem, port, and region objects. Otherwise CXL for accelerators is mostly a transparent capability that the OS may not need to manage. Rather than copy data back and forth between the host a CXL enabled GPU's driver can just assign pointers to its local memory space and trust that it is coherent. That's a gross oversimplification, but I want to convey that there are CXL devices like accelerators that are the responsibility of each accelerator PCI driver, vs CXL memory expander devices which are generic providers of "System RAM" and "Persistent Memory" resources and need the "mem", "port", "region" schema. > Do we need coordination between cxl_driver_register() and pci_register_driver()? They do not collide, and I think this goes back to the concern about whether drivers/cxl/ is scanning for all CXL DVSECs. No, it only cares about the CXL DVSEC in CXL memory expander endpoints with the aforementioned class code, and the CXL DVSEC in every upstream switch port in the ancestry up to the CXL root device (ACPI0017). CXL accelerator drivers will always use pci_register_driver() and can decide to register their DVSEC with the CXL core, or keep it private to themselves. I imagine a GPU accelerator might register a "mem" device if it needs to get CXL.mem decode set up after a hotplug event, but if it's only using CXL.cache, or if its memory space was established by platform firmware then drivers/cxl/ is not involved. > Does CXL impose new constraints on PCI power management? Recall that CXL is built such that it could be supported by a legacy operating system where platform firmware owns the setup of devices, just like DDR memory does not need a driver. This is where CXL 1.1 played until CXL 2.0 added so much configuration complexity (hotplug, interleave, persistent memory) that it started to need OS help. The Linux PCIe PM code will not notice a difference, but behind the scenes the device, if it is offering coherent memory to the CPU, will be coordinating with the CPU like it was part of the CPU package and not a discrete device. I do not see new PM software enabling required in my reading of "Section 10.0 Power Management" of the CXL specification. > > So I hope you see CXL's need to create some PCIe-symbiotic objects in > > order to compose CXL objects (like interleaved memory regions) that > > have no PCIe analog. > > > > Hotplug is more central to PCI than NPEM is, but NPEM is a little bit > > > like PCIe native hotplug in concept: hotplug has a few registers that > > > control downstream indicators, interlock, and power controller; NPEM > > > has registers that control downstream indicators. > > > > > > Both are prescribed by the PCIe spec and presumably designed to work > > > alongside the usual device-specific drivers for bridges, SSDs, etc. > > > > > > I would at least explore the idea of doing common support by > > > integrating NPEM into the PCI core. There would have to be some hook > > > for the enclosure-specific bits, but I think it's fair for the details > > > of sending commands and polling for command completed to be part of > > > the PCI core. > > > > The problem with NPEM compared to hotplug LED signaling is that it has > > the strange property that an NPEM higher in the topology might > > override one lower in the topology. This is to support things like > > NVME enclosures where the NVME device itself may have an NPEM > > instance, but that instance is of course not suitable for signaling > > that the device is not present. > > I would envision the PCI core providing only a bare-bones "send this > command and wait for it to complete" sort of interface. Everything > else, including deciding which device to use, would go elsewhere. > > > So, instead the system may be designed to have an NPEM in the > > upstream switch port and disable the NPEM instances in the device. > > Platform firmware decides which NPEM is in use. > > Since you didn't mention a firmware interface to communicate which > NPEM is in use, I assume firmware does this by preventing other > devices from advertising NPEM support? That's also my assumption. If the OS sees a disabled NPEM in the topology it just needs to assume firmware knew what it was doing when it disabled it. I wish NPEM was better specified than "trust firmware to do the right thing via an ambiguous signal". > > > It also has the "fun" property of additionally being overridden by ACPI. > > ... > > One of the nice properties of the auxiliary organization is well > > defined module boundaries. Will NPEM in the PCI core now force > > CONFIG_ENCLOSURE_SERVICES to be built-in? That seems an unwanted side > > effect to me. > > If the PCI core provides only the mechanism, and the smarts of NPEM > are in something analogous to drivers/scsi/ses.c, I don't think it > would force CONFIG_ENCLOSURE_SERVICES to be built-in. What is that dynamic thing that glues CONFIG_ENCLOSURE_SERVICES to the PCI core that also does not require statically linking that glue to every driver that wants to talk to NPEM? I don't mind that the base hardware access mechanism library is in the PCI core, but what does NVME and the CXL memory expander driver register to get NPEM service and associate their block / memory device with an enclosure slot? To me that glue for these one-off ancillary features is what aux-bus is all about, but I'm open to it not being aux-bus in the end. Maybe Stuart has a proposal here? > > > > DOE *is* specified by PCIe, at least in terms of the data transfer > > > protocol (interrupt usage, read/write mailbox, etc). I think that, > > > and the fact that it's not specific to CXL, means we need some kind of > > > PCI core interface to do the transfers. > > > > DOE transfer code belongs in drivers/pci/ period, but does it really > > need to be in drivers/pci/built-in.a for everyone regardless of > > whether it is being used or not? > > I think my opinion would depend on how small the DOE transfer code > could be made and how much it would complicate things to make it a > module. And also how we could enforce whatever mutual exclusion we > need to make it safe. At least for the mutual exclusion aspect I'm thinking of typical request_region() style exclusion where the aux-driver claims the configuration address register range.
On Fri, Dec 03, 2021 at 05:24:34PM -0800, Dan Williams wrote: > On Fri, Dec 3, 2021 at 2:03 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Thu, Dec 02, 2021 at 05:38:17PM -0800, Dan Williams wrote: I'm cutting out a bunch of details, not because they're unimportant, but because I don't know enough yet for them to make sense to me. > > I guess software uses the CXL DVSEC to distinguish a CXL device > > from a PCIe device, at least for 00.0. > > driver/cxl/pci.c attaches to: "PCI_DEVICE_CLASS((PCI_CLASS_MEMORY_CXL > << 8 | CXL_MEMORY_PROGIF), ~0)" > > I am not aware of any restriction for that class code to appear at > function0? Maybe it's not an actual restriction; I'm reading CXL r2.0, sec 8.1.3, where it says: The PCIe configuration space of Device 0, Function 0 shall include the CXL PCI Express Designated Vendor-Specific Extended Capability (DVSEC) as shown in Figure 126. The capability, status and control fields in Device 0, Function 0 DVSEC control the CXL functionality of the entire CXL device. ... Software may use the presence of this DVSEC to differentiate between a CXL device and a PCIe device. As such, a standard PCIe device must not expose this DVSEC. Sections 9.11.5 and 9.12.2 also talk about looking for CXL DVSEC on dev 0, func 0 to identify CXL devices. > > Not sure about non-Dev 0 Func 0 devices; maybe this implies other > > functions in the same device are part of the same CXL device? > > DVSEC is really only telling us the layout of the MMIO register > space. ... And DVSEC also apparently tells us that "this is a CXL device, not just an ordinary PCIe device"? It's not clear to me how you identify other PCIe functions that are also part of the same CXL device. > > > The fact that the PCI core remains blissfully unaware of all these > > > interdependencies is a feature because CXL is effectively a super-set > > > of PCIe for fast-path CXL.mem operation, even though it is derivative > > > for enumeration and slow-path manageability. > > > > I don't know how blissfully unaware PCI can be. Can a user remove a > > PCIe device that's part of a CXL device via sysfs? > > Yes. If that PCIe device is an endpoint then the corresponding "mem" > driver will get a ->remove() event because it is registered as a child > of that parent PCIe device. That in turn will tear down the relevant > part of the CXL port topology. The CXL device may include several PCIe devices. "mem" is a CXL driver that's bound to one of them (?) Is that what you mean by "mem" being a "child of the the parent PCIe device"? > However, a gap (deliberate design choice?) in the PCI hotplug > implementation I currently see is an ability for an endpoint PCI > driver to dynamically deny hotplug requests based on the state of the > device. ... PCI allows surprise remove, so drivers generally can't deny hot unplugs. PCIe *does* provide for an Electromechanical Interlock (see PCI_EXP_SLTCTL_EIC), but we don't currently support it. > > Is the PCIe device available for drivers to claim? > > drivers/cxl/pci.c *is* a native PCI driver for CXL memory expanders. > You might be thinking of CXL accelerator devices, where the CXL.cache > and CXL.mem capabilities are incremental capabilities for the > accelerator. ... No, I'm not nearly sophisticated enough to be thinking of specific types of CXL things :) > > Do we need coordination between cxl_driver_register() and > > pci_register_driver()? > > They do not collide, and I think this goes back to the concern about > whether drivers/cxl/ is scanning for all CXL DVSECs. ... Sorry, I don't remember what this concern was, and I don't know why they don't collide. I *would* know that if I knew that the set of things cxl_driver_register() binds to doesn't intersect the set of pci_devs, but it's not clear to me what those things are. The PCI core enumerates devices by doing config reads of the Vendor ID for every possible bus and device number. It allocs a pci_dev for each device it finds, and those are the things pci_register_driver() binds drivers to based on Vendor ID, Device ID, etc. How does CXL enumeration work? Do you envision it being integrated into PCI enumeration? Does it build a list/tree/etc of cxl_devs? cxl_driver_register() associates a driver with something. What exactly is the thing the driver is associated with? A pci_dev? A cxl_dev? Some kind of aggregate CXL device composed of several PCIe devices? I expected cxl_driver.probe() to take a "struct cxl_dev *" or similar, but it takes a "struct device *". I'm trying to apply my knowledge of how other buses work in Linux, but obviously it's not working very well. > > Does CXL impose new constraints on PCI power management? > > Recall that CXL is built such that it could be supported by a legacy > operating system where platform firmware owns the setup of devices, > just like DDR memory does not need a driver. This is where CXL 1.1 > played until CXL 2.0 added so much configuration complexity (hotplug, > interleave, persistent memory) that it started to need OS help. The > Linux PCIe PM code will not notice a difference, but behind the scenes > the device, if it is offering coherent memory to the CPU, will be > coordinating with the CPU like it was part of the CPU package and not > a discrete device. I do not see new PM software enabling required in > my reading of "Section 10.0 Power Management" of the CXL > specification. So if Linux PM decides to suspend a PCIe device that's part of a CXL device and put it in D3hot, this is not a problem for CXL? I guess if a CXL driver binds to the PCIe device, it can control what PM will do. But I thought CXL drivers would bind to a CXL thing, not a PCIe thing. I see lots of mentions of LTR in sec 10, which really scares me because I'm pretty confident that Linux handling of LTR is broken, and I have no idea how to fix it. > > > So, instead the system may be designed to have an NPEM in the > > > upstream switch port and disable the NPEM instances in the device. > > > Platform firmware decides which NPEM is in use. > > > > Since you didn't mention a firmware interface to communicate which > > NPEM is in use, I assume firmware does this by preventing other > > devices from advertising NPEM support? > > That's also my assumption. If the OS sees a disabled NPEM in the > topology it just needs to assume firmware knew what it was doing when > it disabled it. I wish NPEM was better specified than "trust firmware > to do the right thing via an ambiguous signal". If we enumerate a device with a capability that is disabled, we normally don't treat that as a signal that it cannot be enabled. There are lots of enable bits in PCI capabilities, and I don't know of any cases where Linux assumes "Enable == 0" means "don't use this feature." Absent some negotiation like _OSC or some documented restriction, e.g., in the PCI Firmware spec, Linux normally just enables features when it decides to use them. Bjorn
On Mon, Dec 6, 2021 at 6:56 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Dec 03, 2021 at 05:24:34PM -0800, Dan Williams wrote: > > On Fri, Dec 3, 2021 at 2:03 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Thu, Dec 02, 2021 at 05:38:17PM -0800, Dan Williams wrote: > > I'm cutting out a bunch of details, not because they're unimportant, > but because I don't know enough yet for them to make sense to me. > > > > I guess software uses the CXL DVSEC to distinguish a CXL device > > > from a PCIe device, at least for 00.0. > > > > driver/cxl/pci.c attaches to: "PCI_DEVICE_CLASS((PCI_CLASS_MEMORY_CXL > > << 8 | CXL_MEMORY_PROGIF), ~0)" > > > > I am not aware of any restriction for that class code to appear at > > function0? > > Maybe it's not an actual restriction; I'm reading CXL r2.0, sec 8.1.3, > where it says: > > The PCIe configuration space of Device 0, Function 0 shall include > the CXL PCI Express Designated Vendor-Specific Extended Capability > (DVSEC) as shown in Figure 126. The capability, status and control > fields in Device 0, Function 0 DVSEC control the CXL functionality > of the entire CXL device. > ... > Software may use the presence of this DVSEC to differentiate between > a CXL device and a PCIe device. As such, a standard PCIe device must > not expose this DVSEC. > > Sections 9.11.5 and 9.12.2 also talk about looking for CXL DVSEC on > dev 0, func 0 to identify CXL devices. Ah, I did not internalize that when reading because if the DVSEC shows up on other functions on the same device it should "just work" as far as Linux is concerned, but I guess it simplifies implementations to constrain where the capability appears. > > > Not sure about non-Dev 0 Func 0 devices; maybe this implies other > > > functions in the same device are part of the same CXL device? > > > > DVSEC is really only telling us the layout of the MMIO register > > space. ... > > And DVSEC also apparently tells us that "this is a CXL device, not > just an ordinary PCIe device"? It's not clear to me how you identify > other PCIe functions that are also part of the same CXL device. I have not encountered any flows where the driver would care if it was enabling another instance of the CXL DVSEC on the same device. > > > > > The fact that the PCI core remains blissfully unaware of all these > > > > interdependencies is a feature because CXL is effectively a super-set > > > > of PCIe for fast-path CXL.mem operation, even though it is derivative > > > > for enumeration and slow-path manageability. > > > > > > I don't know how blissfully unaware PCI can be. Can a user remove a > > > PCIe device that's part of a CXL device via sysfs? > > > > Yes. If that PCIe device is an endpoint then the corresponding "mem" > > driver will get a ->remove() event because it is registered as a child > > of that parent PCIe device. That in turn will tear down the relevant > > part of the CXL port topology. > > The CXL device may include several PCIe devices. "mem" is a CXL > driver that's bound to one of them (?) Is that what you mean by "mem" > being a "child of the the parent PCIe device"? Right, cxl_pci_probe() does device_add() of a "mem" device on the cxl_bus_type, and cxl_pci_remove() does device_del() of that same child device. Just like xhci_pci_probe() arranges for device_add() of a child USB device on the usb_dev_type. > > > However, a gap (deliberate design choice?) in the PCI hotplug > > implementation I currently see is an ability for an endpoint PCI > > driver to dynamically deny hotplug requests based on the state of the > > device. ... > > PCI allows surprise remove, so drivers generally can't deny hot > unplugs. PCIe *does* provide for an Electromechanical Interlock (see > PCI_EXP_SLTCTL_EIC), but we don't currently support it. Ok, I guess people will just need to be careful then. > > > Is the PCIe device available for drivers to claim? > > > > drivers/cxl/pci.c *is* a native PCI driver for CXL memory expanders. > > You might be thinking of CXL accelerator devices, where the CXL.cache > > and CXL.mem capabilities are incremental capabilities for the > > accelerator. ... > > No, I'm not nearly sophisticated enough to be thinking of specific > types of CXL things :) Ah, apologies I was reading too much into your line of questions. > > > > Do we need coordination between cxl_driver_register() and > > > pci_register_driver()? > > > > They do not collide, and I think this goes back to the concern about > > whether drivers/cxl/ is scanning for all CXL DVSECs. ... > > Sorry, I don't remember what this concern was, and I don't know why > they don't collide. I *would* know that if I knew that the set of > things cxl_driver_register() binds to doesn't intersect the set of > pci_devs, but it's not clear to me what those things are. cxl_driver_register() registers a driver on the cxl_bus_type, it can not bind to devices on the pci_bus_type. > The PCI core enumerates devices by doing config reads of the Vendor ID > for every possible bus and device number. It allocs a pci_dev for > each device it finds, and those are the things pci_register_driver() > binds drivers to based on Vendor ID, Device ID, etc. > > How does CXL enumeration work? Do you envision it being integrated > into PCI enumeration? Does it build a list/tree/etc of cxl_devs? The cxl_pci driver, native PCI driver, attaches to endpoints that emit the CXL Memory Device class code. That driver walks up the PCI topology and adds a cxl_port device for each CXL DVSEC it finds in each PCIE Upstream Port up to the hostbridge. Additionally there is a cxl_acpi driver that enumerates platform CXL root resources and is called the 'root' cxl_port. Once root-to-endpoint connectivity is established then the endpoint is considered ready for CXL.mem operation. > cxl_driver_register() associates a driver with something. What > exactly is the thing the driver is associated with? A pci_dev? A > cxl_dev? Some kind of aggregate CXL device composed of several PCIe > devices? The aggregate device is a cxl_region composed of 1 or more endpoint (registered by cxl_pci_probe()) devices. Like a RAID array it aggregates multiple devices to produce a new memory range. > I expected cxl_driver.probe() to take a "struct cxl_dev *" or similar, > but it takes a "struct device *". I'm trying to apply my knowledge of > how other buses work in Linux, but obviously it's not working very > well. There's no common attributes between "mem" "port" and "region" devices so the drivers just do container_of() on the device and skip defining a generic 'struct cxl_dev'. > > > Does CXL impose new constraints on PCI power management? > > > > Recall that CXL is built such that it could be supported by a legacy > > operating system where platform firmware owns the setup of devices, > > just like DDR memory does not need a driver. This is where CXL 1.1 > > played until CXL 2.0 added so much configuration complexity (hotplug, > > interleave, persistent memory) that it started to need OS help. The > > Linux PCIe PM code will not notice a difference, but behind the scenes > > the device, if it is offering coherent memory to the CPU, will be > > coordinating with the CPU like it was part of the CPU package and not > > a discrete device. I do not see new PM software enabling required in > > my reading of "Section 10.0 Power Management" of the CXL > > specification. > > So if Linux PM decides to suspend a PCIe device that's part of a CXL > device and put it in D3hot, this is not a problem for CXL? I guess if > a CXL driver binds to the PCIe device, it can control what PM will do. > But I thought CXL drivers would bind to a CXL thing, not a PCIe thing. Recall that this starts with a native PCI driver for endpoints, that driver can coordinate PM for its children on the CXL bus if it is needed. > I see lots of mentions of LTR in sec 10, which really scares me > because I'm pretty confident that Linux handling of LTR is broken, and > I have no idea how to fix it. Ok, I will keep that in mind. > > > > So, instead the system may be designed to have an NPEM in the > > > > upstream switch port and disable the NPEM instances in the device. > > > > Platform firmware decides which NPEM is in use. > > > > > > Since you didn't mention a firmware interface to communicate which > > > NPEM is in use, I assume firmware does this by preventing other > > > devices from advertising NPEM support? > > > > That's also my assumption. If the OS sees a disabled NPEM in the > > topology it just needs to assume firmware knew what it was doing when > > it disabled it. I wish NPEM was better specified than "trust firmware > > to do the right thing via an ambiguous signal". > > If we enumerate a device with a capability that is disabled, we > normally don't treat that as a signal that it cannot be enabled. > There are lots of enable bits in PCI capabilities, and I don't know of > any cases where Linux assumes "Enable == 0" means "don't use this > feature." Absent some negotiation like _OSC or some documented > restriction, e.g., in the PCI Firmware spec, Linux normally just > enables features when it decides to use them. If the LEDs are not routed to a given NPEM instance, no amount of enabling can get that instance operational.
diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst index 3b8f41395f6b..fbf0393cdddc 100644 --- a/Documentation/driver-api/cxl/memory-devices.rst +++ b/Documentation/driver-api/cxl/memory-devices.rst @@ -28,6 +28,11 @@ CXL Memory Device .. kernel-doc:: drivers/cxl/pci.c :internal: +CXL Port +-------- +.. kernel-doc:: drivers/cxl/port.c + :doc: cxl port + CXL Core -------- .. kernel-doc:: drivers/cxl/cxl.h diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index ef05e96f8f97..3aeb33bba5a3 100644 --- a/drivers/cxl/Kconfig +++ b/drivers/cxl/Kconfig @@ -77,4 +77,26 @@ config CXL_PMEM provisioning the persistent memory capacity of CXL memory expanders. If unsure say 'm'. + +config CXL_MEM + tristate "CXL.mem: Memory Devices" + select CXL_PORT + depends on CXL_PCI + default CXL_BUS + help + The CXL.mem protocol allows a device to act as a provider of "System + RAM" and/or "Persistent Memory" that is fully coherent as if the + memory was attached to the typical CPU memory controller. This is + known as HDM "Host-managed Device Memory". + + Say 'y/m' to enable a driver that will attach to CXL.mem devices for + memory expansion and control of HDM. See Chapter 9.13 in the CXL 2.0 + specification for a detailed description of HDM. + + If unsure say 'm'. + + +config CXL_PORT + tristate + endif diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile index cf07ae6cea17..56fcac2323cb 100644 --- a/drivers/cxl/Makefile +++ b/drivers/cxl/Makefile @@ -3,7 +3,9 @@ obj-$(CONFIG_CXL_BUS) += core/ obj-$(CONFIG_CXL_PCI) += cxl_pci.o obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o +obj-$(CONFIG_CXL_PORT) += cxl_port.o cxl_pci-y := pci.o cxl_acpi-y := acpi.o cxl_pmem-y := pmem.o +cxl_port-y := port.o diff --git a/drivers/cxl/core/bus.c b/drivers/cxl/core/bus.c index 9e0d7d5d9298..46a06cfe79bd 100644 --- a/drivers/cxl/core/bus.c +++ b/drivers/cxl/core/bus.c @@ -31,6 +31,8 @@ static DECLARE_RWSEM(root_port_sem); static struct device *cxl_topology_host; +static bool is_cxl_decoder(struct device *dev); + int cxl_register_topology_host(struct device *host) { down_write(&topology_host_sem); @@ -75,6 +77,45 @@ static void put_cxl_topology_host(struct device *dev) up_read(&topology_host_sem); } +static int decoder_match(struct device *dev, void *data) +{ + struct resource *theirs = (struct resource *)data; + struct cxl_decoder *cxld; + + if (!is_cxl_decoder(dev)) + return 0; + + cxld = to_cxl_decoder(dev); + if (theirs->start <= cxld->decoder_range.start && + theirs->end >= cxld->decoder_range.end) + return 1; + + return 0; +} + +static struct cxl_decoder *cxl_find_root_decoder(resource_size_t base, + resource_size_t size) +{ + struct cxl_decoder *cxld = NULL; + struct device *cxldd; + struct device *host; + struct resource res = (struct resource){ + .start = base, + .end = base + size - 1, + }; + + host = get_cxl_topology_host(); + if (!host) + return NULL; + + cxldd = device_find_child(host, &res, decoder_match); + if (cxldd) + cxld = to_cxl_decoder(cxldd); + + put_cxl_topology_host(host); + return cxld; +} + static ssize_t devtype_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -280,6 +321,11 @@ bool is_root_decoder(struct device *dev) } EXPORT_SYMBOL_NS_GPL(is_root_decoder, CXL); +static bool is_cxl_decoder(struct device *dev) +{ + return dev->type->release == cxl_decoder_release; +} + struct cxl_decoder *to_cxl_decoder(struct device *dev) { if (dev_WARN_ONCE(dev, dev->type->release != cxl_decoder_release, @@ -327,6 +373,7 @@ struct cxl_port *to_cxl_port(struct device *dev) return NULL; return container_of(dev, struct cxl_port, dev); } +EXPORT_SYMBOL_NS_GPL(to_cxl_port, CXL); struct cxl_dport *cxl_get_root_dport(struct device *dev) { @@ -735,6 +782,24 @@ EXPORT_SYMBOL_NS_GPL(cxl_decoder_add, CXL); static void cxld_unregister(void *dev) { + struct cxl_decoder *plat_decoder, *cxld = to_cxl_decoder(dev); + resource_size_t base, size; + + if (is_root_decoder(dev)) { + device_unregister(dev); + return; + } + + base = cxld->decoder_range.start; + size = range_len(&cxld->decoder_range); + + if (size) { + plat_decoder = cxl_find_root_decoder(base, size); + if (plat_decoder) + __release_region(&plat_decoder->platform_res, base, + size); + } + device_unregister(dev); } @@ -789,6 +854,8 @@ static int cxl_device_id(struct device *dev) return CXL_DEVICE_NVDIMM_BRIDGE; if (dev->type == &cxl_nvdimm_type) return CXL_DEVICE_NVDIMM; + if (dev->type == &cxl_port_type) + return CXL_DEVICE_PORT; return 0; } diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c index 41a0245867ea..f191b0c995a7 100644 --- a/drivers/cxl/core/regs.c +++ b/drivers/cxl/core/regs.c @@ -159,9 +159,8 @@ void cxl_probe_device_regs(struct device *dev, void __iomem *base, } EXPORT_SYMBOL_NS_GPL(cxl_probe_device_regs, CXL); -static void __iomem *devm_cxl_iomap_block(struct device *dev, - resource_size_t addr, - resource_size_t length) +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, + resource_size_t length) { void __iomem *ret_val; struct resource *res; @@ -180,6 +179,7 @@ static void __iomem *devm_cxl_iomap_block(struct device *dev, return ret_val; } +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL); int cxl_map_component_regs(struct pci_dev *pdev, struct cxl_component_regs *regs, diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 3962a5e6a950..24fa16157d5e 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -17,6 +17,9 @@ * (port-driver, region-driver, nvdimm object-drivers... etc). */ +/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */ +#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K + /* CXL 2.0 8.2.5 CXL.cache and CXL.mem Registers*/ #define CXL_CM_OFFSET 0x1000 #define CXL_CM_CAP_HDR_OFFSET 0x0 @@ -36,11 +39,22 @@ #define CXL_HDM_DECODER_CAP_OFFSET 0x0 #define CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0) #define CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4) -#define CXL_HDM_DECODER0_BASE_LOW_OFFSET 0x10 -#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET 0x14 -#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET 0x18 -#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET 0x1c -#define CXL_HDM_DECODER0_CTRL_OFFSET 0x20 +#define CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8) +#define CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9) +#define CXL_HDM_DECODER_CTRL_OFFSET 0x4 +#define CXL_HDM_DECODER_ENABLE BIT(1) +#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10) +#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14) +#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18) +#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c) +#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20) +#define CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0) +#define CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4) +#define CXL_HDM_DECODER0_CTRL_COMMIT BIT(9) +#define CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10) +#define CXL_HDM_DECODER0_CTRL_TYPE BIT(12) +#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24) +#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28) static inline int cxl_hdm_decoder_count(u32 cap_hdr) { @@ -148,6 +162,8 @@ int cxl_map_device_regs(struct pci_dev *pdev, enum cxl_regloc_type; int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type, struct cxl_register_map *map); +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, + resource_size_t length); #define CXL_RESOURCE_NONE ((resource_size_t) -1) #define CXL_TARGET_STRLEN 20 @@ -165,7 +181,8 @@ void cxl_unregister_topology_host(struct device *host); #define CXL_DECODER_F_TYPE2 BIT(2) #define CXL_DECODER_F_TYPE3 BIT(3) #define CXL_DECODER_F_LOCK BIT(4) -#define CXL_DECODER_F_MASK GENMASK(4, 0) +#define CXL_DECODER_F_ENABLE BIT(5) +#define CXL_DECODER_F_MASK GENMASK(5, 0) enum cxl_decoder_type { CXL_DECODER_ACCELERATOR = 2, @@ -255,6 +272,8 @@ struct cxl_walk_context { * @dports: cxl_dport instances referenced by decoders * @decoder_ida: allocator for decoder ids * @component_reg_phys: component register capability base address (optional) + * @rescan_work: worker object for bus rescans after port additions + * @data: opaque data with driver specific usage */ struct cxl_port { struct device dev; @@ -263,6 +282,8 @@ struct cxl_port { struct list_head dports; struct ida decoder_ida; resource_size_t component_reg_phys; + struct work_struct rescan_work; + void *data; }; /** @@ -325,6 +346,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv); #define CXL_DEVICE_NVDIMM_BRIDGE 1 #define CXL_DEVICE_NVDIMM 2 +#define CXL_DEVICE_PORT 3 #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*") #define CXL_MODALIAS_FMT "cxl:t%d" diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c new file mode 100644 index 000000000000..3c03131517af --- /dev/null +++ b/drivers/cxl/port.c @@ -0,0 +1,323 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright(c) 2021 Intel Corporation. All rights reserved. */ +#include <linux/device.h> +#include <linux/module.h> +#include <linux/slab.h> + +#include "cxlmem.h" + +/** + * DOC: cxl port + * + * The port driver implements the set of functionality needed to allow full + * decoder enumeration and routing. A CXL port is an abstraction of a CXL + * component that implements some amount of CXL decoding of CXL.mem traffic. + * As of the CXL 2.0 spec, this includes: + * + * .. list-table:: CXL Components w/ Ports + * :widths: 25 25 50 + * :header-rows: 1 + * + * * - component + * - upstream + * - downstream + * * - Hostbridge + * - ACPI0016 + * - root port + * * - Switch + * - Switch Upstream Port + * - Switch Downstream Port + * * - Endpoint + * - Endpoint Port + * - N/A + * + * The primary service this driver provides is enumerating HDM decoders and + * presenting APIs to other drivers to utilize the decoders. + */ + +static struct workqueue_struct *cxl_port_wq; + +struct cxl_port_data { + struct cxl_component_regs regs; + + struct port_caps { + unsigned int count; + unsigned int tc; + unsigned int interleave11_8; + unsigned int interleave14_12; + } caps; +}; + +static inline int to_interleave_granularity(u32 ctrl) +{ + int val = FIELD_GET(CXL_HDM_DECODER0_CTRL_IG_MASK, ctrl); + + return 256 << val; +} + +static inline int to_interleave_ways(u32 ctrl) +{ + int val = FIELD_GET(CXL_HDM_DECODER0_CTRL_IW_MASK, ctrl); + + return 1 << val; +} + +static void get_caps(struct cxl_port *port, struct cxl_port_data *cpd) +{ + void __iomem *hdm_decoder = cpd->regs.hdm_decoder; + struct port_caps *caps = &cpd->caps; + u32 hdm_cap; + + hdm_cap = readl(hdm_decoder + CXL_HDM_DECODER_CAP_OFFSET); + + caps->count = cxl_hdm_decoder_count(hdm_cap); + caps->tc = FIELD_GET(CXL_HDM_DECODER_TARGET_COUNT_MASK, hdm_cap); + caps->interleave11_8 = + FIELD_GET(CXL_HDM_DECODER_INTERLEAVE_11_8, hdm_cap); + caps->interleave14_12 = + FIELD_GET(CXL_HDM_DECODER_INTERLEAVE_14_12, hdm_cap); +} + +static int map_regs(struct cxl_port *port, void __iomem *crb, + struct cxl_port_data *cpd) +{ + struct cxl_register_map map; + struct cxl_component_reg_map *comp_map = &map.component_map; + + cxl_probe_component_regs(&port->dev, crb, comp_map); + if (!comp_map->hdm_decoder.valid) { + dev_err(&port->dev, "HDM decoder registers invalid\n"); + return -ENXIO; + } + + cpd->regs.hdm_decoder = crb + comp_map->hdm_decoder.offset; + + return 0; +} + +static u64 get_decoder_size(void __iomem *hdm_decoder, int n) +{ + u32 ctrl = readl(hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(n)); + + if (ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED) + return 0; + + return ioread64_hi_lo(hdm_decoder + + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(n)); +} + +static bool is_endpoint_port(struct cxl_port *port) +{ + /* Endpoints can't be ports... yet! */ + return false; +} + +static void rescan_ports(struct work_struct *work) +{ + if (bus_rescan_devices(&cxl_bus_type)) + pr_warn("Failed to rescan\n"); +} + +/* Minor layering violation */ +static int dvsec_range_used(struct cxl_port *port) +{ + struct cxl_endpoint_dvsec_info *info; + int i, ret = 0; + + if (!is_endpoint_port(port)) + return 0; + + info = port->data; + for (i = 0; i < info->ranges; i++) + if (info->range[i].size) + ret |= 1 << i; + + return ret; +} + +static int enumerate_hdm_decoders(struct cxl_port *port, + struct cxl_port_data *portdata) +{ + void __iomem *hdm_decoder = portdata->regs.hdm_decoder; + bool global_enable; + u32 global_ctrl; + int i = 0; + + global_ctrl = readl(hdm_decoder + CXL_HDM_DECODER_CTRL_OFFSET); + global_enable = global_ctrl & CXL_HDM_DECODER_ENABLE; + if (!global_enable) { + i = dvsec_range_used(port); + if (i) { + dev_err(&port->dev, + "Couldn't add port because device is using DVSEC range registers\n"); + return -EBUSY; + } + } + + for (i = 0; i < portdata->caps.count; i++) { + int rc, target_count = portdata->caps.tc; + struct cxl_decoder *cxld; + int *target_map = NULL; + u64 size; + + if (is_endpoint_port(port)) + target_count = 0; + + cxld = cxl_decoder_alloc(port, target_count); + if (IS_ERR(cxld)) { + dev_warn(&port->dev, + "Failed to allocate the decoder\n"); + return PTR_ERR(cxld); + } + + cxld->target_type = CXL_DECODER_EXPANDER; + cxld->interleave_ways = 1; + cxld->interleave_granularity = 0; + + size = get_decoder_size(hdm_decoder, i); + if (size != 0) { +#define decoderN(reg, n) hdm_decoder + CXL_HDM_DECODER0_##reg(n) + int temp[CXL_DECODER_MAX_INTERLEAVE]; + u64 base; + u32 ctrl; + int j; + union { + u64 value; + char target_id[8]; + } target_list; + + target_map = temp; + ctrl = readl(decoderN(CTRL_OFFSET, i)); + base = ioread64_hi_lo(decoderN(BASE_LOW_OFFSET, i)); + + cxld->decoder_range = (struct range){ + .start = base, + .end = base + size - 1 + }; + + cxld->flags = CXL_DECODER_F_ENABLE; + cxld->interleave_ways = to_interleave_ways(ctrl); + cxld->interleave_granularity = + to_interleave_granularity(ctrl); + + if (FIELD_GET(CXL_HDM_DECODER0_CTRL_TYPE, ctrl) == 0) + cxld->target_type = CXL_DECODER_ACCELERATOR; + + target_list.value = ioread64_hi_lo(decoderN(TL_LOW, i)); + for (j = 0; j < cxld->interleave_ways; j++) + target_map[j] = target_list.target_id[j]; +#undef decoderN + } + + rc = cxl_decoder_add_locked(cxld, target_map); + if (rc) + put_device(&cxld->dev); + else + rc = cxl_decoder_autoremove(&port->dev, cxld); + if (rc) + dev_err(&port->dev, "Failed to add decoder\n"); + else + dev_dbg(&cxld->dev, "Added to port %s\n", + dev_name(&port->dev)); + } + + /* + * Turn on global enable now since DVSEC ranges aren't being used and + * we'll eventually want the decoder enabled. + */ + if (!global_enable) { + dev_dbg(&port->dev, "Enabling HDM decode\n"); + writel(global_ctrl | CXL_HDM_DECODER_ENABLE, hdm_decoder + CXL_HDM_DECODER_CTRL_OFFSET); + } + + return 0; +} + +static int cxl_port_probe(struct device *dev) +{ + struct cxl_port *port = to_cxl_port(dev); + struct cxl_port_data *portdata; + void __iomem *crb; + int rc; + + if (port->component_reg_phys == CXL_RESOURCE_NONE) + return 0; + + portdata = devm_kzalloc(dev, sizeof(*portdata), GFP_KERNEL); + if (!portdata) + return -ENOMEM; + + crb = devm_cxl_iomap_block(&port->dev, port->component_reg_phys, + CXL_COMPONENT_REG_BLOCK_SIZE); + if (!crb) { + dev_err(&port->dev, "No component registers mapped\n"); + return -ENXIO; + } + + rc = map_regs(port, crb, portdata); + if (rc) + return rc; + + get_caps(port, portdata); + if (portdata->caps.count == 0) { + dev_err(&port->dev, "Spec violation. Caps invalid\n"); + return -ENXIO; + } + + rc = enumerate_hdm_decoders(port, portdata); + if (rc) { + dev_err(&port->dev, "Couldn't enumerate decoders (%d)\n", rc); + return rc; + } + + /* + * Bus rescan is done in a workqueue so that we can do so with the + * device lock dropped. + * + * Why do we need to rescan? There is a race between cxl_acpi and + * cxl_mem (which depends on cxl_pci). cxl_mem will only create a port + * if it can establish a path up to a root port, which is enumerated by + * a platform specific driver (ie. cxl_acpi) and bound by this driver. + * While cxl_acpi could do the rescan, it makes sense to be here as + * other platform drivers might require the same functionality. + */ + INIT_WORK(&port->rescan_work, rescan_ports); + queue_work(cxl_port_wq, &port->rescan_work); + + return 0; +} + +static struct cxl_driver cxl_port_driver = { + .name = "cxl_port", + .probe = cxl_port_probe, + .id = CXL_DEVICE_PORT, +}; + +static __init int cxl_port_init(void) +{ + int rc; + + cxl_port_wq = alloc_ordered_workqueue("cxl_port", 0); + if (!cxl_port_wq) + return -ENOMEM; + + rc = cxl_driver_register(&cxl_port_driver); + if (rc) { + destroy_workqueue(cxl_port_wq); + return rc; + } + + return 0; +} + +static __exit void cxl_port_exit(void) +{ + destroy_workqueue(cxl_port_wq); + cxl_driver_unregister(&cxl_port_driver); +} + +module_init(cxl_port_init); +module_exit(cxl_port_exit); +MODULE_LICENSE("GPL v2"); +MODULE_IMPORT_NS(CXL); +MODULE_ALIAS_CXL(CXL_DEVICE_PORT);
The CXL port driver is responsible for managing the decoder resources contained within the port. It will also provide APIs that other drivers will consume for managing these resources. There are 4 types of ports in a system: 1. Platform port. This is a non-programmable entity. Such a port is named rootX. It is enumerated by cxl_acpi in an ACPI based system. 2. Hostbridge port. This ports register access is defined in a platform specific way (CHBS for ACPI platforms). It has n downstream ports, each of which are known as CXL 2.0 root ports. Once the platform specific mechanism to get the offset to the registers is obtained it operates just like other CXL components. The enumeration of this component is started by cxl_acpi and completed by cxl_port. 3. Switch port. A switch port is similar to a hostbridge port except register access is defined in the CXL specification in a platform agnostic way. The downstream ports for a switch are simply known as downstream ports. The enumeration of these are entirely contained within cxl_port. 4. Endpoint port. Endpoint ports are similar to switch ports with the exception that they have no downstream ports, only the underlying media on the device. The enumeration of these are started by cxl_pci, and completed by cxl_port. Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> --- Changes since RFCv2: - Reword commit message tense (Dan) - Reword commit message - Drop SOFTDEP since it's not needed yet (Dan) - Add CONFIG_CXL_PORT (Dan) - s/CXL_DECODER_F_EN/CXL_DECODER_F_ENABLE (Dan) - rename cxl_hdm_decoder_ functions to "to_" (Dan) - remove useless inline (Dan) - Check endpoint decoder based on dport list instead of driver id (Dan) - Use range instead of resource per dependent patch change - Use clever union packing for target list (Dan) - Only check NULL from devm_cxl_iomap_block (Dan) - Properly parent the created cxl decoders - Move bus rescanning from cxl_acpi to here (Dan) - Remove references to "CFMWS" in core (Dan) - Use macro to help keep within 80 character lines --- .../driver-api/cxl/memory-devices.rst | 5 + drivers/cxl/Kconfig | 22 ++ drivers/cxl/Makefile | 2 + drivers/cxl/core/bus.c | 67 ++++ drivers/cxl/core/regs.c | 6 +- drivers/cxl/cxl.h | 34 +- drivers/cxl/port.c | 323 ++++++++++++++++++ 7 files changed, 450 insertions(+), 9 deletions(-) create mode 100644 drivers/cxl/port.c