Message ID | 20240809082750.3015641-1-ming4.li@intel.com |
---|---|
Headers | show |
Series | Fix get a wrong pci host bridge in cxl_setup_parent_dport() | expand |
Li Ming wrote: > The cxl_test unit test environment on qemu always hit below call trace > with KASAN enabled: > > BUG: KASAN: slab-out-of-bounds in cxl_setup_parent_dport+0x480/0x530 [cxl_core] > Read of size 1 at addr ff110000676014f8 by task (udev-worker)/676[ 24.424403] CPU: 2 PID: 676 Comm: (udev-worker) Tainted: G O N 6.10.0-qemucxl #1 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20240214-2.el9 02/14/2024 > Call Trace: > <TASK> > dump_stack_lvl+0xea/0x150 > print_report+0xce/0x610 > ? kasan_complete_mode_report_info+0x40/0x200 > kasan_report+0xcc/0x110 > __asan_report_load1_noabort+0x18/0x20 > cxl_setup_parent_dport+0x480/0x530 [cxl_core] > cxl_mem_probe+0x49b/0xaa0 [cxl_mem] > > The root cause is that a wrong host bridge was gotten from > dport->dport_dev in cxl_setup_parent_dport(). In > cxl_setup_parent_dport(), it always calls > to_pci_host_bridge(dport->dport_dev) to get a pci_host_bridge structure. > There are two issues in the implementation: > > * to_pci_host_bridge(dport->dport_dev) should be used only for RCH > cases, dport->dport_dev points to a pci device rather than a pci host > bridge in VH cases. The solution is checking if dport is in RCH mode > then calling to_pci_host_bridge(). > (Patch 1) > > * In cxl_test unit test environment, cxl_test will create a emulated CXL > topology with emulated dports, the dport_dev of a emulated dport > points to a platform device. to_pci_host_bridge(dport->dport_dev) also > gets a wrong pci host bridge in the case. The solution is implementing > a new wrap function called __wrap_cxl_setup_parent_dport() on cxl_test > side, the function will filter all emulated dports, make sure only > real dports can be handled by cxl_setup_parent_dport(). > (Patch 2) > > v1 link: https://lore.kernel.org/linux-cxl/ZrHR+0w3bwM1Ik8h@xpf.sh.intel.com/T/#med6200e54ec12f09fdcc04571516adda261c9561 > > v2: > - Add call trace log into changelog > - Remove 'dev_is_platform(dport->dport_dev)' checking out of cxl driver > scope. Check if dport is emulated in cxl_test. I am suprised we got this far without this being a problem earlier. For the series you can add: Reviewed-by: Dan Williams <dan.j.williams@intel.com> ...can you also follow along with a rename, documentation, and cleanup patch? cxl_setup_parent_dport() tells the reader absolutely nothing about what the function does, and that it is specifically initializing PCI AER operation. It should be called something like cxl_dport_init_aer(), and it should have kernel-doc associated with it. The cleanup opportunity I see is to consolidate the ->native_aer check for the cxl_rcrb_to_aer() and cxl_disable_rch_root_ints() into one location. It is a bit silly that cxl_disable_rch_root_ints() needs to check if dport->regs.dport_aer was initialized when it was just setup a few lines above. I.e. there are just too many helper functions and it all could be consolidated in a bigger cxl_dport_init_aer() function. > > Li Ming (2): > cxl/pci: Get AER capability address from RCRB only for RCH dport > cxl/test: Skip cxl_setup_parent_dport() for emulated dports This is the right thing to do whenever possible. The usage of dev_is_platform() only makes sense for scenarios where a function is called by the cxl_core that needs to avoid getting confused by cxl_test. In this case, since cxl_mem is making the call to the exported cxl_setup_parent_dport() symbol, then mocking it for cxl_test is appropriate.
Dan Williams wrote: > Li Ming wrote: > > The cxl_test unit test environment on qemu always hit below call trace > > with KASAN enabled: > > > > BUG: KASAN: slab-out-of-bounds in cxl_setup_parent_dport+0x480/0x530 [cxl_core] > > Read of size 1 at addr ff110000676014f8 by task (udev-worker)/676[ 24.424403] CPU: 2 PID: 676 Comm: (udev-worker) Tainted: G O N 6.10.0-qemucxl #1 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20240214-2.el9 02/14/2024 > > Call Trace: > > <TASK> > > dump_stack_lvl+0xea/0x150 > > print_report+0xce/0x610 > > ? kasan_complete_mode_report_info+0x40/0x200 > > kasan_report+0xcc/0x110 > > __asan_report_load1_noabort+0x18/0x20 > > cxl_setup_parent_dport+0x480/0x530 [cxl_core] > > cxl_mem_probe+0x49b/0xaa0 [cxl_mem] > > > > The root cause is that a wrong host bridge was gotten from > > dport->dport_dev in cxl_setup_parent_dport(). In > > cxl_setup_parent_dport(), it always calls > > to_pci_host_bridge(dport->dport_dev) to get a pci_host_bridge structure. > > There are two issues in the implementation: > > > > * to_pci_host_bridge(dport->dport_dev) should be used only for RCH > > cases, dport->dport_dev points to a pci device rather than a pci host > > bridge in VH cases. The solution is checking if dport is in RCH mode > > then calling to_pci_host_bridge(). > > (Patch 1) > > > > * In cxl_test unit test environment, cxl_test will create a emulated CXL > > topology with emulated dports, the dport_dev of a emulated dport > > points to a platform device. to_pci_host_bridge(dport->dport_dev) also > > gets a wrong pci host bridge in the case. The solution is implementing > > a new wrap function called __wrap_cxl_setup_parent_dport() on cxl_test > > side, the function will filter all emulated dports, make sure only > > real dports can be handled by cxl_setup_parent_dport(). > > (Patch 2) > > > > v1 link: https://lore.kernel.org/linux-cxl/ZrHR+0w3bwM1Ik8h@xpf.sh.intel.com/T/#med6200e54ec12f09fdcc04571516adda261c9561 > > > > v2: > > - Add call trace log into changelog > > - Remove 'dev_is_platform(dport->dport_dev)' checking out of cxl driver > > scope. Check if dport is emulated in cxl_test. > > I am suprised we got this far without this being a problem earlier. > > For the series you can add: > > Reviewed-by: Dan Williams <dan.j.williams@intel.com> This fixes some of the issues I was seeing in the cxl-test failures. So I'm going to tag this too. That said, on patch 1 I would like to see further clean up. Diging into the details of the dport->rcrb.aer_cap usage this check and code should be moved to cxl_dport_map_rch_aer() and make aer_cap a local and remove struct cxl_rcrb_info completely. For now, for the series. Reviewed-by: Ira Weiny <ira.weiny@intel.com> > > ...can you also follow along with a rename, documentation, and cleanup > patch? cxl_setup_parent_dport() tells the reader absolutely nothing > about what the function does, and that it is specifically initializing > PCI AER operation. > > It should be called something like cxl_dport_init_aer(), and it should > have kernel-doc associated with it. The cleanup opportunity I see is to > consolidate the ->native_aer check for the cxl_rcrb_to_aer() and > cxl_disable_rch_root_ints() into one location. It is a bit silly that > cxl_disable_rch_root_ints() needs to check if dport->regs.dport_aer was > initialized when it was just setup a few lines above. I.e. there are > just too many helper functions and it all could be consolidated in a > bigger cxl_dport_init_aer() function. > > > > > Li Ming (2): > > cxl/pci: Get AER capability address from RCRB only for RCH dport > > cxl/test: Skip cxl_setup_parent_dport() for emulated dports > > This is the right thing to do whenever possible. The usage of > dev_is_platform() only makes sense for scenarios where a function is > called by the cxl_core that needs to avoid getting confused by cxl_test. > > In this case, since cxl_mem is making the call to the exported > cxl_setup_parent_dport() symbol, then mocking it for cxl_test is > appropriate. >
On 8/10/2024 5:36 AM, Dan Williams wrote: > Li Ming wrote: >> The cxl_test unit test environment on qemu always hit below call trace >> with KASAN enabled: >> >> BUG: KASAN: slab-out-of-bounds in cxl_setup_parent_dport+0x480/0x530 [cxl_core] >> Read of size 1 at addr ff110000676014f8 by task (udev-worker)/676[ 24.424403] CPU: 2 PID: 676 Comm: (udev-worker) Tainted: G O N 6.10.0-qemucxl #1 >> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20240214-2.el9 02/14/2024 >> Call Trace: >> <TASK> >> dump_stack_lvl+0xea/0x150 >> print_report+0xce/0x610 >> ? kasan_complete_mode_report_info+0x40/0x200 >> kasan_report+0xcc/0x110 >> __asan_report_load1_noabort+0x18/0x20 >> cxl_setup_parent_dport+0x480/0x530 [cxl_core] >> cxl_mem_probe+0x49b/0xaa0 [cxl_mem] >> >> The root cause is that a wrong host bridge was gotten from >> dport->dport_dev in cxl_setup_parent_dport(). In >> cxl_setup_parent_dport(), it always calls >> to_pci_host_bridge(dport->dport_dev) to get a pci_host_bridge structure. >> There are two issues in the implementation: >> >> * to_pci_host_bridge(dport->dport_dev) should be used only for RCH >> cases, dport->dport_dev points to a pci device rather than a pci host >> bridge in VH cases. The solution is checking if dport is in RCH mode >> then calling to_pci_host_bridge(). >> (Patch 1) >> >> * In cxl_test unit test environment, cxl_test will create a emulated CXL >> topology with emulated dports, the dport_dev of a emulated dport >> points to a platform device. to_pci_host_bridge(dport->dport_dev) also >> gets a wrong pci host bridge in the case. The solution is implementing >> a new wrap function called __wrap_cxl_setup_parent_dport() on cxl_test >> side, the function will filter all emulated dports, make sure only >> real dports can be handled by cxl_setup_parent_dport(). >> (Patch 2) >> >> v1 link: https://lore.kernel.org/linux-cxl/ZrHR+0w3bwM1Ik8h@xpf.sh.intel.com/T/#med6200e54ec12f09fdcc04571516adda261c9561 >> >> v2: >> - Add call trace log into changelog >> - Remove 'dev_is_platform(dport->dport_dev)' checking out of cxl driver >> scope. Check if dport is emulated in cxl_test. > I am suprised we got this far without this being a problem earlier. > > For the series you can add: > > Reviewed-by: Dan Williams <dan.j.williams@intel.com> > > ...can you also follow along with a rename, documentation, and cleanup > patch? cxl_setup_parent_dport() tells the reader absolutely nothing > about what the function does, and that it is specifically initializing > PCI AER operation. > > It should be called something like cxl_dport_init_aer(), and it should > have kernel-doc associated with it. The cleanup opportunity I see is to > consolidate the ->native_aer check for the cxl_rcrb_to_aer() and > cxl_disable_rch_root_ints() into one location. It is a bit silly that > cxl_disable_rch_root_ints() needs to check if dport->regs.dport_aer was > initialized when it was just setup a few lines above. I.e. there are > just too many helper functions and it all could be consolidated in a > bigger cxl_dport_init_aer() function. Sure, I will do that, thanks for review. > >> Li Ming (2): >> cxl/pci: Get AER capability address from RCRB only for RCH dport >> cxl/test: Skip cxl_setup_parent_dport() for emulated dports > This is the right thing to do whenever possible. The usage of > dev_is_platform() only makes sense for scenarios where a function is > called by the cxl_core that needs to avoid getting confused by cxl_test. > > In this case, since cxl_mem is making the call to the exported > cxl_setup_parent_dport() symbol, then mocking it for cxl_test is > appropriate.