Message ID | 20230113170131.5086-1-a.antonovitch@gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | PCI/hotplug: Replaced down_write_nested with hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock. | expand |
Hi Anatoli, On Fri, Jan 13, 2023 at 12:01:31PM -0500, Anatoli Antonovitch wrote: > It is to avoid any potential issues when S3 resume but at the same > time we want to hot-unplug. > > To fix the race between pciehp and AER reported in > https://bugzilla.kernel.org/show_bug.cgi?id=215590 I've just submitted an alternative patch to fix this, could you give it a spin and see if the issue goes away? https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/ That alternative approach is preferable IMO because it also solves the problem that marking devices as permanently offline isn't possible concurrently to driver bind/unbind at the moment. Additionally, the alternative patch simplifies locking and reduces code size. Thanks and sorry for my belated response. Lukas
Hi Lukas, The patch has been tested on the same setup. Unfortunately, this alternative approach with optimization and simplified locking is not resolve the issue in https://bugzilla.kernel.org/show_bug.cgi?id=215590 I have uploaded the log dmesg_6_2_rc4_hotadd_aer_fix_a6bd101b8f84.txt into the bugzilla for your patch. Thanks, Anatoli On 2023-01-20 04:28, Lukas Wunner wrote: > I've just submitted an alternative patch to fix this, could you give > it a spin and see if the issue goes away? > > https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/
On Fri, Jan 20, 2023 at 04:35:01PM -0500, Anatoli Antonovitch wrote: > On 2023-01-20 04:28, Lukas Wunner wrote: > > I've just submitted an alternative patch to fix this, could you give > > it a spin and see if the issue goes away? > > > > https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/ > > The patch has been tested on the same setup. > Unfortunately, this alternative approach with optimization and simplified > locking is not resolve the issue in > https://bugzilla.kernel.org/show_bug.cgi?id=215590 > > I have uploaded the log dmesg_6_2_rc4_hotadd_aer_fix_a6bd101b8f84.txt into > the bugzilla for your patch. You're now getting a different deadlock. That one is addressed by this old patch (it's already linked from the bugzilla): https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/ If you apply that patch plus the new one, do you still see a deadlock? Thanks, Lukas
I do not see a deadlock, when applying the following old patch: https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/ after merge for the kernel 6.2.0-rc5, and applied the alternative patch: https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/ I have uploaded the merged patch and the system log for the upstream kernel. Anatoli On 2023-01-21 02:21, Lukas Wunner wrote: > You're now getting a different deadlock. That one is addressed by this > old patch (it's already linked from the bugzilla): > > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/ > > If you apply that patch plus the new one, do you still see a deadlock?
Hi Lukas, Can we revisit the patches again to get a fix? The issue still reproduce and visible in the kernel 6.2.0-rc8. Thanks, Anatoli On 2023-01-23 14:30, Anatoli Antonovitch wrote: > I do not see a deadlock, when applying the following old patch: > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/ > > after merge for the kernel 6.2.0-rc5, and applied the alternative patch: > https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/ > > > I have uploaded the merged patch and the system log for the upstream > kernel. > > Anatoli > > > On 2023-01-21 02:21, Lukas Wunner wrote: >> You're now getting a different deadlock. That one is addressed by this >> old patch (it's already linked from the bugzilla): >> >> https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/ >> >> >> If you apply that patch plus the new one, do you still see a deadlock?
On Mon, Feb 13, 2023 at 09:59:52AM -0500, Anatoli Antonovitch wrote: > Hi Lukas, > > Can we revisit the patches again to get a fix? > The issue still reproduce and visible in the kernel 6.2.0-rc8. > On 2023-01-23 14:30, Anatoli Antonovitch wrote: > > I do not see a deadlock, when applying the following old patch: > > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/ This old patch would need to be updated and reposted. There was a 0-day bot issue and a question to be resolved. Maybe this is all already resolved, but it needs to be posted and tested with a current kernel. > > after merge for the kernel 6.2.0-rc5, and applied the alternative patch: > > https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/ This one is on track to appear in v6.3-rc1: https://git.kernel.org/cgit/linux/kernel/git/pci/pci.git/commit/?id=74ff8864cc84 > > I have uploaded the merged patch and the system log for the upstream > > kernel. > > > > Anatoli > > > > > > On 2023-01-21 02:21, Lukas Wunner wrote: > > > You're now getting a different deadlock. That one is addressed by this > > > old patch (it's already linked from the bugzilla): > > > > > > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/ > > > > > > > > > If you apply that patch plus the new one, do you still see a deadlock?
[Public] > -----Original Message----- > From: Bjorn Helgaas <helgaas@kernel.org> > Sent: Friday, February 17, 2023 11:04 AM > To: Antonovitch, Anatoli <Anatoli.Antonovitch@amd.com> > Cc: Lukas Wunner <lukas@wunner.de>; Anatoli Antonovitch > <a.antonovitch@gmail.com>; linux-pci@vger.kernel.org; > bhelgaas@google.com; Deucher, Alexander > <Alexander.Deucher@amd.com>; Koenig, Christian > <Christian.Koenig@amd.com> > Subject: Re: [PATCH] PCI/hotplug: Replaced down_write_nested with > hotplug_slot_rwsem if ctrl->depth > 0 when taking the ctrl->reset_lock. > > On Mon, Feb 13, 2023 at 09:59:52AM -0500, Anatoli Antonovitch wrote: > > Hi Lukas, > > > > Can we revisit the patches again to get a fix? > > The issue still reproduce and visible in the kernel 6.2.0-rc8. > > > On 2023-01-23 14:30, Anatoli Antonovitch wrote: > > > I do not see a deadlock, when applying the following old patch: > > > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73 > > > dab4b6.1595329748.git.lukas@wunner.de/ > > This old patch would need to be updated and reposted. There was a 0-day > bot issue and a question to be resolved. Maybe this is all already resolved, > but it needs to be posted and tested with a current kernel. Lukas, can you resend that patch? We can test it. Alex > > > > after merge for the kernel 6.2.0-rc5, and applied the alternative patch: > > > https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e3 > > > 7d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/ > > This one is on track to appear in v6.3-rc1: > https://git.kernel.org/cgit/linux/kernel/git/pci/pci.git/commit/?id=74ff8864c > c84 > > > > I have uploaded the merged patch and the system log for the upstream > > > kernel. > > > > > > Anatoli > > > > > > > > > On 2023-01-21 02:21, Lukas Wunner wrote: > > > > You're now getting a different deadlock. That one is addressed by > > > > this old patch (it's already linked from the bugzilla): > > > > > > > > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d > > > > 73dab4b6.1595329748.git.lukas@wunner.de/ > > > > > > > > > > > > If you apply that patch plus the new one, do you still see a deadlock?
On Fri, Feb 17, 2023 at 06:37:54PM +0000, Deucher, Alexander wrote: > > From: Bjorn Helgaas <helgaas@kernel.org> > > On Mon, Feb 13, 2023 at 09:59:52AM -0500, Anatoli Antonovitch wrote: > > > On 2023-01-23 14:30, Anatoli Antonovitch wrote: > > > > I do not see a deadlock, when applying the following old patch: > > > > https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/ > > > > > > Can we revisit the patches again to get a fix? > > > The issue still reproduce and visible in the kernel 6.2.0-rc8. > > > > This old patch would need to be updated and reposted. There was a 0-day > > bot issue and a question to be resolved. Maybe this is all already resolved, > > but it needs to be posted and tested with a current kernel. > > Lukas, can you resend that patch? We can test it. I'm working on a patch which aims to solve these deadlocks differently, by reducing the critical sections for which the reset_lock is held. Please stand by. Thanks, Lukas
Thanks Lukas. The patch has been tested with current kernel 6.3.0-rc5 on the same setup. The deadlock between reset_lock and device_lock has been fixed. See details in the dmesg log: dmesg_6.3.0-rc5_fix.txt in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215590 Thanks, Anatoli On 2023-02-19 15:21, Lukas Wunner wrote: > On Fri, Feb 17, 2023 at 06:37:54PM +0000, Deucher, Alexander wrote: >>> From: Bjorn Helgaas <helgaas@kernel.org> >>> On Mon, Feb 13, 2023 at 09:59:52AM -0500, Anatoli Antonovitch wrote: >>>> On 2023-01-23 14:30, Anatoli Antonovitch wrote: >>>>> I do not see a deadlock, when applying the following old patch: >>>>> https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/ >>>> Can we revisit the patches again to get a fix? >>>> The issue still reproduce and visible in the kernel 6.2.0-rc8. >>> This old patch would need to be updated and reposted. There was a 0-day >>> bot issue and a question to be resolved. Maybe this is all already resolved, >>> but it needs to be posted and tested with a current kernel. >> Lukas, can you resend that patch? We can test it. > I'm working on a patch which aims to solve these deadlocks differently, > by reducing the critical sections for which the reset_lock is held. > Please stand by. > > Thanks, > > Lukas
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 10e9670eea0b..b1084e67f798 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -27,6 +27,8 @@ #include "../pci.h" #include "pciehp.h" +static DECLARE_RWSEM(hotplug_slot_rwsem); + static const struct dmi_system_id inband_presence_disabled_dmi_table[] = { /* * Match all Dell systems, as some Dell systems have inband @@ -911,7 +913,10 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, bool probe) if (probe) return 0; - down_write_nested(&ctrl->reset_lock, ctrl->depth); + if (ctrl->depth > 0) + down_write_nested(&ctrl->reset_lock, ctrl->depth); + else + down_write(&hotplug_slot_rwsem); if (!ATTN_BUTTN(ctrl)) { ctrl_mask |= PCI_EXP_SLTCTL_PDCE; @@ -931,7 +936,11 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, bool probe) ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__, pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask); - up_write(&ctrl->reset_lock); + if (ctrl->depth > 0) + up_write(&ctrl->reset_lock); + else + up_write(&hotplug_slot_rwsem); + return rc; }