Message ID | 20200512031315.189865.15477.stgit@awfm-01.aw.intel.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | minor hfi and qib fixes | expand |
On Mon, May 11, 2020 at 11:13:15PM -0400, Dennis Dalessandro wrote: > From: Kaike Wan <kaike.wan@intel.com> > > The workqueue hfi1_wq is destroyed in function shutdown_device(), which > is called by either shutdown_one() or remove_one(). The function > shutdown_one() is called when the kernel is rebooted while remove_one() > is called when the hfi1 driver is unloaded. When the kernel is rebooted, > hfi1_wq is destroyed while all qps are still active, leading to a > kernel crash: I was under impression that kernel reboot should follow same logic as module removal. This is what graceful reboot will do anyway. Can you please give me a link where I can read about difference in those flows? Thanks
> -----Original Message----- > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > owner@vger.kernel.org> On Behalf Of Leon Romanovsky > Sent: Tuesday, May 12, 2020 1:55 AM > To: Dalessandro, Dennis <dennis.dalessandro@intel.com> > Cc: jgg@ziepe.ca; dledford@redhat.com; linux-rdma@vger.kernel.org; > Marciniszyn, Mike <mike.marciniszyn@intel.com>; stable@vger.kernel.org; > Wan, Kaike <kaike.wan@intel.com> > Subject: Re: [PATCH for-rc or next 1/3] IB/hfi1: Do not destroy hfi1_wq when > the device is shut down > > On Mon, May 11, 2020 at 11:13:15PM -0400, Dennis Dalessandro wrote: > > From: Kaike Wan <kaike.wan@intel.com> > > > > The workqueue hfi1_wq is destroyed in function shutdown_device(), > > which is called by either shutdown_one() or remove_one(). The function > > shutdown_one() is called when the kernel is rebooted while > > remove_one() is called when the hfi1 driver is unloaded. When the > > kernel is rebooted, hfi1_wq is destroyed while all qps are still > > active, leading to a kernel crash: > > I was under impression that kernel reboot should follow same logic as > module removal. This is what graceful reboot will do anyway. Can you please > give me a link where I can read about difference in those flows? > I used to think the same. However, by adding traces to the hfi driver, I found out that the shutdown function of the pci_driver was called when typing "reboot" while the remove function of the pci_driver was called when typing "modprobe -r hfi1". I am not an expert on kernel reboot and can someone give some hints? Kaike
On Tue, May 12, 2020 at 11:52:34AM +0000, Wan, Kaike wrote: > > > > -----Original Message----- > > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > > owner@vger.kernel.org> On Behalf Of Leon Romanovsky > > Sent: Tuesday, May 12, 2020 1:55 AM > > To: Dalessandro, Dennis <dennis.dalessandro@intel.com> > > Cc: jgg@ziepe.ca; dledford@redhat.com; linux-rdma@vger.kernel.org; > > Marciniszyn, Mike <mike.marciniszyn@intel.com>; stable@vger.kernel.org; > > Wan, Kaike <kaike.wan@intel.com> > > Subject: Re: [PATCH for-rc or next 1/3] IB/hfi1: Do not destroy hfi1_wq when > > the device is shut down > > > > On Mon, May 11, 2020 at 11:13:15PM -0400, Dennis Dalessandro wrote: > > > From: Kaike Wan <kaike.wan@intel.com> > > > > > > The workqueue hfi1_wq is destroyed in function shutdown_device(), > > > which is called by either shutdown_one() or remove_one(). The function > > > shutdown_one() is called when the kernel is rebooted while > > > remove_one() is called when the hfi1 driver is unloaded. When the > > > kernel is rebooted, hfi1_wq is destroyed while all qps are still > > > active, leading to a kernel crash: > > > > I was under impression that kernel reboot should follow same logic as > > module removal. This is what graceful reboot will do anyway. Can you please > > give me a link where I can read about difference in those flows? > > > I used to think the same. However, by adding traces to the hfi driver, I found out that the shutdown function of the pci_driver was called when typing "reboot" while the remove function of the pci_driver was called when typing "modprobe -r hfi1". I took a look on what mlx5_core is doing in shutdown flow and it can be summarized in the following: 1. Drain workqueues 2. Close PCI 3. Don't release anything. So maybe you didn't flush the hfi1_wq? > > I am not an expert on kernel reboot and can someone give some hints? > > Kaike > > > >
> -----Original Message----- > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > owner@vger.kernel.org> On Behalf Of Leon Romanovsky > Sent: Wednesday, May 13, 2020 3:59 AM > To: Wan, Kaike <kaike.wan@intel.com> > Cc: Dalessandro, Dennis <dennis.dalessandro@intel.com>; jgg@ziepe.ca; > dledford@redhat.com; linux-rdma@vger.kernel.org; Marciniszyn, Mike > <mike.marciniszyn@intel.com>; stable@vger.kernel.org > Subject: Re: [PATCH for-rc or next 1/3] IB/hfi1: Do not destroy hfi1_wq when > the device is shut down > > On Tue, May 12, 2020 at 11:52:34AM +0000, Wan, Kaike wrote: > > > > > > > -----Original Message----- > > > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > > > owner@vger.kernel.org> On Behalf Of Leon Romanovsky > > > Sent: Tuesday, May 12, 2020 1:55 AM > > > To: Dalessandro, Dennis <dennis.dalessandro@intel.com> > > > Cc: jgg@ziepe.ca; dledford@redhat.com; linux-rdma@vger.kernel.org; > > > Marciniszyn, Mike <mike.marciniszyn@intel.com>; > > > stable@vger.kernel.org; Wan, Kaike <kaike.wan@intel.com> > > > Subject: Re: [PATCH for-rc or next 1/3] IB/hfi1: Do not destroy > > > hfi1_wq when the device is shut down > > > > > > On Mon, May 11, 2020 at 11:13:15PM -0400, Dennis Dalessandro wrote: > > > > From: Kaike Wan <kaike.wan@intel.com> > > > > > > > > The workqueue hfi1_wq is destroyed in function shutdown_device(), > > > > which is called by either shutdown_one() or remove_one(). The > > > > function > > > > shutdown_one() is called when the kernel is rebooted while > > > > remove_one() is called when the hfi1 driver is unloaded. When the > > > > kernel is rebooted, hfi1_wq is destroyed while all qps are still > > > > active, leading to a kernel crash: > > > > > > I was under impression that kernel reboot should follow same logic > > > as module removal. This is what graceful reboot will do anyway. Can > > > you please give me a link where I can read about difference in those > flows? > > > > > I used to think the same. However, by adding traces to the hfi driver, I > found out that the shutdown function of the pci_driver was called when > typing "reboot" while the remove function of the pci_driver was called > when typing "modprobe -r hfi1". > > I took a look on what mlx5_core is doing in shutdown flow and it can be > summarized in the following: > 1. Drain workqueues > 2. Close PCI > 3. Don't release anything. > > So maybe you didn't flush the hfi1_wq? Will add the flush. Thanks, Kaike > > > > > >
On Mon, May 11, 2020 at 11:13:15PM -0400, Dennis Dalessandro wrote: > From: Kaike Wan <kaike.wan@intel.com> > > The workqueue hfi1_wq is destroyed in function shutdown_device(), which > is called by either shutdown_one() or remove_one(). The function > shutdown_one() is called when the kernel is rebooted while remove_one() > is called when the hfi1 driver is unloaded. When the kernel is rebooted, > hfi1_wq is destroyed while all qps are still active, leading to a > kernel crash: AFAIK the purpose of shutdown is to stop all in progress DMAs. If devices are wildly doing DMA during the shutdown process then all manner of things can fail, including kexecing into another kernel. Do you achive that with these shutdown handlers? It does make sense that the work queue would not be destroyed in shutdown, but I'm surprised it doesn't flush it? Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@ziepe.ca> > Sent: Tuesday, May 19, 2020 8:27 PM > To: Dalessandro, Dennis <dennis.dalessandro@intel.com> > Cc: dledford@redhat.com; linux-rdma@vger.kernel.org; Marciniszyn, Mike > <mike.marciniszyn@intel.com>; stable@vger.kernel.org; Wan, Kaike > <kaike.wan@intel.com> > Subject: Re: [PATCH for-rc or next 1/3] IB/hfi1: Do not destroy hfi1_wq when > the device is shut down > > On Mon, May 11, 2020 at 11:13:15PM -0400, Dennis Dalessandro wrote: > > From: Kaike Wan <kaike.wan@intel.com> > > > > The workqueue hfi1_wq is destroyed in function shutdown_device(), > > which is called by either shutdown_one() or remove_one(). The function > > shutdown_one() is called when the kernel is rebooted while > > remove_one() is called when the hfi1 driver is unloaded. When the > > kernel is rebooted, hfi1_wq is destroyed while all qps are still > > active, leading to a kernel crash: > > AFAIK the purpose of shutdown is to stop all in progress DMAs. > > If devices are wildly doing DMA during the shutdown process then all manner > of things can fail, including kexecing into another kernel. > > Do you achive that with these shutdown handlers? We did try to shut down the hardware and will address some software issues in next revision. > > It does make sense that the work queue would not be destroyed in > shutdown, but I'm surprised it doesn't flush it? Will do. Kaike
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c index 3759d92..10fb5c6 100644 --- a/drivers/infiniband/hw/hfi1/init.c +++ b/drivers/infiniband/hw/hfi1/init.c @@ -829,6 +829,25 @@ static int create_workqueues(struct hfi1_devdata *dd) } /** + * destroy_workqueues - destroy per port workqueues + * @dd: the hfi1_ib device + */ +static void destroy_workqueues(struct hfi1_devdata *dd) +{ + int pidx; + struct hfi1_pportdata *ppd; + + for (pidx = 0; pidx < dd->num_pports; ++pidx) { + ppd = dd->pport + pidx; + + if (ppd->hfi1_wq) { + destroy_workqueue(ppd->hfi1_wq); + ppd->hfi1_wq = NULL; + } + } +} + +/** * enable_general_intr() - Enable the IRQs that will be handled by the * general interrupt handler. * @dd: valid devdata @@ -1102,10 +1121,6 @@ static void shutdown_device(struct hfi1_devdata *dd) */ hfi1_quiet_serdes(ppd); - if (ppd->hfi1_wq) { - destroy_workqueue(ppd->hfi1_wq); - ppd->hfi1_wq = NULL; - } if (ppd->link_wq) { destroy_workqueue(ppd->link_wq); ppd->link_wq = NULL; @@ -1757,6 +1772,7 @@ static void remove_one(struct pci_dev *pdev) * clear dma engines, etc. */ shutdown_device(dd); + destroy_workqueues(dd); stop_timers(dd);