diff mbox

Commit f5d9b7f0f9 (fix r600_enable_sclk_control()) causes kexec issues

Message ID 20130730184642.GA352@x4 (mailing list archive)
State New, archived
Headers show

Commit Message

Markus Trippelsdorf July 30, 2013, 6:46 p.m. UTC
On 2013.07.30 at 10:53 -0400, Alex Deucher wrote:
> On Tue, Jul 30, 2013 at 7:27 AM, Markus Trippelsdorf
> <markus@trippelsdorf.de> wrote:
> > On 2013.07.29 at 15:53 -0400, Alex Deucher wrote:
> >> On Mon, Jul 29, 2013 at 2:10 PM, Eric W. Biederman
> >> <ebiederm@xmission.com> wrote:
> >> > Alex Deucher <alexdeucher@gmail.com> writes:
> >> >
> >> >> On Mon, Jul 29, 2013 at 11:50 AM, Eric W. Biederman
> >> >> <ebiederm@xmission.com> wrote:
> >> >>>
> >> >>>
> >> >>> Alex Deucher <alexdeucher@gmail.com> wrote:
> >> >>>>On Mon, Jul 29, 2013 at 10:09 AM, Markus Trippelsdorf
> >> >>>><markus@trippelsdorf.de> wrote:
> >> >>>>> On 2013.07.29 at 09:58 -0400, Alex Deucher wrote:
> >> >>>>>> On Mon, Jul 29, 2013 at 3:51 AM, Markus Trippelsdorf
> >> >>>>>> <markus@trippelsdorf.de> wrote:
> >> >>>>>> > On my test machine Xorg doesn't start anymore when I kexec into a
> >> >>>>>> > 3.11.0-rc3 kernel.
> >> >>>>>>
> >> >>>>>> With kexec, dpm doesn't get torn down properly which can result in a
> >> >>>>>> bad hardware state when the driver loads again.  Does the attached
> >> >>>>>> patch help?  It attempts to disable dpm at startup in case it wasn't
> >> >>>>>> torn down properly previously.
> >> >>>>>
> >> >>>>> dpm initialization now works, but unfortunately GPU acceleration
> >> >>>>still gets
> >> >>>>> disabled:
> >> >>>>
> >> >>>>Stupid kexec complicates things.  We need to make sure dpm is torn
> >> >>>>down before we init the rest of the GPU, but dpm needs get initialized
> >> >>>>later in the init process since it depends on certain other state from
> >> >>>>the driver.  I need to think about this for a bit.  I'm not sure of a
> >> >>>>good way to handle this.
> >> >>>
> >> >>> You might just want to implement a shutdown method that cleans things up properly.   At least as a first pass until you worry about things like kexec on panic.
> >> >>>
> >> >>> Or can you not shutdown the graphics stack on reboot because of the need to display the kernels shutdown progress?
> >> >>
> >> >> Does kexec actually call this shutdown method?  The driver implements
> >> >> appropriate clean-up measures if it's shutdown properly.
> >> >
> >> > Absoltuely.  All parts of the reboot path call ->shutdown.  Including
> >> > kexec.
> >> >
> >> > You don't get a device remove/hotunplug but unless this is a kexec on
> >> > panic ->shutdown is most definitely called.  Now I am talking about the
> >> > device layer/pci layer shutdown method I don't know how gpu drivers are
> >> > wired up.  GPU land was a little strange last I looked.  Hopefully it
> >> > isn't so strange that there is a method named shutdown that is not wired
> >> > up.
> >>
> >> It doesn't look like the drm infrastructure has a shutdown callback.
> >> The drm drivers register a drm_driver callback struct that includes an
> >> unload callback which takes care of all the device teardown (if you
> >> unload the module for example).  I don't know that it actually gets
> >> called at kexec time however.  I don't know enough about how kexec
> >> works.
> >
> > BTW there is r100_restore_sanity() in drm/radeon/r100.c that explicitly
> > handles the kexec case during init. So maybe an r600_restore_sanity()
> > function is needed?
> >
> > (One of the advantages of using kexec (besides the much quicker boot
> > time) is that the monitor is not switched off and then on during boot.
> > I guess that advantage would be lost if the unload callback would be
> > called.)
> 
> r100_restore_sanity() is basically a set of hacks (that gets called at
> driver startup) to work around the fact that with kexec the drm driver
> is not torn down correctly.  So we could add a bunch more asic
> specific tear down sequences to deal with dpm (and all the other
> engines on the GPU that may potentially cause problems if they are not
> torn down properly), but that will just turn into a mess.  All of
> these hacks also add latency to the driver load.  I think the best
> solution would probably be to figure how to hook up the drm unload
> callback to the shutdown method that kexec uses.

FYI the following (ugly) hack works for me. 
(If I don't comment out radeon_fbdev_fini(rdev) kexec will hang.)

Comments

Markus Trippelsdorf July 30, 2013, 8:56 p.m. UTC | #1
On 2013.07.30 at 20:46 +0200, Markus Trippelsdorf wrote:
> On 2013.07.30 at 10:53 -0400, Alex Deucher wrote:
> > On Tue, Jul 30, 2013 at 7:27 AM, Markus Trippelsdorf
> > <markus@trippelsdorf.de> wrote:
> > > On 2013.07.29 at 15:53 -0400, Alex Deucher wrote:
> > >> On Mon, Jul 29, 2013 at 2:10 PM, Eric W. Biederman
> > >> <ebiederm@xmission.com> wrote:
> > >> > Alex Deucher <alexdeucher@gmail.com> writes:
> > >> >
> > >> >> On Mon, Jul 29, 2013 at 11:50 AM, Eric W. Biederman
> > >> >> <ebiederm@xmission.com> wrote:
> > >> >>>
> > >> >>>
> > >> >>> Alex Deucher <alexdeucher@gmail.com> wrote:
> > >> >>>>On Mon, Jul 29, 2013 at 10:09 AM, Markus Trippelsdorf
> > >> >>>><markus@trippelsdorf.de> wrote:
> > >> >>>>> On 2013.07.29 at 09:58 -0400, Alex Deucher wrote:
> > >> >>>>>> On Mon, Jul 29, 2013 at 3:51 AM, Markus Trippelsdorf
> > >> >>>>>> <markus@trippelsdorf.de> wrote:
> > >> >>>>>> > On my test machine Xorg doesn't start anymore when I kexec into a
> > >> >>>>>> > 3.11.0-rc3 kernel.
> > >> >>>>>>
> > >> >>>>>> With kexec, dpm doesn't get torn down properly which can result in a
> > >> >>>>>> bad hardware state when the driver loads again.  Does the attached
> > >> >>>>>> patch help?  It attempts to disable dpm at startup in case it wasn't
> > >> >>>>>> torn down properly previously.
> > >> >>>>>
> > >> >>>>> dpm initialization now works, but unfortunately GPU acceleration
> > >> >>>>still gets
> > >> >>>>> disabled:
> > >> >>>>
> > >> >>>>Stupid kexec complicates things.  We need to make sure dpm is torn
> > >> >>>>down before we init the rest of the GPU, but dpm needs get initialized
> > >> >>>>later in the init process since it depends on certain other state from
> > >> >>>>the driver.  I need to think about this for a bit.  I'm not sure of a
> > >> >>>>good way to handle this.
> > >> >>>
> > >> >>> You might just want to implement a shutdown method that cleans things up properly.   At least as a first pass until you worry about things like kexec on panic.
> > >> >>>
> > >> >>> Or can you not shutdown the graphics stack on reboot because of the need to display the kernels shutdown progress?
> > >> >>
> > >> >> Does kexec actually call this shutdown method?  The driver implements
> > >> >> appropriate clean-up measures if it's shutdown properly.
> > >> >
> > >> > Absoltuely.  All parts of the reboot path call ->shutdown.  Including
> > >> > kexec.
> > >> >
> > >> > You don't get a device remove/hotunplug but unless this is a kexec on
> > >> > panic ->shutdown is most definitely called.  Now I am talking about the
> > >> > device layer/pci layer shutdown method I don't know how gpu drivers are
> > >> > wired up.  GPU land was a little strange last I looked.  Hopefully it
> > >> > isn't so strange that there is a method named shutdown that is not wired
> > >> > up.
> > >>
> > >> It doesn't look like the drm infrastructure has a shutdown callback.
> > >> The drm drivers register a drm_driver callback struct that includes an
> > >> unload callback which takes care of all the device teardown (if you
> > >> unload the module for example).  I don't know that it actually gets
> > >> called at kexec time however.  I don't know enough about how kexec
> > >> works.
> > >
> > > BTW there is r100_restore_sanity() in drm/radeon/r100.c that explicitly
> > > handles the kexec case during init. So maybe an r600_restore_sanity()
> > > function is needed?
> > >
> > > (One of the advantages of using kexec (besides the much quicker boot
> > > time) is that the monitor is not switched off and then on during boot.
> > > I guess that advantage would be lost if the unload callback would be
> > > called.)
> > 
> > r100_restore_sanity() is basically a set of hacks (that gets called at
> > driver startup) to work around the fact that with kexec the drm driver
> > is not torn down correctly.  So we could add a bunch more asic
> > specific tear down sequences to deal with dpm (and all the other
> > engines on the GPU that may potentially cause problems if they are not
> > torn down properly), but that will just turn into a mess.  All of
> > these hacks also add latency to the driver load.  I think the best
> > solution would probably be to figure how to hook up the drm unload
> > callback to the shutdown method that kexec uses.
> 
> FYI the following (ugly) hack works for me. 

No. It still fails, although much more infrequently (~ on every 6-8 boot).

I begin to wonder if:
 [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD) 
is an simple initialization bug that doesn't directly depend on kexec at
all.
Joshua C. July 31, 2013, 9:19 a.m. UTC | #2
2013/7/30 Markus Trippelsdorf <markus@trippelsdorf.de>:
>
> I begin to wonder if:
>  [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
> is an simple initialization bug that doesn't directly depend on kexec at
> all.
>
> --
> Markus

I bet on this (because I see the same error in another context)
diff mbox

Patch

diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
index 75349cd..13e2988 100644
--- a/drivers/gpu/drm/radeon/r100.c
+++ b/drivers/gpu/drm/radeon/r100.c
@@ -3947,20 +3947,6 @@  void r100_fini(struct radeon_device *rdev)
  */
 void r100_restore_sanity(struct radeon_device *rdev)
 {
-	u32 tmp;
-
-	tmp = RREG32(RADEON_CP_CSQ_CNTL);
-	if (tmp) {
-		WREG32(RADEON_CP_CSQ_CNTL, 0);
-	}
-	tmp = RREG32(RADEON_CP_RB_CNTL);
-	if (tmp) {
-		WREG32(RADEON_CP_RB_CNTL, 0);
-	}
-	tmp = RREG32(RADEON_SCRATCH_UMSK);
-	if (tmp) {
-		WREG32(RADEON_SCRATCH_UMSK, 0);
-	}
 }
 
 int r100_init(struct radeon_device *rdev)
diff --git a/drivers/gpu/drm/radeon/radeon_display.c b/drivers/gpu/drm/radeon/radeon_display.c
index c2b67b4..79b38e2 100644
--- a/drivers/gpu/drm/radeon/radeon_display.c
+++ b/drivers/gpu/drm/radeon/radeon_display.c
@@ -1405,7 +1405,7 @@  int radeon_modeset_init(struct radeon_device *rdev)
 
 void radeon_modeset_fini(struct radeon_device *rdev)
 {
-	radeon_fbdev_fini(rdev);
+//	radeon_fbdev_fini(rdev);
 	kfree(rdev->mode_info.bios_hardcoded_edid);
 	radeon_pm_fini(rdev);
 
diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
index 29876b1..e705e8c 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -388,6 +388,15 @@  static const struct file_operations radeon_driver_kms_fops = {
 #endif
 };
 
+
+static void
+radeon_kexec_shutdown(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+
+	radeon_driver_unload_kms(dev);
+}
+
 static struct drm_driver kms_driver = {
 	.driver_features =
 	    DRIVER_USE_AGP | DRIVER_USE_MTRR | DRIVER_PCI_DMA | DRIVER_SG |
@@ -463,6 +472,7 @@  static struct pci_driver radeon_kms_pci_driver = {
 	.remove = radeon_pci_remove,
 	.suspend = radeon_pci_suspend,
 	.resume = radeon_pci_resume,
+	.shutdown = radeon_kexec_shutdown,
 };
 
 static int __init radeon_init(void)