diff mbox

[0/3] drm/radeon kexec fixes

Message ID 20130911090135.GB359@x4 (mailing list archive)
State New, archived
Headers show

Commit Message

Markus Trippelsdorf Sept. 11, 2013, 9:01 a.m. UTC
On 2013.09.09 at 11:38 +0200, Christian König wrote:
> Am 09.09.2013 11:21, schrieb Markus Trippelsdorf:
> > On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:
> >> Markus Trippelsdorf <markus@trippelsdorf.de> writes:
> >>
> >>> Here are a couple of patches that get kexec working with radeon devices.
> >>> I've tested this on my RS780.
> >>> Comments or flames are welcome.
> >>> Thanks.
> >> A couple of high level comments.
> >>
> >> This looks promising for the usual case.
> >>
> >> Removing the printk at the end of the kexec path seems a little dubious,
> >> what of other cpus, interrupt handlers, etc.  Basically estabilishing a
> >> new rule on when printk is allowed seems a little dubious at this point,
> >> even if it is a useful debugging trick.
> > OK. I will drop this patch. It doesn't seem to be necessary, because I
> > cannot reproduce the printk related hang anymore.
> >
> >> Having a clean shutdown of the radeon definitely seems worth doing,
> >> because the cases where we care abouty video are when a person is in
> >> front of the system.
> > Yes. But please note that even with radeon_pci_shutdown implemented, I
> > still get ring test failures on roughly every eighth kexec boot:
> >
> >   [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
> >   radeon 0000:01:05.0: disabling GPU acceleration
> >
> > That's definitely better than the current state of affairs, with ring
> > test failures on every second boot. But I haven't figured out the reason
> > for these failures yet. It's curious that once a ring test failure
> > occurs, it will reliably fail after each kexec invocation, no matter how
> > often repeated. Only a reboot brings the machine back to normal.
> 
> The main problem here is that the AMD gfx hardware doesn't really 
> support being reinitialized once booted (for various reasons). It's a 
> (intended) limitation of the hardware design that you can only 
> initialize certain blocks once every power cycle, so the whole approach 
> actually will never work 100% reliable.
> 
> All you can hope for is that stopping the hardware while shutting down 
> the old kernel and starting it again results in exactly the same 
> hardware parameters (offsets, clock etc...) otherwise starting the 
> blocks will just fail and you end up with disabled acceleration like above.
> 
> Sorry, but there isn't much we can do about this,

I've tested this further and it turned out that if I revert commit
f5d9b7f0f9 on top of my "drm/radeon: Implement radeon_pci_shutdown"
patch, the initialization failures seem to go away completely.

Any idea what's going on?

Here's the patch:

Comments

Christian König Sept. 11, 2013, 9:10 a.m. UTC | #1
Am 11.09.2013 11:01, schrieb Markus Trippelsdorf:
> On 2013.09.09 at 11:38 +0200, Christian König wrote:
>> Am 09.09.2013 11:21, schrieb Markus Trippelsdorf:
>>> On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:
>>>> Markus Trippelsdorf <markus@trippelsdorf.de> writes:
>>>>
>>>>> Here are a couple of patches that get kexec working with radeon devices.
>>>>> I've tested this on my RS780.
>>>>> Comments or flames are welcome.
>>>>> Thanks.
>>>> A couple of high level comments.
>>>>
>>>> This looks promising for the usual case.
>>>>
>>>> Removing the printk at the end of the kexec path seems a little dubious,
>>>> what of other cpus, interrupt handlers, etc.  Basically estabilishing a
>>>> new rule on when printk is allowed seems a little dubious at this point,
>>>> even if it is a useful debugging trick.
>>> OK. I will drop this patch. It doesn't seem to be necessary, because I
>>> cannot reproduce the printk related hang anymore.
>>>
>>>> Having a clean shutdown of the radeon definitely seems worth doing,
>>>> because the cases where we care abouty video are when a person is in
>>>> front of the system.
>>> Yes. But please note that even with radeon_pci_shutdown implemented, I
>>> still get ring test failures on roughly every eighth kexec boot:
>>>
>>>    [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
>>>    radeon 0000:01:05.0: disabling GPU acceleration
>>>
>>> That's definitely better than the current state of affairs, with ring
>>> test failures on every second boot. But I haven't figured out the reason
>>> for these failures yet. It's curious that once a ring test failure
>>> occurs, it will reliably fail after each kexec invocation, no matter how
>>> often repeated. Only a reboot brings the machine back to normal.
>> The main problem here is that the AMD gfx hardware doesn't really
>> support being reinitialized once booted (for various reasons). It's a
>> (intended) limitation of the hardware design that you can only
>> initialize certain blocks once every power cycle, so the whole approach
>> actually will never work 100% reliable.
>>
>> All you can hope for is that stopping the hardware while shutting down
>> the old kernel and starting it again results in exactly the same
>> hardware parameters (offsets, clock etc...) otherwise starting the
>> blocks will just fail and you end up with disabled acceleration like above.
>>
>> Sorry, but there isn't much we can do about this,
> I've tested this further and it turned out that if I revert commit
> f5d9b7f0f9 on top of my "drm/radeon: Implement radeon_pci_shutdown"
> patch, the initialization failures seem to go away completely.
>
> Any idea what's going on?

Well DPM is mostly Alex domain, but if I have to guess I would say that 
the SCLK is gated by the hardware when the driver is unloaded and since 
DPM initialized only later not ungated when the driver loads again.

> Here's the patch:
>
> diff --git a/drivers/gpu/drm/radeon/r600_dpm.c b/drivers/gpu/drm/radeon/r600_dpm.c
> index fa0de46..4e8c1988 100644
> --- a/drivers/gpu/drm/radeon/r600_dpm.c
> +++ b/drivers/gpu/drm/radeon/r600_dpm.c
> @@ -296,9 +296,9 @@ bool r600_dynamicpm_enabled(struct radeon_device *rdev)
>   void r600_enable_sclk_control(struct radeon_device *rdev, bool enable)
>   {
>   	if (enable)
> -		WREG32_P(SCLK_PWRMGT_CNTL, 0, ~SCLK_PWRMGT_OFF);
> +		WREG32_P(GENERAL_PWRMGT, 0, ~SCLK_PWRMGT_OFF);
>   	else
> -		WREG32_P(SCLK_PWRMGT_CNTL, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
> +		WREG32_P(GENERAL_PWRMGT, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
>   }
>   
>   void r600_enable_mclk_control(struct radeon_device *rdev, bool enable)

The patch just breaks SCLK gating on R6xx again, so no gain here.

Christian.
Alex Deucher Sept. 11, 2013, 1:30 p.m. UTC | #2
On Wed, Sep 11, 2013 at 5:01 AM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> On 2013.09.09 at 11:38 +0200, Christian König wrote:
>> Am 09.09.2013 11:21, schrieb Markus Trippelsdorf:
>> > On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:
>> >> Markus Trippelsdorf <markus@trippelsdorf.de> writes:
>> >>
>> >>> Here are a couple of patches that get kexec working with radeon devices.
>> >>> I've tested this on my RS780.
>> >>> Comments or flames are welcome.
>> >>> Thanks.
>> >> A couple of high level comments.
>> >>
>> >> This looks promising for the usual case.
>> >>
>> >> Removing the printk at the end of the kexec path seems a little dubious,
>> >> what of other cpus, interrupt handlers, etc.  Basically estabilishing a
>> >> new rule on when printk is allowed seems a little dubious at this point,
>> >> even if it is a useful debugging trick.
>> > OK. I will drop this patch. It doesn't seem to be necessary, because I
>> > cannot reproduce the printk related hang anymore.
>> >
>> >> Having a clean shutdown of the radeon definitely seems worth doing,
>> >> because the cases where we care abouty video are when a person is in
>> >> front of the system.
>> > Yes. But please note that even with radeon_pci_shutdown implemented, I
>> > still get ring test failures on roughly every eighth kexec boot:
>> >
>> >   [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
>> >   radeon 0000:01:05.0: disabling GPU acceleration
>> >
>> > That's definitely better than the current state of affairs, with ring
>> > test failures on every second boot. But I haven't figured out the reason
>> > for these failures yet. It's curious that once a ring test failure
>> > occurs, it will reliably fail after each kexec invocation, no matter how
>> > often repeated. Only a reboot brings the machine back to normal.
>>
>> The main problem here is that the AMD gfx hardware doesn't really
>> support being reinitialized once booted (for various reasons). It's a
>> (intended) limitation of the hardware design that you can only
>> initialize certain blocks once every power cycle, so the whole approach
>> actually will never work 100% reliable.
>>
>> All you can hope for is that stopping the hardware while shutting down
>> the old kernel and starting it again results in exactly the same
>> hardware parameters (offsets, clock etc...) otherwise starting the
>> blocks will just fail and you end up with disabled acceleration like above.
>>
>> Sorry, but there isn't much we can do about this,
>
> I've tested this further and it turned out that if I revert commit
> f5d9b7f0f9 on top of my "drm/radeon: Implement radeon_pci_shutdown"
> patch, the initialization failures seem to go away completely.
>
> Any idea what's going on?
>

You are disabling dynamic power management with that patch reverted.
The patch fixed a copy paste typo in the register.  Bit 0
(SCLK_PWRMGT_OFF) of register SCLK_PWRMGT_CNTL controls whether
dynamic engine clock control is enabled.  Bit 0 (GLOBAL_PWRMGT_EN) of
register GENERAL_PWRMGT controls whether dynamic power management
(dynamic engine/memory/voltage, controls etc.) is enabled at all.

Alex

> Here's the patch:
>
> diff --git a/drivers/gpu/drm/radeon/r600_dpm.c b/drivers/gpu/drm/radeon/r600_dpm.c
> index fa0de46..4e8c1988 100644
> --- a/drivers/gpu/drm/radeon/r600_dpm.c
> +++ b/drivers/gpu/drm/radeon/r600_dpm.c
> @@ -296,9 +296,9 @@ bool r600_dynamicpm_enabled(struct radeon_device *rdev)
>  void r600_enable_sclk_control(struct radeon_device *rdev, bool enable)
>  {
>         if (enable)
> -               WREG32_P(SCLK_PWRMGT_CNTL, 0, ~SCLK_PWRMGT_OFF);
> +               WREG32_P(GENERAL_PWRMGT, 0, ~SCLK_PWRMGT_OFF);
>         else
> -               WREG32_P(SCLK_PWRMGT_CNTL, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
> +               WREG32_P(GENERAL_PWRMGT, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
>  }
>
>  void r600_enable_mclk_control(struct radeon_device *rdev, bool enable)
>
>
> --
> Markus
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
diff mbox

Patch

diff --git a/drivers/gpu/drm/radeon/r600_dpm.c b/drivers/gpu/drm/radeon/r600_dpm.c
index fa0de46..4e8c1988 100644
--- a/drivers/gpu/drm/radeon/r600_dpm.c
+++ b/drivers/gpu/drm/radeon/r600_dpm.c
@@ -296,9 +296,9 @@  bool r600_dynamicpm_enabled(struct radeon_device *rdev)
 void r600_enable_sclk_control(struct radeon_device *rdev, bool enable)
 {
 	if (enable)
-		WREG32_P(SCLK_PWRMGT_CNTL, 0, ~SCLK_PWRMGT_OFF);
+		WREG32_P(GENERAL_PWRMGT, 0, ~SCLK_PWRMGT_OFF);
 	else
-		WREG32_P(SCLK_PWRMGT_CNTL, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
+		WREG32_P(GENERAL_PWRMGT, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
 }
 
 void r600_enable_mclk_control(struct radeon_device *rdev, bool enable)