diff mbox

Looking for a start point for fixing a bug

Message ID CAH1Y=qqccf7WmfQpwD5oVTeM_sJaKb1AQrXnoWsh=kmkyYekjg@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

?????? ?????? Aug. 11, 2014, 4:03 a.m. UTC
2014-08-11 1:53 GMT+04:00 Niels Ole Salscheider
<niels_ole@salscheider-online.de>:
> On Monday 11 August 2014, 01:19:32, ?????? ?????? wrote:
>> Hello again, hope you are still reading my texts...
>>
>> I digged through the code and narrowed down the issue I wanted to fix.
>> It appears to be related to the `bool thermal_active` dpm struct
>> member and this piece of code:
>>
>> if (rdev->asic->dpm.force_performance_level) {
>>         if (rdev->pm.dpm.thermal_active) {
>>             enum radeon_dpm_forced_level level = rdev->pm.dpm.forced_level;
>>             /* force low perf level for thermal */
>>             radeon_dpm_force_performance_level(rdev,
>> RADEON_DPM_FORCED_LEVEL_LOW);
>>             /* save the user's level */
>>             rdev->pm.dpm.forced_level = level;
>>         } else {
>>             /* otherwise, user selected level */
>>             radeon_dpm_force_performance_level(rdev,
>> rdev->pm.dpm.forced_level); }
>>     }
>>
>> I did a double check here - at boot `thermal_active` is `false` and
>> thus, performance level is properly initiated. But at resume from
>> suspend `thermal_active` is true and performance level is strictly
>> bound to low profile.
>> Besides you cannot change it via echo 1 > /sys/.../force_dpm_level,
>> again thanks to `thermal_active` checked there.
>>
>> Could you explain meaning of this small boolean to me? I'd like to
>> make a small neat patch fixing this, but I'm scared of doing it in
>> wrong way.
>> Sorry if I'm being too persistent.
>
> I think thermal_active means that the temperature got too high so that low
> clocks have to be used.
>
> Just some idea, but thermal.work only gets scheduled when the high to low
> temperature interrupt occurs. When the temperature is too high before suspend
> (so that thermal_active is true) and it gets low during standby this interrupt
> will not occur. thermal.work is therefore not scheduled...
>
> Ole
>

You were right, Ole. The driver thinks the temperature is too high.
Thanks a lot!
It seems the function ci_set_thermal_temperature_range is missing some lines:




All other similar callbacks for different families of cards have these
lines. I wonder if there is any specific case for not doing this...
How do I propose it as a patch anyway?

>> Thanks,
>> Oleg
>>
>> 2014-07-22 20:05 GMT+04:00 Alex Deucher <alexdeucher@gmail.com>:
>> > On Tue, Jul 22, 2014 at 8:39 AM, ?????? ?????? <algonkvel@gmail.com>
> wrote:
>> >> Hello all!
>> >>
>> >> I have some spare time and knowledge in C to try to fix some bugs I am
>> >> seeing on my machine.
>> >> So I've checked out and compiled all git trees that I may need and now
>> >> I'm
>> >> beginning to read articles.
>> >>
>> >> And this is the point from where I don't know where to go. I want to fix
>> >> particular bug #79806 [1].
>> >> For me there are many places where this bug can hide - mesa? dri? radeon
>> >> kernel module? and I just don't know whether should I start reading
>> >> articles about mesa hacking or about dri architecture or about kernel
>> >> module development.
>> >>
>> >> Now I think the best thing for me is to start looking through radeon
>> >> kernel
>> >> module code (I've got ingenious idea that power management resides there)
>> >> and read more about its architecture. Is this right? I mean, I just want
>> >> to
>> >> find out, is this a right place to start looking at for this bug?
>> >
>> > The power management is handled in the kernel driver.  See radeon_pm.c
>> > and the relevant *_dpm.c files depending on what asic you have.
>> >
>> > Alex
>> >
>> >> P.S. Sorry for my English in case it's not good, I'm learning it now
>> >>
>> >> P.P.S. And thanks for your hard work!
>> >>
>> >> -------------------------------------------
>> >> [1] https://bugs.freedesktop.org/show_bug.cgi?id=79806
>> >>
>> >> _______________________________________________
>> >> dri-devel mailing list
>> >> dri-devel@lists.freedesktop.org
>> >> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>

Comments

Niels Ole Salscheider Aug. 11, 2014, 6:50 a.m. UTC | #1
On Monday 11 August 2014, 08:03:57, ?????? ?????? wrote:
> 2014-08-11 1:53 GMT+04:00 Niels Ole Salscheider
> 
> <niels_ole@salscheider-online.de>:
> > On Monday 11 August 2014, 01:19:32, ?????? ?????? wrote:
> >> Hello again, hope you are still reading my texts...
> >> 
> >> I digged through the code and narrowed down the issue I wanted to fix.
> >> It appears to be related to the `bool thermal_active` dpm struct
> >> member and this piece of code:
> >> 
> >> if (rdev->asic->dpm.force_performance_level) {
> >> 
> >>         if (rdev->pm.dpm.thermal_active) {
> >>         
> >>             enum radeon_dpm_forced_level level =
> >>             rdev->pm.dpm.forced_level;
> >>             /* force low perf level for thermal */
> >>             radeon_dpm_force_performance_level(rdev,
> >> 
> >> RADEON_DPM_FORCED_LEVEL_LOW);
> >> 
> >>             /* save the user's level */
> >>             rdev->pm.dpm.forced_level = level;
> >>         
> >>         } else {
> >>         
> >>             /* otherwise, user selected level */
> >>             radeon_dpm_force_performance_level(rdev,
> >> 
> >> rdev->pm.dpm.forced_level); }
> >> 
> >>     }
> >> 
> >> I did a double check here - at boot `thermal_active` is `false` and
> >> thus, performance level is properly initiated. But at resume from
> >> suspend `thermal_active` is true and performance level is strictly
> >> bound to low profile.
> >> Besides you cannot change it via echo 1 > /sys/.../force_dpm_level,
> >> again thanks to `thermal_active` checked there.
> >> 
> >> Could you explain meaning of this small boolean to me? I'd like to
> >> make a small neat patch fixing this, but I'm scared of doing it in
> >> wrong way.
> >> Sorry if I'm being too persistent.
> > 
> > I think thermal_active means that the temperature got too high so that low
> > clocks have to be used.
> > 
> > Just some idea, but thermal.work only gets scheduled when the high to low
> > temperature interrupt occurs. When the temperature is too high before
> > suspend (so that thermal_active is true) and it gets low during standby
> > this interrupt will not occur. thermal.work is therefore not scheduled...
> > 
> > Ole
> 
> You were right, Ole. The driver thinks the temperature is too high.
> Thanks a lot!
> It seems the function ci_set_thermal_temperature_range is missing some
> lines:
> 
> 
> diff --git a/drivers/gpu/drm/radeon/ci_dpm.c
> b/drivers/gpu/drm/radeon/ci_dpm.c index 584090a..102a4bc 100644
> --- a/radeon/ci_dpm.c
> +++ b/radeon/ci_dpm.c
> @@ -869,6 +869,9 @@ static int ci_set_thermal_temperature_range(struct
> radeon_device *rdev,
>         WREG32_SMC(CG_THERMAL_CTRL, tmp);
>  #endif
> 
> +    rdev->pm.dpm.thermal.min_temp = low_temp;
> +    rdev->pm.dpm.thermal.max_temp = high_temp;
> +
>         return 0;
>  }
> 
> 
> All other similar callbacks for different families of cards have these
> lines. I wonder if there is any specific case for not doing this...

This seems to be wrong indeed. I wonder why it happens after a suspend - 
resume cycle, though. Does the hardware generate a corresponding interrupt 
after resume?

There is however still an XXX comment in that function... Maybe Alex can 
comment on that.

> How do I propose it as a patch anyway?

Fix it in your local git checkout, commit it (don't forget to pass -s to get 
your signed-of-by line) and use git format-patch / git send-email to send it 
to this list...

> >> Thanks,
> >> Oleg
> >> 
> >> 2014-07-22 20:05 GMT+04:00 Alex Deucher <alexdeucher@gmail.com>:
> >> > On Tue, Jul 22, 2014 at 8:39 AM, ?????? ?????? <algonkvel@gmail.com>
> > 
> > wrote:
> >> >> Hello all!
> >> >> 
> >> >> I have some spare time and knowledge in C to try to fix some bugs I am
> >> >> seeing on my machine.
> >> >> So I've checked out and compiled all git trees that I may need and now
> >> >> I'm
> >> >> beginning to read articles.
> >> >> 
> >> >> And this is the point from where I don't know where to go. I want to
> >> >> fix
> >> >> particular bug #79806 [1].
> >> >> For me there are many places where this bug can hide - mesa? dri?
> >> >> radeon
> >> >> kernel module? and I just don't know whether should I start reading
> >> >> articles about mesa hacking or about dri architecture or about kernel
> >> >> module development.
> >> >> 
> >> >> Now I think the best thing for me is to start looking through radeon
> >> >> kernel
> >> >> module code (I've got ingenious idea that power management resides
> >> >> there)
> >> >> and read more about its architecture. Is this right? I mean, I just
> >> >> want
> >> >> to
> >> >> find out, is this a right place to start looking at for this bug?
> >> > 
> >> > The power management is handled in the kernel driver.  See radeon_pm.c
> >> > and the relevant *_dpm.c files depending on what asic you have.
> >> > 
> >> > Alex
> >> > 
> >> >> P.S. Sorry for my English in case it's not good, I'm learning it now
> >> >> 
> >> >> P.P.S. And thanks for your hard work!
> >> >> 
> >> >> -------------------------------------------
> >> >> [1] https://bugs.freedesktop.org/show_bug.cgi?id=79806
> >> >> 
> >> >> _______________________________________________
> >> >> dri-devel mailing list
> >> >> dri-devel@lists.freedesktop.org
> >> >> http://lists.freedesktop.org/mailman/listinfo/dri-devel
> >> 
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/dri-devel
Alex Deucher Aug. 11, 2014, 2:49 p.m. UTC | #2
On Mon, Aug 11, 2014 at 12:03 AM, ?????? ?????? <algonkvel@gmail.com> wrote:
> 2014-08-11 1:53 GMT+04:00 Niels Ole Salscheider
> <niels_ole@salscheider-online.de>:
>> On Monday 11 August 2014, 01:19:32, ?????? ?????? wrote:
>>> Hello again, hope you are still reading my texts...
>>>
>>> I digged through the code and narrowed down the issue I wanted to fix.
>>> It appears to be related to the `bool thermal_active` dpm struct
>>> member and this piece of code:
>>>
>>> if (rdev->asic->dpm.force_performance_level) {
>>>         if (rdev->pm.dpm.thermal_active) {
>>>             enum radeon_dpm_forced_level level = rdev->pm.dpm.forced_level;
>>>             /* force low perf level for thermal */
>>>             radeon_dpm_force_performance_level(rdev,
>>> RADEON_DPM_FORCED_LEVEL_LOW);
>>>             /* save the user's level */
>>>             rdev->pm.dpm.forced_level = level;
>>>         } else {
>>>             /* otherwise, user selected level */
>>>             radeon_dpm_force_performance_level(rdev,
>>> rdev->pm.dpm.forced_level); }
>>>     }
>>>
>>> I did a double check here - at boot `thermal_active` is `false` and
>>> thus, performance level is properly initiated. But at resume from
>>> suspend `thermal_active` is true and performance level is strictly
>>> bound to low profile.
>>> Besides you cannot change it via echo 1 > /sys/.../force_dpm_level,
>>> again thanks to `thermal_active` checked there.
>>>
>>> Could you explain meaning of this small boolean to me? I'd like to
>>> make a small neat patch fixing this, but I'm scared of doing it in
>>> wrong way.
>>> Sorry if I'm being too persistent.
>>
>> I think thermal_active means that the temperature got too high so that low
>> clocks have to be used.
>>
>> Just some idea, but thermal.work only gets scheduled when the high to low
>> temperature interrupt occurs. When the temperature is too high before suspend
>> (so that thermal_active is true) and it gets low during standby this interrupt
>> will not occur. thermal.work is therefore not scheduled...
>>
>> Ole
>>
>
> You were right, Ole. The driver thinks the temperature is too high.
> Thanks a lot!
> It seems the function ci_set_thermal_temperature_range is missing some lines:
>
>
> diff --git a/drivers/gpu/drm/radeon/ci_dpm.c b/drivers/gpu/drm/radeon/ci_dpm.c
> index 584090a..102a4bc 100644
> --- a/radeon/ci_dpm.c
> +++ b/radeon/ci_dpm.c
> @@ -869,6 +869,9 @@ static int ci_set_thermal_temperature_range(struct
> radeon_device *rdev,
>         WREG32_SMC(CG_THERMAL_CTRL, tmp);
>  #endif
>
> +    rdev->pm.dpm.thermal.min_temp = low_temp;
> +    rdev->pm.dpm.thermal.max_temp = high_temp;
> +
>         return 0;
>  }
>
>
> All other similar callbacks for different families of cards have these
> lines. I wonder if there is any specific case for not doing this...
> How do I propose it as a patch anyway?

You analysis is correct.  Thanks for tracking this down.

Alex

>
>>> Thanks,
>>> Oleg
>>>
>>> 2014-07-22 20:05 GMT+04:00 Alex Deucher <alexdeucher@gmail.com>:
>>> > On Tue, Jul 22, 2014 at 8:39 AM, ?????? ?????? <algonkvel@gmail.com>
>> wrote:
>>> >> Hello all!
>>> >>
>>> >> I have some spare time and knowledge in C to try to fix some bugs I am
>>> >> seeing on my machine.
>>> >> So I've checked out and compiled all git trees that I may need and now
>>> >> I'm
>>> >> beginning to read articles.
>>> >>
>>> >> And this is the point from where I don't know where to go. I want to fix
>>> >> particular bug #79806 [1].
>>> >> For me there are many places where this bug can hide - mesa? dri? radeon
>>> >> kernel module? and I just don't know whether should I start reading
>>> >> articles about mesa hacking or about dri architecture or about kernel
>>> >> module development.
>>> >>
>>> >> Now I think the best thing for me is to start looking through radeon
>>> >> kernel
>>> >> module code (I've got ingenious idea that power management resides there)
>>> >> and read more about its architecture. Is this right? I mean, I just want
>>> >> to
>>> >> find out, is this a right place to start looking at for this bug?
>>> >
>>> > The power management is handled in the kernel driver.  See radeon_pm.c
>>> > and the relevant *_dpm.c files depending on what asic you have.
>>> >
>>> > Alex
>>> >
>>> >> P.S. Sorry for my English in case it's not good, I'm learning it now
>>> >>
>>> >> P.P.S. And thanks for your hard work!
>>> >>
>>> >> -------------------------------------------
>>> >> [1] https://bugs.freedesktop.org/show_bug.cgi?id=79806
>>> >>
>>> >> _______________________________________________
>>> >> dri-devel mailing list
>>> >> dri-devel@lists.freedesktop.org
>>> >> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
diff mbox

Patch

diff --git a/drivers/gpu/drm/radeon/ci_dpm.c b/drivers/gpu/drm/radeon/ci_dpm.c
index 584090a..102a4bc 100644
--- a/radeon/ci_dpm.c
+++ b/radeon/ci_dpm.c
@@ -869,6 +869,9 @@  static int ci_set_thermal_temperature_range(struct
radeon_device *rdev,
        WREG32_SMC(CG_THERMAL_CTRL, tmp);
 #endif

+    rdev->pm.dpm.thermal.min_temp = low_temp;
+    rdev->pm.dpm.thermal.max_temp = high_temp;
+
        return 0;
 }