diff mbox

nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

Message ID 5135D375.9060006@free.fr (mailing list archive)
State New, archived
Headers show

Commit Message

Martin Peres March 5, 2013, 11:13 a.m. UTC
On 04/03/2013 22:41, Konrad Rzeszutek Wilk wrote:
> Pls CC me in case you would like me also to test them with the mdelay 
> patch. 

Hi Konrad,

Marcin proposed me another explanation for the issue you are seeing and 
it made me look again at the code.

I don't have enough nv4x hw to test all the conditions but with the 
attached patches, you may get a saner
behaviour than a computer that shut-downs whenever you turn it on (like 
a "most useless machine ever").
The most important patch is the 8th one.

Please try applying them on top of your 3.9-rc1 kernel and send me back 
your kernel logs + sensors output.

Cheers,
Martin

PS: The attached patches are parts of my current thermal-related queue. 
I'll post them soon to the list.
- http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commits/thermal

Comments

Konrad Rzeszutek Wilk March 11, 2013, 12:38 p.m. UTC | #1
> With that I am still getting the issues (even with an insance delay of 100 seconds).
> Here is the serial log with various runs.

Any thoughts?
> [   13.523878] initcall init_sg+0x0/0x1000 [sg] returned 0 after 5355 usecs
> ^G^G[   13.621376] nouveau  [  PTHERM][0000:00:0d.0] programmed thresholds [ 90(2), 95(3), 145(2), 135(5) ]
> [   13.630487] nouveau 39079] nouveau  [  PTHERM][0000:00:0d.0] Thermal management: automatic
> [   13.646028] nouveau  [  PTHERM][0000:00:0d.0] temperature (218 C) hit the 'downclock' threshold
> [   13.654702] nouveau  [  PTHERM][0000:00:0d.0] temperature (218 C) hit the 'critical' threshold
> [   13.663296] nouveau  [  PTHERM][0000:00:0d.0] temperature (218 C) hit the 'shutdown' threshold
> [   13.671992] [TTM] Zone  kernel: Available graphics memory: 1963774 kiB

Perhaps I've some insanely stupid BIOS?
Martin Peres March 11, 2013, 11 p.m. UTC | #2
On 11/03/2013 13:38, Konrad Rzeszutek Wilk wrote:
>> With that I am still getting the issues (even with an insance delay of 100 seconds).
>> Here is the serial log with various runs.
> Any thoughts?
Sorry for taking so long to answer but I got a one-week flu and still 
had to do my research duties :s

Anyway, as a matter of fact, I do have some thoughts. If you don't mind, 
the tests I would like you to make will be listed at the end of the message.
>> [   13.523878] initcall init_sg+0x0/0x1000 [sg] returned 0 after 5355 usecs
>> ^G^G[   13.621376] nouveau  [  PTHERM][0000:00:0d.0] programmed thresholds [ 90(2), 95(3), 145(2), 135(5) ]
>> [   13.630487] nouveau 39079] nouveau  [  PTHERM][0000:00:0d.0] Thermal management: automatic
>> [   13.646028] nouveau  [  PTHERM][0000:00:0d.0] temperature (218 C) hit the 'downclock' threshold
>> [   13.654702] nouveau  [  PTHERM][0000:00:0d.0] temperature (218 C) hit the 'critical' threshold
>> [   13.663296] nouveau  [  PTHERM][0000:00:0d.0] temperature (218 C) hit the 'shutdown' threshold
>> [   13.671992] [TTM] Zone  kernel: Available graphics memory: 1963774 kiB
> Perhaps I've some insanely stupid BIOS?

So, first of all, I indeed would like to see your vbios and I also would 
like to know the bitfield of some regs.

The easiest way to do both is to grab and compile the envytools[0].

To grab your vbios, please do the following:
nvagetbios > nv4c_vbios.rom

To get the bitfield of the thermal-related regs:
nvascan 15b0 10 > nv4c_therm_scan

Please send me both of these files and I'll see what I can do.

Sorry again for the very late answer (I'm slowly getting better).

Martin

[0] https://github.com/pathscale/envytools
Martin Peres March 15, 2013, 3:48 p.m. UTC | #3
Hi everyone,

As a follow up, Konrad sent me in private his vbios and the issue turned 
out to be trivial.
The reason why it behaved this way was that his vbios didn't have sensor 
calibration values.
The fix is available here: 
http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commit/59b4006b5b30828bbd094dffe3937333b43d1e12

This fix is part of a pull request I sent to Ben.

Thanks again Konrad for reporting and testing the patches, I'll add you 
as a tester to this patch :)

Cheers,
Mupuf

PS: For the records, here is a fwd of our private conversation.

-------- Message original --------
Sujet: 	Re: nouveau shuts the machine down with v3.9-rc1 (temperature 
(72 C) hit the 'shutdown' threshold).
Date : 	Fri, 15 Mar 2013 11:16:17 -0400
De : 	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pour : 	Martin Peres <martin.peres@free.fr>


On Fri, Mar 15, 2013 at 02:30:44AM +0100, Martin Peres wrote:
> On 13/03/2013 03:20, Konrad Rzeszutek Wilk wrote:
> >>Ah ah, what challenge? The reason why the temperature is messed up
> >>is ... trivial.
> >>
> >>Will send a patch for that!
> >Heh. Pls CC me so I can test it and add the Tested-by flag:
> >>Thanks for reporting the bug!
> >Of course.
> >>Martin
> Hey Konrad,
>
> Here are the thermal patches I sent to Ben Skeggs for review. The
> patch that should solve your problem is the patch 6.
>
> Let me know if it solves your issue (that I managed to reproduce by
> faking a different vbios).
>

> dmesg | grep nou
[   12.177930] calling  nouveau_drm_init+0x0/0x1000 [nouveau] @ 1488
[   12.330206] nouveau 0000:00:0d.0: setting latency timer to 64
[   12.353307] nouveau  [  DEVICE][0000:00:0d.0] BOOT0  : 0x04c000a2
[   12.359398] nouveau  [  DEVICE][0000:00:0d.0] Chipset: C61 (NV4C)
[   12.365477] nouveau  [  DEVICE][0000:00:0d.0] Family : NV40
[   12.371621] nouveau  [   VBIOS][0000:00:0d.0] checking PRAMIN for image...
[   12.416327] nouveau  [   VBIOS][0000:00:0d.0] ... appears to be valid
[   12.422758] nouveau  [   VBIOS][0000:00:0d.0] using image from PRAMIN
[   12.429324] nouveau  [   VBIOS][0000:00:0d.0] BIT signature found
[   12.429326] nouveau  [   VBIOS][0000:00:0d.0] version 05.61.32.22.01
[   12.443160] nouveau  [     PFB][0000:00:0d.0] RAM type: unknown
[   12.443161] nouveau  [     PFB][0000:00:0d.0] RAM size: 128 MiB
[   12.443162] nouveau  [     PFB][0000:00:0d.0]    ZCOMP: 0 tags
[   12.507777] nouveau  [  PTHERM][0000:00:0d.0] FAN control: none / external
[   12.514647] nouveau  [  PTHERM][0000:00:0d.0] fan management: disabled
[   12.521161] nouveau  [  PTHERM][0000:00:0d.0] internal sensor: no
[   12.547272] nouveau  [  PTHERM][0000:00:0d.0] programmed thresholds [ 90(2), 95(3), 145(2), 135(5) ]
[   12.573758] nouveau  [     DRM] VRAM: 125 MiB
[   12.579153] nouveau  [     DRM] GART: 512 MiB
[   12.584887] nouveau  [     DRM] TMDS table version 1.1
[   12.590018] nouveau  [     DRM] DCB version 3.0
[   12.594555] nouveau  [     DRM] DCB outp 00: 01000310 00000023
[   12.601754] nouveau  [     DRM] DCB outp 01: 00110204 97e50000
[   12.607585] nouveau  [     DRM] DCB conn 00: 0000
[   12.612424] nouveau  [     DRM] Saving VGA fonts
[   12.656034] nouveau W[     DRM] DCB type 4 not known
[   12.660991] nouveau W[     DRM] Unknown-1 has no encoders, removing
[   12.681157] nouveau  [     DRM] 1 available performance level(s)
[   12.687714] nouveau  [     DRM] 0: core 425MHz shader 425MHz fanspeed 100%
[   12.694575] nouveau  [     DRM] c:
[   12.699270] nouveau  [     DRM] MM: using M2MF for buffer copies
[   12.738742] nouveau 0000:00:0d.0: No connectors reported connected with modes
[   12.752063] nouveau  [     DRM] allocated 1024x768 fb: 0x9000, bo ffff88012dffbc00
[   12.763397] fbcon: nouveaufb (fb0) is primary device
[   12.780410] nouveau 0000:00:0d.0: fb0: nouveaufb frame buffer device
[   12.786754] nouveau 0000:00:0d.0: registered panic notifier
[   12.792330] [drm] Initialized nouveau 1.1.0 20120801 for 0000:00:0d.0 on minor 0
[   12.800071] initcall nouveau_drm_init+0x0/0x1000 [nouveau] returned 0 after 602409 usecs


and no poweroffs :-)

So definitly Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
all of the patches.

Thanks!
> Cheers,
> Martin
Rafał Miłecki March 22, 2013, 4:55 p.m. UTC | #4
2013/3/15 Martin Peres <martin.peres@free.fr>
> As a follow up, Konrad sent me in private his vbios and the issue turned out to be trivial.
> The reason why it behaved this way was that his vbios didn't have sensor calibration values.
> The fix is available here: http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commit/59b4006b5b30828bbd094dffe3937333b43d1e12
>
> This fix is part of a pull request I sent to Ben.
>
> Thanks again Konrad for reporting and testing the patches, I'll add you as a tester to this patch :)

Thanks guys for debugging analyzing and fixing this. I got the same problem on
00:05.0 VGA compatible controller [0300]: NVIDIA Corporation C51G
[GeForce 6100] [10de:0242] (rev a2)
and now it's fixed.

It seems it wasn't just a one single BIOS like that in the world ;)

--
Rafa?
8698080ee092bdbd6ee2cd5e7f707ceea2812bd8
Merge branch 'drm-nouveau-fixes-3.9' of git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-next
Regression fixes and oops fixes for nouveau.
[   76.082597] nouveau  [  DEVICE][0000:00:05.0] BOOT0  : 0x04e000a2
[   76.082605] nouveau  [  DEVICE][0000:00:05.0] Chipset: C51 (NV4E)
[   76.082609] nouveau  [  DEVICE][0000:00:05.0] Family : NV40
[   76.084534] nouveau  [   VBIOS][0000:00:05.0] checking PRAMIN for image...
[   76.125409] nouveau  [   VBIOS][0000:00:05.0] ... appears to be valid
[   76.125418] nouveau  [   VBIOS][0000:00:05.0] using image from PRAMIN
[   76.125658] nouveau  [   VBIOS][0000:00:05.0] BIT signature found
[   76.125663] nouveau  [   VBIOS][0000:00:05.0] version 05.51.22.28.10
[   76.128699] nouveau  [     PFB][0000:00:05.0] RAM type: stolen system memory
[   76.128708] nouveau  [     PFB][0000:00:05.0] RAM size: 64 MiB
[   76.128711] nouveau  [     PFB][0000:00:05.0]    ZCOMP: 0 tags
[   76.781036] nouveau  [  PTHERM][0000:00:05.0] FAN control: none / external
[   76.781053] nouveau  [  PTHERM][0000:00:05.0] Thermal management: disabled
[   76.781057] nouveau  [  PTHERM][0000:00:05.0] internal sensor: yes
[   76.791261] nouveau  [  PTHERM][0000:00:05.0] programmed thresholds [ 90(2), 95(3), 145(2), 135(5) ]
[   76.791267] nouveau  [  PTHERM][0000:00:05.0] temperature (154 C) hit the 'fanboost' threshold
[   76.791271] nouveau  [  PTHERM][0000:00:05.0] Thermal management: automatic
[   76.791277] nouveau  [  PTHERM][0000:00:05.0] temperature (154 C) hit the 'downclock' threshold
[   76.791281] nouveau  [  PTHERM][0000:00:05.0] temperature (154 C) hit the 'critical' threshold
[   76.791285] nouveau  [  PTHERM][0000:00:05.0] temperature (154 C) hit the 'shutdown' threshold

cf9a625fae3d0ce8dffab53b2758d7c0cf4a5ad4
Merge branch 'drm-nouveau-fixes-3.9' of git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-next
Lots of thermal fixes and fix a lockdep warning we've been seeing.
[   55.668598] nouveau  [  DEVICE][0000:00:05.0] BOOT0  : 0x04e000a2
[   55.668606] nouveau  [  DEVICE][0000:00:05.0] Chipset: C51 (NV4E)
[   55.668609] nouveau  [  DEVICE][0000:00:05.0] Family : NV40
[   55.670533] nouveau  [   VBIOS][0000:00:05.0] checking PRAMIN for image...
[   55.711390] nouveau  [   VBIOS][0000:00:05.0] ... appears to be valid
[   55.711399] nouveau  [   VBIOS][0000:00:05.0] using image from PRAMIN
[   55.711639] nouveau  [   VBIOS][0000:00:05.0] BIT signature found
[   55.711644] nouveau  [   VBIOS][0000:00:05.0] version 05.51.22.28.10
[   55.714712] nouveau  [     PFB][0000:00:05.0] RAM type: stolen system memory
[   55.714721] nouveau  [     PFB][0000:00:05.0] RAM size: 64 MiB
[   55.714724] nouveau  [     PFB][0000:00:05.0]    ZCOMP: 0 tags
[   56.367033] nouveau  [  PTHERM][0000:00:05.0] FAN control: none / external
[   56.367052] nouveau  [  PTHERM][0000:00:05.0] fan management: disabled
[   56.367056] nouveau  [  PTHERM][0000:00:05.0] internal sensor: no
[   56.387298] nouveau  [  PTHERM][0000:00:05.0] programmed thresholds [ 90(2), 95(3), 145(2), 135(5) ]
diff mbox

Patch

From 60dce3447342d7bb1122e90c3f0aa63573e0a9b4 Mon Sep 17 00:00:00 2001
From: Martin Peres <martin.peres@labri.fr>
Date: Tue, 5 Mar 2013 10:38:37 +0100
Subject: [PATCH 8/8] drm/nv40/therm: <DO NOT PUSH> move nv4c to the newer
 temperature-reading style

This is a guess made by joi and that may quite likely be true
---
 drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
index d546ada..2b24667 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
@@ -41,13 +41,13 @@  nv40_is_older_style_sensor(struct nouveau_therm *therm)
 	case 0x44:
 	case 0x4a:
 	case 0x47:
+	case 0x4c:
 		return OLD_STYLE;
 
 	case 0x46:
 	case 0x49:
 	case 0x4b:
 	case 0x4e:
-	case 0x4c:
 	case 0x67:
 	case 0x68:
 	case 0x63:
-- 
1.8.1.5