Message ID | alpine.LRH.2.02.1402261551010.9913@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, now testing 3.14-rc7 here (r128 hardware rather than MGA), and I seem to still be experiencing the same or very similar crash as you here: agpgart-intel 0000:00:00.0: AGP 2.0 bridge agpgart-intel 0000:00:00.0: putting AGP V2 device into 4x mode pci 0000:01:00.0: putting AGP V2 device into 4x mode Registering platform device 'r128_cce.0'. Parent at platform device: 'r128_cce.0': device_add bus: 'platform': add device r128_cce.0 PM: Adding info for platform:r128_cce.0 __allocate_fw_buf: fw-r128/r128_cce.bin buf=dd9ec800 platform r128_cce.0: firmware: direct-loading firmware r128/r128_cce.bin fw_set_page_data: fw-r128/r128_cce.bin buf=dd9ec800 data=e07f8000 size=2048 bus: 'platform': remove device r128_cce.0 PM: Removing info for platform:r128_cce.0 fw_name_devm_release: fw_name-r128/r128_cce.bin devm-dd9ccfcc released __fw_free_buf: fw-r128/r128_cce.bin buf=dd9ec800 data=e07f8000 size=2048 evbug: Event. Dev: input7, Type: 2, Code: 0, Value: 1 evbug: Event. Dev: input7, Type: 2, Code: 1, Value: 1 evbug: Event. Dev: input7, Type: 0, Code: 0, Value: 0 evbug: Event. Dev: input7, Type: 2, Code: 0, Value: 2 evbug: Event. Dev: input7, Type: 0, Code: 0, Value: 0 BUG: unable to handle kernel paging request at e07f0040 IP: [<e293fdb8>] r128_driver_irq_uninstall+0x18/0x1d [r128] *pde = 1f414067 *pte = 00000000 Oops: 0002 [#1] Modules linked in: lp r128 drm uinput nls_iso8859_1 nls_cp437 vfat fat radeonfb cfbfillrect cfbimgblt cfbcopyarea i2c_algo_bit fb_ddc i2c_core fb fbdev ppdev lo op fuse firewire_sbp2 mcs7830 usbnet usb_storage mii iTCO_wdt iTCO_vendor_suppor t snd_maestro3 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq _midi snd_rawmidi snd_seq_oss pcmcia snd_seq_midi_event snd_seq snd_seq_device s nd_timer microcode firewire_ohci dell_laptop sg dcdbas yenta_socket snd firewire _core sr_mod pcmcia_rsrc psmouse crc_itu_t cdrom pcmcia_core pcspkr video backli ght evbug evdev uhci_hcd floppy rtc_cmos ehci_hcd intel_agp intel_gtt usbcore us b_common lpc_ich mfd_core CPU: 0 PID: 4674 Comm: Xorg Not tainted 3.14.0-rc7+ #9 Hardware name: Dell Computer Corporation Inspiron 8000 /Inspir on 8000 , BIOS A23 01/21/2004 task: ded082f0 ti: da6c4000 task.ti: da6c4000 EIP: 0060:[<e293fdb8>] EFLAGS: 00213246 CPU: 0 EIP is at r128_driver_irq_uninstall+0x18/0x1d [r128] EAX: 00000000 EBX: dd9eb400 ECX: 00000000 EDX: e07f0000 ESI: 00000001 EDI: dd9ccd40 EBP: da6c5d48 ESP: da6c5d48 DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068 CR0: 80050033 CR2: e07f0040 CR3: 1da53000 CR4: 000007d0 Stack: da6c5d78 e284dc79 00000000 e28696c5 e285fddc e28696bd 0000000b 01cc54c0 00203202 dd9eb400 dd9eb400 00000000 da6c5d90 e293b163 dd9ec8e0 dd9ec8e8 dd9eb400 dd9eb400 da6c5d98 e293fc9c da6c5dc0 e284c542 00000001 e2869575 Call Trace: [<e284dc79>] drm_irq_uninstall+0x119/0x13b [drm] [<e293b163>] r128_do_cleanup_cce+0x15/0xb3 [r128] [<e293fc9c>] r128_driver_lastclose+0x8/0xa [r128] [<e284c542>] drm_lastclose+0x40/0x143 [drm] [<e284ca37>] drm_release+0x3f2/0x419 [drm] [<c10b4c07>] __fput+0xca/0x185 [<c10b4ce8>] ____fput+0x8/0xa [<c103c213>] task_work_run+0x4f/0x60 [<c102a4fc>] do_exit+0x27f/0x6bb [<c1032bc0>] ? __sigqueue_free+0x2c/0x2f [<c102b4df>] do_group_exit+0x2e/0x65 [<c1034963>] get_signal_to_deliver+0x420/0x45b [<c1033788>] ? __send_signal.constprop.34+0x15a/0x234 [<c10014a2>] do_signal+0x34/0x6d0 [<c1033fdf>] ? do_send_specific+0x4a/0x74 [<c1001b69>] do_notify_resume+0x2b/0x52 [<c12c5a33>] work_notifysig+0x24/0x29 Code: 50 10 b8 01 00 00 00 89 42 44 5d c3 55 31 c0 89 e5 5d c3 55 8b 80 e8 00 00 00 89 e5 85 c0 74 0e 8b 80 94 00 00 00 8b 50 10 31 c0 <89> 42 40 5d c3 55 ba f0 0c 94 e2 89 e5 b8 68 0d 94 e2 e8 55 15 EIP: [<e293fdb8>] r128_driver_irq_uninstall+0x18/0x1d [r128] SS:ESP 0068:da6c5d48 CR2: 00000000e07f0040 ---[ end trace 018ccfcd552fb6cf ]--- Fixing recursive fault but reboot is needed! device: '254:0': device_add Applying your (probably experimental?) posted patch (thanks for having done the necessary debugging work for me :) upon next boot made this dump go away, but I got greeted with a relatively similar: [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [drm] No driver support for vblank timestamp query. [drm] Initialized r128 2.5.0 20030725 for 0000:01:00.0 on minor 0 agpgart-intel 0000:00:00.0: AGP 2.0 bridge agpgart-intel 0000:00:00.0: putting AGP V2 device into 4x mode pci 0000:01:00.0: putting AGP V2 device into 4x mode Registering platform device 'r128_cce.0'. Parent at platform device: 'r128_cce.0': device_add bus: 'platform': add device r128_cce.0 PM: Adding info for platform:r128_cce.0 __allocate_fw_buf: fw-r128/r128_cce.bin buf=dda02aa0 platform r128_cce.0: firmware: direct-loading firmware r128/r128_cce.bin fw_set_page_data: fw-r128/r128_cce.bin buf=dda02aa0 data=e080c000 size=2048 bus: 'platform': remove device r128_cce.0 PM: Removing info for platform:r128_cce.0 fw_name_devm_release: fw_name-r128/r128_cce.bin devm-dda289cc released __fw_free_buf: fw-r128/r128_cce.bin buf=dda02aa0 data=e080c000 size=2048 device class 'printer': registering lp: driver loaded but no devices found BUG: unable to handle kernel NULL pointer dereference at 00000058 IP: [<e2a11cab>] r128_get_vblank_counter+0xd/0x16 [r128] *pde = 00000000 Oops: 0000 [#1] Modules linked in: lp r128 drm uinput nls_iso8859_1 nls_cp437 vfat fat radeonfb cfbfillrect cfbimgblt cfbcopyarea i2c_algo_bit fb_ddc i2c_core fb fbdev ppdev loop fuse firewire_sbp2 mcs7830 usbnet mii usb_storage iTCO_wdt iTCO_vendor_support snd_maestro3 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_rawmidi microcode snd_seq_oss dell_laptop snd_seq_midi_event dcdbas snd_seq snd_seq_device pcmcia snd_timer sg firewire_ohci sr_mod psmouse pcspkr firewire_core yenta_socket crc_itu_t snd cdrom pcmcia_rsrc pcmcia_core video backlight rtc_cmos floppy uhci_hcd ehci_hcd evbug evdev intel_agp lpc_ich intel_gtt mfd_core usbcore usb_common CPU: 0 PID: 3842 Comm: Xorg Not tainted 3.14.0-rc7+ #9 Hardware name: Dell Computer Corporation Inspiron 8000 /Inspiron 8000 , BIOS A23 01/21/2004 task: df796230 ti: dd9ee000 task.ti: dd9ee000 EIP: 0060:[<e2a11cab>] EFLAGS: 00213046 CPU: 0 EIP is at r128_get_vblank_counter+0xd/0x16 [r128] EAX: 00000000 EBX: decd0c00 ECX: e2a12d68 EDX: 00000000 ESI: 00000001 EDI: dda28400 EBP: dd9efebc ESP: dd9efebc DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068 CR0: 80050033 CR2: 00000058 CR3: 1da04000 CR4: 000007d0 Stack: dd9efeec e29bbbe6 00000000 c0000000 bfc2f150 00000009 dd9efee4 00a0f964 00203202 dda02b00 decd0c00 00000000 dd9eff08 e29bd7bb 00000000 decd0c40 dd9a2cc0 decd0c00 00000000 dd9eff14 e29bd893 dd9a2c80 dd9eff54 e29ba9c6 Call Trace: [<e29bbbe6>] drm_irq_uninstall+0x86/0x126 [drm] [<e29bd7bb>] drm_master_destroy+0x31/0xee [drm] [<e29bd893>] drm_master_put+0x1b/0x1d [drm] [<e29ba9c6>] drm_release+0x381/0x419 [drm] [<c10b4c07>] __fput+0xca/0x185 [<c10b4ce8>] ____fput+0x8/0xa [<c103c213>] task_work_run+0x4f/0x60 [<c1001b8c>] do_notify_resume+0x4e/0x52 [<c12c5a33>] work_notifysig+0x24/0x29 Code: 10 8b 80 14 07 00 00 c7 43 48 00 00 00 00 83 c4 14 5b 5d c3 55 89 e5 e8 b2 b4 ff ff 5d c3 55 85 d2 8b 80 e8 00 00 00 89 e5 75 05 <8b> 40 58 eb 02 31 c0 5d c3 55 89 e5 53 8b 8a e8 00 00 00 89 d3 EIP: [<e2a11cab>] r128_get_vblank_counter+0xd/0x16 [r128] SS:ESP 0068:dd9efebc CR2: 0000000000000058 ---[ end trace 0180925feaa449a7 ]--- device: 'vcs9': device_add and non-responsive graphics mode. 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Rage Mobility 128 AGP 4X/Mobility M4 ii libdrm-dev 2.4.45-3 i386 Userspace interface to kernel DRM services -- development files ii libdrm-intel1:i386 2.4.45-3 i386 Userspace interface to intel-specific kernel DRM services -- runtime rc libdrm-nouveau1 2.4.21-1~squeeze3 i386 Userspace interface to nouveau-specific kernel DRM services -- runtime ii libdrm-nouveau2:i386 2.4.45-3 i386 Userspace interface to nouveau-specific kernel DRM services -- runtime ii libdrm-radeon1:i386 2.4.45-3 i386 Userspace interface to radeon-specific kernel DRM services -- runtime ii libdrm2:i386 2.4.45-3 i386 Userspace interface to kernel DRM services -- runtime So, "Hmm?" :) Thanks, Andreas Mohr
Hi, On Sun, Mar 23, 2014 at 12:43:17AM +0100, Andreas Mohr wrote: > Hi, > > now testing 3.14-rc7 here (r128 hardware rather than MGA), > and I seem to still be experiencing the same or very similar crash as you here: I decided to do some more experimentation: I added a Section "Module" Disable "dri" EndSection which did reliably disable dri according to 3.12.0-rc2+ xdpyinfo / Xorg.0.log On -rc7 however, this changed the issue from a kernel OOPS into merely a Xorg.0.log trace and abort (no such issue on working kernel given my userspace configuration!): [ 68.334] Backtrace: [ 68.335] 0: /usr/bin/X (xorg_backtrace+0x49) [0xb7767769] [ 68.335] 1: /usr/bin/X (0xb75ea000+0x181186) [0xb776b186] [ 68.335] 2: linux-gate.so.1 (__kernel_rt_sigreturn+0x0) [0xffffe40c] [ 68.335] 3: /usr/lib/i386-linux-gnu/libpixman-1.so.0 (0xb7430000+0x727c0) [0xb74a27c0] [ 68.335] 4: /usr/lib/i386-linux-gnu/libpixman-1.so.0 (0xb7430000+0x57abf) [0xb7487abf] [ 68.335] 5: /usr/lib/i386-linux-gnu/libpixman-1.so.0 (pixman_blt+0x7d) [0xb74367ad] [ 68.335] 6: /usr/lib/xorg/modules/libfb.so (fbCopyNtoN+0x2af) [0xb6de066f] [ 68.335] 7: /usr/bin/X (miCopyRegion+0x17c) [0xb7743f6c] [ 68.336] 8: /usr/bin/X (miDoCopy+0x4f0) [0xb77445a0] [ 68.336] 9: /usr/lib/xorg/modules/libfb.so (fbCopyArea+0x7e) [0xb6de089e] [ 68.336] 10: /usr/lib/xorg/modules/libxaa.so (0xb6f04000+0xacf3) [0xb6f0ecf3] [ 68.336] 11: /usr/lib/xorg/modules/libxaa.so (0xb6f04000+0x548d3) [0xb6f588d3] [ 68.336] 12: /usr/bin/X (0xb75ea000+0x10929d) [0xb76f329d] [ 68.336] 13: /usr/bin/X (0xb75ea000+0x15b60f) [0xb774560f] [ 68.336] 14: /usr/bin/X (0xb75ea000+0x16cd21) [0xb7756d21] [ 68.336] 15: /usr/bin/X (0xb75ea000+0x16d50c) [0xb775750c] [ 68.337] 16: /usr/bin/X (0xb75ea000+0xbe0b7) [0xb76a80b7] [ 68.337] 17: /usr/bin/X (miPointerUpdateSprite+0x2a7) [0xb7751687] [ 68.337] 18: /usr/bin/X (0xb75ea000+0x16791a) [0xb775191a] [ 68.337] 19: /usr/bin/X (0xb75ea000+0xcd1c4) [0xb76b71c4] [ 68.337] 20: /usr/bin/X (0xb75ea000+0x101bfe) [0xb76ebbfe] [ 68.337] 21: /usr/bin/X (0xb75ea000+0x4437b) [0xb762e37b] [ 68.337] 22: /usr/bin/X (WindowHasNewCursor+0x3b) [0xb762f6db] [ 68.337] 23: /usr/bin/X (ChangeWindowAttributes+0xb0c) [0xb7655e1c] [ 68.338] 24: /usr/bin/X (0xb75ea000+0x35da7) [0xb761fda7] [ 68.338] 25: /usr/bin/X (0xb75ea000+0x3c375) [0xb7626375] [ 68.338] 26: /usr/bin/X (0xb75ea000+0x29e95) [0xb7613e95] [ 68.338] 27: /lib/i386-linux-gnu/i686/cmov/libc.so.6 (__libc_start_main+0xf5) [0xb71ef8f5] [ 68.338] 28: /usr/bin/X (0xb75ea000+0x2a1e9) [0xb76141e9] [ 68.338] [ 68.338] Bus error at address 0xb4f1ed4e [ 68.338] Fatal server error: [ 68.339] Caught signal 7 (Bus error). Server aborting I then also decided to update libdrm package: i libdrm-dev:i386 2.4.52-1 i386 Userspace interface to kernel DRM services -- development files ii libdrm-intel1:i386 2.4.52-1 i386 Userspace interface to intel-specific kernel DRM services -- runtime rc libdrm-nouveau1 2.4.21-1~squeeze3 i386 Userspace interface to nouveau-specific kernel DRM services -- runtime ii libdrm-nouveau2:i386 2.4.52-1 i386 Userspace interface to nouveau-specific kernel DRM services -- runtime ii libdrm-radeon1:i386 2.4.52-1 i386 Userspace interface to radeon-specific kernel DRM services -- runtime ii libdrm2:i386 2.4.52-1 i386 Userspace interface to kernel DRM services -- runtime which did end up flawless on 3.12.0-rc2+, too (but failed to improve the issue on 3.14.0-rc7+). So, for all intents and purposes, drm infrastructure seems unavoidably (neither dri disable nor libdrm upgrade helps) affected. Does anyone know which change caused that issue? (I'm asking because bisect here would be relatively painful). Thanks, Andreas Mohr
On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote: > > which did end up flawless on 3.12.0-rc2+, too > (but failed to improve the issue on 3.14.0-rc7+). > > So, for all intents and purposes, drm infrastructure seems unavoidably > (neither dri disable nor libdrm upgrade helps) affected. > Does anyone know which change caused that issue? > (I'm asking because bisect here would be relatively painful). So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current 3.14 rc only, or did it happen already in the previous release? Linus
On Sun, Mar 23, 2014 at 09:39:16AM -0700, Linus Torvalds wrote: > On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote: > > > > which did end up flawless on 3.12.0-rc2+, too > > (but failed to improve the issue on 3.14.0-rc7+). > > > > So, for all intents and purposes, drm infrastructure seems unavoidably > > (neither dri disable nor libdrm upgrade helps) affected. > > Does anyone know which change caused that issue? > > (I'm asking because bisect here would be relatively painful). > > So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current > 3.14 rc only, or did it happen already in the previous release? Hmm, given that Mikulas in https://lkml.org/lkml/2014/2/26/537 offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!) that that indeed may have been a regression at <= 3.13 proper even (which may pose interesting questions about the level of testing coverage we still enjoy [not!?] in this hardware area). Oh well, seems I'll have to prepare/build 3.13 now... Thanks, Andreas Mohr
On Mon, Mar 24, 2014 at 7:27 AM, Andreas Mohr <andi@lisas.de> wrote: > On Sun, Mar 23, 2014 at 09:39:16AM -0700, Linus Torvalds wrote: >> On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote: >> > >> > which did end up flawless on 3.12.0-rc2+, too >> > (but failed to improve the issue on 3.14.0-rc7+). >> > >> > So, for all intents and purposes, drm infrastructure seems unavoidably >> > (neither dri disable nor libdrm upgrade helps) affected. >> > Does anyone know which change caused that issue? >> > (I'm asking because bisect here would be relatively painful). >> >> So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current >> 3.14 rc only, or did it happen already in the previous release? > > Hmm, given that Mikulas in > https://lkml.org/lkml/2014/2/26/537 > offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!) > that that indeed may have been a regression at <= 3.13 proper even > (which may pose interesting questions about the level of testing coverage > we still enjoy [not!?] in this hardware area). > > Oh well, seems I'll have to prepare/build 3.13 now... It's > 15 year old hardware, so yes I believe we have close to 0 testing coverage on it outside of distros, I'm not even sure I have one anymore, I might be able to test an MGA in one box. Dave.
On Mon, Mar 24, 2014 at 07:45:47AM +1000, Dave Airlie wrote: > On Mon, Mar 24, 2014 at 7:27 AM, Andreas Mohr <andi@lisas.de> wrote: > > On Sun, Mar 23, 2014 at 09:39:16AM -0700, Linus Torvalds wrote: > >> On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote: > >> > > >> > which did end up flawless on 3.12.0-rc2+, too > >> > (but failed to improve the issue on 3.14.0-rc7+). > >> > > >> > So, for all intents and purposes, drm infrastructure seems unavoidably > >> > (neither dri disable nor libdrm upgrade helps) affected. > >> > Does anyone know which change caused that issue? > >> > (I'm asking because bisect here would be relatively painful). > >> > >> So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current > >> 3.14 rc only, or did it happen already in the previous release? > > > > Hmm, given that Mikulas in > > https://lkml.org/lkml/2014/2/26/537 > > offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!) > > that that indeed may have been a regression at <= 3.13 proper even > > (which may pose interesting questions about the level of testing coverage > > we still enjoy [not!?] in this hardware area). > > > > Oh well, seems I'll have to prepare/build 3.13 now... > > It's > 15 year old hardware, so yes I believe we have close to 0 > testing coverage on it outside of distros, > > I'm not even sure I have one anymore, I might be able to test an MGA in one box. I haven't done a full read of all the related code, but this smells like a similar bug I've hit all over the place in the i810 driver (another one of those undead drm drivers of yonders). Ingredients: 1) Xorg creates a drm mapping of the register space. 2) Xorg tells the hw-specific drm which drm mapping has the hw registers, and the driver uses that. Iirc this has been done as some form of OS abstraction. Also note that these mappings aren't refcounted, so the first guy to call drm_rmmap wins. -> All hell breaks loose if Xorg dies and takes all it's mappings with it (in master_destroy, since the Xorg /dev fd is the master) and leaves the driver hanging in the air if there's an interrupt still pending (or anything else fwiw). In my case with i810 it was some subtle thing with error codes somewhere else (iirc, been a while) which made Xorg fall over a bit differently and so crash the kernel. Presuming this is the case I think we need a proper bisect here to figure out the root-cause and re-apply the lost duct-tape. "Properly" fixing the underlying bug for any sane definition of "proper" is impossible - the legacy drm driver model is just too broken for that. And for new-style drivers killing the irq support when they don't expect it is not cool, since those drivers are sane and assume full control over all hw interactions (not like their legacy bethren). -Daniel
On Mon, 24 Mar 2014, Daniel Vetter wrote: > On Mon, Mar 24, 2014 at 07:45:47AM +1000, Dave Airlie wrote: > > On Mon, Mar 24, 2014 at 7:27 AM, Andreas Mohr <andi@lisas.de> wrote: > > > On Sun, Mar 23, 2014 at 09:39:16AM -0700, Linus Torvalds wrote: > > >> On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote: > > >> > > > >> > which did end up flawless on 3.12.0-rc2+, too > > >> > (but failed to improve the issue on 3.14.0-rc7+). > > >> > > > >> > So, for all intents and purposes, drm infrastructure seems unavoidably > > >> > (neither dri disable nor libdrm upgrade helps) affected. > > >> > Does anyone know which change caused that issue? > > >> > (I'm asking because bisect here would be relatively painful). > > >> > > >> So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current > > >> 3.14 rc only, or did it happen already in the previous release? > > > > > > Hmm, given that Mikulas in > > > https://lkml.org/lkml/2014/2/26/537 > > > offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!) > > > that that indeed may have been a regression at <= 3.13 proper even > > > (which may pose interesting questions about the level of testing coverage > > > we still enjoy [not!?] in this hardware area). That patch drops a mutex, so it is not correct. There is mutex resursion - we need to uninstall the irq in drm_master_destroy, because here we are committed to destroy the device. But the routine that uninstalls the irq takes struct_mutex, which is already held in drm_master_destroy. I suppose that the person who maintains drm reworks the patch so that it's correct: - could we use a different mutex to protect the irq in drm_irq.c? Or possibly no mutex at all and use cmpxchg to manipulate the variable dev->irq_enabled? - this seems like the best solution. But I am not sure if the code in drm_irq.c somehow depends on the facts that other parts of the drm subsystem take struct_mutex. - could we pass a new argument to drm_irq_uninstall that tells it not to take the mutex? drm_master_destroy would set this argument to 1. drm_master_destroy is mostly called with struct_mutex held, but there may be places in vmwgfx_drv.c where drm_master_put (which calls drm_master_destroy) may be called without struct_mutex held. Is it true that drm_master_destroy can be called without struct_mutex held? I don't know. I think drm maintainer should sort out the above issues and modify the patch accordingly. > > > Oh well, seems I'll have to prepare/build 3.13 now... > > > > It's > 15 year old hardware, so yes I believe we have close to 0 > > testing coverage on it outside of distros, > > > > I'm not even sure I have one anymore, I might be able to test an MGA in one box. > > I haven't done a full read of all the related code, but this smells like a > similar bug I've hit all over the place in the i810 driver (another one of > those undead drm drivers of yonders). Ingredients: > > 1) Xorg creates a drm mapping of the register space. > 2) Xorg tells the hw-specific drm which drm mapping has the hw registers, > and the driver uses that. Iirc this has been done as some form of OS > abstraction. Also note that these mappings aren't refcounted, so the first > guy to call drm_rmmap wins. > > -> All hell breaks loose if Xorg dies and takes all it's mappings with it > (in master_destroy, since the Xorg /dev fd is the master) and leaves the > driver hanging in the air if there's an interrupt still pending (or > anything else fwiw). For me that crash happened when xorg exited with a fatal error too. Mikulas
On Mon, Mar 24, 2014 at 01:17:12PM -0400, Mikulas Patocka wrote: > > > On Mon, 24 Mar 2014, Daniel Vetter wrote: > > > On Mon, Mar 24, 2014 at 07:45:47AM +1000, Dave Airlie wrote: > > > On Mon, Mar 24, 2014 at 7:27 AM, Andreas Mohr <andi@lisas.de> wrote: > > > > On Sun, Mar 23, 2014 at 09:39:16AM -0700, Linus Torvalds wrote: > > > >> On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote: > > > >> > > > > >> > which did end up flawless on 3.12.0-rc2+, too > > > >> > (but failed to improve the issue on 3.14.0-rc7+). > > > >> > > > > >> > So, for all intents and purposes, drm infrastructure seems unavoidably > > > >> > (neither dri disable nor libdrm upgrade helps) affected. > > > >> > Does anyone know which change caused that issue? > > > >> > (I'm asking because bisect here would be relatively painful). > > > >> > > > >> So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current > > > >> 3.14 rc only, or did it happen already in the previous release? > > > > > > > > Hmm, given that Mikulas in > > > > https://lkml.org/lkml/2014/2/26/537 > > > > offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!) > > > > that that indeed may have been a regression at <= 3.13 proper even > > > > (which may pose interesting questions about the level of testing coverage > > > > we still enjoy [not!?] in this hardware area). > > That patch drops a mutex, so it is not correct. There is mutex resursion - > we need to uninstall the irq in drm_master_destroy, because here we are > committed to destroy the device. But the routine that uninstalls the irq > takes struct_mutex, which is already held in drm_master_destroy. > > I suppose that the person who maintains drm reworks the patch so that it's > correct: > > - could we use a different mutex to protect the irq in drm_irq.c? Or > possibly no mutex at all and use cmpxchg to manipulate the variable > dev->irq_enabled? - this seems like the best solution. But I am not sure > if the code in drm_irq.c somehow depends on the facts that other parts of > the drm subsystem take struct_mutex. > > - could we pass a new argument to drm_irq_uninstall that tells it not to > take the mutex? drm_master_destroy would set this argument to 1. > drm_master_destroy is mostly called with struct_mutex held, but there may > be places in vmwgfx_drv.c where drm_master_put (which calls > drm_master_destroy) may be called without struct_mutex held. > > Is it true that drm_master_destroy can be called without struct_mutex > held? I don't know. > > > I think drm maintainer should sort out the above issues and modify the > patch accordingly. > > > > > Oh well, seems I'll have to prepare/build 3.13 now... > > > > > > It's > 15 year old hardware, so yes I believe we have close to 0 > > > testing coverage on it outside of distros, > > > > > > I'm not even sure I have one anymore, I might be able to test an MGA in one box. > > > > I haven't done a full read of all the related code, but this smells like a > > similar bug I've hit all over the place in the i810 driver (another one of > > those undead drm drivers of yonders). Ingredients: > > > > 1) Xorg creates a drm mapping of the register space. > > 2) Xorg tells the hw-specific drm which drm mapping has the hw registers, > > and the driver uses that. Iirc this has been done as some form of OS > > abstraction. Also note that these mappings aren't refcounted, so the first > > guy to call drm_rmmap wins. > > > > -> All hell breaks loose if Xorg dies and takes all it's mappings with it > > (in master_destroy, since the Xorg /dev fd is the master) and leaves the > > driver hanging in the air if there's an interrupt still pending (or > > anything else fwiw). > > For me that crash happened when xorg exited with a fatal error too. Is this fatal error itself a regression or have you seen that on older kernels, too? Like I've said the entire teardown sequence for legacy drm drivers is terminally busted, so the only hope we have is to reapply this missing duct-tape which made your X crash. But if that itself isn't a regression there's no way to fix the current drm/mga driver without a complete rewrite as a new-style kernel modesetting driver. -Daniel
On Mon, 24 Mar 2014, Daniel Vetter wrote: > On Mon, Mar 24, 2014 at 01:17:12PM -0400, Mikulas Patocka wrote: > > > > > Hmm, given that Mikulas in > > > > > https://lkml.org/lkml/2014/2/26/537 > > > > > offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!) > > > > > that that indeed may have been a regression at <= 3.13 proper even > > > > > (which may pose interesting questions about the level of testing coverage > > > > > we still enjoy [not!?] in this hardware area). > > > > That patch drops a mutex, so it is not correct. There is mutex resursion - > > we need to uninstall the irq in drm_master_destroy, because here we are > > committed to destroy the device. But the routine that uninstalls the irq > > takes struct_mutex, which is already held in drm_master_destroy. > > > > I suppose that the person who maintains drm reworks the patch so that it's > > correct: > > > > - could we use a different mutex to protect the irq in drm_irq.c? Or > > possibly no mutex at all and use cmpxchg to manipulate the variable > > dev->irq_enabled? - this seems like the best solution. But I am not sure > > if the code in drm_irq.c somehow depends on the facts that other parts of > > the drm subsystem take struct_mutex. > > > > - could we pass a new argument to drm_irq_uninstall that tells it not to > > take the mutex? drm_master_destroy would set this argument to 1. > > drm_master_destroy is mostly called with struct_mutex held, but there may > > be places in vmwgfx_drv.c where drm_master_put (which calls > > drm_master_destroy) may be called without struct_mutex held. > > > > Is it true that drm_master_destroy can be called without struct_mutex > > held? I don't know. > > > > > > I think drm maintainer should sort out the above issues and modify the > > patch accordingly. > > > > > -> All hell breaks loose if Xorg dies and takes all it's mappings with it > > > (in master_destroy, since the Xorg /dev fd is the master) and leaves the > > > driver hanging in the air if there's an interrupt still pending (or > > > anything else fwiw). > > > > For me that crash happened when xorg exited with a fatal error too. > > Is this fatal error itself a regression or have you seen that on older > kernels, too? In my case that Xorg error was not kernel-related at all. It happened because of unknown symbol because I used mga_dri.so from Debian 6 in Debian 7 (mga_dri.so isn't shipped in Debian 7 anymore). I can still play quake with that old mga_dri.so, although in some other scenario it causes failure because of unknown symbol. I should probably recompile mga_dri on my own. > Like I've said the entire teardown sequence for legacy drm drivers is > terminally busted, so the only hope we have is to reapply this missing > duct-tape which made your X crash. But if that itself isn't a regression > there's no way to fix the current drm/mga driver without a complete > rewrite as a new-style kernel modesetting driver. > -Daniel If someone understands the locking issues I pointed out above, it could be easy to fix. Mikulas
On Mon, Mar 24, 2014 at 9:40 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> > > -> All hell breaks loose if Xorg dies and takes all it's mappings with it >> > > (in master_destroy, since the Xorg /dev fd is the master) and leaves the >> > > driver hanging in the air if there's an interrupt still pending (or >> > > anything else fwiw). >> > >> > For me that crash happened when xorg exited with a fatal error too. >> >> Is this fatal error itself a regression or have you seen that on older >> kernels, too? > > In my case that Xorg error was not kernel-related at all. It happened > because of unknown symbol because I used mga_dri.so from Debian 6 in > Debian 7 (mga_dri.so isn't shipped in Debian 7 anymore). I can still play > quake with that old mga_dri.so, although in some other scenario it causes > failure because of unknown symbol. I should probably recompile mga_dri on > my own. > >> Like I've said the entire teardown sequence for legacy drm drivers is >> terminally busted, so the only hope we have is to reapply this missing >> duct-tape which made your X crash. But if that itself isn't a regression >> there's no way to fix the current drm/mga driver without a complete >> rewrite as a new-style kernel modesetting driver. >> -Daniel > > If someone understands the locking issues I pointed out above, it could be > easy to fix. The locking issue isn't your problem, the real issue is that putting a irq_uninstall into core code will break all the new (properly working) drivers. And you can't really fix this in mga itself since the lifetime rules of the register mappings are totally broken. It's a fundamental misdesign of the legacy drm driver architecture and the _only_ way to fix this bug for real is to rewrite this all. Which was done for all the still used drivers like i915, radeon, nouveau, ... -Daniel
Hi, On Mon, Mar 24, 2014 at 10:46:49PM +0100, Daniel Vetter wrote: > On Mon, Mar 24, 2014 at 9:40 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > If someone understands the locking issues I pointed out above, it could be > > easy to fix. > > The locking issue isn't your problem, the real issue is that putting a > irq_uninstall into core code will break all the new (properly working) > drivers. And you can't really fix this in mga itself since the > lifetime rules of the register mappings are totally broken. It's a > fundamental misdesign of the legacy drm driver architecture and the > _only_ way to fix this bug for real is to rewrite this all. Which was > done for all the still used drivers like i915, radeon, nouveau, ... That sounds plausible - yet with meatballs (ok, maybe I should omit such quite possibly unjustified qualification) such as this: git show --stat 771fe6b912fca54f03 how is a bunch of marginally-trained hobbyists ever supposed to be implementing a working practical (i.e., "base") driver for *various* currently unsupported (booted would perhaps even be a more fitting word?) hardware? While the result of a wc -l check of the drivers/gpu/drm/r128 dir itself seems quite positive, that still might be not much of help when eyeing the large KMS changes that had to be done elsewhere. I guess we can make use of all the practical advice/links that we can get... (such as hints at good candidates of existing KMSified drivers which don't come with the full bells and whistles package, hints at suitably sized KMS support commits, grandma tutorials, ...). Some semi-short search wasn't overly successful, with links such as http://www.x.org/wiki/ModeSetting/ https://en.wikipedia.org/wiki/KMS_%28Linux_kernel%29#Linux "New, Generic X.Org KMS Driver Work" http://www.phoronix.com/scan.php?page=news_item&px=OTk1OA Or perhaps I should just state outright that I seem to be in need of a working solution for my kernel upgrade pain which I would be deemed to want semi-soonish (the i810, MGA users and some others might be sharing my thoughts). IOW, my r128 driver is somewhat of a "still used driver", thank you very much. Thanks for having managed to survive my posting in an asbestos-lined garment (apologies if it came across in harsh terms :), Andreas Mohr (not necessarily a member of the forced-monopoly hardware upgrade treadmill cult)
On Tue, Mar 25, 2014 at 12:11 AM, Andreas Mohr <andi@lisas.de> wrote: > On Mon, Mar 24, 2014 at 10:46:49PM +0100, Daniel Vetter wrote: >> On Mon, Mar 24, 2014 at 9:40 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> > If someone understands the locking issues I pointed out above, it could be >> > easy to fix. >> >> The locking issue isn't your problem, the real issue is that putting a >> irq_uninstall into core code will break all the new (properly working) >> drivers. And you can't really fix this in mga itself since the >> lifetime rules of the register mappings are totally broken. It's a >> fundamental misdesign of the legacy drm driver architecture and the >> _only_ way to fix this bug for real is to rewrite this all. Which was >> done for all the still used drivers like i915, radeon, nouveau, ... > > That sounds plausible - yet with meatballs (ok, maybe I should omit > such quite possibly unjustified qualification) such as this: > > git show --stat 771fe6b912fca54f03 > > how is a bunch of marginally-trained hobbyists ever supposed to be implementing > a working practical (i.e., "base") driver for *various* currently unsupported > (booted would perhaps even be a more fitting word?) hardware? > > While the result of a wc -l check of the drivers/gpu/drm/r128 dir itself > seems quite positive, that still might be not much of help > when eyeing the large KMS changes that had to be done elsewhere. > > I guess we can make use of all the practical advice/links that we can get... > (such as hints at good candidates of existing KMSified drivers > which don't come with the full bells and whistles package, > hints at suitably sized KMS support commits, grandma tutorials, ...). > Some semi-short search wasn't overly successful, with links such as > http://www.x.org/wiki/ModeSetting/ > https://en.wikipedia.org/wiki/KMS_%28Linux_kernel%29#Linux > "New, Generic X.Org KMS Driver Work" http://www.phoronix.com/scan.php?page=news_item&px=OTk1OA > > Or perhaps I should just state outright that I seem to be in need > of a working solution for my kernel upgrade pain > which I would be deemed to want semi-soonish > (the i810, MGA users and some others might be sharing my thoughts). > IOW, my r128 driver is somewhat of a "still used driver", thank you very much. > > Thanks for having managed to survive my posting in an asbestos-lined > garment (apologies if it came across in harsh terms :), > > Andreas Mohr > (not necessarily a member of the forced-monopoly hardware upgrade treadmill cult) Well I have a i810 here personally and since about 3 years haven't gotten around to implement a kms stack for it (just for fun essentially). And that's on a platform which can steal a lot of the kms infrastructure from the i915 driver. For simple devices getting a basic kms driver off the ground (that means no accelaration support in the kernel with GEM, and obviously userspace not ported) seems to be doable in one gsoc term, at least someone managed it. Which means 3 months full-time work. So yeah, writing real gpu drivers is fairly non-trivial task. And all the people who can do it are fully busy getting the new shit to work. Wrt stealing code: It's better to look at the legacy fbdev drivers, since they have modeset support. And meanwhile we have fairly decent drm DocBook in the kernel, the latest updates are in drm-next. -Daniel
On Mon, 24 Mar 2014, Daniel Vetter wrote: > >> Like I've said the entire teardown sequence for legacy drm drivers is > >> terminally busted, so the only hope we have is to reapply this missing > >> duct-tape which made your X crash. But if that itself isn't a regression > >> there's no way to fix the current drm/mga driver without a complete > >> rewrite as a new-style kernel modesetting driver. > >> -Daniel > > > > If someone understands the locking issues I pointed out above, it could be > > easy to fix. > > The locking issue isn't your problem, the real issue is that putting a > irq_uninstall into core code will break all the new (properly working) > drivers. And you can't really fix this in mga itself since the > lifetime rules of the register mappings are totally broken. It's a > fundamental misdesign of the legacy drm driver architecture and the > _only_ way to fix this bug for real is to rewrite this all. Which was > done for all the still used drivers like i915, radeon, nouveau, ... > -Daniel When I tried Radeon AGP card with the KMS driver, it lacked the possibility to set video mode with fbset and the framebuffer console was very slow because it wasn't accelerated. So, Radeon with the new driver is much less useable than Matrox. Did I misconfigure something? Or, is console acceleration and modesetting deliberately unsupported in KMS drivers? Mikulas
On Tue, Mar 25, 2014 at 06:42:13PM -0400, Mikulas Patocka wrote: > > > On Mon, 24 Mar 2014, Daniel Vetter wrote: > > > >> Like I've said the entire teardown sequence for legacy drm drivers is > > >> terminally busted, so the only hope we have is to reapply this missing > > >> duct-tape which made your X crash. But if that itself isn't a regression > > >> there's no way to fix the current drm/mga driver without a complete > > >> rewrite as a new-style kernel modesetting driver. > > >> -Daniel > > > > > > If someone understands the locking issues I pointed out above, it could be > > > easy to fix. > > > > The locking issue isn't your problem, the real issue is that putting a > > irq_uninstall into core code will break all the new (properly working) > > drivers. And you can't really fix this in mga itself since the > > lifetime rules of the register mappings are totally broken. It's a > > fundamental misdesign of the legacy drm driver architecture and the > > _only_ way to fix this bug for real is to rewrite this all. Which was > > done for all the still used drivers like i915, radeon, nouveau, ... > > -Daniel > > When I tried Radeon AGP card with the KMS driver, it lacked the > possibility to set video mode with fbset and the framebuffer console was > very slow because it wasn't accelerated. > > So, Radeon with the new driver is much less useable than Matrox. > > Did I misconfigure something? Or, is console acceleration and modesetting > deliberately unsupported in KMS drivers? You've missed nothing and the performances is abmysal intentionally. It could be fixed (we've had patches for i915 to accelarate the performance) but not worth it - fbdev emulation is just here for backwards compat. If you want fast console on kms drivers you need to look at David Herrmann's kmscon. That uses GL drivers on top of egl+gbm+kms and that will be fast. The long-term plan is to switch to that and have a very minimal shim in the kernel on top of kms as an emergency logging console only (i.e. even less than the current fbdev stuff). -Daniel
Index: linux-3.13.5/drivers/gpu/drm/drm_irq.c =================================================================== --- linux-3.13.5.orig/drivers/gpu/drm/drm_irq.c 2014-01-24 19:36:33.529945530 +0100 +++ linux-3.13.5/drivers/gpu/drm/drm_irq.c 2014-02-26 16:40:41.379534715 +0100 @@ -357,10 +357,8 @@ int drm_irq_uninstall(struct drm_device if (!drm_core_check_feature(dev, DRIVER_HAVE_IRQ)) return -EINVAL; - mutex_lock(&dev->struct_mutex); irq_enabled = dev->irq_enabled; dev->irq_enabled = false; - mutex_unlock(&dev->struct_mutex); /* * Wake up any waiters so they don't hang. Index: linux-3.13.5/drivers/gpu/drm/drm_stub.c =================================================================== --- linux-3.13.5.orig/drivers/gpu/drm/drm_stub.c 2014-02-26 16:31:36.158862071 +0100 +++ linux-3.13.5/drivers/gpu/drm/drm_stub.c 2014-02-26 16:40:52.031317277 +0100 @@ -167,6 +167,8 @@ static void drm_master_destroy(struct kr list_del(&master->head); + drm_irq_uninstall(dev); + if (dev->driver->master_destroy) dev->driver->master_destroy(dev, master);