diff mbox

a crash in mga_driver_irq_uninstall

Message ID alpine.LRH.2.02.1402261551010.9913@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mikulas Patocka Feb. 26, 2014, 9:25 p.m. UTC
Hi

I'm getting a reproducible crash in kernel MGA DRM driver.

The crash happens in the following way:

drm_release is called
drm_release calls drm_master_put(&file_priv->master);
drm_master_put drops a reference and calls drm_master_destroy
drm_master_destroy calls drm_rmmap_locked to unmap the card's mmio space
drm_release continues to execute
drm_release calls drm_lastclose
drm_lastclose calls drm_irq_uninstall
drm_irq_uninstall calls dev->driver->irq_uninstall (mga_driver_irq_uninstall)
mga_driver_irq_uninstall performs MGA_WRITE(MGA_IEN, 0), it crashes 
	because the mmio range was unmapped 

The result are these two crashes - one in mga_driver_irq_uninstall and the 
other in mga_driver_irq_handler:

[44272.019284] BUG: unable to handle kernel paging request at e4831e1c
[44272.021000] IP: [<e481e8d8>] mga_driver_irq_uninstall+0x18/0x28 [mga]
[44272.021000] *pde = 1e7ba067 *pte = 00000000
[44272.021000] Oops: 0002 [#1]
[44272.021000] Modules linked in: mga drm hid_generic usbhid hid ipv6 
cpufreq_stats cpufreq_conservative cpufreq_powersave cpufreq_ondemand 
cpufreq_userspace plip spadfs hpfs nls_cp852 msdos fat matroxfb_base 
matroxfb_g450 matroxfb_accel cfbfillrect cfbimgblt cfbcopyarea 
matroxfb_DAC1064 g450_pll matroxfb_misc floppy dm_crypt snd_usb_audio 
snd_usbmidi_lib snd_hwdep snd_seq_midi snd_seq_midi_event snd_rawmidi 
snd_pcm snd_page_alloc snd_seq snd_seq_device snd_timer snd soundcore 
powernow_k6 pcspkr evdev ehci_pci via686a i2c_viapro e1000 i2c_core 
uhci_hcd ehci_hcd via_agp usbcore usb_common parport_pc agpgart parport 
dm_mod ext4 crc16
jbd2 mbcache sata_promise unix
[44272.021000] CPU: 0 PID: 3140 Comm: Xorg Not tainted 3.13.5 #5
[44272.021000] Hardware name: System Manufacturer Product Name/VA-503A, 
BIOS 4.51 PG 08/02/00
[44272.021000] task: c043ce80 ti: de662000 task.ti: de662000
[44272.021000] EIP: 0060:[<e481e8d8>] EFLAGS: 00213286 CPU: 0
[44272.021000] EIP is at mga_driver_irq_uninstall+0x18/0x28 [mga]
[44272.021000] EAX: de87fc00 EBX: de87fc00 ECX: e4830000 EDX: 00000000
[44272.021000] ESI: 00000001 EDI: 00000001 EBP: 00203202 ESP: de663e58
[44272.021000]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[44272.021000] CR0: 8005003b CR2: e4831e1c CR3: 1e666000 CR4: 00000090
[44272.021000] Stack:
[44272.021000]  e47d17ae 00000000 e47f0064 e47e6ca2 e47f0079 0000000f 
0187fc00 c0096e40
[44272.021000]  de87fc00 00000001 de87fc00 e481a277 00000001 e481ee1f 
e481eaa7 e481ee1d
[44272.021000]  de7a6880 00150012 e47d6b2d 00017915 e47d6b2d df401d00 
c0096e40 de87fc00
[44272.021000] Call Trace:
[44272.021000]  [<e47d17ae>] ? drm_irq_uninstall+0xae/0x180 [drm]
[44272.021000]  [<e481a277>] ? mga_do_cleanup_dma+0x237/0x300 [mga]
[44272.021000]  [<e47d6b2d>] ? drm_ht_remove+0x2d/0x40 [drm]
[44272.021000]  [<e47d6b2d>] ? drm_ht_remove+0x2d/0x40 [drm]
[44272.021000]  [<e47cf03e>] ? drm_lastclose+0x3e/0x180 [drm]
[44272.021000]  [<e47cf4fb>] ? drm_release+0x37b/0x520 [drm]
[44272.021000]  [<c10a61b2>] ? __fput+0x72/0x1c0
[44272.021000]  [<c1039499>] ? task_work_run+0x79/0xa0
[44272.021000]  [<c102722f>] ? do_exit+0x18f/0x740
[44272.021000]  [<c10a5bec>] ? vfs_writev+0x2c/0x60
[44272.021000]  [<c10a5d8a>] ? SyS_writev+0x4a/0xc0
[44272.021000]  [<c10278a6>] ? do_group_exit+0x26/0x60
[44272.021000]  [<c10278f1>] ? SyS_exit_group+0x11/0x20
[44272.021000]  [<c125cbf8>] ? syscall_call+0x7/0xb
[44272.021000] Code: 1e 00 00 b0 00 5b c3 8d b6 00 00 00 00 8d bf 00 00 00 
00 8b 90 d8 00 00 00 85 d2 74 1b 8b 92 dc 00 00 00 8b 4a 10 ba 00 00 00 00 
<89> 91 1c 1e 00 00 c6 80 80 00 00 00 00 c3 00 00 ba 00 f9 81 e4
[44272.021000] EIP: [<e481e8d8>] mga_driver_irq_uninstall+0x18/0x28 [mga] 
SS:ESP 0068:de663e58
[44272.021000] CR2: 00000000e4831e1c
[44272.021000] ---[ end trace 68cd6cc5ef884eb2 ]---
[44272.021000] Fixing recursive fault but reboot is needed!
[44272.651150] BUG: unable to handle kernel paging request at e4831e14
[44272.654217] IP: [<e481e5f4>] mga_driver_irq_handler+0x14/0xe0 [mga]
[44272.654217] *pde = 1e7ba067 *pte = 00000000
[44272.654217] Oops: 0000 [#2]
[44272.654217] Modules linked in: mga drm hid_generic usbhid hid ipv6 
cpufreq_stats cpufreq_conservative cpufreq_powersave cpufreq_ondemand 
cpufreq_userspace plip spadfs hpfs nls_cp852 msdos fat matroxfb_base 
matroxfb_g450 matroxfb_accel cfbfillrect cfbimgblt cfbcopyarea 
matroxfb_DAC1064 g450_pll matroxfb_misc floppy dm_crypt snd_usb_audio 
snd_usbmidi_lib snd_hwdep snd_seq_midi snd_seq_midi_event snd_rawmidi 
snd_pcm snd_page_alloc snd_seq snd_seq_device snd_timer snd soundcore 
powernow_k6 pcspkr evdev ehci_pci via686a i2c_viapro e1000 i2c_core 
uhci_hcd ehci_hcd via_agp usbcore usb_common parport_pc agpgart parport 
dm_mod ext4 crc16
jbd2 mbcache sata_promise unix
[44272.654217] CPU: 0 PID: 0 Comm: swapper Tainted: G      D      3.13.5 
#5
[44272.654217] Hardware name: System Manufacturer Product Name/VA-503A, 
BIOS 4.51 PG 08/02/00
[44272.654217] task: c132f500 ti: df406000 task.ti: c1324000
[44272.654217] EIP: 0060:[<e481e5f4>] EFLAGS: 00210082 CPU: 0
[44272.654217] EIP is at mga_driver_irq_handler+0x14/0xe0 [mga]
[44272.654217] EAX: de87fc00 EBX: df640a00 ECX: e0860803 EDX: e4830000
[44272.654217] ESI: 00000080 EDI: 0000000f EBP: de585340 ESP: df407fb0
[44272.654217]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[44272.654217] CR0: 8005003b CR2: e4831e14 CR3: 1f7ca000 CR4: 00000090
[44272.654217] Stack:
[44272.654217]  0000000f 00000001 00000080 c104aaf6 00000000 00000000 
00000000 00000000
[44272.654217]  00000000 00000000 df405780 df405780 df405780 c104c9a0 
0000000f c104ac19
[44272.654217]  df405780 c104c9ec c1325f68 c10036b2
[44272.654217] Call Trace:
[44272.654217]  [<c104aaf6>] ? handle_irq_event_percpu+0x36/0x140
[44272.654217]  [<c104c9a0>] ? cond_unmask_irq+0x40/0x40
[44272.654217]  [<c104ac19>] ? handle_irq_event+0x19/0x40
[44272.654217]  [<c104c9ec>] ? handle_level_irq+0x4c/0x80
[44272.654217]  <IRQ>
[44272.654217]  [<c100340e>] ? do_IRQ+0x2e/0xa0
[44272.654217]  [<c125d5ec>] ? common_interrupt+0x2c/0x31
[44272.654217]  [<c104a3a8>] ? cpu_startup_entry+0xa8/0x120
[44272.654217]  [<c135f989>] ? start_kernel+0x2d1/0x2d6
[44272.654217]  [<c135f502>] ? repair_env_string+0x4d/0x4d
[44272.654217] Code: 0e 8b 80 9c 00 00 00 c3 8d b4 26 00 00 00 00 b8 00 00 
00 00 c3 66 90 56 53 50 89 d0 8b 9a d8 00 00 00 8b 93 dc 00 00 00 8b 52 10 
<8b> b2 14 1e 00 00 f7 c6 20 00 00 00 0f 85 8a 00 00 00 b8 00 00
[44272.654217] EIP: [<e481e5f4>] mga_driver_irq_handler+0x14/0xe0 [mga] 
SS:ESP 0068:df407fb0
[44272.654217] CR2: 00000000e4831e14
[44272.654217] ---[ end trace 68cd6cc5ef884eb3 ]---
[44272.654217] Kernel panic - not syncing: Fatal exception in interrupt


The crash can be fixed with this patch - it calls drm_irq_uninstall before 
unmapping the registers, so that further calls to drm_irq_uninstall do not 
try to touch the mmio space. Removing the lock is not correct, but the 
lock is already held when calling drm_master_destroy - you know the drm 
locking rules better than me, so you could come up with a better patch for 
this problem.

Mikulas

---
 drivers/gpu/drm/drm_irq.c  |    2 --
 drivers/gpu/drm/drm_stub.c |    2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Andreas Mohr March 22, 2014, 11:43 p.m. UTC | #1
Hi,

now testing 3.14-rc7 here (r128 hardware rather than MGA),
and I seem to still be experiencing the same or very similar crash as you here:

agpgart-intel 0000:00:00.0: AGP 2.0 bridge
agpgart-intel 0000:00:00.0: putting AGP V2 device into 4x mode
pci 0000:01:00.0: putting AGP V2 device into 4x mode
Registering platform device 'r128_cce.0'. Parent at platform
device: 'r128_cce.0': device_add
bus: 'platform': add device r128_cce.0
PM: Adding info for platform:r128_cce.0
__allocate_fw_buf: fw-r128/r128_cce.bin buf=dd9ec800
platform r128_cce.0: firmware: direct-loading firmware r128/r128_cce.bin
fw_set_page_data: fw-r128/r128_cce.bin buf=dd9ec800 data=e07f8000 size=2048
bus: 'platform': remove device r128_cce.0
PM: Removing info for platform:r128_cce.0
fw_name_devm_release: fw_name-r128/r128_cce.bin devm-dd9ccfcc released
__fw_free_buf: fw-r128/r128_cce.bin buf=dd9ec800 data=e07f8000 size=2048
evbug: Event. Dev: input7, Type: 2, Code: 0, Value: 1
evbug: Event. Dev: input7, Type: 2, Code: 1, Value: 1
evbug: Event. Dev: input7, Type: 0, Code: 0, Value: 0
evbug: Event. Dev: input7, Type: 2, Code: 0, Value: 2
evbug: Event. Dev: input7, Type: 0, Code: 0, Value: 0
BUG: unable to handle kernel paging request at e07f0040
IP: [<e293fdb8>] r128_driver_irq_uninstall+0x18/0x1d [r128]
*pde = 1f414067 *pte = 00000000 
Oops: 0002 [#1] 
Modules linked in: lp r128 drm uinput nls_iso8859_1 nls_cp437 vfat fat radeonfb 
cfbfillrect cfbimgblt cfbcopyarea i2c_algo_bit fb_ddc i2c_core fb fbdev ppdev lo
op fuse firewire_sbp2 mcs7830 usbnet usb_storage mii iTCO_wdt iTCO_vendor_suppor
t snd_maestro3 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq
_midi snd_rawmidi snd_seq_oss pcmcia snd_seq_midi_event snd_seq snd_seq_device s
nd_timer microcode firewire_ohci dell_laptop sg dcdbas yenta_socket snd firewire
_core sr_mod pcmcia_rsrc psmouse crc_itu_t cdrom pcmcia_core pcspkr video backli
ght evbug evdev uhci_hcd floppy rtc_cmos ehci_hcd intel_agp intel_gtt usbcore us
b_common lpc_ich mfd_core
CPU: 0 PID: 4674 Comm: Xorg Not tainted 3.14.0-rc7+ #9
Hardware name: Dell Computer Corporation Inspiron 8000                   /Inspir
on 8000            , BIOS A23 01/21/2004
task: ded082f0 ti: da6c4000 task.ti: da6c4000
EIP: 0060:[<e293fdb8>] EFLAGS: 00213246 CPU: 0
EIP is at r128_driver_irq_uninstall+0x18/0x1d [r128]
EAX: 00000000 EBX: dd9eb400 ECX: 00000000 EDX: e07f0000
ESI: 00000001 EDI: dd9ccd40 EBP: da6c5d48 ESP: da6c5d48
 DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068
CR0: 80050033 CR2: e07f0040 CR3: 1da53000 CR4: 000007d0
Stack:
 da6c5d78 e284dc79 00000000 e28696c5 e285fddc e28696bd 0000000b 01cc54c0
 00203202 dd9eb400 dd9eb400 00000000 da6c5d90 e293b163 dd9ec8e0 dd9ec8e8
 dd9eb400 dd9eb400 da6c5d98 e293fc9c da6c5dc0 e284c542 00000001 e2869575
Call Trace:
 [<e284dc79>] drm_irq_uninstall+0x119/0x13b [drm]
 [<e293b163>] r128_do_cleanup_cce+0x15/0xb3 [r128]
 [<e293fc9c>] r128_driver_lastclose+0x8/0xa [r128]
 [<e284c542>] drm_lastclose+0x40/0x143 [drm]
 [<e284ca37>] drm_release+0x3f2/0x419 [drm]
 [<c10b4c07>] __fput+0xca/0x185
 [<c10b4ce8>] ____fput+0x8/0xa
 [<c103c213>] task_work_run+0x4f/0x60
 [<c102a4fc>] do_exit+0x27f/0x6bb
 [<c1032bc0>] ? __sigqueue_free+0x2c/0x2f
 [<c102b4df>] do_group_exit+0x2e/0x65
 [<c1034963>] get_signal_to_deliver+0x420/0x45b
 [<c1033788>] ? __send_signal.constprop.34+0x15a/0x234
 [<c10014a2>] do_signal+0x34/0x6d0
 [<c1033fdf>] ? do_send_specific+0x4a/0x74
 [<c1001b69>] do_notify_resume+0x2b/0x52
 [<c12c5a33>] work_notifysig+0x24/0x29
Code: 50 10 b8 01 00 00 00 89 42 44 5d c3 55 31 c0 89 e5 5d c3 55 8b 80 e8 00 00 00 89 e5 85 c0 74 0e 8b 80 94 00 00 00 8b 50 10 31 c0 <89> 42 40 5d c3 55 ba f0 0c 94 e2 89 e5 b8 68 0d 94 e2 e8 55 15
EIP: [<e293fdb8>] r128_driver_irq_uninstall+0x18/0x1d [r128] SS:ESP 0068:da6c5d48
CR2: 00000000e07f0040
---[ end trace 018ccfcd552fb6cf ]---
Fixing recursive fault but reboot is needed!
device: '254:0': device_add








Applying your (probably experimental?) posted patch
(thanks for having done the necessary debugging work for me :)
upon next boot made this dump go away,
but I got greeted with a relatively similar:











[drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[drm] No driver support for vblank timestamp query.
[drm] Initialized r128 2.5.0 20030725 for 0000:01:00.0 on minor 0
agpgart-intel 0000:00:00.0: AGP 2.0 bridge
agpgart-intel 0000:00:00.0: putting AGP V2 device into 4x mode
pci 0000:01:00.0: putting AGP V2 device into 4x mode
Registering platform device 'r128_cce.0'. Parent at platform
device: 'r128_cce.0': device_add
bus: 'platform': add device r128_cce.0
PM: Adding info for platform:r128_cce.0
__allocate_fw_buf: fw-r128/r128_cce.bin buf=dda02aa0
platform r128_cce.0: firmware: direct-loading firmware r128/r128_cce.bin
fw_set_page_data: fw-r128/r128_cce.bin buf=dda02aa0 data=e080c000 size=2048
bus: 'platform': remove device r128_cce.0
PM: Removing info for platform:r128_cce.0
fw_name_devm_release: fw_name-r128/r128_cce.bin devm-dda289cc released
__fw_free_buf: fw-r128/r128_cce.bin buf=dda02aa0 data=e080c000 size=2048
device class 'printer': registering
lp: driver loaded but no devices found
BUG: unable to handle kernel NULL pointer dereference at 00000058
IP: [<e2a11cab>] r128_get_vblank_counter+0xd/0x16 [r128]
*pde = 00000000 
Oops: 0000 [#1] 
Modules linked in: lp r128 drm uinput nls_iso8859_1 nls_cp437 vfat fat radeonfb cfbfillrect cfbimgblt cfbcopyarea i2c_algo_bit fb_ddc i2c_core fb fbdev ppdev loop fuse firewire_sbp2 mcs7830 usbnet mii usb_storage iTCO_wdt iTCO_vendor_support snd_maestro3 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_rawmidi microcode snd_seq_oss dell_laptop snd_seq_midi_event dcdbas snd_seq snd_seq_device pcmcia snd_timer sg firewire_ohci sr_mod psmouse pcspkr firewire_core yenta_socket crc_itu_t snd cdrom pcmcia_rsrc pcmcia_core video backlight rtc_cmos floppy uhci_hcd ehci_hcd evbug evdev intel_agp lpc_ich intel_gtt mfd_core usbcore usb_common
CPU: 0 PID: 3842 Comm: Xorg Not tainted 3.14.0-rc7+ #9
Hardware name: Dell Computer Corporation Inspiron 8000                   /Inspiron 8000            , BIOS A23 01/21/2004
task: df796230 ti: dd9ee000 task.ti: dd9ee000
EIP: 0060:[<e2a11cab>] EFLAGS: 00213046 CPU: 0
EIP is at r128_get_vblank_counter+0xd/0x16 [r128]
EAX: 00000000 EBX: decd0c00 ECX: e2a12d68 EDX: 00000000
ESI: 00000001 EDI: dda28400 EBP: dd9efebc ESP: dd9efebc
 DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068
CR0: 80050033 CR2: 00000058 CR3: 1da04000 CR4: 000007d0
Stack:
 dd9efeec e29bbbe6 00000000 c0000000 bfc2f150 00000009 dd9efee4 00a0f964
 00203202 dda02b00 decd0c00 00000000 dd9eff08 e29bd7bb 00000000 decd0c40
 dd9a2cc0 decd0c00 00000000 dd9eff14 e29bd893 dd9a2c80 dd9eff54 e29ba9c6
Call Trace:
 [<e29bbbe6>] drm_irq_uninstall+0x86/0x126 [drm]
 [<e29bd7bb>] drm_master_destroy+0x31/0xee [drm]
 [<e29bd893>] drm_master_put+0x1b/0x1d [drm]
 [<e29ba9c6>] drm_release+0x381/0x419 [drm]
 [<c10b4c07>] __fput+0xca/0x185
 [<c10b4ce8>] ____fput+0x8/0xa
 [<c103c213>] task_work_run+0x4f/0x60
 [<c1001b8c>] do_notify_resume+0x4e/0x52
 [<c12c5a33>] work_notifysig+0x24/0x29
Code: 10 8b 80 14 07 00 00 c7 43 48 00 00 00 00 83 c4 14 5b 5d c3 55 89 e5 e8 b2 b4 ff ff 5d c3 55 85 d2 8b 80 e8 00 00 00 89 e5 75 05 <8b> 40 58 eb 02 31 c0 5d c3 55 89 e5 53 8b 8a e8 00 00 00 89 d3
EIP: [<e2a11cab>] r128_get_vblank_counter+0xd/0x16 [r128] SS:ESP 0068:dd9efebc
CR2: 0000000000000058
---[ end trace 0180925feaa449a7 ]---
device: 'vcs9': device_add


and non-responsive graphics mode.


01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Rage Mobility 128 AGP 4X/Mobility M4


ii  libdrm-dev                           2.4.45-3                          i386         Userspace interface to kernel DRM services -- development files
ii  libdrm-intel1:i386                   2.4.45-3                          i386         Userspace interface to intel-specific kernel DRM services -- runtime
rc  libdrm-nouveau1                      2.4.21-1~squeeze3                 i386         Userspace interface to nouveau-specific kernel DRM services -- runtime
ii  libdrm-nouveau2:i386                 2.4.45-3                          i386         Userspace interface to nouveau-specific kernel DRM services -- runtime
ii  libdrm-radeon1:i386                  2.4.45-3                          i386         Userspace interface to radeon-specific kernel DRM services -- runtime
ii  libdrm2:i386                         2.4.45-3                          i386         Userspace interface to kernel DRM services -- runtime


So, "Hmm?" :)

Thanks,

Andreas Mohr
Andreas Mohr March 23, 2014, 12:15 p.m. UTC | #2
Hi,

On Sun, Mar 23, 2014 at 12:43:17AM +0100, Andreas Mohr wrote:
> Hi,
> 
> now testing 3.14-rc7 here (r128 hardware rather than MGA),
> and I seem to still be experiencing the same or very similar crash as you here:

I decided to do some more experimentation:

I added a

Section "Module"
        Disable "dri"
EndSection

which did reliably disable dri according to 3.12.0-rc2+ xdpyinfo / Xorg.0.log

On -rc7 however, this changed the issue from a kernel OOPS into merely a
Xorg.0.log trace and abort (no such issue on working kernel
given my userspace configuration!):

[    68.334] Backtrace:
[    68.335] 0: /usr/bin/X (xorg_backtrace+0x49) [0xb7767769]
[    68.335] 1: /usr/bin/X (0xb75ea000+0x181186) [0xb776b186]
[    68.335] 2: linux-gate.so.1 (__kernel_rt_sigreturn+0x0) [0xffffe40c]
[    68.335] 3: /usr/lib/i386-linux-gnu/libpixman-1.so.0 (0xb7430000+0x727c0) [0xb74a27c0]
[    68.335] 4: /usr/lib/i386-linux-gnu/libpixman-1.so.0 (0xb7430000+0x57abf) [0xb7487abf]
[    68.335] 5: /usr/lib/i386-linux-gnu/libpixman-1.so.0 (pixman_blt+0x7d) [0xb74367ad]
[    68.335] 6: /usr/lib/xorg/modules/libfb.so (fbCopyNtoN+0x2af) [0xb6de066f]
[    68.335] 7: /usr/bin/X (miCopyRegion+0x17c) [0xb7743f6c]
[    68.336] 8: /usr/bin/X (miDoCopy+0x4f0) [0xb77445a0]
[    68.336] 9: /usr/lib/xorg/modules/libfb.so (fbCopyArea+0x7e) [0xb6de089e]
[    68.336] 10: /usr/lib/xorg/modules/libxaa.so (0xb6f04000+0xacf3) [0xb6f0ecf3]
[    68.336] 11: /usr/lib/xorg/modules/libxaa.so (0xb6f04000+0x548d3) [0xb6f588d3]
[    68.336] 12: /usr/bin/X (0xb75ea000+0x10929d) [0xb76f329d]
[    68.336] 13: /usr/bin/X (0xb75ea000+0x15b60f) [0xb774560f]
[    68.336] 14: /usr/bin/X (0xb75ea000+0x16cd21) [0xb7756d21]
[    68.336] 15: /usr/bin/X (0xb75ea000+0x16d50c) [0xb775750c]
[    68.337] 16: /usr/bin/X (0xb75ea000+0xbe0b7) [0xb76a80b7]
[    68.337] 17: /usr/bin/X (miPointerUpdateSprite+0x2a7) [0xb7751687]
[    68.337] 18: /usr/bin/X (0xb75ea000+0x16791a) [0xb775191a]
[    68.337] 19: /usr/bin/X (0xb75ea000+0xcd1c4) [0xb76b71c4]
[    68.337] 20: /usr/bin/X (0xb75ea000+0x101bfe) [0xb76ebbfe]
[    68.337] 21: /usr/bin/X (0xb75ea000+0x4437b) [0xb762e37b]
[    68.337] 22: /usr/bin/X (WindowHasNewCursor+0x3b) [0xb762f6db]
[    68.337] 23: /usr/bin/X (ChangeWindowAttributes+0xb0c) [0xb7655e1c]
[    68.338] 24: /usr/bin/X (0xb75ea000+0x35da7) [0xb761fda7]
[    68.338] 25: /usr/bin/X (0xb75ea000+0x3c375) [0xb7626375]
[    68.338] 26: /usr/bin/X (0xb75ea000+0x29e95) [0xb7613e95]
[    68.338] 27: /lib/i386-linux-gnu/i686/cmov/libc.so.6 (__libc_start_main+0xf5) [0xb71ef8f5]
[    68.338] 28: /usr/bin/X (0xb75ea000+0x2a1e9) [0xb76141e9]
[    68.338] 
[    68.338] Bus error at address 0xb4f1ed4e
[    68.338] 
Fatal server error:
[    68.339] Caught signal 7 (Bus error). Server aborting




I then also decided to update libdrm package:

i  libdrm-dev:i386                      2.4.52-1                          i386         Userspace interface to kernel DRM services -- development files
ii  libdrm-intel1:i386                   2.4.52-1                          i386         Userspace interface to intel-specific kernel DRM services -- runtime
rc  libdrm-nouveau1                      2.4.21-1~squeeze3                 i386         Userspace interface to nouveau-specific kernel DRM services -- runtime
ii  libdrm-nouveau2:i386                 2.4.52-1                          i386         Userspace interface to nouveau-specific kernel DRM services -- runtime
ii  libdrm-radeon1:i386                  2.4.52-1                          i386         Userspace interface to radeon-specific kernel DRM services -- runtime
ii  libdrm2:i386                         2.4.52-1                          i386         Userspace interface to kernel DRM services -- runtime


which did end up flawless on 3.12.0-rc2+, too
(but failed to improve the issue on 3.14.0-rc7+).

So, for all intents and purposes, drm infrastructure seems unavoidably
(neither dri disable nor libdrm upgrade helps) affected.
Does anyone know which change caused that issue?
(I'm asking because bisect here would be relatively painful).

Thanks,

Andreas Mohr
Linus Torvalds March 23, 2014, 4:39 p.m. UTC | #3
On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote:
>
> which did end up flawless on 3.12.0-rc2+, too
> (but failed to improve the issue on 3.14.0-rc7+).
>
> So, for all intents and purposes, drm infrastructure seems unavoidably
> (neither dri disable nor libdrm upgrade helps) affected.
> Does anyone know which change caused that issue?
> (I'm asking because bisect here would be relatively painful).

So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current
3.14 rc only, or did it happen already in the previous release?

               Linus
Andreas Mohr March 23, 2014, 9:27 p.m. UTC | #4
On Sun, Mar 23, 2014 at 09:39:16AM -0700, Linus Torvalds wrote:
> On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote:
> >
> > which did end up flawless on 3.12.0-rc2+, too
> > (but failed to improve the issue on 3.14.0-rc7+).
> >
> > So, for all intents and purposes, drm infrastructure seems unavoidably
> > (neither dri disable nor libdrm upgrade helps) affected.
> > Does anyone know which change caused that issue?
> > (I'm asking because bisect here would be relatively painful).
> 
> So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current
> 3.14 rc only, or did it happen already in the previous release?

Hmm, given that Mikulas in
https://lkml.org/lkml/2014/2/26/537
offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!)
that that indeed may have been a regression at <= 3.13 proper even
(which may pose interesting questions about the level of testing coverage
we still enjoy [not!?] in this hardware area).

Oh well, seems I'll have to prepare/build 3.13 now...

Thanks,

Andreas Mohr
Dave Airlie March 23, 2014, 9:45 p.m. UTC | #5
On Mon, Mar 24, 2014 at 7:27 AM, Andreas Mohr <andi@lisas.de> wrote:
> On Sun, Mar 23, 2014 at 09:39:16AM -0700, Linus Torvalds wrote:
>> On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote:
>> >
>> > which did end up flawless on 3.12.0-rc2+, too
>> > (but failed to improve the issue on 3.14.0-rc7+).
>> >
>> > So, for all intents and purposes, drm infrastructure seems unavoidably
>> > (neither dri disable nor libdrm upgrade helps) affected.
>> > Does anyone know which change caused that issue?
>> > (I'm asking because bisect here would be relatively painful).
>>
>> So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current
>> 3.14 rc only, or did it happen already in the previous release?
>
> Hmm, given that Mikulas in
> https://lkml.org/lkml/2014/2/26/537
> offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!)
> that that indeed may have been a regression at <= 3.13 proper even
> (which may pose interesting questions about the level of testing coverage
> we still enjoy [not!?] in this hardware area).
>
> Oh well, seems I'll have to prepare/build 3.13 now...

It's > 15 year old hardware, so yes I believe we have close to 0
testing coverage on it outside of distros,

I'm not even sure I have one anymore, I might be able to test an MGA in one box.

Dave.
Daniel Vetter March 24, 2014, 8:56 a.m. UTC | #6
On Mon, Mar 24, 2014 at 07:45:47AM +1000, Dave Airlie wrote:
> On Mon, Mar 24, 2014 at 7:27 AM, Andreas Mohr <andi@lisas.de> wrote:
> > On Sun, Mar 23, 2014 at 09:39:16AM -0700, Linus Torvalds wrote:
> >> On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote:
> >> >
> >> > which did end up flawless on 3.12.0-rc2+, too
> >> > (but failed to improve the issue on 3.14.0-rc7+).
> >> >
> >> > So, for all intents and purposes, drm infrastructure seems unavoidably
> >> > (neither dri disable nor libdrm upgrade helps) affected.
> >> > Does anyone know which change caused that issue?
> >> > (I'm asking because bisect here would be relatively painful).
> >>
> >> So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current
> >> 3.14 rc only, or did it happen already in the previous release?
> >
> > Hmm, given that Mikulas in
> > https://lkml.org/lkml/2014/2/26/537
> > offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!)
> > that that indeed may have been a regression at <= 3.13 proper even
> > (which may pose interesting questions about the level of testing coverage
> > we still enjoy [not!?] in this hardware area).
> >
> > Oh well, seems I'll have to prepare/build 3.13 now...
> 
> It's > 15 year old hardware, so yes I believe we have close to 0
> testing coverage on it outside of distros,
> 
> I'm not even sure I have one anymore, I might be able to test an MGA in one box.

I haven't done a full read of all the related code, but this smells like a
similar bug I've hit all over the place in the i810 driver (another one of
those undead drm drivers of yonders). Ingredients:

1) Xorg creates a drm mapping of the register space.
2) Xorg tells the hw-specific drm which drm mapping has the hw registers,
and the driver uses that. Iirc this has been done as some form of OS
abstraction. Also note that these mappings aren't refcounted, so the first
guy to call drm_rmmap wins.

-> All hell breaks loose if Xorg dies and takes all it's mappings with it
(in master_destroy, since the Xorg /dev fd is the master) and leaves the
driver hanging in the air if there's an interrupt still pending (or
anything else fwiw).

In my case with i810 it was some subtle thing with error codes somewhere
else (iirc, been a while) which made Xorg fall over a bit differently and
so crash the kernel. Presuming this is the case I think we need a proper
bisect here to figure out the root-cause and re-apply the lost duct-tape.

"Properly" fixing the underlying bug for any sane definition of "proper"
is impossible - the legacy drm driver model is just too broken for that.
And for new-style drivers killing the irq support when they don't expect
it is not cool, since those drivers are sane and assume full control over
all hw interactions (not like their legacy bethren).
-Daniel
Mikulas Patocka March 24, 2014, 5:17 p.m. UTC | #7
On Mon, 24 Mar 2014, Daniel Vetter wrote:

> On Mon, Mar 24, 2014 at 07:45:47AM +1000, Dave Airlie wrote:
> > On Mon, Mar 24, 2014 at 7:27 AM, Andreas Mohr <andi@lisas.de> wrote:
> > > On Sun, Mar 23, 2014 at 09:39:16AM -0700, Linus Torvalds wrote:
> > >> On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote:
> > >> >
> > >> > which did end up flawless on 3.12.0-rc2+, too
> > >> > (but failed to improve the issue on 3.14.0-rc7+).
> > >> >
> > >> > So, for all intents and purposes, drm infrastructure seems unavoidably
> > >> > (neither dri disable nor libdrm upgrade helps) affected.
> > >> > Does anyone know which change caused that issue?
> > >> > (I'm asking because bisect here would be relatively painful).
> > >>
> > >> So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current
> > >> 3.14 rc only, or did it happen already in the previous release?
> > >
> > > Hmm, given that Mikulas in
> > > https://lkml.org/lkml/2014/2/26/537
> > > offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!)
> > > that that indeed may have been a regression at <= 3.13 proper even
> > > (which may pose interesting questions about the level of testing coverage
> > > we still enjoy [not!?] in this hardware area).

That patch drops a mutex, so it is not correct. There is mutex resursion - 
we need to uninstall the irq in drm_master_destroy, because here we are 
committed to destroy the device. But the routine that uninstalls the irq 
takes struct_mutex, which is already held in drm_master_destroy.

I suppose that the person who maintains drm reworks the patch so that it's 
correct:

- could we use a different mutex to protect the irq in drm_irq.c? Or 
possibly no mutex at all and use cmpxchg to manipulate the variable 
dev->irq_enabled? - this seems like the best solution. But I am not sure 
if the code in drm_irq.c somehow depends on the facts that other parts of 
the drm subsystem take struct_mutex.

- could we pass a new argument to drm_irq_uninstall that tells it not to 
take the mutex? drm_master_destroy would set this argument to 1. 
drm_master_destroy is mostly called with struct_mutex held, but there may 
be places in vmwgfx_drv.c where drm_master_put (which calls 
drm_master_destroy) may be called without struct_mutex held.

Is it true that drm_master_destroy can be called without struct_mutex 
held? I don't know.


I think drm maintainer should sort out the above issues and modify the 
patch accordingly.

> > > Oh well, seems I'll have to prepare/build 3.13 now...
> > 
> > It's > 15 year old hardware, so yes I believe we have close to 0
> > testing coverage on it outside of distros,
> > 
> > I'm not even sure I have one anymore, I might be able to test an MGA in one box.
> 
> I haven't done a full read of all the related code, but this smells like a
> similar bug I've hit all over the place in the i810 driver (another one of
> those undead drm drivers of yonders). Ingredients:
> 
> 1) Xorg creates a drm mapping of the register space.
> 2) Xorg tells the hw-specific drm which drm mapping has the hw registers,
> and the driver uses that. Iirc this has been done as some form of OS
> abstraction. Also note that these mappings aren't refcounted, so the first
> guy to call drm_rmmap wins.
> 
> -> All hell breaks loose if Xorg dies and takes all it's mappings with it
> (in master_destroy, since the Xorg /dev fd is the master) and leaves the
> driver hanging in the air if there's an interrupt still pending (or
> anything else fwiw).

For me that crash happened when xorg exited with a fatal error too.

Mikulas
Daniel Vetter March 24, 2014, 8:26 p.m. UTC | #8
On Mon, Mar 24, 2014 at 01:17:12PM -0400, Mikulas Patocka wrote:
> 
> 
> On Mon, 24 Mar 2014, Daniel Vetter wrote:
> 
> > On Mon, Mar 24, 2014 at 07:45:47AM +1000, Dave Airlie wrote:
> > > On Mon, Mar 24, 2014 at 7:27 AM, Andreas Mohr <andi@lisas.de> wrote:
> > > > On Sun, Mar 23, 2014 at 09:39:16AM -0700, Linus Torvalds wrote:
> > > >> On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@lisas.de> wrote:
> > > >> >
> > > >> > which did end up flawless on 3.12.0-rc2+, too
> > > >> > (but failed to improve the issue on 3.14.0-rc7+).
> > > >> >
> > > >> > So, for all intents and purposes, drm infrastructure seems unavoidably
> > > >> > (neither dri disable nor libdrm upgrade helps) affected.
> > > >> > Does anyone know which change caused that issue?
> > > >> > (I'm asking because bisect here would be relatively painful).
> > > >>
> > > >> So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current
> > > >> 3.14 rc only, or did it happen already in the previous release?
> > > >
> > > > Hmm, given that Mikulas in
> > > > https://lkml.org/lkml/2014/2/26/537
> > > > offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!)
> > > > that that indeed may have been a regression at <= 3.13 proper even
> > > > (which may pose interesting questions about the level of testing coverage
> > > > we still enjoy [not!?] in this hardware area).
> 
> That patch drops a mutex, so it is not correct. There is mutex resursion - 
> we need to uninstall the irq in drm_master_destroy, because here we are 
> committed to destroy the device. But the routine that uninstalls the irq 
> takes struct_mutex, which is already held in drm_master_destroy.
> 
> I suppose that the person who maintains drm reworks the patch so that it's 
> correct:
> 
> - could we use a different mutex to protect the irq in drm_irq.c? Or 
> possibly no mutex at all and use cmpxchg to manipulate the variable 
> dev->irq_enabled? - this seems like the best solution. But I am not sure 
> if the code in drm_irq.c somehow depends on the facts that other parts of 
> the drm subsystem take struct_mutex.
> 
> - could we pass a new argument to drm_irq_uninstall that tells it not to 
> take the mutex? drm_master_destroy would set this argument to 1. 
> drm_master_destroy is mostly called with struct_mutex held, but there may 
> be places in vmwgfx_drv.c where drm_master_put (which calls 
> drm_master_destroy) may be called without struct_mutex held.
> 
> Is it true that drm_master_destroy can be called without struct_mutex 
> held? I don't know.
> 
> 
> I think drm maintainer should sort out the above issues and modify the 
> patch accordingly.
> 
> > > > Oh well, seems I'll have to prepare/build 3.13 now...
> > > 
> > > It's > 15 year old hardware, so yes I believe we have close to 0
> > > testing coverage on it outside of distros,
> > > 
> > > I'm not even sure I have one anymore, I might be able to test an MGA in one box.
> > 
> > I haven't done a full read of all the related code, but this smells like a
> > similar bug I've hit all over the place in the i810 driver (another one of
> > those undead drm drivers of yonders). Ingredients:
> > 
> > 1) Xorg creates a drm mapping of the register space.
> > 2) Xorg tells the hw-specific drm which drm mapping has the hw registers,
> > and the driver uses that. Iirc this has been done as some form of OS
> > abstraction. Also note that these mappings aren't refcounted, so the first
> > guy to call drm_rmmap wins.
> > 
> > -> All hell breaks loose if Xorg dies and takes all it's mappings with it
> > (in master_destroy, since the Xorg /dev fd is the master) and leaves the
> > driver hanging in the air if there's an interrupt still pending (or
> > anything else fwiw).
> 
> For me that crash happened when xorg exited with a fatal error too.

Is this fatal error itself a regression or have you seen that on older
kernels, too?

Like I've said the entire teardown sequence for legacy drm drivers is
terminally busted, so the only hope we have is to reapply this missing
duct-tape which made your X crash. But if that itself isn't a regression
there's no way to fix the current drm/mga driver without a complete
rewrite as a new-style kernel modesetting driver.
-Daniel
Mikulas Patocka March 24, 2014, 8:40 p.m. UTC | #9
On Mon, 24 Mar 2014, Daniel Vetter wrote:

> On Mon, Mar 24, 2014 at 01:17:12PM -0400, Mikulas Patocka wrote:

> > > > > Hmm, given that Mikulas in
> > > > > https://lkml.org/lkml/2014/2/26/537
> > > > > offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!)
> > > > > that that indeed may have been a regression at <= 3.13 proper even
> > > > > (which may pose interesting questions about the level of testing coverage
> > > > > we still enjoy [not!?] in this hardware area).
> > 
> > That patch drops a mutex, so it is not correct. There is mutex resursion - 
> > we need to uninstall the irq in drm_master_destroy, because here we are 
> > committed to destroy the device. But the routine that uninstalls the irq 
> > takes struct_mutex, which is already held in drm_master_destroy.
> > 
> > I suppose that the person who maintains drm reworks the patch so that it's 
> > correct:
> > 
> > - could we use a different mutex to protect the irq in drm_irq.c? Or 
> > possibly no mutex at all and use cmpxchg to manipulate the variable 
> > dev->irq_enabled? - this seems like the best solution. But I am not sure 
> > if the code in drm_irq.c somehow depends on the facts that other parts of 
> > the drm subsystem take struct_mutex.
> > 
> > - could we pass a new argument to drm_irq_uninstall that tells it not to 
> > take the mutex? drm_master_destroy would set this argument to 1. 
> > drm_master_destroy is mostly called with struct_mutex held, but there may 
> > be places in vmwgfx_drv.c where drm_master_put (which calls 
> > drm_master_destroy) may be called without struct_mutex held.
> > 
> > Is it true that drm_master_destroy can be called without struct_mutex 
> > held? I don't know.
> > 
> > 
> > I think drm maintainer should sort out the above issues and modify the 
> > patch accordingly.
> > 
> > > -> All hell breaks loose if Xorg dies and takes all it's mappings with it
> > > (in master_destroy, since the Xorg /dev fd is the master) and leaves the
> > > driver hanging in the air if there's an interrupt still pending (or
> > > anything else fwiw).
> > 
> > For me that crash happened when xorg exited with a fatal error too.
> 
> Is this fatal error itself a regression or have you seen that on older
> kernels, too?

In my case that Xorg error was not kernel-related at all. It happened 
because of unknown symbol because I used mga_dri.so from Debian 6 in 
Debian 7 (mga_dri.so isn't shipped in Debian 7 anymore). I can still play 
quake with that old mga_dri.so, although in some other scenario it causes 
failure because of unknown symbol. I should probably recompile mga_dri on 
my own.

> Like I've said the entire teardown sequence for legacy drm drivers is
> terminally busted, so the only hope we have is to reapply this missing
> duct-tape which made your X crash. But if that itself isn't a regression
> there's no way to fix the current drm/mga driver without a complete
> rewrite as a new-style kernel modesetting driver.
> -Daniel

If someone understands the locking issues I pointed out above, it could be 
easy to fix.

Mikulas
Daniel Vetter March 24, 2014, 9:46 p.m. UTC | #10
On Mon, Mar 24, 2014 at 9:40 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> > > -> All hell breaks loose if Xorg dies and takes all it's mappings with it
>> > > (in master_destroy, since the Xorg /dev fd is the master) and leaves the
>> > > driver hanging in the air if there's an interrupt still pending (or
>> > > anything else fwiw).
>> >
>> > For me that crash happened when xorg exited with a fatal error too.
>>
>> Is this fatal error itself a regression or have you seen that on older
>> kernels, too?
>
> In my case that Xorg error was not kernel-related at all. It happened
> because of unknown symbol because I used mga_dri.so from Debian 6 in
> Debian 7 (mga_dri.so isn't shipped in Debian 7 anymore). I can still play
> quake with that old mga_dri.so, although in some other scenario it causes
> failure because of unknown symbol. I should probably recompile mga_dri on
> my own.
>
>> Like I've said the entire teardown sequence for legacy drm drivers is
>> terminally busted, so the only hope we have is to reapply this missing
>> duct-tape which made your X crash. But if that itself isn't a regression
>> there's no way to fix the current drm/mga driver without a complete
>> rewrite as a new-style kernel modesetting driver.
>> -Daniel
>
> If someone understands the locking issues I pointed out above, it could be
> easy to fix.

The locking issue isn't your problem, the real issue is that putting a
irq_uninstall into core code will break all the new (properly working)
drivers. And you can't really fix this in mga itself since the
lifetime rules of the register mappings are totally broken. It's a
fundamental misdesign of the legacy drm driver architecture and the
_only_ way to fix this bug for real is to rewrite this all. Which was
done for all the still used drivers like i915, radeon, nouveau, ...
-Daniel
Andreas Mohr March 24, 2014, 11:11 p.m. UTC | #11
Hi,

On Mon, Mar 24, 2014 at 10:46:49PM +0100, Daniel Vetter wrote:
> On Mon, Mar 24, 2014 at 9:40 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > If someone understands the locking issues I pointed out above, it could be
> > easy to fix.
> 
> The locking issue isn't your problem, the real issue is that putting a
> irq_uninstall into core code will break all the new (properly working)
> drivers. And you can't really fix this in mga itself since the
> lifetime rules of the register mappings are totally broken. It's a
> fundamental misdesign of the legacy drm driver architecture and the
> _only_ way to fix this bug for real is to rewrite this all. Which was
> done for all the still used drivers like i915, radeon, nouveau, ...

That sounds plausible - yet with meatballs (ok, maybe I should omit
such quite possibly unjustified qualification) such as this:

git show --stat 771fe6b912fca54f03

how is a bunch of marginally-trained hobbyists ever supposed to be implementing
a working practical (i.e., "base") driver for *various* currently unsupported
(booted would perhaps even be a more fitting word?) hardware?

While the result of a wc -l check of the drivers/gpu/drm/r128 dir itself
seems quite positive, that still might be not much of help
when eyeing the large KMS changes that had to be done elsewhere.

I guess we can make use of all the practical advice/links that we can get...
(such as hints at good candidates of existing KMSified drivers
which don't come with the full bells and whistles package,
hints at suitably sized KMS support commits, grandma tutorials, ...).
Some semi-short search wasn't overly successful, with links such as
http://www.x.org/wiki/ModeSetting/
https://en.wikipedia.org/wiki/KMS_%28Linux_kernel%29#Linux
"New, Generic X.Org KMS Driver Work" http://www.phoronix.com/scan.php?page=news_item&px=OTk1OA

Or perhaps I should just state outright that I seem to be in need
of a working solution for my kernel upgrade pain
which I would be deemed to want semi-soonish
(the i810, MGA users and some others might be sharing my thoughts).
IOW, my r128 driver is somewhat of a "still used driver", thank you very much.

Thanks for having managed to survive my posting in an asbestos-lined
garment (apologies if it came across in harsh terms :),

Andreas Mohr
(not necessarily a member of the forced-monopoly hardware upgrade treadmill cult)
Daniel Vetter March 25, 2014, 9:13 a.m. UTC | #12
On Tue, Mar 25, 2014 at 12:11 AM, Andreas Mohr <andi@lisas.de> wrote:
> On Mon, Mar 24, 2014 at 10:46:49PM +0100, Daniel Vetter wrote:
>> On Mon, Mar 24, 2014 at 9:40 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> > If someone understands the locking issues I pointed out above, it could be
>> > easy to fix.
>>
>> The locking issue isn't your problem, the real issue is that putting a
>> irq_uninstall into core code will break all the new (properly working)
>> drivers. And you can't really fix this in mga itself since the
>> lifetime rules of the register mappings are totally broken. It's a
>> fundamental misdesign of the legacy drm driver architecture and the
>> _only_ way to fix this bug for real is to rewrite this all. Which was
>> done for all the still used drivers like i915, radeon, nouveau, ...
>
> That sounds plausible - yet with meatballs (ok, maybe I should omit
> such quite possibly unjustified qualification) such as this:
>
> git show --stat 771fe6b912fca54f03
>
> how is a bunch of marginally-trained hobbyists ever supposed to be implementing
> a working practical (i.e., "base") driver for *various* currently unsupported
> (booted would perhaps even be a more fitting word?) hardware?
>
> While the result of a wc -l check of the drivers/gpu/drm/r128 dir itself
> seems quite positive, that still might be not much of help
> when eyeing the large KMS changes that had to be done elsewhere.
>
> I guess we can make use of all the practical advice/links that we can get...
> (such as hints at good candidates of existing KMSified drivers
> which don't come with the full bells and whistles package,
> hints at suitably sized KMS support commits, grandma tutorials, ...).
> Some semi-short search wasn't overly successful, with links such as
> http://www.x.org/wiki/ModeSetting/
> https://en.wikipedia.org/wiki/KMS_%28Linux_kernel%29#Linux
> "New, Generic X.Org KMS Driver Work" http://www.phoronix.com/scan.php?page=news_item&px=OTk1OA
>
> Or perhaps I should just state outright that I seem to be in need
> of a working solution for my kernel upgrade pain
> which I would be deemed to want semi-soonish
> (the i810, MGA users and some others might be sharing my thoughts).
> IOW, my r128 driver is somewhat of a "still used driver", thank you very much.
>
> Thanks for having managed to survive my posting in an asbestos-lined
> garment (apologies if it came across in harsh terms :),
>
> Andreas Mohr
> (not necessarily a member of the forced-monopoly hardware upgrade treadmill cult)

Well I have a i810 here personally and since about 3 years haven't
gotten around to implement a kms stack for it (just for fun
essentially). And that's on a platform which can steal a lot of the
kms infrastructure from the i915 driver.

For simple devices getting a basic kms driver off the ground (that
means no accelaration support in the kernel with GEM, and obviously
userspace not ported) seems to be doable in one gsoc term, at least
someone managed it. Which means 3 months full-time work.

So yeah, writing real gpu drivers is fairly non-trivial task. And all
the people who can do it are fully busy getting the new shit to work.

Wrt stealing code: It's better to look at the legacy fbdev drivers,
since they have modeset support. And meanwhile we have fairly decent
drm DocBook in the kernel, the latest updates are in drm-next.
-Daniel
Mikulas Patocka March 25, 2014, 10:42 p.m. UTC | #13
On Mon, 24 Mar 2014, Daniel Vetter wrote:

> >> Like I've said the entire teardown sequence for legacy drm drivers is
> >> terminally busted, so the only hope we have is to reapply this missing
> >> duct-tape which made your X crash. But if that itself isn't a regression
> >> there's no way to fix the current drm/mga driver without a complete
> >> rewrite as a new-style kernel modesetting driver.
> >> -Daniel
> >
> > If someone understands the locking issues I pointed out above, it could be
> > easy to fix.
> 
> The locking issue isn't your problem, the real issue is that putting a
> irq_uninstall into core code will break all the new (properly working)
> drivers. And you can't really fix this in mga itself since the
> lifetime rules of the register mappings are totally broken. It's a
> fundamental misdesign of the legacy drm driver architecture and the
> _only_ way to fix this bug for real is to rewrite this all. Which was
> done for all the still used drivers like i915, radeon, nouveau, ...
> -Daniel

When I tried Radeon AGP card with the KMS driver, it lacked the 
possibility to set video mode with fbset and the framebuffer console was 
very slow because it wasn't accelerated.

So, Radeon with the new driver is much less useable than Matrox.

Did I misconfigure something? Or, is console acceleration and modesetting 
deliberately unsupported in KMS drivers?

Mikulas
Daniel Vetter March 26, 2014, 7:15 a.m. UTC | #14
On Tue, Mar 25, 2014 at 06:42:13PM -0400, Mikulas Patocka wrote:
> 
> 
> On Mon, 24 Mar 2014, Daniel Vetter wrote:
> 
> > >> Like I've said the entire teardown sequence for legacy drm drivers is
> > >> terminally busted, so the only hope we have is to reapply this missing
> > >> duct-tape which made your X crash. But if that itself isn't a regression
> > >> there's no way to fix the current drm/mga driver without a complete
> > >> rewrite as a new-style kernel modesetting driver.
> > >> -Daniel
> > >
> > > If someone understands the locking issues I pointed out above, it could be
> > > easy to fix.
> > 
> > The locking issue isn't your problem, the real issue is that putting a
> > irq_uninstall into core code will break all the new (properly working)
> > drivers. And you can't really fix this in mga itself since the
> > lifetime rules of the register mappings are totally broken. It's a
> > fundamental misdesign of the legacy drm driver architecture and the
> > _only_ way to fix this bug for real is to rewrite this all. Which was
> > done for all the still used drivers like i915, radeon, nouveau, ...
> > -Daniel
> 
> When I tried Radeon AGP card with the KMS driver, it lacked the 
> possibility to set video mode with fbset and the framebuffer console was 
> very slow because it wasn't accelerated.
> 
> So, Radeon with the new driver is much less useable than Matrox.
> 
> Did I misconfigure something? Or, is console acceleration and modesetting 
> deliberately unsupported in KMS drivers?

You've missed nothing and the performances is abmysal intentionally. It
could be fixed (we've had patches for i915 to accelarate the performance)
but not worth it - fbdev emulation is just here for backwards compat.

If you want fast console on kms drivers you need to look at David
Herrmann's kmscon. That uses GL drivers on top of egl+gbm+kms and that
will be fast. The long-term plan is to switch to that and have a very
minimal shim in the kernel on top of kms as an emergency logging console
only (i.e. even less than the current fbdev stuff).
-Daniel
diff mbox

Patch

Index: linux-3.13.5/drivers/gpu/drm/drm_irq.c
===================================================================
--- linux-3.13.5.orig/drivers/gpu/drm/drm_irq.c	2014-01-24 19:36:33.529945530 +0100
+++ linux-3.13.5/drivers/gpu/drm/drm_irq.c	2014-02-26 16:40:41.379534715 +0100
@@ -357,10 +357,8 @@  int drm_irq_uninstall(struct drm_device
 	if (!drm_core_check_feature(dev, DRIVER_HAVE_IRQ))
 		return -EINVAL;
 
-	mutex_lock(&dev->struct_mutex);
 	irq_enabled = dev->irq_enabled;
 	dev->irq_enabled = false;
-	mutex_unlock(&dev->struct_mutex);
 
 	/*
 	 * Wake up any waiters so they don't hang.
Index: linux-3.13.5/drivers/gpu/drm/drm_stub.c
===================================================================
--- linux-3.13.5.orig/drivers/gpu/drm/drm_stub.c	2014-02-26 16:31:36.158862071 +0100
+++ linux-3.13.5/drivers/gpu/drm/drm_stub.c	2014-02-26 16:40:52.031317277 +0100
@@ -167,6 +167,8 @@  static void drm_master_destroy(struct kr
 
 	list_del(&master->head);
 
+	drm_irq_uninstall(dev);
+
 	if (dev->driver->master_destroy)
 		dev->driver->master_destroy(dev, master);