[RFC] fix parisc runtime hangs wrt pa_tlb_lock

John David Anglin wrote:
>> Since many kernel versions I regularly faced reproducibly kernel hangs when
>> compiling some bigger files. The machine suddenly just seemed to hang.
>>
>> With spinlock debugging turned on I found this:
>>
>> BUG: spinlock recursion on CPU#0, tool/7263
>>  lock: 10644000, .magic: dead4ead, .owner: tool/7263, .owner_cpu: 0
>> Backtrace:
>>  [<10113a94>] show_stack+0x18/0x28
>>
>> BUG: spinlock lockup on CPU#0, tool/7263, 10644000
>> Backtrace:
>>  [<10113a94>] show_stack+0x18/0x28
>>
>> BUG: soft lockup - CPU#0 stuck for 61s! [tool:7263]
>> IASQ: 00000000 00000000 IAOQ: 102d55dc 102d557c
>>  IIR: 03c008b3    ISR: 00000000  IOR: 00000000
>>  CPU:        0   CR30: 7d1a4000 CR31: 11111111
>>  ORIG_R28: 00000000
>>  IAOQ[0]: _raw_spin_lock+0x15c/0x1c0
>>  IAOQ[1]: _raw_spin_lock+0xfc/0x1c0
>>  RP(r2): _raw_spin_lock+0x18c/0x1c0
>> Backtrace:
>>  [<102d560c>] _raw_spin_lock+0x18c/0x1c0
>>
>> Kernel panic - not syncing: softlockup: hung tasks
>> Backtrace:
>>  [<10113a94>] show_stack+0x18/0x28
> 
> I have tested the proposed change using 2.6.30-rc6 and a modified version
> of 2.6.22.10.  I tested using SMP kernels running on a rp3440 and UP
> kernels on a c3750.  I also tested changing the macros to just use
> preempt_disable/preempt_enable.
> 
> The patch doesn't cause any new problems as far as I can tell.  

That's good. Thanks for testing !

> However,
> it doesn't fix any of the problems that I currently see on these two
> machines.  In particular, I see the occasional gcc testsuite timeout
> using SMP kernels.  Programs that usually take a few seconds to run
> timeout after three minutes.  These timeouts don't occur with UP kernels.

Ok, it fixes at least a kernel hang I faced regularily with my compilations...

> On the rp3440, the spinlock is definitely needed.  With just
> preempt_disable/preempt_enable, a crash occurs during bootstrap
> at the point unused memory is recovered.  Thus, the tlb purge
> issue referred to in the preceeding comment affects more than
> just N class.
> 
> On the otherhand, it doesn't seem necessary to disable interrupts
> during the purge with UP kernels.  

My kernel crashes happened with UP-kernels on a B2000 (1 CPU).
So, I still have the feeling that it's necessary to disable interrupts 
on UP kernels as well.

The new cleaned-up attached patch below shows, that arch/parisc/kernel/pci-dma.c
is affected. My kernel crashes always happened when the machine was pretty 
much loaded, during memory pressure when the system still wanted to load
something from disk.
In pci-dma.c the TLB flushing code is called from the DMA handlers, so
it does indicate that IRQ would need to be disabled.
One example I could imagine (with a UP-kernel):
- Process A runs some code, gets into kernel and executes clear_user_page() and locks
  the pa_tlb_lock spinlock (inside the critical section).
- Suddenly Process B gets data from disk via DMA handlers in pci-dma.c
- DMA-Interrupt kicks in, the DMA handler tries to get the pa_tlb_lock and fails
  since process A still holds the lock
-> machine deadlocks and kernel starts reporting about "hung tasks after 61seconds".
  (just search the mailing list for such reports..)

So, even a UP-kernel is affected.

> With SMP kernels, it would be nice
> to know if the lockup was caused by an interruption during the tlb
> purge, or a preemption issue.
> 
> I have the sense that disabling interrupts is wrong.  That is any
> given CPU can only generate one PxTLB inter processor broadcast
> at a time.  Disabling interrupts could cause a deadlock if a TLB
> purge was needed while the purge code was executing.
> 
> The other alternative is to allow the processor that holds the lock
> to enter the flush code.  This would fix the deadlock.  Don't know how
> to code this (atomic compare and exchange?).
> 
> The preempt_disable/preempt_enable crash on the rp3440 made me wonder
> if all tlb purge operations are properly protected with the tlb spinlock.
> I think we need to look at flush_tlb_all_local and copy_user_page_asm.
> They don't seem protected.

New patch attached.

Helge

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Message ID	4A2ADEC9.2090403@gmx.de (mailing list archive)
State	Superseded
Delegated to:	Helge Deller
Headers	show Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n56LPcX9010739 for <patchwork-parisc@patchwork.kernel.org>; Sat, 6 Jun 2009 21:25:38 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751366AbZFFVZe (ORCPT <rfc822;patchwork-parisc@patchwork.kernel.org>); Sat, 6 Jun 2009 17:25:34 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751058AbZFFVZe (ORCPT <rfc822;linux-parisc-outgoing>); Sat, 6 Jun 2009 17:25:34 -0400 Received: from mail.gmx.net ([213.165.64.20]:37569 "HELO mail.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1751366AbZFFVZd (ORCPT <rfc822;linux-parisc@vger.kernel.org>); Sat, 6 Jun 2009 17:25:33 -0400 Received: (qmail invoked by alias); 06 Jun 2009 21:25:33 -0000 Received: from mnhm-590e0f04.pool.einsundeins.de (EHLO [192.168.178.60]) [89.14.15.4] by mail.gmx.net (mp025) with SMTP; 06 Jun 2009 23:25:33 +0200 X-Authenticated: #1045983 X-Provags-ID: V01U2FsdGVkX18LyRQ70DoL3Ttzz3B+BZJ3Sha93wUkfqghdZu4fJ k2mHSNyuDIt4Jz Message-ID: <4A2ADEC9.2090403@gmx.de> Date: Sat, 06 Jun 2009 23:25:29 +0200 From: Helge Deller <deller@gmx.de> User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: John David Anglin <dave@hiauly1.hia.nrc.ca> CC: linux-parisc@vger.kernel.org Subject: Re: [PATCH, RFC] fix parisc runtime hangs wrt pa_tlb_lock References: <20090528015037.3417E4FEA@hiauly1.hia.nrc.ca> In-Reply-To: <20090528015037.3417E4FEA@hiauly1.hia.nrc.ca> X-Enigmail-Version: 0.95.7 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-FuHaFi: 0.45 Sender: linux-parisc-owner@vger.kernel.org Precedence: bulk List-ID: <linux-parisc.vger.kernel.org> X-Mailing-List: linux-parisc@vger.kernel.org

[RFC] fix parisc runtime hangs wrt pa_tlb_lock

Commit Message

Comments

Patch