From patchwork Tue May 31 11:43:43 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arnd Bergmann X-Patchwork-Id: 9144453 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 703E160761 for ; Tue, 31 May 2016 11:45:11 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 61D8A1FF27 for ; Tue, 31 May 2016 11:45:11 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 56AB3272D8; Tue, 31 May 2016 11:45:11 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.9]) (using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 7E4551FF27 for ; Tue, 31 May 2016 11:45:10 +0000 (UTC) Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.80.1 #2 (Red Hat Linux)) id 1b7i5E-0006nh-Qy; Tue, 31 May 2016 11:43:40 +0000 Received: from mout.kundenserver.de ([212.227.17.24]) by bombadil.infradead.org with esmtps (Exim 4.80.1 #2 (Red Hat Linux)) id 1b7i5B-0006Xz-Et for linux-arm-kernel@lists.infradead.org; Tue, 31 May 2016 11:43:39 +0000 Received: from wuerfel.localnet ([78.42.132.4]) by mrelayeu.kundenserver.de (mreue101) with ESMTPSA (Nemesis) id 0M8QjK-1bUJMz3DFK-00vtwZ; Tue, 31 May 2016 13:43:11 +0200 From: Arnd Bergmann To: linux-arm-kernel@lists.infradead.org Subject: Re: [BUG] CONFIG_UNINLINE_SPIN_UNLOCK important for Cortex-A9 Date: Tue, 31 May 2016 13:43:43 +0200 Message-ID: <6689734.3ffZe38SoY@wuerfel> User-Agent: KMail/5.1.3 (Linux/4.4.0-22-generic; KDE/5.18.0; x86_64; ; ) In-Reply-To: <871t4io608.fsf@gmail.com> References: <874m9eoetu.fsf@gmail.com> <4854518.jvbplH0SM0@wuerfel> <871t4io608.fsf@gmail.com> MIME-Version: 1.0 X-Provags-ID: V03:K0:+HN0tl3KwJo87aOMVEw05S9mQJo4GeqsSw59yNQ/5aR1hUETeoL kwV6wV8C51BZneZ83pZMoqfZWmdYp0RqQQcvZyIO/5UZstPvg1v8StOpzguujp1fRA5A4l7 tZClz2m+CME6C2q94oHEN7/TqzezLheWALO5c3UI10Oj2HdMib6YQLh5BM8n8vVFU2Ax+iW YZLRhR44vBhl2KMhOV2Mg== X-UI-Out-Filterresults: notjunk:1; V01:K0:Zwbmu8drLok=:otRLIMSOiR4E+q6258FH0X AvEJ3+lEpk6lYDjy8mqLO88kvg3CQW+aM+B0LaLTT5P/S5FYCukdpqtb2MBWDB4gu1m7sYHMv LXAtkqSb2/JkPv7Yvf/4j5OAOkR/ytTJgQzK/lIb0x2tE7z1A5rnyqZdTmrRFC/Q8rRWPvSrW z+1i7sGg46Im22sNcZM4bLtJskOWod57YOYeJjfTqyeF70rY+NY2CGKbw8xsz/g0a1OHGo6/X H9n6XAVY/apPjjbVyODs+BmwthrtOYdTZCgaOjItpbQqv9BRXX196zlPMbuv6sFL5PqRcHsAs Xer42q2Caz6RzRhGvsiVfgCzPJZnwfhQGaVF4E4SSfNqRtB3tJlaWMSjsHNuje6mksYQPE7qb DADN4IPA9I7splCe06xlOyKLSw+Y9M7pSb5s/FgNa1K0jv3xqbUSnV850Agd21XjKe2Uoo4xE IiFetyit1M1qG9MxZnyEJ5U/OS2uLdHxH1thg/joTGbXlwhnk9lZDsFP8fKTNRasQFaRcIKDU axleyELm/9V0Zqp+NounUo9Ye4aFxUlLoG2Wz3byzHhhA9O5awi9bIPys5kpZHhsO00at0SVL XNYFE1hQeQusqumJsgdtt0WuUd3ltrDxHp4mc5tc73B23uoKfh+an02BncjAp3/XfpRVRWrfy D7zv47ZIaXfBqypTM0KQMMFW90DOk330CN3Fw+4sVitHPyfRiXOPt/xjek06gZcvxXB+Sr54p 9fvK7KyvqYm3pbom X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20160531_044337_877128_F3504E89 X-CRM114-Status: GOOD ( 38.45 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Fabio Estevam , shawnguo@kernel.org, Holger Schurig Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org X-Virus-Scanned: ClamAV using ClamSMTP On Tuesday, May 31, 2016 12:51:03 PM CEST Holger Schurig wrote: > > Have you tried multi_v7_defconfig? That also does not set > > CONFIG_UNINLINE_SPIN_UNLOCK, but it is generally assumed to work. > > Not yet, will try this (see below). Ok > > Just to be sure: If you just enable UNINLINE_SPIN_UNLOCK (e.g. using > > a 'select' from arch/arm/mach-imx/Kconfig) without selecting PROVE_RCU, > > you say the bug is gone too? > > Currently I select it indirectly. One person that tried to help me > turned on 29 different kernel debug options and the bug was gone. Then I > reduced this step by step to find out which of them was needed > minimally. > > Still I don't think that it's an RCU issue. But it could of course be a > "Heisenbug". I would also not guess RCU as the first possibility. I was first guessing that something might be wrong with arch_spin_unlock(), but it's actually a really trivial function: static inline void arch_spin_unlock(arch_spinlock_t *lock) { smp_mb(); lock->tickets.owner++; dsb_sev(); } so there is really not much that can go wrong, and if there is, then presumably either smp_mb() or dsb_sev() is not as strong a barrier here as it should be, and that seems very unlikely. The last change to the function was a while ago, but you could try reverting this commit if nothing else helps: commit 20e260b6f4f717c100620122f626a2c06a4cfd72 Author: Will Deacon Date: Thu Jan 24 14:47:38 2013 +0100 ARM: 7632/1: spinlock: avoid exclusive accesses on unlock() path When unlocking a spinlock, all we need to do is increment the owner field of the lock. Since only one CPU can be performing an unlock() operation for a given lock, this doesn't need to be exclusive. This patch simplifies arch_spin_unlock to use non-exclusive accesses when updating the owner field of the lock. Signed-off-by: Will Deacon Signed-off-by: Russell King > > - Specific kernel version, and full list of patches applied (if any). > > Have you reproduced this with a plain unpatched linux-4.6 kernel? > > I'm not in love with vendor-kernels, so I use kernels from kernel.org. > When I started to chase the oops happened in 4.4.4, but the behavior was > also present in 4.5, 4.5.1 ... 4.5.4 and 4.6. > > I have some patches applied, e.g. aufs4, one patch in the i.MX6Q FEC > ethernet driver and stuff to program the eMMC firmware. Nothing that > plays with rcu, spinlocks etc, e.g. no preempt or Linux-RT patches. > > However, a vanilla kernel boots on my device (isn't device tree > nice?!?!?) and here I had the same behavior. I think that vanilla kernel > was 4.5.3, but I'm unsure. Ok, good, that should make it easy to reproduce. > > > - Specific gcc version you used for building the kernel. Does this > > happen with both old and new compilers, e.g. 4.7 and 6.1? > > I don't have yet a gcc 6.1 armhf cross compiler. It's fairly straightforward to build gcc binaries for testing kernels using Segher's "buildall" script from git://git.infradead.org/users/segher/buildall.git I use that for testing out lots of gcc versions across architectures when I run into a bug that is version specific. > My gcc used to be > http://releases.linaro.org/14.04/components/toolchain/binaries/gcc-linaro-arm-linux-gnueabihf-4.8-2014.04_linux.tar.xz > but during the bug chase I switched to Debian's cross-gcc. It says it is > "arm-linux-gnueabihf-gcc ( 4.9.2-10) 4.9.2". > > A kernel compiled by someone else with some "gcc-4.8 > 4.8.4-2ubuntu1~14.04" also had the same oopsing-behavior. ok. Both of them are a few years old, so it would be interesting to try a new version, probably not so interesting to try anything older. If you like, I can also send you a tarball with gcc-6 binaries > > - If it happens with an unpatched linux-4.6 and all gcc versions > > with multi_v7_defconfig, we should try reproducing it on some > > other quad-core Cortex-A9 (non-imx) platform with the same kernel > > binary. > > I will try to reproduce it, as said it can take between 3 to 24 hours. > > I'll be using > > - vanilla 4.6 without any patch > - ... and without any device-specific kernel module > - multi_v7_defconfig > - Debian 4.9.2 cross-compiler awesome. > [ 2.844408] Bluetooth: L2CAP socket layer initialized > [ 2.849514] Bluetooth: SCO socket layer initialized > [ 2.895099] usbcore: registered new interface driver btusb > [ 3.257897] SMSC LAN8710/LAN8720 2188000.ethernet:00: attached PHY driver [SMSC LAN8710/LAN8720] (mii_bus:phy_addr=2188000.ethernet:00, irq=-1) > [17827.758966] Unable to handle kernel paging request at virtual address 00001014 0x1014 is a rather long offset, but still a plausible NULL pointer. The r10 register contains 0x1000, so this is probably an incorrect value and is being used as a pointer with offset 0x14 added in. With an objdump of your net/core/dev.o, we could see which pointer that was, to maybe figure out which lock is supposed to protect it and where that lock gets released. > [17827.766279] pgd = ee09c000 > [17827.769003] [00001014] *pgd=3eba3831, *pte=00000000, *ppte=00000000 > [17827.775383] Internal error: Oops: 17 [#1] SMP ARM > [17827.780108] Modules linked in: usbhid btusb btrtl btbcm btintel bluetooth flexcan smsc95xx usbnet mii ptxc(O) > [17827.790242] CPU: 1 PID: 372 Comm: stress-ng-socke Tainted: G O 4.5.4 #1 > [17827.797995] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) > [17827.804536] task: ed614780 ti: eebba000 task.ti: eebba000 > [17827.809977] PC is at __netif_receive_skb_core+0x328/0xa9c Unfortunately in the middle of a rather long function, and I don't see a spin_unlock in this function, in fact it's not even called with a spinlock held, so it must be something more indirect. > [17828.297390] r4:ef7ba1b8 r3:ef7ba148 > [17828.301008] [] (process_backlog) from [] (net_rx_action+0x1fc/0x2f0) > [17828.309101] r10:eebbbaf0 r9:00000040 r8:c05ea100 r7:0000012c r6:001abec9 r5:c03aaa30 > [17828.317014] r4:ef7ba1b8 r3:eebbbaf0 > [17828.320637] [] (net_rx_action) from [] (__do_softirq+0x134/0x254) > [17828.328470] r10:c05ea080 r9:40000003 r8:00000101 r7:eebba000 r6:c05ea08c r5:00000003 > [17828.336381] r4:00000000 > [17828.338937] [] (__do_softirq) from [] (do_softirq+0x68/0x70) > [17828.346334] r10:00000006 r9:ef06a800 r8:0000000e r7:00000000 r6:eeb1c200 r5:ee05ba30 > [17828.354238] r4:60030013 > [17828.356797] [] (do_softirq) from [] (__local_bh_enable_ip+0xbc/0xd0) > [17828.364890] r4:00000200 r3:00000008 > [17828.368520] [] (__local_bh_enable_ip) from [] (ip_finish_output2+0x1b4/0x3b8) > [17828.377395] r5:ee05ba30 r4:00000000 > [17828.381005] [] (ip_finish_output2) from [] (ip_finish_output+0x14c/0x20c) > [17828.389531] r8:ee2d2c00 r7:ee2d2c00 r6:c0606280 r5:0000ffff r4:ee05ba30 > [17828.396321] [] (ip_finish_output) from [] (ip_output+0x124/0x130) > [17828.404153] r10:00000006 r9:ef06a800 r8:ee2d2c00 r7:00000000 r6:0971f1b9 r5:c0606280 > [17828.412061] r4:ee05ba30 This might be the smoking gun: We are in the middle of do_softirq() as triggered from a __local_bh_enable_ip. This is often called from __raw_spin_unlock_bh(), though I don't see that call in ip_finish_output2. http://lwn.net/Articles/687617/ has some explanation about how this mechanism works, and what some recent plans to change that are. I don't immediately see anything that could lead to your bug there, but it's a start. Arnd diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h index b4ca707d0a69..6220e9fdf4c7 100644 --- a/arch/arm/include/asm/spinlock.h +++ b/arch/arm/include/asm/spinlock.h @@ -119,22 +119,8 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock) static inline void arch_spin_unlock(arch_spinlock_t *lock) { - unsigned long tmp; - u32 slock; - smp_mb(); - - __asm__ __volatile__( -" mov %1, #1\n" -"1: ldrex %0, [%2]\n" -" uadd16 %0, %0, %1\n" -" strex %1, %0, [%2]\n" -" teq %1, #0\n" -" bne 1b" - : "=&r" (slock), "=&r" (tmp) - : "r" (&lock->slock) - : "cc"); - + lock->tickets.owner++; dsb_sev(); }