bio linked list corruption.

Message ID	CA+55aFxeWe2VQaW30qGR0syiZ75jSwFwg3Ac+wS20KDtf5UKNw@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> MIME-Version: 1.0 In-Reply-To: <CAOMGZ=HfdHLXrRCbzvXcQTgfA-PAErtwZQfPzFr4P8H0MJWZ5g@mail.gmail.com> References: <CA+55aFyiJGk-A_cJH41Ec8Xj0Zz9M3EU-igJ9bgusj=nm28tFQ@mail.gmail.com> <2bdc068d-afd5-7a78-f334-26970c91aaca@fb.com> <CA+55aFxqRCEVxocpZVDMqinP=EsmgzcnvWpOT6x8NMmcBpWb2Q@mail.gmail.com> <203e0319-bc9b-245c-e162-709267540d22@fb.com> <20161026233808.GC15247@clm-mbp.thefacebook.com> <20161026234751.e66xyzjiwifvbuha@codemonkey.org.uk> <20161031185514.b22zvbxvga4xcinz@codemonkey.org.uk> <CA+55aFwUBYzPcbecpHw9=f8h_JuX18x5bE=Kd_k9QkCcgoiBsA@mail.gmail.com> <20161031194454.GA49877@clm-mbp.thefacebook.com> <20161123193419.pq7adje2eanky2wx@codemonkey.org.uk> <20161123195845.iphzr7ac4mu5ewjt@codemonkey.org.uk> <CAOMGZ=HPUoVZ8_yBLsMUHLigvKadKwBio-8gJ5-WM+2Fe4BZ6A@mail.gmail.com> <CAOMGZ=EsaGs9B3jc5X+3Sa1vyBfBW6AznYy3JR1=6iJMgCgG0g@mail.gmail.com> <CAOMGZ=HfdHLXrRCbzvXcQTgfA-PAErtwZQfPzFr4P8H0MJWZ5g@mail.gmail.com> From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 5 Dec 2016 09:55:07 -0800 Message-ID: <CA+55aFxeWe2VQaW30qGR0syiZ75jSwFwg3Ac+wS20KDtf5UKNw@mail.gmail.com> Subject: Re: bio linked list corruption. To: Vegard Nossum <vegard.nossum@gmail.com> Cc: Dave Jones <davej@codemonkey.org.uk>, Chris Mason <clm@fb.com>, Jens Axboe <axboe@fb.com>, Andy Lutomirski <luto@amacapital.net>, Andy Lutomirski <luto@kernel.org>, Al Viro <viro@zeniv.linux.org.uk>, Josef Bacik <jbacik@fb.com>, David Sterba <dsterba@suse.com>, linux-btrfs <linux-btrfs@vger.kernel.org>, Linux Kernel <linux-kernel@vger.kernel.org>, Dave Chinner <david@fromorbit.com> Content-Type: multipart/mixed; boundary=94eb2c0ef0e209e7520542ecfd94 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk

Message ID

CA+55aFxeWe2VQaW30qGR0syiZ75jSwFwg3Ac+wS20KDtf5UKNw@mail.gmail.com (mailing list archive)

State

New, archived

Headers

MIME-Version: 1.0
In-Reply-To: <CAOMGZ=HfdHLXrRCbzvXcQTgfA-PAErtwZQfPzFr4P8H0MJWZ5g@mail.gmail.com>
References: <CA+55aFyiJGk-A_cJH41Ec8Xj0Zz9M3EU-igJ9bgusj=nm28tFQ@mail.gmail.com>
	<2bdc068d-afd5-7a78-f334-26970c91aaca@fb.com>
	<CA+55aFxqRCEVxocpZVDMqinP=EsmgzcnvWpOT6x8NMmcBpWb2Q@mail.gmail.com>
	<203e0319-bc9b-245c-e162-709267540d22@fb.com>
	<20161026233808.GC15247@clm-mbp.thefacebook.com>
	<20161026234751.e66xyzjiwifvbuha@codemonkey.org.uk>
	<20161031185514.b22zvbxvga4xcinz@codemonkey.org.uk>
	<CA+55aFwUBYzPcbecpHw9=f8h_JuX18x5bE=Kd_k9QkCcgoiBsA@mail.gmail.com>
	<20161031194454.GA49877@clm-mbp.thefacebook.com>
	<20161123193419.pq7adje2eanky2wx@codemonkey.org.uk>
	<20161123195845.iphzr7ac4mu5ewjt@codemonkey.org.uk>
	<CAOMGZ=HPUoVZ8_yBLsMUHLigvKadKwBio-8gJ5-WM+2Fe4BZ6A@mail.gmail.com>
	<CAOMGZ=EsaGs9B3jc5X+3Sa1vyBfBW6AznYy3JR1=6iJMgCgG0g@mail.gmail.com>
	<CAOMGZ=HfdHLXrRCbzvXcQTgfA-PAErtwZQfPzFr4P8H0MJWZ5g@mail.gmail.com>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 5 Dec 2016 09:55:07 -0800
Message-ID: <CA+55aFxeWe2VQaW30qGR0syiZ75jSwFwg3Ac+wS20KDtf5UKNw@mail.gmail.com>
Subject: Re: bio linked list corruption.
To: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>, Chris Mason <clm@fb.com>,
	Jens Axboe <axboe@fb.com>, Andy Lutomirski <luto@amacapital.net>,
	Andy Lutomirski <luto@kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>, Josef Bacik <jbacik@fb.com>,
	David Sterba <dsterba@suse.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>
Content-Type: multipart/mixed; boundary=94eb2c0ef0e209e7520542ecfd94
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk

Commit Message

Linus Torvalds Dec. 5, 2016, 5:55 p.m. UTC

On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum <vegard.nossum@gmail.com> wrote:
>
> The warning shows that it made it past the list_empty_careful() check
> in finish_wait() but then bugs out on the &wait->task_list
> dereference.
>
> Anything stick out?

I hate that shmem waitqueue garbage. It's really subtle.

I think the problem is that "wake_up_all()" in shmem_fallocate()
doesn't necessarily wake up everything. It wakes up TASK_NORMAL -
which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean
"everything on the list".

I think that what happens is that the waiters somehow move from
TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that
wake_up_all() will ignore them, leave them on the list, and now that
list on stack is no longer empty at the end.

And the way *THAT* can happen is that the task is on some *other*
waitqueue as well, and that other waiqueue wakes it up. That's not
impossible, you can certainly have people on wait-queues that still
take faults.

Or somebody just uses a directed wake_up_process() or something.

Since you apparently can recreate this fairly easily, how about trying
this stupid patch?

NOTE! This is entirely untested. I may have screwed this up entirely.
You get the idea, though - just remove the wait queue head from the
list - the list entries stay around, but nothing points to the stack
entry (that we're going to free) any more.

And add the warning to see if this actually ever triggers (and because
I'd like to see the callchain when it does, to see if it's another
waitqueue somewhere or what..)

                  Linus

Comments

Vegard Nossum Dec. 5, 2016, 7:11 p.m. UTC | #1

On 5 December 2016 at 18:55, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum <vegard.nossum@gmail.com> wrote:
>>
>> The warning shows that it made it past the list_empty_careful() check
>> in finish_wait() but then bugs out on the &wait->task_list
>> dereference.
>>
>> Anything stick out?
>
> I hate that shmem waitqueue garbage. It's really subtle.
>
> I think the problem is that "wake_up_all()" in shmem_fallocate()
> doesn't necessarily wake up everything. It wakes up TASK_NORMAL -
> which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean
> "everything on the list".
>
> I think that what happens is that the waiters somehow move from
> TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that
> wake_up_all() will ignore them, leave them on the list, and now that
> list on stack is no longer empty at the end.
>
> And the way *THAT* can happen is that the task is on some *other*
> waitqueue as well, and that other waiqueue wakes it up. That's not
> impossible, you can certainly have people on wait-queues that still
> take faults.
>
> Or somebody just uses a directed wake_up_process() or something.
>
> Since you apparently can recreate this fairly easily, how about trying
> this stupid patch?
>
> NOTE! This is entirely untested. I may have screwed this up entirely.
> You get the idea, though - just remove the wait queue head from the
> list - the list entries stay around, but nothing points to the stack
> entry (that we're going to free) any more.
>
> And add the warning to see if this actually ever triggers (and because
> I'd like to see the callchain when it does, to see if it's another
> waitqueue somewhere or what..)

------------[ cut here ]------------
WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0
Kernel panic - not syncing: panic_on_warn set ...

CPU: 22 PID: 14012 Comm: trinity-c73 Not tainted 4.9.0-rc7+ #220
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
ffff8801e32af970 ffffffff81fb08c1 ffffffff83e74b60 ffff8801e32afa48
ffffffff83ed7600 ffffffff847103e0 ffff8801e32afa38 ffffffff81515244
0000000041b58ab3 ffffffff844e21da ffffffff81515061 ffffffff8151591e
Call Trace:
[<ffffffff81fb08c1>] dump_stack+0x83/0xb2
[<ffffffff81515244>] panic+0x1e3/0x3ad
[<ffffffff812708bf>] __warn+0x1bf/0x1e0
[<ffffffff81270aac>] warn_slowpath_null+0x2c/0x40
[<ffffffff8157aef7>] shmem_fallocate+0x9a7/0xac0
[<ffffffff8167c6c0>] vfs_fallocate+0x350/0x620
[<ffffffff815ee5c2>] SyS_madvise+0x432/0x1290
[<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0
[<ffffffff83c965b4>] entry_SYSCALL64_slow_path+0x25/0x25
------------[ cut here ]------------

Attached a full log.


Vegard

Vegard Nossum Dec. 5, 2016, 8:10 p.m. UTC | #2

On 5 December 2016 at 20:11, Vegard Nossum <vegard.nossum@gmail.com> wrote:
> On 5 December 2016 at 18:55, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum <vegard.nossum@gmail.com> wrote:
>> Since you apparently can recreate this fairly easily, how about trying
>> this stupid patch?
>>
>> NOTE! This is entirely untested. I may have screwed this up entirely.
>> You get the idea, though - just remove the wait queue head from the
>> list - the list entries stay around, but nothing points to the stack
>> entry (that we're going to free) any more.
>>
>> And add the warning to see if this actually ever triggers (and because
>> I'd like to see the callchain when it does, to see if it's another
>> waitqueue somewhere or what..)
>
> ------------[ cut here ]------------
> WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0
> Kernel panic - not syncing: panic_on_warn set ...

So I noticed that panic_on_warn just after sending the email and I've
been waiting for it it to trigger again.

The warning has triggered twice more without panic_on_warn set and I
haven't seen any crash yet.


Vegard
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/mm/shmem.c b/mm/shmem.c
index 166ebf5d2bce..a80148b43476 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2665,6 +2665,8 @@  static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		spin_lock(&inode->i_lock);
 		inode->i_private = NULL;
 		wake_up_all(&shmem_falloc_waitq);
+		if (WARN_ON_ONCE(!list_empty(&shmem_falloc_waitq.task_list)))
+			list_del(&shmem_falloc_waitq.task_list);
 		spin_unlock(&inode->i_lock);
 		error = 0;
 		goto out;

bio linked list corruption.

Commit Message

Comments

Patch