diff mbox

kernel panics with 4.14.X versions

Message ID 20180419202302.vj2eu43hy77g5mv7@quack2.suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Kara April 19, 2018, 8:23 p.m. UTC
On Wed 18-04-18 10:32:21, Pavlos Parissis wrote:
> On 17/04/2018 02:12 μμ, Jan Kara wrote:
> > On Tue 17-04-18 01:31:24, Pavlos Parissis wrote:
> >> On 16/04/2018 04:40 μμ, Jan Kara wrote:
> > 
> > <snip>
> > 
> >>> How easily can you hit this?
> >>
> >> Very easily, I only need to wait 1-2 days for a crash to occur.
> > 
> > I wouldn't call that very easily but opinions may differ :). Anyway it's
> > good (at least for debugging) that it's reproducible.
> > 
> 
> Unfortunately, I can't reproduce it, so waiting 1-2 days is the only
> option I have.

Good news guys, Robert has just spotted a bug which looks like what I'd
expect can cause your lockups / crashes. I've merged his patch to my tree
and will push it to Linus for -rc3 so eventually it should land in
appropriate stable trees as well. If you are too eager to test it out, it
is attached for you to try.

								Honza

Comments

Pavlos Parissis April 19, 2018, 9:16 p.m. UTC | #1
On 19/04/2018 10:23 μμ, Jan Kara wrote:
> On Wed 18-04-18 10:32:21, Pavlos Parissis wrote:
>> On 17/04/2018 02:12 μμ, Jan Kara wrote:
>>> On Tue 17-04-18 01:31:24, Pavlos Parissis wrote:
>>>> On 16/04/2018 04:40 μμ, Jan Kara wrote:
>>>
>>> <snip>
>>>
>>>>> How easily can you hit this?
>>>>
>>>> Very easily, I only need to wait 1-2 days for a crash to occur.
>>>
>>> I wouldn't call that very easily but opinions may differ :). Anyway it's
>>> good (at least for debugging) that it's reproducible.
>>>
>>
>> Unfortunately, I can't reproduce it, so waiting 1-2 days is the only
>> option I have.
> 
> Good news guys, Robert has just spotted a bug which looks like what I'd
> expect can cause your lockups / crashes. I've merged his patch to my tree
> and will push it to Linus for -rc3 so eventually it should land in
> appropriate stable trees as well. If you are too eager to test it out, it
> is attached for you to try.
> 
> 								Honza
> 

Can I apply it on top of 4.14.32?
I am on vacation for the coming 11 days, so I can only provide feedback when I am back.

Thanks a lot for your support on this issue.

Cheers,
Pavlos
Robert Kolchmeyer April 19, 2018, 9:24 p.m. UTC | #2
On Thu, Apr 19, 2018 at 2:16 PM, Pavlos Parissis
<pavlos.parissis@gmail.com> wrote:
> On 19/04/2018 10:23 μμ, Jan Kara wrote:
>> On Wed 18-04-18 10:32:21, Pavlos Parissis wrote:
>>> On 17/04/2018 02:12 μμ, Jan Kara wrote:
>>>> On Tue 17-04-18 01:31:24, Pavlos Parissis wrote:
>>>>> On 16/04/2018 04:40 μμ, Jan Kara wrote:
>>>>
>>>> <snip>
>>>>
>>>>>> How easily can you hit this?
>>>>>
>>>>> Very easily, I only need to wait 1-2 days for a crash to occur.
>>>>
>>>> I wouldn't call that very easily but opinions may differ :). Anyway it's
>>>> good (at least for debugging) that it's reproducible.
>>>>
>>>
>>> Unfortunately, I can't reproduce it, so waiting 1-2 days is the only
>>> option I have.
>>
>> Good news guys, Robert has just spotted a bug which looks like what I'd
>> expect can cause your lockups / crashes. I've merged his patch to my tree
>> and will push it to Linus for -rc3 so eventually it should land in
>> appropriate stable trees as well. If you are too eager to test it out, it
>> is attached for you to try.
>>
>>                                                               Honza
>>
>
> Can I apply it on top of 4.14.32?
> I am on vacation for the coming 11 days, so I can only provide feedback when I am back.
>
> Thanks a lot for your support on this issue.
>
> Cheers,
> Pavlos
>

I expect that the patch will apply cleanly to 4.14.32.

-Robert
Dexuan Cui April 19, 2018, 9:37 p.m. UTC | #3
> From: Jan Kara

> Sent: Thursday, April 19, 2018 13:23

> Good news guys, Robert has just spotted a bug which looks like what I'd

> expect can cause your lockups / crashes. I've merged his patch to my tree

> and will push it to Linus for -rc3 so eventually it should land in

> appropriate stable trees as well. If you are too eager to test it out, it

> is attached for you to try.

> 

> Jan Kara


The patch's changelog says "... this behavior results in a kernel panic." 
This sounds like a reference to corrupt memory causes a page fault or
general protection fault.

But what I saw is only a lockup rather than a kernel panic:
watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [java:87260]"

So I guess what I saw can be a different unresolved issue?

-- Dexuan
Jan Kara April 20, 2018, 10:21 a.m. UTC | #4
On Thu 19-04-18 21:37:25, Dexuan Cui wrote:
> > From: Jan Kara
> > Sent: Thursday, April 19, 2018 13:23
> > Good news guys, Robert has just spotted a bug which looks like what I'd
> > expect can cause your lockups / crashes. I've merged his patch to my tree
> > and will push it to Linus for -rc3 so eventually it should land in
> > appropriate stable trees as well. If you are too eager to test it out, it
> > is attached for you to try.
> > 
> > Jan Kara
> 
> The patch's changelog says "... this behavior results in a kernel panic." 
> This sounds like a reference to corrupt memory causes a page fault or
> general protection fault.
> 
> But what I saw is only a lockup rather than a kernel panic:
> watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [java:87260]"
> 
> So I guess what I saw can be a different unresolved issue?

Actually I don't think so. The list iteration simply went through stray
pointer. That can crash but it can also end in an infinite loop, or it can
just randomly corrupt memory. I've seen all these situations with similar
problems. So the fix is definitely worth trying.

								Honza
Dexuan Cui April 20, 2018, 5:43 p.m. UTC | #5
> From: Jan Kara <jack@suse.cz>
> Sent: Friday, April 20, 2018 03:22
> On Thu 19-04-18 21:37:25, Dexuan Cui wrote:
> > > From: Jan Kara
> > > Sent: Thursday, April 19, 2018 13:23
> > > Good news guys, Robert has just spotted a bug which looks like what I'd
> > > expect can cause your lockups / crashes. I've merged his patch to my tree
> > > and will push it to Linus for -rc3 so eventually it should land in
> > > appropriate stable trees as well. If you are too eager to test it out, it
> > > is attached for you to try.
> > >
> > > Jan Kara
> >
> > The patch's changelog says "... this behavior results in a kernel panic."
> > This sounds like a reference to corrupt memory causes a page fault or
> > general protection fault.
> >
> > But what I saw is only a lockup rather than a kernel panic:
> > watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [java:87260]"
> >
> > So I guess what I saw can be a different unresolved issue?
> 
> Actually I don't think so. The list iteration simply went through stray
> pointer. That can crash but it can also end in an infinite loop, or it can
> just randomly corrupt memory. I've seen all these situations with similar
> problems. So the fix is definitely worth trying.
> 
> Jan Kara

Thanks for the explanation! It sounds promising!

We haven't been able to reproduce the issue by ourselves.
If our customer still keeps the setup to reproduce the issue, we'll try to
test the patch. 

-- Dexuan
diff mbox

Patch

From d90a10e2444ba5a351fa695917258ff4c5709fa5 Mon Sep 17 00:00:00 2001
From: Robert Kolchmeyer <rkolchmeyer@google.com>
Date: Thu, 19 Apr 2018 10:44:33 -0700
Subject: [PATCH] fsnotify: Fix fsnotify_mark_connector race

fsnotify() acquires a reference to a fsnotify_mark_connector through
the SRCU-protected pointer to_tell->i_fsnotify_marks. However, it
appears that no precautions are taken in fsnotify_put_mark() to
ensure that fsnotify() drops its reference to this
fsnotify_mark_connector before assigning a value to its 'destroy_next'
field. This can result in fsnotify_put_mark() assigning a value
to a connector's 'destroy_next' field right before fsnotify() tries to
traverse the linked list referenced by the connector's 'list' field.
Since these two fields are members of the same union, this behavior
results in a kernel panic.

This issue is resolved by moving the connector's 'destroy_next' field
into the object pointer union. This should work since the object pointer
access is protected by both a spinlock and the value of the 'flags'
field, and the 'flags' field is cleared while holding the spinlock in
fsnotify_put_mark() before 'destroy_next' is updated. It shouldn't be
possible for another thread to accidentally read from the object pointer
after the 'destroy_next' field is updated.

The offending behavior here is extremely unlikely; since
fsnotify_put_mark() removes references to a connector (specifically,
it ensures that the connector is unreachable from the inode it was
formerly attached to) before updating its 'destroy_next' field, a
sizeable chunk of code in fsnotify_put_mark() has to execute in the
short window between when fsnotify() acquires the connector reference
and saves the value of its 'list' field. On the HEAD kernel, I've only
been able to reproduce this by inserting a udelay(1) in fsnotify().
However, I've been able to reproduce this issue without inserting a
udelay(1) anywhere on older unmodified release kernels, so I believe
it's worth fixing at HEAD.

References: https://bugzilla.kernel.org/show_bug.cgi?id=199437
Fixes: 08991e83b7286635167bab40927665a90fb00d81
CC: stable@vger.kernel.org
Signed-off-by: Robert Kolchmeyer <rkolchmeyer@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fsnotify_backend.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 9f1edb92c97e..a3d13d874fd1 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -217,12 +217,10 @@  struct fsnotify_mark_connector {
 	union {	/* Object pointer [lock] */
 		struct inode *inode;
 		struct vfsmount *mnt;
-	};
-	union {
-		struct hlist_head list;
 		/* Used listing heads to free after srcu period expires */
 		struct fsnotify_mark_connector *destroy_next;
 	};
+	struct hlist_head list;
 };
 
 /*
-- 
2.13.6