diff mbox

slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))

Message ID CA+55aFwS1jfnNzU=8CzRyxC79qKrStwmTXYrBHQYcftAg4LT3Q@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Linus Torvalds Oct. 10, 2016, 3:41 a.m. UTC
On Sun, Oct 9, 2016 at 7:49 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> There is one *correct* way to remove an entry from a singly linked
> list, and it looks like this:
>
>     struct entry **pp, *p;
>
>     pp = &head;
>     while ((p = *pp) != NULL) {
>         if (right_entry(p)) {
>             *pp = p->next;
>             break;
>         }
>         pp = &p->next;
>     }
>
> and that's it. Nothing else.

This COMPLETELY UNTESTED patch tries to fix the nf_hook_entry code to do this.

I repeat: it's ENTIRELY UNTESTED. I just converted the insertion and
deletion to the proper pattern, but I could easily have gotten the
insertion priority test the wrong way around entirely, for example. Or
it could simply have some other completely broken bug in it. It
compiles for me, but that's all I actually checked.

Note that the "correct way" of doing list operations also almost
inevitably is the shortest way by far, since it gets rid of all the
special cases. So the patch looks nice. It gets rid of the magic
"nf_set_hooks_head()" thing too, because once you do list following
right, the head is no different from any other pointer in the list.

So the patch stats look good:

 net/netfilter/core.c | 108 ++++++++++++++++-----------------------------------
 1 file changed, 33 insertions(+), 75 deletions(-)

but again, it's entirely *entirely* untested. Please consider this
just a "this is generally how list insert/delete operations should be
done, avoiding special cases for the first entry".

ALSO NOTE! The code assumes that the "nf_hook_mutex" locking only
protects the actual *lists*, and that the address to the list can be
looked up without holding the lock. That's generally how things are
done, and it simplifies error handling (because you can do the "there
is no such list at all" test before you do anything else. But again, I
don't actually know the code, and if there is something that actually
expands the number of lists etc that depends on that mutex, then the
list head lookup may need to be inside the lock too.

           Linus
net/netfilter/core.c | 108 ++++++++++++++++-----------------------------------
 1 file changed, 33 insertions(+), 75 deletions(-)

Comments

David Miller Oct. 10, 2016, 3:57 a.m. UTC | #1
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 9 Oct 2016 20:41:17 -0700

> Note that the "correct way" of doing list operations also almost
> inevitably is the shortest way by far, since it gets rid of all the
> special cases. So the patch looks nice. It gets rid of the magic
> "nf_set_hooks_head()" thing too, because once you do list following
> right, the head is no different from any other pointer in the list.

Perhaps we should have some "slist" primitives added to
include/linux/list.h but since the comparison differs for each user I
guess it's hard to abstract in a way that's generic and inlines
properly.

I'll start taking a look at your patch and this stuff as well, thanks
Linus.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 10, 2016, 8:24 a.m. UTC | #2
From: David Miller <davem@davemloft.net>
Date: Sun, 09 Oct 2016 23:57:45 -0400 (EDT)

> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sun, 9 Oct 2016 20:41:17 -0700
> 
>> Note that the "correct way" of doing list operations also almost
>> inevitably is the shortest way by far, since it gets rid of all the
>> special cases. So the patch looks nice. It gets rid of the magic
>> "nf_set_hooks_head()" thing too, because once you do list following
>> right, the head is no different from any other pointer in the list.
> 
> Perhaps we should have some "slist" primitives added to
> include/linux/list.h but since the comparison differs for each user I
> guess it's hard to abstract in a way that's generic and inlines
> properly.
> 
> I'll start taking a look at your patch and this stuff as well, thanks
> Linus.

So I've been reviewing this patch and it looks fine, but I also want
to figure out what is actually causing the OOPS and I can't spot it
yet.

One possible way to see that oops is to free the head entry of the
chain without unlinking it.  The next unregister will dereference a
POISON pointer.

Actually...

The POISON value comes not from a hook entry, but from the array of
pointers in the per-netns datastructure.

This means that the netns is possibly getting freed up before we
unregister the netfilter hooks.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aaron Conole Oct. 10, 2016, 1:49 p.m. UTC | #3
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sun, Oct 9, 2016 at 7:49 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> There is one *correct* way to remove an entry from a singly linked
>> list, and it looks like this:
>>
>>     struct entry **pp, *p;
>>
>>     pp = &head;
>>     while ((p = *pp) != NULL) {
>>         if (right_entry(p)) {
>>             *pp = p->next;
>>             break;
>>         }
>>         pp = &p->next;
>>     }
>>
>> and that's it. Nothing else.

Sorry, I should have done that.

> This COMPLETELY UNTESTED patch tries to fix the nf_hook_entry code to do this.
>
> I repeat: it's ENTIRELY UNTESTED. I just converted the insertion and
> deletion to the proper pattern, but I could easily have gotten the
> insertion priority test the wrong way around entirely, for example. Or
> it could simply have some other completely broken bug in it. It
> compiles for me, but that's all I actually checked.

Okay, I'm looking it over.  Sorry for the mess.

> Note that the "correct way" of doing list operations also almost
> inevitably is the shortest way by far, since it gets rid of all the
> special cases. So the patch looks nice. It gets rid of the magic
> "nf_set_hooks_head()" thing too, because once you do list following
> right, the head is no different from any other pointer in the list.
>
> So the patch stats look good:
>
>  net/netfilter/core.c | 108 ++++++++++++++++-----------------------------------
>  1 file changed, 33 insertions(+), 75 deletions(-)
>
> but again, it's entirely *entirely* untested. Please consider this
> just a "this is generally how list insert/delete operations should be
> done, avoiding special cases for the first entry".

I'll review it, and test it.  Can you tell me what steps you took to
reproduce the oops?  I'll enable slab debugging and try to reproduce
without and with this patch (and I'll also look into David's recent
email as well).  Are you simply creating and removing network
namespaces (I did test that, but I should have done a better job)?

> ALSO NOTE! The code assumes that the "nf_hook_mutex" locking only
> protects the actual *lists*, and that the address to the list can be
> looked up without holding the lock. That's generally how things are
> done, and it simplifies error handling (because you can do the "there
> is no such list at all" test before you do anything else. But again, I
> don't actually know the code, and if there is something that actually
> expands the number of lists etc that depends on that mutex, then the
> list head lookup may need to be inside the lock too.

That should be correct, the nf_hook_mutex is only for protecting the
lists.

>            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Oct. 10, 2016, 4:15 p.m. UTC | #4
On Mon, Oct 10, 2016 at 1:24 AM, David Miller <davem@davemloft.net> wrote:
>
> So I've been reviewing this patch and it looks fine, but I also want
> to figure out what is actually causing the OOPS and I can't spot it
> yet.

Yeah, I'm not actually sure the old linked list implementation is
buggy - it might just be ugly. I tried to follow the old code, and I
couldn't.

So the patch I sent out was a combination of "that's not how you
should do singly linked lists" and "those special cases make me
worry". In particular, the old code really ended up doing odd things
in the "can't find entry" case, because it would exit the loop with a
non-NULL 'entry' just because the next entry was NULL..

> One possible way to see that oops is to free the head entry of the
> chain without unlinking it.  The next unregister will dereference a
> POISON pointer.
>
> Actually...
>
> The POISON value comes not from a hook entry, but from the array of
> pointers in the per-netns datastructure.
>
> This means that the netns is possibly getting freed up before we
> unregister the netfilter hooks.

That is certainly one possible explanation for it, yes. However, I
didn't think that part had changed, had it?

The other thing I find a bit odd in that new single-linked list code is this:

 - nf_hook_slow():
        ...
        RCU_INIT_POINTER(state->hook_entries, entry);

which makes me worried..  It copies the head entry of the list, and
maybe it will then (later) end up being used stale. I don't know. But
it looks a bit iffy.

              Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Oct. 10, 2016, 4:28 p.m. UTC | #5
On Mon, Oct 10, 2016 at 6:49 AM, Aaron Conole <aconole@redhat.com> wrote:
>
> Okay, I'm looking it over.  Sorry for the mess.

So as I already answered to Dave, I'm not actually sure that this was
the buggy code, or that my patch would make any difference at all.

I never got a good reproducer for the bug: I spent much of the weekend
rebooting, because it seems to happen only just after a reboot, as I
log in and start my usual thing.

I initially blamed some off filesystem or block layer issue ("Oh, it
only happens with a cold cache"), partly because the initial
non-poisoned slub oopses happened in filesystem code.

But I now think it's netfilter, and I *think* that what triggers it is
something like the bluetooth subsystem giving up or something. What I
do when I log into a new session tends to be to go to the kernel
subdirectory in one or two terminals, and fire up chrome to read
email. And the problem either happened within half a minute of me
doing that, or it never happens at all.

Which is why I ended up rebooting a *lot*. Just running the kernel
never triggered it.

(It took me some time to figure that out, which is basically why I did
almost no pull requests the whole weekend)

The journal entries for that invalid kernel access is somewhat suggestive:

  Oct 09 13:24:03 i7 dbus-daemon[1030]: [system] Failed to activate
service 'org.bluez': timed out

  Oct 09 13:24:09 i7 audit[1]: SERVICE_STOP pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-hostnamed comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'

  Oct 09 13:24:09 i7 kernel: general protection fault: 0000 [#1] SMP

so it happened just as *some* network setup thing was finishing off (I
don't think it was systemd-hostnamed itself that necessarily matters,
but clearly something was finishing up as the netfilter problem
occurred.

> I'll review it, and test it.  Can you tell me what steps you took to
> reproduce the oops?

See above: I can't actually really "reproduce" it. It's probably
highly timing-dependent, and it is not unlikely that it's also very
much about specific setup. I'm running plain Fedora 24, I boot up, log
in, start two or three terminals, fire up chrome, and ...

So far I've seen the problem maybe 5-6 times, but a couple of those
were just silent hangs (I may have rebooted too quickly for things to
hit the disk, or the oops may just have killed the machine too hard).
Two I got the oops inside slub code, and I only have one successful
slub poisoning oops from netfilter.

(Part of the reason I only have one is that once I got that, I stopped
rebooting, and instead started looking at the netfilter code and
started to do some merge window pulls again because I felt that this
is *probably* the core reason, and I cant' afford to not do pulls
during the merge window for _too_ long).

              Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Oct. 10, 2016, 7:05 p.m. UTC | #6
On Mon, Oct 10, 2016 at 9:28 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So as I already answered to Dave, I'm not actually sure that this was
> the buggy code, or that my patch would make any difference at all.

My patch does seem to fix things, and in fact the warning about "hook
not found" now triggers.

So I think the bug really was that the singly-linked list handling
code did not correctly handle the case of not finding the entry, and
then freed (incorrectly) the last one that wasn't actually unlinked.

In fact, I get quite a few warnings (56 total) about 30 seconds after
logging in:

[   54.213170] WARNING: CPU: 1 PID: 111 at net/netfilter/core.c:151
nf_unregister_net_hook+0x8e/0x170
... repeat 54 times ...
[   54.445520] WARNING: CPU: 7 PID: 111 at net/netfilter/core.c:151
nf_unregister_net_hook+0x8e/0x170

and looking in the journal, the first one is (again) immediately
preceded by that systemd-hostnamed service stopping:

  Oct 10 11:45:47 i7 audit[1546]: USER_LOGIN
  ...
  Oct 10 11:46:11 i7 audit[1]: SERVICE_STOP pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=fprintd comm="systemd" exe="/usr/lib/systemd/systemd"
hostname=? addr=? terminal=? res=success'
  Oct 10 11:46:13 i7 pulseaudio[1697]: [pulseaudio] bluez5-util.c:
GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did
not receive a reply. Possible causes include: the remote application
did not send a reply, the message bus security policy blocked the
reply, the reply timeout expir
  Oct 10 11:46:13 i7 dbus-daemon[1003]: [system] Failed to activate
service 'org.bluez': timed out
  Oct 10 11:46:20 i7 audit[1]: SERVICE_STOP pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-hostnamed comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'
  Oct 10 11:46:20 i7 kernel: ------------[ cut here ]------------
  Oct 10 11:46:20 i7 kernel: WARNING: CPU: 1 PID: 111 at
net/netfilter/core.c:151 nf_unregister_net_hook+0x8e/0x170

so I do think it's something to do with some network startup service
thing (perhaps dhcp, perhaps chrome, who knows) as I do my initial
login.

David - I think that also explains what was wrong with the old code.
In the old code, this loop:

        while (hooks_entry && nf_entry_dereference(hooks_entry->next)) {

would exit with "hooks_entry" pointing to the last list entry (because
->next was NULL). Nothing was ever unlinked in the loop itself,
because it never actually found a matching entry, but then after the
loop it would free that last entry because it *thought* that was the
match.

My list rewrite fixes that.

Anyway, I'm assuming it will come to me from the networking tree after
more testing by the maintainers. You can add my

  Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

to the patch, though.

David, if you want me to just commit that thing directly, I can
obviously do so, but I do think somebody should look at

 (a) that I actually got the priority list ordering right on the insertion side
 (b) what it is that makes it try to unregister that hook that isn't
on the list in the first place

but on the whole I consider this issue explained and solved. I'll
continue to run with my patch on my machine (just not committed).

              Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aaron Conole Oct. 10, 2016, 7:18 p.m. UTC | #7
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Oct 10, 2016 at 9:28 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So as I already answered to Dave, I'm not actually sure that this was
>> the buggy code, or that my patch would make any difference at all.
>
> My patch does seem to fix things, and in fact the warning about "hook
> not found" now triggers.
>
> So I think the bug really was that the singly-linked list handling
> code did not correctly handle the case of not finding the entry, and
> then freed (incorrectly) the last one that wasn't actually unlinked.
>
> In fact, I get quite a few warnings (56 total) about 30 seconds after
> logging in:
>
> [   54.213170] WARNING: CPU: 1 PID: 111 at net/netfilter/core.c:151
> nf_unregister_net_hook+0x8e/0x170
> ... repeat 54 times ...
> [   54.445520] WARNING: CPU: 7 PID: 111 at net/netfilter/core.c:151
> nf_unregister_net_hook+0x8e/0x170
>
> and looking in the journal, the first one is (again) immediately
> preceded by that systemd-hostnamed service stopping:
>
>   Oct 10 11:45:47 i7 audit[1546]: USER_LOGIN
>   ...
>   Oct 10 11:46:11 i7 audit[1]: SERVICE_STOP pid=1 uid=0
> auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
> msg='unit=fprintd comm="systemd" exe="/usr/lib/systemd/systemd"
> hostname=? addr=? terminal=? res=success'
>   Oct 10 11:46:13 i7 pulseaudio[1697]: [pulseaudio] bluez5-util.c:
> GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did
> not receive a reply. Possible causes include: the remote application
> did not send a reply, the message bus security policy blocked the
> reply, the reply timeout expir
>   Oct 10 11:46:13 i7 dbus-daemon[1003]: [system] Failed to activate
> service 'org.bluez': timed out
>   Oct 10 11:46:20 i7 audit[1]: SERVICE_STOP pid=1 uid=0
> auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
> msg='unit=systemd-hostnamed comm="systemd"
> exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
> res=success'
>   Oct 10 11:46:20 i7 kernel: ------------[ cut here ]------------
>   Oct 10 11:46:20 i7 kernel: WARNING: CPU: 1 PID: 111 at
> net/netfilter/core.c:151 nf_unregister_net_hook+0x8e/0x170
>
> so I do think it's something to do with some network startup service
> thing (perhaps dhcp, perhaps chrome, who knows) as I do my initial
> login.
>
> David - I think that also explains what was wrong with the old code.
> In the old code, this loop:
>
>         while (hooks_entry && nf_entry_dereference(hooks_entry->next)) {
>
> would exit with "hooks_entry" pointing to the last list entry (because
> ->next was NULL). Nothing was ever unlinked in the loop itself,
> because it never actually found a matching entry, but then after the
> loop it would free that last entry because it *thought* that was the
> match.
>
> My list rewrite fixes that.
>
> Anyway, I'm assuming it will come to me from the networking tree after
> more testing by the maintainers. You can add my
>
>   Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> to the patch, though.
>
> David, if you want me to just commit that thing directly, I can
> obviously do so, but I do think somebody should look at
>
>  (a) that I actually got the priority list ordering right on the
>  insertion side

It looks correct.

Reviewed-by: Aaron Conole <aconole@bytheb.org>

>  (b) what it is that makes it try to unregister that hook that isn't
> on the list in the first place

This is a still problem, I think.  I wasn't able to reproduce the issue
on a fedora-23 VM.  My fedora 24 bare-metal system does trigger this,
though.  Not sure what changed in userspace/kernel interaction side (not
an excuse, but just an observation).

> but on the whole I consider this issue explained and solved. I'll
> continue to run with my patch on my machine (just not committed).

Okay.  Very sorry for this, again.

>               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 11, 2016, 12:30 a.m. UTC | #8
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 10 Oct 2016 12:05:17 -0700

> David - I think that also explains what was wrong with the old code.
> In the old code, this loop:
> 
>         while (hooks_entry && nf_entry_dereference(hooks_entry->next)) {
> 
> would exit with "hooks_entry" pointing to the last list entry (because
> ->next was NULL). Nothing was ever unlinked in the loop itself,
> because it never actually found a matching entry, but then after the
> loop it would free that last entry because it *thought* that was the
> match.

It only does this when the ops don't match, but yes it can happen.

Linus can you add some extra info to that:

		WARN(1, "nf_unregister_net_hook: hook not found!\n");

diagnostic, such as the reg->pf and reg->hooknum values?

That might help track down why this is happening in the first
place.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Oct. 11, 2016, 12:54 a.m. UTC | #9
On Mon, Oct 10, 2016 at 5:30 PM, David Miller <davem@davemloft.net> wrote:
>
> Linus can you add some extra info to that:

Sure. I made it a WARN_ON_ONCE(), but then always just printed the
pf/hooknum. It's all over the map:

 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=2
 reg->pf=2 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=2
 reg->pf=10 and reg->hooknum=3
 reg->pf=7 and reg->hooknum=1
 reg->pf=7 and reg->hooknum=2
 reg->pf=7 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=0
 reg->pf=2 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=0
 reg->pf=2 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=1
 reg->pf=2 and reg->hooknum=1
 reg->pf=10 and reg->hooknum=0
 reg->pf=10 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=0
 reg->pf=10 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=1
 reg->pf=10 and reg->hooknum=1
 reg->pf=7 and reg->hooknum=3
 reg->pf=7 and reg->hooknum=4
 reg->pf=7 and reg->hooknum=0
 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=2
 reg->pf=2 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=2
 reg->pf=10 and reg->hooknum=3
 reg->pf=7 and reg->hooknum=1
 reg->pf=7 and reg->hooknum=2
 reg->pf=7 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=0
 reg->pf=2 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=0
 reg->pf=2 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=1
 reg->pf=2 and reg->hooknum=1
 reg->pf=10 and reg->hooknum=0
 reg->pf=10 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=0
 reg->pf=10 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=1
 reg->pf=10 and reg->hooknum=1
 reg->pf=7 and reg->hooknum=3
 reg->pf=7 and reg->hooknum=4
 reg->pf=7 and reg->hooknum=0

and putting that through "sort -n" and "uniq -c", I get:

      4  reg->pf=10 and reg->hooknum=0
      4  reg->pf=10 and reg->hooknum=1
      2  reg->pf=10 and reg->hooknum=2
      6  reg->pf=10 and reg->hooknum=3
      6  reg->pf=10 and reg->hooknum=4
      4  reg->pf=2 and reg->hooknum=0
      4  reg->pf=2 and reg->hooknum=1
      2  reg->pf=2 and reg->hooknum=2
      6  reg->pf=2 and reg->hooknum=3
      6  reg->pf=2 and reg->hooknum=4
      2  reg->pf=7 and reg->hooknum=0
      2  reg->pf=7 and reg->hooknum=1
      2  reg->pf=7 and reg->hooknum=2
      4  reg->pf=7 and reg->hooknum=3

which doesn't look much better. But clearly there's a lot of those
"try to unregister stuff that you can't even find".

Maybe it tells you something.

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kubecek Oct. 11, 2016, 1:17 p.m. UTC | #10
On Mon, Oct 10, 2016 at 04:24:01AM -0400, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sun, 09 Oct 2016 23:57:45 -0400 (EDT)
> 
> This means that the netns is possibly getting freed up before we
> unregister the netfilter hooks.

Sounds a bit like the issue discussed here:

  https://marc.info/?l=netfilter-devel&m=146980917627262&w=2

Could it be (partly) the same race condition?

Michal Kubecek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aaron Conole Oct. 11, 2016, 1:55 p.m. UTC | #11
Michal Kubecek <mkubecek@suse.cz> writes:

> On Mon, Oct 10, 2016 at 04:24:01AM -0400, David Miller wrote:
>> From: David Miller <davem@davemloft.net>
>> Date: Sun, 09 Oct 2016 23:57:45 -0400 (EDT)
>> 
>> This means that the netns is possibly getting freed up before we
>> unregister the netfilter hooks.
>
> Sounds a bit like the issue discussed here:
>
>   https://marc.info/?l=netfilter-devel&m=146980917627262&w=2
>
> Could it be (partly) the same race condition?

It looks like it's possible.  It appears that there could be a
long-standing race between these.  I'll look into it more carefully, and
discuss with Pablo and Florian when they're situated from netdev
conference.

-Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index c9d90eb64046..814258641fcc 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -65,49 +65,24 @@  static DEFINE_MUTEX(nf_hook_mutex);
 #define nf_entry_dereference(e) \
 	rcu_dereference_protected(e, lockdep_is_held(&nf_hook_mutex))
 
-static struct nf_hook_entry *nf_hook_entry_head(struct net *net,
-						const struct nf_hook_ops *reg)
+static struct nf_hook_entry __rcu **nf_hook_entry_head(struct net *net, const struct nf_hook_ops *reg)
 {
-	struct nf_hook_entry *hook_head = NULL;
-
 	if (reg->pf != NFPROTO_NETDEV)
-		hook_head = nf_entry_dereference(net->nf.hooks[reg->pf]
-						 [reg->hooknum]);
-	else if (reg->hooknum == NF_NETDEV_INGRESS) {
+		return net->nf.hooks[reg->pf]+reg->hooknum;
+
 #ifdef CONFIG_NETFILTER_INGRESS
+	if (reg->hooknum == NF_NETDEV_INGRESS) {
 		if (reg->dev && dev_net(reg->dev) == net)
-			hook_head =
-				nf_entry_dereference(
-					reg->dev->nf_hooks_ingress);
-#endif
+			return &reg->dev->nf_hooks_ingress;
 	}
-	return hook_head;
-}
-
-/* must hold nf_hook_mutex */
-static void nf_set_hooks_head(struct net *net, const struct nf_hook_ops *reg,
-			      struct nf_hook_entry *entry)
-{
-	switch (reg->pf) {
-	case NFPROTO_NETDEV:
-#ifdef CONFIG_NETFILTER_INGRESS
-		/* We already checked in nf_register_net_hook() that this is
-		 * used from ingress.
-		 */
-		rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
 #endif
-		break;
-	default:
-		rcu_assign_pointer(net->nf.hooks[reg->pf][reg->hooknum],
-				   entry);
-		break;
-	}
+	return NULL;
 }
 
 int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 {
-	struct nf_hook_entry *hooks_entry;
-	struct nf_hook_entry *entry;
+	struct nf_hook_entry __rcu **pp;
+	struct nf_hook_entry *entry, *p;
 
 	if (reg->pf == NFPROTO_NETDEV) {
 #ifndef CONFIG_NETFILTER_INGRESS
@@ -119,6 +94,10 @@  int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 			return -EINVAL;
 	}
 
+	pp = nf_hook_entry_head(net, reg);
+	if (!pp)
+		return -EINVAL;
+
 	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
 	if (!entry)
 		return -ENOMEM;
@@ -128,26 +107,15 @@  int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 	entry->next	= NULL;
 
 	mutex_lock(&nf_hook_mutex);
-	hooks_entry = nf_hook_entry_head(net, reg);
-
-	if (hooks_entry && hooks_entry->orig_ops->priority > reg->priority) {
-		/* This is the case where we need to insert at the head */
-		entry->next = hooks_entry;
-		hooks_entry = NULL;
-	}
-
-	while (hooks_entry &&
-		reg->priority >= hooks_entry->orig_ops->priority &&
-		nf_entry_dereference(hooks_entry->next)) {
-		hooks_entry = nf_entry_dereference(hooks_entry->next);
-	}
 
-	if (hooks_entry) {
-		entry->next = nf_entry_dereference(hooks_entry->next);
-		rcu_assign_pointer(hooks_entry->next, entry);
-	} else {
-		nf_set_hooks_head(net, reg, entry);
+	/* Find the spot in the list */
+	while ((p = nf_entry_dereference(*pp)) != NULL) {
+		if (reg->priority < p->orig_ops->priority)
+			break;
+		pp = &p->next;
 	}
+	rcu_assign_pointer(entry->next, p);
+	rcu_assign_pointer(*pp, p);
 
 	mutex_unlock(&nf_hook_mutex);
 #ifdef CONFIG_NETFILTER_INGRESS
@@ -163,33 +131,23 @@  EXPORT_SYMBOL(nf_register_net_hook);
 
 void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
 {
-	struct nf_hook_entry *hooks_entry;
+	struct nf_hook_entry __rcu **pp;
+	struct nf_hook_entry *p;
 
-	mutex_lock(&nf_hook_mutex);
-	hooks_entry = nf_hook_entry_head(net, reg);
-	if (hooks_entry && hooks_entry->orig_ops == reg) {
-		nf_set_hooks_head(net, reg,
-				  nf_entry_dereference(hooks_entry->next));
-		goto unlock;
-	}
-	while (hooks_entry && nf_entry_dereference(hooks_entry->next)) {
-		struct nf_hook_entry *next =
-			nf_entry_dereference(hooks_entry->next);
-		struct nf_hook_entry *nnext;
+	pp = nf_hook_entry_head(net, reg);
+	if (WARN_ON_ONCE(!pp))
+		return;
 
-		if (next->orig_ops != reg) {
-			hooks_entry = next;
-			continue;
+	mutex_lock(&nf_hook_mutex);
+	while ((p = nf_entry_dereference(*pp)) != NULL) {
+		if (p->orig_ops == reg) {
+			rcu_assign_pointer(*pp, p->next);
+			break;
 		}
-		nnext = nf_entry_dereference(next->next);
-		rcu_assign_pointer(hooks_entry->next, nnext);
-		hooks_entry = next;
-		break;
+		pp = &p->next;
 	}
-
-unlock:
 	mutex_unlock(&nf_hook_mutex);
-	if (!hooks_entry) {
+	if (!p) {
 		WARN(1, "nf_unregister_net_hook: hook not found!\n");
 		return;
 	}
@@ -201,10 +159,10 @@  void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
 	static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]);
 #endif
 	synchronize_net();
-	nf_queue_nf_hook_drop(net, hooks_entry);
+	nf_queue_nf_hook_drop(net, p);
 	/* other cpu might still process nfqueue verdict that used reg */
 	synchronize_net();
-	kfree(hooks_entry);
+	kfree(p);
 }
 EXPORT_SYMBOL(nf_unregister_net_hook);