diff mbox

3.7.3+: Bad paging request in ip_rcv_finish while running NFS traffic.

Message ID 50FF4BC9.1060206@candelatech.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ben Greear Jan. 23, 2013, 2:32 a.m. UTC
On 01/22/2013 02:18 PM, Ben Greear wrote:
> On 01/22/2013 09:26 AM, Eric Dumazet wrote:
>> On Tue, 2013-01-22 at 09:17 -0800, Eric Dumazet wrote:
>>> On Tue, 2013-01-22 at 09:08 -0800, Ben Greear wrote:
>>>
>>>> Unfortunately, I hit it again this morning after the first restart of
>>>> my application (which bounces all 3000 interfaces).  Memory poisoning
>>>> was disabled.
>>>
>>> Is your NFS traffic using TCP or UDP ?
>>>
>>
>> Oh well, it seems macvlan.c has to skb_drop_dst(skb) before giving skb
>> to netif_rx()
>
> I just saw another crash.  It had run 2 user-space restarts and
> 2 reboots, but on the third reboot, it crashed coming up.  It seemed
> to last longer this time, but that could just be luck as it's never
> been super easy to reproduce this quickly.

I added a patch to set dst->input and dst->output to 0xdeadbeef before
freeing the memory.  (The warn-on below did NOT hit)

@@ -452,6 +452,9 @@ static inline int dst_output(struct sk_buff *skb)
  /* Input packet from network to transport.  */
  static inline int dst_input(struct sk_buff *skb)
  {
+       if (WARN_ON(((unsigned long)(skb_dst(skb))) < 4000)) {
+               printk("Bad skb_dst: %lu\n", skb->_skb_refdst);
+       }
         return skb_dst(skb)->input(skb);
  }


Looks like we do indeed access freed memory, based on this crash I saw on
the next reboot:

[root@lf1011-12060006 ~]# BUG: unable to handle kernel paging request at 00000000deadbeef
IP: [<00000000deadbeef>] 0xdeadbeee
PGD 0
Oops: 0010 [#1] PREEMPT SMP
Modules linked in: macvlan pktgen lockd sunrpc uinput iTCO_wdt iTCO_vendor_support gpio_ich coretemp hwmon kvm_intel kvm microcode pcspkr i2c_i801 lpc_ich 
e1000e i7core_edac ioatdma edac_core igb ptp pps_core dca ipv6 mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core
CPU 8
Pid: 59, comm: ksoftirqd/8 Tainted: G         C O 3.7.3+ #46 Iron Systems Inc. EE2610R/X8ST3
RIP: 0010:[<00000000deadbeef>]  [<00000000deadbeef>] 0xdeadbeee
RSP: 0018:ffff88040d7d7bc0  EFLAGS: 00010286
RAX: ffff8803d97fc900 RBX: ffff8803d4d30d00 RCX: 0000000000000028
RDX: ffffffff81aafcb0 RSI: ffffffff81a2a500 RDI: ffff8803d4d30d00
RBP: ffff88040d7d7be8 R08: ffffffff814a8812 R09: ffff88040d7d7bb0
R10: ffff8803c9dfd8fc R11: ffff88040d7d7c48 R12: ffff8803c9dfd8fc
R13: ffff8803d4d30d00 R14: ffff88040d3f8000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88041fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000deadbeef CR3: 0000000001a0b000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ksoftirqd/8 (pid: 59, threadinfo ffff88040d7d6000, task ffff88040d7e1f50)
Stack:
  ffffffff814a8b02 ffff8803d4d30d00 ffffffff814a8812 ffff8803d4d30d00
  ffff88040d3f8000 ffff88040d7d7c18 ffffffff814a8eb5 0000000080000000
  ffffffff81472e61 ffff8803d4d30d00 ffff88040d3f8000 ffff88040d7d7c48
Call Trace:
  [<ffffffff814a8b02>] ? ip_rcv_finish+0x2f0/0x308
  [<ffffffff814a8812>] ? skb_dst+0x5a/0x5a
  [<ffffffff814a8eb5>] NF_HOOK.clone.1+0x4c/0x54
  [<ffffffff81472e61>] ? dev_seq_stop+0xb/0xb
  [<ffffffff814a9142>] ip_rcv+0x237/0x269
  [<ffffffff81473def>] __netif_receive_skb+0x487/0x530
  [<ffffffff81473f91>] process_backlog+0xf9/0x1da
  [<ffffffff8147639a>] net_rx_action+0xad/0x218
  [<ffffffff8108d50a>] __do_softirq+0x9c/0x161
  [<ffffffff8108d5f2>] run_ksoftirqd+0x23/0x42
  [<ffffffff810a7ebe>] smpboot_thread_fn+0x253/0x259
  [<ffffffff810a7c6b>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810a0a6d>] kthread+0xc2/0xca
  [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
  [<ffffffff81537b7c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
Code:  Bad RIP value.
RIP  [<00000000deadbeef>] 0xdeadbeee
  RSP <ffff88040d7d7bc0>
CR2: 00000000deadbeef
---[ end trace eed854e70ff0a575 ]---
Kernel panic - not syncing: Fatal excepti

Thanks,
Ben

Comments

Eric Dumazet Jan. 23, 2013, 6:11 a.m. UTC | #1
On Tue, 2013-01-22 at 18:32 -0800, Ben Greear wrote:

> diff --git a/net/core/dst.c b/net/core/dst.c
> index ee6153e..234b168 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -245,6 +245,7 @@ again:
>                  dst->ops->destroy(dst);
>          if (dst->dev)
>                  dev_put(dst->dev);
> +       dst->input = dst->output = 0xdeadbeef;
>          kmem_cache_free(dst->ops->kmem_cachep, dst);

Great !

You could comment the kmem_cache_free() as well to get better chances to
hit the bug, and maybe start a bisection to find the faulty commit ?



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Jan. 23, 2013, 7:14 a.m. UTC | #2
On 01/22/2013 10:11 PM, Eric Dumazet wrote:
> On Tue, 2013-01-22 at 18:32 -0800, Ben Greear wrote:
>
>> diff --git a/net/core/dst.c b/net/core/dst.c
>> index ee6153e..234b168 100644
>> --- a/net/core/dst.c
>> +++ b/net/core/dst.c
>> @@ -245,6 +245,7 @@ again:
>>                   dst->ops->destroy(dst);
>>           if (dst->dev)
>>                   dev_put(dst->dev);
>> +       dst->input = dst->output = 0xdeadbeef;
>>           kmem_cache_free(dst->ops->kmem_cachep, dst);
>
> Great !
>
> You could comment the kmem_cache_free() as well to get better chances to
> hit the bug, and maybe start a bisection to find the faulty commit ?

I suspect the bug goes back at least as far as 3.3.  And since
I need the NFS patches for this test case, bisecting will be pure hell.

I'll work on some more code instrumentation tomorrow.

One thing that came to mind while I was looking at the code today:

How are the non-ref-counted dst objects used safely?  Any chance
that tearing down the IP protocol on a device (or deleting a device)
could delete a dst that is referenced by an skb (and thus crashes as
I see)?

Thanks,
Ben
Eric Dumazet Jan. 23, 2013, 1:35 p.m. UTC | #3
On Tue, 2013-01-22 at 23:14 -0800, Ben Greear wrote:

> How are the non-ref-counted dst objects used safely?  Any chance
> that tearing down the IP protocol on a device (or deleting a device)
> could delete a dst that is referenced by an skb (and thus crashes as
> I see)?

There is probably a bug. Normally it should be RCU protected.



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 23, 2013, 2:42 p.m. UTC | #4
On Tue, 2013-01-22 at 23:14 -0800, Ben Greear wrote:

> I suspect the bug goes back at least as far as 3.3.  And since
> I need the NFS patches for this test case, bisecting will be pure hell.

Make sure the bug is still present in net-next.


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Jan. 23, 2013, 6:15 p.m. UTC | #5
On 01/23/2013 05:35 AM, Eric Dumazet wrote:
> On Tue, 2013-01-22 at 23:14 -0800, Ben Greear wrote:
>
>> How are the non-ref-counted dst objects used safely?  Any chance
>> that tearing down the IP protocol on a device (or deleting a device)
>> could delete a dst that is referenced by an skb (and thus crashes as
>> I see)?
>
> There is probably a bug. Normally it should be RCU protected.

I'm building net-next, and will do some testing on it later today
if all goes well.

In the meantime, a few questions about RCU.

If the non-ref-counted dst is protected by RCU, does that mean
that from the time we acquire the pointer (ie, set it in the skb),
to the time we clear the pointer from the skb, we must be under
RCU read lock?

If so, that means that ip_rcv_finish must be called under RCU
read lock, etc?

Btw, this seems non related to NFS, so I'm dropping them
from CC after this email unless I hear a request otherwise.

Thanks,
Ben
Eric Dumazet Jan. 23, 2013, 9:43 p.m. UTC | #6
On Wed, 2013-01-23 at 10:15 -0800, Ben Greear wrote:
> On 01/23/2013 05:35 AM, Eric Dumazet wrote:
> > On Tue, 2013-01-22 at 23:14 -0800, Ben Greear wrote:
> >
> >> How are the non-ref-counted dst objects used safely?  Any chance
> >> that tearing down the IP protocol on a device (or deleting a device)
> >> could delete a dst that is referenced by an skb (and thus crashes as
> >> I see)?
> >
> > There is probably a bug. Normally it should be RCU protected.
> 
> I'm building net-next, and will do some testing on it later today
> if all goes well.
> 
> In the meantime, a few questions about RCU.
> 
> If the non-ref-counted dst is protected by RCU, does that mean
> that from the time we acquire the pointer (ie, set it in the skb),
> to the time we clear the pointer from the skb, we must be under
> RCU read lock?
> 
> If so, that means that ip_rcv_finish must be called under RCU
> read lock, etc?

Obviously yes.  Check skb_dst_force() for the rare cases we want to
escape the RCU section.



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Jan. 23, 2013, 9:53 p.m. UTC | #7
On 01/23/2013 06:42 AM, Eric Dumazet wrote:
> On Tue, 2013-01-22 at 23:14 -0800, Ben Greear wrote:
>
>> I suspect the bug goes back at least as far as 3.3.  And since
>> I need the NFS patches for this test case, bisecting will be pure hell.
>
> Make sure the bug is still present in net-next.
>

My test case isn't running well on net-next.  Seems most
of the file-io processes are hung trying to open a file:

[root@lf1011-12060006 lanforge]# cat /proc/10612/stack
[<ffffffff8115fadb>] do_last+0x1db/0xa4e
[<ffffffff8116090b>] path_openat+0xcb/0x363
[<ffffffff81160ca8>] do_filp_open+0x38/0x84
[<ffffffff81153bfc>] do_sys_open+0x6d/0xff
[<ffffffff81153cbb>] sys_open+0x1c/0x1e
[<ffffffff8154f629>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff


I'm going back to 3.7.3 and will try to learn some more
about what it's doing there.

Thanks,
Ben
Ben Greear Jan. 23, 2013, 11:55 p.m. UTC | #8
On 01/22/2013 06:32 PM, Ben Greear wrote:

So, I'm slowly making some progress.  I've verified that the skb
has bogus dst (0xdeadbeef) at the top of the ip_rcv_finish
method.  I'm trying to track it backwards and figure out which
device it belongs to, etc....takes a while to reproduce though.

One thing about this stack trace below...the dev_seq_stop() does
a rcu read-unlock.  Now, I can't figure out exactly how ip_rcv()
can cause dev_seq_stop() to run, but if this stack is legit,
then maybe by the time we enter the ip_rcv_finish() code we are
running without rcu_readlock() held?

If so, that would probably explain the bug.

> Call Trace:
>   [<ffffffff814a8b02>] ? ip_rcv_finish+0x2f0/0x308
>   [<ffffffff814a8812>] ? skb_dst+0x5a/0x5a
>   [<ffffffff814a8eb5>] NF_HOOK.clone.1+0x4c/0x54
>   [<ffffffff81472e61>] ? dev_seq_stop+0xb/0xb
>   [<ffffffff814a9142>] ip_rcv+0x237/0x269
>   [<ffffffff81473def>] __netif_receive_skb+0x487/0x530
>   [<ffffffff81473f91>] process_backlog+0xf9/0x1da
>   [<ffffffff8147639a>] net_rx_action+0xad/0x218
>   [<ffffffff8108d50a>] __do_softirq+0x9c/0x161
>   [<ffffffff8108d5f2>] run_ksoftirqd+0x23/0x42
>   [<ffffffff810a7ebe>] smpboot_thread_fn+0x253/0x259
>   [<ffffffff810a7c6b>] ? test_ti_thread_flag.clone.0+0x11/0x11
>   [<ffffffff810a0a6d>] kthread+0xc2/0xca
>   [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
>   [<ffffffff81537b7c>] ret_from_fork+0x7c/0xb0
>   [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56


## This is from a slightly different kernel image...but probably this part is legit.

0xffffffff814a92b3 is in ip_rcv (/home/greearb/git/linux-3.7.dev.y/net/ipv4/ip_input.c:466).
461		/* Our transport medium may have padded the buffer out. Now we know it
462		 * is IP we can trim to the true length of the frame.
463		 * Note this now means skb->len holds ntohs(iph->tot_len).
464		 */
465		if (pskb_trim_rcsum(skb, len)) {
466			IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
467			goto drop;
468		}
469	
470		/* Remove any debris in the socket control block */
Eric Dumazet Jan. 24, 2013, 12:01 a.m. UTC | #9
On Wed, 2013-01-23 at 15:55 -0800, Ben Greear wrote:
> On 01/22/2013 06:32 PM, Ben Greear wrote:
> 
> So, I'm slowly making some progress.  I've verified that the skb
> has bogus dst (0xdeadbeef) at the top of the ip_rcv_finish
> method.  I'm trying to track it backwards and figure out which
> device it belongs to, etc....takes a while to reproduce though.
> 
> One thing about this stack trace below...the dev_seq_stop() does
> a rcu read-unlock.  Now, I can't figure out exactly how ip_rcv()
> can cause dev_seq_stop() to run, but if this stack is legit,
> then maybe by the time we enter the ip_rcv_finish() code we are
> running without rcu_readlock() held?
> 
> If so, that would probably explain the bug.
> 

The whole thing is run under rcu_read_lock() done in
__netif_receive_skb()

My suspicion was that we called netif_rx() from macvlan leaving a
not refcounted skb dst.

But the patch I sent to you didnt solve the bug, so its something else.

You could trace at which point the dst was released. (where you set
dst->input/output to deadbeef)

> > Call Trace:
> >   [<ffffffff814a8b02>] ? ip_rcv_finish+0x2f0/0x308
> >   [<ffffffff814a8812>] ? skb_dst+0x5a/0x5a
> >   [<ffffffff814a8eb5>] NF_HOOK.clone.1+0x4c/0x54
> >   [<ffffffff81472e61>] ? dev_seq_stop+0xb/0xb
> >   [<ffffffff814a9142>] ip_rcv+0x237/0x269
> >   [<ffffffff81473def>] __netif_receive_skb+0x487/0x530
> >   [<ffffffff81473f91>] process_backlog+0xf9/0x1da
> >   [<ffffffff8147639a>] net_rx_action+0xad/0x218
> >   [<ffffffff8108d50a>] __do_softirq+0x9c/0x161
> >   [<ffffffff8108d5f2>] run_ksoftirqd+0x23/0x42
> >   [<ffffffff810a7ebe>] smpboot_thread_fn+0x253/0x259
> >   [<ffffffff810a7c6b>] ? test_ti_thread_flag.clone.0+0x11/0x11
> >   [<ffffffff810a0a6d>] kthread+0xc2/0xca
> >   [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
> >   [<ffffffff81537b7c>] ret_from_fork+0x7c/0xb0
> >   [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
> 
> 
> ## This is from a slightly different kernel image...but probably this part is legit.
> 
> 0xffffffff814a92b3 is in ip_rcv (/home/greearb/git/linux-3.7.dev.y/net/ipv4/ip_input.c:466).
> 461		/* Our transport medium may have padded the buffer out. Now we know it
> 462		 * is IP we can trim to the true length of the frame.
> 463		 * Note this now means skb->len holds ntohs(iph->tot_len).
> 464		 */
> 465		if (pskb_trim_rcsum(skb, len)) {
> 466			IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
> 467			goto drop;
> 468		}
> 469	
> 470		/* Remove any debris in the socket control block */
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Jan. 24, 2013, 12:13 a.m. UTC | #10
On 01/23/2013 04:01 PM, Eric Dumazet wrote:
> On Wed, 2013-01-23 at 15:55 -0800, Ben Greear wrote:
>> On 01/22/2013 06:32 PM, Ben Greear wrote:
>>
>> So, I'm slowly making some progress.  I've verified that the skb
>> has bogus dst (0xdeadbeef) at the top of the ip_rcv_finish
>> method.  I'm trying to track it backwards and figure out which
>> device it belongs to, etc....takes a while to reproduce though.
>>
>> One thing about this stack trace below...the dev_seq_stop() does
>> a rcu read-unlock.  Now, I can't figure out exactly how ip_rcv()
>> can cause dev_seq_stop() to run, but if this stack is legit,
>> then maybe by the time we enter the ip_rcv_finish() code we are
>> running without rcu_readlock() held?
>>
>> If so, that would probably explain the bug.
>>
>
> The whole thing is run under rcu_read_lock() done in
> __netif_receive_skb()

I was worried that the dev_seq_stop might be called
incorrectly causing an asymetric unlock.  I have no
idea how that might happened, but several crashes
have that dev_seq_stop method listed, so it got me suspicious.

>
> My suspicion was that we called netif_rx() from macvlan leaving a
> not refcounted skb dst.
>
> But the patch I sent to you didnt solve the bug, so its something else.
>
> You could trace at which point the dst was released. (where you set
> dst->input/output to deadbeef)

My current code is in some garbage collector timer code, but I can
work on saving the call-site that first pokes the dst into the
garbage collection list...

Thanks,
Ben
Eric Dumazet Jan. 24, 2013, 12:23 a.m. UTC | #11
On Wed, 2013-01-23 at 16:13 -0800, Ben Greear wrote:
> On 01/23/2013 04:01 PM, Eric Dumazet wrote:
> > On Wed, 2013-01-23 at 15:55 -0800, Ben Greear wrote:
> >> On 01/22/2013 06:32 PM, Ben Greear wrote:
> >>
> >> So, I'm slowly making some progress.  I've verified that the skb
> >> has bogus dst (0xdeadbeef) at the top of the ip_rcv_finish
> >> method.  I'm trying to track it backwards and figure out which
> >> device it belongs to, etc....takes a while to reproduce though.
> >>
> >> One thing about this stack trace below...the dev_seq_stop() does
> >> a rcu read-unlock.  Now, I can't figure out exactly how ip_rcv()
> >> can cause dev_seq_stop() to run, but if this stack is legit,
> >> then maybe by the time we enter the ip_rcv_finish() code we are
> >> running without rcu_readlock() held?
> >>
> >> If so, that would probably explain the bug.
> >>
> >
> > The whole thing is run under rcu_read_lock() done in
> > __netif_receive_skb()
> 
> I was worried that the dev_seq_stop might be called
> incorrectly causing an asymetric unlock.  I have no
> idea how that might happened, but several crashes
> have that dev_seq_stop method listed, so it got me suspicious.

dev_seq_stop() is some word in the kernel stack, result of a prior
system call. Stack is not cleanup.

Each function reserves an amount of stack but not always write on all
reserved space (some automatic variables might be not set)

Note the "? " before the name : linux printed the symbol but this was
not a call site for this particular call graph. Its only an extra
indication, that can be useful sometimes.



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dst.c b/net/core/dst.c
index ee6153e..234b168 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -245,6 +245,7 @@  again:
                 dst->ops->destroy(dst);
         if (dst->dev)
                 dev_put(dst->dev);
+       dst->input = dst->output = 0xdeadbeef;
         kmem_cache_free(dst->ops->kmem_cachep, dst);

         dst = child;