diff mbox series

[net,1/2] vxlan: do not modify the shared tunnel info when PMTU triggers an ICMP reply

Message ID 20210325153533.770125-2-atenart@kernel.org (mailing list archive)
State Accepted
Commit 30a93d2b7d5a7cbb53ac19c9364a256d1aa6c08a
Delegated to: Netdev Maintainers
Headers show
Series net: do not modify the shared tunnel info when PMTU triggers an ICMP reply | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 4 of 4 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 38 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/header_inline success Link

Commit Message

Antoine Tenart March 25, 2021, 3:35 p.m. UTC
When the interface is part of a bridge or an Open vSwitch port and a
packet exceed a PMTU estimate, an ICMP reply is sent to the sender. When
using the external mode (collect metadata) the source and destination
addresses are reversed, so that Open vSwitch can match the packet
against an existing (reverse) flow.

But inverting the source and destination addresses in the shared
ip_tunnel_info will make following packets of the flow to use a wrong
destination address (packets will be tunnelled to itself), if the flow
isn't updated. Which happens with Open vSwitch, until the flow times
out.

Fixes this by uncloning the skb's ip_tunnel_info before inverting its
source and destination addresses, so that the modification will only be
made for the PTMU packet, not the following ones.

Fixes: fc68c99577cc ("vxlan: Support for PMTU discovery on directly bridged links")
Tested-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 drivers/net/vxlan.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Comments

Vlad Buslov Jan. 20, 2022, 7:38 a.m. UTC | #1
Hello Antoine,

On Thu 25 Mar 2021 at 17:35, Antoine Tenart <atenart@kernel.org> wrote:
> When the interface is part of a bridge or an Open vSwitch port and a
> packet exceed a PMTU estimate, an ICMP reply is sent to the sender. When
> using the external mode (collect metadata) the source and destination
> addresses are reversed, so that Open vSwitch can match the packet
> against an existing (reverse) flow.
>
> But inverting the source and destination addresses in the shared
> ip_tunnel_info will make following packets of the flow to use a wrong
> destination address (packets will be tunnelled to itself), if the flow
> isn't updated. Which happens with Open vSwitch, until the flow times
> out.
>
> Fixes this by uncloning the skb's ip_tunnel_info before inverting its
> source and destination addresses, so that the modification will only be
> made for the PTMU packet, not the following ones.
>
> Fixes: fc68c99577cc ("vxlan: Support for PMTU discovery on directly bridged links")
> Tested-by: Eelco Chaudron <echaudro@redhat.com>
> Reviewed-by: Eelco Chaudron <echaudro@redhat.com>
> Signed-off-by: Antoine Tenart <atenart@kernel.org>
> ---
>  drivers/net/vxlan.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 666dd201c3d5..53dbc67e8a34 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -2725,12 +2725,17 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  			goto tx_error;
>  		} else if (err) {
>  			if (info) {
> +				struct ip_tunnel_info *unclone;
>  				struct in_addr src, dst;
>  
> +				unclone = skb_tunnel_info_unclone(skb);
> +				if (unlikely(!unclone))
> +					goto tx_error;
> +

We have been getting memleaks in one of our tests that point to this
code (test deletes vxlan device while running traffic redirected by OvS
TC at the same time):

unreferenced object 0xffff8882d0114200 (size 256):
  comm "softirq", pid 0, jiffies 4296140292 (age 1435.992s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 3b 85 84 ff ff ff ff  .........;......
    a1 26 b7 83 ff ff ff ff 00 00 00 00 00 00 00 00  .&..............
  backtrace:
    [<0000000097659d47>] metadata_dst_alloc+0x1f/0x470
    [<000000007571c30f>] tun_dst_unclone+0xee/0x360 [vxlan]
    [<00000000d2dcfd00>] vxlan_xmit_one+0x131d/0x2a00 [vxlan]
    [<00000000281572b6>] vxlan_xmit+0x8e6/0x4cd0 [vxlan]
    [<00000000d49d33fe>] dev_hard_start_xmit+0x1ba/0x710
    [<00000000eac444f5>] __dev_queue_xmit+0x17c5/0x25f0
    [<000000005fbd8585>] tcf_mirred_act+0xb1d/0xf70 [act_mirred]
    [<0000000064b6eb2d>] tcf_action_exec+0x10e/0x350
    [<00000000352821e8>] fl_classify+0x4e3/0x610 [cls_flower]
    [<0000000011d3f765>] tcf_classify+0x33d/0x800
    [<000000006c69b225>] __netif_receive_skb_core+0x18d6/0x2ae0
    [<00000000dd256fe3>] __netif_receive_skb_one_core+0xaf/0x180
    [<0000000065d43bd6>] process_backlog+0x2e3/0x710
    [<00000000964357ae>] __napi_poll+0x9f/0x560
    [<0000000059a93cf6>] net_rx_action+0x357/0xa60
    [<00000000766481bc>] __do_softirq+0x282/0x94e

Looking at the code the potential issue seems to be that
tun_dst_unclone() creates new metadata_dst instance with refcount==1,
increments the refcount with dst_hold() to value 2, then returns it.
This seems to imply that caller is expected to release one of the
references (second one if for skb), but none of the callers (including
original dev_fill_metadata_dst()) do that, so I guess I'm
misunderstanding something here.

Any tips or suggestions?

>  				src = remote_ip.sin.sin_addr;
>  				dst = local_ip.sin.sin_addr;
> -				info->key.u.ipv4.src = src.s_addr;
> -				info->key.u.ipv4.dst = dst.s_addr;
> +				unclone->key.u.ipv4.src = src.s_addr;
> +				unclone->key.u.ipv4.dst = dst.s_addr;
>  			}
>  			vxlan_encap_bypass(skb, vxlan, vxlan, vni, false);
>  			dst_release(ndst);
> @@ -2781,12 +2786,17 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  			goto tx_error;
>  		} else if (err) {
>  			if (info) {
> +				struct ip_tunnel_info *unclone;
>  				struct in6_addr src, dst;
>  
> +				unclone = skb_tunnel_info_unclone(skb);
> +				if (unlikely(!unclone))
> +					goto tx_error;
> +
>  				src = remote_ip.sin6.sin6_addr;
>  				dst = local_ip.sin6.sin6_addr;
> -				info->key.u.ipv6.src = src;
> -				info->key.u.ipv6.dst = dst;
> +				unclone->key.u.ipv6.src = src;
> +				unclone->key.u.ipv6.dst = dst;
>  			}
>  
>  			vxlan_encap_bypass(skb, vxlan, vxlan, vni, false);
Antoine Tenart Jan. 20, 2022, 10:27 a.m. UTC | #2
Hello Vlad,

Quoting Vlad Buslov (2022-01-20 08:38:05)
> On Thu 25 Mar 2021 at 17:35, Antoine Tenart <atenart@kernel.org> wrote:
> >
> > diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> > index 666dd201c3d5..53dbc67e8a34 100644
> > --- a/drivers/net/vxlan.c
> > +++ b/drivers/net/vxlan.c
> > @@ -2725,12 +2725,17 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
> >                       goto tx_error;
> >               } else if (err) {
> >                       if (info) {
> > +                             struct ip_tunnel_info *unclone;
> >                               struct in_addr src, dst;
> >  
> > +                             unclone = skb_tunnel_info_unclone(skb);
> > +                             if (unlikely(!unclone))
> > +                                     goto tx_error;
> > +
> 
> We have been getting memleaks in one of our tests that point to this
> code (test deletes vxlan device while running traffic redirected by OvS
> TC at the same time):
> 
> unreferenced object 0xffff8882d0114200 (size 256):
>   comm "softirq", pid 0, jiffies 4296140292 (age 1435.992s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 3b 85 84 ff ff ff ff  .........;......
>     a1 26 b7 83 ff ff ff ff 00 00 00 00 00 00 00 00  .&..............
>   backtrace:
>     [<0000000097659d47>] metadata_dst_alloc+0x1f/0x470
>     [<000000007571c30f>] tun_dst_unclone+0xee/0x360 [vxlan]
>     [<00000000d2dcfd00>] vxlan_xmit_one+0x131d/0x2a00 [vxlan]
>     [<00000000281572b6>] vxlan_xmit+0x8e6/0x4cd0 [vxlan]
>     [<00000000d49d33fe>] dev_hard_start_xmit+0x1ba/0x710
>     [<00000000eac444f5>] __dev_queue_xmit+0x17c5/0x25f0
>     [<000000005fbd8585>] tcf_mirred_act+0xb1d/0xf70 [act_mirred]
>     [<0000000064b6eb2d>] tcf_action_exec+0x10e/0x350
>     [<00000000352821e8>] fl_classify+0x4e3/0x610 [cls_flower]
>     [<0000000011d3f765>] tcf_classify+0x33d/0x800
>     [<000000006c69b225>] __netif_receive_skb_core+0x18d6/0x2ae0
>     [<00000000dd256fe3>] __netif_receive_skb_one_core+0xaf/0x180
>     [<0000000065d43bd6>] process_backlog+0x2e3/0x710
>     [<00000000964357ae>] __napi_poll+0x9f/0x560
>     [<0000000059a93cf6>] net_rx_action+0x357/0xa60
>     [<00000000766481bc>] __do_softirq+0x282/0x94e
> 
> Looking at the code the potential issue seems to be that
> tun_dst_unclone() creates new metadata_dst instance with refcount==1,
> increments the refcount with dst_hold() to value 2, then returns it.
> This seems to imply that caller is expected to release one of the
> references (second one if for skb), but none of the callers (including
> original dev_fill_metadata_dst()) do that, so I guess I'm
> misunderstanding something here.
> 
> Any tips or suggestions?

I'd say there is no need to increase the dst refcount here after calling
metadata_dst_alloc, as the metadata is local to the skb and the dst
refcount was already initialized to 1. This might be an issue with
commit fc4099f17240 ("openvswitch: Fix egress tunnel info."); I CCed
Pravin, he might recall if there was a reason to increase the refcount.

Thanks,
Antoine
Vlad Buslov Jan. 20, 2022, 12:58 p.m. UTC | #3
On Thu 20 Jan 2022 at 12:27, Antoine Tenart <atenart@kernel.org> wrote:
> Hello Vlad,
>
> Quoting Vlad Buslov (2022-01-20 08:38:05)
>> On Thu 25 Mar 2021 at 17:35, Antoine Tenart <atenart@kernel.org> wrote:
>> >
>> > diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
>> > index 666dd201c3d5..53dbc67e8a34 100644
>> > --- a/drivers/net/vxlan.c
>> > +++ b/drivers/net/vxlan.c
>> > @@ -2725,12 +2725,17 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>> >                       goto tx_error;
>> >               } else if (err) {
>> >                       if (info) {
>> > +                             struct ip_tunnel_info *unclone;
>> >                               struct in_addr src, dst;
>> >  
>> > +                             unclone = skb_tunnel_info_unclone(skb);
>> > +                             if (unlikely(!unclone))
>> > +                                     goto tx_error;
>> > +
>> 
>> We have been getting memleaks in one of our tests that point to this
>> code (test deletes vxlan device while running traffic redirected by OvS
>> TC at the same time):
>> 
>> unreferenced object 0xffff8882d0114200 (size 256):
>>   comm "softirq", pid 0, jiffies 4296140292 (age 1435.992s)
>>   hex dump (first 32 bytes):
>>     00 00 00 00 00 00 00 00 00 3b 85 84 ff ff ff ff  .........;......
>>     a1 26 b7 83 ff ff ff ff 00 00 00 00 00 00 00 00  .&..............
>>   backtrace:
>>     [<0000000097659d47>] metadata_dst_alloc+0x1f/0x470
>>     [<000000007571c30f>] tun_dst_unclone+0xee/0x360 [vxlan]
>>     [<00000000d2dcfd00>] vxlan_xmit_one+0x131d/0x2a00 [vxlan]
>>     [<00000000281572b6>] vxlan_xmit+0x8e6/0x4cd0 [vxlan]
>>     [<00000000d49d33fe>] dev_hard_start_xmit+0x1ba/0x710
>>     [<00000000eac444f5>] __dev_queue_xmit+0x17c5/0x25f0
>>     [<000000005fbd8585>] tcf_mirred_act+0xb1d/0xf70 [act_mirred]
>>     [<0000000064b6eb2d>] tcf_action_exec+0x10e/0x350
>>     [<00000000352821e8>] fl_classify+0x4e3/0x610 [cls_flower]
>>     [<0000000011d3f765>] tcf_classify+0x33d/0x800
>>     [<000000006c69b225>] __netif_receive_skb_core+0x18d6/0x2ae0
>>     [<00000000dd256fe3>] __netif_receive_skb_one_core+0xaf/0x180
>>     [<0000000065d43bd6>] process_backlog+0x2e3/0x710
>>     [<00000000964357ae>] __napi_poll+0x9f/0x560
>>     [<0000000059a93cf6>] net_rx_action+0x357/0xa60
>>     [<00000000766481bc>] __do_softirq+0x282/0x94e
>> 
>> Looking at the code the potential issue seems to be that
>> tun_dst_unclone() creates new metadata_dst instance with refcount==1,
>> increments the refcount with dst_hold() to value 2, then returns it.
>> This seems to imply that caller is expected to release one of the
>> references (second one if for skb), but none of the callers (including
>> original dev_fill_metadata_dst()) do that, so I guess I'm
>> misunderstanding something here.
>> 
>> Any tips or suggestions?
>
> I'd say there is no need to increase the dst refcount here after calling
> metadata_dst_alloc, as the metadata is local to the skb and the dst
> refcount was already initialized to 1. This might be an issue with
> commit fc4099f17240 ("openvswitch: Fix egress tunnel info."); I CCed
> Pravin, he might recall if there was a reason to increase the refcount.

I tried to remove the dst_hold(), but that caused underflows[0], so I
guess the current reference counting is required at least for some
use-cases.

[0]:

[  118.803011] ------------[ cut here ]------------                                    
[  118.803011] dst_release: dst:000000001fc13e61 refcnt:-2                             
[  118.803019] dst_release: dst:000000001fc13e61 refcnt:-2                             
[  118.803027] dst_release: dst:000000001fc13e61 refcnt:-2                             
[  118.803041] dst_release: dst:000000001fc13e61 refcnt:-2                             
[  118.803046] dst_release: dst:000000001fc13e61 refcnt:-2                             
[  118.803060] dst_release: dst:000000001fc13e61 refcnt:-2                             
[  118.803065] dst_release: dst:000000001fc13e61 refcnt:-2                             
[  118.803078] dst_release: dst:000000001fc13e61 refcnt:-2                             
[  118.803083] dst_release: dst:000000001fc13e61 refcnt:-2                             
[  118.803096] dst_release: dst:000000001fc13e61 refcnt:-2                             
[  118.803920] dst_release underflow                                                   
[  118.803937] WARNING: CPU: 4 PID: 0 at net/core/dst.c:173 dst_release+0x79/0x90                                                                                             
[  118.815961] Modules linked in: act_tunnel_key act_mirred act_skbedit veth vxlan ip6_udp_tunnel udp_tunnel act_gact cls_flower sch_ingress openvswitch nsh nf_conncount mlx5_ib mlx5_core mlxfw pci_hyperv_intf ptp pps_core nfsv3 nfs_acl xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_connt
rack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill overlay rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs sunrpc ib_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core virtio_net 
net_failover failover i2c_i801 i2c_smbus kvm_intel kvm pcspkr irqbypass crc32_pclmul ghash_clmulni_intel sch_fq_codel drm i2c_core ip_tables crc32c_intel serio_raw fuse [last unloaded: mlxfw]
[  118.829464] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.16.0+ #3                                                                                                           
[  118.830567] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014                                                                                                                                                                                                                              
[  118.832541] RIP: 0010:dst_release+0x79/0x90                                         
[  118.833372] Code: 04 e8 db 14 01 00 8b 4c 24 04 85 c0 74 c7 e9 4d 60 22 00 48 c7 c7 35 00 32 82 89 4c 24 04 c6 05 c5 47 e5 00 01 e8 6f c2 1b 00 <0f> 0b 8b 4c 24 04 eb cb 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40                                                                                                                                       
[  118.836566] RSP: 0018:ffffc90000180ee0 EFLAGS: 00010282                             
[  118.837546] RAX: 0000000000000000 RBX: ffff88813a139ab0 RCX: 0000000000000000                                                                                              
[  118.838912] RDX: 0000000000000102 RSI: ffffffff82286273 RDI: 00000000ffffffff                                                                                              
[  118.840278] RBP: 0000000000000004 R08: 0000000000000015 R09: ffffc90000180e78                                                                                              
[  118.841646] R10: ffffffff825c7000 R11: 0000000000000001 R12: ffff88811d3ab480                                                                                              
[  118.843008] R13: 0000000000000170 R14: 0000000000000000 R15: ffff8882f5a2e1b8                                                                                              
[  118.844371] FS:  0000000000000000(0000) GS:ffff8882f5a00000(0000) knlGS:0000000000000000                                                                                   
[  118.845968] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                                                                              
[  118.847100] CR2: 00007fa5e5abd7e0 CR3: 0000000175408005 CR4: 0000000000770ee0                                                                                              
[  118.848461] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000                                                                                              
[  118.849834] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400                                                                                              
[  118.851198] PKRU: 55555554                                                          
[  118.851815] Call Trace:                                                             
[  118.852387]  <IRQ>                                                                  
[  118.852894]  dst_cache_destroy+0x33/0x60                                            
[  118.853728]  dst_destroy+0xaa/0xb0                                                  
[  118.854465]  rcu_core+0x2a3/0x960                                                   
[  118.855197]  __do_softirq+0xf0/0x2f9                                                
[  118.855976]  __irq_exit_rcu+0xcc/0x120                                              
[  118.856765]  sysvec_apic_timer_interrupt+0xa2/0xd0                                  
[  118.857746]  </IRQ>                                                                 
[  118.858270]  <TASK>                                                                 
[  118.858775]  asm_sysvec_apic_timer_interrupt+0x12/0x20                              
[  118.859801] RIP: 0010:native_safe_halt+0xb/0x10                                     
[  118.860731] Code: 7e ff ff ff 7f 5b c3 65 48 8b 04 25 c0 cb 01 00 f0 80 48 02 20 48 8b 00 a8 08 75 c3 eb 80 cc eb 07 0f 00 2d 8f f5 4f 00 fb f4 <c3> 0f 1f 40 00 eb 07 0f 00 2d 7f f5 4f 00 f4 c3 cc cc cc cc cc 0f                                                                                                                                       
[  118.864175] RSP: 0018:ffffc9000009fef0 EFLAGS: 00000206                             
[  118.865223] RAX: 0000000000027f6c RBX: 0000000000000004 RCX: 0000000000000000                                                                                              
[  118.866504] RDX: ffff88817c090d50 RSI: ffffffff82286273 RDI: ffffffff8225bf7f                                                                                              
[  118.867774] RBP: ffff8881009d2e80 R08: 0000000000027f6b R09: 0000000000000000                                                                                              
[  118.869051] R10: 00000000fffd3b17 R11: 0000000000000001 R12: 0000000000000000                                                                                              
[  118.870326] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000                                                                                              
[  118.871601]  default_idle+0xa/0x10                                                  
[  118.872300]  default_idle_call+0x33/0xe0                                            
[  118.873081]  do_idle+0x208/0x270                                                    
[  118.873741]  cpu_startup_entry+0x19/0x20                                            
[  118.874562]  secondary_startup_64_no_verify+0xc3/0xcb                               
[  118.875604]  </TASK>                                                                
[  118.876140] ---[ end trace dec5061c76371ce7 ]---
Antoine Tenart Jan. 28, 2022, 5:01 p.m. UTC | #4
Hi Vlad,

Quoting Vlad Buslov (2022-01-20 13:58:18)
> On Thu 20 Jan 2022 at 12:27, Antoine Tenart <atenart@kernel.org> wrote:
> > Quoting Vlad Buslov (2022-01-20 08:38:05)
> >> 
> >> We have been getting memleaks in one of our tests that point to this
> >> code (test deletes vxlan device while running traffic redirected by OvS
> >> TC at the same time):
> >> 
> >> unreferenced object 0xffff8882d0114200 (size 256):
> >>     [<0000000097659d47>] metadata_dst_alloc+0x1f/0x470
> >>     [<000000007571c30f>] tun_dst_unclone+0xee/0x360 [vxlan]
> >>     [<00000000d2dcfd00>] vxlan_xmit_one+0x131d/0x2a00 [vxlan]

[...]

> >> Looking at the code the potential issue seems to be that
> >> tun_dst_unclone() creates new metadata_dst instance with refcount==1,
> >> increments the refcount with dst_hold() to value 2, then returns it.
> >> This seems to imply that caller is expected to release one of the
> >> references (second one if for skb), but none of the callers (including
> >> original dev_fill_metadata_dst()) do that, so I guess I'm
> >> misunderstanding something here.
> >> 
> >> Any tips or suggestions?
> >
> > I'd say there is no need to increase the dst refcount here after calling
> > metadata_dst_alloc, as the metadata is local to the skb and the dst
> > refcount was already initialized to 1. This might be an issue with
> > commit fc4099f17240 ("openvswitch: Fix egress tunnel info."); I CCed
> > Pravin, he might recall if there was a reason to increase the refcount.
> 
> I tried to remove the dst_hold(), but that caused underflows[0], so I
> guess the current reference counting is required at least for some
> use-cases.
> 
> [0]:
> 
> [  118.803011] dst_release: dst:000000001fc13e61 refcnt:-2                             

[...]

I finally had some time to look at this. Does the diff below fix your
issue?

diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index 14efa0ded75d..90a7a4daea9c 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -110,8 +110,8 @@ static inline struct metadata_dst *tun_rx_dst(int md_size)
 static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
 {
        struct metadata_dst *md_dst = skb_metadata_dst(skb);
-       int md_size;
        struct metadata_dst *new_md;
+       int md_size, ret;
 
        if (!md_dst || md_dst->type != METADATA_IP_TUNNEL)
                return ERR_PTR(-EINVAL);
@@ -123,8 +123,15 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
 
        memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
               sizeof(struct ip_tunnel_info) + md_size);
+#ifdef CONFIG_DST_CACHE
+       ret = dst_cache_init(&new_md->u.tun_info.dst_cache, GFP_ATOMIC);
+       if (ret) {
+               metadata_dst_free(new_md);
+               return ERR_PTR(ret);
+       }
+#endif
+
        skb_dst_drop(skb);
-       dst_hold(&new_md->dst);
        skb_dst_set(skb, &new_md->dst);
        return new_md;
 }

Antoine
Vlad Buslov Jan. 31, 2022, 11:26 a.m. UTC | #5
On Fri 28 Jan 2022 at 19:01, Antoine Tenart <atenart@kernel.org> wrote:
> Hi Vlad,
>
> Quoting Vlad Buslov (2022-01-20 13:58:18)
>> On Thu 20 Jan 2022 at 12:27, Antoine Tenart <atenart@kernel.org> wrote:
>> > Quoting Vlad Buslov (2022-01-20 08:38:05)
>> >> 
>> >> We have been getting memleaks in one of our tests that point to this
>> >> code (test deletes vxlan device while running traffic redirected by OvS
>> >> TC at the same time):
>> >> 
>> >> unreferenced object 0xffff8882d0114200 (size 256):
>> >>     [<0000000097659d47>] metadata_dst_alloc+0x1f/0x470
>> >>     [<000000007571c30f>] tun_dst_unclone+0xee/0x360 [vxlan]
>> >>     [<00000000d2dcfd00>] vxlan_xmit_one+0x131d/0x2a00 [vxlan]
>
> [...]
>
>> >> Looking at the code the potential issue seems to be that
>> >> tun_dst_unclone() creates new metadata_dst instance with refcount==1,
>> >> increments the refcount with dst_hold() to value 2, then returns it.
>> >> This seems to imply that caller is expected to release one of the
>> >> references (second one if for skb), but none of the callers (including
>> >> original dev_fill_metadata_dst()) do that, so I guess I'm
>> >> misunderstanding something here.
>> >> 
>> >> Any tips or suggestions?
>> >
>> > I'd say there is no need to increase the dst refcount here after calling
>> > metadata_dst_alloc, as the metadata is local to the skb and the dst
>> > refcount was already initialized to 1. This might be an issue with
>> > commit fc4099f17240 ("openvswitch: Fix egress tunnel info."); I CCed
>> > Pravin, he might recall if there was a reason to increase the refcount.
>> 
>> I tried to remove the dst_hold(), but that caused underflows[0], so I
>> guess the current reference counting is required at least for some
>> use-cases.
>> 
>> [0]:
>> 
>> [  118.803011] dst_release: dst:000000001fc13e61 refcnt:-2                             
>
> [...]
>
> I finally had some time to look at this. Does the diff below fix your
> issue?

Yes, with the patch applied I'm no longer able to reproduce memory leak.
Thanks for fixing this!

>
> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
> index 14efa0ded75d..90a7a4daea9c 100644
> --- a/include/net/dst_metadata.h
> +++ b/include/net/dst_metadata.h
> @@ -110,8 +110,8 @@ static inline struct metadata_dst *tun_rx_dst(int md_size)
>  static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>  {
>         struct metadata_dst *md_dst = skb_metadata_dst(skb);
> -       int md_size;
>         struct metadata_dst *new_md;
> +       int md_size, ret;
>  
>         if (!md_dst || md_dst->type != METADATA_IP_TUNNEL)
>                 return ERR_PTR(-EINVAL);
> @@ -123,8 +123,15 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>  
>         memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>                sizeof(struct ip_tunnel_info) + md_size);
> +#ifdef CONFIG_DST_CACHE
> +       ret = dst_cache_init(&new_md->u.tun_info.dst_cache, GFP_ATOMIC);
> +       if (ret) {
> +               metadata_dst_free(new_md);
> +               return ERR_PTR(ret);
> +       }
> +#endif
> +
>         skb_dst_drop(skb);
> -       dst_hold(&new_md->dst);
>         skb_dst_set(skb, &new_md->dst);
>         return new_md;
>  }
>
> Antoine
Antoine Tenart Jan. 31, 2022, 1:26 p.m. UTC | #6
Quoting Vlad Buslov (2022-01-31 12:26:47)
> On Fri 28 Jan 2022 at 19:01, Antoine Tenart <atenart@kernel.org> wrote:
> >
> > I finally had some time to look at this. Does the diff below fix your
> > issue?
> 
> Yes, with the patch applied I'm no longer able to reproduce memory leak.
> Thanks for fixing this!

Thanks for testing. I'll send a formal patch, can I add your Tested-by?

Also, do you know how to trigger the following code path in OVS
https://elixir.bootlin.com/linux/latest/source/net/openvswitch/actions.c#L944
? Would be good (not required) to test it, to ensure the fix doesn't
break it.

Thanks,
Antoine

> > diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
> > index 14efa0ded75d..90a7a4daea9c 100644
> > --- a/include/net/dst_metadata.h
> > +++ b/include/net/dst_metadata.h
> > @@ -110,8 +110,8 @@ static inline struct metadata_dst *tun_rx_dst(int md_size)
> >  static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
> >  {
> >         struct metadata_dst *md_dst = skb_metadata_dst(skb);
> > -       int md_size;
> >         struct metadata_dst *new_md;
> > +       int md_size, ret;
> >  
> >         if (!md_dst || md_dst->type != METADATA_IP_TUNNEL)
> >                 return ERR_PTR(-EINVAL);
> > @@ -123,8 +123,15 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
> >  
> >         memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
> >                sizeof(struct ip_tunnel_info) + md_size);
> > +#ifdef CONFIG_DST_CACHE
> > +       ret = dst_cache_init(&new_md->u.tun_info.dst_cache, GFP_ATOMIC);
> > +       if (ret) {
> > +               metadata_dst_free(new_md);
> > +               return ERR_PTR(ret);
> > +       }
> > +#endif
> > +
> >         skb_dst_drop(skb);
> > -       dst_hold(&new_md->dst);
> >         skb_dst_set(skb, &new_md->dst);
> >         return new_md;
> >  }
Stefano Brivio Jan. 31, 2022, 2:04 p.m. UTC | #7
Hi Antoine,

On Mon, 31 Jan 2022 14:26:47 +0100
Antoine Tenart <atenart@kernel.org> wrote:

> Quoting Vlad Buslov (2022-01-31 12:26:47)
> > On Fri 28 Jan 2022 at 19:01, Antoine Tenart <atenart@kernel.org> wrote:  
> > >
> > > I finally had some time to look at this. Does the diff below fix your
> > > issue?  
> > 
> > Yes, with the patch applied I'm no longer able to reproduce memory leak.
> > Thanks for fixing this!  
> 
> Thanks for testing. I'll send a formal patch, can I add your Tested-by?
> 
> Also, do you know how to trigger the following code path in OVS
> https://elixir.bootlin.com/linux/latest/source/net/openvswitch/actions.c#L944

I guess the selftests pmtu_ipv{4,6}_ovs_vxlan{4,6}_exception and
pmtu_ipv{4,6}_ovs_geneve{4,6}_exception from net/pmtu.sh:

	https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/net/pmtu.sh?id=ece1278a9b81bdfc088f087f8372a072b7010956#n81

should trigger that path once or twice per test, but I haven't tried
recently.
Antoine Tenart Jan. 31, 2022, 2:42 p.m. UTC | #8
Hi Stefano,

Quoting Stefano Brivio (2022-01-31 15:04:18)
> On Mon, 31 Jan 2022 14:26:47 +0100
> Antoine Tenart <atenart@kernel.org> wrote:
> > Quoting Vlad Buslov (2022-01-31 12:26:47)
> > > On Fri 28 Jan 2022 at 19:01, Antoine Tenart <atenart@kernel.org> wrote:  
> > > >
> > > > I finally had some time to look at this. Does the diff below fix your
> > > > issue?  
> > > 
> > > Yes, with the patch applied I'm no longer able to reproduce memory leak.
> > > Thanks for fixing this!  
> > 
> > Thanks for testing. I'll send a formal patch, can I add your Tested-by?
> > 
> > Also, do you know how to trigger the following code path in OVS
> > https://elixir.bootlin.com/linux/latest/source/net/openvswitch/actions.c#L944
> 
> I guess the selftests pmtu_ipv{4,6}_ovs_vxlan{4,6}_exception and
> pmtu_ipv{4,6}_ovs_geneve{4,6}_exception from net/pmtu.sh:
> 
>         https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/net/pmtu.sh?id=ece1278a9b81bdfc088f087f8372a072b7010956#n81
> 
> should trigger that path once or twice per test, but I haven't tried
> recently.

Thanks for the suggestion! I did run all 8 ptmu_*_ovs_* tests, they all
passed but didn't trigger a call to dev_fill_metadata_dst in
net/openvswitch/actions.c.

To be sure there wasn't a misunderstanding: I did test the PTMU code
path in Geneve/VXLAN (while one of the endpoint is an OVS port); but the
net/openvswitch/actions.c code path is something different, used to
retrieve tunnel egress info. I don't know when/how this is used by OVS.

Thanks,
Antoine
Vlad Buslov Jan. 31, 2022, 5:55 p.m. UTC | #9
On Mon 31 Jan 2022 at 15:26, Antoine Tenart <atenart@kernel.org> wrote:
> Quoting Vlad Buslov (2022-01-31 12:26:47)
>> On Fri 28 Jan 2022 at 19:01, Antoine Tenart <atenart@kernel.org> wrote:
>> >
>> > I finally had some time to look at this. Does the diff below fix your
>> > issue?
>> 
>> Yes, with the patch applied I'm no longer able to reproduce memory leak.
>> Thanks for fixing this!
>
> Thanks for testing. I'll send a formal patch, can I add your Tested-by?

Sure!

Reported-by: Vlad Buslov <vladbu@nvidia.com>
Tested-by: Vlad Buslov <vladbu@nvidia.com>

>
> Also, do you know how to trigger the following code path in OVS
> https://elixir.bootlin.com/linux/latest/source/net/openvswitch/actions.c#L944
> ? Would be good (not required) to test it, to ensure the fix doesn't
> break it.

Sorry, I don't. We mostly concentrate on testing hardware
offload-specific code paths (e.g. TC).

>
> Thanks,
> Antoine
>
>> > diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
>> > index 14efa0ded75d..90a7a4daea9c 100644
>> > --- a/include/net/dst_metadata.h
>> > +++ b/include/net/dst_metadata.h
>> > @@ -110,8 +110,8 @@ static inline struct metadata_dst *tun_rx_dst(int md_size)
>> >  static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>> >  {
>> >         struct metadata_dst *md_dst = skb_metadata_dst(skb);
>> > -       int md_size;
>> >         struct metadata_dst *new_md;
>> > +       int md_size, ret;
>> >  
>> >         if (!md_dst || md_dst->type != METADATA_IP_TUNNEL)
>> >                 return ERR_PTR(-EINVAL);
>> > @@ -123,8 +123,15 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>> >  
>> >         memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>> >                sizeof(struct ip_tunnel_info) + md_size);
>> > +#ifdef CONFIG_DST_CACHE
>> > +       ret = dst_cache_init(&new_md->u.tun_info.dst_cache, GFP_ATOMIC);
>> > +       if (ret) {
>> > +               metadata_dst_free(new_md);
>> > +               return ERR_PTR(ret);
>> > +       }
>> > +#endif
>> > +
>> >         skb_dst_drop(skb);
>> > -       dst_hold(&new_md->dst);
>> >         skb_dst_set(skb, &new_md->dst);
>> >         return new_md;
>> >  }
diff mbox series

Patch

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 666dd201c3d5..53dbc67e8a34 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2725,12 +2725,17 @@  static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 			goto tx_error;
 		} else if (err) {
 			if (info) {
+				struct ip_tunnel_info *unclone;
 				struct in_addr src, dst;
 
+				unclone = skb_tunnel_info_unclone(skb);
+				if (unlikely(!unclone))
+					goto tx_error;
+
 				src = remote_ip.sin.sin_addr;
 				dst = local_ip.sin.sin_addr;
-				info->key.u.ipv4.src = src.s_addr;
-				info->key.u.ipv4.dst = dst.s_addr;
+				unclone->key.u.ipv4.src = src.s_addr;
+				unclone->key.u.ipv4.dst = dst.s_addr;
 			}
 			vxlan_encap_bypass(skb, vxlan, vxlan, vni, false);
 			dst_release(ndst);
@@ -2781,12 +2786,17 @@  static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 			goto tx_error;
 		} else if (err) {
 			if (info) {
+				struct ip_tunnel_info *unclone;
 				struct in6_addr src, dst;
 
+				unclone = skb_tunnel_info_unclone(skb);
+				if (unlikely(!unclone))
+					goto tx_error;
+
 				src = remote_ip.sin6.sin6_addr;
 				dst = local_ip.sin6.sin6_addr;
-				info->key.u.ipv6.src = src;
-				info->key.u.ipv6.dst = dst;
+				unclone->key.u.ipv6.src = src;
+				unclone->key.u.ipv6.dst = dst;
 			}
 
 			vxlan_encap_bypass(skb, vxlan, vxlan, vni, false);