Message ID | 20240216120144.24037-1-fw@strlen.de (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: ip_tunnel: do not adjust device headroom on xmit | expand |
On Fri, 2024-02-16 at 13:01 +0100, Florian Westphal wrote: > syzkaller triggered following kasan splat: > BUG: KASAN: use-after-free in __skb_flow_dissect+0x19d1/0x7a50 net/core/flow_dissector.c:1170 > Read of size 1 at addr ffff88812fb4000e by task syz-executor183/5191 > [..] > kasan_report+0xda/0x110 mm/kasan/report.c:588 > __skb_flow_dissect+0x19d1/0x7a50 net/core/flow_dissector.c:1170 > skb_flow_dissect_flow_keys include/linux/skbuff.h:1514 [inline] > ___skb_get_hash net/core/flow_dissector.c:1791 [inline] > __skb_get_hash+0xc7/0x540 net/core/flow_dissector.c:1856 > skb_get_hash include/linux/skbuff.h:1556 [inline] > ip_tunnel_xmit+0x1855/0x33c0 net/ipv4/ip_tunnel.c:748 > ipip_tunnel_xmit+0x3cc/0x4e0 net/ipv4/ipip.c:308 > __netdev_start_xmit include/linux/netdevice.h:4940 [inline] > netdev_start_xmit include/linux/netdevice.h:4954 [inline] > xmit_one net/core/dev.c:3548 [inline] > dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3564 > __dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4349 > dev_queue_xmit include/linux/netdevice.h:3134 [inline] > neigh_connected_output+0x42c/0x5d0 net/core/neighbour.c:1592 > ... > ip_finish_output2+0x833/0x2550 net/ipv4/ip_output.c:235 > ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323 > .. > iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82 > ip_tunnel_xmit+0x1dbc/0x33c0 net/ipv4/ip_tunnel.c:831 > ipgre_xmit+0x4a1/0x980 net/ipv4/ip_gre.c:665 > __netdev_start_xmit include/linux/netdevice.h:4940 [inline] > netdev_start_xmit include/linux/netdevice.h:4954 [inline] > xmit_one net/core/dev.c:3548 [inline] > dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3564 > ... > > The splat occurs because skb->data points past skb->head allocated > area. This is because neigh layer does: > __skb_pull(skb, skb_network_offset(skb)); > > ... but skb_network_offset() returns a negative offset and > __skb_pull() arg is unsigned. IOW, we skb->data gets "adjusted" > by a huge value. > > The negative value is returned because skb->head and skb->data distance is > more than 64k and skb->network_header (u16) has wrapped around. > > The bug is in the ip_tunnel infrastructure, which can cause > dev->needed_headroom to increment ad infinitum. > > The syzkaller reproducer consists of packets getting routed via a gre > tunnel, and route of gre encapsulated packets pointing at another (ipip) > tunnel. The ipip encapsulation finds gre0 as next output device. > > This results in the following pattern: > > 1). First packet is to be sent out via gre0. > Route lookup found an output device, ipip0. > > 2). > ip_tunnel_xmit for gre0 bumps gre0->needed_headroom based on the > future output device, rt.dev->needed_headroom (ipip0). > > 3). > ip output / start_xmit moves skb on to ipip0. which runs the same > code path again (xmit recursion). > > 4). > Routing step for the post-gre0-encap packet finds gre0 as output > device to use for ipip0 encapsulated packet. > > tunl0->needed_headroom is then incremented based on the (already > bumped) gre0 device headroom. > > This repeats for every future packet: > > gre0->needed_headroom gets inflated because previous packets' > ipip0 step incremented rt->dev (gre0) headroom, and ipip0 incremented > because gre0 needed_headroom was increased. > > For each subsequent packet, gre/ipip0->needed_headroom grows until > post-expand-head reallocations result in a skb->head/data distance of > more than 64k. > > Once that happens, skb->network_header (u16) wraps around when > pskb_expand_head tries to make sure that skb_network_offset() is > unchanged after the headroom expansion/reallocation. > > After this skb_network_offset(skb) returns a different (and > negative) result post headroom expansion. > > The next trip to neigh layer (or anything else that would __skb_pull > the network header) makes skb->data point to a memory location outside > skb->head area. > > Remove this optimization. > > Alternative would be to cap the needed_headroom update to a reasonable > upperlimit such as 256 to prevent growth. I fear this will cause visible regression for tunnels in metadata mode: the egress device should not be available at creation time, so the guessed needed_headroom should be low. I recall (very old) performance tests showing head reallocation as quite noticeable while traversing tunnels. I think enforcing the upper limit should be safer and hopefully not too complex. We could use even an higher bound (say 512) as just need to avoid 16bits overflow, right? Thanks! Paolo
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index a4513ffb66cb..1dde07031122 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -632,10 +632,7 @@ void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, } headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len; - if (headroom > READ_ONCE(dev->needed_headroom)) - WRITE_ONCE(dev->needed_headroom, headroom); - - if (skb_cow_head(skb, READ_ONCE(dev->needed_headroom))) { + if (skb_cow_head(skb, headroom)) { ip_rt_put(rt); goto tx_dropped; } @@ -818,10 +815,8 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, max_headroom = LL_RESERVED_SPACE(rt->dst.dev) + sizeof(struct iphdr) + rt->dst.header_len + ip_encap_hlen(&tunnel->encap); - if (max_headroom > READ_ONCE(dev->needed_headroom)) - WRITE_ONCE(dev->needed_headroom, max_headroom); - if (skb_cow_head(skb, READ_ONCE(dev->needed_headroom))) { + if (skb_cow_head(skb, max_headroom)) { ip_rt_put(rt); DEV_STATS_INC(dev, tx_dropped); kfree_skb(skb);
syzkaller triggered following kasan splat: BUG: KASAN: use-after-free in __skb_flow_dissect+0x19d1/0x7a50 net/core/flow_dissector.c:1170 Read of size 1 at addr ffff88812fb4000e by task syz-executor183/5191 [..] kasan_report+0xda/0x110 mm/kasan/report.c:588 __skb_flow_dissect+0x19d1/0x7a50 net/core/flow_dissector.c:1170 skb_flow_dissect_flow_keys include/linux/skbuff.h:1514 [inline] ___skb_get_hash net/core/flow_dissector.c:1791 [inline] __skb_get_hash+0xc7/0x540 net/core/flow_dissector.c:1856 skb_get_hash include/linux/skbuff.h:1556 [inline] ip_tunnel_xmit+0x1855/0x33c0 net/ipv4/ip_tunnel.c:748 ipip_tunnel_xmit+0x3cc/0x4e0 net/ipv4/ipip.c:308 __netdev_start_xmit include/linux/netdevice.h:4940 [inline] netdev_start_xmit include/linux/netdevice.h:4954 [inline] xmit_one net/core/dev.c:3548 [inline] dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3564 __dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4349 dev_queue_xmit include/linux/netdevice.h:3134 [inline] neigh_connected_output+0x42c/0x5d0 net/core/neighbour.c:1592 ... ip_finish_output2+0x833/0x2550 net/ipv4/ip_output.c:235 ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323 .. iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1dbc/0x33c0 net/ipv4/ip_tunnel.c:831 ipgre_xmit+0x4a1/0x980 net/ipv4/ip_gre.c:665 __netdev_start_xmit include/linux/netdevice.h:4940 [inline] netdev_start_xmit include/linux/netdevice.h:4954 [inline] xmit_one net/core/dev.c:3548 [inline] dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3564 ... The splat occurs because skb->data points past skb->head allocated area. This is because neigh layer does: __skb_pull(skb, skb_network_offset(skb)); ... but skb_network_offset() returns a negative offset and __skb_pull() arg is unsigned. IOW, we skb->data gets "adjusted" by a huge value. The negative value is returned because skb->head and skb->data distance is more than 64k and skb->network_header (u16) has wrapped around. The bug is in the ip_tunnel infrastructure, which can cause dev->needed_headroom to increment ad infinitum. The syzkaller reproducer consists of packets getting routed via a gre tunnel, and route of gre encapsulated packets pointing at another (ipip) tunnel. The ipip encapsulation finds gre0 as next output device. This results in the following pattern: 1). First packet is to be sent out via gre0. Route lookup found an output device, ipip0. 2). ip_tunnel_xmit for gre0 bumps gre0->needed_headroom based on the future output device, rt.dev->needed_headroom (ipip0). 3). ip output / start_xmit moves skb on to ipip0. which runs the same code path again (xmit recursion). 4). Routing step for the post-gre0-encap packet finds gre0 as output device to use for ipip0 encapsulated packet. tunl0->needed_headroom is then incremented based on the (already bumped) gre0 device headroom. This repeats for every future packet: gre0->needed_headroom gets inflated because previous packets' ipip0 step incremented rt->dev (gre0) headroom, and ipip0 incremented because gre0 needed_headroom was increased. For each subsequent packet, gre/ipip0->needed_headroom grows until post-expand-head reallocations result in a skb->head/data distance of more than 64k. Once that happens, skb->network_header (u16) wraps around when pskb_expand_head tries to make sure that skb_network_offset() is unchanged after the headroom expansion/reallocation. After this skb_network_offset(skb) returns a different (and negative) result post headroom expansion. The next trip to neigh layer (or anything else that would __skb_pull the network header) makes skb->data point to a memory location outside skb->head area. Remove this optimization. Alternative would be to cap the needed_headroom update to a reasonable upperlimit such as 256 to prevent growth. Lets try the simpler solution first and see if any performance regressions get reported. Reported-by: syzbot+bfde3bef047a81b8fde6@syzkaller.appspotmail.com Closes: https://groups.google.com/g/syzkaller-bugs/c/fL9G6GtWskY/m/VKk_PR5FBAAJ Fixes: 243aad830e8a ("ip_gre: include route header_len in max_headroom calculation") Signed-off-by: Florian Westphal <fw@strlen.de> --- net/ipv4/ip_tunnel.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-)