diff mbox series

[net,v2] virtio_net: fix wrong buf address calculation when using xdp

Message ID 20220424102121.2686893-1-razor@blackwall.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] virtio_net: fix wrong buf address calculation when using xdp | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net, async
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers warning 14 maintainers not CCed: songliubraving@fb.com trix@redhat.com andrii@kernel.org pabeni@redhat.com nathan@kernel.org ndesaulniers@google.com kafai@fb.com john.fastabend@gmail.com llvm@lists.linux.dev yhs@fb.com bpf@vger.kernel.org hawk@kernel.org kpsingh@kernel.org ast@kernel.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 20 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Nikolay Aleksandrov April 24, 2022, 10:21 a.m. UTC
We received a report[1] of kernel crashes when Cilium is used in XDP
mode with virtio_net after updating to newer kernels. After
investigating the reason it turned out that when using mergeable bufs
with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
calculates the build_skb address wrong because the offset can become less
than the headroom so it gets the address of the previous page (-X bytes
depending on how lower offset is):
 page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256

This is a pr_err() I added in the beginning of page_to_skb which clearly
shows offset that is less than headroom by adding 4 bytes of metadata
via an xdp prog. The calculations done are:
 receive_mergeable():
 headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
 offset = xdp.data - page_address(xdp_page) -
          vi->hdr_len - metasize;

 page_to_skb():
 p = page_address(page) + offset;
 ...
 buf = p - headroom;

Now buf goes -4 bytes from the page's starting address as can be seen
above which is set as skb->head and skb->data by build_skb later. Depending
on what's done with the skb (when it's freed most often) we get all kinds
of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
the new headroom after the xdp program has run, similar to how offset
and len are recalculated. Headroom is directly related to
data_hard_start, data and data_meta, so we use them to get the new size.
The result is correct (similar pr_err() in page_to_skb, one case of
xdp_page and one case of virtnet buf):
 a) Case with 4 bytes of metadata
 [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
 [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
 b) Case of pushing data +32 bytes
 [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
 [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
 c) Case of pushing data -33 bytes
 [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
 [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223

An example reproducer xdp prog[3] is below.

[1] https://github.com/cilium/cilium/issues/19453

[2] Two of the many traces:
 [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
 [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
 [   41.300891] kernel BUG at include/linux/mm.h:720!
 [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
 [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
 [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
 [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
 [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
 [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
 [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
 [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
 [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
 [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
 [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
 [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
 [   41.321387] Call Trace:
 [   41.321819]  <TASK>
 [   41.322193]  skb_release_data+0x13f/0x1c0
 [   41.322902]  __kfree_skb+0x20/0x30
 [   41.343870]  tcp_recvmsg_locked+0x671/0x880
 [   41.363764]  tcp_recvmsg+0x5e/0x1c0
 [   41.384102]  inet_recvmsg+0x42/0x100
 [   41.406783]  ? sock_recvmsg+0x1d/0x70
 [   41.428201]  sock_read_iter+0x84/0xd0
 [   41.445592]  ? 0xffffffffa3000000
 [   41.462442]  new_sync_read+0x148/0x160
 [   41.479314]  ? 0xffffffffa3000000
 [   41.496937]  vfs_read+0x138/0x190
 [   41.517198]  ksys_read+0x87/0xc0
 [   41.535336]  do_syscall_64+0x3b/0x90
 [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [   41.568050] RIP: 0033:0x48765b
 [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
 [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
 [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
 [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
 [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
 [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
 [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
 [   41.744254]  </TASK>
 [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net

 and

 [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
 [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
 [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
 [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
 [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
 [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
 [   33.532471] page dumped because: nonzero mapcount
 [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
 [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
 [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   33.532484] Call Trace:
 [   33.532496]  <TASK>
 [   33.532500]  dump_stack_lvl+0x45/0x5a
 [   33.532506]  bad_page.cold+0x63/0x94
 [   33.532510]  free_pcp_prepare+0x290/0x420
 [   33.532515]  free_unref_page+0x1b/0x100
 [   33.532518]  skb_release_data+0x13f/0x1c0
 [   33.532524]  kfree_skb_reason+0x3e/0xc0
 [   33.532527]  ip6_mc_input+0x23c/0x2b0
 [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
 [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0

[3] XDP program to reproduce(xdp_pass.c):
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>

 SEC("xdp_pass")
 int xdp_pkt_pass(struct xdp_md *ctx)
 {
          bpf_xdp_adjust_head(ctx, -(int)32);
          return XDP_PASS;
 }

 char _license[] SEC("license") = "GPL";

 compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
 load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass

CC: stable@vger.kernel.org
CC: Jason Wang <jasowang@redhat.com>
CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: virtualization@lists.linux-foundation.org
Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
---
v2: Recalculate headroom based on data, data_hard_start and data_meta

 drivers/net/virtio_net.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Xuan Zhuo April 24, 2022, 10:42 a.m. UTC | #1
On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> We received a report[1] of kernel crashes when Cilium is used in XDP
> mode with virtio_net after updating to newer kernels. After
> investigating the reason it turned out that when using mergeable bufs
> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> calculates the build_skb address wrong because the offset can become less
> than the headroom so it gets the address of the previous page (-X bytes
> depending on how lower offset is):
>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>
> This is a pr_err() I added in the beginning of page_to_skb which clearly
> shows offset that is less than headroom by adding 4 bytes of metadata
> via an xdp prog. The calculations done are:
>  receive_mergeable():
>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>  offset = xdp.data - page_address(xdp_page) -
>           vi->hdr_len - metasize;
>
>  page_to_skb():
>  p = page_address(page) + offset;
>  ...
>  buf = p - headroom;
>
> Now buf goes -4 bytes from the page's starting address as can be seen
> above which is set as skb->head and skb->data by build_skb later. Depending
> on what's done with the skb (when it's freed most often) we get all kinds
> of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
> the new headroom after the xdp program has run, similar to how offset
> and len are recalculated. Headroom is directly related to
> data_hard_start, data and data_meta, so we use them to get the new size.
> The result is correct (similar pr_err() in page_to_skb, one case of
> xdp_page and one case of virtnet buf):
>  a) Case with 4 bytes of metadata
>  [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
>  [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
>  b) Case of pushing data +32 bytes
>  [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
>  [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
>  c) Case of pushing data -33 bytes
>  [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
>  [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223
>
> An example reproducer xdp prog[3] is below.
>
> [1] https://github.com/cilium/cilium/issues/19453
>
> [2] Two of the many traces:
>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>  [   41.321387] Call Trace:
>  [   41.321819]  <TASK>
>  [   41.322193]  skb_release_data+0x13f/0x1c0
>  [   41.322902]  __kfree_skb+0x20/0x30
>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>  [   41.384102]  inet_recvmsg+0x42/0x100
>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>  [   41.428201]  sock_read_iter+0x84/0xd0
>  [   41.445592]  ? 0xffffffffa3000000
>  [   41.462442]  new_sync_read+0x148/0x160
>  [   41.479314]  ? 0xffffffffa3000000
>  [   41.496937]  vfs_read+0x138/0x190
>  [   41.517198]  ksys_read+0x87/0xc0
>  [   41.535336]  do_syscall_64+0x3b/0x90
>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>  [   41.568050] RIP: 0033:0x48765b
>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>  [   41.744254]  </TASK>
>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>
>  and
>
>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>  [   33.532471] page dumped because: nonzero mapcount
>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   33.532484] Call Trace:
>  [   33.532496]  <TASK>
>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>  [   33.532506]  bad_page.cold+0x63/0x94
>  [   33.532510]  free_pcp_prepare+0x290/0x420
>  [   33.532515]  free_unref_page+0x1b/0x100
>  [   33.532518]  skb_release_data+0x13f/0x1c0
>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>
> [3] XDP program to reproduce(xdp_pass.c):
>  #include <linux/bpf.h>
>  #include <bpf/bpf_helpers.h>
>
>  SEC("xdp_pass")
>  int xdp_pkt_pass(struct xdp_md *ctx)
>  {
>           bpf_xdp_adjust_head(ctx, -(int)32);
>           return XDP_PASS;
>  }
>
>  char _license[] SEC("license") = "GPL";
>
>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>
> CC: stable@vger.kernel.org
> CC: Jason Wang <jasowang@redhat.com>
> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> CC: Daniel Borkmann <daniel@iogearbox.net>
> CC: "Michael S. Tsirkin" <mst@redhat.com>
> CC: virtualization@lists.linux-foundation.org
> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> ---
> v2: Recalculate headroom based on data, data_hard_start and data_meta
>
>  drivers/net/virtio_net.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..a12338de7ef1 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			 * xdp.data_meta were adjusted
>  			 */
>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> +
> +			/* recalculate headroom if xdp.data or xdp.data_meta
> +			 * were adjusted
> +			 */
> +			headroom = xdp.data - xdp.data_hard_start - metasize;


This is incorrect.


		data = page_address(xdp_page) + offset;
		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);

so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len

(page_address(xdp_page) + offset) points to virtio-net hdr.
(page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.

xdp.data_hard_start points to buf + vi->hdr_len

Thanks.


> +
>  			/* We can only create skb based on xdp_page. */
>  			if (unlikely(xdp_page != page)) {
>  				rcu_read_unlock();
> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
>  						       len, PAGE_SIZE, false,
>  						       metasize,
> -						       VIRTIO_XDP_HEADROOM);
> +						       headroom);
>  				return head_skb;
>  			}
>  			break;
> --
> 2.35.1
>
Nikolay Aleksandrov April 24, 2022, 10:56 a.m. UTC | #2
On 24/04/2022 13:42, Xuan Zhuo wrote:
> On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> We received a report[1] of kernel crashes when Cilium is used in XDP
>> mode with virtio_net after updating to newer kernels. After
>> investigating the reason it turned out that when using mergeable bufs
>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>> calculates the build_skb address wrong because the offset can become less
>> than the headroom so it gets the address of the previous page (-X bytes
>> depending on how lower offset is):
>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>
>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>> shows offset that is less than headroom by adding 4 bytes of metadata
>> via an xdp prog. The calculations done are:
>>  receive_mergeable():
>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>  offset = xdp.data - page_address(xdp_page) -
>>           vi->hdr_len - metasize;
>>
>>  page_to_skb():
>>  p = page_address(page) + offset;
>>  ...
>>  buf = p - headroom;
>>
>> Now buf goes -4 bytes from the page's starting address as can be seen
>> above which is set as skb->head and skb->data by build_skb later. Depending
>> on what's done with the skb (when it's freed most often) we get all kinds
>> of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
>> the new headroom after the xdp program has run, similar to how offset
>> and len are recalculated. Headroom is directly related to
>> data_hard_start, data and data_meta, so we use them to get the new size.
>> The result is correct (similar pr_err() in page_to_skb, one case of
>> xdp_page and one case of virtnet buf):
>>  a) Case with 4 bytes of metadata
>>  [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
>>  [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
>>  b) Case of pushing data +32 bytes
>>  [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
>>  [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
>>  c) Case of pushing data -33 bytes
>>  [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
>>  [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223
>>
>> An example reproducer xdp prog[3] is below.
>>
>> [1] https://github.com/cilium/cilium/issues/19453
>>
>> [2] Two of the many traces:
>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>  [   41.321387] Call Trace:
>>  [   41.321819]  <TASK>
>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>  [   41.322902]  __kfree_skb+0x20/0x30
>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>  [   41.445592]  ? 0xffffffffa3000000
>>  [   41.462442]  new_sync_read+0x148/0x160
>>  [   41.479314]  ? 0xffffffffa3000000
>>  [   41.496937]  vfs_read+0x138/0x190
>>  [   41.517198]  ksys_read+0x87/0xc0
>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>  [   41.568050] RIP: 0033:0x48765b
>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>  [   41.744254]  </TASK>
>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>
>>  and
>>
>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>  [   33.532471] page dumped because: nonzero mapcount
>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   33.532484] Call Trace:
>>  [   33.532496]  <TASK>
>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>  [   33.532506]  bad_page.cold+0x63/0x94
>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>  [   33.532515]  free_unref_page+0x1b/0x100
>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>
>> [3] XDP program to reproduce(xdp_pass.c):
>>  #include <linux/bpf.h>
>>  #include <bpf/bpf_helpers.h>
>>
>>  SEC("xdp_pass")
>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>  {
>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>           return XDP_PASS;
>>  }
>>
>>  char _license[] SEC("license") = "GPL";
>>
>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>
>> CC: stable@vger.kernel.org
>> CC: Jason Wang <jasowang@redhat.com>
>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>> CC: Daniel Borkmann <daniel@iogearbox.net>
>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>> CC: virtualization@lists.linux-foundation.org
>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>> ---
>> v2: Recalculate headroom based on data, data_hard_start and data_meta
>>
>>  drivers/net/virtio_net.c | 8 +++++++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 87838cbe38cf..a12338de7ef1 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  			 * xdp.data_meta were adjusted
>>  			 */
>>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>> +
>> +			/* recalculate headroom if xdp.data or xdp.data_meta
>> +			 * were adjusted
>> +			 */
>> +			headroom = xdp.data - xdp.data_hard_start - metasize;
> 
> 
> This is incorrect.
> 
> 
> 		data = page_address(xdp_page) + offset;
> 		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
> 		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
> 				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);
> 
> so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len
> 
> (page_address(xdp_page) + offset) points to virtio-net hdr.
> (page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.
> 
> xdp.data_hard_start points to buf + vi->hdr_len
> 
> Thanks.
> 

xdp.data points to buf + vi->hdr_len + VIRTIO_XDP_HEADROOM, so we calculate
xdp.data - xdp.data_hard_start, i.e. buf + vi->hdr_len + VIRTIO_XDP_HEADROOM - (buf + vi->hdr_len)

You can see the headrooms from my tests above, they are correct and they match exactly
the values from the headroom calculations that you suggested earlier.

> 
>> +
>>  			/* We can only create skb based on xdp_page. */
>>  			if (unlikely(xdp_page != page)) {
>>  				rcu_read_unlock();
>> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>  						       len, PAGE_SIZE, false,
>>  						       metasize,
>> -						       VIRTIO_XDP_HEADROOM);
>> +						       headroom);
>>  				return head_skb;
>>  			}
>>  			break;
>> --
>> 2.35.1
>>
Xuan Zhuo April 24, 2022, 11:18 a.m. UTC | #3
On Sun, 24 Apr 2022 13:56:17 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 24/04/2022 13:42, Xuan Zhuo wrote:
> > On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >> We received a report[1] of kernel crashes when Cilium is used in XDP
> >> mode with virtio_net after updating to newer kernels. After
> >> investigating the reason it turned out that when using mergeable bufs
> >> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >> calculates the build_skb address wrong because the offset can become less
> >> than the headroom so it gets the address of the previous page (-X bytes
> >> depending on how lower offset is):
> >>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>
> >> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >> shows offset that is less than headroom by adding 4 bytes of metadata
> >> via an xdp prog. The calculations done are:
> >>  receive_mergeable():
> >>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>  offset = xdp.data - page_address(xdp_page) -
> >>           vi->hdr_len - metasize;
> >>
> >>  page_to_skb():
> >>  p = page_address(page) + offset;
> >>  ...
> >>  buf = p - headroom;
> >>
> >> Now buf goes -4 bytes from the page's starting address as can be seen
> >> above which is set as skb->head and skb->data by build_skb later. Depending
> >> on what's done with the skb (when it's freed most often) we get all kinds
> >> of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
> >> the new headroom after the xdp program has run, similar to how offset
> >> and len are recalculated. Headroom is directly related to
> >> data_hard_start, data and data_meta, so we use them to get the new size.
> >> The result is correct (similar pr_err() in page_to_skb, one case of
> >> xdp_page and one case of virtnet buf):
> >>  a) Case with 4 bytes of metadata
> >>  [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
> >>  [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
> >>  b) Case of pushing data +32 bytes
> >>  [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
> >>  [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
> >>  c) Case of pushing data -33 bytes
> >>  [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
> >>  [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223
> >>
> >> An example reproducer xdp prog[3] is below.
> >>
> >> [1] https://github.com/cilium/cilium/issues/19453
> >>
> >> [2] Two of the many traces:
> >>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>  [   41.321387] Call Trace:
> >>  [   41.321819]  <TASK>
> >>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>  [   41.322902]  __kfree_skb+0x20/0x30
> >>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>  [   41.445592]  ? 0xffffffffa3000000
> >>  [   41.462442]  new_sync_read+0x148/0x160
> >>  [   41.479314]  ? 0xffffffffa3000000
> >>  [   41.496937]  vfs_read+0x138/0x190
> >>  [   41.517198]  ksys_read+0x87/0xc0
> >>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>  [   41.568050] RIP: 0033:0x48765b
> >>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>  [   41.744254]  </TASK>
> >>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>
> >>  and
> >>
> >>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>  [   33.532471] page dumped because: nonzero mapcount
> >>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   33.532484] Call Trace:
> >>  [   33.532496]  <TASK>
> >>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>  [   33.532506]  bad_page.cold+0x63/0x94
> >>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>  [   33.532515]  free_unref_page+0x1b/0x100
> >>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>
> >> [3] XDP program to reproduce(xdp_pass.c):
> >>  #include <linux/bpf.h>
> >>  #include <bpf/bpf_helpers.h>
> >>
> >>  SEC("xdp_pass")
> >>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>  {
> >>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>           return XDP_PASS;
> >>  }
> >>
> >>  char _license[] SEC("license") = "GPL";
> >>
> >>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>
> >> CC: stable@vger.kernel.org
> >> CC: Jason Wang <jasowang@redhat.com>
> >> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >> CC: Daniel Borkmann <daniel@iogearbox.net>
> >> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >> CC: virtualization@lists.linux-foundation.org
> >> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >> ---
> >> v2: Recalculate headroom based on data, data_hard_start and data_meta
> >>
> >>  drivers/net/virtio_net.c | 8 +++++++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 87838cbe38cf..a12338de7ef1 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>  			 * xdp.data_meta were adjusted
> >>  			 */
> >>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> >> +
> >> +			/* recalculate headroom if xdp.data or xdp.data_meta
> >> +			 * were adjusted
> >> +			 */
> >> +			headroom = xdp.data - xdp.data_hard_start - metasize;
> >
> >
> > This is incorrect.
> >
> >
> > 		data = page_address(xdp_page) + offset;
> > 		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
> > 		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
> > 				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);
> >
> > so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len
> >
> > (page_address(xdp_page) + offset) points to virtio-net hdr.
> > (page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.
> >
> > xdp.data_hard_start points to buf + vi->hdr_len
> >
> > Thanks.
> >
>
> xdp.data points to buf + vi->hdr_len + VIRTIO_XDP_HEADROOM, so we calculate
> xdp.data - xdp.data_hard_start, i.e. buf + vi->hdr_len + VIRTIO_XDP_HEADROOM - (buf + vi->hdr_len)
>
> You can see the headrooms from my tests above, they are correct and they match exactly
> the values from the headroom calculations that you suggested earlier.


OK. You are right, xdp.data, xdp.data_hard_start have an offset of hdr_len. I
hope this can be explained in the comments, because the headroom we want to get
is virtio_hdr - buf. Although the value here are equal.

In addition, if you are going to post v2, I think you should post a new thread
separately instead of replying in the previous thread.

Thanks.


>
> >
> >> +
> >>  			/* We can only create skb based on xdp_page. */
> >>  			if (unlikely(xdp_page != page)) {
> >>  				rcu_read_unlock();
> >> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >>  						       len, PAGE_SIZE, false,
> >>  						       metasize,
> >> -						       VIRTIO_XDP_HEADROOM);
> >> +						       headroom);
> >>  				return head_skb;
> >>  			}
> >>  			break;
> >> --
> >> 2.35.1
> >>
>
Nikolay Aleksandrov April 24, 2022, 2:53 p.m. UTC | #4
On 24/04/2022 14:18, Xuan Zhuo wrote:
> On Sun, 24 Apr 2022 13:56:17 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> On 24/04/2022 13:42, Xuan Zhuo wrote:
>>> On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
[snip]
>>>>
>>>> CC: stable@vger.kernel.org
>>>> CC: Jason Wang <jasowang@redhat.com>
>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>>> CC: virtualization@lists.linux-foundation.org
>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>> ---
>>>> v2: Recalculate headroom based on data, data_hard_start and data_meta
>>>>
>>>>  drivers/net/virtio_net.c | 8 +++++++-
>>>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..a12338de7ef1 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>  			 * xdp.data_meta were adjusted
>>>>  			 */
>>>>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>>>> +
>>>> +			/* recalculate headroom if xdp.data or xdp.data_meta
>>>> +			 * were adjusted
>>>> +			 */
>>>> +			headroom = xdp.data - xdp.data_hard_start - metasize;
>>>
>>>
>>> This is incorrect.
>>>
>>>
>>> 		data = page_address(xdp_page) + offset;
>>> 		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
>>> 		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
>>> 				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);
>>>
>>> so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len
>>>
>>> (page_address(xdp_page) + offset) points to virtio-net hdr.
>>> (page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.
>>>
>>> xdp.data_hard_start points to buf + vi->hdr_len
>>>
>>> Thanks.
>>>
>>
>> xdp.data points to buf + vi->hdr_len + VIRTIO_XDP_HEADROOM, so we calculate
>> xdp.data - xdp.data_hard_start, i.e. buf + vi->hdr_len + VIRTIO_XDP_HEADROOM - (buf + vi->hdr_len)
>>
>> You can see the headrooms from my tests above, they are correct and they match exactly
>> the values from the headroom calculations that you suggested earlier.
> 
> 
> OK. You are right, xdp.data, xdp.data_hard_start have an offset of hdr_len. I
> hope this can be explained in the comments, because the headroom we want to get
> is virtio_hdr - buf. Although the value here are equal.
> 

I think it's normal for them to be equal because buf + offset points to vnet_hdr start,
that is it doesn't include the vnet_hdr size, so all that is left to subtract to get
to buf is offset - headroom after the prepare (given correct headroom, of course). 
The linearized xdp page buf has the following initial geometry (at prepare):
 offset = VIRTIO_XDP_HEADROOM
 headroom = VIRTIO_XDP_HEADROOM
 data_hard_start = page_address + vnet_hdr
 data = page_address + vnet_hdr + headroom
 data_meta = data

after xdp prog has run:
offset is recalculated as: (page_address + vnet_hdr + adjusted headroom) - page_address -
                            vnet_hdr - metasize = adjusted headroom - metasize

I wrote adjusted headroom because it depends on data and data_meta adjustments done by
the xdp prog, so based on the above we can get back to page_address (or buf if it's a virtnet
buf) by subtracting the following headroom recalculation:
 headroom = data - data_hard_start - metasize =
 (page_address + vnet_hdr + adjusted headroom) - page_address - vnet_hdr - metasize =
 adjusted headroom - metasize

That is because in page_to_skb() p points to page_address + recalculated offset, i.e.
p = page_address + (adjusted headroom - metasize) for the linearized case.
For the other case it's the same just the initial offset is a larger value.

I'll add a comment that page_address + offset always points to vnet_hdr start so
it will be equal to headroom which is data - data_hard_start because we have
data = page_address + vnet_hdr + adjusted headroom and data_hard_start at page_address
+ vnet_hdr. Note that we have adjusted headroom + vnet_hdr bytes available behind data
always, so for offset to point to vnet_hdr it has to be equal to headroom, it just
starts at page_address while the actual headroom starts at page_address + vnet_hdr to
reserve that many bytes.

I'll see how I can make that more concise. :)

> In addition, if you are going to post v2, I think you should post a new thread
> separately instead of replying in the previous thread.
> 

sure, I don't mind either way

> Thanks.
> 
> 

Cheers,
 Nik

>>
>>>
>>>> +
>>>>  			/* We can only create skb based on xdp_page. */
>>>>  			if (unlikely(xdp_page != page)) {
>>>>  				rcu_read_unlock();
>>>> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>  						       len, PAGE_SIZE, false,
>>>>  						       metasize,
>>>> -						       VIRTIO_XDP_HEADROOM);
>>>> +						       headroom);
>>>>  				return head_skb;
>>>>  			}
>>>>  			break;
>>>> --
>>>> 2.35.1
>>>>
>>
diff mbox series

Patch

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..a12338de7ef1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1005,6 +1005,12 @@  static struct sk_buff *receive_mergeable(struct net_device *dev,
 			 * xdp.data_meta were adjusted
 			 */
 			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
+
+			/* recalculate headroom if xdp.data or xdp.data_meta
+			 * were adjusted
+			 */
+			headroom = xdp.data - xdp.data_hard_start - metasize;
+
 			/* We can only create skb based on xdp_page. */
 			if (unlikely(xdp_page != page)) {
 				rcu_read_unlock();
@@ -1012,7 +1018,7 @@  static struct sk_buff *receive_mergeable(struct net_device *dev,
 				head_skb = page_to_skb(vi, rq, xdp_page, offset,
 						       len, PAGE_SIZE, false,
 						       metasize,
-						       VIRTIO_XDP_HEADROOM);
+						       headroom);
 				return head_skb;
 			}
 			break;