Message ID | 792238.1690667367@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | Endless loop in udp with MSG_SPLICE_READ - Re: [syzbot] [fs?] INFO: task hung in pipe_release (4) | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
David Howells wrote: > Hi Jakub, Willem, > > I think I'm going to need your help with this one. > > > > syzbot has bisected this issue to: > > > > > > commit 7ac7c987850c3ec617c778f7bd871804dc1c648d > > > Author: David Howells <dhowells@redhat.com> > > > Date: Mon May 22 12:11:22 2023 +0000 > > > > > > udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES > > > > > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15853bcaa80000 > > > start commit: 3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w.. > > > git tree: upstream > > > final oops: https://syzkaller.appspot.com/x/report.txt?x=17853bcaa80000 > > > console output: https://syzkaller.appspot.com/x/log.txt?x=13853bcaa80000 > > > kernel config: https://syzkaller.appspot.com/x/.config?x=150188feee7071a7 > > > dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12a86682a80000 > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1520ab6ca80000 > > > > > > Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com > > > Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES") > > > > > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection > > The issue that syzbot is triggering seems to be something to do with the > calculations in the "if (copy <= 0) { ... }" chunk in __ip_append_data() when > MSG_SPLICE_PAGES is in operation. > > What seems to happen is that the test program uses sendmsg() + MSG_MORE to > loads a UDP packet with 1406 bytes of data to the MTU size (1434) and then > splices in 8 extra bytes. > > r3 = socket$inet_udp(0x2, 0x2, 0x0) > setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4) > bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10) > connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10) > sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800) > write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8) > splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0) > > This results in some negative intermediate values turning up in the > calculations - and this results in the remaining length being made longer > from 8 to 14. > > I added some printks (patch attached), resulting in the attached tracelines: > > ==>splice_to_socket() 7099 > udp_sendmsg(8,8) > __ip_append_data(copy=-6,len=8, mtu=1434 skblen=1434 maxfl=1428) > pagedlen 14 = 14 - 0 > copy -6 = 14 - 0 - 6 - 14 > length 8 -= -6 + 0 > __ip_append_data(copy=1414,len=14, mtu=1434 skblen=20 maxfl=1428) > copy=1414 len=14 > skb_splice_from_iter(8,14) > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428) > copy=1406 len=6 > skb_splice_from_iter(0,6) > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428) > copy=1406 len=6 > skb_splice_from_iter(0,6) > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428) > copy=1406 len=6 > skb_splice_from_iter(0,6) > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428) > copy=1406 len=6 > skb_splice_from_iter(0,6) > copy=1406 len=6 > skb_splice_from_iter(0,6) > ... > > 'copy' gets calculated as -6 because the maxfraglen (maxfl=1428) is 8 bytes > less than the amount of data then in the packet (skblen=1434). > > 'copy' gets recalculated part way down as -6 from datalen (14) - transhdrlen > (0) - fraggap (6) - pagedlen (14). > > datalen is 14 because it was length (8) + fraggap (6). > > Inside skb_splice_from_iter(), we eventually end up in an enless loop in which > msg_iter.count is 0 and the length to be copied is 6. It always returns 0 > because there's nothing to copy, and so __ip_append_data() cycles round the > loop endlessly. > > Any suggestion as to how to fix this? Still ingesting. The syzkaller repro runs in threaded mode, so syscalls should not be interpreted in order. There are two seemingly (but evidently not really) independent operations: One involving splicing pipe(&(0x7f0000000100)={<r0=>0xffffffffffffffff, <r1=>0xffffffffffffffff}) r2 = socket$inet_udp(0x2, 0x2, 0x0) write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8) splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0) close(r2) And separately the MSG_MORE transmission that you mentioned r3 = socket$inet_udp(0x2, 0x2, 0x0) setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4) bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10) connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10) sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800) This second program also sets setsockopt SOL_SOCKET/SO_BROADCAST. That is likely not some random noise in the program (but that can be easily checked, by removing it -- assuming the repro triggers quickly). Question is whether the infinite loop happens on the r2, the socket involving MSG_SPLICE_PAGES, or r3, the socket involving SO_BROADCAST. If the second, on the surface it would seem that splicing is entirely unrelated.
Willem de Bruijn wrote: > David Howells wrote: > > Hi Jakub, Willem, > > > > I think I'm going to need your help with this one. > > > > > > syzbot has bisected this issue to: > > > > > > > > commit 7ac7c987850c3ec617c778f7bd871804dc1c648d > > > > Author: David Howells <dhowells@redhat.com> > > > > Date: Mon May 22 12:11:22 2023 +0000 > > > > > > > > udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES > > > > > > > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15853bcaa80000 > > > > start commit: 3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w.. > > > > git tree: upstream > > > > final oops: https://syzkaller.appspot.com/x/report.txt?x=17853bcaa80000 > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=13853bcaa80000 > > > > kernel config: https://syzkaller.appspot.com/x/.config?x=150188feee7071a7 > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e > > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12a86682a80000 > > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1520ab6ca80000 > > > > > > > > Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com > > > > Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES") > > > > > > > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection > > > > The issue that syzbot is triggering seems to be something to do with the > > calculations in the "if (copy <= 0) { ... }" chunk in __ip_append_data() when > > MSG_SPLICE_PAGES is in operation. > > > > What seems to happen is that the test program uses sendmsg() + MSG_MORE to > > loads a UDP packet with 1406 bytes of data to the MTU size (1434) and then > > splices in 8 extra bytes. > > > > r3 = socket$inet_udp(0x2, 0x2, 0x0) > > setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4) > > bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10) > > connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10) > > sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800) > > write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8) > > splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0) > > > > This results in some negative intermediate values turning up in the > > calculations - and this results in the remaining length being made longer > > from 8 to 14. > > > > I added some printks (patch attached), resulting in the attached tracelines: > > > > ==>splice_to_socket() 7099 > > udp_sendmsg(8,8) > > __ip_append_data(copy=-6,len=8, mtu=1434 skblen=1434 maxfl=1428) > > pagedlen 14 = 14 - 0 > > copy -6 = 14 - 0 - 6 - 14 > > length 8 -= -6 + 0 > > __ip_append_data(copy=1414,len=14, mtu=1434 skblen=20 maxfl=1428) > > copy=1414 len=14 > > skb_splice_from_iter(8,14) > > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428) > > copy=1406 len=6 > > skb_splice_from_iter(0,6) > > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428) > > copy=1406 len=6 > > skb_splice_from_iter(0,6) > > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428) > > copy=1406 len=6 > > skb_splice_from_iter(0,6) > > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428) > > copy=1406 len=6 > > skb_splice_from_iter(0,6) > > copy=1406 len=6 > > skb_splice_from_iter(0,6) > > ... > > > > 'copy' gets calculated as -6 because the maxfraglen (maxfl=1428) is 8 bytes > > less than the amount of data then in the packet (skblen=1434). > > > > 'copy' gets recalculated part way down as -6 from datalen (14) - transhdrlen > > (0) - fraggap (6) - pagedlen (14). > > > > datalen is 14 because it was length (8) + fraggap (6). > > > > Inside skb_splice_from_iter(), we eventually end up in an enless loop in which > > msg_iter.count is 0 and the length to be copied is 6. It always returns 0 > > because there's nothing to copy, and so __ip_append_data() cycles round the > > loop endlessly. > > > > Any suggestion as to how to fix this? > > Still ingesting. > > The syzkaller repro runs in threaded mode, so syscalls should not be > interpreted in order. > > There are two seemingly (but evidently not really) independent > operations: > > One involving splicing > > pipe(&(0x7f0000000100)={<r0=>0xffffffffffffffff, <r1=>0xffffffffffffffff}) > r2 = socket$inet_udp(0x2, 0x2, 0x0) > write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8) > splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0) > close(r2) > > And separately the MSG_MORE transmission that you mentioned > > r3 = socket$inet_udp(0x2, 0x2, 0x0) > setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4) > bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10) > connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10) > sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800) > > This second program also sets setsockopt SOL_SOCKET/SO_BROADCAST. That > is likely not some random noise in the program (but that can be easily > checked, by removing it -- assuming the repro triggers quickly). > > Question is whether the infinite loop happens on the r2, the socket > involving MSG_SPLICE_PAGES, or r3, the socket involving SO_BROADCAST. > If the second, on the surface it would seem that splicing is entirely > unrelated. I still don't entirely follow how the splice and sendmmsg end up on the same socket. Also, the sendmmsg in the case on the dashboard is very odd, a vlen of 0x4000000000003bd and flags MSG_MORE | MSG_CONFIRM. But maybe other runs have more sensible flags here. The issue appears to be with appending through splicing to an skb that exceeds the length with fragments, triggering the if (fraggap) branch to copy some trailer from skb_prev to skb, then appending the spliced data. Perhaps not an true fix based on understanding in detail, but one way out may be to disable splicing if !transhdrlen (which ip_append_data clears if appending).
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > The syzkaller repro runs in threaded mode, so syscalls should not be > interpreted in order. I think they are actually ordered. It's kind of weirdly done, though, flipping back and forth between threads and using futexes for synchronisation. David
Hi Willem, Here's a reduced testcase. I doesn't require anything special; the key is that the amount of data placed in the packet by the send() - it's related to the MTU size. It needs to stuff in sufficient data to go over the fragmentation limit (I think). In this case, my interface's MTU is 8192. send() is sticking in 8161 bytes of data and then the output from the aforeposted debugging patch is: ==>splice_to_socket() 6630 udp_sendmsg(8,8) __ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188) pagedlen 9 = 9 - 0 copy -1 = 9 - 0 - 1 - 9 length 8 -= -1 + 0 __ip_append_data(copy=8172,len=9, mtu=8192 skblen=20 maxfl=8188) copy=8172 len=9 skb_splice_from_iter(8,9) __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188) copy=8164 len=1 skb_splice_from_iter(0,1) __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188) copy=8164 len=1 skb_splice_from_iter(0,1) __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188) copy=8164 len=1 skb_splice_from_iter(0,1) __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188) copy=8164 len=1 skb_splice_from_iter(0,1) copy=8164 len=1 skb_splice_from_iter(0,1) It looks like send() pushes 1 byte over the fragmentation limit, then the splice sees -1 crop up, the length to be copied is increased by 1, but insufficient data is available and we go into an endless loop. --- #define _GNU_SOURCE #include <arpa/inet.h> #include <fcntl.h> #include <netinet/in.h> #include <stdarg.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/socket.h> #include <sys/mman.h> #include <sys/uio.h> #define OSERROR(R, S) do { if ((long)(R) == -1L) { perror((S)); exit(1); } } while(0) int main() { struct sockaddr_storage ss; struct sockaddr_in sin; void *buffer; unsigned int tmp; int pfd[2], sfd; int res; OSERROR(pipe(pfd), "pipe"); sfd = socket(AF_INET, SOCK_DGRAM, 0); OSERROR(sfd, "socket/2"); memset(&sin, 0, sizeof(sin)); sin.sin_family = AF_INET; sin.sin_port = htons(0); sin.sin_addr.s_addr = htonl(0xc0a80601); #warning you might want to set the address here - this is 192.168.6.1 OSERROR(connect(sfd, (struct sockaddr *)&sin, sizeof(sin)), "connect"); buffer = mmap(NULL, 1024*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0); OSERROR(buffer, "mmap"); OSERROR(send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE), "send"); #warning you need to adjust the length on the above line to match your MTU OSERROR(write(pfd[1], buffer, 8), "write"); OSERROR(splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0), "splice"); return 0; }
David Howells wrote: > Hi Willem, > > Here's a reduced testcase. I doesn't require anything special; the key is > that the amount of data placed in the packet by the send() - it's related to > the MTU size. It needs to stuff in sufficient data to go over the > fragmentation limit (I think). > > In this case, my interface's MTU is 8192. send() is sticking in 8161 bytes of > data and then the output from the aforeposted debugging patch is: > > ==>splice_to_socket() 6630 > udp_sendmsg(8,8) > __ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188) > pagedlen 9 = 9 - 0 > copy -1 = 9 - 0 - 1 - 9 > length 8 -= -1 + 0 > __ip_append_data(copy=8172,len=9, mtu=8192 skblen=20 maxfl=8188) > copy=8172 len=9 > skb_splice_from_iter(8,9) > __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188) > copy=8164 len=1 > skb_splice_from_iter(0,1) > __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188) > copy=8164 len=1 > skb_splice_from_iter(0,1) > __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188) > copy=8164 len=1 > skb_splice_from_iter(0,1) > __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188) > copy=8164 len=1 > skb_splice_from_iter(0,1) > copy=8164 len=1 > skb_splice_from_iter(0,1) > > It looks like send() pushes 1 byte over the fragmentation limit, then the > splice sees -1 crop up, the length to be copied is increased by 1, but > insufficient data is available and we go into an endless loop. > > --- > #define _GNU_SOURCE > #include <arpa/inet.h> > #include <fcntl.h> > #include <netinet/in.h> > #include <stdarg.h> > #include <stdbool.h> > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <sys/socket.h> > #include <sys/mman.h> > #include <sys/uio.h> > > #define OSERROR(R, S) do { if ((long)(R) == -1L) { perror((S)); exit(1); } } while(0) > > int main() > { > struct sockaddr_storage ss; > struct sockaddr_in sin; > void *buffer; > unsigned int tmp; > int pfd[2], sfd; > int res; > > OSERROR(pipe(pfd), "pipe"); > > sfd = socket(AF_INET, SOCK_DGRAM, 0); > OSERROR(sfd, "socket/2"); > > memset(&sin, 0, sizeof(sin)); > sin.sin_family = AF_INET; > sin.sin_port = htons(0); > sin.sin_addr.s_addr = htonl(0xc0a80601); > #warning you might want to set the address here - this is 192.168.6.1 > OSERROR(connect(sfd, (struct sockaddr *)&sin, sizeof(sin)), "connect"); > > buffer = mmap(NULL, 1024*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0); > OSERROR(buffer, "mmap"); > > OSERROR(send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE), "send"); > #warning you need to adjust the length on the above line to match your MTU > > OSERROR(write(pfd[1], buffer, 8), "write"); > > OSERROR(splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0), "splice"); > return 0; > } That's helpful. Is the MSG_CONFIRM needed to trigger this? Appending to a MSG_MORE datagram that previously fit within MTU, but no longer, triggers the copy from skb_prev to skb in if (fraggap). I did not see how that would cause issues, but maybe something in how that second skb is setup makes none of the cases in the while loop successfully append, yet also not fail and exit. It would be helpful to know which path it takes (I assume skb_splice_from_iter) and what that returns (0?). Is this indeed trivially sidestepped if downgrading from splicing to regular copying with fragmentation? @@ -1042,7 +1042,7 @@ static int __ip_append_data(struct sock *sk, if (inet->hdrincl) return -EPERM; if (rt->dst.dev->features & NETIF_F_SG && - getfrag == ip_generic_getfrag) + getfrag == ip_generic_getfrag && transhdrlen)
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > Is the MSG_CONFIRM needed to trigger this? It's not actually necessary. The syz test was doing it. > Is this indeed trivially sidestepped if downgrading from splicing to > regular copying with fragmentation? That works. The logging looks like: ==>splice_to_socket() 5535 udp_sendmsg(8,8) __ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188) copy 8 = 9 - 0 - 1 - 0 length 8 -= 8 + 0 <==splice_to_socket() = 8 It looks like pagedlen being non-zero might be the issue. David
The more I look at __ip_append_data(), the more I think the maths is wrong. In the bit that allocates a new skbuff: if (copy <= 0) { ... datalen = length + fraggap; if (datalen > mtu - fragheaderlen) datalen = maxfraglen - fragheaderlen; fraglen = datalen + fragheaderlen; pagedlen = 0; ... if ((flags & MSG_MORE) && !(rt->dst.dev->features&NETIF_F_SG)) ... else if (!paged && (fraglen + alloc_extra < SKB_MAX_ALLOC || !(rt->dst.dev->features & NETIF_F_SG))) ... else { alloclen = fragheaderlen + transhdrlen; pagedlen = datalen - transhdrlen; } ... In the MSG_SPLICE_READ but not MSG_MORE case, we go through that else clause. The values used here, a few lines further along: copy = datalen - transhdrlen - fraggap - pagedlen; are constant over the intervening span. This means that, provided the splice isn't going to exceed the MTU on the second fragment, the calculation of 'copy' can then be simplified algebraically thus: copy = (length + fraggap) - transhdrlen - fraggap - pagedlen; copy = length - transhdrlen - pagedlen; copy = length - transhdrlen - (datalen - transhdrlen); copy = length - transhdrlen - datalen + transhdrlen; copy = length - datalen; copy = length - (length + fraggap); copy = length - length - fraggap; copy = -fraggap; I think we might need to recalculate copy after the conditional call to getfrag(). Possibly we should skip that entirely for MSG_SPLICE_READ. The root seems to be that we're subtracting pagedlen from datalen - but probably we shouldn't be doing getfrag() if pagedlen > 0. David
David Howells wrote: > The more I look at __ip_append_data(), the more I think the maths is wrong. > In the bit that allocates a new skbuff: > > if (copy <= 0) { > ... > datalen = length + fraggap; > if (datalen > mtu - fragheaderlen) > datalen = maxfraglen - fragheaderlen; > fraglen = datalen + fragheaderlen; > pagedlen = 0; > ... > if ((flags & MSG_MORE) && > !(rt->dst.dev->features&NETIF_F_SG)) > ... > else if (!paged && > (fraglen + alloc_extra < SKB_MAX_ALLOC || > !(rt->dst.dev->features & NETIF_F_SG))) > ... > else { > alloclen = fragheaderlen + transhdrlen; > pagedlen = datalen - transhdrlen; > } > ... > > In the MSG_SPLICE_READ but not MSG_MORE case, we go through that else clause. > The values used here, a few lines further along: > > copy = datalen - transhdrlen - fraggap - pagedlen; > > are constant over the intervening span. This means that, provided the splice > isn't going to exceed the MTU on the second fragment, the calculation of > 'copy' can then be simplified algebraically thus: > > copy = (length + fraggap) - transhdrlen - fraggap - pagedlen; > > copy = length - transhdrlen - pagedlen; > > copy = length - transhdrlen - (datalen - transhdrlen); > > copy = length - transhdrlen - datalen + transhdrlen; > > copy = length - datalen; > > copy = length - (length + fraggap); > > copy = length - length - fraggap; > > copy = -fraggap; > > I think we might need to recalculate copy after the conditional call to > getfrag(). Possibly we should skip that entirely for MSG_SPLICE_READ. The > root seems to be that we're subtracting pagedlen from datalen - but probably > we shouldn't be doing getfrag() if pagedlen > 0. q
David Howells wrote: > The more I look at __ip_append_data(), the more I think the maths is wrong. > In the bit that allocates a new skbuff: > > if (copy <= 0) { > ... > datalen = length + fraggap; > if (datalen > mtu - fragheaderlen) > datalen = maxfraglen - fragheaderlen; > fraglen = datalen + fragheaderlen; > pagedlen = 0; > ... > if ((flags & MSG_MORE) && > !(rt->dst.dev->features&NETIF_F_SG)) > ... > else if (!paged && > (fraglen + alloc_extra < SKB_MAX_ALLOC || > !(rt->dst.dev->features & NETIF_F_SG))) > ... > else { > alloclen = fragheaderlen + transhdrlen; > pagedlen = datalen - transhdrlen; > } > ... > > In the MSG_SPLICE_READ but not MSG_MORE case, we go through that else clause. > The values used here, a few lines further along: > > copy = datalen - transhdrlen - fraggap - pagedlen; > > are constant over the intervening span. This means that, provided the splice > isn't going to exceed the MTU on the second fragment, the calculation of > 'copy' can then be simplified algebraically thus: > > copy = (length + fraggap) - transhdrlen - fraggap - pagedlen; > > copy = length - transhdrlen - pagedlen; > > copy = length - transhdrlen - (datalen - transhdrlen); > > copy = length - transhdrlen - datalen + transhdrlen; > > copy = length - datalen; > > copy = length - (length + fraggap); > > copy = length - length - fraggap; > > copy = -fraggap; > > I think we might need to recalculate copy after the conditional call to > getfrag(). Possibly we should skip that entirely for MSG_SPLICE_READ. The > root seems to be that we're subtracting pagedlen from datalen - but probably > we shouldn't be doing getfrag() if pagedlen > 0. That getfrag is needed. For non-splice cases, to fill the linear part of an skb. As your example shows, it is skipped if all data is covered by pagedlen? In this edge case, splicing appends pagedlen to an skb that holds only a small linear part for fragheaderlen and fraggap. Splicing has no problem appending to a normal linear skb, right. Say send(fd, buf, 100, MSG_MORE); write(pipe[1], buf, 8); splice(pipe[2], 0, fd, 0, 8, 0); This only happens when a new skb has to be allocated during the splice call. What causes the infinite loop: does skb_splice_from_iter return 0 and therefore the loop neither decrements copy, nor breaks out with error? Apologies for the earlier mail. Accidentally hit send too soon..
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > That getfrag is needed. For non-splice cases, to fill the linear part > of an skb. As your example shows, it is skipped if all data is covered > by pagedlen? Yes, because copy goes negative. To quote from the previously quoted log: ==>splice_to_socket() 6630 udp_sendmsg(8,8) __ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188) pagedlen 9 = 9 - 0 copy -1 = 9 - 0 - 1 - 9 copy is -(the number of excess bytes). length 8 -= -1 + 0 which then gets deducted from the length - but why? I wonder if copy should be cleared if we don't call getfrag(). It looks like it's meant to deduct the amount copied by getfrag(), but that doesn't happen if copy < 0. Also, note that MSG_ZEROCOPY might see the same maths issue here. David
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > What causes the infinite loop: does skb_splice_from_iter return 0 and > therefore the loop neither decrements copy, nor breaks out with error? Yeah. skb_splice_from_iter() starts returning 0 because the iterator is empty, but it's still being asked to copy data. Possibly it should break out of the loop (or give an error) in such a case. David
The attached seems to work. I still think copy isn't correctly calculated in some circumstances - as I showed, several terms in the maths cancel out, including the length of the data. I'm also not entirely sure what 'paged' means in this function. Should it actually be set in the MSG_SPLICE_PAGES context? --- udp: Fix __ip_addend_data() diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 6e70839257f7..54675a4f2c9f 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1157,7 +1157,7 @@ static int __ip_append_data(struct sock *sk, pskb_trim_unique(skb_prev, maxfraglen); } - copy = datalen - transhdrlen - fraggap - pagedlen; + copy = max_t(int, datalen - transhdrlen - fraggap - pagedlen, 0); if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) { err = -EFAULT; kfree_skb(skb);
David Howells wrote: > The attached seems to work. I still think copy isn't correctly calculated in > some circumstances - as I showed, several terms in the maths cancel out, > including the length of the data. That arithmetic makes assumptions that are specific to a particular set of conditions (e.g., that pagedlen is non-zero). Since the arithmetic is so complicated and error prone, I would try to structure a fix that is easy to reason about to only change behavior for the MSG_SPLICE_PAGES case. > I'm also not entirely sure what 'paged' means in this function. Should it > actually be set in the MSG_SPLICE_PAGES context? I introduced it with MSG_ZEROCOPY. It sets up pagedlen to capture the length that is not copied. If the existing code would affect MSG_ZEROCOPY too, I expect syzbot to have reported that previously. > --- > udp: Fix __ip_addend_data() > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index 6e70839257f7..54675a4f2c9f 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -1157,7 +1157,7 @@ static int __ip_append_data(struct sock *sk, > pskb_trim_unique(skb_prev, maxfraglen); > } > > - copy = datalen - transhdrlen - fraggap - pagedlen; > + copy = max_t(int, datalen - transhdrlen - fraggap - pagedlen, 0); > if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) { > err = -EFAULT; > kfree_skb(skb); >
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > I'm also not entirely sure what 'paged' means in this function. Should it > > actually be set in the MSG_SPLICE_PAGES context? > > I introduced it with MSG_ZEROCOPY. It sets up pagedlen to capture the > length that is not copied. > > If the existing code would affect MSG_ZEROCOPY too, I expect syzbot > to have reported that previously. Ah... I think it *should* affect MSG_ZEROCOPY also... but... If you look at: } else { err = skb_zerocopy_iter_dgram(skb, from, copy); if (err < 0) goto error; } offset += copy; length -= copy; MSG_ZEROCOPY assumes that if it didn't return an error, then skb_zerocopy_iter_dgram() copied all the data requested - whether or not the iterator had sufficient data to copy. If you look in __zerocopy_sg_from_iter(), it will drop straight out, returning 0 if/when iov_iter_count() is/reaches 0, even if length is still > 0, just as skb_splice_from_iter() does. So there's a potential bug in the handling of MSG_ZEROCOPY - but one that you survive because it subtracts 'copy' from 'length', reducing it to zero, exits the loop and returns without looking at 'length' again. The actual length to be transmitted is in the skbuff. > Since the arithmetic is so complicated and error prone, I would try > to structure a fix that is easy to reason about to only change > behavior for the MSG_SPLICE_PAGES case. Does that mean you want to have a go at that - or did you want me to try? David
David Howells wrote: > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > > I'm also not entirely sure what 'paged' means in this function. Should it > > > actually be set in the MSG_SPLICE_PAGES context? > > > > I introduced it with MSG_ZEROCOPY. It sets up pagedlen to capture the > > length that is not copied. > > > > If the existing code would affect MSG_ZEROCOPY too, I expect syzbot > > to have reported that previously. > > Ah... I think it *should* affect MSG_ZEROCOPY also... but... If you look at: > > } else { > err = skb_zerocopy_iter_dgram(skb, from, copy); > if (err < 0) > goto error; > } > offset += copy; > length -= copy; > > MSG_ZEROCOPY assumes that if it didn't return an error, then > skb_zerocopy_iter_dgram() copied all the data requested - whether or not the > iterator had sufficient data to copy. > > If you look in __zerocopy_sg_from_iter(), it will drop straight out, returning > 0 if/when iov_iter_count() is/reaches 0, even if length is still > 0, just as > skb_splice_from_iter() does. > > So there's a potential bug in the handling of MSG_ZEROCOPY - but one that you > survive because it subtracts 'copy' from 'length', reducing it to zero, exits > the loop and returns without looking at 'length' again. The actual length to > be transmitted is in the skbuff. > > > Since the arithmetic is so complicated and error prone, I would try > > to structure a fix that is easy to reason about to only change > > behavior for the MSG_SPLICE_PAGES case. > > Does that mean you want to have a go at that - or did you want me to try Please give it a try. I can review. It's just safer if it's trivial to review that the patch only affects the behavior of the recently introduced MSG_SPLICE_PAGES code.
diff --git a/fs/splice.c b/fs/splice.c index 004eb1c4ce31..9ee82b818bd6 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -801,6 +801,8 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, size_t spliced = 0; bool need_wakeup = false; + printk("==>splice_to_socket() %u\n", current->pid); + pipe_lock(pipe); while (len > 0) { @@ -911,6 +913,7 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, pipe_unlock(pipe); if (need_wakeup) wakeup_pipe_writers(pipe); + printk("<==splice_to_socket() = %zd\n", spliced ?: ret); return spliced ?: ret; } #endif diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a298992060e6..c3d60da9e3f7 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -6801,6 +6801,13 @@ ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter, ssize_t spliced = 0, ret = 0; unsigned int i; + static int __pcount; + + if (__pcount < 6) { + printk("skb_splice_from_iter(%zu,%zd)\n", iter->count, maxsize); + __pcount++; + } + while (iter->count > 0) { ssize_t space, nr, len; size_t off; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 6e70839257f7..8c84a7d13627 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1066,6 +1066,14 @@ static int __ip_append_data(struct sock *sk, copy = mtu - skb->len; if (copy < length) copy = maxfraglen - skb->len; + if (flags & MSG_SPLICE_PAGES) { + static int __pcount; + if (__pcount < 6) { + printk("__ip_append_data(copy=%d,len=%d, mtu=%d skblen=%d maxfl=%d)\n", + copy, length, mtu, skb->len, maxfraglen); + __pcount++; + } + } if (copy <= 0) { char *data; unsigned int datalen; @@ -1112,6 +1120,10 @@ static int __ip_append_data(struct sock *sk, else { alloclen = fragheaderlen + transhdrlen; pagedlen = datalen - transhdrlen; + if (flags & MSG_SPLICE_PAGES) { + printk("pagedlen %d = %d - %d\n", + pagedlen, datalen, transhdrlen); + } } alloclen += alloc_extra; @@ -1158,6 +1170,9 @@ static int __ip_append_data(struct sock *sk, } copy = datalen - transhdrlen - fraggap - pagedlen; + if (flags & MSG_SPLICE_PAGES) + printk("copy %d = %d - %d - %d - %d\n", + copy, datalen, transhdrlen, fraggap, pagedlen); if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) { err = -EFAULT; kfree_skb(skb); @@ -1165,6 +1180,8 @@ static int __ip_append_data(struct sock *sk, } offset += copy; + if (flags & MSG_SPLICE_PAGES) + printk("length %d -= %d + %d\n", length, copy, transhdrlen); length -= copy + transhdrlen; transhdrlen = 0; exthdrlen = 0; @@ -1192,6 +1209,13 @@ static int __ip_append_data(struct sock *sk, continue; } + if (flags & MSG_SPLICE_PAGES) { + static int __qcount; + if (__qcount < 6) { + printk("copy=%d len=%d\n", copy, length); + __qcount++; + } + } if (copy > length) copy = length; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 42a96b3547c9..bd3f4e62574b 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1081,6 +1081,9 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (msg->msg_flags & MSG_OOB) /* Mirror BSD error message compatibility */ return -EOPNOTSUPP; + if (msg->msg_flags & MSG_SPLICE_PAGES) + printk("udp_sendmsg(%zx,%zx)\n", msg->msg_iter.count, len); + getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag; fl4 = &inet->cork.fl.u.ip4;