Message ID | 20240613113504.1079860-1-sagi@grimberg.me (mailing list archive) |
---|---|
State | Accepted |
Commit | 934c29999b57b835d65442da6f741d5e27f3b584 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: micro-optimize skb_datagram_iter | expand |
Hello: This patch was applied to netdev/net-next.git (main) by Jakub Kicinski <kuba@kernel.org>: On Thu, 13 Jun 2024 14:35:04 +0300 you wrote: > We only use the mapping in a single context in a short and contained scope, > so kmap_local_page is sufficient and cheaper. This will also allow > skb_datagram_iter to be called from softirq context. > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > --- > net/core/datagram.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Here is the summary with links: - net: micro-optimize skb_datagram_iter https://git.kernel.org/netdev/net-next/c/934c29999b57 You are awesome, thank you!
Hello, kernel test robot noticed "kernel_BUG_at_mm/usercopy.c" on: commit: 18f0423b9eccb781310af6709ceebf654175af14 ("[PATCH] net: micro-optimize skb_datagram_iter") url: https://github.com/intel-lab-lkp/linux/commits/Sagi-Grimberg/net-micro-optimize-skb_datagram_iter/20240613-193620 base: https://git.kernel.org/cgit/linux/kernel/git/davem/net.git b60b1bdc1888f51da7a2a22c48c5f1eb2bd12e97 patch link: https://lore.kernel.org/all/20240613113504.1079860-1-sagi@grimberg.me/ patch subject: [PATCH] net: micro-optimize skb_datagram_iter in testcase: boot compiler: gcc-8 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) +------------------------------------------+------------+------------+ | | b60b1bdc18 | 18f0423b9e | +------------------------------------------+------------+------------+ | boot_successes | 6 | 0 | | boot_failures | 0 | 6 | | kernel_BUG_at_mm/usercopy.c | 0 | 6 | | Oops:invalid_opcode:#[##]PREEMPT_SMP | 0 | 6 | | EIP:usercopy_abort | 0 | 6 | | Kernel_panic-not_syncing:Fatal_exception | 0 | 6 | +------------------------------------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202406161539.b5ff7b20-oliver.sang@intel.com [ 13.495377][ T189] ------------[ cut here ]------------ [ 13.495862][ T189] kernel BUG at mm/usercopy.c:102! [ 13.496372][ T189] Oops: invalid opcode: 0000 [#1] PREEMPT SMP [ 13.496927][ T189] CPU: 0 PID: 189 Comm: systemctl Not tainted 6.10.0-rc2-00258-g18f0423b9ecc #1 [ 13.497741][ T189] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 13.498663][ T189] EIP: usercopy_abort (mm/usercopy.c:102 (discriminator 12)) [ 13.499424][ T194] usercopy: Kernel memory exposure attempt detected from kmap (offset 0, size 8192)! [ 13.499647][ T189] Code: d6 89 44 24 0c 0f 45 cf 8b 7d 0c 89 74 24 10 89 4c 24 04 c7 04 24 a4 55 8a d6 89 7c 24 20 8b 7d 08 89 7c 24 1c e8 20 3c df ff <0f> 0b b8 80 91 d7 d6 e8 a8 de 68 00 ba 3d 17 86 d6 89 55 f0 89 d6 All code ======== 0: d6 (bad) 1: 89 44 24 0c mov %eax,0xc(%rsp) 5: 0f 45 cf cmovne %edi,%ecx 8: 8b 7d 0c mov 0xc(%rbp),%edi b: 89 74 24 10 mov %esi,0x10(%rsp) f: 89 4c 24 04 mov %ecx,0x4(%rsp) 13: c7 04 24 a4 55 8a d6 movl $0xd68a55a4,(%rsp) 1a: 89 7c 24 20 mov %edi,0x20(%rsp) 1e: 8b 7d 08 mov 0x8(%rbp),%edi 21: 89 7c 24 1c mov %edi,0x1c(%rsp) 25: e8 20 3c df ff call 0xffffffffffdf3c4a 2a:* 0f 0b ud2 <-- trapping instruction 2c: b8 80 91 d7 d6 mov $0xd6d79180,%eax 31: e8 a8 de 68 00 call 0x68dede 36: ba 3d 17 86 d6 mov $0xd686173d,%edx 3b: 89 55 f0 mov %edx,-0x10(%rbp) 3e: 89 d6 mov %edx,%esi Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: b8 80 91 d7 d6 mov $0xd6d79180,%eax 7: e8 a8 de 68 00 call 0x68deb4 c: ba 3d 17 86 d6 mov $0xd686173d,%edx 11: 89 55 f0 mov %edx,-0x10(%rbp) 14: 89 d6 mov %edx,%esi [ 13.500502][ T194] ------------[ cut here ]------------ [ 13.502187][ T189] EAX: 00000052 EBX: d680da68 ECX: e0435480 EDX: 01000232 [ 13.502666][ T194] kernel BUG at mm/usercopy.c:102! [ 13.503236][ T189] ESI: d686173d EDI: 00000000 EBP: ece37c44 ESP: ece37c10 [ 13.504266][ T189] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010286 [ 13.504856][ T189] CR0: 80050033 CR2: 0135eb6c CR3: 2beff000 CR4: 000406d0 [ 13.505464][ T189] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 13.506083][ T189] DR6: fffe0ff0 DR7: 00000400 [ 13.506495][ T189] Call Trace: [ 13.506795][ T189] ? show_regs (arch/x86/kernel/dumpstack.c:479) [ 13.507187][ T189] ? __die_body (arch/x86/kernel/dumpstack.c:421) [ 13.507576][ T189] ? die (arch/x86/kernel/dumpstack.c:449) [ 13.507894][ T189] ? do_trap (arch/x86/kernel/traps.c:114 arch/x86/kernel/traps.c:155) [ 13.508270][ T189] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359 kernel/locking/lockdep.c:4311) [ 13.508783][ T189] ? do_error_trap (arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:176) [ 13.509182][ T189] ? usercopy_abort (mm/usercopy.c:102 (discriminator 12)) [ 13.509588][ T189] ? exc_overflow (arch/x86/kernel/traps.c:252) [ 13.509983][ T189] ? exc_invalid_op (arch/x86/kernel/traps.c:267) [ 13.510396][ T189] ? usercopy_abort (mm/usercopy.c:102 (discriminator 12)) [ 13.510797][ T189] ? handle_exception (arch/x86/entry/entry_32.S:1054) [ 13.511242][ T189] ? exc_overflow (arch/x86/kernel/traps.c:252) [ 13.511646][ T189] ? usercopy_abort (mm/usercopy.c:102 (discriminator 12)) [ 13.512070][ T189] ? exc_overflow (arch/x86/kernel/traps.c:252) [ 13.512434][ T189] ? usercopy_abort (mm/usercopy.c:102 (discriminator 12)) [ 13.512832][ T189] __check_object_size (mm/usercopy.c:180 mm/usercopy.c:251 mm/usercopy.c:213) [ 13.513275][ T189] simple_copy_to_iter (include/linux/uio.h:196 net/core/datagram.c:513) [ 13.513693][ T189] __skb_datagram_iter (net/core/datagram.c:424 (discriminator 1)) [ 13.514138][ T189] skb_copy_datagram_iter (net/core/datagram.c:529) [ 13.514606][ T189] ? skb_free_datagram (net/core/datagram.c:512) [ 13.515028][ T189] ? scm_stat_del (net/unix/af_unix.c:2883) [ 13.515429][ T189] unix_stream_read_actor (net/unix/af_unix.c:2889) [ 13.515884][ T189] unix_stream_read_generic (net/unix/af_unix.c:2805) [ 13.516377][ T189] ? cma_for_each_area (mm/page_ext.c:518) [ 13.516826][ T189] ? unix_stream_splice_read (net/unix/af_unix.c:2907) [ 13.517301][ T189] unix_stream_recvmsg (net/unix/af_unix.c:2923) [ 13.517720][ T189] ? scm_stat_del (net/unix/af_unix.c:2883) [ 13.518108][ T189] ____sys_recvmsg (net/socket.c:1046 net/socket.c:1068 net/socket.c:2804) [ 13.518527][ T189] ? import_iovec (lib/iov_iter.c:1348) [ 13.518930][ T189] ? copy_msghdr_from_user (net/socket.c:2525) [ 13.519396][ T189] ___sys_recvmsg (net/socket.c:2846) [ 13.519811][ T189] ? __fdget (fs/file.c:1160) [ 13.520186][ T189] ? sockfd_lookup_light (net/socket.c:558) [ 13.520643][ T189] __sys_recvmsg (include/linux/file.h:34 net/socket.c:2878) [ 13.521046][ T189] __ia32_sys_socketcall (net/socket.c:3173 net/socket.c:3077 net/socket.c:3077) [ 13.521513][ T189] ia32_sys_call (arch/x86/entry/syscall_32.c:42) [ 13.521923][ T189] __do_fast_syscall_32 (arch/x86/entry/common.c:165 arch/x86/entry/common.c:386) [ 13.522362][ T189] do_fast_syscall_32 (arch/x86/entry/common.c:411) [ 13.522787][ T189] do_SYSENTER_32 (arch/x86/entry/common.c:450) [ 13.523188][ T189] entry_SYSENTER_32 (arch/x86/entry/entry_32.S:836) [ 13.523613][ T189] EIP: 0xb7ee6579 [ 13.523933][ T189] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 All code ======== 0: b8 01 10 06 03 mov $0x3061001,%eax 5: 74 b4 je 0xffffffffffffffbb 7: 01 10 add %edx,(%rax) 9: 07 (bad) a: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi e: 10 08 adc %cl,(%rax) 10: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi ... 20: 00 51 52 add %dl,0x52(%rcx) 23: 55 push %rbp 24:* 89 e5 mov %esp,%ebp <-- trapping instruction 26: 0f 34 sysenter 28: cd 80 int $0x80 2a: 5d pop %rbp 2b: 5a pop %rdx 2c: 59 pop %rcx 2d: c3 ret 2e: 90 nop 2f: 90 nop 30: 90 nop 31: 90 nop 32: 8d 76 00 lea 0x0(%rsi),%esi 35: 58 pop %rax 36: b8 77 00 00 00 mov $0x77,%eax 3b: cd 80 int $0x80 3d: 90 nop 3e: 8d .byte 0x8d 3f: 76 .byte 0x76 Code starting with the faulting instruction =========================================== 0: 5d pop %rbp 1: 5a pop %rdx 2: 59 pop %rcx 3: c3 ret 4: 90 nop 5: 90 nop 6: 90 nop 7: 90 nop 8: 8d 76 00 lea 0x0(%rsi),%esi b: 58 pop %rax c: b8 77 00 00 00 mov $0x77,%eax 11: cd 80 int $0x80 13: 90 nop 14: 8d .byte 0x8d 15: 76 .byte 0x76 [ 13.525624][ T189] EAX: ffffffda EBX: 00000011 ECX: bfdf5450 EDX: 00000000 [ 13.526233][ T189] ESI: b7c46000 EDI: bfdf54ac EBP: bfdf54a8 ESP: bfdf5440 [ 13.526853][ T189] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000286 [ 13.527519][ T189] Modules linked in: i2c_piix4(+) agpgart(+) qemu_fw_cfg button fuse drm drm_panel_orientation_quirks ip_tables [ 13.528566][ T194] Oops: invalid opcode: 0000 [#2] PREEMPT SMP [ 13.528804][ T189] ---[ end trace 0000000000000000 ]--- [ 13.529217][ T194] CPU: 1 PID: 194 Comm: systemctl Tainted: G D 6.10.0-rc2-00258-g18f0423b9ecc #1 [ 13.529725][ T189] EIP: usercopy_abort (mm/usercopy.c:102 (discriminator 12)) [ 13.530536][ T194] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 13.531060][ T189] Code: d6 89 44 24 0c 0f 45 cf 8b 7d 0c 89 74 24 10 89 4c 24 04 c7 04 24 a4 55 8a d6 89 7c 24 20 8b 7d 08 89 7c 24 1c e8 20 3c df ff <0f> 0b b8 80 91 d7 d6 e8 a8 de 68 00 ba 3d 17 86 d6 89 55 f0 89 d6 All code ======== 0: d6 (bad) 1: 89 44 24 0c mov %eax,0xc(%rsp) 5: 0f 45 cf cmovne %edi,%ecx 8: 8b 7d 0c mov 0xc(%rbp),%edi b: 89 74 24 10 mov %esi,0x10(%rsp) f: 89 4c 24 04 mov %ecx,0x4(%rsp) 13: c7 04 24 a4 55 8a d6 movl $0xd68a55a4,(%rsp) 1a: 89 7c 24 20 mov %edi,0x20(%rsp) 1e: 8b 7d 08 mov 0x8(%rbp),%edi 21: 89 7c 24 1c mov %edi,0x1c(%rsp) 25: e8 20 3c df ff call 0xffffffffffdf3c4a 2a:* 0f 0b ud2 <-- trapping instruction 2c: b8 80 91 d7 d6 mov $0xd6d79180,%eax 31: e8 a8 de 68 00 call 0x68dede 36: ba 3d 17 86 d6 mov $0xd686173d,%edx 3b: 89 55 f0 mov %edx,-0x10(%rbp) 3e: 89 d6 mov %edx,%esi Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: b8 80 91 d7 d6 mov $0xd6d79180,%eax 7: e8 a8 de 68 00 call 0x68deb4 c: ba 3d 17 86 d6 mov $0xd686173d,%edx 11: 89 55 f0 mov %edx,-0x10(%rbp) 14: 89 d6 mov %edx,%esi The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20240616/202406161539.b5ff7b20-oliver.sang@intel.com
On 16/06/2024 11:06, kernel test robot wrote: > > Hello, > > kernel test robot noticed "kernel_BUG_at_mm/usercopy.c" on: > > commit: 18f0423b9eccb781310af6709ceebf654175af14 ("[PATCH] net: micro-optimize skb_datagram_iter") > url: https://github.com/intel-lab-lkp/linux/commits/Sagi-Grimberg/net-micro-optimize-skb_datagram_iter/20240613-193620 > base: https://git.kernel.org/cgit/linux/kernel/git/davem/net.git b60b1bdc1888f51da7a2a22c48c5f1eb2bd12e97 > patch link: https://lore.kernel.org/all/20240613113504.1079860-1-sagi@grimberg.me/ > patch subject: [PATCH] net: micro-optimize skb_datagram_iter > > in testcase: boot > > compiler: gcc-8 > test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G > > (please refer to attached dmesg/kmsg for entire log/backtrace) > > > +------------------------------------------+------------+------------+ > | | b60b1bdc18 | 18f0423b9e | > +------------------------------------------+------------+------------+ > | boot_successes | 6 | 0 | > | boot_failures | 0 | 6 | > | kernel_BUG_at_mm/usercopy.c | 0 | 6 | > | Oops:invalid_opcode:#[##]PREEMPT_SMP | 0 | 6 | > | EIP:usercopy_abort | 0 | 6 | > | Kernel_panic-not_syncing:Fatal_exception | 0 | 6 | > +------------------------------------------+------------+------------+ > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <oliver.sang@intel.com> > | Closes: https://lore.kernel.org/oe-lkp/202406161539.b5ff7b20-oliver.sang@intel.com > > > [ 13.495377][ T189] ------------[ cut here ]------------ > [ 13.495862][ T189] kernel BUG at mm/usercopy.c:102! > [ 13.496372][ T189] Oops: invalid opcode: 0000 [#1] PREEMPT SMP > [ 13.496927][ T189] CPU: 0 PID: 189 Comm: systemctl Not tainted 6.10.0-rc2-00258-g18f0423b9ecc #1 > [ 13.497741][ T189] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > [ 13.498663][ T189] EIP: usercopy_abort (mm/usercopy.c:102 (discriminator 12)) > [ 13.499424][ T194] usercopy: Kernel memory exposure attempt detected from kmap (offset 0, size 8192)! Hmm, not sure I understand exactly why changing kmap() to kmap_local_page() expose this, but it looks like mm/usercopy does not like size=8192 when copying for the skb frag. quick git browse directs to: -- commit 4e140f59d285c1ca1e5c81b4c13e27366865bd09 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Mon Jan 10 23:15:27 2022 +0000 mm/usercopy: Check kmap addresses properly If you are copying to an address in the kmap region, you may not copy across a page boundary, no matter what the size of the underlying allocation. You can't kmap() a slab page because slab pages always come from low memory. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220110231530.665970-2-willy@infradead.org -- CCing willy. The documentation suggest that under single-context usage kmap() can be freely converted to kmap_local_page()? But seems that when using kmap() the size is not an issue, still trying to understand why. > [ 13.499647][ T189] Code: d6 89 44 24 0c 0f 45 cf 8b 7d 0c 89 74 24 10 89 4c 24 04 c7 04 24 a4 55 8a d6 89 7c 24 20 8b 7d 08 89 7c 24 1c e8 20 3c df ff <0f> 0b b8 80 91 d7 d6 e8 a8 de 68 00 ba 3d 17 86 d6 89 55 f0 89 d6 > All code > ======== > 0: d6 (bad) > 1: 89 44 24 0c mov %eax,0xc(%rsp) > 5: 0f 45 cf cmovne %edi,%ecx > 8: 8b 7d 0c mov 0xc(%rbp),%edi > b: 89 74 24 10 mov %esi,0x10(%rsp) > f: 89 4c 24 04 mov %ecx,0x4(%rsp) > 13: c7 04 24 a4 55 8a d6 movl $0xd68a55a4,(%rsp) > 1a: 89 7c 24 20 mov %edi,0x20(%rsp) > 1e: 8b 7d 08 mov 0x8(%rbp),%edi > 21: 89 7c 24 1c mov %edi,0x1c(%rsp) > 25: e8 20 3c df ff call 0xffffffffffdf3c4a > 2a:* 0f 0b ud2 <-- trapping instruction > 2c: b8 80 91 d7 d6 mov $0xd6d79180,%eax > 31: e8 a8 de 68 00 call 0x68dede > 36: ba 3d 17 86 d6 mov $0xd686173d,%edx > 3b: 89 55 f0 mov %edx,-0x10(%rbp) > 3e: 89 d6 mov %edx,%esi > > Code starting with the faulting instruction > =========================================== > 0: 0f 0b ud2 > 2: b8 80 91 d7 d6 mov $0xd6d79180,%eax > 7: e8 a8 de 68 00 call 0x68deb4 > c: ba 3d 17 86 d6 mov $0xd686173d,%edx > 11: 89 55 f0 mov %edx,-0x10(%rbp) > 14: 89 d6 mov %edx,%esi > [ 13.500502][ T194] ------------[ cut here ]------------ > [ 13.502187][ T189] EAX: 00000052 EBX: d680da68 ECX: e0435480 EDX: 01000232 > [ 13.502666][ T194] kernel BUG at mm/usercopy.c:102! > [ 13.503236][ T189] ESI: d686173d EDI: 00000000 EBP: ece37c44 ESP: ece37c10 > [ 13.504266][ T189] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010286 > [ 13.504856][ T189] CR0: 80050033 CR2: 0135eb6c CR3: 2beff000 CR4: 000406d0 > [ 13.505464][ T189] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 > [ 13.506083][ T189] DR6: fffe0ff0 DR7: 00000400 > [ 13.506495][ T189] Call Trace: > [ 13.506795][ T189] ? show_regs (arch/x86/kernel/dumpstack.c:479) > [ 13.507187][ T189] ? __die_body (arch/x86/kernel/dumpstack.c:421) > [ 13.507576][ T189] ? die (arch/x86/kernel/dumpstack.c:449) > [ 13.507894][ T189] ? do_trap (arch/x86/kernel/traps.c:114 arch/x86/kernel/traps.c:155) > [ 13.508270][ T189] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359 kernel/locking/lockdep.c:4311) > [ 13.508783][ T189] ? do_error_trap (arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:176) > [ 13.509182][ T189] ? usercopy_abort (mm/usercopy.c:102 (discriminator 12)) > [ 13.509588][ T189] ? exc_overflow (arch/x86/kernel/traps.c:252) > [ 13.509983][ T189] ? exc_invalid_op (arch/x86/kernel/traps.c:267) > [ 13.510396][ T189] ? usercopy_abort (mm/usercopy.c:102 (discriminator 12)) > [ 13.510797][ T189] ? handle_exception (arch/x86/entry/entry_32.S:1054) > [ 13.511242][ T189] ? exc_overflow (arch/x86/kernel/traps.c:252) > [ 13.511646][ T189] ? usercopy_abort (mm/usercopy.c:102 (discriminator 12)) > [ 13.512070][ T189] ? exc_overflow (arch/x86/kernel/traps.c:252) > [ 13.512434][ T189] ? usercopy_abort (mm/usercopy.c:102 (discriminator 12)) > [ 13.512832][ T189] __check_object_size (mm/usercopy.c:180 mm/usercopy.c:251 mm/usercopy.c:213) > [ 13.513275][ T189] simple_copy_to_iter (include/linux/uio.h:196 net/core/datagram.c:513) > [ 13.513693][ T189] __skb_datagram_iter (net/core/datagram.c:424 (discriminator 1)) > [ 13.514138][ T189] skb_copy_datagram_iter (net/core/datagram.c:529) > [ 13.514606][ T189] ? skb_free_datagram (net/core/datagram.c:512) > [ 13.515028][ T189] ? scm_stat_del (net/unix/af_unix.c:2883) > [ 13.515429][ T189] unix_stream_read_actor (net/unix/af_unix.c:2889) > [ 13.515884][ T189] unix_stream_read_generic (net/unix/af_unix.c:2805) > [ 13.516377][ T189] ? cma_for_each_area (mm/page_ext.c:518) > [ 13.516826][ T189] ? unix_stream_splice_read (net/unix/af_unix.c:2907) > [ 13.517301][ T189] unix_stream_recvmsg (net/unix/af_unix.c:2923) > [ 13.517720][ T189] ? scm_stat_del (net/unix/af_unix.c:2883) > [ 13.518108][ T189] ____sys_recvmsg (net/socket.c:1046 net/socket.c:1068 net/socket.c:2804) > [ 13.518527][ T189] ? import_iovec (lib/iov_iter.c:1348) > [ 13.518930][ T189] ? copy_msghdr_from_user (net/socket.c:2525) > [ 13.519396][ T189] ___sys_recvmsg (net/socket.c:2846) > [ 13.519811][ T189] ? __fdget (fs/file.c:1160) > [ 13.520186][ T189] ? sockfd_lookup_light (net/socket.c:558) > [ 13.520643][ T189] __sys_recvmsg (include/linux/file.h:34 net/socket.c:2878) > [ 13.521046][ T189] __ia32_sys_socketcall (net/socket.c:3173 net/socket.c:3077 net/socket.c:3077) > [ 13.521513][ T189] ia32_sys_call (arch/x86/entry/syscall_32.c:42) > [ 13.521923][ T189] __do_fast_syscall_32 (arch/x86/entry/common.c:165 arch/x86/entry/common.c:386) > [ 13.522362][ T189] do_fast_syscall_32 (arch/x86/entry/common.c:411) > [ 13.522787][ T189] do_SYSENTER_32 (arch/x86/entry/common.c:450) > [ 13.523188][ T189] entry_SYSENTER_32 (arch/x86/entry/entry_32.S:836) > [ 13.523613][ T189] EIP: 0xb7ee6579 > [ 13.523933][ T189] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 > All code > ======== > 0: b8 01 10 06 03 mov $0x3061001,%eax > 5: 74 b4 je 0xffffffffffffffbb > 7: 01 10 add %edx,(%rax) > 9: 07 (bad) > a: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi > e: 10 08 adc %cl,(%rax) > 10: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi > ... > 20: 00 51 52 add %dl,0x52(%rcx) > 23: 55 push %rbp > 24:* 89 e5 mov %esp,%ebp <-- trapping instruction > 26: 0f 34 sysenter > 28: cd 80 int $0x80 > 2a: 5d pop %rbp > 2b: 5a pop %rdx > 2c: 59 pop %rcx > 2d: c3 ret > 2e: 90 nop > 2f: 90 nop > 30: 90 nop > 31: 90 nop > 32: 8d 76 00 lea 0x0(%rsi),%esi > 35: 58 pop %rax > 36: b8 77 00 00 00 mov $0x77,%eax > 3b: cd 80 int $0x80 > 3d: 90 nop > 3e: 8d .byte 0x8d > 3f: 76 .byte 0x76 > > Code starting with the faulting instruction > =========================================== > 0: 5d pop %rbp > 1: 5a pop %rdx > 2: 59 pop %rcx > 3: c3 ret > 4: 90 nop > 5: 90 nop > 6: 90 nop > 7: 90 nop > 8: 8d 76 00 lea 0x0(%rsi),%esi > b: 58 pop %rax > c: b8 77 00 00 00 mov $0x77,%eax > 11: cd 80 int $0x80 > 13: 90 nop > 14: 8d .byte 0x8d > 15: 76 .byte 0x76 > [ 13.525624][ T189] EAX: ffffffda EBX: 00000011 ECX: bfdf5450 EDX: 00000000 > [ 13.526233][ T189] ESI: b7c46000 EDI: bfdf54ac EBP: bfdf54a8 ESP: bfdf5440 > [ 13.526853][ T189] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000286 > [ 13.527519][ T189] Modules linked in: i2c_piix4(+) agpgart(+) qemu_fw_cfg button fuse drm drm_panel_orientation_quirks ip_tables > [ 13.528566][ T194] Oops: invalid opcode: 0000 [#2] PREEMPT SMP > [ 13.528804][ T189] ---[ end trace 0000000000000000 ]--- > [ 13.529217][ T194] CPU: 1 PID: 194 Comm: systemctl Tainted: G D 6.10.0-rc2-00258-g18f0423b9ecc #1 > [ 13.529725][ T189] EIP: usercopy_abort (mm/usercopy.c:102 (discriminator 12)) > [ 13.530536][ T194] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > [ 13.531060][ T189] Code: d6 89 44 24 0c 0f 45 cf 8b 7d 0c 89 74 24 10 89 4c 24 04 c7 04 24 a4 55 8a d6 89 7c 24 20 8b 7d 08 89 7c 24 1c e8 20 3c df ff <0f> 0b b8 80 91 d7 d6 e8 a8 de 68 00 ba 3d 17 86 d6 89 55 f0 89 d6 > All code > ======== > 0: d6 (bad) > 1: 89 44 24 0c mov %eax,0xc(%rsp) > 5: 0f 45 cf cmovne %edi,%ecx > 8: 8b 7d 0c mov 0xc(%rbp),%edi > b: 89 74 24 10 mov %esi,0x10(%rsp) > f: 89 4c 24 04 mov %ecx,0x4(%rsp) > 13: c7 04 24 a4 55 8a d6 movl $0xd68a55a4,(%rsp) > 1a: 89 7c 24 20 mov %edi,0x20(%rsp) > 1e: 8b 7d 08 mov 0x8(%rbp),%edi > 21: 89 7c 24 1c mov %edi,0x1c(%rsp) > 25: e8 20 3c df ff call 0xffffffffffdf3c4a > 2a:* 0f 0b ud2 <-- trapping instruction > 2c: b8 80 91 d7 d6 mov $0xd6d79180,%eax > 31: e8 a8 de 68 00 call 0x68dede > 36: ba 3d 17 86 d6 mov $0xd686173d,%edx > 3b: 89 55 f0 mov %edx,-0x10(%rbp) > 3e: 89 d6 mov %edx,%esi > > Code starting with the faulting instruction > =========================================== > 0: 0f 0b ud2 > 2: b8 80 91 d7 d6 mov $0xd6d79180,%eax > 7: e8 a8 de 68 00 call 0x68deb4 > c: ba 3d 17 86 d6 mov $0xd686173d,%edx > 11: 89 55 f0 mov %edx,-0x10(%rbp) > 14: 89 d6 mov %edx,%esi > > > The kernel config and materials to reproduce are available at: > https://download.01.org/0day-ci/archive/20240616/202406161539.b5ff7b20-oliver.sang@intel.com > > >
From: Sagi Grimberg > Sent: 16 June 2024 10:24 ... > > [ 13.498663][ T189] EIP: usercopy_abort (mm/usercopy.c:102 (discriminator 12)) > > [ 13.499424][ T194] usercopy: Kernel memory exposure attempt detected from kmap (offset 0, size > 8192)! > > Hmm, not sure I understand exactly why changing kmap() to > kmap_local_page() expose this, > but it looks like mm/usercopy does not like size=8192 when copying for > the skb frag. Can't a usercopy fault and have to read the page from swap? So the process can sleep and then be rescheduled on a different cpu? So you can't use kmap_local_page() here at all. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, Jun 16, 2024 at 09:51:05PM +0000, David Laight wrote: > From: Sagi Grimberg > > Sent: 16 June 2024 10:24 > ... > > > [ 13.498663][ T189] EIP: usercopy_abort (mm/usercopy.c:102 (discriminator 12)) > > > [ 13.499424][ T194] usercopy: Kernel memory exposure attempt detected from kmap (offset 0, size > > 8192)! > > > > Hmm, not sure I understand exactly why changing kmap() to > > kmap_local_page() expose this, > > but it looks like mm/usercopy does not like size=8192 when copying for > > the skb frag. > > Can't a usercopy fault and have to read the page from swap? > So the process can sleep and then be rescheduled on a different cpu? > So you can't use kmap_local_page() here at all. I don't think you understand how kmap_local_page() works.
On Sun, Jun 16, 2024 at 12:24:28PM +0300, Sagi Grimberg wrote: > > [ 13.495377][ T189] ------------[ cut here ]------------ > > [ 13.495862][ T189] kernel BUG at mm/usercopy.c:102! > > [ 13.496372][ T189] Oops: invalid opcode: 0000 [#1] PREEMPT SMP > > [ 13.496927][ T189] CPU: 0 PID: 189 Comm: systemctl Not tainted 6.10.0-rc2-00258-g18f0423b9ecc #1 > > [ 13.497741][ T189] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > > [ 13.498663][ T189] EIP: usercopy_abort (mm/usercopy.c:102 (discriminator 12)) > > [ 13.499424][ T194] usercopy: Kernel memory exposure attempt detected from kmap (offset 0, size 8192)! > > Hmm, not sure I understand exactly why changing kmap() to kmap_local_page() > expose this, > but it looks like mm/usercopy does not like size=8192 when copying for the > skb frag. > > quick git browse directs to: > -- > commit 4e140f59d285c1ca1e5c81b4c13e27366865bd09 > Author: Matthew Wilcox (Oracle) <willy@infradead.org> > Date: Mon Jan 10 23:15:27 2022 +0000 > > mm/usercopy: Check kmap addresses properly > > If you are copying to an address in the kmap region, you may not copy > across a page boundary, no matter what the size of the underlying > allocation. You can't kmap() a slab page because slab pages always > come from low memory. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > Acked-by: Kees Cook <keescook@chromium.org> > Signed-off-by: Kees Cook <keescook@chromium.org> > Link: > https://lore.kernel.org/r/20220110231530.665970-2-willy@infradead.org > -- > > CCing willy. > > The documentation suggest that under single-context usage kmap() can be > freely converted > to kmap_local_page()? But seems that when using kmap() the size is not an > issue, still trying to > understand why. Probably because kmap() returns page_address() for non-highmem pages while kmap_local_page() actually returns a kmap address: if (!IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP) && !PageHighMem(page)) return page_address(page); return __kmap_local_pfn_prot(page_to_pfn(page), prot); so if skb frags are always lowmem (are they?) this is a false positive. if they can be highmem, then you've uncovered a bug that nobody's noticed because nobody's testing on 32-bit any more.
On 17/06/2024 0:56, Matthew Wilcox wrote: > On Sun, Jun 16, 2024 at 12:24:28PM +0300, Sagi Grimberg wrote: >>> [ 13.495377][ T189] ------------[ cut here ]------------ >>> [ 13.495862][ T189] kernel BUG at mm/usercopy.c:102! >>> [ 13.496372][ T189] Oops: invalid opcode: 0000 [#1] PREEMPT SMP >>> [ 13.496927][ T189] CPU: 0 PID: 189 Comm: systemctl Not tainted 6.10.0-rc2-00258-g18f0423b9ecc #1 >>> [ 13.497741][ T189] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 >>> [ 13.498663][ T189] EIP: usercopy_abort (mm/usercopy.c:102 (discriminator 12)) >>> [ 13.499424][ T194] usercopy: Kernel memory exposure attempt detected from kmap (offset 0, size 8192)! >> Hmm, not sure I understand exactly why changing kmap() to kmap_local_page() >> expose this, >> but it looks like mm/usercopy does not like size=8192 when copying for the >> skb frag. >> >> quick git browse directs to: >> -- >> commit 4e140f59d285c1ca1e5c81b4c13e27366865bd09 >> Author: Matthew Wilcox (Oracle) <willy@infradead.org> >> Date: Mon Jan 10 23:15:27 2022 +0000 >> >> mm/usercopy: Check kmap addresses properly >> >> If you are copying to an address in the kmap region, you may not copy >> across a page boundary, no matter what the size of the underlying >> allocation. You can't kmap() a slab page because slab pages always >> come from low memory. >> >> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> >> Acked-by: Kees Cook <keescook@chromium.org> >> Signed-off-by: Kees Cook <keescook@chromium.org> >> Link: >> https://lore.kernel.org/r/20220110231530.665970-2-willy@infradead.org >> -- >> >> CCing willy. >> >> The documentation suggest that under single-context usage kmap() can be >> freely converted >> to kmap_local_page()? But seems that when using kmap() the size is not an >> issue, still trying to >> understand why. > Probably because kmap() returns page_address() for non-highmem pages > while kmap_local_page() actually returns a kmap address: > > if (!IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP) && !PageHighMem(page)) > return page_address(page); > return __kmap_local_pfn_prot(page_to_pfn(page), prot); > > so if skb frags are always lowmem (are they?) this is a false positive. AFAIR these buffers are coming from the RX ring, so they should be coming from a page_frag_cache, so I want to say always low memory? > if they can be highmem, then you've uncovered a bug that nobody's > noticed because nobody's testing on 32-bit any more. Not sure, Jakub? Eric?
From: Matthew Wilcox > Sent: 16 June 2024 22:53 > > On Sun, Jun 16, 2024 at 09:51:05PM +0000, David Laight wrote: > > From: Sagi Grimberg > > > Sent: 16 June 2024 10:24 > > ... > > > > [ 13.498663][ T189] EIP: usercopy_abort (mm/usercopy.c:102 (discriminator 12)) > > > > [ 13.499424][ T194] usercopy: Kernel memory exposure attempt detected from kmap (offset 0, > size > > > 8192)! > > > > > > Hmm, not sure I understand exactly why changing kmap() to > > > kmap_local_page() expose this, > > > but it looks like mm/usercopy does not like size=8192 when copying for > > > the skb frag. > > > > Can't a usercopy fault and have to read the page from swap? > > So the process can sleep and then be rescheduled on a different cpu? > > So you can't use kmap_local_page() here at all. > > I don't think you understand how kmap_local_page() works. Quite likely :-) But I thought it was a cheap way of temporarily mapping a physical memory page into the current cpu's page tables without having to do any IPI to tell other cpu about the insert or removal? Which would require that the process not be migrated, which pretty much implies that pre-emption be disabled. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, 17 Jun 2024 09:29:53 +0300 Sagi Grimberg wrote: > > Probably because kmap() returns page_address() for non-highmem pages > > while kmap_local_page() actually returns a kmap address: > > > > if (!IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP) && !PageHighMem(page)) > > return page_address(page); > > return __kmap_local_pfn_prot(page_to_pfn(page), prot); > > > > so if skb frags are always lowmem (are they?) this is a false positive. > > AFAIR these buffers are coming from the RX ring, so they should be > coming from a page_frag_cache, > so I want to say always low memory? > > > if they can be highmem, then you've uncovered a bug that nobody's > > noticed because nobody's testing on 32-bit any more. > > Not sure, Jakub? Eric? My uneducated guess would be that until recent(ish) sendpage rework from David Howells all high mem pages would have been single pages.
On 17/06/2024 19:58, Jakub Kicinski wrote: > On Mon, 17 Jun 2024 09:29:53 +0300 Sagi Grimberg wrote: >>> Probably because kmap() returns page_address() for non-highmem pages >>> while kmap_local_page() actually returns a kmap address: >>> >>> if (!IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP) && !PageHighMem(page)) >>> return page_address(page); >>> return __kmap_local_pfn_prot(page_to_pfn(page), prot); >>> >>> so if skb frags are always lowmem (are they?) this is a false positive. >> AFAIR these buffers are coming from the RX ring, so they should be >> coming from a page_frag_cache, >> so I want to say always low memory? >> >>> if they can be highmem, then you've uncovered a bug that nobody's >>> noticed because nobody's testing on 32-bit any more. >> Not sure, Jakub? Eric? > My uneducated guess would be that until recent(ish) sendpage rework > from David Howells all high mem pages would have been single pages. dHowells?
Jakub Kicinski <kuba@kernel.org> wrote: > On Mon, 17 Jun 2024 09:29:53 +0300 Sagi Grimberg wrote: > > > Probably because kmap() returns page_address() for non-highmem pages > > > while kmap_local_page() actually returns a kmap address: > > > > > > if (!IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP) && !PageHighMem(page)) > > > return page_address(page); > > > return __kmap_local_pfn_prot(page_to_pfn(page), prot); > > > > > > so if skb frags are always lowmem (are they?) this is a false positive. > > > > AFAIR these buffers are coming from the RX ring, so they should be > > coming from a page_frag_cache, > > so I want to say always low memory? > > > > > if they can be highmem, then you've uncovered a bug that nobody's > > > noticed because nobody's testing on 32-bit any more. > > > > Not sure, Jakub? Eric? > > My uneducated guess would be that until recent(ish) sendpage rework > from David Howells all high mem pages would have been single pages. Um. I touched the Tx side, not the Rx side. I also don't know whether all high mem pages would be single pages. I'll have to defer that one to the MM folks. David
On 19/06/2024 15:46, David Howells wrote: > Jakub Kicinski <kuba@kernel.org> wrote: > >> On Mon, 17 Jun 2024 09:29:53 +0300 Sagi Grimberg wrote: >>>> Probably because kmap() returns page_address() for non-highmem pages >>>> while kmap_local_page() actually returns a kmap address: >>>> >>>> if (!IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP) && !PageHighMem(page)) >>>> return page_address(page); >>>> return __kmap_local_pfn_prot(page_to_pfn(page), prot); >>>> >>>> so if skb frags are always lowmem (are they?) this is a false positive. >>> AFAIR these buffers are coming from the RX ring, so they should be >>> coming from a page_frag_cache, >>> so I want to say always low memory? >>> >>>> if they can be highmem, then you've uncovered a bug that nobody's >>>> noticed because nobody's testing on 32-bit any more. >>> Not sure, Jakub? Eric? >> My uneducated guess would be that until recent(ish) sendpage rework >> from David Howells all high mem pages would have been single pages. > Um. I touched the Tx side, not the Rx side. > > I also don't know whether all high mem pages would be single pages. I'll have > to defer that one to the MM folks. What prevents from gro to expand frags from crossing PAGE_SIZE? btw, at least from the code in skb_gro_receive() it appears that page_address() is called directly, which suggest that these netmem pages are lowmem?
On Wed, Jun 19, 2024 at 3:54 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > On 19/06/2024 15:46, David Howells wrote: > > Jakub Kicinski <kuba@kernel.org> wrote: > > > >> On Mon, 17 Jun 2024 09:29:53 +0300 Sagi Grimberg wrote: > >>>> Probably because kmap() returns page_address() for non-highmem pages > >>>> while kmap_local_page() actually returns a kmap address: > >>>> > >>>> if (!IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP) && !PageHighMem(page)) > >>>> return page_address(page); > >>>> return __kmap_local_pfn_prot(page_to_pfn(page), prot); > >>>> > >>>> so if skb frags are always lowmem (are they?) this is a false positive. > >>> AFAIR these buffers are coming from the RX ring, so they should be > >>> coming from a page_frag_cache, > >>> so I want to say always low memory? > >>> > >>>> if they can be highmem, then you've uncovered a bug that nobody's > >>>> noticed because nobody's testing on 32-bit any more. > >>> Not sure, Jakub? Eric? > >> My uneducated guess would be that until recent(ish) sendpage rework > >> from David Howells all high mem pages would have been single pages. > > Um. I touched the Tx side, not the Rx side. > > > > I also don't know whether all high mem pages would be single pages. I'll have > > to defer that one to the MM folks. > > What prevents from gro to expand frags from crossing PAGE_SIZE? > > btw, at least from the code in skb_gro_receive() it appears that > page_address() is called directly, > which suggest that these netmem pages are lowmem? GRO should only be fed with lowmem pages. But the trace involves af_unix, not GRO ? I guess that with splice games, it is possible to add high order pages to skbs. I think skb_frag_foreach_page() could be used to fix this issue.
On Wed, Jun 19, 2024 at 4:51 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Jun 19, 2024 at 3:54 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > > > > > On 19/06/2024 15:46, David Howells wrote: > > > Jakub Kicinski <kuba@kernel.org> wrote: > > > > > >> On Mon, 17 Jun 2024 09:29:53 +0300 Sagi Grimberg wrote: > > >>>> Probably because kmap() returns page_address() for non-highmem pages > > >>>> while kmap_local_page() actually returns a kmap address: > > >>>> > > >>>> if (!IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP) && !PageHighMem(page)) > > >>>> return page_address(page); > > >>>> return __kmap_local_pfn_prot(page_to_pfn(page), prot); > > >>>> > > >>>> so if skb frags are always lowmem (are they?) this is a false positive. > > >>> AFAIR these buffers are coming from the RX ring, so they should be > > >>> coming from a page_frag_cache, > > >>> so I want to say always low memory? > > >>> > > >>>> if they can be highmem, then you've uncovered a bug that nobody's > > >>>> noticed because nobody's testing on 32-bit any more. > > >>> Not sure, Jakub? Eric? > > >> My uneducated guess would be that until recent(ish) sendpage rework > > >> from David Howells all high mem pages would have been single pages. > > > Um. I touched the Tx side, not the Rx side. > > > > > > I also don't know whether all high mem pages would be single pages. I'll have > > > to defer that one to the MM folks. > > > > What prevents from gro to expand frags from crossing PAGE_SIZE? > > > > btw, at least from the code in skb_gro_receive() it appears that > > page_address() is called directly, > > which suggest that these netmem pages are lowmem? > > GRO should only be fed with lowmem pages. > > But the trace involves af_unix, not GRO ? > > I guess that with splice games, it is possible to add high order pages to skbs. > > I think skb_frag_foreach_page() could be used to fix this issue. For reference, please look at commit c613c209c3f351d47158f728271d0c73b6dd24c6 Author: Willem de Bruijn <willemb@google.com> Date: Mon Jul 31 08:15:47 2017 -0400 net: add skb_frag_foreach_page and use with kmap_atomic
On 19/06/2024 17:51, Eric Dumazet wrote: > On Wed, Jun 19, 2024 at 3:54 PM Sagi Grimberg <sagi@grimberg.me> wrote: >> >> >> On 19/06/2024 15:46, David Howells wrote: >>> Jakub Kicinski <kuba@kernel.org> wrote: >>> >>>> On Mon, 17 Jun 2024 09:29:53 +0300 Sagi Grimberg wrote: >>>>>> Probably because kmap() returns page_address() for non-highmem pages >>>>>> while kmap_local_page() actually returns a kmap address: >>>>>> >>>>>> if (!IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP) && !PageHighMem(page)) >>>>>> return page_address(page); >>>>>> return __kmap_local_pfn_prot(page_to_pfn(page), prot); >>>>>> >>>>>> so if skb frags are always lowmem (are they?) this is a false positive. >>>>> AFAIR these buffers are coming from the RX ring, so they should be >>>>> coming from a page_frag_cache, >>>>> so I want to say always low memory? >>>>> >>>>>> if they can be highmem, then you've uncovered a bug that nobody's >>>>>> noticed because nobody's testing on 32-bit any more. >>>>> Not sure, Jakub? Eric? >>>> My uneducated guess would be that until recent(ish) sendpage rework >>>> from David Howells all high mem pages would have been single pages. >>> Um. I touched the Tx side, not the Rx side. >>> >>> I also don't know whether all high mem pages would be single pages. I'll have >>> to defer that one to the MM folks. >> What prevents from gro to expand frags from crossing PAGE_SIZE? >> >> btw, at least from the code in skb_gro_receive() it appears that >> page_address() is called directly, >> which suggest that these netmem pages are lowmem? > GRO should only be fed with lowmem pages. > > But the trace involves af_unix, not GRO ? > > I guess that with splice games, it is possible to add high order pages to skbs. > > I think skb_frag_foreach_page() could be used to fix this issue. Thanks Eric, that is indeed the missing piece. I'll give it go.
Eric Dumazet <edumazet@google.com> wrote:
> But the trace involves af_unix, not GRO ?
Ah. splice()/vmsplice() can feed multipage folios into the socket. Possibly
those can be in highmem.
David
diff --git a/net/core/datagram.c b/net/core/datagram.c index a8b625abe242..ac74df248205 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -436,14 +436,14 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset, end = start + skb_frag_size(frag); if ((copy = end - offset) > 0) { struct page *page = skb_frag_page(frag); - u8 *vaddr = kmap(page); + u8 *vaddr = kmap_local_page(page); if (copy > len) copy = len; n = INDIRECT_CALL_1(cb, simple_copy_to_iter, vaddr + skb_frag_off(frag) + offset - start, copy, data, to); - kunmap(page); + kunmap_local(vaddr); offset += n; if (n != copy) goto short_copy;
We only use the mapping in a single context in a short and contained scope, so kmap_local_page is sufficient and cheaper. This will also allow skb_datagram_iter to be called from softirq context. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> --- net/core/datagram.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)