diff mbox series

[v13,03/12] splice: Do splice read from a buffered file without using ITER_PIPE

Message ID 20230209102954.528942-4-dhowells@redhat.com (mailing list archive)
State New
Headers show
Series iov_iter: Improve page extraction (pin or just list) | expand

Commit Message

David Howells Feb. 9, 2023, 10:29 a.m. UTC
Provide a function to do splice read from a buffered file, pulling the
folios out of the pagecache directly by calling filemap_get_pages() to do
any required reading and then pasting the returned folios into the pipe.

A helper function is provided to do the actual folio pasting and will
handle multipage folios by splicing as many of the relevant subpages as
will fit into the pipe.

The ITER_BVEC-based splicing previously added is then only used for
splicing from O_DIRECT files.

The code is loosely based on filemap_read() and might belong in
mm/filemap.c with that as it needs to use filemap_get_pages().

With this, ITER_PIPE is no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/splice.c | 159 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 135 insertions(+), 24 deletions(-)

Comments

Christoph Hellwig Feb. 13, 2023, 8:28 a.m. UTC | #1
> The code is loosely based on filemap_read() and might belong in
> mm/filemap.c with that as it needs to use filemap_get_pages().

Yes, I thunk it should go into filemap.c

> +	while (spliced < size &&
> +	       !pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
> +		struct pipe_buffer *buf = &pipe->bufs[pipe->head & (pipe->ring_size - 1)];

Can you please facto this calculation, that is also duplicated in patch
one into a helper?

static inline struct pipe_buffer *pipe_head_buf(struct pipe_inode_info *pipe)
{
	return &pipe->bufs[pipe->head & (pipe->ring_size - 1)];
}

> +	struct folio_batch fbatch;
> +	size_t total_spliced = 0, used, npages;
> +	loff_t isize, end_offset;
> +	bool writably_mapped;
> +	int i, error = 0;
> +
> +	struct kiocb iocb = {

Why the empty line before this declaration?

> +		.ki_filp	= in,
> +		.ki_pos		= *ppos,
> +	};

Also why doesn't this use init_sync_kiocb?

>  	if (in->f_flags & O_DIRECT)
>  		return generic_file_direct_splice_read(in, ppos, pipe, len, flags);
> +	return generic_file_buffered_splice_read(in, ppos, pipe, len, flags);

Btw, can we drop the verbose generic_file_ prefix here?

generic_file_buffered_splice_read really should be filemap_splice_read
and be in filemap.c.  generic_file_direct_splice_read I'd just name
direct_splice_read.
David Howells Feb. 13, 2023, 10:11 a.m. UTC | #2
Christoph Hellwig <hch@infradead.org> wrote:

> Also why doesn't this use init_sync_kiocb?

I'm not sure I want ki_flags.

> >  	if (in->f_flags & O_DIRECT)
> >  		return generic_file_direct_splice_read(in, ppos, pipe, len, flags);
> > +	return generic_file_buffered_splice_read(in, ppos, pipe, len, flags);
> 
> Btw, can we drop the verbose generic_file_ prefix here?

Probably.  Note that at some point cifs, for example, running in "unbuffered"
mode might want to call [generic_file_]direct_splice_read() directly.

David
Christoph Hellwig Feb. 13, 2023, 10:18 a.m. UTC | #3
On Mon, Feb 13, 2023 at 10:11:01AM +0000, David Howells wrote:
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > Also why doesn't this use init_sync_kiocb?
> 
> I'm not sure I want ki_flags.

Why?
David Howells Feb. 13, 2023, 11:15 a.m. UTC | #4
Christoph Hellwig <hch@infradead.org> wrote:

> > > Also why doesn't this use init_sync_kiocb?
> > 
> > I'm not sure I want ki_flags.
> 
> Why?

I'm not sure I want ki_flags setting from f_iocb_flags I should've said.  I'm
not sure how the IOCB_* flags that I import from there will affect the
operation of the synchronous read splice.  IOCB_NOWAIT, for example, or, for
that matter, IOCB_APPEND.

David
Christoph Hellwig Feb. 13, 2023, 2:44 p.m. UTC | #5
On Mon, Feb 13, 2023 at 11:15:37AM +0000, David Howells wrote:
> I'm not sure I want ki_flags setting from f_iocb_flags I should've said.  I'm
> not sure how the IOCB_* flags that I import from there will affect the
> operation of the synchronous read splice.

The same way as they did in the old ITER_PIPE based
generic_file_splice_read that uses init_sync_kiocb?  And if there's any
questions about them we need to do a deep audit.

> IOCB_NOWAIT, for example, or, for

I'd expect a set IOCB_NOWAIT to make the function return -EAGAIN
when it has to block.

> that matter, IOCB_APPEND.

IOCB_APPEND has no effect on reads of any kind.
Guenter Roeck Feb. 13, 2023, 6:06 p.m. UTC | #6
Hi,

On Thu, Feb 09, 2023 at 10:29:45AM +0000, David Howells wrote:
> Provide a function to do splice read from a buffered file, pulling the
> folios out of the pagecache directly by calling filemap_get_pages() to do
> any required reading and then pasting the returned folios into the pipe.
> 
> A helper function is provided to do the actual folio pasting and will
> handle multipage folios by splicing as many of the relevant subpages as
> will fit into the pipe.
> 
> The ITER_BVEC-based splicing previously added is then only used for
> splicing from O_DIRECT files.
> 
> The code is loosely based on filemap_read() and might belong in
> mm/filemap.c with that as it needs to use filemap_get_pages().
> 
> With this, ITER_PIPE is no longer used.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>

With this patch in the tree, the "collie" and "mps2" qemu emulations
crash for me. Crash logs are attached. I also attached the bisect log
for "collie".

Unfortunately I can not revert the patch to confirm because the revert
results in compile failures.

Guenter
---
bisect log

# bad: [09e41676e35ab06e4bce8870ea3bf1f191c3cb90] Add linux-next specific files for 20230213
# good: [4ec5183ec48656cec489c49f989c508b68b518e3] Linux 6.2-rc7
git bisect start 'HEAD' 'v6.2-rc7'
# good: [8b065aee8dfbecc978324b204fc897168c9adcd0] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
git bisect good 8b065aee8dfbecc978324b204fc897168c9adcd0
# bad: [72655d7bf4966cc46ac85ef74b26eb74e251ae4a] Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
git bisect bad 72655d7bf4966cc46ac85ef74b26eb74e251ae4a
# good: [55461ffd2b7ee0a8fe4a1f98ae6f4a33771e8193] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git
git bisect good 55461ffd2b7ee0a8fe4a1f98ae6f4a33771e8193
# bad: [0f1bf464790dad200077e97d35cd8bb9dd7b8341] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply.git
git bisect bad 0f1bf464790dad200077e97d35cd8bb9dd7b8341
# good: [c72ebd41e0737e1f1d30dc6eb3d167e8d16dcc3a] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git
git bisect good c72ebd41e0737e1f1d30dc6eb3d167e8d16dcc3a
# bad: [501053535caca01f20a9323d3c8dec9ecb7a06b1] Merge branch 'for-6.3/iov-extract' into for-next
git bisect bad 501053535caca01f20a9323d3c8dec9ecb7a06b1
# good: [efde918ac66958c568926120841e7692b1e9bd9d] rxrpc: use bvec_set_page to initialize a bvec
git bisect good efde918ac66958c568926120841e7692b1e9bd9d
# good: [6938b812a638d9f02d3eb4fd07c7aab4fd44076d] Merge branch 'for-6.3/io_uring' into for-next
git bisect good 6938b812a638d9f02d3eb4fd07c7aab4fd44076d
# good: [1972d038a5401781377d3ce2d901bf7763a43589] ublk: pass NULL to blk_mq_alloc_disk() as queuedata
git bisect good 1972d038a5401781377d3ce2d901bf7763a43589
# good: [f37bf75ca73d523ebaa7ceb44c45d8ecd05374fe] block, bfq: cleanup 'bfqg->online'
git bisect good f37bf75ca73d523ebaa7ceb44c45d8ecd05374fe
# bad: [34c5b3634708864d5845cbadad03833c30051e0b] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
git bisect bad 34c5b3634708864d5845cbadad03833c30051e0b
# bad: [d9722a47571104f7fa1eeb5ec59044d3607c6070] splice: Do splice read from a buffered file without using ITER_PIPE
git bisect bad d9722a47571104f7fa1eeb5ec59044d3607c6070
# good: [cd119d2fa647945d63941d3fd64f4acc9f6eec24] mm: Pass info, not iter, into filemap_get_pages() and unstatic it
git bisect good cd119d2fa647945d63941d3fd64f4acc9f6eec24
# first bad commit: [d9722a47571104f7fa1eeb5ec59044d3607c6070] splice: Do splice read from a buffered file without using ITER_PIPE

---
arm:collie crash

8<--- cut here ---
Unable to handle kernel NULL pointer dereference at virtual address 00000000 when execute
[00000000] *pgd=c14b4831c14b4831, *pte=c14b4000, *ppte=e09b5a14
8<--- cut here ---
Unhandled fault: page domain fault (0x01b) at 0x00000000
[00000000] *pgd=c14b4831, *pte=00000000, *ppte=00000000
Internal error: : 1b [#1] ARM
CPU: 0 PID: 58 Comm: cat Not tainted 6.2.0-rc7-next-20230213 #1
Hardware name: Sharp-Collie
PC is at copy_from_kernel_nofault+0x124/0x23c
LR is at 0xe09b5a84
pc : [<c009d894>]    lr : [<e09b5a84>]    psr: 20000193
sp : e09b5a4c  ip : e09b5a84  fp : e09b5a80
r10: 00000214  r9 : 60000113  r8 : 00000004
r7 : c08b91fc  r6 : e09b5a84  r5 : 00000004  r4 : 00000000
r3 : 00000001  r2 : c14a6ca0  r1 : 00000000  r0 : 00000001
Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0000717f  Table: c14d4000  DAC: 00000051
Register r0 information: non-paged memory
Register r1 information: NULL pointer
Register r2 information: slab task_struct start c14a6ca0 pointer offset 0 size 3232
Register r3 information: non-paged memory
Register r4 information: NULL pointer
Register r5 information: non-paged memory
Register r6 information: 2-page vmalloc region starting at 0xe09b4000 allocated at kernel_clone+0x78/0x474
Register r7 information: non-slab/vmalloc memory
Register r8 information: non-paged memory
Register r9 information: non-paged memory
Register r10 information: non-paged memory
Register r11 information: 2-page vmalloc region starting at 0xe09b4000 allocated at kernel_clone+0x78/0x474
Register r12 information: 2-page vmalloc region starting at 0xe09b4000 allocated at kernel_clone+0x78/0x474
Process cat (pid: 58, stack limit = 0xfabdb807)
Stack: (0xe09b5a4c to 0xe09b6000)
5a40:                            e09b5a70 e09b5a5c c005a4e4 c087b82c e09b5c14
5a60: e09b5c14 00000000 c08b91fc 80000005 60000113 e09b5a98 e09b5a84 c000cfa8
5a80: c009d77c e09b5ab0 c087b82c e09b5ac0 e09b5a9c c06b8020 c000cf84 c149a840
5aa0: c07df1e0 e09b5c14 c07df1c8 c087f3e4 c08b91fc e09b5b40 e09b5ac4 c000cc28
5ac0: c06b8010 e09b5af0 e09b5ad4 c005e7a4 c005d0b8 e09b5af8 e09b5ae4 c005d0d4
5ae0: c14b4000 e09b5b08 e09b5af4 c06de218 c005e72c e09b5b10 c087b82c e09b5b40
5b00: e09b5b1c c06dcba8 c06de1f4 c07ded54 c087b82c 00000000 80000005 00000000
5b20: c149a840 c07df1e0 c149a8b8 00000004 00000214 e09b5b58 e09b5b44 c000dcec
5b40: c000cb98 80000005 e09b5c14 e09b5b80 e09b5b5c c000dd78 c000dc84 e09b5c14
5b60: e09b5b6c e09b5c14 80000005 00000000 c149a840 e09b5bc0 e09b5b84 c000df6c
5b80: c000dd2c 00000000 c087ba8c c14a6ca0 00010000 c08bacc0 00000005 e09b5c14
5ba0: c087f688 c000e184 00000000 c14a6ca0 c149f158 e09b5bd8 e09b5bc4 c000e22c
5bc0: c000ddc0 00000005 e09b5c14 e09b5c10 e09b5bdc c000e33c c000e190 c14a6ca0
5be0: c00526a4 c14a6ca0 c088607c 60000013 00000000 20000013 ffffffff e09b5c48
5c00: dfb10900 e09b5cb4 e09b5c14 c0008e10 c000e304 c14b3a20 dfb10900 dfb10900
5c20: 60000093 dfb10900 00000000 c14b3a20 60000013 dfb10900 00000000 c149f158
5c40: e09b5cb4 e09b5cb8 e09b5c60 c0093a7c 00000000 20000013 ffffffff 00000051
5c60: c00a5a88 00000cc0 dfb10900 00000000 00000cc0 c149f128 e09b5cb4 e09b5c88
5c80: c009507c c087b82c e09b5c90 e09b5d94 e09b5d68 00000000 c149f128 dfb10900
5ca0: 00000000 c149f158 e09b5d3c e09b5cb8 c0099714 c0093a20 dfff1d14 c14a7258
5cc0: c00e28bc 00000002 e09b5d1c e09b5cd8 00010000 00000001 c14b3af0 c14b3a20
5ce0: c14a6ca0 00000010 00000001 c14b3a20 c149f128 c14b3af0 00000000 00000000
5d00: 00000000 00000000 00000000 c087b82c 80000113 00000000 00000000 e09b5f18
5d20: 00010000 c14b5600 00000000 00010000 e09b5e04 e09b5d40 c013019c c0099324
5d40: 60000113 00000000 c14a6ca0 c14a6ca0 e09b5d84 c14b3a20 c087ba8c c14a6ca0
5d60: 00000000 00000000 c14b3a20 00000000 00000000 00000000 00000000 00000000
5d80: 00000000 00000000 00000000 00000000 fffffffe ffff0000 000001a9 00000000
5da0: c832b34f c088607c 60000013 00000000 e09b5f18 c0a90ec8 01000000 c14b5600
5dc0: e09b5e04 e09b5dd0 c0055788 c0054dec 00000000 c087b82c c0102d90 c14b5600
5de0: 00000000 e09b5f18 e09b5f18 c14b3a20 00000000 00010000 e09b5e94 e09b5e08
5e00: c0130ec0 c01300d0 00000001 00000000 c0102d90 c14b5600 e09b5e9c e09b5e28
5e20: c06ef2a8 c005582c 00000001 00000000 c0102d90 e09b5e40 c0055f60 c0052e4c
5e40: e09b5e5c e09b5e50 c14b5634 00000001 00000001 00000002 c000f22c c149a894
5e60: 00000000 c087b82c c14d1e00 00010000 c14b3a20 c14b5600 e09b5f18 c0130e48
5e80: 01000000 00000001 e09b5ec4 e09b5e98 c012fee0 c0130e54 00000000 c06ef210
5ea0: c0102d90 00000000 c14b5600 c14b3a20 e09b5f18 00000000 e09b5ef4 e09b5ec8
5ec0: c0131d94 c012fe50 00000000 c0052e4c c14b3a20 00000000 00000000 01000000
5ee0: c14b3a20 c0c2aa20 e09b5f5c e09b5ef8 c00f72d8 c0131d28 00000000 c149a8b8
5f00: 00000002 00000255 c14b5600 00000000 c0c2aa20 c00f8f4c 00000000 00000000
5f20: 00000000 00000000 e09b5f74 c087b82c c000df18 00000000 00000000 00000003
5f40: 000000ef c0008420 c14a6ca0 01000000 e09b5fa4 e09b5f60 c00f8f4c c00f7170
5f60: 7fffffff 00000000 e09b5fac e09b5f78 c000e278 c087b82c befeee88 01000000
5f80: 00000000 01000000 000000ef c0008420 c14a6ca0 00000000 00000000 e09b5fa8
5fa0: c0008260 c00f8e38 01000000 00000000 00000001 00000003 00000000 01000000
5fc0: 01000000 00000000 01000000 000000ef 00000001 00000001 00000000 00000000
5fe0: b6e485d0 befedc74 00019764 b6e485dc 60000010 00000001 00000000 00000000
Backtrace:
 copy_from_kernel_nofault from is_valid_bugaddr+0x30/0x7c
 r9:60000113 r8:80000005 r7:c08b91fc r6:00000000 r5:e09b5c14 r4:e09b5c14
 is_valid_bugaddr from report_bug+0x1c/0x114
 report_bug from die+0x9c/0x398
 r7:c08b91fc r6:c087f3e4 r5:c07df1c8 r4:e09b5c14
 die from die_kernel_fault+0x74/0xa8
 r10:00000214 r9:00000004 r8:c149a8b8 r7:c07df1e0 r6:c149a840 r5:00000000
 r4:80000005
 die_kernel_fault from __do_kernel_fault.part.0+0x58/0x94
 r7:e09b5c14 r4:80000005
 __do_kernel_fault.part.0 from do_page_fault+0x1b8/0x338
 r7:c149a840 r6:00000000 r5:80000005 r4:e09b5c14
 do_page_fault from do_translation_fault+0xa8/0xb0
 r10:c149f158 r9:c14a6ca0 r8:00000000 r7:c000e184 r6:c087f688 r5:e09b5c14
 r4:00000005
 do_translation_fault from do_PrefetchAbort+0x44/0x98
 r5:e09b5c14 r4:00000005
 do_PrefetchAbort from __pabt_svc+0x50/0x80
Exception stack(0xe09b5c14 to 0xe09b5c5c)
5c00:                                              c14b3a20 dfb10900 dfb10900
5c20: 60000093 dfb10900 00000000 c14b3a20 60000013 dfb10900 00000000 c149f158
5c40: e09b5cb4 e09b5cb8 e09b5c60 c0093a7c 00000000 20000013 ffffffff
 r8:dfb10900 r7:e09b5c48 r6:ffffffff r5:20000013 r4:00000000
 filemap_read_folio from filemap_get_pages+0x3fc/0x7a4
 r10:c149f158 r9:00000000 r8:dfb10900 r7:c149f128 r6:00000000 r5:e09b5d68
 r4:e09b5d94
 filemap_get_pages from generic_file_buffered_splice_read.constprop.0+0xd8/0x400
 r10:00010000 r9:00000000 r8:c14b5600 r7:00010000 r6:e09b5f18 r5:00000000
 r4:00000000
 generic_file_buffered_splice_read.constprop.0 from generic_file_splice_read+0x78/0x310
 r10:00010000 r9:00000000 r8:c14b3a20 r7:e09b5f18 r6:e09b5f18 r5:00000000
 r4:c14b5600
 generic_file_splice_read from do_splice_to+0x9c/0xbc
 r10:00000001 r9:01000000 r8:c0130e48 r7:e09b5f18 r6:c14b5600 r5:c14b3a20
 r4:00010000
 do_splice_to from splice_file_to_pipe+0x78/0x80
 r8:00000000 r7:e09b5f18 r6:c14b3a20 r5:c14b5600 r4:00000000
 splice_file_to_pipe from do_sendfile+0x174/0x59c
 r9:c0c2aa20 r8:c14b3a20 r7:01000000 r6:00000000 r5:00000000 r4:c14b3a20
 do_sendfile from sys_sendfile64+0x120/0x148
 r10:01000000 r9:c14a6ca0 r8:c0008420 r7:000000ef r6:00000003 r5:00000000
 r4:00000000
 sys_sendfile64 from ret_fast_syscall+0x0/0x44
Exception stack(0xe09b5fa8 to 0xe09b5ff0)
5fa0:                   01000000 00000000 00000001 00000003 00000000 01000000
5fc0: 01000000 00000000 01000000 000000ef 00000001 00000001 00000000 00000000
5fe0: b6e485d0 befedc74 00019764 b6e485dc
 r10:00000000 r9:c14a6ca0 r8:c0008420 r7:000000ef r6:01000000 r5:00000000
 r4:01000000
Code: e21e1003 1a000011 e3550003 9a00000f (e4943000)
---[ end trace 0000000000000000 ]---

---
arm:mps2

[    4.659693]
[    4.659693] Unhandled exception: IPSR = 00000006 LR = fffffff1
[    4.659888] CPU: 0 PID: 155 Comm: cat Tainted: G                 N 6.2.0-rc7-next-20230213 #1
[    4.660030] Hardware name: Generic DT based system
[    4.660118] PC is at 0x0
[    4.660248] LR is at filemap_read_folio+0x17/0x4e
[    4.660468] pc : [<00000000>]    lr : [<21044c97>]    psr: 0000000b
[    4.660534] sp : 2185bd10  ip : 2185bcd0  fp : 00080001
[    4.660591] r10: 21757b40  r9 : 2175eea4  r8 : 2185bdf4
[    4.660649] r7 : 2175ee88  r6 : 21757b40  r5 : 21fecb40  r4 : 00000000
[    4.660718] r3 : 00000001  r2 : 00000001  r1 : 21fecb40  r0 : 21757b40
[    4.660785] xPSR: 0000000b
[    4.661126]  filemap_read_folio from filemap_get_pages+0x127/0x36e
[    4.661247]  filemap_get_pages from generic_file_buffered_splice_read.constprop.5+0x85/0x244
[    4.661342]  generic_file_buffered_splice_read.constprop.5 from generic_file_splice_read+0x1c3/0x1e2
[    4.661436]  generic_file_splice_read from splice_file_to_pipe+0x2f/0x48
[    4.661509]  splice_file_to_pipe from do_sendfile+0x193/0x1b8
[    4.661573]  do_sendfile from sys_sendfile64+0x63/0x70
[    4.661653]  sys_sendfile64 from ret_fast_syscall+0x1/0x4c
[    4.661740] Exception stack(0x2185bfa8 to 0x2185bff0)
[    4.661854] bfa0:                   000000ef 00000000 00000001 00000003 00000000 01000000
[    4.661944] bfc0: 000000ef 00000000 21b56e48 000000ef 00000001 00000001 00000000 21b51770
[    4.662023] bfe0: 21b4e791 21b56e48 21b174c9 21b0265e
David Howells Feb. 13, 2023, 10:43 p.m. UTC | #7
Guenter Roeck <linux@roeck-us.net> wrote:

> [    4.660118] PC is at 0x0
> [    4.660248] LR is at filemap_read_folio+0x17/0x4e

Do you know what the filesystem is that's being read from?

I think the problem is that there are a few filesystems/drivers that call
generic_file_splice_read() but don't have a ->read_folio().  Now most of these
can be made to call direct_splice_read() instead, leaving just coda, overlayfs
and shmem.

Coda and overlayfs can be made to pass the request down a layer.  I'm about to
look into shmem.

David
Guenter Roeck Feb. 13, 2023, 10:51 p.m. UTC | #8
On 2/13/23 14:43, David Howells wrote:
> Guenter Roeck <linux@roeck-us.net> wrote:
> 
>> [    4.660118] PC is at 0x0
>> [    4.660248] LR is at filemap_read_folio+0x17/0x4e
> 
> Do you know what the filesystem is that's being read from?
> 
> I think the problem is that there are a few filesystems/drivers that call
> generic_file_splice_read() but don't have a ->read_folio().  Now most of these
> can be made to call direct_splice_read() instead, leaving just coda, overlayfs
> and shmem.
> 
> Coda and overlayfs can be made to pass the request down a layer.  I'm about to
> look into shmem.
> 

Both are initrd.

Guenter
David Howells Feb. 13, 2023, 11:12 p.m. UTC | #9
Guenter Roeck <linux@roeck-us.net> wrote:

> Both are initrd.

Do you mean rootfs?  And, if so, is that tmpfs-based or ramfs-based?

David
Guenter Roeck Feb. 13, 2023, 11:25 p.m. UTC | #10
On 2/13/23 15:12, David Howells wrote:
> Guenter Roeck <linux@roeck-us.net> wrote:
> 
>> Both are initrd.
> 
> Do you mean rootfs?  And, if so, is that tmpfs-based or ramfs-based?
> 

Both are provided to the kernel using the -initrd qemu option,
which usually means that the address/location is passed to the kernel
through either a register or a data structure. I have not really paid
much attention to what the kernel is doing with that information.
It is in cpio format, so it must be decompressed, but I don't know how
it is actually handled (nor why this doesn't fail on other boots
from initrd).

Guenter
diff mbox series

Patch

diff --git a/fs/splice.c b/fs/splice.c
index b4be6fc314a1..963cbf20abc8 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -22,6 +22,7 @@ 
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/splice.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
@@ -375,6 +376,135 @@  static ssize_t generic_file_direct_splice_read(struct file *in, loff_t *ppos,
 	return ret;
 }
 
+/*
+ * Splice subpages from a folio into a pipe.
+ */
+static size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
+				     struct folio *folio,
+				     loff_t fpos, size_t size)
+{
+	struct page *page;
+	size_t spliced = 0, offset = offset_in_folio(folio, fpos);
+
+	page = folio_page(folio, offset / PAGE_SIZE);
+	size = min(size, folio_size(folio) - offset);
+	offset %= PAGE_SIZE;
+
+	while (spliced < size &&
+	       !pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
+		struct pipe_buffer *buf = &pipe->bufs[pipe->head & (pipe->ring_size - 1)];
+		size_t part = min_t(size_t, PAGE_SIZE - offset, size - spliced);
+
+		*buf = (struct pipe_buffer) {
+			.ops	= &page_cache_pipe_buf_ops,
+			.page	= page,
+			.offset	= offset,
+			.len	= part,
+		};
+		folio_get(folio);
+		pipe->head++;
+		page++;
+		spliced += part;
+		offset = 0;
+	}
+
+	return spliced;
+}
+
+/*
+ * Splice folios from the pagecache of a buffered (ie. non-O_DIRECT) file into
+ * a pipe.
+ */
+static ssize_t generic_file_buffered_splice_read(struct file *in, loff_t *ppos,
+						 struct pipe_inode_info *pipe,
+						 size_t len,
+						 unsigned int flags)
+{
+	struct folio_batch fbatch;
+	size_t total_spliced = 0, used, npages;
+	loff_t isize, end_offset;
+	bool writably_mapped;
+	int i, error = 0;
+
+	struct kiocb iocb = {
+		.ki_filp	= in,
+		.ki_pos		= *ppos,
+	};
+
+	/* Work out how much data we can actually add into the pipe */
+	used = pipe_occupancy(pipe->head, pipe->tail);
+	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	len = min_t(size_t, len, npages * PAGE_SIZE);
+
+	folio_batch_init(&fbatch);
+
+	do {
+		cond_resched();
+
+		if (*ppos >= i_size_read(file_inode(in)))
+			break;
+
+		iocb.ki_pos = *ppos;
+		error = filemap_get_pages(&iocb, len, &fbatch, true);
+		if (error < 0)
+			break;
+
+		/*
+		 * i_size must be checked after we know the pages are Uptodate.
+		 *
+		 * Checking i_size after the check allows us to calculate
+		 * the correct value for "nr", which means the zero-filled
+		 * part of the page is not copied back to userspace (unless
+		 * another truncate extends the file - this is desired though).
+		 */
+		isize = i_size_read(file_inode(in));
+		if (unlikely(*ppos >= isize))
+			break;
+		end_offset = min_t(loff_t, isize, *ppos + len);
+
+		/*
+		 * Once we start copying data, we don't want to be touching any
+		 * cachelines that might be contended:
+		 */
+		writably_mapped = mapping_writably_mapped(in->f_mapping);
+
+		for (i = 0; i < folio_batch_count(&fbatch); i++) {
+			struct folio *folio = fbatch.folios[i];
+			size_t n;
+
+			if (folio_pos(folio) >= end_offset)
+				goto out;
+			folio_mark_accessed(folio);
+
+			/*
+			 * If users can be writing to this folio using arbitrary
+			 * virtual addresses, take care of potential aliasing
+			 * before reading the folio on the kernel side.
+			 */
+			if (writably_mapped)
+				flush_dcache_folio(folio);
+
+			n = splice_folio_into_pipe(pipe, folio, *ppos, len);
+			if (!n)
+				goto out;
+			len -= n;
+			total_spliced += n;
+			*ppos += n;
+			in->f_ra.prev_pos = *ppos;
+			if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
+				goto out;
+		}
+
+		folio_batch_release(&fbatch);
+	} while (len);
+
+out:
+	folio_batch_release(&fbatch);
+	file_accessed(in);
+
+	return total_spliced ? total_spliced : error;
+}
+
 /**
  * generic_file_splice_read - splice data from file to a pipe
  * @in:		file to splice from
@@ -392,32 +522,13 @@  ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
-	struct iov_iter to;
-	struct kiocb kiocb;
-	int ret;
-
+	if (unlikely(*ppos >= file_inode(in)->i_sb->s_maxbytes))
+		return 0;
+	if (unlikely(!len))
+		return 0;
 	if (in->f_flags & O_DIRECT)
 		return generic_file_direct_splice_read(in, ppos, pipe, len, flags);
-
-	iov_iter_pipe(&to, ITER_DEST, pipe, len);
-	init_sync_kiocb(&kiocb, in);
-	kiocb.ki_pos = *ppos;
-	ret = call_read_iter(in, &kiocb, &to);
-	if (ret > 0) {
-		*ppos = kiocb.ki_pos;
-		file_accessed(in);
-	} else if (ret < 0) {
-		/* free what was emitted */
-		pipe_discard_from(pipe, to.start_head);
-		/*
-		 * callers of ->splice_read() expect -EAGAIN on
-		 * "can't put anything in there", rather than -EFAULT.
-		 */
-		if (ret == -EFAULT)
-			ret = -EAGAIN;
-	}
-
-	return ret;
+	return generic_file_buffered_splice_read(in, ppos, pipe, len, flags);
 }
 EXPORT_SYMBOL(generic_file_splice_read);