Message ID | 20230308165251.2078898-4-dhowells@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | splice, block: Use page pinning and kill ITER_PIPE | expand |
On Wed, Mar 8, 2023 at 8:53 AM David Howells <dhowells@redhat.com> wrote: > > The new filemap_splice_read() has an implicit expectation via > filemap_get_pages() that ->read_folio() exists if ->readahead() doesn't > fully populate the pagecache of the file it is reading from[1], potentially > leading to a jump to NULL if this doesn't exist. shmem, however, (and by > extension, tmpfs, ramfs and rootfs), doesn't have ->read_folio(), This patch is the only one in your series that I went "Ugh, that's really ugly" for. Do we really want to basically duplicate all of filemap_splice_read()? I get the feeling that the zeropage case just isn't so important that we'd need to duplicate filemap_splice_read() just for that, and I think that the code should either (a) just make a silly "read_folio()" for shmfs that just clears the page. Ugly but maybe simple and not horrid? or (b) teach filemap_splice_read() that a NULL 'read_folio' function means "use the zero page" That might not be splice() itself, but maybe in filemap_get_pages() or something. or (c) go even further, and teach read_folio() in general about file holes, and allow *any* filesystem to read zeroes that way in general without creating a folio for it. in a perfect world, if done well I think shmem_file_read_iter() should go away, and it could use generic_file_read_iter too. I dunno. Maybe shm really is *so* special that this is the right way to do things, but I did react quite negatively to this patch. So not a complete NAK, but definitely a "do we _really_ have to do this?" Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > I get the feeling that the zeropage case just isn't so important that we'd > need to duplicate filemap_splice_read() just for that, and I think that the > code should either > > (a) just make a silly "read_folio()" for shmfs that just clears the page. > > Ugly but maybe simple and not horrid? The problem with that is that once a page is in the pagecache attached to a shmem file, we can't get rid of it without deleting or truncating the file... At least, I think that the case. For all other regular filesystems, a page full of zeros can be flushed/discarded by the LRU. shmem also has its own function for allocating folios in its pagecache, the caller of ->read_folio() would probably have to use that. > (b) teach filemap_splice_read() that a NULL 'read_folio' function > means "use the zero page" > > That might not be splice() itself, but maybe in > filemap_get_pages() or something. It would require some special handling in filemap_get_pages() - and/or probably better filemap_splice_read() since, for shmem, it's only relevant to splice. An additional flag could be added to filemap_get_pages() to tell it to stop at a hole in the pagecache rather than invoking readahead. filemap_splice_read() would then need to examine the pagecache to work out how big the hole is and insert the appropriate number of zeropages before calling back into filemap_get_pages() again. Possibly it could use SEEK_DATA. > or > > (c) go even further, and teach read_folio() in general about file > holes, and allow *any* filesystem to read zeroes that way in general > without creating a folio for it. Nice idea, but we'd need a way to store a "negative" marker (as opposed to "unknown") in the pagecache for the filemap code to be able to use it. This sort of thing might become easier if xarray gets switched to a maple tree implementation as that would better allow for caching of a known file hole of arbitrary size with a single entry. But for the moment, the filemap code would have to jump through a filesystem's ->readahead or ->read_folio vectors to work out if there's a hole there or not - but in both cases it must already have allocated the pages it wants to query. David
On Wed, Mar 08, 2023 at 02:39:00PM -0800, Linus Torvalds wrote: > On Wed, Mar 8, 2023 at 8:53 AM David Howells <dhowells@redhat.com> wrote: > > > > The new filemap_splice_read() has an implicit expectation via > > filemap_get_pages() that ->read_folio() exists if ->readahead() doesn't > > fully populate the pagecache of the file it is reading from[1], potentially > > leading to a jump to NULL if this doesn't exist. shmem, however, (and by > > extension, tmpfs, ramfs and rootfs), doesn't have ->read_folio(), > > This patch is the only one in your series that I went "Ugh, that's > really ugly" for. > > Do we really want to basically duplicate all of filemap_splice_read()? > > I get the feeling that the zeropage case just isn't so important that > we'd need to duplicate filemap_splice_read() just for that, and I > think that the code should either > > (a) just make a silly "read_folio()" for shmfs that just clears the page. > > Ugly but maybe simple and not horrid? The problem is that we might have swapped out the shmem folio. So we don't want to clear the page, but ask swap to fill the page. The way that currently works (see shmem_get_folio_gfp()) is to fetch the swap entry from the page cache, allocate a new folio inside the shmem code, then replace the swap entry with the new folio. What I'd like to see is the generic code say "Ah, this is a shmem inode, so it's special and the xa_value entry is swap information, not workingset information, so I'll allocate the folio and restore the folio->private swap information to let the shmem_read_folio function do its job correctly". Either that or we completely overhaul the shmem code to store the location of its swapped data somewhere that's not the page cache. > (b) teach filemap_splice_read() that a NULL 'read_folio' function > means "use the zero page" Same problem as (a). > (c) go even further, and teach read_folio() in general about file > holes, and allow *any* filesystem to read zeroes that way in general > without creating a folio for it. I've had thoughts along those lines in the past. It's pretty major surgery, I think. At the moment, we allocate the pages and add them to the page cache in a locked state before asking the filesystem to populate them. So the fs doesn't even have the file layout (eg the get_block or iomap info) that would tell it where the holes are until the page has already been allocated and inserted. We could of course free the page and replace it with a special 'THIS_IS_A_HOLE' entry. It's just never seemed important enuogh to me to do this surgery. > in a perfect world, if done well I think shmem_file_read_iter() should > go away, and it could use generic_file_read_iter too. > > I dunno. Maybe shm really is *so* special that this is the right way > to do things, but I did react quite negatively to this patch. So not a > complete NAK, but definitely a "do we _really_ have to do this?" I'd really like to see shmem have a read_folio implementation. I don't know how much work it's going to be.
On Tue, Mar 14, 2023 at 9:43 AM Matthew Wilcox <willy@infradead.org> wrote: > > The problem is that we might have swapped out the shmem folio. So we > don't want to clear the page, but ask swap to fill the page. Doesn't shmem_swapin_folio() already basically do all that work? The real oddity with shmem - compared to other filesystems - is that the xarray has a value entry instead of being a real folio. And yes, the current filemap code will then just ignore such entries as "doesn't exist", and so the regular read iterators will all fail on it. But while filemap_get_read_batch() will stop at a value-folio, I feel like filemap_create_folio() should be able to turn a value page into a "real" page. Right now it already allocates said page, but then I think filemap_add_folio() will return -EEXIST when said entry exists as a value. But *if* instead of -EEXIST we could just replace the value with the (already locked) page, and have some sane way to pass that value (which is the swap entry data) to readpage(), I think that should just do it all. Admittedly I really don't know this area very well, so I may be *entirely* out to lunch. But the whole "teach the filemap code to actually react to XA value entries" would be how I'd solve the hole issue too. So I think there are commonalities here. Linus Linus
Hi Linus, Are you okay if we go with my current patch for the moment? David
On Tue, Mar 14, 2023 at 11:26 AM David Howells <dhowells@redhat.com> wrote: > > Are you okay if we go with my current patch for the moment? I guess. But please at least stop doing the get_page(buf->page); on the zero-page (which includes using no-op .get and .put functions in zero_pipe_buf_ops(). Maybe we can do /dev/null some day and actually have a common case for those. Linus
On Tue, Mar 14, 2023 at 12:07 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Maybe we can do /dev/null some day and actually have a common case for those. /dev/zero, I mean. We already do splice to /dev/null (and splicing from /dev/null isn't interesting ;) Linus
On Tue, Mar 14, 2023 at 11:02:40AM -0700, Linus Torvalds wrote: > On Tue, Mar 14, 2023 at 9:43 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > The problem is that we might have swapped out the shmem folio. So we > > don't want to clear the page, but ask swap to fill the page. > > Doesn't shmem_swapin_folio() already basically do all that work? > > The real oddity with shmem - compared to other filesystems - is that > the xarray has a value entry instead of being a real folio. And yes, > the current filemap code will then just ignore such entries as > "doesn't exist", and so the regular read iterators will all fail on > it. > > But while filemap_get_read_batch() will stop at a value-folio, I feel > like filemap_create_folio() should be able to turn a value page into a > "real" page. Right now it already allocates said page, but then I > think filemap_add_folio() will return -EEXIST when said entry exists > as a value. > > But *if* instead of -EEXIST we could just replace the value with the > (already locked) page, and have some sane way to pass that value > (which is the swap entry data) to readpage(), I think that should just > do it all. This was basically what I had in mind: I don't think this will handle a case like: Alloc order-0 folio at index 4 Alloc order-0 folio at index 7 Swap out both folios Alloc order-9 folio at indices 0-511 But I don't see where shmem currently handles that either. Maybe it falls back to order-0 folios instead of the crude BUG_ON I put in. Hugh? diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 82c1262f396f..30f2502314de 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -114,12 +114,6 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop, struct folio *shmem_read_folio_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp); -static inline struct folio *shmem_read_folio(struct address_space *mapping, - pgoff_t index) -{ - return shmem_read_folio_gfp(mapping, index, mapping_gfp_mask(mapping)); -} - static inline struct page *shmem_read_mapping_page( struct address_space *mapping, pgoff_t index) { diff --git a/mm/filemap.c b/mm/filemap.c index 57c1b154fb5a..8e4f95c5b65a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -877,6 +877,8 @@ noinline int __filemap_add_folio(struct address_space *mapping, order, gfp); xas_lock_irq(&xas); xas_for_each_conflict(&xas, entry) { + if (old) + BUG_ON(shmem_mapping(mapping)); old = entry; if (!xa_is_value(entry)) { xas_set_err(&xas, -EEXIST); @@ -885,7 +887,12 @@ noinline int __filemap_add_folio(struct address_space *mapping, } if (old) { - if (shadowp) + if (shmem_mapping(mapping)) { + folio_set_swap_entry(folio, + radix_to_swp_entry(old)); + folio_set_swapcache(folio); + folio_set_swapbacked(folio); + } else if (shadowp) *shadowp = old; /* entry may have been split before we acquired lock */ order = xa_get_order(xas.xa, xas.xa_index); diff --git a/mm/shmem.c b/mm/shmem.c index 8e60826e4246..ea75c7dcf5ec 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2059,6 +2059,18 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop, mapping_gfp_mask(inode->i_mapping), NULL, NULL, NULL); } +static int shmem_read_folio(struct file *file, struct folio *folio) +{ + if (folio_test_swapcache(folio)) { + swap_readpage(&folio->page, true, NULL); + } else { + folio_zero_segment(folio, 0, folio_size(folio)); + folio_mark_uptodate(folio); + folio_unlock(folio); + } + return 0; +} + /* * This is like autoremove_wake_function, but it removes the wait queue * entry unconditionally - even if something else had already woken the @@ -2396,7 +2408,8 @@ static int shmem_fadvise_willneed(struct address_space *mapping, xa_for_each_range(&mapping->i_pages, index, folio, start, end) { if (!xa_is_value(folio)) continue; - folio = shmem_read_folio(mapping, index); + folio = shmem_read_folio_gfp(mapping, index, + mapping_gfp_mask(mapping)); if (!IS_ERR(folio)) folio_put(folio); } @@ -4037,6 +4050,7 @@ static int shmem_error_remove_page(struct address_space *mapping, } const struct address_space_operations shmem_aops = { + .read_folio = shmem_read_folio, .writepage = shmem_writepage, .dirty_folio = noop_dirty_folio, #ifdef CONFIG_TMPFS
Linus Torvalds <torvalds@linux-foundation.org> wrote: > But please at least stop doing the > > get_page(buf->page); > > on the zero-page (which includes using no-op .get and .put functions > in zero_pipe_buf_ops(). I'll make the attached change. It seems to work. David --- diff --git a/mm/shmem.c b/mm/shmem.c index 3cbec1d56112..d9b60ab556fe 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2719,6 +2719,17 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) return retval ? retval : error; } +static bool zero_pipe_buf_get(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ + return true; +} + +static void zero_pipe_buf_release(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ +} + static bool zero_pipe_buf_try_steal(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { @@ -2726,9 +2737,9 @@ static bool zero_pipe_buf_try_steal(struct pipe_inode_info *pipe, } static const struct pipe_buf_operations zero_pipe_buf_ops = { - .release = generic_pipe_buf_release, + .release = zero_pipe_buf_release, .try_steal = zero_pipe_buf_try_steal, - .get = generic_pipe_buf_get, + .get = zero_pipe_buf_get, }; static size_t splice_zeropage_into_pipe(struct pipe_inode_info *pipe,
diff --git a/mm/shmem.c b/mm/shmem.c index 448f393d8ab2..3cbec1d56112 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2719,6 +2719,128 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) return retval ? retval : error; } +static bool zero_pipe_buf_try_steal(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ + return false; +} + +static const struct pipe_buf_operations zero_pipe_buf_ops = { + .release = generic_pipe_buf_release, + .try_steal = zero_pipe_buf_try_steal, + .get = generic_pipe_buf_get, +}; + +static size_t splice_zeropage_into_pipe(struct pipe_inode_info *pipe, + loff_t fpos, size_t size) +{ + size_t offset = fpos & ~PAGE_MASK; + + size = min_t(size_t, size, PAGE_SIZE - offset); + + if (!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) { + struct pipe_buffer *buf = pipe_head_buf(pipe); + + *buf = (struct pipe_buffer) { + .ops = &zero_pipe_buf_ops, + .page = ZERO_PAGE(0), + .offset = offset, + .len = size, + }; + get_page(buf->page); + pipe->head++; + } + + return size; +} + +static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, + size_t len, unsigned int flags) +{ + struct inode *inode = file_inode(in); + struct address_space *mapping = inode->i_mapping; + struct folio *folio = NULL; + size_t total_spliced = 0, used, npages, n, part; + loff_t isize; + int error = 0; + + /* Work out how much data we can actually add into the pipe */ + used = pipe_occupancy(pipe->head, pipe->tail); + npages = max_t(ssize_t, pipe->max_usage - used, 0); + len = min_t(size_t, len, npages * PAGE_SIZE); + + do { + if (*ppos >= i_size_read(inode)) + break; + + error = shmem_get_folio(inode, *ppos / PAGE_SIZE, &folio, SGP_READ); + if (error) { + if (error == -EINVAL) + error = 0; + break; + } + if (folio) { + folio_unlock(folio); + + if (folio_test_hwpoison(folio)) { + error = -EIO; + break; + } + } + + /* + * i_size must be checked after we know the pages are Uptodate. + * + * Checking i_size after the check allows us to calculate + * the correct value for "nr", which means the zero-filled + * part of the page is not copied back to userspace (unless + * another truncate extends the file - this is desired though). + */ + isize = i_size_read(inode); + if (unlikely(*ppos >= isize)) + break; + part = min_t(loff_t, isize - *ppos, len); + + if (folio) { + /* + * If users can be writing to this page using arbitrary + * virtual addresses, take care about potential aliasing + * before reading the page on the kernel side. + */ + if (mapping_writably_mapped(mapping)) + flush_dcache_folio(folio); + folio_mark_accessed(folio); + /* + * Ok, we have the page, and it's up-to-date, so we can + * now splice it into the pipe. + */ + n = splice_folio_into_pipe(pipe, folio, *ppos, part); + folio_put(folio); + folio = NULL; + } else { + n = splice_zeropage_into_pipe(pipe, *ppos, len); + } + + if (!n) + break; + len -= n; + total_spliced += n; + *ppos += n; + in->f_ra.prev_pos = *ppos; + if (pipe_full(pipe->head, pipe->tail, pipe->max_usage)) + break; + + cond_resched(); + } while (len); + + if (folio) + folio_put(folio); + + file_accessed(in); + return total_spliced ? total_spliced : error; +} + static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) { struct address_space *mapping = file->f_mapping; @@ -3938,7 +4060,7 @@ static const struct file_operations shmem_file_operations = { .read_iter = shmem_file_read_iter, .write_iter = generic_file_write_iter, .fsync = noop_fsync, - .splice_read = generic_file_splice_read, + .splice_read = shmem_file_splice_read, .splice_write = iter_file_splice_write, .fallocate = shmem_fallocate, #endif
The new filemap_splice_read() has an implicit expectation via filemap_get_pages() that ->read_folio() exists if ->readahead() doesn't fully populate the pagecache of the file it is reading from[1], potentially leading to a jump to NULL if this doesn't exist. shmem, however, (and by extension, tmpfs, ramfs and rootfs), doesn't have ->read_folio(), Work around this by equipping shmem with its own splice-read implementation, based on filemap_splice_read(), but able to paste in zero_page when there's a page missing. Signed-off-by: David Howells <dhowells@redhat.com> cc: Daniel Golle <daniel@makrotopia.org> cc: Guenter Roeck <groeck7@gmail.com> cc: Christoph Hellwig <hch@lst.de> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: John Hubbard <jhubbard@nvidia.com> cc: David Hildenbrand <david@redhat.com> cc: Matthew Wilcox <willy@infradead.org> cc: Hugh Dickins <hughd@google.com> cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/Y+pdHFFTk1TTEBsO@makrotopia.org/ [1] --- mm/shmem.c | 124 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 123 insertions(+), 1 deletion(-)