Message ID | 20191217143948.26380-2-axboe@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Support for RWF_UNCACHED | expand |
On 12/17/19 3:39 PM, Jens Axboe wrote: > If RWF_UNCACHED is set for io_uring (or preadv2(2)), we'll use private > pages for the buffered reads. These pages will never be inserted into > the page cache, and they are simply droped when we have done the copy at > the end of IO. > > If pages in the read range are already in the page cache, then use those > for just copying the data instead of starting IO on private pages. > > A previous solution used the page cache even for non-cached ranges, but > the cost of doing so was too high. Removing nodes at the end is > expensive, even with LRU bypass. On top of that, repeatedly > instantiating new xarray nodes is very costly, as it needs to memset 576 > bytes of data, and freeing said nodes involve an RCU call per node as > well. All that adds up, making uncached somewhat slower than O_DIRECT. > > With the current*solition*, we're basically at O_DIRECT levels of Maybe it is 'solution' here. Thanks, Guoqing
On Tue, Dec 17, 2019 at 07:39:43AM -0700, Jens Axboe wrote: > +static void buffered_put_page(struct page *page, bool clear_mapping) > +{ > + if (clear_mapping) > + page->mapping = NULL; > + put_page(page); > +} I'm not a huge fan of the variable name 'clear_mapping'. It describes what it does rather than why it does it. So maybe 'drop_immediate'? Or 'uncached'? > if (!page) { > if (iocb->ki_flags & IOCB_NOWAIT) > goto would_block; > + /* UNCACHED implies no read-ahead */ > + if (iocb->ki_flags & IOCB_UNCACHED) { > + did_dio_begin = true; > + /* block truncate for UNCACHED reads */ > + inode_dio_begin(inode); I think this needs to be: if (!did_dio_begin) inode_dio_begin(inode); did_dio_begin = true; otherwise inode->i_dio_count is going to be increased once per uncached page. Do you have a test in your test-suite that does I/O to more than one page at a time?
On 12/17/19 8:57 AM, Matthew Wilcox wrote: > On Tue, Dec 17, 2019 at 07:39:43AM -0700, Jens Axboe wrote: >> +static void buffered_put_page(struct page *page, bool clear_mapping) >> +{ >> + if (clear_mapping) >> + page->mapping = NULL; >> + put_page(page); >> +} > > I'm not a huge fan of the variable name 'clear_mapping'. It describes > what it does rather than why it does it. So maybe 'drop_immediate'? > Or 'uncached'? I do like 'uncached' a lot better, I've made that change. > I think this needs to be: > > if (!did_dio_begin) > inode_dio_begin(inode); > did_dio_begin = true; > > otherwise inode->i_dio_count is going to be increased once per uncached > page. Do you have a test in your test-suite that does I/O to more than > one page at a time? Good catch! Yes it does, I have fixed that up. Thanks.
On 12/17/19 8:16 AM, Guoqing Jiang wrote: > > On 12/17/19 3:39 PM, Jens Axboe wrote: >> If RWF_UNCACHED is set for io_uring (or preadv2(2)), we'll use private >> pages for the buffered reads. These pages will never be inserted into >> the page cache, and they are simply droped when we have done the copy at >> the end of IO. >> >> If pages in the read range are already in the page cache, then use those >> for just copying the data instead of starting IO on private pages. >> >> A previous solution used the page cache even for non-cached ranges, but >> the cost of doing so was too high. Removing nodes at the end is >> expensive, even with LRU bypass. On top of that, repeatedly >> instantiating new xarray nodes is very costly, as it needs to memset 576 >> bytes of data, and freeing said nodes involve an RCU call per node as >> well. All that adds up, making uncached somewhat slower than O_DIRECT. >> >> With the current*solition*, we're basically at O_DIRECT levels of > > Maybe it is 'solution' here. Indeed, fixed up, thanks.
On Tue, Dec 17, 2019 at 07:39:43AM -0700, Jens Axboe wrote: > If RWF_UNCACHED is set for io_uring (or preadv2(2)), we'll use private > pages for the buffered reads. These pages will never be inserted into > the page cache, and they are simply droped when we have done the copy at > the end of IO. Ok, so unlike the uncached write case, this /isn't/ coherent with the page cache. IOWs, it can race with other reads and page faults and mmap can change the data it has faulted into the page cache and write it back before the RWF_UNCACHED read completes, resulting in RWF_UNCACHED potential returning torn data. And it's not coherent with uncached writes, either, if the filesystem does not provide it's own serialisation between buffered reads and writes. i.e. simple/legacy filesystems just use the page lock to serialise buffered reads against buffered writes, while buffered writes are serialised against each other via the inode_lock() in generic_file_write_iter(). Further, the use of inode_dio_wait() for truncate serialisation is optional for filesystems - it's not a generic mechanism. Filesystems that only support buffered IO only use page locks to provide truncate serialisation and don't call inode_dio_wait() in their ->setattr methods. Hence to serialise truncates against uncached reads in such filesystems the uncached read needs to be page cache coherent. As I said previously: I don't think this is a viable approach because it has page cache coherency issues that are just as bad, if not worse, than direct IO. > If pages in the read range are already in the page cache, then use those > for just copying the data instead of starting IO on private pages. > > A previous solution used the page cache even for non-cached ranges, but > the cost of doing so was too high. Removing nodes at the end is > expensive, even with LRU bypass. If you want to bypass the page cache overhead all together, then use direct IO. We should not make the same mistakes as O_DIRECT for the same reasons (performance!). Instead, once we have the page cache coherent path working we should then work to optimise it with better page cache insert/remove primitives to lower the overhead. > On top of that, repeatedly > instantiating new xarray nodes is very costly, as it needs to memset 576 > bytes of data, and freeing said nodes involve an RCU call per node as > well. All that adds up, making uncached somewhat slower than O_DIRECT. I think some of that has to do with implementation details and the fact you appear to be focussing on PAGE_SIZE IOs. This means you are instantiating and removing a single cached page at a time from the mapping tree, and that means we need to allocate and free an xarray node per IO. I would say this is a future Xarray optimisation target, not a reason for introducing a new incoherent IO API we'll have to support long after we've fixed the xarray inefficiency.... > @@ -2048,6 +2057,13 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, > if (!page) { > if (iocb->ki_flags & IOCB_NOWAIT) > goto would_block; > + /* UNCACHED implies no read-ahead */ > + if (iocb->ki_flags & IOCB_UNCACHED) { > + did_dio_begin = true; > + /* block truncate for UNCACHED reads */ > + inode_dio_begin(inode); > + goto no_cached_page; > + } Ok, so for every page we don't find in the cache, we go issue IO. We also call inode_dio_begin() on every page we miss, but only call inode_dio_end() a single time. So we leak i_dio_count on every IO that needs to read more than a single page, which means we'll hang in the next call to inode_dio_wait()... > no_cached_page: > @@ -2234,6 +2250,14 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, > error = -ENOMEM; > goto out; > } > + if (iocb->ki_flags & IOCB_UNCACHED) { > + __SetPageLocked(page); > + page->mapping = mapping; > + page->index = index; > + clear_mapping = true; > + goto readpage; > + } And we go through the ->readpage path, which means we a building and issuing a single bio per page we miss. THis is highly inefficient, and relies on bio merging to form large IOs in the block layer. Sure, you won't notice the impact if all you do is PAGE_SIZE read() calls, but if you are doing multi-megabyte IOs because you have spinning rust storage then it will make a big difference. IOWs, this does not take advantage of either the mpage_readpages or the iomap_readpages many-pages-to-a-bio read optimisations that the readahead path gives us, so there's more CPU and IO overhead from this RWF_UNCACHED path than there is from the normal readahead based buffered IO path. Oh, and if we have random pages in the cache, this single-page-at-atime approach that will break up large sequential IOs in smaller, non-sequential IOs that will be dispatched separately. It's much more CPU efficient to do a single large IO that pulls new data into the random page in middle of the range than it is to build, issue and complete two separate IOs. It's also much more friendly to spinning rust to do a single large IO than two small separated-by-a-few-sectors IOs... IMO, this looks like an implementation hyper-focussed on brute performance of PAGE_SIZE IOs and it compromises on coherency and efficiency to attain that performance. Quite frankly, if performance is so critical that you need to compromise the IO path in the way, then just use direct IO. Let's get a sane, clean, efficient page cache coherent read IO path in place for RWF_UNCACHED, then optimise it for performance. If it's done right, then the page cache/xarray insert/remove overhead should largely disappear in the noise. Cheers, Dave.
diff --git a/include/linux/fs.h b/include/linux/fs.h index 98e0349adb52..092ea2a4319b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -314,6 +314,7 @@ enum rw_hint { #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) #define IOCB_NOWAIT (1 << 7) +#define IOCB_UNCACHED (1 << 8) struct kiocb { struct file *ki_filp; @@ -3418,6 +3419,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC); if (flags & RWF_APPEND) ki->ki_flags |= IOCB_APPEND; + if (flags & RWF_UNCACHED) + ki->ki_flags |= IOCB_UNCACHED; return 0; } diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 379a612f8f1d..357ebb0e0c5d 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -299,8 +299,11 @@ typedef int __bitwise __kernel_rwf_t; /* per-IO O_APPEND */ #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) +/* drop cache after reading or writing data */ +#define RWF_UNCACHED ((__force __kernel_rwf_t)0x00000040) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ - RWF_APPEND) + RWF_APPEND | RWF_UNCACHED) #endif /* _UAPI_LINUX_FS_H */ diff --git a/mm/filemap.c b/mm/filemap.c index bf6aa30be58d..7ddc4d8386cf 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1990,6 +1990,13 @@ static void shrink_readahead_size_eio(struct file *filp, ra->ra_pages /= 4; } +static void buffered_put_page(struct page *page, bool clear_mapping) +{ + if (clear_mapping) + page->mapping = NULL; + put_page(page); +} + /** * generic_file_buffered_read - generic file read routine * @iocb: the iocb to read @@ -2013,6 +2020,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, struct address_space *mapping = filp->f_mapping; struct inode *inode = mapping->host; struct file_ra_state *ra = &filp->f_ra; + bool did_dio_begin = false; loff_t *ppos = &iocb->ki_pos; pgoff_t index; pgoff_t last_index; @@ -2032,6 +2040,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, offset = *ppos & ~PAGE_MASK; for (;;) { + bool clear_mapping = false; struct page *page; pgoff_t end_index; loff_t isize; @@ -2048,6 +2057,13 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, if (!page) { if (iocb->ki_flags & IOCB_NOWAIT) goto would_block; + /* UNCACHED implies no read-ahead */ + if (iocb->ki_flags & IOCB_UNCACHED) { + did_dio_begin = true; + /* block truncate for UNCACHED reads */ + inode_dio_begin(inode); + goto no_cached_page; + } page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); @@ -2106,7 +2122,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, isize = i_size_read(inode); end_index = (isize - 1) >> PAGE_SHIFT; if (unlikely(!isize || index > end_index)) { - put_page(page); + buffered_put_page(page, clear_mapping); goto out; } @@ -2115,7 +2131,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, if (index == end_index) { nr = ((isize - 1) & ~PAGE_MASK) + 1; if (nr <= offset) { - put_page(page); + buffered_put_page(page, clear_mapping); goto out; } } @@ -2147,7 +2163,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, offset &= ~PAGE_MASK; prev_offset = offset; - put_page(page); + buffered_put_page(page, clear_mapping); written += ret; if (!iov_iter_count(iter)) goto out; @@ -2189,7 +2205,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, if (unlikely(error)) { if (error == AOP_TRUNCATED_PAGE) { - put_page(page); + buffered_put_page(page, clear_mapping); error = 0; goto find_page; } @@ -2206,7 +2222,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, * invalidate_mapping_pages got it */ unlock_page(page); - put_page(page); + buffered_put_page(page, clear_mapping); goto find_page; } unlock_page(page); @@ -2221,7 +2237,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, readpage_error: /* UHHUH! A synchronous read error occurred. Report it */ - put_page(page); + buffered_put_page(page, clear_mapping); goto out; no_cached_page: @@ -2234,6 +2250,14 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, error = -ENOMEM; goto out; } + if (iocb->ki_flags & IOCB_UNCACHED) { + __SetPageLocked(page); + page->mapping = mapping; + page->index = index; + clear_mapping = true; + goto readpage; + } + error = add_to_page_cache_lru(page, mapping, index, mapping_gfp_constraint(mapping, GFP_KERNEL)); if (error) { @@ -2250,6 +2274,8 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, would_block: error = -EAGAIN; out: + if (did_dio_begin) + inode_dio_end(inode); ra->prev_pos = prev_index; ra->prev_pos <<= PAGE_SHIFT; ra->prev_pos |= prev_offset;
If RWF_UNCACHED is set for io_uring (or preadv2(2)), we'll use private pages for the buffered reads. These pages will never be inserted into the page cache, and they are simply droped when we have done the copy at the end of IO. If pages in the read range are already in the page cache, then use those for just copying the data instead of starting IO on private pages. A previous solution used the page cache even for non-cached ranges, but the cost of doing so was too high. Removing nodes at the end is expensive, even with LRU bypass. On top of that, repeatedly instantiating new xarray nodes is very costly, as it needs to memset 576 bytes of data, and freeing said nodes involve an RCU call per node as well. All that adds up, making uncached somewhat slower than O_DIRECT. With the current solition, we're basically at O_DIRECT levels of performance for RWF_UNCACHED IO. Protect against truncate the same way O_DIRECT does, by calling inode_dio_begin() to elevate the inode->i_dio_count. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- include/linux/fs.h | 3 +++ include/uapi/linux/fs.h | 5 ++++- mm/filemap.c | 38 ++++++++++++++++++++++++++++++++------ 3 files changed, 39 insertions(+), 7 deletions(-)