Message ID | 20241203153232.92224-13-axboe@kernel.dk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Uncached buffered IO | expand |
On Tue, Dec 03, 2024 at 08:31:47AM -0700, Jens Axboe wrote: > If RWF_UNCACHED is set for a write, mark new folios being written with > uncached. This is done by passing in the fact that it's an uncached write > through the folio pointer. We can only get there when IOCB_UNCACHED was > allowed, which can only happen if the file system opts in. Opting in means > they need to check for the LSB in the folio pointer to know if it's an > uncached write or not. If it is, then FGP_UNCACHED should be used if > creating new folios is necessary. > > Uncached writes will drop any folios they create upon writeback > completion, but leave folios that may exist in that range alone. Since > ->write_begin() doesn't currently take any flags, and to avoid needing > to change the callback kernel wide, use the foliop being passed in to > ->write_begin() to signal if this is an uncached write or not. File > systems can then use that to mark newly created folios as uncached. > > This provides similar benefits to using RWF_UNCACHED with reads. Testing > buffered writes on 32 files: > > writing bs 65536, uncached 0 > 1s: 196035MB/sec > 2s: 132308MB/sec > 3s: 132438MB/sec > 4s: 116528MB/sec > 5s: 103898MB/sec > 6s: 108893MB/sec > 7s: 99678MB/sec > 8s: 106545MB/sec > 9s: 106826MB/sec > 10s: 101544MB/sec > 11s: 111044MB/sec > 12s: 124257MB/sec > 13s: 116031MB/sec > 14s: 114540MB/sec > 15s: 115011MB/sec > 16s: 115260MB/sec > 17s: 116068MB/sec > 18s: 116096MB/sec > > where it's quite obvious where the page cache filled, and performance > dropped from to about half of where it started, settling in at around > 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to > reclaim pages. > > Running the same test with uncached buffered writes: > > writing bs 65536, uncached 1 > 1s: 198974MB/sec > 2s: 189618MB/sec > 3s: 193601MB/sec > 4s: 188582MB/sec > 5s: 193487MB/sec > 6s: 188341MB/sec > 7s: 194325MB/sec > 8s: 188114MB/sec > 9s: 192740MB/sec > 10s: 189206MB/sec > 11s: 193442MB/sec > 12s: 189659MB/sec > 13s: 191732MB/sec > 14s: 190701MB/sec > 15s: 191789MB/sec > 16s: 191259MB/sec > 17s: 190613MB/sec > 18s: 191951MB/sec > > and the behavior is fully predictable, performing the same throughout > even after the page cache would otherwise have fully filled with dirty > data. It's also about 65% faster, and using half the CPU of the system > compared to the normal buffered write. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > include/linux/fs.h | 5 +++++ > include/linux/pagemap.h | 9 +++++++++ > mm/filemap.c | 12 +++++++++++- > 3 files changed, 25 insertions(+), 1 deletion(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 40383f5cc6a2..32255473f79d 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2912,6 +2912,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count) > (iocb->ki_flags & IOCB_SYNC) ? 0 : 1); > if (ret) > return ret; > + } else if (iocb->ki_flags & IOCB_UNCACHED) { > + struct address_space *mapping = iocb->ki_filp->f_mapping; > + > + filemap_fdatawrite_range_kick(mapping, iocb->ki_pos, > + iocb->ki_pos + count); > } > > return count; > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index f2d49dccb7c1..e49587c40157 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -14,6 +14,7 @@ > #include <linux/gfp.h> > #include <linux/bitops.h> > #include <linux/hardirq.h> /* for in_interrupt() */ > +#include <linux/writeback.h> > #include <linux/hugetlb_inline.h> > > struct folio_batch; > @@ -70,6 +71,14 @@ static inline int filemap_write_and_wait(struct address_space *mapping) > return filemap_write_and_wait_range(mapping, 0, LLONG_MAX); > } > > +/* > + * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write, > + * and the ->write_begin() handler on a file system supporting FOP_UNCACHED > + * must check for this and pass FGP_UNCACHED for folio creation. > + */ > +#define foliop_uncached ((struct folio *) 0xfee1c001) > +#define foliop_is_uncached(foliop) (*(foliop) == foliop_uncached) Honestly, I'm not a fan of foliop_uncached or foliop_is_uncached. The first one because it's a magic value and can you guarantee that 0xfee1c001 will never be a pointer to an actual struct folio, even on 32-bit? Second, they're both named "foliop" even though the first one doesn't return a (struct folio **) but the second one takes that as an arg. I think these two macros are only used for ext4 (or really, !iomap) support, right? And that's only to avoid messing with ->write_begin? What if you dropped ext4 support instead? :D --D > /** > * filemap_set_wb_err - set a writeback error on an address_space > * @mapping: mapping in which to set writeback error > diff --git a/mm/filemap.c b/mm/filemap.c > index 826df99e294f..00f3c6c58629 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -4095,7 +4095,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i) > ssize_t written = 0; > > do { > - struct folio *folio; > + struct folio *folio = NULL; > size_t offset; /* Offset into folio */ > size_t bytes; /* Bytes to write to folio */ > size_t copied; /* Bytes copied from user */ > @@ -4123,6 +4123,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i) > break; > } > > + /* > + * If IOCB_UNCACHED is set here, we now the file system > + * supports it. And hence it'll know to check folip for being > + * set to this magic value. If so, it's an uncached write. > + * Whenever ->write_begin() changes prototypes again, this > + * can go away and just pass iocb or iocb flags. > + */ > + if (iocb->ki_flags & IOCB_UNCACHED) > + folio = foliop_uncached; > + > status = a_ops->write_begin(file, mapping, pos, bytes, > &folio, &fsdata); > if (unlikely(status < 0)) > -- > 2.45.2 > >
On 12/6/24 10:17 AM, Darrick J. Wong wrote: > On Tue, Dec 03, 2024 at 08:31:47AM -0700, Jens Axboe wrote: >> If RWF_UNCACHED is set for a write, mark new folios being written with >> uncached. This is done by passing in the fact that it's an uncached write >> through the folio pointer. We can only get there when IOCB_UNCACHED was >> allowed, which can only happen if the file system opts in. Opting in means >> they need to check for the LSB in the folio pointer to know if it's an >> uncached write or not. If it is, then FGP_UNCACHED should be used if >> creating new folios is necessary. >> >> Uncached writes will drop any folios they create upon writeback >> completion, but leave folios that may exist in that range alone. Since >> ->write_begin() doesn't currently take any flags, and to avoid needing >> to change the callback kernel wide, use the foliop being passed in to >> ->write_begin() to signal if this is an uncached write or not. File >> systems can then use that to mark newly created folios as uncached. >> >> This provides similar benefits to using RWF_UNCACHED with reads. Testing >> buffered writes on 32 files: >> >> writing bs 65536, uncached 0 >> 1s: 196035MB/sec >> 2s: 132308MB/sec >> 3s: 132438MB/sec >> 4s: 116528MB/sec >> 5s: 103898MB/sec >> 6s: 108893MB/sec >> 7s: 99678MB/sec >> 8s: 106545MB/sec >> 9s: 106826MB/sec >> 10s: 101544MB/sec >> 11s: 111044MB/sec >> 12s: 124257MB/sec >> 13s: 116031MB/sec >> 14s: 114540MB/sec >> 15s: 115011MB/sec >> 16s: 115260MB/sec >> 17s: 116068MB/sec >> 18s: 116096MB/sec >> >> where it's quite obvious where the page cache filled, and performance >> dropped from to about half of where it started, settling in at around >> 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to >> reclaim pages. >> >> Running the same test with uncached buffered writes: >> >> writing bs 65536, uncached 1 >> 1s: 198974MB/sec >> 2s: 189618MB/sec >> 3s: 193601MB/sec >> 4s: 188582MB/sec >> 5s: 193487MB/sec >> 6s: 188341MB/sec >> 7s: 194325MB/sec >> 8s: 188114MB/sec >> 9s: 192740MB/sec >> 10s: 189206MB/sec >> 11s: 193442MB/sec >> 12s: 189659MB/sec >> 13s: 191732MB/sec >> 14s: 190701MB/sec >> 15s: 191789MB/sec >> 16s: 191259MB/sec >> 17s: 190613MB/sec >> 18s: 191951MB/sec >> >> and the behavior is fully predictable, performing the same throughout >> even after the page cache would otherwise have fully filled with dirty >> data. It's also about 65% faster, and using half the CPU of the system >> compared to the normal buffered write. >> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> --- >> include/linux/fs.h | 5 +++++ >> include/linux/pagemap.h | 9 +++++++++ >> mm/filemap.c | 12 +++++++++++- >> 3 files changed, 25 insertions(+), 1 deletion(-) >> >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index 40383f5cc6a2..32255473f79d 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -2912,6 +2912,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count) >> (iocb->ki_flags & IOCB_SYNC) ? 0 : 1); >> if (ret) >> return ret; >> + } else if (iocb->ki_flags & IOCB_UNCACHED) { >> + struct address_space *mapping = iocb->ki_filp->f_mapping; >> + >> + filemap_fdatawrite_range_kick(mapping, iocb->ki_pos, >> + iocb->ki_pos + count); >> } >> >> return count; >> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h >> index f2d49dccb7c1..e49587c40157 100644 >> --- a/include/linux/pagemap.h >> +++ b/include/linux/pagemap.h >> @@ -14,6 +14,7 @@ >> #include <linux/gfp.h> >> #include <linux/bitops.h> >> #include <linux/hardirq.h> /* for in_interrupt() */ >> +#include <linux/writeback.h> >> #include <linux/hugetlb_inline.h> >> >> struct folio_batch; >> @@ -70,6 +71,14 @@ static inline int filemap_write_and_wait(struct address_space *mapping) >> return filemap_write_and_wait_range(mapping, 0, LLONG_MAX); >> } >> >> +/* >> + * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write, >> + * and the ->write_begin() handler on a file system supporting FOP_UNCACHED >> + * must check for this and pass FGP_UNCACHED for folio creation. >> + */ >> +#define foliop_uncached ((struct folio *) 0xfee1c001) >> +#define foliop_is_uncached(foliop) (*(foliop) == foliop_uncached) > > Honestly, I'm not a fan of foliop_uncached or foliop_is_uncached. It definitely is what I would elegantly refer to as somewhat of a hack... But it's not _that_ bad imho. > The first one because it's a magic value and can you guarantee that > 0xfee1c001 will never be a pointer to an actual struct folio, even on > 32-bit? I don't think that should be possible, since it's deliberately 1 at the end. A struct like folio (or anything else) should at least be sizeof aligned, and this one is not. > Second, they're both named "foliop" even though the first one doesn't > return a (struct folio **) but the second one takes that as an arg. I just named them as such since they only deal with the folio ** that is being passed in. I can certainly rename the second one to folio_uncached, that would be an improvement I think. Thanks! > I think these two macros are only used for ext4 (or really, !iomap) > support, right? And that's only to avoid messing with ->write_begin? Indeed, ideally we'd change ->write_begin() instead. And that probably should still be done, I just did not want to deal with that nightmare in terms of managing the patchset. And honestly I think it'd be OK to defer that part until ->write_begin() needs to be changed for other reasons, it's a lot of churn just for this particular thing and dealing with the magic pointer value (at least to me) is liveable. > What if you dropped ext4 support instead? :D Hah, yes obviously that'd be a solution, then I'd need to drop btrfs as well. And I would kind of prefer not doing that ;-)
On Fri, Dec 06, 2024 at 11:22:55AM -0700, Jens Axboe wrote: > > Honestly, I'm not a fan of foliop_uncached or foliop_is_uncached. > > It definitely is what I would elegantly refer to as somewhat of a > hack... But it's not _that_ bad imho. It's pretty horrible actually. > > I think these two macros are only used for ext4 (or really, !iomap) > > support, right? And that's only to avoid messing with ->write_begin? > > Indeed, ideally we'd change ->write_begin() instead. And that probably > should still be done, I just did not want to deal with that nightmare in > terms of managing the patchset. And honestly I think it'd be OK to defer > that part until ->write_begin() needs to be changed for other reasons, > it's a lot of churn just for this particular thing and dealing with the > magic pointer value (at least to me) is liveable. ->write_begin() really should just go away, it is a horrible interface. Note that in that past it actually had a flags argument, but that got killed a while ago. > > What if you dropped ext4 support instead? :D > > Hah, yes obviously that'd be a solution, then I'd need to drop btrfs as > well. And I would kind of prefer not doing that ;-) Btrfs doesn't need it. In fact the code would be cleaner and do less work with out, see the patch below. And for ext4 there already is an iomap conversion patch series on the list that just needs more review, so skipping it here and growing the uncached support through that sounds sensible. diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 444bee219b4e..db3a3c7ecf77 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -894,20 +894,17 @@ static gfp_t get_prepare_gfp_flags(struct inode *inode, bool nowait) */ static noinline int prepare_one_folio(struct inode *inode, struct folio **folio_ret, loff_t pos, size_t write_bytes, - bool force_uptodate, bool nowait) + bool force_uptodate, fgf_t fgp_flags) { unsigned long index = pos >> PAGE_SHIFT; - gfp_t mask = get_prepare_gfp_flags(inode, nowait); - fgf_t fgp_flags = (nowait ? FGP_WRITEBEGIN | FGP_NOWAIT : FGP_WRITEBEGIN); + gfp_t mask = get_prepare_gfp_flags(inode, fgp_flags & FGP_NOWAIT); struct folio *folio; int ret = 0; - if (foliop_is_uncached(folio_ret)) - fgp_flags |= FGP_UNCACHED; again: folio = __filemap_get_folio(inode->i_mapping, index, fgp_flags, mask); if (IS_ERR(folio)) { - if (nowait) + if (fgp_flags & FGP_NOWAIT) ret = -EAGAIN; else ret = PTR_ERR(folio); @@ -925,7 +922,7 @@ static noinline int prepare_one_folio(struct inode *inode, struct folio **folio_ if (ret) { /* The folio is already unlocked. */ folio_put(folio); - if (!nowait && ret == -EAGAIN) { + if (!(fgp_flags & FGP_NOWAIT) && ret == -EAGAIN) { ret = 0; goto again; } @@ -1135,9 +1132,15 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i) const bool nowait = (iocb->ki_flags & IOCB_NOWAIT); unsigned int bdp_flags = (nowait ? BDP_ASYNC : 0); bool only_release_metadata = false; + fgf_t fgp_flags = FGP_WRITEBEGIN; - if (nowait) + if (nowait) { ilock_flags |= BTRFS_ILOCK_TRY; + fgp_flags |= FGP_NOWAIT; + } + + if (iocb->ki_flags & IOCB_UNCACHED) + fgp_flags |= FGP_UNCACHED; ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags); if (ret < 0) @@ -1232,11 +1235,8 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i) break; } - if (iocb->ki_flags & IOCB_UNCACHED) - folio = foliop_uncached; - ret = prepare_one_folio(inode, &folio, pos, write_bytes, - force_page_uptodate, false); + force_page_uptodate, fgp_flags); if (ret) { btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes);
On 12/10/24 4:31 AM, Christoph Hellwig wrote: > On Fri, Dec 06, 2024 at 11:22:55AM -0700, Jens Axboe wrote: >>> Honestly, I'm not a fan of foliop_uncached or foliop_is_uncached. >> >> It definitely is what I would elegantly refer to as somewhat of a >> hack... But it's not _that_ bad imho. > > It's pretty horrible actually. Tell us how you really feel :-) >>> I think these two macros are only used for ext4 (or really, !iomap) >>> support, right? And that's only to avoid messing with ->write_begin? >> >> Indeed, ideally we'd change ->write_begin() instead. And that probably >> should still be done, I just did not want to deal with that nightmare in >> terms of managing the patchset. And honestly I think it'd be OK to defer >> that part until ->write_begin() needs to be changed for other reasons, >> it's a lot of churn just for this particular thing and dealing with the >> magic pointer value (at least to me) is liveable. > > ->write_begin() really should just go away, it is a horrible interface. > Note that in that past it actually had a flags argument, but that got > killed a while ago. > >>> What if you dropped ext4 support instead? :D >> >> Hah, yes obviously that'd be a solution, then I'd need to drop btrfs as >> well. And I would kind of prefer not doing that ;-) > > Btrfs doesn't need it. In fact the code would be cleaner and do less > work with out, see the patch below. And for ext4 there already is an > iomap conversion patch series on the list that just needs more review, > so skipping it here and growing the uncached support through that sounds > sensible. I can certainly defer the ext4 series if the below sorts out btrfs, if that iomap conversion series is making progress. Don't have an issue slotting behind that. I'll check and test your btrfs tweak, thanks!
diff --git a/include/linux/fs.h b/include/linux/fs.h index 40383f5cc6a2..32255473f79d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2912,6 +2912,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count) (iocb->ki_flags & IOCB_SYNC) ? 0 : 1); if (ret) return ret; + } else if (iocb->ki_flags & IOCB_UNCACHED) { + struct address_space *mapping = iocb->ki_filp->f_mapping; + + filemap_fdatawrite_range_kick(mapping, iocb->ki_pos, + iocb->ki_pos + count); } return count; diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index f2d49dccb7c1..e49587c40157 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -14,6 +14,7 @@ #include <linux/gfp.h> #include <linux/bitops.h> #include <linux/hardirq.h> /* for in_interrupt() */ +#include <linux/writeback.h> #include <linux/hugetlb_inline.h> struct folio_batch; @@ -70,6 +71,14 @@ static inline int filemap_write_and_wait(struct address_space *mapping) return filemap_write_and_wait_range(mapping, 0, LLONG_MAX); } +/* + * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write, + * and the ->write_begin() handler on a file system supporting FOP_UNCACHED + * must check for this and pass FGP_UNCACHED for folio creation. + */ +#define foliop_uncached ((struct folio *) 0xfee1c001) +#define foliop_is_uncached(foliop) (*(foliop) == foliop_uncached) + /** * filemap_set_wb_err - set a writeback error on an address_space * @mapping: mapping in which to set writeback error diff --git a/mm/filemap.c b/mm/filemap.c index 826df99e294f..00f3c6c58629 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -4095,7 +4095,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i) ssize_t written = 0; do { - struct folio *folio; + struct folio *folio = NULL; size_t offset; /* Offset into folio */ size_t bytes; /* Bytes to write to folio */ size_t copied; /* Bytes copied from user */ @@ -4123,6 +4123,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i) break; } + /* + * If IOCB_UNCACHED is set here, we now the file system + * supports it. And hence it'll know to check folip for being + * set to this magic value. If so, it's an uncached write. + * Whenever ->write_begin() changes prototypes again, this + * can go away and just pass iocb or iocb flags. + */ + if (iocb->ki_flags & IOCB_UNCACHED) + folio = foliop_uncached; + status = a_ops->write_begin(file, mapping, pos, bytes, &folio, &fsdata); if (unlikely(status < 0))
If RWF_UNCACHED is set for a write, mark new folios being written with uncached. This is done by passing in the fact that it's an uncached write through the folio pointer. We can only get there when IOCB_UNCACHED was allowed, which can only happen if the file system opts in. Opting in means they need to check for the LSB in the folio pointer to know if it's an uncached write or not. If it is, then FGP_UNCACHED should be used if creating new folios is necessary. Uncached writes will drop any folios they create upon writeback completion, but leave folios that may exist in that range alone. Since ->write_begin() doesn't currently take any flags, and to avoid needing to change the callback kernel wide, use the foliop being passed in to ->write_begin() to signal if this is an uncached write or not. File systems can then use that to mark newly created folios as uncached. This provides similar benefits to using RWF_UNCACHED with reads. Testing buffered writes on 32 files: writing bs 65536, uncached 0 1s: 196035MB/sec 2s: 132308MB/sec 3s: 132438MB/sec 4s: 116528MB/sec 5s: 103898MB/sec 6s: 108893MB/sec 7s: 99678MB/sec 8s: 106545MB/sec 9s: 106826MB/sec 10s: 101544MB/sec 11s: 111044MB/sec 12s: 124257MB/sec 13s: 116031MB/sec 14s: 114540MB/sec 15s: 115011MB/sec 16s: 115260MB/sec 17s: 116068MB/sec 18s: 116096MB/sec where it's quite obvious where the page cache filled, and performance dropped from to about half of where it started, settling in at around 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to reclaim pages. Running the same test with uncached buffered writes: writing bs 65536, uncached 1 1s: 198974MB/sec 2s: 189618MB/sec 3s: 193601MB/sec 4s: 188582MB/sec 5s: 193487MB/sec 6s: 188341MB/sec 7s: 194325MB/sec 8s: 188114MB/sec 9s: 192740MB/sec 10s: 189206MB/sec 11s: 193442MB/sec 12s: 189659MB/sec 13s: 191732MB/sec 14s: 190701MB/sec 15s: 191789MB/sec 16s: 191259MB/sec 17s: 190613MB/sec 18s: 191951MB/sec and the behavior is fully predictable, performing the same throughout even after the page cache would otherwise have fully filled with dirty data. It's also about 65% faster, and using half the CPU of the system compared to the normal buffered write. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- include/linux/fs.h | 5 +++++ include/linux/pagemap.h | 9 +++++++++ mm/filemap.c | 12 +++++++++++- 3 files changed, 25 insertions(+), 1 deletion(-)