Message ID | 20220621062627.2637632-1-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] btrfs: don't limit direct reads to a single sector | expand |
On 2022/6/21 14:26, Christoph Hellwig wrote: > Btrfs currently limits direct I/O reads to a single sector, which goes > back to commit c329861da406 ("Btrfs: don't allocate a separate csums > array for direct reads") from Josef. That commit changes the direct I/O > code to ".. use the private part of the io_tree for our csums.", but ten > years later that isn't how checksums for direct reads work, instead they > use a csums allocation on a per-btrfs_dio_private basis (which have their > own performance problem for small I/O, but that will be addressed later). > > There is no fundamental limit in btrfs itself to limit the I/O size > except for the size of the checksum array that scales linearly with > the number of sectors in an I/O. Pick a somewhat arbitrary limit of > 256 limits, which matches what the buffered reads typically see as > the upper limit as the limit for direct I/O as well. > > This significantly improves direct read performance. For example a fio > run doing 1 MiB aio reads with a queue depth of 1 roughly triples the > throughput: > > Baseline: > > READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec > > With this patch: > > READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc > > Signed-off-by: Christoph Hellwig <hch@lst.de> The patch itself looks good to me. Reviewed-by: Qu Wenruo <wqu@suse.com> But the extra comment itself can be confusing. > --- > > Changes since v1: > - keep a (large) upper limit on the I/O size. > > fs/btrfs/inode.c | 6 +++++- > fs/btrfs/volumes.h | 7 +++++++ > 2 files changed, 12 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 33ba4d22e1430..f6dc6e8c54e3a 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7592,8 +7592,12 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, > const u64 data_alloc_len = length; > bool unlock_extents = false; > > + /* > + * Cap the size of reads to that usually seen in buffered I/O as we need > + * to allocate a contiguous array for the checksums. > + */ > if (!write) > - len = min_t(u64, len, fs_info->sectorsize); > + len = min_t(u64, len, fs_info->sectorsize * BTRFS_MAX_SECTORS); It's better just to craft a manual limit, that BTRFS_MAX_SECTORS doesn't really match the real buffered read size (commented below). Here I believe we can afford an artificial limit, no need to align with any existing limit. > > lockstart = start; > lockend = start + len - 1; > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index b61508723d5d2..5f2cea9a44860 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -354,6 +354,13 @@ struct btrfs_fs_devices { > - 2 * sizeof(struct btrfs_chunk)) \ > / sizeof(struct btrfs_stripe) + 1) > > +/* > + * Maximum number of sectors for a single bio to limit the size of the > + * checksum array. This matches the number of bio_vecs per bio and thus the > + * I/O size for buffered I/O. > + */ > +#define BTRFS_MAX_SECTORS 256 In fact, one bio vector can point to multiple pages, as its bv_len can be larger than PAGE_SIZE. And I have observed bio larger than 1MiB (4MiB or 16MiB IIRC?). Thus this comment itself is unreliable, just putting SZ_1MiB as the upper limit would be enough. Thanks, Qu > + > /* > * Additional info to pass along bio. > *
On Tue, Jun 21, 2022 at 02:35:55PM +0800, Qu Wenruo wrote: > In fact, one bio vector can point to multiple pages, as its bv_len can > be larger than PAGE_SIZE. Yes, it can. But it usually doesn't, as that requires contigous memory. Without the large folio support now merges for xfs/iomap that is very unusual and tends to happen only soon after booting. At which point allocating the larger csums array is also not a problem as we can find contigous memory for that easily as well. For direct I/O on the other hand the destination could be THPs or hugetlbs even when memory is badly fragmented.
On 2022/6/21 14:40, Christoph Hellwig wrote: > On Tue, Jun 21, 2022 at 02:35:55PM +0800, Qu Wenruo wrote: >> In fact, one bio vector can point to multiple pages, as its bv_len can >> be larger than PAGE_SIZE. > > Yes, it can. But it usually doesn't, as that requires contigous memory. That's true. > Without the large folio support now merges for xfs/iomap that is very > unusual and tends to happen only soon after booting. A little off-topic here, what did the extra XFS/iomap do here? IIRC the multi-page bio vector is already there for years. As long as the pages in page cache are contiguous, bio_add_page() will create such multi-page vector, without any extra support from fs. > At which point > allocating the larger csums array is also not a problem as we can > find contigous memory for that easily as well. For direct I/O on the > other hand the destination could be THPs or hugetlbs even when memory > is badly fragmented. My point here is just no need to align any limit. Smash a good enough value here is enough, or we need to dig way deeper to see if the MAX_SECTORS based one is really correct or not. Thanks, Qu
On Tue, Jun 21, 2022 at 02:55:06PM +0800, Qu Wenruo wrote: > A little off-topic here, what did the extra XFS/iomap do here? > > IIRC the multi-page bio vector is already there for years. Yes. > As long as the pages in page cache are contiguous, bio_add_page() will > create such multi-page vector, without any extra support from fs. Yes. Every file system supports that case right now, but that only covers the case where pages are contiguous by luck because the allocations worked that way. The iomap THP support is all about having I/O map (and to a small extent the file system, which for now is just xfs) to be able to deal with large folios, that is > PAGE_SIZE compound allocations. >> At which point >> allocating the larger csums array is also not a problem as we can >> find contigous memory for that easily as well. For direct I/O on the >> other hand the destination could be THPs or hugetlbs even when memory >> is badly fragmented. > > My point here is just no need to align any limit. > > Smash a good enough value here is enough, or we need to dig way deeper to > see if the MAX_SECTORS based one is really correct or not. So my plan here was to also enforce the limit for buffered I/O and provide a mempool for the maximum allocation size instead of just failing on memory presure right now. But there is another series or two I need to finish first to have the buffered I/O code in the shape where this can be easily added.
On 2022/6/21 14:59, Christoph Hellwig wrote: > On Tue, Jun 21, 2022 at 02:55:06PM +0800, Qu Wenruo wrote: >> A little off-topic here, what did the extra XFS/iomap do here? >> >> IIRC the multi-page bio vector is already there for years. > > Yes. > >> As long as the pages in page cache are contiguous, bio_add_page() will >> create such multi-page vector, without any extra support from fs. > > Yes. Every file system supports that case right now, but that only > covers the case where pages are contiguous by luck because the > allocations worked that way. The iomap THP support is all about > having I/O map (and to a small extent the file system, which for now > is just xfs) to be able to deal with large folios, that is > PAGE_SIZE > compound allocations. That sounds interesting. > >>> At which point >>> allocating the larger csums array is also not a problem as we can >>> find contigous memory for that easily as well. For direct I/O on the >>> other hand the destination could be THPs or hugetlbs even when memory >>> is badly fragmented. >> >> My point here is just no need to align any limit. >> >> Smash a good enough value here is enough, or we need to dig way deeper to >> see if the MAX_SECTORS based one is really correct or not. > > So my plan here was to also enforce the limit for buffered I/O and > provide a mempool for the maximum allocation size instead of just > failing on memory presure right now. Not sure if mempool is really needed for this case, but since I have seen memory allocation failure for extent io tree (and directly leads to BUG_ON()), so I guess it will make sense. > But there is another series > or two I need to finish first to have the buffered I/O code in the > shape where this can be easily added. > More and more pending patches, and I never see an end... Thanks, Qu
On 21.06.22 г. 9:26 ч., Christoph Hellwig wrote: > Btrfs currently limits direct I/O reads to a single sector, which goes > back to commit c329861da406 ("Btrfs: don't allocate a separate csums > array for direct reads") from Josef. That commit changes the direct I/O > code to ".. use the private part of the io_tree for our csums.", but ten > years later that isn't how checksums for direct reads work, instead they > use a csums allocation on a per-btrfs_dio_private basis (which have their > own performance problem for small I/O, but that will be addressed later). > > There is no fundamental limit in btrfs itself to limit the I/O size > except for the size of the checksum array that scales linearly with > the number of sectors in an I/O. Pick a somewhat arbitrary limit of > 256 limits, which matches what the buffered reads typically see as > the upper limit as the limit for direct I/O as well. > > This significantly improves direct read performance. For example a fio > run doing 1 MiB aio reads with a queue depth of 1 roughly triples the > throughput: > > Baseline: > > READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec > > With this patch: > > READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- LGTM. However I think a more appropriate subject would be "increase direct read size limit to 256 sectors" Reviewed-by: Nikolay Borisov <nborisov@suse.com> > > Changes since v1: > - keep a (large) upper limit on the I/O size. > > fs/btrfs/inode.c | 6 +++++- > fs/btrfs/volumes.h | 7 +++++++ > 2 files changed, 12 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 33ba4d22e1430..f6dc6e8c54e3a 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7592,8 +7592,12 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, > const u64 data_alloc_len = length; > bool unlock_extents = false; > > + /* > + * Cap the size of reads to that usually seen in buffered I/O as we need > + * to allocate a contiguous array for the checksums. > + */ > if (!write) > - len = min_t(u64, len, fs_info->sectorsize); > + len = min_t(u64, len, fs_info->sectorsize * BTRFS_MAX_SECTORS); > > lockstart = start; > lockend = start + len - 1; > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index b61508723d5d2..5f2cea9a44860 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -354,6 +354,13 @@ struct btrfs_fs_devices { > - 2 * sizeof(struct btrfs_chunk)) \ > / sizeof(struct btrfs_stripe) + 1) > > +/* > + * Maximum number of sectors for a single bio to limit the size of the > + * checksum array. This matches the number of bio_vecs per bio and thus the > + * I/O size for buffered I/O. > + */ > +#define BTRFS_MAX_SECTORS 256 > + > /* > * Additional info to pass along bio. > *
On Tue, Jun 21, 2022 at 08:26:27AM +0200, Christoph Hellwig wrote: > Btrfs currently limits direct I/O reads to a single sector, which goes > back to commit c329861da406 ("Btrfs: don't allocate a separate csums > array for direct reads") from Josef. That commit changes the direct I/O > code to ".. use the private part of the io_tree for our csums.", but ten > years later that isn't how checksums for direct reads work, instead they > use a csums allocation on a per-btrfs_dio_private basis (which have their > own performance problem for small I/O, but that will be addressed later). > > There is no fundamental limit in btrfs itself to limit the I/O size > except for the size of the checksum array that scales linearly with > the number of sectors in an I/O. Pick a somewhat arbitrary limit of > 256 limits, which matches what the buffered reads typically see as > the upper limit as the limit for direct I/O as well. > > This significantly improves direct read performance. For example a fio > run doing 1 MiB aio reads with a queue depth of 1 roughly triples the > throughput: > > Baseline: > > READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec > > With this patch: > > READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- Added to misc-next, thanks. I've renamed the constant to BTRFS_MAX_BIO_SECTORS and updated the subject per Nikolay's suggestion.
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 33ba4d22e1430..f6dc6e8c54e3a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7592,8 +7592,12 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, const u64 data_alloc_len = length; bool unlock_extents = false; + /* + * Cap the size of reads to that usually seen in buffered I/O as we need + * to allocate a contiguous array for the checksums. + */ if (!write) - len = min_t(u64, len, fs_info->sectorsize); + len = min_t(u64, len, fs_info->sectorsize * BTRFS_MAX_SECTORS); lockstart = start; lockend = start + len - 1; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index b61508723d5d2..5f2cea9a44860 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -354,6 +354,13 @@ struct btrfs_fs_devices { - 2 * sizeof(struct btrfs_chunk)) \ / sizeof(struct btrfs_stripe) + 1) +/* + * Maximum number of sectors for a single bio to limit the size of the + * checksum array. This matches the number of bio_vecs per bio and thus the + * I/O size for buffered I/O. + */ +#define BTRFS_MAX_SECTORS 256 + /* * Additional info to pass along bio. *
Btrfs currently limits direct I/O reads to a single sector, which goes back to commit c329861da406 ("Btrfs: don't allocate a separate csums array for direct reads") from Josef. That commit changes the direct I/O code to ".. use the private part of the io_tree for our csums.", but ten years later that isn't how checksums for direct reads work, instead they use a csums allocation on a per-btrfs_dio_private basis (which have their own performance problem for small I/O, but that will be addressed later). There is no fundamental limit in btrfs itself to limit the I/O size except for the size of the checksum array that scales linearly with the number of sectors in an I/O. Pick a somewhat arbitrary limit of 256 limits, which matches what the buffered reads typically see as the upper limit as the limit for direct I/O as well. This significantly improves direct read performance. For example a fio run doing 1 MiB aio reads with a queue depth of 1 roughly triples the throughput: Baseline: READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec With this patch: READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc Signed-off-by: Christoph Hellwig <hch@lst.de> --- Changes since v1: - keep a (large) upper limit on the I/O size. fs/btrfs/inode.c | 6 +++++- fs/btrfs/volumes.h | 7 +++++++ 2 files changed, 12 insertions(+), 1 deletion(-)