diff mbox series

[v2] btrfs: don't limit direct reads to a single sector

Message ID 20220621062627.2637632-1-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [v2] btrfs: don't limit direct reads to a single sector | expand

Commit Message

Christoph Hellwig June 21, 2022, 6:26 a.m. UTC
Btrfs currently limits direct I/O reads to a single sector, which goes
back to commit c329861da406 ("Btrfs: don't allocate a separate csums
array for direct reads") from Josef.  That commit changes the direct I/O
code to ".. use the private part of the io_tree for our csums.", but ten
years later that isn't how checksums for direct reads work, instead they
use a csums allocation on a per-btrfs_dio_private basis (which have their
own performance problem for small I/O, but that will be addressed later).

There is no fundamental limit in btrfs itself to limit the I/O size
except for the size of the checksum array that scales linearly with
the number of sectors in an I/O.  Pick a somewhat arbitrary limit of
256 limits, which matches what the buffered reads typically see as
the upper limit as the limit for direct I/O as well.

This significantly improves direct read performance.  For example a fio
run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
throughput:

Baseline:

READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec

With this patch:

READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc

Signed-off-by: Christoph Hellwig <hch@lst.de>
---

Changes since v1:
  - keep a (large) upper limit on the I/O size.

 fs/btrfs/inode.c   | 6 +++++-
 fs/btrfs/volumes.h | 7 +++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

Comments

Qu Wenruo June 21, 2022, 6:35 a.m. UTC | #1
On 2022/6/21 14:26, Christoph Hellwig wrote:
> Btrfs currently limits direct I/O reads to a single sector, which goes
> back to commit c329861da406 ("Btrfs: don't allocate a separate csums
> array for direct reads") from Josef.  That commit changes the direct I/O
> code to ".. use the private part of the io_tree for our csums.", but ten
> years later that isn't how checksums for direct reads work, instead they
> use a csums allocation on a per-btrfs_dio_private basis (which have their
> own performance problem for small I/O, but that will be addressed later).
>
> There is no fundamental limit in btrfs itself to limit the I/O size
> except for the size of the checksum array that scales linearly with
> the number of sectors in an I/O.  Pick a somewhat arbitrary limit of
> 256 limits, which matches what the buffered reads typically see as
> the upper limit as the limit for direct I/O as well.
>
> This significantly improves direct read performance.  For example a fio
> run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
> throughput:
>
> Baseline:
>
> READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec
>
> With this patch:
>
> READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

The patch itself looks good to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

But the extra comment itself can be confusing.

> ---
>
> Changes since v1:
>    - keep a (large) upper limit on the I/O size.
>
>   fs/btrfs/inode.c   | 6 +++++-
>   fs/btrfs/volumes.h | 7 +++++++
>   2 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 33ba4d22e1430..f6dc6e8c54e3a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7592,8 +7592,12 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>   	const u64 data_alloc_len = length;
>   	bool unlock_extents = false;
>
> +	/*
> +	 * Cap the size of reads to that usually seen in buffered I/O as we need
> +	 * to allocate a contiguous array for the checksums.
> +	 */
>   	if (!write)
> -		len = min_t(u64, len, fs_info->sectorsize);
> +		len = min_t(u64, len, fs_info->sectorsize * BTRFS_MAX_SECTORS);

It's better just to craft a manual limit, that BTRFS_MAX_SECTORS doesn't
really match the real buffered read size (commented below).

Here I believe we can afford an artificial limit, no need to align with
any existing limit.

>
>   	lockstart = start;
>   	lockend = start + len - 1;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index b61508723d5d2..5f2cea9a44860 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -354,6 +354,13 @@ struct btrfs_fs_devices {
>   				- 2 * sizeof(struct btrfs_chunk))	\
>   				/ sizeof(struct btrfs_stripe) + 1)
>
> +/*
> + * Maximum number of sectors for a single bio to limit the size of the
> + * checksum array.  This matches the number of bio_vecs per bio and thus the
> + * I/O size for buffered I/O.
> + */
> +#define BTRFS_MAX_SECTORS	256

In fact, one bio vector can point to multiple pages, as its bv_len can
be larger than PAGE_SIZE.

And I have observed bio larger than 1MiB (4MiB or 16MiB IIRC?).
Thus this comment itself is unreliable, just putting SZ_1MiB as the
upper limit would be enough.

Thanks,
Qu


> +
>   /*
>    * Additional info to pass along bio.
>    *
Christoph Hellwig June 21, 2022, 6:40 a.m. UTC | #2
On Tue, Jun 21, 2022 at 02:35:55PM +0800, Qu Wenruo wrote:
> In fact, one bio vector can point to multiple pages, as its bv_len can
> be larger than PAGE_SIZE.

Yes, it can.  But it usually doesn't, as that requires contigous memory.
Without the large folio support now merges for xfs/iomap that is very
unusual and tends to happen only soon after booting.  At which point
allocating the larger csums array is also not a problem as we can
find contigous memory for that easily as well.  For direct I/O on the
other hand the destination could be THPs or hugetlbs even when memory
is badly fragmented.
Qu Wenruo June 21, 2022, 6:55 a.m. UTC | #3
On 2022/6/21 14:40, Christoph Hellwig wrote:
> On Tue, Jun 21, 2022 at 02:35:55PM +0800, Qu Wenruo wrote:
>> In fact, one bio vector can point to multiple pages, as its bv_len can
>> be larger than PAGE_SIZE.
> 
> Yes, it can.  But it usually doesn't, as that requires contigous memory.

That's true.

> Without the large folio support now merges for xfs/iomap that is very
> unusual and tends to happen only soon after booting.

A little off-topic here, what did the extra XFS/iomap do here?

IIRC the multi-page bio vector is already there for years.

As long as the pages in page cache are contiguous, bio_add_page() will 
create such multi-page vector, without any extra support from fs.

>  At which point
> allocating the larger csums array is also not a problem as we can
> find contigous memory for that easily as well.  For direct I/O on the
> other hand the destination could be THPs or hugetlbs even when memory
> is badly fragmented.

My point here is just no need to align any limit.

Smash a good enough value here is enough, or we need to dig way deeper 
to see if the MAX_SECTORS based one is really correct or not.

Thanks,
Qu
Christoph Hellwig June 21, 2022, 6:59 a.m. UTC | #4
On Tue, Jun 21, 2022 at 02:55:06PM +0800, Qu Wenruo wrote:
> A little off-topic here, what did the extra XFS/iomap do here?
>
> IIRC the multi-page bio vector is already there for years.

Yes.

> As long as the pages in page cache are contiguous, bio_add_page() will 
> create such multi-page vector, without any extra support from fs.

Yes.  Every file system supports that case right now, but that only
covers the case where pages are contiguous by luck because the
allocations worked that way.  The iomap THP support is all about
having I/O map (and to a small extent the file system, which for now
is just xfs) to be able to deal with large folios, that is > PAGE_SIZE
compound allocations.

>>  At which point
>> allocating the larger csums array is also not a problem as we can
>> find contigous memory for that easily as well.  For direct I/O on the
>> other hand the destination could be THPs or hugetlbs even when memory
>> is badly fragmented.
>
> My point here is just no need to align any limit.
>
> Smash a good enough value here is enough, or we need to dig way deeper to 
> see if the MAX_SECTORS based one is really correct or not.

So my plan here was to also enforce the limit for buffered I/O and
provide a mempool for the maximum allocation size instead of just
failing on memory presure right now.  But there is another series
or two I need to finish first to have the buffered I/O code in the
shape where this can be easily added.
Qu Wenruo June 21, 2022, 7:04 a.m. UTC | #5
On 2022/6/21 14:59, Christoph Hellwig wrote:
> On Tue, Jun 21, 2022 at 02:55:06PM +0800, Qu Wenruo wrote:
>> A little off-topic here, what did the extra XFS/iomap do here?
>>
>> IIRC the multi-page bio vector is already there for years.
>
> Yes.
>
>> As long as the pages in page cache are contiguous, bio_add_page() will
>> create such multi-page vector, without any extra support from fs.
>
> Yes.  Every file system supports that case right now, but that only
> covers the case where pages are contiguous by luck because the
> allocations worked that way.  The iomap THP support is all about
> having I/O map (and to a small extent the file system, which for now
> is just xfs) to be able to deal with large folios, that is > PAGE_SIZE
> compound allocations.

That sounds interesting.

>
>>>   At which point
>>> allocating the larger csums array is also not a problem as we can
>>> find contigous memory for that easily as well.  For direct I/O on the
>>> other hand the destination could be THPs or hugetlbs even when memory
>>> is badly fragmented.
>>
>> My point here is just no need to align any limit.
>>
>> Smash a good enough value here is enough, or we need to dig way deeper to
>> see if the MAX_SECTORS based one is really correct or not.
>
> So my plan here was to also enforce the limit for buffered I/O and
> provide a mempool for the maximum allocation size instead of just
> failing on memory presure right now.

Not sure if mempool is really needed for this case, but since I have
seen memory allocation failure for extent io tree (and directly leads to
BUG_ON()), so I guess it will make sense.

>  But there is another series
> or two I need to finish first to have the buffered I/O code in the
> shape where this can be easily added.
>
More and more pending patches, and I never see an end...

Thanks,
Qu
Nikolay Borisov June 21, 2022, 8:37 a.m. UTC | #6
On 21.06.22 г. 9:26 ч., Christoph Hellwig wrote:
> Btrfs currently limits direct I/O reads to a single sector, which goes
> back to commit c329861da406 ("Btrfs: don't allocate a separate csums
> array for direct reads") from Josef.  That commit changes the direct I/O
> code to ".. use the private part of the io_tree for our csums.", but ten
> years later that isn't how checksums for direct reads work, instead they
> use a csums allocation on a per-btrfs_dio_private basis (which have their
> own performance problem for small I/O, but that will be addressed later).
> 
> There is no fundamental limit in btrfs itself to limit the I/O size
> except for the size of the checksum array that scales linearly with
> the number of sectors in an I/O.  Pick a somewhat arbitrary limit of
> 256 limits, which matches what the buffered reads typically see as
> the upper limit as the limit for direct I/O as well.
> 
> This significantly improves direct read performance.  For example a fio
> run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
> throughput:
> 
> Baseline:
> 
> READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec
> 
> With this patch:
> 
> READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---

LGTM. However I think a more appropriate subject would be "increase 
direct read size limit to 256 sectors"


Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> 
> Changes since v1:
>    - keep a (large) upper limit on the I/O size.
> 
>   fs/btrfs/inode.c   | 6 +++++-
>   fs/btrfs/volumes.h | 7 +++++++
>   2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 33ba4d22e1430..f6dc6e8c54e3a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7592,8 +7592,12 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>   	const u64 data_alloc_len = length;
>   	bool unlock_extents = false;
>   
> +	/*
> +	 * Cap the size of reads to that usually seen in buffered I/O as we need
> +	 * to allocate a contiguous array for the checksums.
> +	 */
>   	if (!write)
> -		len = min_t(u64, len, fs_info->sectorsize);
> +		len = min_t(u64, len, fs_info->sectorsize * BTRFS_MAX_SECTORS);
>   
>   	lockstart = start;
>   	lockend = start + len - 1;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index b61508723d5d2..5f2cea9a44860 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -354,6 +354,13 @@ struct btrfs_fs_devices {
>   				- 2 * sizeof(struct btrfs_chunk))	\
>   				/ sizeof(struct btrfs_stripe) + 1)
>   
> +/*
> + * Maximum number of sectors for a single bio to limit the size of the
> + * checksum array.  This matches the number of bio_vecs per bio and thus the
> + * I/O size for buffered I/O.
> + */
> +#define BTRFS_MAX_SECTORS	256
> +
>   /*
>    * Additional info to pass along bio.
>    *
David Sterba June 21, 2022, 1:34 p.m. UTC | #7
On Tue, Jun 21, 2022 at 08:26:27AM +0200, Christoph Hellwig wrote:
> Btrfs currently limits direct I/O reads to a single sector, which goes
> back to commit c329861da406 ("Btrfs: don't allocate a separate csums
> array for direct reads") from Josef.  That commit changes the direct I/O
> code to ".. use the private part of the io_tree for our csums.", but ten
> years later that isn't how checksums for direct reads work, instead they
> use a csums allocation on a per-btrfs_dio_private basis (which have their
> own performance problem for small I/O, but that will be addressed later).
> 
> There is no fundamental limit in btrfs itself to limit the I/O size
> except for the size of the checksum array that scales linearly with
> the number of sectors in an I/O.  Pick a somewhat arbitrary limit of
> 256 limits, which matches what the buffered reads typically see as
> the upper limit as the limit for direct I/O as well.
> 
> This significantly improves direct read performance.  For example a fio
> run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
> throughput:
> 
> Baseline:
> 
> READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec
> 
> With this patch:
> 
> READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---

Added to misc-next, thanks. I've renamed the constant to
BTRFS_MAX_BIO_SECTORS and updated the subject per Nikolay's suggestion.
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 33ba4d22e1430..f6dc6e8c54e3a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7592,8 +7592,12 @@  static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	const u64 data_alloc_len = length;
 	bool unlock_extents = false;
 
+	/*
+	 * Cap the size of reads to that usually seen in buffered I/O as we need
+	 * to allocate a contiguous array for the checksums.
+	 */
 	if (!write)
-		len = min_t(u64, len, fs_info->sectorsize);
+		len = min_t(u64, len, fs_info->sectorsize * BTRFS_MAX_SECTORS);
 
 	lockstart = start;
 	lockend = start + len - 1;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b61508723d5d2..5f2cea9a44860 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -354,6 +354,13 @@  struct btrfs_fs_devices {
 				- 2 * sizeof(struct btrfs_chunk))	\
 				/ sizeof(struct btrfs_stripe) + 1)
 
+/*
+ * Maximum number of sectors for a single bio to limit the size of the
+ * checksum array.  This matches the number of bio_vecs per bio and thus the
+ * I/O size for buffered I/O.
+ */
+#define BTRFS_MAX_SECTORS	256
+
 /*
  * Additional info to pass along bio.
  *