Btrfs: implement unlocked buffered write

Message ID	1526442757-7167-1-git-send-email-robbieko@synology.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: robbieko <robbieko@synology.com> To: linux-btrfs@vger.kernel.org Cc: Robbie Ko <robbieko@synology.com> Subject: [PATCH] Btrfs: implement unlocked buffered write Date: Wed, 16 May 2018 11:52:37 +0800 Message-Id: <1526442757-7167-1-git-send-email-robbieko@synology.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk

robbieko May 16, 2018, 3:52 a.m. UTC

From: Robbie Ko <robbieko@synology.com>

This idea is from direct io. By this patch, we can make the buffered
write parallel, and improve the performance and latency. But because we
can not update isize without i_mutex, the unlocked buffered write just
can be done in front of the EOF.

We needn't worry about the race between buffered write and truncate,
because the truncate need wait until all the buffered write end.

And we also needn't worry about the race between dio write and punch hole,
because we have extent lock to protect our operation.

I ran fio to test the performance of this feature.

== Hardware ==
CPU: Intel® Xeon® D-1531
SSD: Intel S3710 200G
Volume : RAID 5 , SSD * 6

== config file ==
[global]
group_reporting
time_based
thread=1
norandommap
ioengine=libaio
bs=4k
iodepth=32
size=16G
runtime=180
numjobs=8
rw=randwrite

[file1]
filename=/mnt/btrfs/nocow/testfile

== result (iops) ==
lock     = 68470
unlocked = 94242

== result (clat) ==
lock
 lat (usec): min=184, max=1209.9K, avg=3738.35, stdev=20869.49
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  330], 10.00th=[  334], 20.00th=[  346],
  | 30.00th=[  370], 40.00th=[  386], 50.00th=[  406], 60.00th=[  446],
  | 70.00th=[  516], 80.00th=[  612], 90.00th=[  828], 95.00th=[10432],
  | 99.00th=[84480], 99.50th=[117248], 99.90th=[226304], 99.95th=[333824],
  | 99.99th=[692224]

unlocked
 lat (usec): min=10, max=218208, avg=2691.44, stdev=5380.82
 clat percentiles (usec):
  |  1.00th=[  302],  5.00th=[  390], 10.00th=[  442], 20.00th=[  502],
  | 30.00th=[  548], 40.00th=[  596], 50.00th=[  652], 60.00th=[  724],
  | 70.00th=[  916], 80.00th=[ 5024], 90.00th=[ 5664], 95.00th=[10048],
  | 99.00th=[29568], 99.50th=[39168], 99.90th=[54016], 99.95th=[59648],
  | 99.99th=[78336]

Signed-off-by: Robbie Ko <robbieko@synology.com>
---
 fs/btrfs/file.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

David Sterba May 22, 2018, 5:11 p.m. UTC | #1

On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:
> From: Robbie Ko <robbieko@synology.com>
> 
> This idea is from direct io. By this patch, we can make the buffered
> write parallel, and improve the performance and latency. But because we
> can not update isize without i_mutex, the unlocked buffered write just
> can be done in front of the EOF.
> 
> We needn't worry about the race between buffered write and truncate,
> because the truncate need wait until all the buffered write end.
> 
> And we also needn't worry about the race between dio write and punch hole,
> because we have extent lock to protect our operation.
> 
> I ran fio to test the performance of this feature.
> 
> == Hardware ==
> CPU: Intel® Xeon® D-1531
> SSD: Intel S3710 200G
> Volume : RAID 5 , SSD * 6
> 
> == config file ==
> [global]
> group_reporting
> time_based
> thread=1
> norandommap
> ioengine=libaio
> bs=4k
> iodepth=32
> size=16G
> runtime=180
> numjobs=8
> rw=randwrite
> 
> [file1]
> filename=/mnt/btrfs/nocow/testfile
> 
> == result (iops) ==
> lock     = 68470
> unlocked = 94242

This looks impressive.

> == result (clat) ==
> lock
>  lat (usec): min=184, max=1209.9K, avg=3738.35, stdev=20869.49
>  clat percentiles (usec):
>   |  1.00th=[  322],  5.00th=[  330], 10.00th=[  334], 20.00th=[  346],
>   | 30.00th=[  370], 40.00th=[  386], 50.00th=[  406], 60.00th=[  446],
>   | 70.00th=[  516], 80.00th=[  612], 90.00th=[  828], 95.00th=[10432],
>   | 99.00th=[84480], 99.50th=[117248], 99.90th=[226304], 99.95th=[333824],
>   | 99.99th=[692224]
> 
> unlocked
>  lat (usec): min=10, max=218208, avg=2691.44, stdev=5380.82
>  clat percentiles (usec):
>   |  1.00th=[  302],  5.00th=[  390], 10.00th=[  442], 20.00th=[  502],
>   | 30.00th=[  548], 40.00th=[  596], 50.00th=[  652], 60.00th=[  724],
>   | 70.00th=[  916], 80.00th=[ 5024], 90.00th=[ 5664], 95.00th=[10048],
>   | 99.00th=[29568], 99.50th=[39168], 99.90th=[54016], 99.95th=[59648],
>   | 99.99th=[78336]

The latencies are better, nice.

> Signed-off-by: Robbie Ko <robbieko@synology.com>
> ---
>  fs/btrfs/file.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 41ab907..8eac540 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1600,6 +1600,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>  	int ret = 0;
>  	bool only_release_metadata = false;
>  	bool force_page_uptodate = false;
> +	bool relock = false;
>  
>  	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
>  			PAGE_SIZE / (sizeof(struct page *)));
> @@ -1609,6 +1610,18 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>  	if (!pages)
>  		return -ENOMEM;
>  
> +	inode_dio_begin(inode);
> +
> +	/*
> +	 * If the write is beyond the EOF, we need update
> +	 * the isize, but it is protected by i_mutex. So we can
> +	 * not unlock the i_mutex at this case.
> +	 */
> +	if (pos + iov_iter_count(i) <= i_size_read(inode)) {
> +		inode_unlock(inode);
> +		relock = true;
> +	}

And this looks scary :)

The reasons why this should be safe given in the changelog sound good to
me, but I haven't done a closer review. The patch is ok for addition to
for-next so we'll see if it breaks something, I'd rather be a bit
conservative about merging so no promises that it's going to land in
4.18. More reviews are welcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Omar Sandoval May 22, 2018, 5:28 p.m. UTC | #2

On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:
> From: Robbie Ko <robbieko@synology.com>
> 
> This idea is from direct io. By this patch, we can make the buffered
> write parallel, and improve the performance and latency. But because we
> can not update isize without i_mutex, the unlocked buffered write just
> can be done in front of the EOF.
> 
> We needn't worry about the race between buffered write and truncate,
> because the truncate need wait until all the buffered write end.

I'm not convinced that this race isn't an issue. Consider this:

__btrfs_buffered_write()	btrfs_setsize()
inode_dio_begin()
inode_unlock()
				inode_lock()
				truncate_setsize() /* Updates i_size */
				truncate_pagecache() /* Locks and unlocks pages */
				inode_dio_wait()
prepare_pages() /* Locks pages */
btrfs_dirty_pages() /* Updates i_size without i_rwsem! */

I think moving inode_dio_wait() to before truncate_setsize() in
btrfs_setsize() would close this race, but I haven't thought about it
long enough to determine whether that still works for the original
reason the inode_dio_wait() was added.

> And we also needn't worry about the race between dio write and punch hole,
> because we have extent lock to protect our operation.
> 
> I ran fio to test the performance of this feature.
> 
> == Hardware ==
> CPU: Intel® Xeon® D-1531
> SSD: Intel S3710 200G
> Volume : RAID 5 , SSD * 6
> 
> == config file ==
> [global]
> group_reporting
> time_based
> thread=1
> norandommap
> ioengine=libaio
> bs=4k
> iodepth=32
> size=16G
> runtime=180
> numjobs=8
> rw=randwrite
> 
> [file1]
> filename=/mnt/btrfs/nocow/testfile
> 
> == result (iops) ==
> lock     = 68470
> unlocked = 94242
> 
> == result (clat) ==
> lock
>  lat (usec): min=184, max=1209.9K, avg=3738.35, stdev=20869.49
>  clat percentiles (usec):
>   |  1.00th=[  322],  5.00th=[  330], 10.00th=[  334], 20.00th=[  346],
>   | 30.00th=[  370], 40.00th=[  386], 50.00th=[  406], 60.00th=[  446],
>   | 70.00th=[  516], 80.00th=[  612], 90.00th=[  828], 95.00th=[10432],
>   | 99.00th=[84480], 99.50th=[117248], 99.90th=[226304], 99.95th=[333824],
>   | 99.99th=[692224]
> 
> unlocked
>  lat (usec): min=10, max=218208, avg=2691.44, stdev=5380.82
>  clat percentiles (usec):
>   |  1.00th=[  302],  5.00th=[  390], 10.00th=[  442], 20.00th=[  502],
>   | 30.00th=[  548], 40.00th=[  596], 50.00th=[  652], 60.00th=[  724],
>   | 70.00th=[  916], 80.00th=[ 5024], 90.00th=[ 5664], 95.00th=[10048],
>   | 99.00th=[29568], 99.50th=[39168], 99.90th=[54016], 99.95th=[59648],
>   | 99.99th=[78336]
> 
> Signed-off-by: Robbie Ko <robbieko@synology.com>
> ---
>  fs/btrfs/file.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 41ab907..8eac540 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1600,6 +1600,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>  	int ret = 0;
>  	bool only_release_metadata = false;
>  	bool force_page_uptodate = false;
> +	bool relock = false;
>  
>  	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
>  			PAGE_SIZE / (sizeof(struct page *)));
> @@ -1609,6 +1610,18 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>  	if (!pages)
>  		return -ENOMEM;
>  
> +	inode_dio_begin(inode);

This would need a comment as to why we're using inode_dio_begin() for
buffered I/O.

> +	/*
> +	 * If the write is beyond the EOF, we need update
> +	 * the isize, but it is protected by i_mutex. So we can
> +	 * not unlock the i_mutex at this case.
> +	 */
> +	if (pos + iov_iter_count(i) <= i_size_read(inode)) {
> +		inode_unlock(inode);
> +		relock = true;
> +	}
> +
>  	while (iov_iter_count(i) > 0) {
>  		size_t offset = pos & (PAGE_SIZE - 1);
>  		size_t sector_offset;
> @@ -1808,6 +1821,10 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>  		}
>  	}
>  
> +	inode_dio_end(inode);
> +	if (relock)
> +		inode_lock(inode);
> +
>  	extent_changeset_free(data_reserved);
>  	return num_written ? num_written : ret;
>  }
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig May 22, 2018, 6:08 p.m. UTC | #3

On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:
> From: Robbie Ko <robbieko@synology.com>
> 
> This idea is from direct io. By this patch, we can make the buffered
> write parallel, and improve the performance and latency. But because we
> can not update isize without i_mutex, the unlocked buffered write just
> can be done in front of the EOF.
> 
> We needn't worry about the race between buffered write and truncate,
> because the truncate need wait until all the buffered write end.
> 
> And we also needn't worry about the race between dio write and punch hole,
> because we have extent lock to protect our operation.
> 
> I ran fio to test the performance of this feature.
> 
> == Hardware ==
> CPU: Intel® Xeon® D-1531
> SSD: Intel S3710 200G
> Volume : RAID 5 , SSD * 6
> 
> == config file ==
> [global]
> group_reporting
> time_based
> thread=1
> norandommap
> ioengine=libaio
> bs=4k
> iodepth=32
> size=16G
> runtime=180
> numjobs=8
> rw=randwrite
> 
> [file1]
> filename=/mnt/btrfs/nocow/testfile
> 
> == result (iops) ==
> lock     = 68470
> unlocked = 94242
> 
> == result (clat) ==
> lock
>  lat (usec): min=184, max=1209.9K, avg=3738.35, stdev=20869.49
>  clat percentiles (usec):
>   |  1.00th=[  322],  5.00th=[  330], 10.00th=[  334], 20.00th=[  346],
>   | 30.00th=[  370], 40.00th=[  386], 50.00th=[  406], 60.00th=[  446],
>   | 70.00th=[  516], 80.00th=[  612], 90.00th=[  828], 95.00th=[10432],
>   | 99.00th=[84480], 99.50th=[117248], 99.90th=[226304], 99.95th=[333824],
>   | 99.99th=[692224]
> 
> unlocked
>  lat (usec): min=10, max=218208, avg=2691.44, stdev=5380.82
>  clat percentiles (usec):
>   |  1.00th=[  302],  5.00th=[  390], 10.00th=[  442], 20.00th=[  502],
>   | 30.00th=[  548], 40.00th=[  596], 50.00th=[  652], 60.00th=[  724],
>   | 70.00th=[  916], 80.00th=[ 5024], 90.00th=[ 5664], 95.00th=[10048],
>   | 99.00th=[29568], 99.50th=[39168], 99.90th=[54016], 99.95th=[59648],
>   | 99.99th=[78336]
> 
> Signed-off-by: Robbie Ko <robbieko@synology.com>
> ---
>  fs/btrfs/file.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 41ab907..8eac540 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1600,6 +1600,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>  	int ret = 0;
>  	bool only_release_metadata = false;
>  	bool force_page_uptodate = false;
> +	bool relock = false;
>  
>  	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
>  			PAGE_SIZE / (sizeof(struct page *)));
> @@ -1609,6 +1610,18 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>  	if (!pages)
>  		return -ENOMEM;
>  
> +	inode_dio_begin(inode);
> +
> +	/*
> +	 * If the write is beyond the EOF, we need update
> +	 * the isize, but it is protected by i_mutex. So we can
> +	 * not unlock the i_mutex at this case.
> +	 */
> +	if (pos + iov_iter_count(i) <= i_size_read(inode)) {
> +		inode_unlock(inode);

And what protects two writes from interleaving their results now?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason May 22, 2018, 6:31 p.m. UTC | #4

On 22 May 2018, at 14:08, Christoph Hellwig wrote:

> On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:
>> From: Robbie Ko <robbieko@synology.com>
>>
>> This idea is from direct io. By this patch, we can make the buffered
>> write parallel, and improve the performance and latency. But because 
>> we
>> can not update isize without i_mutex, the unlocked buffered write 
>> just
>> can be done in front of the EOF.
>>
>> We needn't worry about the race between buffered write and truncate,
>> because the truncate need wait until all the buffered write end.
>>
>> And we also needn't worry about the race between dio write and punch 
>> hole,
>> because we have extent lock to protect our operation.
>>
>> I ran fio to test the performance of this feature.
>
> And what protects two writes from interleaving their results now?

page locks...ish, we at least won't have results interleaved in a single 
page.  For btrfs it'll actually be multiple pages since we try to do 
more than one at a time.

I haven't verified all the assumptions around truncate and fallocate and 
friends expecting the dio special locking to be inside i_size.  In 
general this makes me a little uncomfortable.

But we're not avoiding the inode lock completely, we're just dropping it 
for the expensive parts of writing to the file.  A quick guess about 
what the expensive parts are:

1) balance_dirty_pages()
2) btrfs_btree_balance_dirty()
3) metadata reservations/enospc waiting.

Can I bribe you to benchmark how much each of those things is impacting 
the iops/latency benefits?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig May 23, 2018, 6:37 a.m. UTC | #5

On Tue, May 22, 2018 at 02:31:36PM -0400, Chris Mason wrote:
> > And what protects two writes from interleaving their results now?
> 
> page locks...ish, we at least won't have results interleaved in a single
> page.  For btrfs it'll actually be multiple pages since we try to do more
> than one at a time.

I think you are going to break just about every assumption people
have that any single write is going to be atomic vs another write.

E.g. this comes from the posix read definition reference by the
write definition:

"I/O is intended to be atomic to ordinary files and pipes and FIFOs.
Atomic means that all the bytes from a single operation that started out
together end up together, without interleaving from other I/O
operations. It is a known attribute of terminals that this is not
honored, and terminals are explicitly (and implicitly permanently)
excepted, making the behavior unspecified. The behavior for other device
types is also left unspecified, but the wording is intended to imply
that future standards might choose to specify atomicity (or not).
"

Without taking the inode lock (or some sort of range lock) you can
easily interleave data from two write requests.

> But we're not avoiding the inode lock completely, we're just dropping it for
> the expensive parts of writing to the file.  A quick guess about what the
> expensive parts are:

The way I read the patch it basically 'avoids' the inode lock for almost
the whole write call, just minus some setup.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

robbieko May 23, 2018, 7:07 a.m. UTC | #6

Omar Sandoval 於 2018-05-23 01:28 寫到:
> On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:
>> From: Robbie Ko <robbieko@synology.com>
>> 
>> This idea is from direct io. By this patch, we can make the buffered
>> write parallel, and improve the performance and latency. But because 
>> we
>> can not update isize without i_mutex, the unlocked buffered write just
>> can be done in front of the EOF.
>> 
>> We needn't worry about the race between buffered write and truncate,
>> because the truncate need wait until all the buffered write end.
> 
> I'm not convinced that this race isn't an issue. Consider this:
> 
> __btrfs_buffered_write()	btrfs_setsize()
> inode_dio_begin()
> inode_unlock()
> 				inode_lock()
> 				truncate_setsize() /* Updates i_size */
> 				truncate_pagecache() /* Locks and unlocks pages */
> 				inode_dio_wait()
> prepare_pages() /* Locks pages */
> btrfs_dirty_pages() /* Updates i_size without i_rwsem! */
> 
> I think moving inode_dio_wait() to before truncate_setsize() in
> btrfs_setsize() would close this race, but I haven't thought about it
> long enough to determine whether that still works for the original
> reason the inode_dio_wait() was added.

OK. I will correct this part.

The original use of inode_dio_wait is to avoid dio read and truncate 
race.
We can fix like below:
-       /* we don't support swapfiles, so vmtruncate shouldn't fail */
-       truncate_setsize(inode, newsize);
-
         /* Disable nonlocked read DIO to avoid the end less truncate */
         btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
         inode_dio_wait(inode);
+
+       /* we don't support swapfiles, so vmtruncate shouldn't fail */
+       truncate_setsize(inode, newsize);
+
         btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));

We just make sure to update isize before restoring unlocked dio read.

Thanks.


> 
>> And we also needn't worry about the race between dio write and punch 
>> hole,
>> because we have extent lock to protect our operation.
>> 
>> I ran fio to test the performance of this feature.
>> 
>> == Hardware ==
>> CPU: Intel® Xeon® D-1531
>> SSD: Intel S3710 200G
>> Volume : RAID 5 , SSD * 6
>> 
>> == config file ==
>> [global]
>> group_reporting
>> time_based
>> thread=1
>> norandommap
>> ioengine=libaio
>> bs=4k
>> iodepth=32
>> size=16G
>> runtime=180
>> numjobs=8
>> rw=randwrite
>> 
>> [file1]
>> filename=/mnt/btrfs/nocow/testfile
>> 
>> == result (iops) ==
>> lock     = 68470
>> unlocked = 94242
>> 
>> == result (clat) ==
>> lock
>>  lat (usec): min=184, max=1209.9K, avg=3738.35, stdev=20869.49
>>  clat percentiles (usec):
>>   |  1.00th=[  322],  5.00th=[  330], 10.00th=[  334], 20.00th=[  
>> 346],
>>   | 30.00th=[  370], 40.00th=[  386], 50.00th=[  406], 60.00th=[  
>> 446],
>>   | 70.00th=[  516], 80.00th=[  612], 90.00th=[  828], 
>> 95.00th=[10432],
>>   | 99.00th=[84480], 99.50th=[117248], 99.90th=[226304], 
>> 99.95th=[333824],
>>   | 99.99th=[692224]
>> 
>> unlocked
>>  lat (usec): min=10, max=218208, avg=2691.44, stdev=5380.82
>>  clat percentiles (usec):
>>   |  1.00th=[  302],  5.00th=[  390], 10.00th=[  442], 20.00th=[  
>> 502],
>>   | 30.00th=[  548], 40.00th=[  596], 50.00th=[  652], 60.00th=[  
>> 724],
>>   | 70.00th=[  916], 80.00th=[ 5024], 90.00th=[ 5664], 
>> 95.00th=[10048],
>>   | 99.00th=[29568], 99.50th=[39168], 99.90th=[54016], 
>> 99.95th=[59648],
>>   | 99.99th=[78336]
>> 
>> Signed-off-by: Robbie Ko <robbieko@synology.com>
>> ---
>>  fs/btrfs/file.c | 17 +++++++++++++++++
>>  1 file changed, 17 insertions(+)
>> 
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 41ab907..8eac540 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -1600,6 +1600,7 @@ static noinline ssize_t 
>> __btrfs_buffered_write(struct file *file,
>>  	int ret = 0;
>>  	bool only_release_metadata = false;
>>  	bool force_page_uptodate = false;
>> +	bool relock = false;
>> 
>>  	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
>>  			PAGE_SIZE / (sizeof(struct page *)));
>> @@ -1609,6 +1610,18 @@ static noinline ssize_t 
>> __btrfs_buffered_write(struct file *file,
>>  	if (!pages)
>>  		return -ENOMEM;
>> 
>> +	inode_dio_begin(inode);
> 
> This would need a comment as to why we're using inode_dio_begin() for
> buffered I/O.
> 
>> +	/*
>> +	 * If the write is beyond the EOF, we need update
>> +	 * the isize, but it is protected by i_mutex. So we can
>> +	 * not unlock the i_mutex at this case.
>> +	 */
>> +	if (pos + iov_iter_count(i) <= i_size_read(inode)) {
>> +		inode_unlock(inode);
>> +		relock = true;
>> +	}
>> +
>>  	while (iov_iter_count(i) > 0) {
>>  		size_t offset = pos & (PAGE_SIZE - 1);
>>  		size_t sector_offset;
>> @@ -1808,6 +1821,10 @@ static noinline ssize_t 
>> __btrfs_buffered_write(struct file *file,
>>  		}
>>  	}
>> 
>> +	inode_dio_end(inode);
>> +	if (relock)
>> +		inode_lock(inode);
>> +
>>  	extent_changeset_free(data_reserved);
>>  	return num_written ? num_written : ret;
>>  }
>> --
>> 1.9.1
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

robbieko May 23, 2018, 7:26 a.m. UTC | #7

Chris Mason 於 2018-05-23 02:31 寫到:
> On 22 May 2018, at 14:08, Christoph Hellwig wrote:
> 
>> On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:
>>> From: Robbie Ko <robbieko@synology.com>
>>> 
>>> This idea is from direct io. By this patch, we can make the buffered
>>> write parallel, and improve the performance and latency. But because 
>>> we
>>> can not update isize without i_mutex, the unlocked buffered write 
>>> just
>>> can be done in front of the EOF.
>>> 
>>> We needn't worry about the race between buffered write and truncate,
>>> because the truncate need wait until all the buffered write end.
>>> 
>>> And we also needn't worry about the race between dio write and punch 
>>> hole,
>>> because we have extent lock to protect our operation.
>>> 
>>> I ran fio to test the performance of this feature.
>> 
>> And what protects two writes from interleaving their results now?
> 
> page locks...ish, we at least won't have results interleaved in a
> single page.  For btrfs it'll actually be multiple pages since we try
> to do more than one at a time.
> 
> I haven't verified all the assumptions around truncate and fallocate
> and friends expecting the dio special locking to be inside i_size.  In
> general this makes me a little uncomfortable.
> 
> But we're not avoiding the inode lock completely, we're just dropping
> it for the expensive parts of writing to the file.  A quick guess
> about what the expensive parts are:
> 
> 1) balance_dirty_pages()
> 2) btrfs_btree_balance_dirty()
> 3) metadata reservations/enospc waiting.
> 

The expensive part of buffered_write are:
1. prepare_pages()
     --wait_on_page_writeback()
     Because writeback submit page to PG_writeback.
     We must wait until the page writeback IO ends.

2. lock_and_cleanup_extent_if_need
     --btrfs_start_ordered_extent
     When a large number of ordered_extent queue is in 
endio_write_workers workqueue.
     Buffered_write assumes that ordered_extent is the last one in the 
endio_write_workers workqueue,
     and waits for all ordered_extents to be processed before because the 
workqueue is a FIFO.

Thanks.
Robbie Ko

> Can I bribe you to benchmark how much each of those things is
> impacting the iops/latency benefits?
> 
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nikolay Borisov May 23, 2018, 7:58 a.m. UTC | #8

On 23.05.2018 09:37, Christoph Hellwig wrote:
> On Tue, May 22, 2018 at 02:31:36PM -0400, Chris Mason wrote:
>>> And what protects two writes from interleaving their results now?
>>
>> page locks...ish, we at least won't have results interleaved in a single
>> page.  For btrfs it'll actually be multiple pages since we try to do more
>> than one at a time.
> 
> I think you are going to break just about every assumption people
> have that any single write is going to be atomic vs another write.
> 
> E.g. this comes from the posix read definition reference by the
> write definition:
> 
> "I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> Atomic means that all the bytes from a single operation that started out
> together end up together, without interleaving from other I/O
> operations. It is a known attribute of terminals that this is not
> honored, and terminals are explicitly (and implicitly permanently)
> excepted, making the behavior unspecified. The behavior for other device
> types is also left unspecified, but the wording is intended to imply
> that future standards might choose to specify atomicity (or not).
> "
> 
> Without taking the inode lock (or some sort of range lock) you can
> easily interleave data from two write requests.
> 
>> But we're not avoiding the inode lock completely, we're just dropping it for
>> the expensive parts of writing to the file.  A quick guess about what the
>> expensive parts are:
> 
> The way I read the patch it basically 'avoids' the inode lock for almost
> the whole write call, just minus some setup.

You are right, this drops the inode_lock for some rather crucial ops:
prepare_pages - this one allocates pages in the page cache mapping for
this inode and locks them

lock_and_cleanup_extent_if_need - this one completes any previous writes
for the range currenlty being written and locks the range of extents.

Looking at the code I *think* the locking provided by
prepare_pages/lock_and_cleanup_extent_if_need *should* provide the
necessary exclusivity to guarantee atomicity w.r.t to different writes.

So Robbie,

This reliance on btrfs-internal locks to provide exclusivity *must* be
explicitly documented in the commit log.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason May 23, 2018, 3:56 p.m. UTC | #9

On 23 May 2018, at 3:26, robbieko wrote:

> Chris Mason 於 2018-05-23 02:31 寫到:
>> On 22 May 2018, at 14:08, Christoph Hellwig wrote:
>>
>>> On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:
>>>> From: Robbie Ko <robbieko@synology.com>
>>>>
>>>> This idea is from direct io. By this patch, we can make the 
>>>> buffered
>>>> write parallel, and improve the performance and latency. But 
>>>> because we
>>>> can not update isize without i_mutex, the unlocked buffered write 
>>>> just
>>>> can be done in front of the EOF.
>>>>
>>>> We needn't worry about the race between buffered write and 
>>>> truncate,
>>>> because the truncate need wait until all the buffered write end.
>>>>
>>>> And we also needn't worry about the race between dio write and 
>>>> punch hole,
>>>> because we have extent lock to protect our operation.
>>>>
>>>> I ran fio to test the performance of this feature.
>>>
>>> And what protects two writes from interleaving their results now?
>>
>> page locks...ish, we at least won't have results interleaved in a
>> single page.  For btrfs it'll actually be multiple pages since we try
>> to do more than one at a time.
>>
>> I haven't verified all the assumptions around truncate and fallocate
>> and friends expecting the dio special locking to be inside i_size.  
>> In
>> general this makes me a little uncomfortable.
>>
>> But we're not avoiding the inode lock completely, we're just dropping
>> it for the expensive parts of writing to the file.  A quick guess
>> about what the expensive parts are:
>>
>> 1) balance_dirty_pages()
>> 2) btrfs_btree_balance_dirty()
>> 3) metadata reservations/enospc waiting.
>>
>
> The expensive part of buffered_write are:
> 1. prepare_pages()
>     --wait_on_page_writeback()
>     Because writeback submit page to PG_writeback.
>     We must wait until the page writeback IO ends.

Is this based on timing the fio job or something else?  We can trigger a 
stable page prep run before we take the mutex, but stable pages 
shouldn't be part of random IO workloads unless we're doing random IO 
inside a file that mostly fits in ram.

balance_dirty_pages() is a much more common waiting point, and doing 
that with the inode lock held definitely adds contention.

>
> 2. lock_and_cleanup_extent_if_need
>     --btrfs_start_ordered_extent
>     When a large number of ordered_extent queue is in 
> endio_write_workers workqueue.
>     Buffered_write assumes that ordered_extent is the last one in the 
> endio_write_workers workqueue,
>     and waits for all ordered_extents to be processed before because 
> the workqueue is a FIFO.
>

This isn't completely accurate, but it falls into the stable pages 
bucket as well.  We can push a lighter version of the stable page wait 
before the inode lock when the IO is inside of i_size.  It won't 
completely remove the possibility that someone else dirties those pages 
again, but if the workload is random or really splitting up the file, 
it'll make the work done with the lock held much less.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason May 23, 2018, 6:01 p.m. UTC | #10

On 23 May 2018, at 2:37, Christoph Hellwig wrote:

> On Tue, May 22, 2018 at 02:31:36PM -0400, Chris Mason wrote:
>>> And what protects two writes from interleaving their results now?
>>
>> page locks...ish, we at least won't have results interleaved in a 
>> single
>> page.  For btrfs it'll actually be multiple pages since we try to do 
>> more
>> than one at a time.
>
> I think you are going to break just about every assumption people
> have that any single write is going to be atomic vs another write.
>
> E.g. this comes from the posix read definition reference by the
> write definition:
>
> "I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> Atomic means that all the bytes from a single operation that started 
> out
> together end up together, without interleaving from other I/O
> operations. It is a known attribute of terminals that this is not
> honored, and terminals are explicitly (and implicitly permanently)
> excepted, making the behavior unspecified. The behavior for other 
> device
> types is also left unspecified, but the wording is intended to imply
> that future standards might choose to specify atomicity (or not).
> "
>
> Without taking the inode lock (or some sort of range lock) you can
> easily interleave data from two write requests.

"This volume of IEEE Std 1003.1-2001 does not specify behavior of 
concurrent writes to a file from multiple processes. Applications should 
use some form of concurrency control."

I'm always more worried about truncate than standards ;)  But just to be 
clear, I'm not disagreeing with you at all.  Interleaved writes just 
wasn't the first thing that jumped to my mind.

>
>> But we're not avoiding the inode lock completely, we're just dropping 
>> it for
>> the expensive parts of writing to the file.  A quick guess about what 
>> the
>> expensive parts are:
>
> The way I read the patch it basically 'avoids' the inode lock for 
> almost
> the whole write call, just minus some setup.

Yeah, if we can get 90% of the way there by pushing some 
balance_dirty_pages() outside the lock (or whatever other expensive 
setup we're doing), I'd by much happier.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

robbieko May 24, 2018, 8:46 a.m. UTC | #11

Chris Mason 於 2018-05-23 23:56 寫到:
> On 23 May 2018, at 3:26, robbieko wrote:
> 
>> Chris Mason 於 2018-05-23 02:31 寫到:
>>> On 22 May 2018, at 14:08, Christoph Hellwig wrote:
>>> 
>>>> On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:
>>>>> From: Robbie Ko <robbieko@synology.com>
>>>>> 
>>>>> This idea is from direct io. By this patch, we can make the 
>>>>> buffered
>>>>> write parallel, and improve the performance and latency. But 
>>>>> because we
>>>>> can not update isize without i_mutex, the unlocked buffered write 
>>>>> just
>>>>> can be done in front of the EOF.
>>>>> 
>>>>> We needn't worry about the race between buffered write and 
>>>>> truncate,
>>>>> because the truncate need wait until all the buffered write end.
>>>>> 
>>>>> And we also needn't worry about the race between dio write and 
>>>>> punch hole,
>>>>> because we have extent lock to protect our operation.
>>>>> 
>>>>> I ran fio to test the performance of this feature.
>>>> 
>>>> And what protects two writes from interleaving their results now?
>>> 
>>> page locks...ish, we at least won't have results interleaved in a
>>> single page.  For btrfs it'll actually be multiple pages since we try
>>> to do more than one at a time.
>>> 
>>> I haven't verified all the assumptions around truncate and fallocate
>>> and friends expecting the dio special locking to be inside i_size.  
>>> In
>>> general this makes me a little uncomfortable.
>>> 
>>> But we're not avoiding the inode lock completely, we're just dropping
>>> it for the expensive parts of writing to the file.  A quick guess
>>> about what the expensive parts are:
>>> 
>>> 1) balance_dirty_pages()
>>> 2) btrfs_btree_balance_dirty()
>>> 3) metadata reservations/enospc waiting.
>>> 
>> 
>> The expensive part of buffered_write are:
>> 1. prepare_pages()
>>     --wait_on_page_writeback()
>>     Because writeback submit page to PG_writeback.
>>     We must wait until the page writeback IO ends.
> 
> Is this based on timing the fio job or something else?  We can trigger
> a stable page prep run before we take the mutex, but stable pages
> shouldn't be part of random IO workloads unless we're doing random IO
> inside a file that mostly fits in ram.
> 

This problem is easily encountered in the context of VMs and File-based 
iSCSI LUNs.
Most of them are random write pattern.
Fio can quickly simulate the problem.

So which is the best way?
1. unlocked buffered write (like direct-io)
2. stable page prep
3. Or is there any way to completely avoid stable pages?

> balance_dirty_pages() is a much more common waiting point, and doing
> that with the inode lock held definitely adds contention.

Yes. I agree.
But for latency, balance_dirty_pages is usually a relatively smooth 
latency,
lock_and_cleanup_extent_if_need and prepare_pages are dependent on IO,
so the latency is a multiple of growth.

Thanks.
Robbie Ko

> 
>> 
>> 2. lock_and_cleanup_extent_if_need
>>     --btrfs_start_ordered_extent
>>     When a large number of ordered_extent queue is in 
>> endio_write_workers workqueue.
>>     Buffered_write assumes that ordered_extent is the last one in the 
>> endio_write_workers workqueue,
>>     and waits for all ordered_extents to be processed before because 
>> the workqueue is a FIFO.
>> 
> 
> This isn't completely accurate, but it falls into the stable pages
> bucket as well.  We can push a lighter version of the stable page wait
> before the inode lock when the IO is inside of i_size.  It won't
> completely remove the possibility that someone else dirties those
> pages again, but if the workload is random or really splitting up the
> file, it'll make the work done with the lock held much less.
> 
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason May 24, 2018, 3:05 p.m. UTC | #12

On 24 May 2018, at 4:46, robbieko wrote:

> Chris Mason 於 2018-05-23 23:56 寫到:
>> On 23 May 2018, at 3:26, robbieko wrote:
>>>>
>>>> But we're not avoiding the inode lock completely, we're just 
>>>> dropping
>>>> it for the expensive parts of writing to the file.  A quick guess
>>>> about what the expensive parts are:
>>>>
>>>> 1) balance_dirty_pages()
>>>> 2) btrfs_btree_balance_dirty()
>>>> 3) metadata reservations/enospc waiting.
>>>>
>>>
>>> The expensive part of buffered_write are:
>>> 1. prepare_pages()
>>>     --wait_on_page_writeback()
>>>     Because writeback submit page to PG_writeback.
>>>     We must wait until the page writeback IO ends.
>>
>> Is this based on timing the fio job or something else?  We can 
>> trigger
>> a stable page prep run before we take the mutex, but stable pages
>> shouldn't be part of random IO workloads unless we're doing random IO
>> inside a file that mostly fits in ram.
>>
>
> This problem is easily encountered in the context of VMs and 
> File-based iSCSI LUNs.
> Most of them are random write pattern.
> Fio can quickly simulate the problem.

With really random IO, to a device bigger than memory, the chances of us 
writing to the same block twice with it still in flight are pretty 
small.  With VMs and other virtual block devices backed by btrfs, it's 
easier to hit stable pages because filesystems try pretty hard to be 
less than completely random.

>
> So which is the best way?
> 1. unlocked buffered write (like direct-io)
> 2. stable page prep
> 3. Or is there any way to completely avoid stable pages?
>
>> balance_dirty_pages() is a much more common waiting point, and doing
>> that with the inode lock held definitely adds contention.
>
> Yes. I agree.
> But for latency, balance_dirty_pages is usually a relatively smooth 
> latency,
> lock_and_cleanup_extent_if_need and prepare_pages are dependent on IO,
> so the latency is a multiple of growth.

I think the best way is to peel out a stable page write wait before the 
inode lock is taken, and only when the write is inside i_size.  We'll 
need to keep the existing stable page wait as well, but waiting before 
the inode lock is taken should handle the vast majority of IO in flight.

I'd also push the balance_dirty_pages() outside the inode lock, as long 
as the write is relatively small.

For really getting rid of stable pages, that's a bigger project ;)

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs: implement unlocked buffered write

Commit Message

Comments

Patch