diff mbox series

[v5,1/1] Allow non-extending parallel direct writes on the same file.

Message ID 20220617071027.6569-2-dharamhans87@gmail.com (mailing list archive)
State New, archived
Headers show
Series FUSE: Allow non-extending parallel direct writes | expand

Commit Message

Dharmendra Singh June 17, 2022, 7:10 a.m. UTC
In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration
of the write request. I could not find in fuse code and git history
a comment which clearly explains why this exclusive lock is taken
for direct writes.  Following might be the reasons for acquiring
an exclusive lock but not be limited to
1) Our guess is some USER space fuse implementations might be relying
   on this lock for seralization.
2) The lock protects against file read/write size races.
3) Ruling out any issues arising from partial write failures.

This patch relaxes the exclusive lock for direct non-extending writes
only. File size extending writes might not need the lock either,
but we are not entirely sure if there is a risk to introduce any
kind of regression. Furthermore, benchmarking with fio does not
show a difference between patch versions that take on file size
extension a) an exclusive lock and b) a shared lock.

A possible example of an issue with i_size extending writes are write
error cases. Some writes might succeed and others might fail for
file system internal reasons - for example ENOSPACE. With parallel
file size extending writes it _might_ be difficult to revert the action
of the failing write, especially to restore the right i_size.

With these changes, we allow non-extending parallel direct writes
on the same file with the help of a flag called
FOPEN_PARALLEL_DIRECT_WRITES. If this flag is set on the file (flag is
passed from libfuse to fuse kernel as part of file open/create),
we do not take exclusive lock anymore, but instead use a shared lock
that allows non-extending writes to run in parallel.
FUSE implementations which rely on this inode lock for serialisation
can continue to do so and serialized direct writes are still the
default.  Implementations that do not do write serialization need to
be updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in
their file open/create reply.

On patch review there were concerns that network file systems (or
vfs multiple mounts of the same file system) might have issues with
parallel writes. We believe this is not the case, as this is just a
local lock, which network file systems could not rely on anyway.
I.e. this lock is just for local consistency.

Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/file.c            | 43 ++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/fuse.h |  2 ++
 2 files changed, 42 insertions(+), 3 deletions(-)

Comments

Miklos Szeredi June 17, 2022, 7:36 a.m. UTC | #1
On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:

> This patch relaxes the exclusive lock for direct non-extending writes
> only. File size extending writes might not need the lock either,
> but we are not entirely sure if there is a risk to introduce any
> kind of regression. Furthermore, benchmarking with fio does not
> show a difference between patch versions that take on file size
> extension a) an exclusive lock and b) a shared lock.

I'm okay with this, but ISTR Bernd noted a real-life scenario where
this is not sufficient.  Maybe that should be mentioned in the patch
header?

Thanks,
Miklos
Bernd Schubert June 17, 2022, 9:25 a.m. UTC | #2
Hi Miklos,

On 6/17/22 09:36, Miklos Szeredi wrote:
> On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
> 
>> This patch relaxes the exclusive lock for direct non-extending writes
>> only. File size extending writes might not need the lock either,
>> but we are not entirely sure if there is a risk to introduce any
>> kind of regression. Furthermore, benchmarking with fio does not
>> show a difference between patch versions that take on file size
>> extension a) an exclusive lock and b) a shared lock.
> 
> I'm okay with this, but ISTR Bernd noted a real-life scenario where
> this is not sufficient.  Maybe that should be mentioned in the patch
> header?


the above comment is actually directly from me.

We didn't check if fio extends the file before the runs, but even if it 
would, my current thinking is that before we serialized n-threads, now 
we have an alternation of
	- "parallel n-1 threads running" + 1 waiting thread
	- "blocked  n-1 threads" + 1 running

I think if we will come back anyway, if we should continue to see slow 
IO with MPIIO. Right now we want to get our patches merged first and 
then will create an updated module for RHEL8 (+derivatives) customers. 
Our benchmark machines are also running plain RHEL8 kernels - without 
back porting the modules first we don' know yet what we will be the 
actual impact to things like io500.

Shall we still extend the commit message or are we good to go?



Thanks,
Bernd
Miklos Szeredi June 17, 2022, 12:43 p.m. UTC | #3
On Fri, 17 Jun 2022 at 11:25, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>
> Hi Miklos,
>
> On 6/17/22 09:36, Miklos Szeredi wrote:
> > On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
> >
> >> This patch relaxes the exclusive lock for direct non-extending writes
> >> only. File size extending writes might not need the lock either,
> >> but we are not entirely sure if there is a risk to introduce any
> >> kind of regression. Furthermore, benchmarking with fio does not
> >> show a difference between patch versions that take on file size
> >> extension a) an exclusive lock and b) a shared lock.
> >
> > I'm okay with this, but ISTR Bernd noted a real-life scenario where
> > this is not sufficient.  Maybe that should be mentioned in the patch
> > header?
>
>
> the above comment is actually directly from me.
>
> We didn't check if fio extends the file before the runs, but even if it
> would, my current thinking is that before we serialized n-threads, now
> we have an alternation of
>         - "parallel n-1 threads running" + 1 waiting thread
>         - "blocked  n-1 threads" + 1 running
>
> I think if we will come back anyway, if we should continue to see slow
> IO with MPIIO. Right now we want to get our patches merged first and
> then will create an updated module for RHEL8 (+derivatives) customers.
> Our benchmark machines are also running plain RHEL8 kernels - without
> back porting the modules first we don' know yet what we will be the
> actual impact to things like io500.
>
> Shall we still extend the commit message or are we good to go?

Well, it would be nice to see the real workload on the backported
patch.   Not just because it would tell us if this makes sense in the
first place, but also to have additional testing.

Thanks,
Miklos
Bernd Schubert June 17, 2022, 1:07 p.m. UTC | #4
On 6/17/22 14:43, Miklos Szeredi wrote:
> On Fri, 17 Jun 2022 at 11:25, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>
>> Hi Miklos,
>>
>> On 6/17/22 09:36, Miklos Szeredi wrote:
>>> On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
>>>
>>>> This patch relaxes the exclusive lock for direct non-extending writes
>>>> only. File size extending writes might not need the lock either,
>>>> but we are not entirely sure if there is a risk to introduce any
>>>> kind of regression. Furthermore, benchmarking with fio does not
>>>> show a difference between patch versions that take on file size
>>>> extension a) an exclusive lock and b) a shared lock.
>>>
>>> I'm okay with this, but ISTR Bernd noted a real-life scenario where
>>> this is not sufficient.  Maybe that should be mentioned in the patch
>>> header?
>>
>>
>> the above comment is actually directly from me.
>>
>> We didn't check if fio extends the file before the runs, but even if it
>> would, my current thinking is that before we serialized n-threads, now
>> we have an alternation of
>>          - "parallel n-1 threads running" + 1 waiting thread
>>          - "blocked  n-1 threads" + 1 running
>>
>> I think if we will come back anyway, if we should continue to see slow
>> IO with MPIIO. Right now we want to get our patches merged first and
>> then will create an updated module for RHEL8 (+derivatives) customers.
>> Our benchmark machines are also running plain RHEL8 kernels - without
>> back porting the modules first we don' know yet what we will be the
>> actual impact to things like io500.
>>
>> Shall we still extend the commit message or are we good to go?
> 
> Well, it would be nice to see the real workload on the backported
> patch.   Not just because it would tell us if this makes sense in the
> first place, but also to have additional testing.

I really don't want to backport before it is merged upstream - back 
porting first has several issues (like it gets never merged and we need 
to maintain for ever, management believes the work is done and doesn't 
plan for more time, etc).

What we can do, is to install a recent kernel on one of our systems and 
then run single-shared-filed IOR over MPIIO against the patched and 
unpatched fuse module. I hope I find the time later on today or latest 
on Monday.


Thanks,
Bernd
Vivek Goyal June 18, 2022, 7:07 p.m. UTC | #5
On Fri, Jun 17, 2022 at 12:40:27PM +0530, Dharmendra Singh wrote:
> In general, as of now, in FUSE, direct writes on the same file are
> serialized over inode lock i.e we hold inode lock for the full duration
> of the write request. I could not find in fuse code and git history
> a comment which clearly explains why this exclusive lock is taken
> for direct writes.  Following might be the reasons for acquiring
> an exclusive lock but not be limited to
> 1) Our guess is some USER space fuse implementations might be relying
>    on this lock for seralization.
> 2) The lock protects against file read/write size races.
> 3) Ruling out any issues arising from partial write failures.
> 
> This patch relaxes the exclusive lock for direct non-extending writes
> only. File size extending writes might not need the lock either,
> but we are not entirely sure if there is a risk to introduce any
> kind of regression. Furthermore, benchmarking with fio does not
> show a difference between patch versions that take on file size
> extension a) an exclusive lock and b) a shared lock.
> 
> A possible example of an issue with i_size extending writes are write
> error cases. Some writes might succeed and others might fail for
> file system internal reasons - for example ENOSPACE. With parallel
> file size extending writes it _might_ be difficult to revert the action
> of the failing write, especially to restore the right i_size.
> 
> With these changes, we allow non-extending parallel direct writes
> on the same file with the help of a flag called
> FOPEN_PARALLEL_DIRECT_WRITES. If this flag is set on the file (flag is
> passed from libfuse to fuse kernel as part of file open/create),
> we do not take exclusive lock anymore, but instead use a shared lock
> that allows non-extending writes to run in parallel.
> FUSE implementations which rely on this inode lock for serialisation
> can continue to do so and serialized direct writes are still the
> default.  Implementations that do not do write serialization need to
> be updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in
> their file open/create reply.
> 
> On patch review there were concerns that network file systems (or
> vfs multiple mounts of the same file system) might have issues with
> parallel writes. We believe this is not the case, as this is just a
> local lock, which network file systems could not rely on anyway.
> I.e. this lock is just for local consistency.
> 
> Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/file.c            | 43 ++++++++++++++++++++++++++++++++++++---
>  include/uapi/linux/fuse.h |  2 ++
>  2 files changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 37eebfb90500..b3a5706f301d 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1565,14 +1565,47 @@ static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  	return res;
>  }
>  
> +static bool fuse_direct_write_extending_i_size(struct kiocb *iocb,
> +					       struct iov_iter *iter)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	return iocb->ki_pos + iov_iter_count(iter) > i_size_read(inode);
> +}
> +
>  static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct file *file = iocb->ki_filp;
> +	struct fuse_file *ff = file->private_data;
>  	struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
>  	ssize_t res;
> +	bool exclusive_lock =
> +		!(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) ||
> +		iocb->ki_flags & IOCB_APPEND ||
> +		fuse_direct_write_extending_i_size(iocb, from);
> +
> +	/*
> +	 * Take exclusive lock if
> +	 * - Parallel direct writes are disabled - a user space decision
> +	 * - Parallel direct writes are enabled and i_size is being extended.
> +	 *   This might not be needed at all, but needs further investigation.
> +	 */
> +	if (exclusive_lock)
> +		inode_lock(inode);
> +	else {
> +		inode_lock_shared(inode);
> +
> +		/* A race with truncate might have come up as the decision for
> +		 * the lock type was done without holding the lock, check again.
> +		 */
> +		if (fuse_direct_write_extending_i_size(iocb, from)) {
> +			inode_unlock_shared(inode);
> +			inode_lock(inode);
> +			exclusive_lock = true;
> +		}
> +	}
>  
> -	/* Don't allow parallel writes to the same file */
> -	inode_lock(inode);
>  	res = generic_write_checks(iocb, from);
>  	if (res > 0) {
>  		if (!is_sync_kiocb(iocb) && iocb->ki_flags & IOCB_DIRECT) {
> @@ -1583,7 +1616,10 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			fuse_write_update_attr(inode, iocb->ki_pos, res);
>  		}
>  	}
> -	inode_unlock(inode);
> +	if (exclusive_lock)
> +		inode_unlock(inode);
> +	else
> +		inode_unlock_shared(inode);
>  
>  	return res;
>  }
> @@ -2925,6 +2961,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  
>  	if (iov_iter_rw(iter) == WRITE) {
>  		fuse_write_update_attr(inode, pos, ret);
> +		/* For extending writes we already hold exclusive lock */
>  		if (ret < 0 && offset + count > i_size)
>  			fuse_do_truncate(file);

I was curious about this truncation if ret < 0. I am assuming that this
means if some write failed, do a truncation. Looking at git history
I found following commit.

commit efb9fa9e911b23c7ea5330215bda778a7c69dba8
Author: Maxim Patlasov <mpatlasov@parallels.com>
Date:   Tue Dec 18 14:05:08 2012 +0400

    fuse: truncate file if async dio failed

    The patch improves error handling in fuse_direct_IO(): if we successfully
    submitted several fuse requests on behalf of synchronous direct write
    extending file and some of them failed, let's try to do our best to clean-up.

    Changed in v2: reuse fuse_do_setattr(). Thanks to Brian for suggestion.


What's interesting here is that looks like we already have code to
submit multiple FUSE file extending AIO + DIO requests and then we
wait for completion. So a single write can be be broken into multiple
smaller fuse write requests, IIUC.

I see following.

fuse_direct_IO()
{
        /*
         * We cannot asynchronously extend the size of a file.
         * In such case the aio will behave exactly like sync io.
         */
        if ((offset + count > i_size) && io->write)
                io->blocking = true;
}

This should force the IO to be blocking/synchronous. And looks like
if io->async is set, fuse_direct_io() can still submit multiple
file extending requests and we will wait for completion.

wait_for_completion(&wait);

And truncate the file if some I/O failed. This probably means undo
all the writes we did even if some of them succeeded.

                if (ret < 0 && offset + count > i_size)
                        fuse_do_truncate(file);
        }


Anyway, point I am trying to make is that for a single AIO + DIO file
extending write, looks like existing code might split it into multiple
AIO + DIO file extending writes and wait for their completion. And
if any one of the split requests fails, we need to truncate the file
and undo all the WRITEs. And for truncation we need exclusive lock.
And that leads to the conclusion that we can't hold shared lock
while extending the file otherwise current code will do truncation
with shared lock held and things will be broken somewhere.

Thanks
Vivek
Bernd Schubert Sept. 13, 2022, 8:44 a.m. UTC | #6
On 6/17/22 14:43, Miklos Szeredi wrote:
> On Fri, 17 Jun 2022 at 11:25, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>
>> Hi Miklos,
>>
>> On 6/17/22 09:36, Miklos Szeredi wrote:
>>> On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
>>>
>>>> This patch relaxes the exclusive lock for direct non-extending writes
>>>> only. File size extending writes might not need the lock either,
>>>> but we are not entirely sure if there is a risk to introduce any
>>>> kind of regression. Furthermore, benchmarking with fio does not
>>>> show a difference between patch versions that take on file size
>>>> extension a) an exclusive lock and b) a shared lock.
>>>
>>> I'm okay with this, but ISTR Bernd noted a real-life scenario where
>>> this is not sufficient.  Maybe that should be mentioned in the patch
>>> header?
>>
>>
>> the above comment is actually directly from me.
>>
>> We didn't check if fio extends the file before the runs, but even if it
>> would, my current thinking is that before we serialized n-threads, now
>> we have an alternation of
>>          - "parallel n-1 threads running" + 1 waiting thread
>>          - "blocked  n-1 threads" + 1 running
>>
>> I think if we will come back anyway, if we should continue to see slow
>> IO with MPIIO. Right now we want to get our patches merged first and
>> then will create an updated module for RHEL8 (+derivatives) customers.
>> Our benchmark machines are also running plain RHEL8 kernels - without
>> back porting the modules first we don' know yet what we will be the
>> actual impact to things like io500.
>>
>> Shall we still extend the commit message or are we good to go?
> 
> Well, it would be nice to see the real workload on the backported
> patch.   Not just because it would tell us if this makes sense in the
> first place, but also to have additional testing.


Sorry for the delay, Dharmendra and me got busy with other tasks and 
Horst (in CC) took over the patches and did the MPIIO benchmarks on 5.19.

Results with https://github.com/dchirikov/mpiio.git

		unpatched    patched	  patched
		(extending) (extending)	 (non-extending)
----------------------------------------------------------
		 MB/s	     MB/s            MB/s
2 threads     2275.00      2497.00 	 5688.00
4 threads     2438.00      2560.00      10240.00
8 threads     2925.00      3792.00      25600.00
16 threads    3792.00 	  10240.00      20480.00


(Patched-nonextending is a manual operation on the file to extend the 
size, mpiio does not support that natively, as far as I know.)



Results with IOR (HPC quasi standard benchmark)

ior -w -E -k -o /tmp/test/home/hbi/test/test.1 -a mpiio -s 1280 -b 8m -t 8m


		unpatched     	patched
		(extending)     (extending)
-------------------------------------------
		   MB/s		  MB/s
2 threads	2086.10		2027.76
4 threads	1858.94		2132.73
8 threads	1792.68		4609.05
16 threads	1786.48		8627.96


(IOR does not allow manual file extension, without changing its code.)

We can see that patched non-extending gives the best results, as 
Dharmendra has already posted before, but results are still
much better with the patches in extending mode. My assumption is here 
instead serializing N-writers, there is an alternative
run of
	- 1 thread extending, N-1 waiting
	- N-1 writing, 1 thread waiting
in the patched version.



Thanks,
Bernd
Bernd Schubert Oct. 20, 2022, 2:51 p.m. UTC | #7
Miklos,

is there anything that speaks against getting this patch into 
linux-next? Shall I resend against v6.0? I just tested it - there is no 
merge conflict. The updated patch with benchmark results in the commit 
message is here
https://github.com/aakefbs/linux/tree/parallel-dio-write


Thanks,
Bernd
Miklos Szeredi Oct. 21, 2022, 6:57 a.m. UTC | #8
On Tue, 13 Sept 2022 at 10:44, Bernd Schubert <bschubert@ddn.com> wrote:
>
>
>
> On 6/17/22 14:43, Miklos Szeredi wrote:
> > On Fri, 17 Jun 2022 at 11:25, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> >>
> >> Hi Miklos,
> >>
> >> On 6/17/22 09:36, Miklos Szeredi wrote:
> >>> On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
> >>>
> >>>> This patch relaxes the exclusive lock for direct non-extending writes
> >>>> only. File size extending writes might not need the lock either,
> >>>> but we are not entirely sure if there is a risk to introduce any
> >>>> kind of regression. Furthermore, benchmarking with fio does not
> >>>> show a difference between patch versions that take on file size
> >>>> extension a) an exclusive lock and b) a shared lock.
> >>>
> >>> I'm okay with this, but ISTR Bernd noted a real-life scenario where
> >>> this is not sufficient.  Maybe that should be mentioned in the patch
> >>> header?
> >>
> >>
> >> the above comment is actually directly from me.
> >>
> >> We didn't check if fio extends the file before the runs, but even if it
> >> would, my current thinking is that before we serialized n-threads, now
> >> we have an alternation of
> >>          - "parallel n-1 threads running" + 1 waiting thread
> >>          - "blocked  n-1 threads" + 1 running
> >>
> >> I think if we will come back anyway, if we should continue to see slow
> >> IO with MPIIO. Right now we want to get our patches merged first and
> >> then will create an updated module for RHEL8 (+derivatives) customers.
> >> Our benchmark machines are also running plain RHEL8 kernels - without
> >> back porting the modules first we don' know yet what we will be the
> >> actual impact to things like io500.
> >>
> >> Shall we still extend the commit message or are we good to go?
> >
> > Well, it would be nice to see the real workload on the backported
> > patch.   Not just because it would tell us if this makes sense in the
> > first place, but also to have additional testing.
>
>
> Sorry for the delay, Dharmendra and me got busy with other tasks and
> Horst (in CC) took over the patches and did the MPIIO benchmarks on 5.19.
>
> Results with https://github.com/dchirikov/mpiio.git
>
>                 unpatched    patched      patched
>                 (extending) (extending)  (non-extending)
> ----------------------------------------------------------
>                  MB/s        MB/s            MB/s
> 2 threads     2275.00      2497.00       5688.00
> 4 threads     2438.00      2560.00      10240.00
> 8 threads     2925.00      3792.00      25600.00
> 16 threads    3792.00     10240.00      20480.00
>
>
> (Patched-nonextending is a manual operation on the file to extend the
> size, mpiio does not support that natively, as far as I know.)
>
>
>
> Results with IOR (HPC quasi standard benchmark)
>
> ior -w -E -k -o /tmp/test/home/hbi/test/test.1 -a mpiio -s 1280 -b 8m -t 8m
>
>
>                 unpatched       patched
>                 (extending)     (extending)
> -------------------------------------------
>                    MB/s           MB/s
> 2 threads       2086.10         2027.76
> 4 threads       1858.94         2132.73
> 8 threads       1792.68         4609.05
> 16 threads      1786.48         8627.96
>
>
> (IOR does not allow manual file extension, without changing its code.)
>
> We can see that patched non-extending gives the best results, as
> Dharmendra has already posted before, but results are still
> much better with the patches in extending mode. My assumption is here
> instead serializing N-writers, there is an alternative
> run of
>         - 1 thread extending, N-1 waiting
>         - N-1 writing, 1 thread waiting
> in the patched version.
>

Okay, thanks for the heads up.

I queued the patch up for v6.2

Thanks,
Miklos


>
>
> Thanks,
> Bernd
Bernd Schubert Oct. 21, 2022, 9:16 a.m. UTC | #9
On 10/21/22 08:57, Miklos Szeredi wrote:
> On Tue, 13 Sept 2022 at 10:44, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>>
>>
>> On 6/17/22 14:43, Miklos Szeredi wrote:
>>> On Fri, 17 Jun 2022 at 11:25, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>>>
>>>> Hi Miklos,
>>>>
>>>> On 6/17/22 09:36, Miklos Szeredi wrote:
>>>>> On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
>>>>>
>>>>>> This patch relaxes the exclusive lock for direct non-extending writes
>>>>>> only. File size extending writes might not need the lock either,
>>>>>> but we are not entirely sure if there is a risk to introduce any
>>>>>> kind of regression. Furthermore, benchmarking with fio does not
>>>>>> show a difference between patch versions that take on file size
>>>>>> extension a) an exclusive lock and b) a shared lock.
>>>>>
>>>>> I'm okay with this, but ISTR Bernd noted a real-life scenario where
>>>>> this is not sufficient.  Maybe that should be mentioned in the patch
>>>>> header?
>>>>
>>>>
>>>> the above comment is actually directly from me.
>>>>
>>>> We didn't check if fio extends the file before the runs, but even if it
>>>> would, my current thinking is that before we serialized n-threads, now
>>>> we have an alternation of
>>>>           - "parallel n-1 threads running" + 1 waiting thread
>>>>           - "blocked  n-1 threads" + 1 running
>>>>
>>>> I think if we will come back anyway, if we should continue to see slow
>>>> IO with MPIIO. Right now we want to get our patches merged first and
>>>> then will create an updated module for RHEL8 (+derivatives) customers.
>>>> Our benchmark machines are also running plain RHEL8 kernels - without
>>>> back porting the modules first we don' know yet what we will be the
>>>> actual impact to things like io500.
>>>>
>>>> Shall we still extend the commit message or are we good to go?
>>>
>>> Well, it would be nice to see the real workload on the backported
>>> patch.   Not just because it would tell us if this makes sense in the
>>> first place, but also to have additional testing.
>>
>>
>> Sorry for the delay, Dharmendra and me got busy with other tasks and
>> Horst (in CC) took over the patches and did the MPIIO benchmarks on 5.19.
>>
>> Results with https://github.com/dchirikov/mpiio.git
>>
>>                  unpatched    patched      patched
>>                  (extending) (extending)  (non-extending)
>> ----------------------------------------------------------
>>                   MB/s        MB/s            MB/s
>> 2 threads     2275.00      2497.00       5688.00
>> 4 threads     2438.00      2560.00      10240.00
>> 8 threads     2925.00      3792.00      25600.00
>> 16 threads    3792.00     10240.00      20480.00
>>
>>
>> (Patched-nonextending is a manual operation on the file to extend the
>> size, mpiio does not support that natively, as far as I know.)
>>
>>
>>
>> Results with IOR (HPC quasi standard benchmark)
>>
>> ior -w -E -k -o /tmp/test/home/hbi/test/test.1 -a mpiio -s 1280 -b 8m -t 8m
>>
>>
>>                  unpatched       patched
>>                  (extending)     (extending)
>> -------------------------------------------
>>                     MB/s           MB/s
>> 2 threads       2086.10         2027.76
>> 4 threads       1858.94         2132.73
>> 8 threads       1792.68         4609.05
>> 16 threads      1786.48         8627.96
>>
>>
>> (IOR does not allow manual file extension, without changing its code.)
>>
>> We can see that patched non-extending gives the best results, as
>> Dharmendra has already posted before, but results are still
>> much better with the patches in extending mode. My assumption is here
>> instead serializing N-writers, there is an alternative
>> run of
>>          - 1 thread extending, N-1 waiting
>>          - N-1 writing, 1 thread waiting
>> in the patched version.
>>
> 
> Okay, thanks for the heads up.
> 
> I queued the patch up for v6.2
> 

Thank you!
diff mbox series

Patch

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 37eebfb90500..b3a5706f301d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1565,14 +1565,47 @@  static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return res;
 }
 
+static bool fuse_direct_write_extending_i_size(struct kiocb *iocb,
+					       struct iov_iter *iter)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	return iocb->ki_pos + iov_iter_count(iter) > i_size_read(inode);
+}
+
 static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
+	struct file *file = iocb->ki_filp;
+	struct fuse_file *ff = file->private_data;
 	struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
 	ssize_t res;
+	bool exclusive_lock =
+		!(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) ||
+		iocb->ki_flags & IOCB_APPEND ||
+		fuse_direct_write_extending_i_size(iocb, from);
+
+	/*
+	 * Take exclusive lock if
+	 * - Parallel direct writes are disabled - a user space decision
+	 * - Parallel direct writes are enabled and i_size is being extended.
+	 *   This might not be needed at all, but needs further investigation.
+	 */
+	if (exclusive_lock)
+		inode_lock(inode);
+	else {
+		inode_lock_shared(inode);
+
+		/* A race with truncate might have come up as the decision for
+		 * the lock type was done without holding the lock, check again.
+		 */
+		if (fuse_direct_write_extending_i_size(iocb, from)) {
+			inode_unlock_shared(inode);
+			inode_lock(inode);
+			exclusive_lock = true;
+		}
+	}
 
-	/* Don't allow parallel writes to the same file */
-	inode_lock(inode);
 	res = generic_write_checks(iocb, from);
 	if (res > 0) {
 		if (!is_sync_kiocb(iocb) && iocb->ki_flags & IOCB_DIRECT) {
@@ -1583,7 +1616,10 @@  static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			fuse_write_update_attr(inode, iocb->ki_pos, res);
 		}
 	}
-	inode_unlock(inode);
+	if (exclusive_lock)
+		inode_unlock(inode);
+	else
+		inode_unlock_shared(inode);
 
 	return res;
 }
@@ -2925,6 +2961,7 @@  fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 	if (iov_iter_rw(iter) == WRITE) {
 		fuse_write_update_attr(inode, pos, ret);
+		/* For extending writes we already hold exclusive lock */
 		if (ret < 0 && offset + count > i_size)
 			fuse_do_truncate(file);
 	}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index a28dd60078ff..bbb1246cae19 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -301,6 +301,7 @@  struct fuse_file_lock {
  * FOPEN_CACHE_DIR: allow caching this directory
  * FOPEN_STREAM: the file is stream-like (no file position at all)
  * FOPEN_NOFLUSH: don't flush data cache on close (unless FUSE_WRITEBACK_CACHE)
+ * FOPEN_PARALLEL_DIRECT_WRITES: Allow concurrent direct writes on the same inode
  */
 #define FOPEN_DIRECT_IO		(1 << 0)
 #define FOPEN_KEEP_CACHE	(1 << 1)
@@ -308,6 +309,7 @@  struct fuse_file_lock {
 #define FOPEN_CACHE_DIR		(1 << 3)
 #define FOPEN_STREAM		(1 << 4)
 #define FOPEN_NOFLUSH		(1 << 5)
+#define FOPEN_PARALLEL_DIRECT_WRITES	(1 << 6)
 
 /**
  * INIT request/reply flags