mbox series

[RESEND,V12,0/8] fuse: Add support for passthrough read/write

Message ID 20210125153057.3623715-1-balsini@android.com (mailing list archive)
Headers show
Series fuse: Add support for passthrough read/write | expand

Message

Alessio Balsini Jan. 25, 2021, 3:30 p.m. UTC
This is the 12th version of the series, rebased on top of v5.11-rc5.
Please find the changelog at the bottom of this cover letter.

Add support for file system passthrough read/write of files when enabled
in userspace through the option FUSE_PASSTHROUGH.

There are file systems based on FUSE that are intended to enforce
special policies or trigger complicated decision makings at the file
operations level. Android, for example, uses FUSE to enforce
fine-grained access policies that also depend on the file contents.
Sometimes it happens that at open or create time a file is identified as
not requiring additional checks for consequent reads/writes, thus FUSE
would simply act as a passive bridge between the process accessing the
FUSE file system and the lower file system. Splicing and caching help
reduce the FUSE overhead, but there are still read/write operations
forwarded to the userspace FUSE daemon that could be avoided.

This series has been inspired by the original patches from Nikhilesh
Reddy, the idea and code of which has been elaborated and improved
thanks to the community support.

When the FUSE_PASSTHROUGH capability is enabled, the FUSE daemon may
decide while handling the open/create operations, if the given file can
be accessed in passthrough mode. This means that all the further read
and write operations would be forwarded by the kernel directly to the
lower file system using the VFS layer rather than to the FUSE daemon.
All the requests other than reads or writes are still handled by the
userspace FUSE daemon.
This allows for improved performance on reads and writes, especially in
the case of reads at random offsets, for which no (readahead) caching
mechanism would help.
Benchmarks show improved performance that is close to native file system
access when doing massive manipulations on a single opened file,
especially in the case of random reads, random writes and sequential
writes. Detailed benchmarking results are presented below.

The creation of this direct connection (passthrough) between FUSE file
objects and file objects in the lower file system happens in a way that
reminds of passing file descriptors via sockets:
- a process requests the opening of a file handled by FUSE, so the
  kernel forwards the request to the FUSE daemon;
- the FUSE daemon opens the target file in the lower file system,
  getting its file descriptor;
- the FUSE daemon also decides according to its internal policies if
  passthrough can be enabled for that file, and, if so, can perform a
  FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl on /dev/fuse, passing the file
  descriptor obtained at the previous step and the fuse_req unique
  identifier;
- the kernel translates the file descriptor to the file pointer
  navigating through the opened files of the "current" process and
  temporarily stores it in the associated open/create fuse_req's
  passthrough_filp;
- when the FUSE daemon has done with the request and it's time for the
  kernel to close it, it checks if the passthrough_filp is available and
in case updates the additional field in the fuse_file owned by the
process accessing the FUSE file system.
From now on, all the read/write operations performed by that process
will be redirected to the corresponding lower file system file by
creating new VFS requests.
Since the read/write operation to the lower file system is executed with
the current process's credentials, it might happen that it does not have
enough privileges to succeed. For this reason, the process temporarily
receives the same credentials as the FUSE daemon, that are reverted as
soon as the read/write operation completes, emulating the behavior of
the request to be performed by the FUSE daemon itself. This solution has
been inspired by the way overlayfs handles read/write operations.
Asynchronous IO is supported as well, handled by creating separate AIO
requests for the lower file system that will be internally tracked by
FUSE, that intercepts and propagates their completion through an
internal ki_completed callback similar to the current implementation of
overlayfs.
Finally, also memory-mapped FUSE files are supported in this FUSE
passthrough series as it has been noticed that when a same file with
FUSE passthrough enabled is accessed both with standard
read/write(-iter) operations and memory-mapped read/write operations,
the file content might result corrupted due to an inconsistency between
the FUSE and lower file system caches.

The ioctl has been designed taking as a reference and trying to converge
to the fuse2 implementation. For example, the fuse_passthrough_out data
structure has extra fields that will allow for further extensions of the
feature.


    Performance on RAM block device

What follows has been performed using a custom passthrough_hp FUSE
daemon that enables pass-through for each file that is opened during
both "open" and "create". Benchmarks were run on an Intel Xeon W-2135,
64 GiB of RAM workstation, with a RAM block device used as storage
target. More specifically, out of the system's 64 GiB of RAM, 40 GiB
were reserved for /dev/ram0, formatted as ext4. For the FUSE and FUSE
passthrough benchmarks, the FUSE file system was mounted on top of the
mounted /dev/ram0 device.
That file system has been completely filled and then cleaned up before
running the benchmarks: this to ensure that all the /dev/ram0 space was
reserved and not usable as page cache.

The rationale for using a RAM block device is that SSDs may experience
performance fluctuations, especially when dealing with accessing data
random offsets.
Getting rid of the discrete storage device also removes a huge component
of slowness, highlighting the performance difference of the software
parts (and probably the goodness of CPU caching and its coherence
mechanisms).

No special tuning has been performed, e.g., all the involved processes
are SCHED_OTHER, ondemand is the frequency governor with no frequency
restrictions, and turbo-boost, as well as p-state, are active. This is
because I noticed that, for such high-level benchmarks, results
consistency was minimally affected by these features.

The source code of the updated libfuse library and passthrough_hp is
shared at the following repository:

  https://github.com/balsini/libfuse/tree/fuse-passthrough-v12-v5.11-rc5

Two different kinds of benchmarks were done for this change, the first
set of tests evaluates the bandwidth improvements when manipulating huge
single files, the second set of tests verify that no performance
regressions were introduced when handling many small files.

All the caches were dropped before running every benchmark with:

  echo 3 > /proc/sys/vm/drop_caches

All the benchmarks were run 10 times, with 1 minute cool down between
each run.

The first benchmarks were done by running FIO (fio-3.24) with:
- bs=4Ki;
- file size: 35Gi;
- ioengine: sync;
- fsync_on_close=1;
- randseed=0.
The target file has been chosen large enough to avoid it to be entirely
loaded into the page cache.

Results are presented in the following table:

+-----------+------------+-------------+-------------+
|   MiB/s   |    fuse    | passthrough |   native    |
+-----------+------------+-------------+-------------+
| read      | 471(±1.3%) | 1791(±1.0%) | 1839(±1.8%) |
| write     | 95(±.6%)   | 1068(±.9%)  | 1322(±.8%)  |
| randread  | 25(±1.7%)  | 860(±.8%)   | 1135(±.5%)  |
| randwrite | 76(±3.0%)  | 813(±1.0%)  | 1005(±.7%)  |
+-----------+------------+-------------+-------------+

This table shows that FUSE, except for the sequential reads, is far
behind FUSE passthrough and native in terms of performance. The
extremely good FUSE performance for sequential reads is the result of a
great read-ahead mechanism. I was able to verify that setting
read_ahead_kb to 0 causes a terrible performance drop.
All the results are stable, as shown by the standard deviations.
Moreover, these numbers show the reasonable gap between passthrough and
native, introduced by the extra traversal through the VFS layer.

As long as this patch has the primary objective of improving bandwidth,
another set of tests has been performed to see how this behaves on a
totally different scenario that involves accessing many small files. For
this purpose, measuring the build time of the Linux kernel has been
chosen as an appropriate, well-known, workload. The kernel has been
built with as many processes as the number of logical CPUs (-j
$(nproc)), that besides being a reasonable parallelization value, is
also enough to saturate the processor's utilization thanks to the
additional FUSE daemon's threads, making it even harder to get closer to
the native file system performance.
The following table shows the total build times in the different
configurations:

+------------------+--------------+-----------+
|                  | AVG duration |  Standard |
|                  |     (sec)    | deviation |
+------------------+--------------+-----------+
| FUSE             |      144.566 |     0.697 |
+------------------+--------------+-----------+
| FUSE passthrough |      133.820 |     0.341 |
+------------------+--------------+-----------+
| Native           |      109.423 |     0.724 |
+------------------+--------------+-----------+

Further testing and performance evaluations are welcome.


    Description of the series

Patch 1 generalizes the function which converts iocb flags to rw flags
from overlayfs, so that can be used in this patch set.

Patch 2 enables the 32-bit compatibility for the /dev/fuse ioctl.

Patch 3 introduces the data structures, function signatures and ioctl
required both for the communication with userspace and for the internal
kernel use.

Patch 4 introduces initialization and release functions for FUSE
passthrough.

Patch 5 enables the synchronous read and write operations for those FUSE
files for which the passthrough functionality is enabled.

Patch 6 extends the read and write operations to also support
asynchronous IO.

Patch 7 allows FUSE passthrough to target files for which the requesting
process would not have direct access to, by temporarily performing a
credentials switch to the credentials of the FUSE daemon that issued the
FUSE passthrough ioctl.

Patch 8 extends FUSE passthrough operations to memory-mapped FUSE files.


    Changelog

Changes in v12:
* Revert FILESYSTEM_MAX_STACK_DEPTH checks as they were in v10
  [Requested by Amir Goldstein]
* Introduce passthrough support for memory-mapped FUSE files
  [Requested by yanwu]

Changes in v11:
* Fix the FILESYSTEM_MAX_STACK_DEPTH check to allow other file systems
  to be stacked
* Moved file system stacking depth check at ioctl time
* Update cover letter with correct libfuse repository to test the change
  [Requested by Peng Tao]
* Fix the file reference counter leak introduced in v10
  [Requested by yanwu]

Changes in v10:
* UAPI updated: ioctl now returns an ID that will be used at open/create
  response time to reference the passthrough file
* Synchronous read/write_iter functions does not return silly errors
  (fixed in aio patch)
* FUSE daemon credentials updated at ioctl time instead of mount time
* Updated benchmark results
  [Requested by Miklos Szeredi]

Changes in v9:
* Switched to using VFS instead of direct lower FS file ops
  [Attempt to address a request from Jens Axboe, Jann Horn,
  Amir Goldstein]
* Removal of useless included aio.h header
  [Proposed by Jens Axboe]

Changes in v8:
* aio requests now use kmalloc/kfree, instead of kmem_cache
* Switched to call_{read,write}_iter in AIO
* Revisited attributes copy
* Passthrough can only be enabled via ioctl, fixing the security issue
  spotted by Jann
* Use an extensible fuse_passthrough_out data structure
  [Attempt to address a request from Nikolaus Rath, Amir Goldstein and
Miklos Szeredi]

Changes in v7:
* Full handling of aio requests as done in overlayfs (update commit
* message).
* s/fget_raw/fget.
* Open fails in case of passthrough errors, emitting warning messages.
  [Proposed by Jann Horn]
* Create new local kiocb, getting rid of the previously proposed ki_filp
  swapping.
  [Proposed by Jann Horn and Jens Axboe]
* Code polishing.

Changes in v6:
* Port to kernel v5.8:
  * fuse_file_{read,write}_iter changed since the v5 of this patch was
    proposed.
* Simplify fuse_simple_request.
* Merge fuse_passthrough.h into fuse_i.h
* Refactor of passthrough.c:
  * Remove BUG_ONs.
  * Simplified error checking and request arguments indexing.
  * Use call_{read,write}_iter utility functions.
  * Remove get_file and fputs during read/write: handle the extra FUSE
    references to the lower file object when the fuse_file is
    created/deleted.
  [Proposed by Jann Horn]

Changes in v5:
* Fix the check when setting the passthrough file.
  [Found when testing by Mike Shal]

Changes in v3 and v4:
* Use the fs_stack_depth to prevent further stacking and a minor fix.
  [Proposed by Jann Horn]

Changes in v2:
* Changed the feature name to passthrough from stacked_io.
  [Proposed by Linus Torvalds]


Alessio Balsini (8):
  fs: Generic function to convert iocb to rw flags
  fuse: 32-bit user space ioctl compat for fuse device
  fuse: Definitions and ioctl for passthrough
  fuse: Passthrough initialization and release
  fuse: Introduce synchronous read and write for passthrough
  fuse: Handle asynchronous read and write in passthrough
  fuse: Use daemon creds in passthrough mode
  fuse: Introduce passthrough for mmap

 fs/fuse/Makefile          |   1 +
 fs/fuse/dev.c             |  41 ++++--
 fs/fuse/dir.c             |   2 +
 fs/fuse/file.c            |  15 +-
 fs/fuse/fuse_i.h          |  33 +++++
 fs/fuse/inode.c           |  22 ++-
 fs/fuse/passthrough.c     | 280 ++++++++++++++++++++++++++++++++++++++
 fs/overlayfs/file.c       |  23 +---
 include/linux/fs.h        |   5 +
 include/uapi/linux/fuse.h |  14 +-
 10 files changed, 401 insertions(+), 35 deletions(-)
 create mode 100644 fs/fuse/passthrough.c

Comments

Alessio Balsini Jan. 25, 2021, 4:46 p.m. UTC | #1
On Mon, Jan 25, 2021 at 03:30:50PM +0000, Alessio Balsini wrote:
> OverlayFS implements its own function to translate iocb flags into rw
> flags, so that they can be passed into another vfs call.
> With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
> Jens created a 1:1 matching between the iocb flags and rw flags,
> simplifying the conversion.
> 
> Reduce the OverlayFS code by making the flag conversion function generic
> and reusable.
> 
> Signed-off-by: Alessio Balsini <balsini@android.com>
> ---
>  fs/overlayfs/file.c | 23 +++++------------------
>  include/linux/fs.h  |  5 +++++
>  2 files changed, 10 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index bd9dd38347ae..56be2ffc5a14 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -15,6 +15,8 @@
>  #include <linux/fs.h>
>  #include "overlayfs.h"
>  
> +#define OVL_IOCB_MASK (IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
> +
>  struct ovl_aio_req {
>  	struct kiocb iocb;
>  	struct kiocb *orig_iocb;
> @@ -236,22 +238,6 @@ static void ovl_file_accessed(struct file *file)
>  	touch_atime(&file->f_path);
>  }
>  
> -static rwf_t ovl_iocb_to_rwf(int ifl)
> -{
> -	rwf_t flags = 0;
> -
> -	if (ifl & IOCB_NOWAIT)
> -		flags |= RWF_NOWAIT;
> -	if (ifl & IOCB_HIPRI)
> -		flags |= RWF_HIPRI;
> -	if (ifl & IOCB_DSYNC)
> -		flags |= RWF_DSYNC;
> -	if (ifl & IOCB_SYNC)
> -		flags |= RWF_SYNC;
> -
> -	return flags;
> -}
> -
>  static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
>  {
>  	struct kiocb *iocb = &aio_req->iocb;
> @@ -299,7 +285,8 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  	old_cred = ovl_override_creds(file_inode(file)->i_sb);
>  	if (is_sync_kiocb(iocb)) {
>  		ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
> -				    ovl_iocb_to_rwf(iocb->ki_flags));
> +				    iocb_to_rw_flags(iocb->ki_flags,
> +						     OVL_IOCB_MASK));
>  	} else {
>  		struct ovl_aio_req *aio_req;
>  
> @@ -356,7 +343,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>  	if (is_sync_kiocb(iocb)) {
>  		file_start_write(real.file);
>  		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
> -				     ovl_iocb_to_rwf(ifl));
> +				     iocb_to_rw_flags(ifl, OVL_IOCB_MASK));
>  		file_end_write(real.file);
>  		/* Update size */
>  		ovl_copyattr(ovl_inode_real(inode), inode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd47deea7c17..647c35423545 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3275,6 +3275,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
>  	return 0;
>  }
>  
> +static inline rwf_t iocb_to_rw_flags(int ifl, int iocb_mask)
> +{
> +	return ifl & iocb_mask;
> +}
> +
>  static inline ino_t parent_ino(struct dentry *dentry)
>  {
>  	ino_t res;
> -- 
> 2.30.0.280.ga3ce27912f-goog
> 

For some reason lkml.org and lore.kernel.org are not showing this change
as part of the thread.
Let's see if replying to the email fixes the indexing.

Regards,
Alessio
Wu Yan March 24, 2021, 7:43 a.m. UTC | #2
On 1/26/21 12:46 AM, Alessio Balsini wrote:
> On Mon, Jan 25, 2021 at 03:30:50PM +0000, Alessio Balsini wrote:
>> OverlayFS implements its own function to translate iocb flags into rw
>> flags, so that they can be passed into another vfs call.
>> With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
>> Jens created a 1:1 matching between the iocb flags and rw flags,
>> simplifying the conversion.
>>
>> Reduce the OverlayFS code by making the flag conversion function generic
>> and reusable.
>>
>> Signed-off-by: Alessio Balsini <balsini@android.com>
>> ---
>>   fs/overlayfs/file.c | 23 +++++------------------
>>   include/linux/fs.h  |  5 +++++
>>   2 files changed, 10 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> index bd9dd38347ae..56be2ffc5a14 100644
>> --- a/fs/overlayfs/file.c
>> +++ b/fs/overlayfs/file.c
>> @@ -15,6 +15,8 @@
>>   #include <linux/fs.h>
>>   #include "overlayfs.h"
>>   
>> +#define OVL_IOCB_MASK (IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
>> +
>>   struct ovl_aio_req {
>>   	struct kiocb iocb;
>>   	struct kiocb *orig_iocb;
>> @@ -236,22 +238,6 @@ static void ovl_file_accessed(struct file *file)
>>   	touch_atime(&file->f_path);
>>   }
>>   
>> -static rwf_t ovl_iocb_to_rwf(int ifl)
>> -{
>> -	rwf_t flags = 0;
>> -
>> -	if (ifl & IOCB_NOWAIT)
>> -		flags |= RWF_NOWAIT;
>> -	if (ifl & IOCB_HIPRI)
>> -		flags |= RWF_HIPRI;
>> -	if (ifl & IOCB_DSYNC)
>> -		flags |= RWF_DSYNC;
>> -	if (ifl & IOCB_SYNC)
>> -		flags |= RWF_SYNC;
>> -
>> -	return flags;
>> -}
>> -
>>   static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
>>   {
>>   	struct kiocb *iocb = &aio_req->iocb;
>> @@ -299,7 +285,8 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>>   	old_cred = ovl_override_creds(file_inode(file)->i_sb);
>>   	if (is_sync_kiocb(iocb)) {
>>   		ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
>> -				    ovl_iocb_to_rwf(iocb->ki_flags));
>> +				    iocb_to_rw_flags(iocb->ki_flags,
>> +						     OVL_IOCB_MASK));
>>   	} else {
>>   		struct ovl_aio_req *aio_req;
>>   
>> @@ -356,7 +343,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>>   	if (is_sync_kiocb(iocb)) {
>>   		file_start_write(real.file);
>>   		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
>> -				     ovl_iocb_to_rwf(ifl));
>> +				     iocb_to_rw_flags(ifl, OVL_IOCB_MASK));
>>   		file_end_write(real.file);
>>   		/* Update size */
>>   		ovl_copyattr(ovl_inode_real(inode), inode);
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index fd47deea7c17..647c35423545 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -3275,6 +3275,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
>>   	return 0;
>>   }
>>   
>> +static inline rwf_t iocb_to_rw_flags(int ifl, int iocb_mask)
>> +{
>> +	return ifl & iocb_mask;
>> +}
>> +
>>   static inline ino_t parent_ino(struct dentry *dentry)
>>   {
>>   	ino_t res;
>> -- 
>> 2.30.0.280.ga3ce27912f-goog
>>
> 
> For some reason lkml.org and lore.kernel.org are not showing this change
> as part of the thread.
> Let's see if replying to the email fixes the indexing.
> 
> Regards,
> Alessio
> 

Hi, Alessio

This change imply IOCB_* and RWF_* flags are properly aligned, which is 
not true for kernel version 5.4/4.19/4.14. As the patch ("fs: align 
IOCB_* flags with RWF_* flags") is not back-ported to these stable 
kernel branches. The issue was found when applying these patches
to kernel-5.4(files open with passthrough enabled can't do append 
write). I think the issue exists in AOSP common kernel too.
Could you please fix this?

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce71bfea207b4d7c21d36f24ec37618ffcea1da8

https://android-review.googlesource.com/c/kernel/common/+/1556243

Thanks
yanwu
Alessio Balsini March 24, 2021, 2:02 p.m. UTC | #3
On Wed, Mar 24, 2021 at 03:43:12PM +0800, Rokudo Yan wrote:
> On 1/26/21 12:46 AM, Alessio Balsini wrote:
> > On Mon, Jan 25, 2021 at 03:30:50PM +0000, Alessio Balsini wrote:
> > > OverlayFS implements its own function to translate iocb flags into rw
> > > flags, so that they can be passed into another vfs call.
> > > With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
> > > Jens created a 1:1 matching between the iocb flags and rw flags,
> > > simplifying the conversion.
> > > 
> > > Reduce the OverlayFS code by making the flag conversion function generic
> > > and reusable.
> > > 
> > > Signed-off-by: Alessio Balsini <balsini@android.com>
> > > ---
> > >   fs/overlayfs/file.c | 23 +++++------------------
> > >   include/linux/fs.h  |  5 +++++
> > >   2 files changed, 10 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > index bd9dd38347ae..56be2ffc5a14 100644
> > > --- a/fs/overlayfs/file.c
> > > +++ b/fs/overlayfs/file.c
> > > @@ -15,6 +15,8 @@
> > >   #include <linux/fs.h>
> > >   #include "overlayfs.h"
> > > +#define OVL_IOCB_MASK (IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
> > > +
> > >   struct ovl_aio_req {
> > >   	struct kiocb iocb;
> > >   	struct kiocb *orig_iocb;
> > > @@ -236,22 +238,6 @@ static void ovl_file_accessed(struct file *file)
> > >   	touch_atime(&file->f_path);
> > >   }
> > > -static rwf_t ovl_iocb_to_rwf(int ifl)
> > > -{
> > > -	rwf_t flags = 0;
> > > -
> > > -	if (ifl & IOCB_NOWAIT)
> > > -		flags |= RWF_NOWAIT;
> > > -	if (ifl & IOCB_HIPRI)
> > > -		flags |= RWF_HIPRI;
> > > -	if (ifl & IOCB_DSYNC)
> > > -		flags |= RWF_DSYNC;
> > > -	if (ifl & IOCB_SYNC)
> > > -		flags |= RWF_SYNC;
> > > -
> > > -	return flags;
> > > -}
> > > -
> > >   static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
> > >   {
> > >   	struct kiocb *iocb = &aio_req->iocb;
> > > @@ -299,7 +285,8 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > >   	old_cred = ovl_override_creds(file_inode(file)->i_sb);
> > >   	if (is_sync_kiocb(iocb)) {
> > >   		ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
> > > -				    ovl_iocb_to_rwf(iocb->ki_flags));
> > > +				    iocb_to_rw_flags(iocb->ki_flags,
> > > +						     OVL_IOCB_MASK));
> > >   	} else {
> > >   		struct ovl_aio_req *aio_req;
> > > @@ -356,7 +343,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> > >   	if (is_sync_kiocb(iocb)) {
> > >   		file_start_write(real.file);
> > >   		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
> > > -				     ovl_iocb_to_rwf(ifl));
> > > +				     iocb_to_rw_flags(ifl, OVL_IOCB_MASK));
> > >   		file_end_write(real.file);
> > >   		/* Update size */
> > >   		ovl_copyattr(ovl_inode_real(inode), inode);
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index fd47deea7c17..647c35423545 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -3275,6 +3275,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > >   	return 0;
> > >   }
> > > +static inline rwf_t iocb_to_rw_flags(int ifl, int iocb_mask)
> > > +{
> > > +	return ifl & iocb_mask;
> > > +}
> > > +
> > >   static inline ino_t parent_ino(struct dentry *dentry)
> > >   {
> > >   	ino_t res;
> > > -- 
> > > 2.30.0.280.ga3ce27912f-goog
> > > 
> > 
> > For some reason lkml.org and lore.kernel.org are not showing this change
> > as part of the thread.
> > Let's see if replying to the email fixes the indexing.
> > 
> > Regards,
> > Alessio
> > 
> 
> Hi, Alessio
> 
> This change imply IOCB_* and RWF_* flags are properly aligned, which is not
> true for kernel version 5.4/4.19/4.14. As the patch ("fs: align IOCB_* flags
> with RWF_* flags") is not back-ported to these stable kernel branches. The
> issue was found when applying these patches
> to kernel-5.4(files open with passthrough enabled can't do append write). I
> think the issue exists in AOSP common kernel too.
> Could you please fix this?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce71bfea207b4d7c21d36f24ec37618ffcea1da8
> 
> https://android-review.googlesource.com/c/kernel/common/+/1556243
> 
> Thanks
> yanwu

Hi yanwu,

Correct, this change depends on commit ce71bfea207b ("fs: align IOCB_*
flags with RWF_* flags"), and this dependency is satisfied upstream.
Being FUSE passthrough a new feature and not a bugfix, I'm not planning
to do any backporting to LTS kernels (and GregKH won't probably accept
it).

Android is a different story (and slightly out of topic here).
We are looking forward to have FUSE passthrough enabled on Android as
most of the user data is handled by FUSE. We liked the performance
improvements and non-intrusiveness of the change both for the kernel and
for userspace, so we started supporting this in android12-5.4+ kernel
branches. We are not planning to maintain the feature to older kernels
though (we can't add features to already released), and this is why FUSE
passthrough is not merged there.
To answer your question, in AOSP the officially supported kernels
already have the flags alignment change merged, and a not supported
backporting to older kernels (i.e., 4.14 and 4.19) is already available:

https://android-review.googlesource.com/q/%2522BACKPORT:+fs:+align+IOCB_*+flags+with+RWF_*+flags%2522+-status:abandoned

Thanks,
Alessio
Amir Goldstein Nov. 18, 2021, 6:31 p.m. UTC | #4
On Mon, Jan 25, 2021 at 5:31 PM Alessio Balsini <balsini@android.com> wrote:
>
> This is the 12th version of the series, rebased on top of v5.11-rc5.
> Please find the changelog at the bottom of this cover letter.
>
> Add support for file system passthrough read/write of files when enabled
> in userspace through the option FUSE_PASSTHROUGH.
>
> There are file systems based on FUSE that are intended to enforce
> special policies or trigger complicated decision makings at the file
> operations level. Android, for example, uses FUSE to enforce
> fine-grained access policies that also depend on the file contents.
> Sometimes it happens that at open or create time a file is identified as
> not requiring additional checks for consequent reads/writes, thus FUSE
> would simply act as a passive bridge between the process accessing the
> FUSE file system and the lower file system. Splicing and caching help
> reduce the FUSE overhead, but there are still read/write operations
> forwarded to the userspace FUSE daemon that could be avoided.
>
> This series has been inspired by the original patches from Nikhilesh
> Reddy, the idea and code of which has been elaborated and improved
> thanks to the community support.
>
> When the FUSE_PASSTHROUGH capability is enabled, the FUSE daemon may
> decide while handling the open/create operations, if the given file can
> be accessed in passthrough mode. This means that all the further read
> and write operations would be forwarded by the kernel directly to the
> lower file system using the VFS layer rather than to the FUSE daemon.
> All the requests other than reads or writes are still handled by the
> userspace FUSE daemon.
> This allows for improved performance on reads and writes, especially in
> the case of reads at random offsets, for which no (readahead) caching
> mechanism would help.
> Benchmarks show improved performance that is close to native file system
> access when doing massive manipulations on a single opened file,
> especially in the case of random reads, random writes and sequential
> writes. Detailed benchmarking results are presented below.
>
> The creation of this direct connection (passthrough) between FUSE file
> objects and file objects in the lower file system happens in a way that
> reminds of passing file descriptors via sockets:
> - a process requests the opening of a file handled by FUSE, so the
>   kernel forwards the request to the FUSE daemon;
> - the FUSE daemon opens the target file in the lower file system,
>   getting its file descriptor;
> - the FUSE daemon also decides according to its internal policies if
>   passthrough can be enabled for that file, and, if so, can perform a
>   FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl on /dev/fuse, passing the file
>   descriptor obtained at the previous step and the fuse_req unique
>   identifier;
> - the kernel translates the file descriptor to the file pointer
>   navigating through the opened files of the "current" process and
>   temporarily stores it in the associated open/create fuse_req's
>   passthrough_filp;
> - when the FUSE daemon has done with the request and it's time for the
>   kernel to close it, it checks if the passthrough_filp is available and
> in case updates the additional field in the fuse_file owned by the
> process accessing the FUSE file system.
> From now on, all the read/write operations performed by that process
> will be redirected to the corresponding lower file system file by
> creating new VFS requests.
> Since the read/write operation to the lower file system is executed with
> the current process's credentials, it might happen that it does not have
> enough privileges to succeed. For this reason, the process temporarily
> receives the same credentials as the FUSE daemon, that are reverted as
> soon as the read/write operation completes, emulating the behavior of
> the request to be performed by the FUSE daemon itself. This solution has
> been inspired by the way overlayfs handles read/write operations.
> Asynchronous IO is supported as well, handled by creating separate AIO
> requests for the lower file system that will be internally tracked by
> FUSE, that intercepts and propagates their completion through an
> internal ki_completed callback similar to the current implementation of
> overlayfs.
> Finally, also memory-mapped FUSE files are supported in this FUSE
> passthrough series as it has been noticed that when a same file with
> FUSE passthrough enabled is accessed both with standard
> read/write(-iter) operations and memory-mapped read/write operations,
> the file content might result corrupted due to an inconsistency between
> the FUSE and lower file system caches.
>
> The ioctl has been designed taking as a reference and trying to converge
> to the fuse2 implementation. For example, the fuse_passthrough_out data
> structure has extra fields that will allow for further extensions of the
> feature.
>
>
>     Performance on RAM block device
>
> What follows has been performed using a custom passthrough_hp FUSE
> daemon that enables pass-through for each file that is opened during
> both "open" and "create". Benchmarks were run on an Intel Xeon W-2135,
> 64 GiB of RAM workstation, with a RAM block device used as storage
> target. More specifically, out of the system's 64 GiB of RAM, 40 GiB
> were reserved for /dev/ram0, formatted as ext4. For the FUSE and FUSE
> passthrough benchmarks, the FUSE file system was mounted on top of the
> mounted /dev/ram0 device.
> That file system has been completely filled and then cleaned up before
> running the benchmarks: this to ensure that all the /dev/ram0 space was
> reserved and not usable as page cache.
>
> The rationale for using a RAM block device is that SSDs may experience
> performance fluctuations, especially when dealing with accessing data
> random offsets.
> Getting rid of the discrete storage device also removes a huge component
> of slowness, highlighting the performance difference of the software
> parts (and probably the goodness of CPU caching and its coherence
> mechanisms).
>
> No special tuning has been performed, e.g., all the involved processes
> are SCHED_OTHER, ondemand is the frequency governor with no frequency
> restrictions, and turbo-boost, as well as p-state, are active. This is
> because I noticed that, for such high-level benchmarks, results
> consistency was minimally affected by these features.
>
> The source code of the updated libfuse library and passthrough_hp is
> shared at the following repository:
>
>   https://github.com/balsini/libfuse/tree/fuse-passthrough-v12-v5.11-rc5
>
> Two different kinds of benchmarks were done for this change, the first
> set of tests evaluates the bandwidth improvements when manipulating huge
> single files, the second set of tests verify that no performance
> regressions were introduced when handling many small files.
>
> All the caches were dropped before running every benchmark with:
>
>   echo 3 > /proc/sys/vm/drop_caches
>
> All the benchmarks were run 10 times, with 1 minute cool down between
> each run.
>
> The first benchmarks were done by running FIO (fio-3.24) with:
> - bs=4Ki;
> - file size: 35Gi;
> - ioengine: sync;
> - fsync_on_close=1;
> - randseed=0.
> The target file has been chosen large enough to avoid it to be entirely
> loaded into the page cache.
>
> Results are presented in the following table:
>
> +-----------+------------+-------------+-------------+
> |   MiB/s   |    fuse    | passthrough |   native    |
> +-----------+------------+-------------+-------------+
> | read      | 471(±1.3%) | 1791(±1.0%) | 1839(±1.8%) |
> | write     | 95(±.6%)   | 1068(±.9%)  | 1322(±.8%)  |
> | randread  | 25(±1.7%)  | 860(±.8%)   | 1135(±.5%)  |
> | randwrite | 76(±3.0%)  | 813(±1.0%)  | 1005(±.7%)  |
> +-----------+------------+-------------+-------------+
>
> This table shows that FUSE, except for the sequential reads, is far
> behind FUSE passthrough and native in terms of performance. The
> extremely good FUSE performance for sequential reads is the result of a
> great read-ahead mechanism. I was able to verify that setting
> read_ahead_kb to 0 causes a terrible performance drop.
> All the results are stable, as shown by the standard deviations.
> Moreover, these numbers show the reasonable gap between passthrough and
> native, introduced by the extra traversal through the VFS layer.
>
> As long as this patch has the primary objective of improving bandwidth,
> another set of tests has been performed to see how this behaves on a
> totally different scenario that involves accessing many small files. For
> this purpose, measuring the build time of the Linux kernel has been
> chosen as an appropriate, well-known, workload. The kernel has been
> built with as many processes as the number of logical CPUs (-j
> $(nproc)), that besides being a reasonable parallelization value, is
> also enough to saturate the processor's utilization thanks to the
> additional FUSE daemon's threads, making it even harder to get closer to
> the native file system performance.
> The following table shows the total build times in the different
> configurations:
>
> +------------------+--------------+-----------+
> |                  | AVG duration |  Standard |
> |                  |     (sec)    | deviation |
> +------------------+--------------+-----------+
> | FUSE             |      144.566 |     0.697 |
> +------------------+--------------+-----------+
> | FUSE passthrough |      133.820 |     0.341 |
> +------------------+--------------+-----------+
> | Native           |      109.423 |     0.724 |
> +------------------+--------------+-----------+
>
> Further testing and performance evaluations are welcome.
>
>
>     Description of the series
>
> Patch 1 generalizes the function which converts iocb flags to rw flags
> from overlayfs, so that can be used in this patch set.
>
> Patch 2 enables the 32-bit compatibility for the /dev/fuse ioctl.
>
> Patch 3 introduces the data structures, function signatures and ioctl
> required both for the communication with userspace and for the internal
> kernel use.
>
> Patch 4 introduces initialization and release functions for FUSE
> passthrough.
>
> Patch 5 enables the synchronous read and write operations for those FUSE
> files for which the passthrough functionality is enabled.
>
> Patch 6 extends the read and write operations to also support
> asynchronous IO.
>
> Patch 7 allows FUSE passthrough to target files for which the requesting
> process would not have direct access to, by temporarily performing a
> credentials switch to the credentials of the FUSE daemon that issued the
> FUSE passthrough ioctl.
>
> Patch 8 extends FUSE passthrough operations to memory-mapped FUSE files.
>
>
>     Changelog
>
> Changes in v12:
> * Revert FILESYSTEM_MAX_STACK_DEPTH checks as they were in v10
>   [Requested by Amir Goldstein]
> * Introduce passthrough support for memory-mapped FUSE files
>   [Requested by yanwu]
>
> Changes in v11:
> * Fix the FILESYSTEM_MAX_STACK_DEPTH check to allow other file systems
>   to be stacked
> * Moved file system stacking depth check at ioctl time
> * Update cover letter with correct libfuse repository to test the change
>   [Requested by Peng Tao]
> * Fix the file reference counter leak introduced in v10
>   [Requested by yanwu]
>
> Changes in v10:
> * UAPI updated: ioctl now returns an ID that will be used at open/create
>   response time to reference the passthrough file
> * Synchronous read/write_iter functions does not return silly errors
>   (fixed in aio patch)
> * FUSE daemon credentials updated at ioctl time instead of mount time
> * Updated benchmark results
>   [Requested by Miklos Szeredi]
>
> Changes in v9:
> * Switched to using VFS instead of direct lower FS file ops
>   [Attempt to address a request from Jens Axboe, Jann Horn,
>   Amir Goldstein]
> * Removal of useless included aio.h header
>   [Proposed by Jens Axboe]
>
> Changes in v8:
> * aio requests now use kmalloc/kfree, instead of kmem_cache
> * Switched to call_{read,write}_iter in AIO
> * Revisited attributes copy
> * Passthrough can only be enabled via ioctl, fixing the security issue
>   spotted by Jann
> * Use an extensible fuse_passthrough_out data structure
>   [Attempt to address a request from Nikolaus Rath, Amir Goldstein and
> Miklos Szeredi]
>
> Changes in v7:
> * Full handling of aio requests as done in overlayfs (update commit
> * message).
> * s/fget_raw/fget.
> * Open fails in case of passthrough errors, emitting warning messages.
>   [Proposed by Jann Horn]
> * Create new local kiocb, getting rid of the previously proposed ki_filp
>   swapping.
>   [Proposed by Jann Horn and Jens Axboe]
> * Code polishing.
>
> Changes in v6:
> * Port to kernel v5.8:
>   * fuse_file_{read,write}_iter changed since the v5 of this patch was
>     proposed.
> * Simplify fuse_simple_request.
> * Merge fuse_passthrough.h into fuse_i.h
> * Refactor of passthrough.c:
>   * Remove BUG_ONs.
>   * Simplified error checking and request arguments indexing.
>   * Use call_{read,write}_iter utility functions.
>   * Remove get_file and fputs during read/write: handle the extra FUSE
>     references to the lower file object when the fuse_file is
>     created/deleted.
>   [Proposed by Jann Horn]
>
> Changes in v5:
> * Fix the check when setting the passthrough file.
>   [Found when testing by Mike Shal]
>
> Changes in v3 and v4:
> * Use the fs_stack_depth to prevent further stacking and a minor fix.
>   [Proposed by Jann Horn]
>
> Changes in v2:
> * Changed the feature name to passthrough from stacked_io.
>   [Proposed by Linus Torvalds]
>
>
> Alessio Balsini (8):
>   fs: Generic function to convert iocb to rw flags
>   fuse: 32-bit user space ioctl compat for fuse device
>   fuse: Definitions and ioctl for passthrough
>   fuse: Passthrough initialization and release
>   fuse: Introduce synchronous read and write for passthrough
>   fuse: Handle asynchronous read and write in passthrough
>   fuse: Use daemon creds in passthrough mode
>   fuse: Introduce passthrough for mmap
>
>  fs/fuse/Makefile          |   1 +
>  fs/fuse/dev.c             |  41 ++++--
>  fs/fuse/dir.c             |   2 +
>  fs/fuse/file.c            |  15 +-
>  fs/fuse/fuse_i.h          |  33 +++++
>  fs/fuse/inode.c           |  22 ++-
>  fs/fuse/passthrough.c     | 280 ++++++++++++++++++++++++++++++++++++++
>  fs/overlayfs/file.c       |  23 +---
>  include/linux/fs.h        |   5 +
>  include/uapi/linux/fuse.h |  14 +-
>  10 files changed, 401 insertions(+), 35 deletions(-)
>  create mode 100644 fs/fuse/passthrough.c
>
> --
> 2.30.0.280.ga3ce27912f-goog
>

Hi Alessio,

I have been testing this patch set for a while and recently
nfstest_posix found one issue:
mtime/ctime are not invalidated on passthrough write.

I have tested a fix on this 5.10.y backport branch:
https://github.com/amir73il/linux/commits/linux-5.10.y-fuse-passthrough

Please feel free to review and/or take the fix for your next posting
if you have plans of posting another version...

Thanks,
Amir.