diff mbox series

fuse: require /dev/fuse reads to have enough buffer capacity (take 2)

Message ID 20190612112544.GA21465@deco.navytux.spb.ru (mailing list archive)
State New, archived
Headers show
Series fuse: require /dev/fuse reads to have enough buffer capacity (take 2) | expand

Commit Message

Kirill Smelkov June 12, 2019, 11:25 a.m. UTC
On Wed, Jun 12, 2019 at 09:44:49AM +0200, Miklos Szeredi wrote:
> On Tue, Jun 11, 2019 at 10:28 PM Kirill Smelkov <kirr@nexedi.com> wrote:
> 
> > Miklos, would 4K -> `sizeof(fuse_in_header) + sizeof(fuse_write_in)` for
> > header room change be accepted?
> 
> Yes, next cycle.   For 4.2 I'll just push the revert.

Thanks Miklos. Please consider queuing the following patch for 5.3.
Sander, could you please confirm that glusterfs is not broken with this
version of the check?

Thanks beforehand,
Kirill

---- 8< ----
From 24a04e8be9bbf6e67de9e1908dcbe95d426d2521 Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <kirr@nexedi.com>
Date: Wed, 27 Mar 2019 10:15:15 +0000
Subject: [PATCH] fuse: require /dev/fuse reads to have enough buffer capacity (take 2)

[ This retries commit d4b13963f217 which was reverted in 766741fcaa1f.

  In this version we require only `sizeof(fuse_in_header) + sizeof(fuse_write_in)`
  instead of 4K for FUSE request header room, because, contrary to
  libfuse and kernel client behaviour, GlusterFS actually provides only
  so much room for request header. ]

A FUSE filesystem server queues /dev/fuse sys_read calls to get
filesystem requests to handle. It does not know in advance what would be
that request as it can be anything that client issues - LOOKUP, READ,
WRITE, ... Many requests are short and retrieve data from the
filesystem. However WRITE and NOTIFY_REPLY write data into filesystem.

Before getting into operation phase, FUSE filesystem server and kernel
client negotiate what should be the maximum write size the client will
ever issue. After negotiation the contract in between server/client is
that the filesystem server then should queue /dev/fuse sys_read calls with
enough buffer capacity to receive any client request - WRITE in
particular, while FUSE client should not, in particular, send WRITE
requests with > negotiated max_write payload. FUSE client in kernel and
libfuse historically reserve 4K for request header. However an existing
filesystem server - GlusterFS - was found which reserves only 80 bytes
for header room (= `sizeof(fuse_in_header) + sizeof(fuse_write_in)`).

https://lore.kernel.org/linux-fsdevel/20190611202738.GA22556@deco.navytux.spb.ru/
https://github.com/gluster/glusterfs/blob/v3.8.15-0-gd174f021a/xlators/mount/fuse/src/fuse-bridge.c#L4894

Since

	`sizeof(fuse_in_header) + sizeof(fuse_write_in)` ==
	`sizeof(fuse_in_header) + sizeof(fuse_read_in)`  ==
	`sizeof(fuse_in_header) + sizeof(fuse_notify_retrieve_in)`

is the absolute minimum any sane filesystem should be using for header
room, the contract is that filesystem server should queue sys_reads with
`sizeof(fuse_in_header) + sizeof(fuse_write_in)` + max_write buffer.

If the filesystem server does not follow this contract, what can happen
is that fuse_dev_do_read will see that request size is > buffer size,
and then it will return EIO to client who issued the request but won't
indicate in any way that there is a problem to filesystem server.
This can be hard to diagnose because for some requests, e.g. for
NOTIFY_REPLY which mimics WRITE, there is no client thread that is
waiting for request completion and that EIO goes nowhere, while on
filesystem server side things look like the kernel is not replying back
after successful NOTIFY_RETRIEVE request made by the server.

We can make the problem easy to diagnose if we indicate via error return to
filesystem server when it is violating the contract.  This should not
practically cause problems because if a filesystem server is using shorter
buffer, writes to it were already very likely to cause EIO, and if the
filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER
minimum buffer size.

Please see [1] for context where the problem of stuck filesystem was hit
for real (because kernel client was incorrectly sending more than
max_write data with NOTIFY_REPLY; see also previous patch), how the
situation was traced and for more involving patch that did not make it
into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fuse/dev.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Sander Eikelenboom June 12, 2019, 12:11 p.m. UTC | #1
On 12/06/2019 13:25, Kirill Smelkov wrote:
> On Wed, Jun 12, 2019 at 09:44:49AM +0200, Miklos Szeredi wrote:
>> On Tue, Jun 11, 2019 at 10:28 PM Kirill Smelkov <kirr@nexedi.com> wrote:
>>
>>> Miklos, would 4K -> `sizeof(fuse_in_header) + sizeof(fuse_write_in)` for
>>> header room change be accepted?
>>
>> Yes, next cycle.   For 4.2 I'll just push the revert.
> 
> Thanks Miklos. Please consider queuing the following patch for 5.3.
> Sander, could you please confirm that glusterfs is not broken with this
> version of the check?
> 
> Thanks beforehand,
> Kirill

Sure will give it a spin this evening and report back.

--
Sander
Sander Eikelenboom June 12, 2019, 1:03 p.m. UTC | #2
On 12/06/2019 13:25, Kirill Smelkov wrote:
> On Wed, Jun 12, 2019 at 09:44:49AM +0200, Miklos Szeredi wrote:
>> On Tue, Jun 11, 2019 at 10:28 PM Kirill Smelkov <kirr@nexedi.com> wrote:
>>
>>> Miklos, would 4K -> `sizeof(fuse_in_header) + sizeof(fuse_write_in)` for
>>> header room change be accepted?
>>
>> Yes, next cycle.   For 4.2 I'll just push the revert.
> 
> Thanks Miklos. Please consider queuing the following patch for 5.3.
> Sander, could you please confirm that glusterfs is not broken with this
> version of the check?
> 
> Thanks beforehand,
> Kirill


Hmm unfortunately it doesn't build, see below.

--
Sander


In file included from ./include/linux/list.h:9:0,
                 from ./include/linux/wait.h:7,
                 from ./include/linux/wait_bit.h:8,
                 from ./include/linux/fs.h:6,
                 from fs/fuse/fuse_i.h:17,
                 from fs/fuse/dev.c:9:
fs/fuse/dev.c: In function ‘fuse_dev_do_read’:
fs/fuse/dev.c:1336:14: error: ‘fuse_in_header’ undeclared (first use in this function)
       sizeof(fuse_in_header) + sizeof(fuse_write_in) + fc->max_write))
              ^
./include/linux/kernel.h:818:40: note: in definition of macro ‘__typecheck’
   (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                                        ^
./include/linux/kernel.h:842:24: note: in expansion of macro ‘__safe_cmp’
  __builtin_choose_expr(__safe_cmp(x, y), \
                        ^~~~~~~~~~
./include/linux/kernel.h:918:27: note: in expansion of macro ‘__careful_cmp’
 #define max_t(type, x, y) __careful_cmp((type)(x), (type)(y), >)
                           ^~~~~~~~~~~~~
fs/fuse/dev.c:1335:15: note: in expansion of macro ‘max_t’
  if (nbytes < max_t(size_t, FUSE_MIN_READ_BUFFER,
               ^~~~~
fs/fuse/dev.c:1336:14: note: each undeclared identifier is reported only once for each function it appears in
       sizeof(fuse_in_header) + sizeof(fuse_write_in) + fc->max_write))
              ^
./include/linux/kernel.h:818:40: note: in definition of macro ‘__typecheck’
   (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                                        ^
./include/linux/kernel.h:842:24: note: in expansion of macro ‘__safe_cmp’
  __builtin_choose_expr(__safe_cmp(x, y), \
                        ^~~~~~~~~~
./include/linux/kernel.h:918:27: note: in expansion of macro ‘__careful_cmp’
 #define max_t(type, x, y) __careful_cmp((type)(x), (type)(y), >)
                           ^~~~~~~~~~~~~
fs/fuse/dev.c:1335:15: note: in expansion of macro ‘max_t’
  if (nbytes < max_t(size_t, FUSE_MIN_READ_BUFFER,
               ^~~~~
fs/fuse/dev.c:1336:39: error: ‘fuse_write_in’ undeclared (first use in this function)
       sizeof(fuse_in_header) + sizeof(fuse_write_in) + fc->max_write))
                                       ^
./include/linux/kernel.h:818:40: note: in definition of macro ‘__typecheck’
   (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                                        ^
./include/linux/kernel.h:842:24: note: in expansion of macro ‘__safe_cmp’
  __builtin_choose_expr(__safe_cmp(x, y), \
                        ^~~~~~~~~~
./include/linux/kernel.h:918:27: note: in expansion of macro ‘__careful_cmp’
 #define max_t(type, x, y) __careful_cmp((type)(x), (type)(y), >)
                           ^~~~~~~~~~~~~
fs/fuse/dev.c:1335:15: note: in expansion of macro ‘max_t’
  if (nbytes < max_t(size_t, FUSE_MIN_READ_BUFFER,
               ^~~~~
./include/linux/kernel.h:842:2: error: first argument to ‘__builtin_choose_expr’ not a constant
  __builtin_choose_expr(__safe_cmp(x, y), \
  ^
./include/linux/kernel.h:918:27: note: in expansion of macro ‘__careful_cmp’
 #define max_t(type, x, y) __careful_cmp((type)(x), (type)(y), >)
                           ^~~~~~~~~~~~~
fs/fuse/dev.c:1335:15: note: in expansion of macro ‘max_t’
  if (nbytes < max_t(size_t, FUSE_MIN_READ_BUFFER,
               ^~~~~
scripts/Makefile.build:278: recipe for target 'fs/fuse/dev.o' failed
make[3]: *** [fs/fuse/dev.o] Error 1
scripts/Makefile.build:489: recipe for target 'fs/fuse' failed
make[2]: *** [fs/fuse] Error 2
Kirill Smelkov June 12, 2019, 2:12 p.m. UTC | #3
On Wed, Jun 12, 2019 at 03:03:49PM +0200, Sander Eikelenboom wrote:
> On 12/06/2019 13:25, Kirill Smelkov wrote:
> > On Wed, Jun 12, 2019 at 09:44:49AM +0200, Miklos Szeredi wrote:
> >> On Tue, Jun 11, 2019 at 10:28 PM Kirill Smelkov <kirr@nexedi.com> wrote:
> >>
> >>> Miklos, would 4K -> `sizeof(fuse_in_header) + sizeof(fuse_write_in)` for
> >>> header room change be accepted?
> >>
> >> Yes, next cycle.   For 4.2 I'll just push the revert.
> > 
> > Thanks Miklos. Please consider queuing the following patch for 5.3.
> > Sander, could you please confirm that glusterfs is not broken with this
> > version of the check?
> > 
> > Thanks beforehand,
> > Kirill
> 
> 
> Hmm unfortunately it doesn't build, see below.
> [...]
> fs/fuse/dev.c:1336:14: error: ‘fuse_in_header’ undeclared (first use in this function)
>        sizeof(fuse_in_header) + sizeof(fuse_write_in) + fc->max_write))

Sorry, my bad, it was missing "struct" before fuse_in_header. I
originally compile-tested the patch with `make -j4`, was distracted onto
other topic and did not see the error after returning due to long tail
of successful CC lines. Apologize for the inconvenience. Below is a
fixed patch that was both compile-tested and runtime-tested with my FUSE
workloads (non-glusterfs).

Kirill

---- 8< ----
From 98fd29bb6789d5f6c346274b99d47008ad856607 Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <kirr@nexedi.com>
Date: Wed, 12 Jun 2019 17:06:18 +0300
Subject: [PATCH v2] fuse: require /dev/fuse reads to have enough buffer capacity (take 2)

[ This retries commit d4b13963f217 which was reverted in 766741fcaa1f.

  In this version we require only `sizeof(fuse_in_header) + sizeof(fuse_write_in)`
  instead of 4K for FUSE request header room, because, contrary to
  libfuse and kernel client behaviour, GlusterFS actually provides only
  so much room for request header. ]

A FUSE filesystem server queues /dev/fuse sys_read calls to get
filesystem requests to handle. It does not know in advance what would be
that request as it can be anything that client issues - LOOKUP, READ,
WRITE, ... Many requests are short and retrieve data from the
filesystem. However WRITE and NOTIFY_REPLY write data into filesystem.

Before getting into operation phase, FUSE filesystem server and kernel
client negotiate what should be the maximum write size the client will
ever issue. After negotiation the contract in between server/client is
that the filesystem server then should queue /dev/fuse sys_read calls with
enough buffer capacity to receive any client request - WRITE in
particular, while FUSE client should not, in particular, send WRITE
requests with > negotiated max_write payload. FUSE client in kernel and
libfuse historically reserve 4K for request header. However an existing
filesystem server - GlusterFS - was found which reserves only 80 bytes
for header room (= `sizeof(fuse_in_header) + sizeof(fuse_write_in)`).

https://lore.kernel.org/linux-fsdevel/20190611202738.GA22556@deco.navytux.spb.ru/
https://github.com/gluster/glusterfs/blob/v3.8.15-0-gd174f021a/xlators/mount/fuse/src/fuse-bridge.c#L4894

Since

	`sizeof(fuse_in_header) + sizeof(fuse_write_in)` ==
	`sizeof(fuse_in_header) + sizeof(fuse_read_in)`  ==
	`sizeof(fuse_in_header) + sizeof(fuse_notify_retrieve_in)`

is the absolute minimum any sane filesystem should be using for header
room, the contract is that filesystem server should queue sys_reads with
`sizeof(fuse_in_header) + sizeof(fuse_write_in)` + max_write buffer.

If the filesystem server does not follow this contract, what can happen
is that fuse_dev_do_read will see that request size is > buffer size,
and then it will return EIO to client who issued the request but won't
indicate in any way that there is a problem to filesystem server.
This can be hard to diagnose because for some requests, e.g. for
NOTIFY_REPLY which mimics WRITE, there is no client thread that is
waiting for request completion and that EIO goes nowhere, while on
filesystem server side things look like the kernel is not replying back
after successful NOTIFY_RETRIEVE request made by the server.

We can make the problem easy to diagnose if we indicate via error return to
filesystem server when it is violating the contract.  This should not
practically cause problems because if a filesystem server is using shorter
buffer, writes to it were already very likely to cause EIO, and if the
filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER
minimum buffer size.

Please see [1] for context where the problem of stuck filesystem was hit
for real (because kernel client was incorrectly sending more than
max_write data with NOTIFY_REPLY; see also previous patch), how the
situation was traced and for more involving patch that did not make it
into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
---
 fs/fuse/dev.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ea8237513dfa..b2b2344eadcf 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1317,6 +1317,26 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	unsigned reqsize;
 	unsigned int hash;
 
+	/*
+	 * Require sane minimum read buffer - that has capacity for fixed part
+	 * of any request header + negotiated max_write room for data. If the
+	 * requirement is not satisfied return EINVAL to the filesystem server
+	 * to indicate that it is not following FUSE server/client contract.
+	 * Don't dequeue / abort any request.
+	 *
+	 * Historically libfuse reserves 4K for fixed header room, but e.g.
+	 * GlusterFS reserves only 80 bytes
+	 *
+	 *	= `sizeof(fuse_in_header) + sizeof(fuse_write_in)`
+	 *
+	 * which is the absolute minimum any sane filesystem should be using
+	 * for header room.
+	 */
+	if (nbytes < max_t(size_t, FUSE_MIN_READ_BUFFER,
+			   sizeof(struct fuse_in_header) + sizeof(struct fuse_write_in) +
+				fc->max_write))
+		return -EINVAL;
+
  restart:
 	spin_lock(&fiq->waitq.lock);
 	err = -EAGAIN;
Sander Eikelenboom June 12, 2019, 4:28 p.m. UTC | #4
On 12/06/2019 16:12, Kirill Smelkov wrote:
> On Wed, Jun 12, 2019 at 03:03:49PM +0200, Sander Eikelenboom wrote:
>> On 12/06/2019 13:25, Kirill Smelkov wrote:
>>> On Wed, Jun 12, 2019 at 09:44:49AM +0200, Miklos Szeredi wrote:
>>>> On Tue, Jun 11, 2019 at 10:28 PM Kirill Smelkov <kirr@nexedi.com> wrote:
>>>>
>>>>> Miklos, would 4K -> `sizeof(fuse_in_header) + sizeof(fuse_write_in)` for
>>>>> header room change be accepted?
>>>>
>>>> Yes, next cycle.   For 4.2 I'll just push the revert.
>>>
>>> Thanks Miklos. Please consider queuing the following patch for 5.3.
>>> Sander, could you please confirm that glusterfs is not broken with this
>>> version of the check?
>>>
>>> Thanks beforehand,
>>> Kirill
>>
>>
>> Hmm unfortunately it doesn't build, see below.
>> [...]
>> fs/fuse/dev.c:1336:14: error: ‘fuse_in_header’ undeclared (first use in this function)
>>        sizeof(fuse_in_header) + sizeof(fuse_write_in) + fc->max_write))
> 
> Sorry, my bad, it was missing "struct" before fuse_in_header. I
> originally compile-tested the patch with `make -j4`, was distracted onto
> other topic and did not see the error after returning due to long tail
> of successful CC lines. Apologize for the inconvenience. Below is a
> fixed patch that was both compile-tested and runtime-tested with my FUSE
> workloads (non-glusterfs).
> 
> Kirill
> 

Just tested and it works for me, thanks !

--
Sander
Kirill Smelkov June 12, 2019, 5:03 p.m. UTC | #5
On Wed, Jun 12, 2019 at 06:28:17PM +0200, Sander Eikelenboom wrote:
> On 12/06/2019 16:12, Kirill Smelkov wrote:
> > On Wed, Jun 12, 2019 at 03:03:49PM +0200, Sander Eikelenboom wrote:
> >> On 12/06/2019 13:25, Kirill Smelkov wrote:
> >>> On Wed, Jun 12, 2019 at 09:44:49AM +0200, Miklos Szeredi wrote:
> >>>> On Tue, Jun 11, 2019 at 10:28 PM Kirill Smelkov <kirr@nexedi.com> wrote:
> >>>>
> >>>>> Miklos, would 4K -> `sizeof(fuse_in_header) + sizeof(fuse_write_in)` for
> >>>>> header room change be accepted?
> >>>>
> >>>> Yes, next cycle.   For 4.2 I'll just push the revert.
> >>>
> >>> Thanks Miklos. Please consider queuing the following patch for 5.3.
> >>> Sander, could you please confirm that glusterfs is not broken with this
> >>> version of the check?
> >>>
> >>> Thanks beforehand,
> >>> Kirill
> >>
> >>
> >> Hmm unfortunately it doesn't build, see below.
> >> [...]
> >> fs/fuse/dev.c:1336:14: error: ‘fuse_in_header’ undeclared (first use in this function)
> >>        sizeof(fuse_in_header) + sizeof(fuse_write_in) + fc->max_write))
> > 
> > Sorry, my bad, it was missing "struct" before fuse_in_header. I
> > originally compile-tested the patch with `make -j4`, was distracted onto
> > other topic and did not see the error after returning due to long tail
> > of successful CC lines. Apologize for the inconvenience. Below is a
> > fixed patch that was both compile-tested and runtime-tested with my FUSE
> > workloads (non-glusterfs).
> > 
> > Kirill
> > 
> 
> Just tested and it works for me, thanks !

Thanks for feedback. Kirill
diff mbox series

Patch

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ea8237513dfa..15531ba560b5 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1317,6 +1317,25 @@  static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	unsigned reqsize;
 	unsigned int hash;
 
+	/*
+	 * Require sane minimum read buffer - that has capacity for fixed part
+	 * of any request header + negotiated max_write room for data. If the
+	 * requirement is not satisfied return EINVAL to the filesystem server
+	 * to indicate that it is not following FUSE server/client contract.
+	 * Don't dequeue / abort any request.
+	 *
+	 * Historically libfuse reserves 4K for fixed header room, but e.g.
+	 * GlusterFS reserves only 80 bytes
+	 *
+	 *	= `sizeof(fuse_in_header) + sizeof(fuse_write_in)`
+	 *
+	 * which is the absolute minimum any sane filesystem should be using
+	 * for header room.
+	 */
+	if (nbytes < max_t(size_t, FUSE_MIN_READ_BUFFER,
+			   sizeof(fuse_in_header) + sizeof(fuse_write_in) + fc->max_write))
+		return -EINVAL;
+
  restart:
 	spin_lock(&fiq->waitq.lock);
 	err = -EAGAIN;