[RESEND,v2] fuse: Don't drop NOTIFY_REPLY if we promised it

From: Kirill Smelkov <kirr@nexedi.com>

A successful call to NOTIFY_RETRIEVE by filesystem carries promise from
the kernel to send back NOTIFY_REPLY message. However if the filesystem
is not reading requests with fuse_conn->max_pages capacity,
fuse_dev_do_read might see that the "request is too large" and decide to
"reply with an error and restart the read". "Reply with an error" has
underlying assumption that there is a "requester thread" that is waiting
for request completion, which is true for most requests, but is not true
for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right
after it could successfully queue NOTIFY_REPLY message without waiting
for NOTIFY_REPLY completion. This leads to situation when filesystem
requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for
that notification request, but NOTIFY_REPLY is not coming back.

More, since there is no "requester thread" to handle the error, the
situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_
/dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request
was removed from pending queue and abandoned.

One way to fix would be to change NOTIFY_RETRIEVE handler to wait until
queued NOTIFY_REPLY is actually read back to the server and only then
return NOTIFY_RETRIEVE status. However this is change in behaviour and
would require filesystems to have at least 2 threads. In particular a
single-threaded filesystem that was previously successfully using
NOTIFY_RETRIEVE would become stuck after the change. This way of fixing
is thus not acceptable.

However we can fix it another way - by always returning NOTIFY_REPLY
irregardless of its original size - with so much data as provided read
buffer could fit. This aligns with the way NOTIFY_RETRIEVE handler
works, which already unconditionally caps requested retrieve size to
fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE
semantic if we return less data than was originally requested.

This fix requires another behaviour change however - to be sure that
read buffer has enough capacity to always fit fixed NOTIFY_REPLY part
plus at least some (0 or more) data, we have to precheck the buffer
before dequeuing and handling a request. And if the buffer is very small -
return EINVAL to read in filesystem with semantic that queued read was
invalid from the viewpoint of FUSE protocol. Even though this is also
behaviour change, this should not practically cause problems: 1d3d752b47
(fuse: clean up request size limit checking), which originally removed
such EINVAL return and reworked fuse_dev_do_read to loop and retry, also
added FUSE_MIN_READ_BUFFER=8K to user-visible fuse.h with comment that
"The read buffer is required to be at least 8k ..." Even though
FUSE_MIN_READ_BUFFER is not currently checked anywhere in the kernel,
libfuse always initializes session with bufsize=32·pages and, since its
beginning, (at least from 2005) issues a warning should user modify
fuse_session->bufsize directly to be sure that queued buffers are at
least as large as that sane minimum:

	https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fuse_lowlevel.c#L2869
	https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fuse_lowlevel.c#L1947
	(semantic added in https://github.com/libfuse/libfuse/commit/044da2e9e0)

This way we should be safe to add the check for minimum read buffer size.

I've hit this bug for real with my filesystem that is using
https://github.com/hanwen/go-fuse: there was no NOTIFY_REPLY after
successful NOTIFY_RETRIEVE and the filesystem was stuck waiting,
because FUSE protocol (definition scattered through many places) states
that NOTIFY_REPLY is guaranteed to come after successful NOTIFY_RETRIEVE
(see 2d45ba381a "fuse: add retrieve request"). After inspecting
/sys/fs/fuse/connections/X/waiting and seeing it was 1, I was initially
suspecting that it was user-space who is not issuing /dev/fuse reads and
NOTIFY_REPLY is there but stuck in kernel pending queue. However tracing
what is going on in /dev/fuse exchange and in both kernel and userspace
(see https://lab.nexedi.com/kirr/wendelin.core/blob/13d2d1f8/wcfs/fusetrace)
showed that there are correctly queued /dev/fuse reads still pending
after NOTIFY_RETRIEVE returns and it is the kernel who is not replying back:

	...

	P2 2.215710 /dev/fuse <- qread      wcfs/11399_4_r:

	        syscall.Syscall+48
	        syscall.Read+73
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+107
	        runtime.goexit+1

	P2 2.215810 /dev/fuse -> read       wcfs/11399_4_r:
	        .56  RELEASE i8 ...             (ret=64)

	P2 2.215859 /dev/fuse <- write      wcfs/11399_5_w:
	        .56 (0) ...

	        syscall.Syscall+48
	        syscall.Write+73
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931
	        github.com/hanwen/go-fuse/fuse.(*Server).write+194
	        github.com/hanwen/go-fuse/fuse.(*Server).handleRequest+179
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+399
	        runtime.goexit+1

	P2 2.215871 /dev/fuse -> write_ack  wcfs/11399_5_w (ret=16)

	P2 2.215876 /dev/fuse <- qread      wcfs/11399_5_r:    <-- NOTE

	        syscall.Syscall+48
	        syscall.Read+73
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+107
	        runtime.goexit+1

	P0 2.221527 /dev/fuse <- qread      wcfs/11401_1_r:    <-- NOTE

	        syscall.Syscall+48
	        syscall.Read+73
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+107
	        runtime.goexit+1

	P1 2.239384 /dev/fuse -> read       wcfs/11398_6_r:	# woken read that was queued before "..."
	        .57  READ i5 ...                (ret=80)

	P0 2.239626 /dev/fuse <- write      wcfs/11397_0_w:
	        NOTIFY_RETRIEVE ...

	        syscall.Syscall+48
	        syscall.Write+73
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931
	        github.com/hanwen/go-fuse/fuse.(*Server).write+194
	        github.com/hanwen/go-fuse/fuse.(*Server).InodeRetrieveCache+764
	        github.com/hanwen/go-fuse/fuse/nodefs.(*FileSystemConnector).FileRetrieveCache+157
	        main.(*BigFile).invalidateBlk+232
	        main.(*Root).zδhandle1.func1+72
	        golang.org/x/sync/errgroup.(*Group).Go.func1+87
	        runtime.goexit+1

	P0 2.239660 /dev/fuse -> write_ack  wcfs/11397_0_w (ret=48)

	# stuck
	# (full trace: https://lab.nexedi.com/kirr/wendelin.core/commit/96416aaabd)

with queued / served read analysis confirming that two reads were indeed queued
and not served:

	grep -w -e '<- qread\>' y.log |awk {'print $6'} |sort >qread.txt
	grep -w -e '-> read\>'  y.log |awk {'print $6'} |sort >read.txt

	# xdiff qread.txt read.txt
	diff --git a/qread.txt b/read.txt
	index 4ab50d7..fdd2be1 100644
	--- a/qread.txt
	+++ b/read.txt
	@@ -53,7 +53,5 @@ wcfs/11399_1_r:
	 wcfs/11399_2_r:
	 wcfs/11399_3_r:
	 wcfs/11399_4_r:
	-wcfs/11399_5_r:
	 wcfs/11400_0_r:
	 wcfs/11401_0_r:
	-wcfs/11401_1_r:

The bug was hit because go-fuse by default uses 64K for read buffer size

	https://github.com/hanwen/go-fuse/blob/33711add/fuse/server.go#L142

and the kernel presets fuse_conn->max_pages to be 128K (= 32·4K pages).

Go-fuse will be likely fixed to both use bufsize=kernel's and to
correctly handle size > bufsize in InodeRetrieveCache. However we should
also fix the kernel to always deliver NOTIFY_REPLY once NOTIFY_RETRIEVE
was successful, so that FUSE protocol guarantee always holds
irregardless of whether userspace used default or other valid buffer
size setting, and so that filesystems can count not to get stuck waiting
for kernel who promised a reply.

This way this patch is here.

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Cc: <stable@vger.kernel.org> # v2.6.36+
---

 First patch version was sent 1 week ago, but got no response:
 https://marc.info/?l=linux-fsdevel&m=155000277921155&w=2

 Changes since v1: don't forget to also update req->misc.retrieve_in.size
 after truncation.

 ( This is my first patch to fs/fuse, so please forgive me if I missed anything. )

 fs/fuse/dev.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 65 insertions(+), 6 deletions(-)

Message ID	20190219094147.32734-1-kirr@nexedi.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9EEA61390 for <patchwork-linux-fsdevel@patchwork.kernel.org>; Tue, 19 Feb 2019 09:57:15 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8ADBA2B06B for <patchwork-linux-fsdevel@patchwork.kernel.org>; Tue, 19 Feb 2019 09:57:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7E1AB2B15D; Tue, 19 Feb 2019 09:57:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.6 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI,URIBL_GREY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 60A492B06B for <patchwork-linux-fsdevel@patchwork.kernel.org>; Tue, 19 Feb 2019 09:57:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727315AbfBSJ5N (ORCPT <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>); Tue, 19 Feb 2019 04:57:13 -0500 Received: from mail136-25.atl41.mandrillapp.com ([198.2.136.25]:61633 "EHLO mail136-25.atl41.mandrillapp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725763AbfBSJ5N (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>); Tue, 19 Feb 2019 04:57:13 -0500 X-Greylist: delayed 901 seconds by postgrey-1.27 at vger.kernel.org; Tue, 19 Feb 2019 04:57:12 EST DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=mandrill; d=nexedi.com; h=From:Subject:To:Cc:Message-Id:Date:MIME-Version:Content-Type:Content-Transfer-Encoding; i=kirr@nexedi.com; bh=SrfAldX/egEUBm1hepTN68TtqRwyuS2H0qH5d3iy04c=; b=pjT3PztSdwtnNeTRaWeXX5tdWNMBdj7ufplZhVLslqQ9GqdRmBpdAWKMsnMO5K1w0Bk1meN1zhWf 473VIAh5XpCx0bHCp1UjYbmkH+1w0FsCRcQVcpf5KQE5ShcB9wjbltYWWA3AHOp0yb+V65WR7wVw AeIO8wXDyEi9RETb/cE= Received: from pmta04.mandrill.prod.atl01.rsglab.com (127.0.0.1) by mail136-25.atl41.mandrillapp.com id hdf9fg1sb1k6 for <linux-fsdevel@vger.kernel.org>; Tue, 19 Feb 2019 09:42:10 +0000 (envelope-from <bounce-md_31050260.5c6bcf72.v1-e277703ff0eb4c039786e0a68e048370@mandrillapp.com>) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; i=@mandrillapp.com; q=dns/txt; s=mandrill; t=1550569330; h=From : Subject : To : Cc : Message-Id : Date : MIME-Version : Content-Type : Content-Transfer-Encoding : From : Subject : Date : X-Mandrill-User : List-Unsubscribe; bh=SrfAldX/egEUBm1hepTN68TtqRwyuS2H0qH5d3iy04c=; b=X3xh+f7BsHcKSvyd45PTgwyciTx80XtL3P8RUgwPQAFg62b/wbxV5OyoeqZxbK+PgDqnml jW9N23hbi+0h4agINQHcWmI6pZVuK779uJeCcS3NBIYA8Lb4j9Sbnk+BWOL2DJ4UUo56MN81 nbCBG/zoX/PvetNmfT5za9/B1LQs8= From: Kirill Smelkov <kirr@nexedi.com> Subject: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it Received: from [87.98.221.171] by mandrillapp.com id e277703ff0eb4c039786e0a68e048370; Tue, 19 Feb 2019 09:42:10 +0000 X-Mailer: git-send-email 2.21.0.rc0.269.g1a574e7a28 To: Miklos Szeredi <miklos@szeredi.hu>, Miklos Szeredi <mszeredi@redhat.com> Cc: <linux-fsdevel@vger.kernel.org>, <fuse-devel@lists.sourceforge.net>, Kirill Smelkov <kirr@nexedi.com>, Han-Wen Nienhuys <hanwen@google.com>, Jakob Unterwurzacher <jakobunt@gmail.com>, <stable@vger.kernel.org> Message-Id: <20190219094147.32734-1-kirr@nexedi.com> X-Report-Abuse: Please forward a copy of this message, including all headers, to abuse@mandrill.com X-Report-Abuse: You can also report abuse here: http://mandrillapp.com/contact/abuse?id=31050260.e277703ff0eb4c039786e0a68e048370 X-Mandrill-User: md_31050260 Date: Tue, 19 Feb 2019 09:42:10 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: <linux-fsdevel.vger.kernel.org> X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP
Series	[RESEND,v2] fuse: Don't drop NOTIFY_REPLY if we promised it \| expand [RESEND,v2] fuse: Don't drop NOTIFY_REPLY if we promised it

[RESEND,v2] fuse: Don't drop NOTIFY_REPLY if we promised it

Commit Message

Comments

Patch