From patchwork Wed Jun 19 20:40:24 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Snitzer X-Patchwork-Id: 13704521 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A1D4815887D for ; Wed, 19 Jun 2024 20:40:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718829647; cv=none; b=aXv/hU5SqGRX+avH06MUqQpwE9NNmaop1KztGz29kltsyuu9Umu95OlyEmrRQ4aOSe/HoqJyZKIuSmSdbfMcJiHuKMuSNNquWbzHAlWrnZcC8d3NkrAosGz7xLO7mKnTZZK03+XwBf5cfgBFhuiSbV98+B1/Zm6aHXR61E2dY7c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718829647; c=relaxed/simple; bh=k72TvxKnTRK/FfL/Nf413a0B5sKoHvgzflvUrfx/yzs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HGIxj0CwiIyhfDGCPWI2prCgBiVvJnfn/2wM03GUtxx0jq2HnKtWPRHlBJQIzvAGn/o8Kpt6Ja+VSsr9v7EOnfeo2h9FKBW0oX0z3L+XgREgFyb87rIqcI+RLN60C1YNkf9/rSu9Hlfr2njaByDoFJYsjdg9CPrW4sxtBWdI8UM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=H3c68cj9; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="H3c68cj9" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4DE5FC2BBFC; Wed, 19 Jun 2024 20:40:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1718829647; bh=k72TvxKnTRK/FfL/Nf413a0B5sKoHvgzflvUrfx/yzs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=H3c68cj9DFHFRpmwahTj+UKlguuDduQV5wK/yXQPwYWVeWEm2fNyUmIsZAv0sTSHl VIKC00dN59+9Jw433TIhmymnTYpxSBiN2qc5VCx5my2jiAeTj4DIhosLP4K3meRVsf mXQJJBuuzr3QyWTBMG6Pez6Iu1kL5LsLao/GWERVlyIm1QfyOFEHD+XjcWDU2FZmj0 K7yfWqScaAdnU1WbCb3kaDzMmK/x/k2r4vcsMIFos9x4FEWnvMaBhUqxPoaZGX+HOg dfNmigbiNSubjRKizIGbg8JiRVqK5te7ura5zsDoEOEUyfkMUeFJoh+0xEauQtutS3 OISPiF5a6JjDg== From: Mike Snitzer To: linux-nfs@vger.kernel.org Cc: Jeff Layton , Chuck Lever , Trond Myklebust , NeilBrown , snitzer@hammerspace.com Subject: [PATCH v6 10/18] nfs/localio: use dedicated workqueues for filesystem read and write Date: Wed, 19 Jun 2024 16:40:24 -0400 Message-ID: <20240619204032.93740-11-snitzer@kernel.org> X-Mailer: git-send-email 2.44.0 In-Reply-To: <20240619204032.93740-1-snitzer@kernel.org> References: <20240619204032.93740-1-snitzer@kernel.org> Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Trond Myklebust For localio access, don't call filesystem read() and write() routines directly. Some filesystem writeback routines can end up taking up a lot of stack space (particularly xfs). Instead of risking running over due to the extra overhead from the NFS stack, we should just call these routines from a workqueue job. Use of dedicated workqueues improves performance over using the system_unbound_wq. Localio is motivated by the promise of improved performance, it makes little sense to yield it back. But further analysis of the latest stack depth requirements would be useful. It'd be nice to root cause and fix the latest stack hogs, because using workqueues at all can cause a loss in performance due to context switches. Signed-off-by: Trond Myklebust Signed-off-by: Mike Snitzer --- fs/nfs/inode.c | 57 +++++++++++++++++--------- fs/nfs/internal.h | 1 + fs/nfs/localio.c | 102 +++++++++++++++++++++++++++++++++++----------- 3 files changed, 118 insertions(+), 42 deletions(-) diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index f9923cbf6058..aac8c5302503 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -2394,35 +2394,54 @@ static void nfs_destroy_inodecache(void) kmem_cache_destroy(nfs_inode_cachep); } +struct workqueue_struct *nfslocaliod_workqueue; struct workqueue_struct *nfsiod_workqueue; EXPORT_SYMBOL_GPL(nfsiod_workqueue); /* - * start up the nfsiod workqueue - */ -static int nfsiod_start(void) -{ - struct workqueue_struct *wq; - dprintk("RPC: creating workqueue nfsiod\n"); - wq = alloc_workqueue("nfsiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); - if (wq == NULL) - return -ENOMEM; - nfsiod_workqueue = wq; - return 0; -} - -/* - * Destroy the nfsiod workqueue + * Destroy the nfsiod workqueues */ static void nfsiod_stop(void) { struct workqueue_struct *wq; wq = nfsiod_workqueue; - if (wq == NULL) - return; - nfsiod_workqueue = NULL; - destroy_workqueue(wq); + if (wq != NULL) { + nfsiod_workqueue = NULL; + destroy_workqueue(wq); + } +#if IS_ENABLED(CONFIG_NFS_LOCALIO) + wq = nfslocaliod_workqueue; + if (wq != NULL) { + nfslocaliod_workqueue = NULL; + destroy_workqueue(wq); + } +#endif /* CONFIG_NFS_LOCALIO */ +} + +/* + * Start the nfsiod workqueues + */ +static int nfsiod_start(void) +{ + dprintk("RPC: creating workqueue nfsiod\n"); + nfsiod_workqueue = alloc_workqueue("nfsiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); + if (nfsiod_workqueue == NULL) + return -ENOMEM; +#if IS_ENABLED(CONFIG_NFS_LOCALIO) + /* + * localio writes need to use a normal (non-memreclaim) workqueue. + * When we start getting low on space, XFS goes and calls flush_work() on + * a non-memreclaim work queue, which causes a priority inversion problem. + */ + dprintk("RPC: creating workqueue nfslocaliod\n"); + nfslocaliod_workqueue = alloc_workqueue("nfslocaliod", WQ_UNBOUND, 0); + if (unlikely(nfslocaliod_workqueue == NULL)) { + nfsiod_stop(); + return -ENOMEM; + } +#endif /* CONFIG_NFS_LOCALIO */ + return 0; } unsigned int nfs_net_id; diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index d352040e3232..9251a357d097 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -440,6 +440,7 @@ int nfs_check_flags(int); /* inode.c */ extern struct workqueue_struct *nfsiod_workqueue; +extern struct workqueue_struct *nfslocaliod_workqueue; extern struct inode *nfs_alloc_inode(struct super_block *sb); extern void nfs_free_inode(struct inode *); extern int nfs_write_inode(struct inode *, struct writeback_control *); diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c index 38d0832442b2..a0577bbd71ad 100644 --- a/fs/nfs/localio.c +++ b/fs/nfs/localio.c @@ -44,6 +44,12 @@ struct nfs_local_fsync_ctx { }; static void nfs_local_fsync_work(struct work_struct *work); +struct nfs_local_io_args { + struct nfs_local_kiocb *iocb; + struct work_struct work; + struct completion *done; +}; + /* * We need to translate between nfs status return values and * the local errno values which may not be the same. @@ -345,21 +351,38 @@ nfs_local_read_aio_complete(struct kiocb *kiocb, long ret) nfs_local_pgio_complete(iocb); } -static int -nfs_do_local_read(struct nfs_pgio_header *hdr, struct file *filp, - const struct rpc_call_ops *call_ops) +static void nfs_local_call_read(struct work_struct *work) { - struct nfs_local_kiocb *iocb; + struct nfs_local_io_args *args = + container_of(work, struct nfs_local_io_args, work); + struct nfs_local_kiocb *iocb = args->iocb; + struct file *filp = iocb->kiocb.ki_filp; struct iov_iter iter; ssize_t status; + nfs_local_iter_init(&iter, iocb, READ); + + status = filp->f_op->read_iter(&iocb->kiocb, &iter); + if (status != -EIOCBQUEUED) { + nfs_local_read_done(iocb, status); + nfs_local_pgio_release(iocb); + } + complete(args->done); +} + +static int nfs_do_local_read(struct nfs_pgio_header *hdr, struct file *filp, + const struct rpc_call_ops *call_ops) +{ + struct nfs_local_io_args args; + DECLARE_COMPLETION_ONSTACK(done); + struct nfs_local_kiocb *iocb; + dprintk("%s: vfs_read count=%u pos=%llu\n", __func__, hdr->args.count, hdr->args.offset); iocb = nfs_local_iocb_alloc(hdr, filp, GFP_KERNEL); if (iocb == NULL) return -ENOMEM; - nfs_local_iter_init(&iter, iocb, READ); nfs_local_pgio_init(hdr, call_ops); hdr->res.eof = false; @@ -369,11 +392,18 @@ nfs_do_local_read(struct nfs_pgio_header *hdr, struct file *filp, iocb->kiocb.ki_complete = nfs_local_read_aio_complete; } - status = filp->f_op->read_iter(&iocb->kiocb, &iter); - if (status != -EIOCBQUEUED) { - nfs_local_read_done(iocb, status); - nfs_local_pgio_release(iocb); - } + /* + * Don't call filesystem read() routines directly. + * In order to avoid issues with stack overflow, + * call the read routines from a workqueue job. + */ + args.iocb = iocb; + args.done = &done; + INIT_WORK_ONSTACK(&args.work, nfs_local_call_read); + queue_work(nfslocaliod_workqueue, &args.work); + wait_for_completion(&done); + destroy_work_on_stack(&args.work); + return 0; } @@ -483,14 +513,35 @@ nfs_local_write_aio_complete(struct kiocb *kiocb, long ret) nfs_local_pgio_complete(iocb); } -static int -nfs_do_local_write(struct nfs_pgio_header *hdr, struct file *filp, - const struct rpc_call_ops *call_ops) +static void nfs_local_call_write(struct work_struct *work) { - struct nfs_local_kiocb *iocb; + struct nfs_local_io_args *args = + container_of(work, struct nfs_local_io_args, work); + struct nfs_local_kiocb *iocb = args->iocb; + struct file *filp = iocb->kiocb.ki_filp; struct iov_iter iter; ssize_t status; + nfs_local_iter_init(&iter, iocb, WRITE); + + file_start_write(filp); + status = filp->f_op->write_iter(&iocb->kiocb, &iter); + file_end_write(filp); + if (status != -EIOCBQUEUED) { + nfs_local_write_done(iocb, status); + nfs_get_vfs_attr(filp, iocb->hdr->res.fattr); + nfs_local_pgio_release(iocb); + } + complete(args->done); +} + +static int nfs_do_local_write(struct nfs_pgio_header *hdr, struct file *filp, + const struct rpc_call_ops *call_ops) +{ + struct nfs_local_io_args args; + DECLARE_COMPLETION_ONSTACK(done); + struct nfs_local_kiocb *iocb; + dprintk("%s: vfs_write count=%u pos=%llu %s\n", __func__, hdr->args.count, hdr->args.offset, (hdr->args.stable == NFS_UNSTABLE) ? "unstable" : "stable"); @@ -498,7 +549,6 @@ nfs_do_local_write(struct nfs_pgio_header *hdr, struct file *filp, iocb = nfs_local_iocb_alloc(hdr, filp, GFP_NOIO); if (iocb == NULL) return -ENOMEM; - nfs_local_iter_init(&iter, iocb, WRITE); switch (hdr->args.stable) { default: @@ -518,14 +568,20 @@ nfs_do_local_write(struct nfs_pgio_header *hdr, struct file *filp, nfs_set_local_verifier(hdr->inode, hdr->res.verf, hdr->args.stable); - file_start_write(filp); - status = filp->f_op->write_iter(&iocb->kiocb, &iter); - file_end_write(filp); - if (status != -EIOCBQUEUED) { - nfs_local_write_done(iocb, status); - nfs_get_vfs_attr(filp, hdr->res.fattr); - nfs_local_pgio_release(iocb); - } + /* + * Don't call filesystem write() routines directly. + * Some filesystem writeback routines can end up taking up a lot of + * stack (particularly xfs). Instead of risking running over due to + * the extra overhead from the NFS stack, call these write routines + * from a workqueue job. + */ + args.iocb = iocb; + args.done = &done; + INIT_WORK_ONSTACK(&args.work, nfs_local_call_write); + queue_work(nfslocaliod_workqueue, &args.work); + wait_for_completion(&done); + destroy_work_on_stack(&args.work); + return 0; }