From patchwork Tue Jun 30 13:54:08 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 6696211 Return-Path: X-Original-To: patchwork-linux-fsdevel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 249F5C05AC for ; Tue, 30 Jun 2015 13:54:40 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 1466320457 for ; Tue, 30 Jun 2015 13:54:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E0678203C4 for ; Tue, 30 Jun 2015 13:54:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751620AbbF3NyR (ORCPT ); Tue, 30 Jun 2015 09:54:17 -0400 Received: from mail-wi0-f181.google.com ([209.85.212.181]:36003 "EHLO mail-wi0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752119AbbF3NyO (ORCPT ); Tue, 30 Jun 2015 09:54:14 -0400 Received: by widjy10 with SMTP id jy10so31131270wid.1; Tue, 30 Jun 2015 06:54:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:subject:from:to:cc:date:in-reply-to:references :content-type:mime-version:content-transfer-encoding; bh=FZzxVqhVPOCtw6aHmH/dQLiDRSaBAX6/bOgW3q7bvWQ=; b=yKbkrGHnmUp3obMKQif8QydJsr/+7ddOlvkUZ+ihYwPVTSgtSj1p+FtdUs51NKH/yv oz+oJirddEMw4P2GOviT4OZMARwO3pP+nG9bX7VYUmftZSVR1H9O+cu2G0QExlW+3ycH maIghvKEYogzvMY/RzBbhBV4m7DT5BhyrNazcajgN5WsDvdyujavURfJqfJQdZCT+fgB JFXqtQDra/sHtp0kyHuiga02+sBI+OwnOrE8RCfwWpczZ6sZ2YsWdpg0DEnaNH5xOc2p ogbm5ZcGwjPBx4AY+0RHfOnr+KHVVt6KdmTLE/aB/HmRYhQOtbM+knLeFss36ze0M97Y wZIg== X-Received: by 10.180.101.233 with SMTP id fj9mr33149016wib.45.1435672452546; Tue, 30 Jun 2015 06:54:12 -0700 (PDT) Received: from [172.28.90.243] ([172.28.90.243]) by mx.google.com with ESMTPSA id bg6sm68941912wjc.13.2015.06.30.06.54.10 (version=TLSv1.2 cipher=AES128-GCM-SHA256 bits=128/128); Tue, 30 Jun 2015 06:54:11 -0700 (PDT) Message-ID: <1435672448.4110.123.camel@edumazet-glaptop2.roam.corp.google.com> Subject: [PATCH v3] fs/file.c: don't acquire files->file_lock in fd_install() From: Eric Dumazet To: Mateusz Guzik , Al Viro Cc: Andrew Morton , "Paul E. McKenney" , Yann Droneaud , Konstantin Khlebnikov , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Date: Tue, 30 Jun 2015 15:54:08 +0200 In-Reply-To: <1430281503.3711.27.camel@edumazet-glaptop2.roam.corp.google.com> References: <1429307216.7346.255.camel@edumazet-glaptop2.roam.corp.google.com> <20150417221646.GA15589@mguzik> <20150417230252.GE889@ZenIV.linux.org.uk> <20150420130633.GA2513@mguzik> <20150420134326.GC2513@mguzik> <20150420151054.GD2513@mguzik> <1429550126.7346.268.camel@edumazet-glaptop2.roam.corp.google.com> <1429562991.7346.290.camel@edumazet-glaptop2.roam.corp.google.com> <1429639543.7346.329.camel@edumazet-glaptop2.roam.corp.google.com> <20150421200624.GA16097@mguzik> <20150421201201.GB16097@mguzik> <1429650413.18561.10.camel@edumazet-glaptop2.roam.corp.google.com> <1429678768.18561.64.camel@edumazet-glaptop2.roam.corp.google.com> <1430281503.3711.27.camel@edumazet-glaptop2.roam.corp.google.com> X-Mailer: Evolution 3.10.4-0ubuntu2 Mime-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Spam-Status: No, score=-7.4 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, T_DKIM_INVALID, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Eric Dumazet Mateusz Guzik reported : Currently obtaining a new file descriptor results in locking fdtable twice - once in order to reserve a slot and second time to fill it. Holding the spinlock in __fd_install() is needed in case a resize is done, or to prevent a resize. Mateusz provided an RFC patch and a micro benchmark : http://people.redhat.com/~mguzik/pipebench.c A resize is an unlikely operation in a process lifetime, as table size is at least doubled at every resize. We can use RCU instead of the spinlock. __fd_install() must wait if a resize is in progress. The resize must block new __fd_install() callers from starting, and wait that ongoing install are finished (synchronize_sched()) resize should be attempted by a single thread to not waste resources. rcu_sched variant is used, as __fd_install() and expand_fdtable() run from process context. It gives us a ~30% speedup using pipebench on a dual Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz Signed-off-by: Eric Dumazet Reported-by: Mateusz Guzik Acked-by: Mateusz Guzik Tested-by: Mateusz Guzik --- v3: added a might_sleep() and a section in Documentation/filesystems/porting v2: replaced a READ_ONCE() by more appropriate rcu_dereference_sched() Documentation/filesystems/porting | 4 + fs/file.c | 67 ++++++++++++++++++++-------- include/linux/fdtable.h | 3 + 3 files changed, 55 insertions(+), 19 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 68f1c9106573..f24d1b833957 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -500,3 +500,7 @@ in your dentry operations instead. dentry, it does not get nameidata at all and it gets called only when cookie is non-NULL. Note that link body isn't available anymore, so if you need it, store it as cookie. +-- +[mandatory] + __fd_install() & fd_install() can now sleep. Callers should not + hold a spinlock or other resources that do not allow a schedule. diff --git a/fs/file.c b/fs/file.c index 93c5f89c248b..3d2eb4c542a4 100644 --- a/fs/file.c +++ b/fs/file.c @@ -147,6 +147,13 @@ static int expand_fdtable(struct files_struct *files, int nr) spin_unlock(&files->file_lock); new_fdt = alloc_fdtable(nr); + + /* make sure all __fd_install() have seen resize_in_progress + * or have finished their rcu_read_lock_sched() section. + */ + if (atomic_read(&files->count) > 1) + synchronize_sched(); + spin_lock(&files->file_lock); if (!new_fdt) return -ENOMEM; @@ -158,21 +165,14 @@ static int expand_fdtable(struct files_struct *files, int nr) __free_fdtable(new_fdt); return -EMFILE; } - /* - * Check again since another task may have expanded the fd table while - * we dropped the lock - */ cur_fdt = files_fdtable(files); - if (nr >= cur_fdt->max_fds) { - /* Continue as planned */ - copy_fdtable(new_fdt, cur_fdt); - rcu_assign_pointer(files->fdt, new_fdt); - if (cur_fdt != &files->fdtab) - call_rcu(&cur_fdt->rcu, free_fdtable_rcu); - } else { - /* Somebody else expanded, so undo our attempt */ - __free_fdtable(new_fdt); - } + BUG_ON(nr < cur_fdt->max_fds); + copy_fdtable(new_fdt, cur_fdt); + rcu_assign_pointer(files->fdt, new_fdt); + if (cur_fdt != &files->fdtab) + call_rcu(&cur_fdt->rcu, free_fdtable_rcu); + /* coupled with smp_rmb() in __fd_install() */ + smp_wmb(); return 1; } @@ -185,21 +185,38 @@ static int expand_fdtable(struct files_struct *files, int nr) * The files->file_lock should be held on entry, and will be held on exit. */ static int expand_files(struct files_struct *files, int nr) + __releases(files->file_lock) + __acquires(files->file_lock) { struct fdtable *fdt; + int expanded = 0; +repeat: fdt = files_fdtable(files); /* Do we need to expand? */ if (nr < fdt->max_fds) - return 0; + return expanded; /* Can we expand? */ if (nr >= sysctl_nr_open) return -EMFILE; + if (unlikely(files->resize_in_progress)) { + spin_unlock(&files->file_lock); + expanded = 1; + wait_event(files->resize_wait, !files->resize_in_progress); + spin_lock(&files->file_lock); + goto repeat; + } + /* All good, so we try */ - return expand_fdtable(files, nr); + files->resize_in_progress = true; + expanded = expand_fdtable(files, nr); + files->resize_in_progress = false; + + wake_up_all(&files->resize_wait); + return expanded; } static inline void __set_close_on_exec(int fd, struct fdtable *fdt) @@ -256,6 +273,8 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) atomic_set(&newf->count, 1); spin_lock_init(&newf->file_lock); + newf->resize_in_progress = false; + init_waitqueue_head(&newf->resize_wait); newf->next_fd = 0; new_fdt = &newf->fdtab; new_fdt->max_fds = NR_OPEN_DEFAULT; @@ -553,11 +572,21 @@ void __fd_install(struct files_struct *files, unsigned int fd, struct file *file) { struct fdtable *fdt; - spin_lock(&files->file_lock); - fdt = files_fdtable(files); + + might_sleep(); + rcu_read_lock_sched(); + + while (unlikely(files->resize_in_progress)) { + rcu_read_unlock_sched(); + wait_event(files->resize_wait, !files->resize_in_progress); + rcu_read_lock_sched(); + } + /* coupled with smp_wmb() in expand_fdtable() */ + smp_rmb(); + fdt = rcu_dereference_sched(files->fdt); BUG_ON(fdt->fd[fd] != NULL); rcu_assign_pointer(fdt->fd[fd], file); - spin_unlock(&files->file_lock); + rcu_read_unlock_sched(); } void fd_install(unsigned int fd, struct file *file) diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index 230f87bdf5ad..fbb88740634a 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -47,6 +47,9 @@ struct files_struct { * read mostly part */ atomic_t count; + bool resize_in_progress; + wait_queue_head_t resize_wait; + struct fdtable __rcu *fdt; struct fdtable fdtab; /*