From patchwork Wed Dec 12 15:27:57 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Josef Bacik <josef@toxicpanda.com>
X-Patchwork-Id: 10726543
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6F09D16B1
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 12 Dec 2018 15:28:04 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5F83A2B148
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 12 Dec 2018 15:28:04 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 533B82B15D; Wed, 12 Dec 2018 15:28:04 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=unavailable
	version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 789CA2B164
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 12 Dec 2018 15:28:03 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 816C08E0122; Wed, 12 Dec 2018 10:28:02 -0500 (EST)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7C40C8E00E5; Wed, 12 Dec 2018 10:28:02 -0500 (EST)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 68B3F8E0122; Wed, 12 Dec 2018 10:28:02 -0500 (EST)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-yw1-f70.google.com (mail-yw1-f70.google.com
 [209.85.161.70])
	by kanga.kvack.org (Postfix) with ESMTP id 358438E00E5
	for <linux-mm@kvack.org>; Wed, 12 Dec 2018 10:28:02 -0500 (EST)
Received: by mail-yw1-f70.google.com with SMTP id v20so10992848ywa.10
        for <linux-mm@kvack.org>; Wed, 12 Dec 2018 07:28:02 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:dkim-signature:from:to:subject:date:message-id
         :in-reply-to:references;
        bh=0DK79C1WGRYuEE806kntOGH6IvjqZXYhSvBExPW3/tk=;
        b=Kc5DCao/trsNR1ins7OTqVk7sFA4+eiCp/hLxyQH3NBUsMhkH9PzVMF53sexIInJFN
         snIGkIWbobR7qpop+hDyt6XsdFgqs0dRz2TErBRCP+kvnRYPdyVEG4rpmBSLvXLWlPvn
         +7MryxAfjgB4hnZeOV4CG1cV88y6s23c90zd1kVRKtDBXsqhRPVaLvYUIiEFWmHXNomq
         NCp3LqyDStBfUzje/egJbTZejQzh9HKaOp947KwoR9yfudHWLydtST3sypB74z6wl4k6
         VeMokVxSs8VUvyfosdJJoqLZyqumoEXjh1hsDejYwx3ABzZESkrAlAcYcfOWdFlB2164
         q1kQ==
X-Gm-Message-State: AA+aEWZpLsBTV5IFxC4Fy2f9zWIS1hkgPeQvS6BQK+sFhoFhfkOTlaBL
	bZrGylijGvO34AYO2mz2alp8WxL9f8G17GDFqWp1Vi1hdiwEAwkNpovgsDLbfurGuyOxfTn4plm
	tnzqkN4W76fWsCkdMHYfAbcb5uKJwpPCxhYRLgMh3NTAhrYDd4+Miot5L1j70EUtyyeNm8uwnem
	URW3dz4FHoAcyRrVuP379y0RCf2B4IAqzSsuIPua5HBjbC1CBGPni+5wPGT7TDFaVQ/kIXrZ/Zu
	/YN2NNJ296FDDqTEbBWLWWD6QgSSuRdJfymsDzIQjETtngCuBgQhcwfiAzvtXUkM7r7DAc05yCt
	bhKfwnGCZWpRPv29+fJNX9Dvnk8OpwcrATeQjTXhKLZsirZbR7nxQrXjlpqPVuU33kQILonGrRY
	y
X-Received: by 2002:a5b:b12:: with SMTP id z18mr9462739ybp.225.1544628481777;
        Wed, 12 Dec 2018 07:28:01 -0800 (PST)
X-Received: by 2002:a5b:b12:: with SMTP id z18mr9462650ybp.225.1544628480314;
        Wed, 12 Dec 2018 07:28:00 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1544628480; cv=none;
        d=google.com; s=arc-20160816;
        b=wMT5Bdsu5lamnklo109ZefIF16zq7+VcVrabi9opmlw4pM6YnbVw9+UjlzgumoOcna
         TQ9eUGdK5dOGMJDl5WQskzJLQsuyGaNy1PY4uCyFE8J+otmK/A+bFwlWhGaGV+vFlYTH
         l8B32ZjIYt0FWRM9N6D3RQyaNPWwMbJNsPS6FjGNxxW7Txw/rwneakMJ1j0mwHRXi8Xu
         lpwX6iMfl2/Eh+Bl5T0/Y6X3dEyCIoZpec6c0ycHujRIZ9SPsAqf2vwAbiL7CDdkgEG4
         dn9fhKsumS8xXAljpksewipq4TsGB9jgIaNr/bUgbE4NvW5dUwTjH57E37zdkxEmUe+w
         nUtg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:to:from
         :dkim-signature;
        bh=0DK79C1WGRYuEE806kntOGH6IvjqZXYhSvBExPW3/tk=;
        b=XDwU1Ho6FoD2oB5aG0GSI8d/kBh6r/R9zw4KZx14bGfv1793h9J/vCf0BzwtGKeOLv
         dghkphTLvdwbLYu+YkW4rRK5Nig48r+8qXhBl9FzvJxaSU5Qlyw5j3eMBIAqzNn/7Zdq
         QGQBG7czuN0TsHPOnhmW/oQhU58HSsdxXnG5pMIUn6NBAb9PFHqKHuY5HualtweNRRRI
         hiWB2/DysWruhmBIFPHrLnQ2p58H+2VZg2fllMPePnXeSHt+Cltf3SvmifmEDcxgneGZ
         grx8wvb/MbB03E76CUPv7obe7uUYLM37YD3iN2JzmPmMYpt0rqbm3XyXJRepMtwOufw2
         TPJQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@toxicpanda-com.20150623.gappssmtp.com
 header.s=20150623 header.b=Bbq2wBNX;
       spf=neutral (google.com: 209.85.220.65 is neither permitted nor denied
 by best guess record for domain of josef@toxicpanda.com)
 smtp.mailfrom=josef@toxicpanda.com
Received: from mail-sor-f65.google.com (mail-sor-f65.google.com.
 [209.85.220.65])
        by mx.google.com with SMTPS id k10sor685786ybh.207.2018.12.12.07.28.00
        for <linux-mm@kvack.org>
        (Google Transport Security);
        Wed, 12 Dec 2018 07:28:00 -0800 (PST)
Received-SPF: neutral (google.com: 209.85.220.65 is neither permitted nor
 denied by best guess record for domain of josef@toxicpanda.com)
 client-ip=209.85.220.65;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@toxicpanda-com.20150623.gappssmtp.com
 header.s=20150623 header.b=Bbq2wBNX;
       spf=neutral (google.com: 209.85.220.65 is neither permitted nor denied
 by best guess record for domain of josef@toxicpanda.com)
 smtp.mailfrom=josef@toxicpanda.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=toxicpanda-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:subject:date:message-id:in-reply-to:references;
        bh=0DK79C1WGRYuEE806kntOGH6IvjqZXYhSvBExPW3/tk=;
        b=Bbq2wBNXOoBBXyzLprjdQwjzd/+HapmPSVsKaWYlHqzmF7kMSiX0+HyhHftX+Gyp6o
         LjL05otAB04cUI2sRMwDBbv+9IMQC5iI4WAR0jYcKJuXejNGqL84zCWN1GoINQ89yGas
         BEik9eaG5clOnKAk7qM9OkD4TKPLBpslsWFdqdiQvcKWjlnR1gvKiBwpUYy62cYgZw0B
         OcizXuJK6hX1twjJhgKvsMjSFDzVXeyfsujfIa1OC8InhjOHAGQSER6GIARMB+DedhTy
         nAI56l5Tvc0UkoUYQD94/xIzJzS9QAhQKhGqnZM1lDLKM81P0pJ3UPvM69nVfJQ3qhbs
         AKMw==
X-Google-Smtp-Source: 
 AFSGD/VhULReGLpm5Llkz1eBzfij7+iStjhu7q2WxFI4OjI7/H2M5x6z8eitZgrmBOkb11p5b7SA1A==
X-Received: by 2002:a25:9907:: with SMTP id z7mr11082731ybn.114.1544628479598;
        Wed, 12 Dec 2018 07:27:59 -0800 (PST)
Received: from localhost ([107.15.81.208])
        by smtp.gmail.com with ESMTPSA id
 f10sm8624606ywb.26.2018.12.12.07.27.58
        (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
        Wed, 12 Dec 2018 07:27:58 -0800 (PST)
From: Josef Bacik <josef@toxicpanda.com>
To: kernel-team@fb.com,
	hannes@cmpxchg.org,
	linux-kernel@vger.kernel.org,
	tj@kernel.org,
	david@fromorbit.com,
	akpm@linux-foundation.org,
	linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org,
	riel@redhat.com,
	jack@suse.cz
Subject: [PATCH][v6] filemap: drop the mmap_sem for all blocking operations
Date: Wed, 12 Dec 2018 10:27:57 -0500
Message-Id: <20181212152757.10017-1-josef@toxicpanda.com>
X-Mailer: git-send-email 2.14.3
In-Reply-To: <20181211173801.29535-4-josef@toxicpanda.com>
References: <20181211173801.29535-4-josef@toxicpanda.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

Currently we only drop the mmap_sem if there is contention on the page
lock.  The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.

The problem with this is the assumption that the readahead does
anything.  In the case that the box is under extreme memory or IO
pressure we may end up not reading anything at all for readahead, which
means we will end up reading in the page under the mmap_sem.

Even if the readahead does something, it could get throttled because of
io pressure on the system and the process is in a lower priority cgroup.

Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions.  Consider some large company that does
a lot of web traffic.  This large company has load balancing logic in
it's core web server, cause some engineer thought this was a brilliant
plan.  This load balancing logic gets statistics from /proc about the
system, which trip over processes mmap_sem for various reasons.  Now the
web server application is in a protected cgroup, but these other
processes may not be, and if they are being throttled while their
mmap_sem is held we'll stall, and cause this nice death spiral.

Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time.  This includes
while issuing readahead, locking the page, or needing to call ->readpage
because readahead did not occur.  Then once we have a fully uptodate
page we can return with VM_FAULT_RETRY and come back again to find our
nicely in-cache page that was gotten outside of the mmap_sem.

This patch also adds a new helper for locking the page with the mmap_sem
dropped.  This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an
error) before it was unlocked.  However a forthcoming patchset will
change this with the ability to abort read-ahead bio's if necessary,
making it more likely that we could contend for a page lock and still
have a not uptodate page.  This allows us to deal with this case by
grabbing the lock and issuing the IO without the mmap_sem held, and then
returning VM_FAULT_RETRY to come back around.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
v5->v6:
- added more comments as per Andrew's suggestion.
- fixed the fpin leaks in the two error paths that were pointed out.

 mm/filemap.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 116 insertions(+), 19 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8fc45f24b201..42e03decf20f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2304,28 +2304,91 @@ EXPORT_SYMBOL(generic_file_read_iter);
 
 #ifdef CONFIG_MMU
 #define MMAP_LOTSAMISS  (100)
+static struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
+					     struct file *fpin)
+{
+	int flags = vmf->flags;
+	if (fpin)
+		return fpin;
+
+	/*
+	 * FAULT_FLAG_RETRY_NOWAIT means we don't want to wait on page locks or
+	 * anything, so we only pin the file and drop the mmap_sem if only
+	 * FAULT_FLAG_ALLOW_RETRY is set.
+	 */
+	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
+	    FAULT_FLAG_ALLOW_RETRY) {
+		fpin = get_file(vmf->vma->vm_file);
+		up_read(&vmf->vma->vm_mm->mmap_sem);
+	}
+	return fpin;
+}
 
 /*
- * Synchronous readahead happens when we don't even find
- * a page in the page cache at all.
+ * lock_page_maybe_drop_mmap - lock the page, possibly dropping the mmap_sem
+ * @vmf - the vm_fault for this fault.
+ * @page - the page to lock.
+ * @fpin - the pointer to the file we may pin (or is already pinned).
+ *
+ * This works similar to lock_page_or_retry in that it can drop the mmap_sem.
+ * It differs in that it actually returns the page locked if it returns 1 and 0
+ * if it couldn't lock the page.  If we did have to drop the mmap_sem then fpin
+ * will point to the pinned file and needs to be fput()'ed at a later point.
  */
-static void do_sync_mmap_readahead(struct vm_fault *vmf)
+static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
+				     struct file **fpin)
+{
+	if (trylock_page(page))
+		return 1;
+
+	if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
+		return 0;
+
+	*fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
+	if (vmf->flags & FAULT_FLAG_KILLABLE) {
+		if (__lock_page_killable(page)) {
+			/*
+			 * We didn't have the right flags to drop the mmap_sem,
+			 * but all fault_handlers only check for fatal signals
+			 * if we return VM_FAULT_RETRY, so we need to drop the
+			 * mmap_sem here and return 0 if we don't have a fpin.
+			 */
+			if (*fpin == NULL)
+				up_read(&vmf->vma->vm_mm->mmap_sem);
+			return 0;
+		}
+	} else
+		__lock_page(page);
+	return 1;
+}
+
+
+/*
+ * Synchronous readahead happens when we don't even find a page in the page
+ * cache at all.  We don't want to perform IO under the mmap sem, so if we have
+ * to drop the mmap sem we return the file that was pinned in order for us to do
+ * that.  If we didn't pin a file then we return NULL.  The file that is
+ * returned needs to be fput()'ed when we're done with it.
+ */
+static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 {
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
+	struct file *fpin = NULL;
 	pgoff_t offset = vmf->pgoff;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
-		return;
+		return fpin;
 	if (!ra->ra_pages)
-		return;
+		return fpin;
 
 	if (vmf->vma->vm_flags & VM_SEQ_READ) {
+		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
-		return;
+		return fpin;
 	}
 
 	/* Avoid banging the cache line if not needed */
@@ -2337,37 +2400,44 @@ static void do_sync_mmap_readahead(struct vm_fault *vmf)
 	 * stop bothering with read-ahead. It will only hurt.
 	 */
 	if (ra->mmap_miss > MMAP_LOTSAMISS)
-		return;
+		return fpin;
 
 	/*
 	 * mmap read-around
 	 */
+	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
 	ra->size = ra->ra_pages;
 	ra->async_size = ra->ra_pages / 4;
 	ra_submit(ra, mapping, file);
+	return fpin;
 }
 
 /*
  * Asynchronous readahead happens when we find the page and PG_readahead,
- * so we want to possibly extend the readahead further..
+ * so we want to possibly extend the readahead further.  We return the file that
+ * was pinned if we have to drop the mmap_sem in order to do IO.
  */
-static void do_async_mmap_readahead(struct vm_fault *vmf,
-				    struct page *page)
+static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
+					    struct page *page)
 {
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
+	struct file *fpin = NULL;
 	pgoff_t offset = vmf->pgoff;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
-		return;
+		return fpin;
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
-	if (PageReadahead(page))
+	if (PageReadahead(page)) {
+		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
+	}
+	return fpin;
 }
 
 /**
@@ -2397,6 +2467,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 {
 	int error;
 	struct file *file = vmf->vma->vm_file;
+	struct file *fpin = NULL;
 	struct address_space *mapping = file->f_mapping;
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
@@ -2418,10 +2489,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
 		 */
-		do_async_mmap_readahead(vmf, page);
+		fpin = do_async_mmap_readahead(vmf, page);
 	} else if (!page) {
 		/* No page in the page cache at all */
-		do_sync_mmap_readahead(vmf);
+		fpin = do_sync_mmap_readahead(vmf);
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
@@ -2429,14 +2500,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		page = pagecache_get_page(mapping, offset,
 					  FGP_CREAT|FGP_FOR_MMAP,
 					  vmf->gfp_mask);
-		if (!page)
+		if (!page) {
+			if (fpin)
+				goto out_retry;
 			return vmf_error(-ENOMEM);
+		}
 	}
 
-	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
-		put_page(page);
-		return ret | VM_FAULT_RETRY;
-	}
+	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
+		goto out_retry;
 
 	/* Did it get truncated? */
 	if (unlikely(page->mapping != mapping)) {
@@ -2453,6 +2525,16 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	if (unlikely(!PageUptodate(page)))
 		goto page_not_uptodate;
 
+	/*
+	 * We've made it this far and we had to drop our mmap_sem, now is the
+	 * time to return to the upper layer and have it re-find the vma and
+	 * redo the fault.
+	 */
+	if (fpin) {
+		unlock_page(page);
+		goto out_retry;
+	}
+
 	/*
 	 * Found the page and have a reference on it.
 	 * We must recheck i_size under page lock.
@@ -2475,12 +2557,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 * and we need to check for errors.
 	 */
 	ClearPageError(page);
+	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	error = mapping->a_ops->readpage(file, page);
 	if (!error) {
 		wait_on_page_locked(page);
 		if (!PageUptodate(page))
 			error = -EIO;
 	}
+	if (fpin)
+		goto out_retry;
 	put_page(page);
 
 	if (!error || error == AOP_TRUNCATED_PAGE)
@@ -2489,6 +2574,18 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	/* Things didn't work out. Return zero to tell the mm layer so. */
 	shrink_readahead_size_eio(file, ra);
 	return VM_FAULT_SIGBUS;
+
+out_retry:
+	/*
+	 * We dropped the mmap_sem, we need to return to the fault handler to
+	 * re-find the vma and come back and find our hopefully still populated
+	 * page.
+	 */
+	if (page)
+		put_page(page);
+	if (fpin)
+		fput(fpin);
+	return ret | VM_FAULT_RETRY;
 }
 EXPORT_SYMBOL(filemap_fault);