From patchwork Mon May 22 21:57:49 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ross Zwisler X-Patchwork-Id: 9741459 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 9A0DF60392 for ; Mon, 22 May 2017 21:58:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8EAFE2873E for ; Mon, 22 May 2017 21:58:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 831AF28742; Mon, 22 May 2017 21:58:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 27EE82873E for ; Mon, 22 May 2017 21:58:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936140AbdEVV6C (ORCPT ); Mon, 22 May 2017 17:58:02 -0400 Received: from mga09.intel.com ([134.134.136.24]:50506 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935106AbdEVV56 (ORCPT ); Mon, 22 May 2017 17:57:58 -0400 Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 22 May 2017 14:57:56 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.38,379,1491289200"; d="scan'208";a="90414341" Received: from theros.lm.intel.com ([10.232.112.77]) by orsmga004.jf.intel.com with ESMTP; 22 May 2017 14:57:55 -0700 From: Ross Zwisler To: Andrew Morton , linux-kernel@vger.kernel.org Cc: Ross Zwisler , "Darrick J. Wong" , Alexander Viro , Christoph Hellwig , Dan Williams , Dave Hansen , Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, "Kirill A . Shutemov" , Pawel Lebioda , Dave Jiang , Xiong Zhou , Eryu Guan , stable@vger.kernel.org Subject: [PATCH v2 2/2] dax: Fix race between colliding PMD & PTE entries Date: Mon, 22 May 2017 15:57:49 -0600 Message-Id: <20170522215749.23516-2-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.9.4 In-Reply-To: <20170522215749.23516-1-ross.zwisler@linux.intel.com> References: <20170522215749.23516-1-ross.zwisler@linux.intel.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP We currently have two related PMD vs PTE races in the DAX code. These can both be easily triggered by having two threads reading and writing simultaneously to the same private mapping, with the key being that private mapping reads can be handled with PMDs but private mapping writes are always handled with PTEs so that we can COW. Here is the first race: CPU 0 CPU 1 (private mapping write) __handle_mm_fault() create_huge_pmd() - FALLBACK handle_pte_fault() passes check for pmd_devmap() (private mapping read) __handle_mm_fault() create_huge_pmd() dax_iomap_pmd_fault() inserts PMD dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD installed in our page tables at this spot. Here's the second race: CPU 0 CPU 1 (private mapping read) __handle_mm_fault() passes check for pmd_none() create_huge_pmd() dax_iomap_pmd_fault() inserts PMD (private mapping write) __handle_mm_fault() create_huge_pmd() - FALLBACK (private mapping read) __handle_mm_fault() passes check for pmd_none() create_huge_pmd() handle_pte_fault() dax_iomap_pte_fault() inserts PTE dax_iomap_pmd_fault() inserts PMD, but we already have a PTE at this spot. The core of the issue is that while there is isolation between faults to the same range in the DAX fault handlers via our DAX entry locking, there is no isolation between faults in the code in mm/memory.c. This means for instance that this code in __handle_mm_fault() can run: if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) { ret = create_huge_pmd(&vmf); But by the time we actually get to run the fault handler called by create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE fault has installed a normal PMD here as a parent. This is the cause of the 2nd race. The first race is similar - there is the following check in handle_pte_fault(): } else { /* See comment in pte_alloc_one_map() */ if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd)) return 0; So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we will bail and retry the fault. This is correct, but there is nothing preventing the PMD from being installed after this check but before we actually get to the DAX PTE fault handlers. In my testing these races result in the following types of errors: BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1 BUG: non-zero nr_ptes on freeing mm: 15 Fix this issue by having the DAX fault handlers verify that it is safe to continue their fault after they have taken an entry lock to block other racing faults. Signed-off-by: Ross Zwisler Reported-by: Pawel Lebioda Cc: stable@vger.kernel.org Reviewed-by: Jan Kara --- Changes from v1: - Handle the failure case in dax_iomap_pte_fault() by retrying the fault (Jan). This series has survived my new xfstest (generic/437) and full xfstest regression testing runs. --- fs/dax.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index c22eaf1..fc62f36 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1155,6 +1155,17 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, } /* + * It is possible, particularly with mixed reads & writes to private + * mappings, that we have raced with a PMD fault that overlaps with + * the PTE we need to set up. If so just return and the fault will be + * retried. + */ + if (pmd_devmap(*vmf->pmd)) { + vmf_ret = VM_FAULT_NOPAGE; + goto unlock_entry; + } + + /* * Note that we don't bother to use iomap_apply here: DAX required * the file system block size to be equal the page size, which means * that we never have to deal with more than a single extent here. @@ -1398,6 +1409,15 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, goto fallback; /* + * It is possible, particularly with mixed reads & writes to private + * mappings, that we have raced with a PTE fault that overlaps with + * the PMD we need to set up. If so we just fall back to a PTE fault + * ourselves. + */ + if (!pmd_none(*vmf->pmd)) + goto unlock_entry; + + /* * Note that we don't use iomap_apply here. We aren't doing I/O, only * setting up a mapping, so really we're using iomap_begin() as a way * to look up our filesystem block.