From patchwork Thu Jan 3 23:54:51 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Kravetz X-Patchwork-Id: 10747867 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 22BB06C2 for ; Thu, 3 Jan 2019 23:55:23 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0C87024B48 for ; Thu, 3 Jan 2019 23:55:23 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EE1EF25404; Thu, 3 Jan 2019 23:55:22 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2855B24B48 for ; Thu, 3 Jan 2019 23:55:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BE80D8E00B8; Thu, 3 Jan 2019 18:55:20 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id B6E9F8E00AE; Thu, 3 Jan 2019 18:55:20 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A10088E00B8; Thu, 3 Jan 2019 18:55:20 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f72.google.com (mail-yw1-f72.google.com [209.85.161.72]) by kanga.kvack.org (Postfix) with ESMTP id 6AED88E00AE for ; Thu, 3 Jan 2019 18:55:20 -0500 (EST) Received: by mail-yw1-f72.google.com with SMTP id t192so22985271ywe.4 for ; Thu, 03 Jan 2019 15:55:20 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id; bh=zusUcwOPKKRlBPDEf84r0yLwIZH48hUbzo/0kVfoZVU=; b=ts2f0alqJ0lpEOC/4yw6IBRzQosHGrHz1W8LJMt7uCSKcpzwPJw5t4gUK+0hD4jIVE pMDAba1TEPaVJiVqzvNZmTHwI9CrXfy1Ww7N+A47fK+SKvvrHmJLQifxBPeVl+0uvL9B wBGNRvKMkbqrCrDpnbXbo7wtbfy8UfwcCeZ/+py0YHOnpUN0eQAbSghZlB7UGMRNNEyI RdP5SjZnX9Uyz2IgDY7Wr3A9o/2+qxNLLOsPU4Dok37b0vKwfPzqAvObEIXJUSjidxZ4 g0Hg32hlgqPzjD7IPVahCS0p6ChdtzVD7l04lA5UbKCegspNcT2XuUmjRLv3z1pXjfkz 1OCA== X-Gm-Message-State: AA+aEWbXvivaHfCShl7z9J9Ex1BhKfmUz9kEXwPKykxU3FXAWz67Xqid zGRkQOA6cOa+zO+LOvgPmNoPiOpHkfkFgyZyjAY+RTvEtPoY1uUz1i1rBG+t7u8Kdq6lhlheEIo wDu49/W1KJGIE0J9ezJzM+0wTx6oJQavPxEExMxlcmbxu0UPFPZn0xvKkYwXUKycndQ== X-Received: by 2002:a81:3051:: with SMTP id w78mr50391911yww.236.1546559720014; Thu, 03 Jan 2019 15:55:20 -0800 (PST) X-Google-Smtp-Source: AFSGD/UeNwfD6dgj/aTMbQxQZI0hhLYmPDgDfRzWdT2ImANHtsi96yJISHmgA3M8/Ml0YZZ0a4SQ X-Received: by 2002:a81:3051:: with SMTP id w78mr50391880yww.236.1546559719028; Thu, 03 Jan 2019 15:55:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546559718; cv=none; d=google.com; s=arc-20160816; b=jAOxINjec9ayEt1whaSjvq46uPTV7ffAGbmQ5JuWOB500RiDCyjtJk9ChKj5NUqo+i 29SmsdY3Hf+zpo5mF0FKjV0cUym3DI/bDMXrMtItpkr0Yr+j0CoCKs/KfypqfUw8mXqu uMGqXNrFdT0SPo6MWjNWhsAymvBhq1COxSX2EWx5huz+VnFdmpT0kaVWNTu+dR4MeLos U45Je/jzRcUok6kTApK83/vwqL2eO0sFSUnjEC0pftlv9kiCYgZev/YrHAr5jPILTs9j hhLPbv6GhComGD2P9vo1OU7qsslVsK0R9K34/0zFeseEYPzWVJh5e1GXBO5TdOI7dL/U Yyrg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=message-id:date:subject:cc:to:from:dkim-signature; bh=zusUcwOPKKRlBPDEf84r0yLwIZH48hUbzo/0kVfoZVU=; b=zDGw2T96Vl6QJXMmLqdDuENui/H/R8rEu+aRaGjjRSWDpojjlfKLh/DpjaNP9iWIZD XgOlkk339Akw2YYtvVERrMWfuFrTDkXR4Bis8c+zfC/NkoWukWNc9RIs2Dsh0erAzZ6q Df3VZ78WiIjZ8i4fFGtNDdUFumVDk3KDXcl9rwlLPYm6BMA2y7XtIfqyb7cGVYp6uGaW iwkYgzYRWNnJeGhd5iZjh4C9gC5BjvDvNpYZD2+ol/nlmouxIJeo/y4rE2tr8Zg2lnD0 Xhu/OVMQZoug5rUIoHoYoz33eO1knmh0vBEbAowlaSUTqSVzNsNgkiIvVu6nOqOsclFM aPJA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=OwJhFre5; spf=pass (google.com: domain of mike.kravetz@oracle.com designates 141.146.126.79 as permitted sender) smtp.mailfrom=mike.kravetz@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from aserp2130.oracle.com (aserp2130.oracle.com. [141.146.126.79]) by mx.google.com with ESMTPS id h128si12347172ywa.35.2019.01.03.15.55.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 03 Jan 2019 15:55:18 -0800 (PST) Received-SPF: pass (google.com: domain of mike.kravetz@oracle.com designates 141.146.126.79 as permitted sender) client-ip=141.146.126.79; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=OwJhFre5; spf=pass (google.com: domain of mike.kravetz@oracle.com designates 141.146.126.79 as permitted sender) smtp.mailfrom=mike.kravetz@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id x03NtBBp081919; Thu, 3 Jan 2019 23:55:11 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id; s=corp-2018-07-02; bh=zusUcwOPKKRlBPDEf84r0yLwIZH48hUbzo/0kVfoZVU=; b=OwJhFre51QKXqgPiywpTQW6AQAXm6BDaOrujotbcx5Xng6BraLWZNpse94gB+80nhhs0 K4s7vM+yI/vqBaLdJYEIgxNElS9eMx86QCtMY3ZAgGBl112yaMK0kusCJP+qZRG/ph0W TQiWUIRfRgep9h01yJYCzqT2PQEGTWV3u/UG0MRIYvrX/+cdXt6+Yf65i9yNE7NaXRAA bq1t/cj3dSs7AwIrmO9yd7258zVq1v6SydbDLnZpEYuVz/ImEi1kqso7go4xCuSdetBj 2eAhpxJwJ59f+SjlMoAFbpDTdv8jHIAmaBcj8ywNSBzPnu4XHcP47btyYI6S92XQCYsx /Q== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by aserp2130.oracle.com with ESMTP id 2pnxee9wtf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 03 Jan 2019 23:55:11 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id x03Nt8cn021278 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 3 Jan 2019 23:55:09 GMT Received: from abhmp0008.oracle.com (abhmp0008.oracle.com [141.146.116.14]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id x03Nt7lD001097; Thu, 3 Jan 2019 23:55:07 GMT Received: from monkey.oracle.com (/50.38.38.67) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 03 Jan 2019 15:55:07 -0800 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton Cc: Michal Hocko , Hugh Dickins , Naoya Horiguchi , "Aneesh Kumar K . V" , Andrea Arcangeli , "Kirill A . Shutemov" , Davidlohr Bueso , Prakash Sangappa , Jan Stancek , Mike Kravetz Subject: [PATCH 1/2] hugetlbfs: Revert "Use i_mmap_rwsem to fix page fault/truncate race" Date: Thu, 3 Jan 2019 15:54:51 -0800 Message-Id: <20190103235452.29335-1-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.17.2 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9125 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1901030202 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP This reverts commit c86aa7bbfd5568ba8a82d3635d8f7b8a8e06fe54 The reverted commit caused ABBA deadlocks when file migration raced with file eviction for specific hugetlbfs files. This was discovered with a modified version of the LTP move_pages12 test. The purpose of the reverted patch was to close a long existing race between hugetlbfs file truncation and page faults. After more analysis of the patch and impacted code, it was determined that i_mmap_rwsem can not be used for all required synchronization. Therefore, revert this patch while working an another approach to the underlying issue. Reported-by: Jan Stancek Signed-off-by: Mike Kravetz --- fs/hugetlbfs/inode.c | 61 ++++++++++++++++++++++++-------------------- mm/hugetlb.c | 21 +++++++-------- 2 files changed, 44 insertions(+), 38 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 63a516096af3..3daf471bbd92 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -410,16 +410,17 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end) * truncation is indicated by end of range being LLONG_MAX * In this case, we first scan the range and release found pages. * After releasing pages, hugetlb_unreserve_pages cleans up region/reserv - * maps and global counts. + * maps and global counts. Page faults can not race with truncation + * in this routine. hugetlb_no_page() prevents page faults in the + * truncated range. It checks i_size before allocation, and again after + * with the page table lock for the page held. The same lock must be + * acquired to unmap a page. * hole punch is indicated if end is not LLONG_MAX * In the hole punch case we scan the range and release found pages. * Only when releasing a page is the associated region/reserv map * deleted. The region/reserv map for ranges without associated - * pages are not modified. - * - * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent - * races with page faults. - * + * pages are not modified. Page faults can race with hole punch. + * This is indicated if we find a mapped page. * Note: If the passed end of range value is beyond the end of file, but * not LLONG_MAX this routine still performs a hole punch operation. */ @@ -449,14 +450,32 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, for (i = 0; i < pagevec_count(&pvec); ++i) { struct page *page = pvec.pages[i]; + u32 hash; index = page->index; + hash = hugetlb_fault_mutex_hash(h, current->mm, + &pseudo_vma, + mapping, index, 0); + mutex_lock(&hugetlb_fault_mutex_table[hash]); + /* - * A mapped page is impossible as callers should unmap - * all references before calling. And, i_mmap_rwsem - * prevents the creation of additional mappings. + * If page is mapped, it was faulted in after being + * unmapped in caller. Unmap (again) now after taking + * the fault mutex. The mutex will prevent faults + * until we finish removing the page. + * + * This race can only happen in the hole punch case. + * Getting here in a truncate operation is a bug. */ - VM_BUG_ON(page_mapped(page)); + if (unlikely(page_mapped(page))) { + BUG_ON(truncate_op); + + i_mmap_lock_write(mapping); + hugetlb_vmdelete_list(&mapping->i_mmap, + index * pages_per_huge_page(h), + (index + 1) * pages_per_huge_page(h)); + i_mmap_unlock_write(mapping); + } lock_page(page); /* @@ -478,6 +497,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, } unlock_page(page); + mutex_unlock(&hugetlb_fault_mutex_table[hash]); } huge_pagevec_release(&pvec); cond_resched(); @@ -489,20 +509,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, static void hugetlbfs_evict_inode(struct inode *inode) { - struct address_space *mapping = inode->i_mapping; struct resv_map *resv_map; - /* - * The vfs layer guarantees that there are no other users of this - * inode. Therefore, it would be safe to call remove_inode_hugepages - * without holding i_mmap_rwsem. We acquire and hold here to be - * consistent with other callers. Since there will be no contention - * on the semaphore, overhead is negligible. - */ - i_mmap_lock_write(mapping); remove_inode_hugepages(inode, 0, LLONG_MAX); - i_mmap_unlock_write(mapping); - resv_map = (struct resv_map *)inode->i_mapping->private_data; /* root inode doesn't have the resv_map, so we should check it */ if (resv_map) @@ -523,8 +532,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset) i_mmap_lock_write(mapping); if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)) hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0); - remove_inode_hugepages(inode, offset, LLONG_MAX); i_mmap_unlock_write(mapping); + remove_inode_hugepages(inode, offset, LLONG_MAX); return 0; } @@ -558,8 +567,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) hugetlb_vmdelete_list(&mapping->i_mmap, hole_start >> PAGE_SHIFT, hole_end >> PAGE_SHIFT); - remove_inode_hugepages(inode, hole_start, hole_end); i_mmap_unlock_write(mapping); + remove_inode_hugepages(inode, hole_start, hole_end); inode_unlock(inode); } @@ -642,11 +651,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, /* addr is the offset within the file (zero based) */ addr = index * hpage_size; - /* - * fault mutex taken here, protects against fault path - * and hole punch. inode_lock previously taken protects - * against truncation. - */ + /* mutex taken here, fault path and hole punch */ hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping, index, addr); mutex_lock(&hugetlb_fault_mutex_table[hash]); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 611b68c43c00..5671ac9d13bb 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3756,16 +3756,16 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, } /* - * We can not race with truncation due to holding i_mmap_rwsem. - * Check once here for faults beyond end of file. + * Use page lock to guard against racing truncation + * before we get page_table_lock. */ - size = i_size_read(mapping->host) >> huge_page_shift(h); - if (idx >= size) - goto out; - retry: page = find_lock_page(mapping, idx); if (!page) { + size = i_size_read(mapping->host) >> huge_page_shift(h); + if (idx >= size) + goto out; + /* * Check for page in userfault range */ @@ -3855,6 +3855,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, } ptl = huge_pte_lock(h, mm, ptep); + size = i_size_read(mapping->host) >> huge_page_shift(h); + if (idx >= size) + goto backout; ret = 0; if (!huge_pte_none(huge_ptep_get(ptep))) @@ -3957,10 +3960,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, /* * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold - * until finished with ptep. This serves two purposes: - * 1) It prevents huge_pmd_unshare from being called elsewhere - * and making the ptep no longer valid. - * 2) It synchronizes us with file truncation. + * until finished with ptep. This prevents huge_pmd_unshare from + * being called elsewhere and making the ptep no longer valid. * * ptep could have already be assigned via huge_pte_offset. That * is OK, as huge_pte_alloc will return the same value unless