From patchwork Mon Nov 5 21:23:15 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Kravetz X-Patchwork-Id: 10669205 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BAA4E1751 for ; Mon, 5 Nov 2018 21:23:32 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B1336294D9 for ; Mon, 5 Nov 2018 21:23:32 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 9EF5C29A15; Mon, 5 Nov 2018 21:23:32 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 061D1294D9 for ; Mon, 5 Nov 2018 21:23:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EFB5C6B0278; Mon, 5 Nov 2018 16:23:30 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id EA9CF6B027A; Mon, 5 Nov 2018 16:23:30 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D72ED6B027B; Mon, 5 Nov 2018 16:23:30 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-io1-f69.google.com (mail-io1-f69.google.com [209.85.166.69]) by kanga.kvack.org (Postfix) with ESMTP id AB31D6B0278 for ; Mon, 5 Nov 2018 16:23:30 -0500 (EST) Received: by mail-io1-f69.google.com with SMTP id q22-v6so12018782iog.9 for ; Mon, 05 Nov 2018 13:23:30 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id; bh=PYO4s9KjsdMbyiZbCDGxHkV6zQxg4GbHaBFwvvdliqE=; b=RqLvFVt9Wu7bA3oKRXXMJ5bICI9GUaep0205frzBH5jigswioY4srp3o5p3qUOsvJg n09Vn9xoU4gxqyyOA+IscWVHFvzOAGdq/gPejAxLnfnkx0r5GfAysVcMqPraiXEWt8/C t7VpH5yUAHeYsBSgV3v8YpjkaN4TB4mDmLCoIj8J9hoQNmhJ+nKP44HHCTAQIiMWkvGG GGD7EjS+aEqcDtj6aIbCTgGSfkQRVY8sNta6yNEd1+boDJ+FGZHFkoj9eYmF35QYCnA9 Pzr8D9oWEDhfk+oUeiVexgZVoXy5QhTGZzcM1/oeBU1nf1zRuz9nDOp3RaUHYF2ICvim DuTw== X-Gm-Message-State: AGRZ1gLeYv0RkxgrafpiVVhOJCJOsE5fprZRCNJML7PsFJ3t+zA4vvZJ nhM1358pEWDEeOhopW+qiaBBac2tvVUZYFNNZpVGOAxZjkz8BS2LLYR8NhlttapOhc8H6enGVhw Xyl7qIpnu7y0lmAqV+U9jGI8tsioDl11TPsuR6h8/K9GmCyIsfVWDP6GpUvXGb1y7mQ== X-Received: by 2002:a24:4169:: with SMTP id x102-v6mr8437226ita.128.1541453010239; Mon, 05 Nov 2018 13:23:30 -0800 (PST) X-Google-Smtp-Source: AJdET5fdbVl3abkMpLzoH2b0n2u8sZFZ5Y+CHRMzoMzF61R+gejJ5lcOBNstKVkXN/YiT+cTeWVR X-Received: by 2002:a24:4169:: with SMTP id x102-v6mr8437204ita.128.1541453009402; Mon, 05 Nov 2018 13:23:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541453009; cv=none; d=google.com; s=arc-20160816; b=inCLXwha0uE8DmsAp9pKvWPbh99r5ucovmFKFjh00af8T/n2SMeYYFxinGVMKJpPeW Fy7B59uIjRnU7PZXySFEBWWFSnDFxINk2UqzlKctn39ZTuuonyMSIkw/WoP8G52DgHRB H840gHDC5nueO0qbX8wbuvKIED94xJ/BRu9gBG3zxGW6ciYs7iWJWS2XDIn9ledc1xMc KD/GvuqIF2h8LrfaaSchGPmiTIGgrzYuD4bPMrC1UdFLBcyNwk6K3b620rxRGpwp8h66 0nLkWjV3VSvA/0nY+UI0IArIH4rIGbz2t2AYpV3N8OYv2GJNRRYssjmK4nRbP2bgFvuz BvlQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=message-id:date:subject:cc:to:from:dkim-signature; bh=PYO4s9KjsdMbyiZbCDGxHkV6zQxg4GbHaBFwvvdliqE=; b=R4vbpgVu7IJ3N0TsC6vyhww2CuK505aOqSl5DAdQ1ZL9LfSqoQpm7Pksb6VCY80tuf YK2pIYl46hl3FoxnDkdLJ+FsytC3E6U4nJKHsQG2Ve4E67M5CBene0PVFyzEIPrwWDQ4 8D70JsVlrdM8E5Ca2jfTzixY0kxFwGauM7GlzhwUpmGlikVAzGnyo308eV4v4RPpu7q0 B21d+lMs+MNtb5pd/+Pk4/3oi8ffVr+5NNPm7IEI9t/LHmKNipezpfGz3dIiRDFu62W4 aA8TzkA9csiGv1+WX/KVZf5DA87qFm7VtTE93RJPvMSr7b1XNkMTdkngeesefyk+3zzI WzhQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=ptnbHVtH; spf=pass (google.com: domain of mike.kravetz@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=mike.kravetz@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2120.oracle.com (userp2120.oracle.com. [156.151.31.85]) by mx.google.com with ESMTPS id x187-v6si25526654itd.52.2018.11.05.13.23.29 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 13:23:29 -0800 (PST) Received-SPF: pass (google.com: domain of mike.kravetz@oracle.com designates 156.151.31.85 as permitted sender) client-ip=156.151.31.85; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=ptnbHVtH; spf=pass (google.com: domain of mike.kravetz@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=mike.kravetz@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5LIpZo089407; Mon, 5 Nov 2018 21:23:21 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id; s=corp-2018-07-02; bh=PYO4s9KjsdMbyiZbCDGxHkV6zQxg4GbHaBFwvvdliqE=; b=ptnbHVtHINwXDbf+3EFXFUAKMuPp/CVKG6nHjGg4qhnVSAAB/E/O0bGSnXzImsm9sftl Lqodat8HbYPiedi2jhHro9lLiPBpDOdpHO2SJ1yW2gY2a5umu8QIQmNIjNzKNEhxIVXe jZndHBMLytoMu4rUQ/hOALhdxAoT1jCQq1A3OxyJwAiwRanEseAys3iqXF9/w5uZXhn9 uONEXQTaKNWB/+xxLF4bC1Iiiv48Jez5th9fWachYTjfzi9391nP8DPhE08XNzRxj8OI fUUKVaYlrhCHd/Vss8h9TzCAEaagvCaCOdejb4dvBo8+UoQIYyBFJFElaKhbyKUL/uSZ Dg== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2120.oracle.com with ESMTP id 2nh4aqhhd2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 21:23:21 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5LNL4K031055 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 21:23:21 GMT Received: from abhmp0002.oracle.com (abhmp0002.oracle.com [141.146.116.8]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5LNKt1025528; Mon, 5 Nov 2018 21:23:20 GMT Received: from monkey.oracle.com (/50.38.38.67) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 13:23:19 -0800 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Michal Hocko , Hugh Dickins , Naoya Horiguchi , Andrea Arcangeli , "Kirill A . Shutemov" , Davidlohr Bueso , Prakash Sangappa , Andrew Morton , Mike Kravetz Subject: [PATCH] hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444! Date: Mon, 5 Nov 2018 13:23:15 -0800 Message-Id: <20181105212315.14125-1-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.17.2 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050188 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP This bug has been experienced several times by Oracle DB team. The BUG is in the routine remove_inode_hugepages() as follows: /* * If page is mapped, it was faulted in after being * unmapped in caller. Unmap (again) now after taking * the fault mutex. The mutex will prevent faults * until we finish removing the page. * * This race can only happen in the hole punch case. * Getting here in a truncate operation is a bug. */ if (unlikely(page_mapped(page))) { BUG_ON(truncate_op); In this case, the elevated map count is not the result of a race. Rather it was incorrectly incremented as the result of a bug in the huge pmd sharing code. Consider the following: - Process A maps a hugetlbfs file of sufficient size and alignment (PUD_SIZE) that a pmd page could be shared. - Process B maps the same hugetlbfs file with the same size and alignment such that a pmd page is shared. - Process B then calls mprotect() to change protections for the mapping with the shared pmd. As a result, the pmd is 'unshared'. - Process B then calls mprotect() again to chage protections for the mapping back to their original value. pmd remains unshared. - Process B then forks and process C is created. During the fork process, we do dup_mm -> dup_mmap -> copy_page_range to copy page tables. Copying page tables for hugetlb mappings is done in the routine copy_hugetlb_page_range. In copy_hugetlb_page_range(), the destination pte is obtained by: dst_pte = huge_pte_alloc(dst, addr, sz); If pmd sharing is possible, the returned pointer will be to a pte in an existing page table. In the situation above, process C could share with either process A or process B. Since process A is first in the list, the returned pte is a pointer to a pte in process A's page table. However, the following check for pmd sharing is in copy_hugetlb_page_range. /* If the pagetables are shared don't copy or take references */ if (dst_pte == src_pte) continue; Since process C is sharing with process A instead of process B, the above test fails. The code in copy_hugetlb_page_range which follows assumes dst_pte points to a huge_pte_none pte. It copies the pte entry from src_pte to dst_pte and increments this map count of the associated page. This is how we end up with an elevated map count. To solve, check the dst_pte entry for huge_pte_none. If !none, this implies PMD sharing so do not copy. Signed-off-by: Mike Kravetz Reviewed-by: Naoya Horiguchi --- mm/hugetlb.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5c390f5a5207..0b391ef6448c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3233,7 +3233,7 @@ static int is_hugetlb_entry_hwpoisoned(pte_t pte) int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma) { - pte_t *src_pte, *dst_pte, entry; + pte_t *src_pte, *dst_pte, entry, dst_entry; struct page *ptepage; unsigned long addr; int cow; @@ -3261,15 +3261,30 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, break; } - /* If the pagetables are shared don't copy or take references */ - if (dst_pte == src_pte) + /* + * If the pagetables are shared don't copy or take references. + * dst_pte == src_pte is the common case of src/dest sharing. + * + * However, src could have 'unshared' and dst shares with + * another vma. If dst_pte !none, this implies sharing. + * Check here before taking page table lock, and once again + * after taking the lock below. + */ + dst_entry = huge_ptep_get(dst_pte); + if ((dst_pte == src_pte) || !huge_pte_none(dst_entry)) continue; dst_ptl = huge_pte_lock(h, dst, dst_pte); src_ptl = huge_pte_lockptr(h, src, src_pte); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); entry = huge_ptep_get(src_pte); - if (huge_pte_none(entry)) { /* skip none entry */ + dst_entry = huge_ptep_get(dst_pte); + if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) { + /* + * Skip if src entry none. Also, skip in the + * unlikely case dst entry !none as this implies + * sharing with another vma. + */ ; } else if (unlikely(is_hugetlb_entry_migration(entry) || is_hugetlb_entry_hwpoisoned(entry))) {