From patchwork Sat Oct 13 00:24:29 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrea Arcangeli X-Patchwork-Id: 10639763 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0E814112B for ; Sat, 13 Oct 2018 00:24:44 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F3ADC2BB01 for ; Sat, 13 Oct 2018 00:24:43 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E84652BB11; Sat, 13 Oct 2018 00:24:43 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 440112BB01 for ; Sat, 13 Oct 2018 00:24:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1690C6B0296; Fri, 12 Oct 2018 20:24:34 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 11B6A6B029A; Fri, 12 Oct 2018 20:24:34 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F26AD6B029C; Fri, 12 Oct 2018 20:24:33 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-qt1-f197.google.com (mail-qt1-f197.google.com [209.85.160.197]) by kanga.kvack.org (Postfix) with ESMTP id 9C4FE6B029B for ; Fri, 12 Oct 2018 20:24:33 -0400 (EDT) Received: by mail-qt1-f197.google.com with SMTP id u28-v6so14033746qtu.3 for ; Fri, 12 Oct 2018 17:24:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=3Rx5J7Q2qn7U7fWNfOig0CiOoqZ7a6efEfPFwnX4eUQ=; b=V4RKp+vKkjvnwyE1KczFpaU7HQzpsDCIxve1ylasrk1PXdxC6PoVkDu5A898sWKvUA +HECaEpGk0B6Aiy5wzcuIwA86ow31q3E/HzYRhXZUnQBGWRZGjkAvVYPYOzOLkyrEtA1 AtpDUcKFWX//M4Eqw6yBOcCpXWG74dI/Ka5UQqdj5b7KsFjAY3N2KUFPBidF1wo2Arbo rzVZn6EF77OmKQQUVzARkYgBO2e4xkKH5OAvpcAET2dVLt+JDTuMRBNSd2u7UWAZjxxC opA13t0A60wnBU1h7hoLKfHv1hCMlFwcIX3a4H6aZ5E8PzPRqBqyGquAcA1AvzHD8AZ7 Ji+A== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of aarcange@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=aarcange@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Gm-Message-State: ABuFfoiSwkk+xTSNVp2r1Rv6vRluQpZMo6OoJ1zOS8z/aK21jgeHj0pE GYBtp4av9M2zEfkGpo1dLtGsDwo5YDp/pFRvWneRRxQaS97Lla371YS3oukr+o03SsXJcBgZLih /1ssVj+NvC6nYB0QDSriLNBrWIdtx+u2bUgAxyd0i8FKbq2As7BDhkbncFAT6yH+LWw== X-Received: by 2002:a37:284c:: with SMTP id o73-v6mr7955476qkh.333.1539390273348; Fri, 12 Oct 2018 17:24:33 -0700 (PDT) X-Google-Smtp-Source: ACcGV61fpdEJb/fal2sEzHOkOebdwmAVZ0xqJbBbdFjyrW8E5vJ8fyAbOhDgw7EzDdFxjN4HKrhK X-Received: by 2002:a37:284c:: with SMTP id o73-v6mr7955448qkh.333.1539390272487; Fri, 12 Oct 2018 17:24:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539390272; cv=none; d=google.com; s=arc-20160816; b=T55glZLn/pqidcm6UpZAEhi7P+rDLE8m+jXRnj7oj44Kepe8zoN7csZsCDnBQCjPmT BiDElL4NyZqT5GVufi2/rvZfdzvn333h+ZRUceNZL4TGZDixZMvt42AapRc9aY2Eaw2E AhWC+/tvscrXp9UQgwSC40dZ2GAg9MObxTGiMBo747k1Z3oUxmaRaR7NjEq5122yhKjy kVp0oxINfQP1whqZvwRklwWcxkKf0cI6F/rikk5Cx3H5AvAwtNRFfix52zbgiHtl5id/ viPQ2+xxUb+OpYWXt7QgVOAtcbPaCmDvVGo2+CAKcxz2UWHP6Vjx0ZUq1IDu6a2oYrZZ PaYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from; bh=3Rx5J7Q2qn7U7fWNfOig0CiOoqZ7a6efEfPFwnX4eUQ=; b=OlkCq4fT2lAJzCVsMk9poYyoELjS+0uyKV0vWl9XUMw6e9Q7hgJ5/STzXuTA++amSE j5T3kpgymfJ9HUHha0+nhqDqYVZJKNyrzl2CalqswNTKv/50qjK4M1jMeJlNiwpwgfro BdDwDO8RRzUtmltPD65KO3JQdisJtbKUx2VcyTd5dD51zOVsdcJPIBmMR+QolDWfmbeZ 9j3cTdut1m8ZfmhUOWWTKtT1OjTgDQHD9TwMKH9n8LDygQI0sBYk4Svg9SLD2rTvlHwY rHX6AIzPXul4WAuYXoVjvlQXm/h4jC7DejhXXk9ZX6uFaZRvroFIygFjVKubzo/0nzev 1KZA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of aarcange@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=aarcange@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id g10-v6si2604249qvb.171.2018.10.12.17.24.32 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 12 Oct 2018 17:24:32 -0700 (PDT) Received-SPF: pass (google.com: domain of aarcange@redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; Authentication-Results: mx.google.com; spf=pass (google.com: domain of aarcange@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=aarcange@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id A25F1C050012; Sat, 13 Oct 2018 00:24:31 +0000 (UTC) Received: from sky.random (ovpn-120-22.rdu2.redhat.com [10.10.120.22]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 5F91A7B5EE; Sat, 13 Oct 2018 00:24:31 +0000 (UTC) From: Andrea Arcangeli To: linux-mm@kvack.org Cc: Aaron Tomlin , Mel Gorman , Jerome Glisse , "Kirill A. Shutemov" , Andrew Morton Subject: [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Date: Fri, 12 Oct 2018 20:24:29 -0400 Message-Id: <20181013002430.698-3-aarcange@redhat.com> In-Reply-To: <20181013002430.698-1-aarcange@redhat.com> References: <20181013002430.698-1-aarcange@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Sat, 13 Oct 2018 00:24:31 +0000 (UTC) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling migrate_misplaced_transhuge_page(). By the time do_huge_pmd_numa_page() runs some CPU could still access the page through the TLB. change_huge_pmd() before arming the numa/protnone transhuge pmd calls mmu_notifier_invalidate_range_start(). So there's no need of mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end() sequence in migrate_misplaced_transhuge_page() too, because by the time migrate_misplaced_transhuge_page() runs, the pmd mapping has already been invalidated in the secondary MMUs. It has to or if a secondary MMU can still write to the page, the migrate_page_copy() would lose data. However an explicit mmu_notifier_invalidate_range() is needed before migrate_misplaced_transhuge_page() starts copying the data of the transhuge page or the below can happen for MMU notifier users sharing the primary MMU pagetables and only implementing ->invalidate_range: CPU0 CPU1 GPU sharing linux pagetables using only ->invalidate_range ----------- ------------ --------- GPU secondary MMU writes to the page mapped by the transhuge pmd change_pmd_range() mmu..._range_start() ->invalidate_range_start() noop change_huge_pmd() set_pmd_at(numa/protnone) pmd_unlock() do_huge_pmd_numa_page() CPU TLB flush globally (1) CPU cannot write to page migrate_misplaced_transhuge_page() GPU writes to the page... migrate_page_copy() ...GPU stops writing to the page CPU TLB flush (2) mmu..._range_end() (3) ->invalidate_range_stop() noop ->invalidate_range() GPU secondary MMU is invalidated and cannot write to the page anymore (too late) Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives too late, we also need a mmu_notifier_invalidate_range() before calling migrate_misplaced_transhuge_page(), because the ->invalidate_range() in (3) also arrives too late. This requirement is the result of the lazy optimization in change_huge_pmd() that releases the pmd_lock without first flushing the TLB and without first calling mmu_notifier_invalidate_range(). Even converting the removed mmu_notifier_invalidate_range_only_end() into a mmu_notifier_invalidate_range_end() would not have been enough to fix this, because it run after migrate_page_copy(). After the hugepage data copy is done migrate_misplaced_transhuge_page() can proceed and call set_pmd_at without having to flush the TLB nor any secondary MMUs because the secondary MMU invalidate, just like the CPU TLB flush, has to happen before the migrate_page_copy() is called or it would be a bug in the first place (and it was for drivers using ->invalidate_range()). KVM is unaffected because it doesn't implement ->invalidate_range(). The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and uses the generic migrate_pages which transitions the pte from numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs and all mmu notifiers there before copying the page. Signed-off-by: Andrea Arcangeli Acked-by: Mel Gorman Acked-by: Kirill A. Shutemov Reviewed-by: Aaron Tomlin --- mm/huge_memory.c | 14 +++++++++++++- mm/migrate.c | 19 ++++++------------- 2 files changed, 19 insertions(+), 14 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a5b28547e321..70b5104075ef 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1562,8 +1562,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) * We are not sure a pending tlb flush here is for a huge page * mapping or not. Hence use the tlb range variant */ - if (mm_tlb_flush_pending(vma->vm_mm)) + if (mm_tlb_flush_pending(vma->vm_mm)) { flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + /* + * change_huge_pmd() released the pmd lock before + * invalidating the secondary MMUs sharing the primary + * MMU pagetables (with ->invalidate_range()). The + * mmu_notifier_invalidate_range_end() (which + * internally calls ->invalidate_range()) in + * change_pmd_range() will run after us, so we can't + * rely on it here and we need an explicit invalidate. + */ + mmu_notifier_invalidate_range(vma->vm_mm, haddr, + haddr + HPAGE_PMD_SIZE); + } /* * Migrate the THP to the requested node, returns with page unlocked diff --git a/mm/migrate.c b/mm/migrate.c index 180e3d0ed16d..c9e9b7db8b6d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2018,8 +2018,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, int isolated = 0; struct page *new_page = NULL; int page_lru = page_is_file_cache(page); - unsigned long mmun_start = address & HPAGE_PMD_MASK; - unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE; + unsigned long start = address & HPAGE_PMD_MASK; + unsigned long end = start + HPAGE_PMD_SIZE; /* * Rate-limit the amount of data that is being migrated to a node. @@ -2054,11 +2054,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, WARN_ON(PageLRU(new_page)); /* Recheck the target PMD */ - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); ptl = pmd_lock(mm, pmd); if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) { spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); /* Reverse changes made by migrate_page_copy() */ if (TestClearPageActive(new_page)) @@ -2089,8 +2087,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, * new page and page_add_new_anon_rmap guarantee the copy is * visible before the pagetable update. */ - flush_cache_range(vma, mmun_start, mmun_end); - page_add_anon_rmap(new_page, vma, mmun_start, true); + flush_cache_range(vma, start, end); + page_add_anon_rmap(new_page, vma, start, true); /* * At this point the pmd is numa/protnone (i.e. non present) * and the TLB has already been flushed globally. So no TLB @@ -2103,7 +2101,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, * at any given time, MADV_DONTNEED won't wait on the pmd lock * and it'll skip clearing this pmd. */ - set_pmd_at(mm, mmun_start, pmd, entry); + set_pmd_at(mm, start, pmd, entry); update_mmu_cache_pmd(vma, address, &entry); page_ref_unfreeze(page, 2); @@ -2112,11 +2110,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED); spin_unlock(ptl); - /* - * No need to double call mmu_notifier->invalidate_range() callback as - * the above pmdp_huge_clear_flush_notify() did already call it. - */ - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); /* Take an "isolate" reference and put new page on the LRU. */ get_page(new_page); @@ -2141,7 +2134,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, ptl = pmd_lock(mm, pmd); if (pmd_same(*pmd, entry)) { entry = pmd_modify(entry, vma->vm_page_prot); - set_pmd_at(mm, mmun_start, pmd, entry); + set_pmd_at(mm, start, pmd, entry); update_mmu_cache_pmd(vma, address, &entry); } spin_unlock(ptl);