[v2,5/8] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages

Message ID	20200129032417.3085670-6-jhubbard@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=uM/9=3S=vger.kernel.org=linux-fsdevel-owner@kernel.org> TLS: TLSv1.2, DES-CBC3-SHA) id <B5e30fad50002>; Tue, 28 Jan 2020 19:24:05 -0800 From: John Hubbard <jhubbard@nvidia.com> To: Andrew Morton <akpm@linux-foundation.org> CC: Al Viro <viro@zeniv.linux.org.uk>, Christoph Hellwig <hch@infradead.org>, Dan Williams <dan.j.williams@intel.com>, Dave Chinner <david@fromorbit.com>, Ira Weiny <ira.weiny@intel.com>, Jan Kara <jack@suse.cz>, Jason Gunthorpe <jgg@ziepe.ca>, Jonathan Corbet <corbet@lwn.net>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, "Kirill A . Shutemov" <kirill@shutemov.name>, Michal Hocko <mhocko@suse.com>, Mike Kravetz <mike.kravetz@oracle.com>, Shuah Khan <shuah@kernel.org>, Vlastimil Babka <vbabka@suse.cz>, <linux-doc@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>, <linux-kselftest@vger.kernel.org>, <linux-rdma@vger.kernel.org>, <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>, John Hubbard <jhubbard@nvidia.com> Subject: [PATCH v2 5/8] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages Date: Tue, 28 Jan 2020 19:24:14 -0800 Message-ID: <20200129032417.3085670-6-jhubbard@nvidia.com> In-Reply-To: <20200129032417.3085670-1-jhubbard@nvidia.com> References: <20200129032417.3085670-1-jhubbard@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	mm/gup: track FOLL_PIN pages (follow on from v12) \| expand [v2,0/8] mm/gup: track FOLL_PIN pages (follow on from v12) [v2,1/8] mm: dump_page: print head page's refcount, for compound pages [v2,2/8] mm/gup: split get_user_pages_remote() into two routines [v2,3/8] mm/gup: pass a flags arg to __gup_device_* functions [v2,4/8] mm/gup: track FOLL_PIN pages [v2,5/8] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages [v2,6/8] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting [v2,7/8] mm/gup_benchmark: support pin_user_pages() and related calls [v2,8/8] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 1d490155ecd7..55a30d260d39 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -52,8 +52,22 @@ Which flags are set by each wrapper For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup flags the caller provides. The caller is required to pass in a non-null struct -pages* array, and the function then pin pages by incrementing each by a special -value. For now, that value is +1, just like get_user_pages*().:: +pages* array, and the function then pins pages by incrementing each by a special +value: GUP_PIN_COUNTING_BIAS. + +For huge pages (and in fact, any compound page of more than 2 pages), the +GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting +is achieved, by using the 3rd struct page in the compound page. A new struct +page field, hpage_pinned_refcount, has been added in order to support this. + +This approach for compound pages avoids the counting upper limit problems that +are discussed below. Those limitations would have been aggravated severely by +huge pages, because each tail page adds a refcount to the head page. And in +fact, testing revealed that, without a separate hpage_pinned_refcount field, +page overflows were seen in some huge page stress tests. + +This also means that huge pages and compound pages (of order > 1) do not suffer +from the false positives problem that is mentioned below.:: Function -------- @@ -99,27 +113,6 @@ pages: This also leads to limitations: there are only 31-10==21 bits available for a counter that increments 10 bits at a time. -TODO: for 1GB and larger huge pages, this is cutting it close. That's because -when pin_user_pages() follows such pages, it increments the head page by "1" -(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for -pin_user_pages()) for each tail page. So if you have a 1GB huge page: - -* There are 256K (18 bits) worth of 4 KB tail pages. -* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, - 10 bits at a time) -* There are 21 - 18 == 3 bits available to count. Except that there aren't, - because you need to allow for a few normal get_page() calls on the head page, - as well. Fortunately, the approach of using addition, rather than "hard" - bitfields, within page->_refcount, allows for sharing these bits gracefully. - But we're still looking at about 8 references. - -This, however, is a missing feature more than anything else, because it's easily -solved by addressing an obvious inefficiency in the original get_user_pages() -approach of retrieving pages: stop treating all the pages as if they were -PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of -this, so some work is required. Once that's in place, this limitation mostly -disappears from view, because there will be ample refcounting range available. - * Callers must specifically request "dma-pinned tracking of pages". In other words, just calling get_user_pages() will not suffice; a new set of functions, pin_user_page() and related, must be used. @@ -222,11 +215,19 @@ Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is because there is a noticeable performance drop in unpin_user_page(), when they are activated. +Other diagnostics +================= + +dump_page() has been enhanced slightly, to handle these new counting fields, and +to better report on compound pages in general. Specifically, for compound pages +with order > 1, the exact (hpage_pinned_refcount) pincount is reported. + References ========== * `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ * `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ * `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ +* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>` John Hubbard, October, 2019 diff --git a/include/linux/mm.h b/include/linux/mm.h index c5d0f4a66788..a8ad5612bbcb 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -770,6 +770,24 @@ static inline unsigned int compound_order(struct page *page) return page[1].compound_order; } +static inline bool hpage_pincount_available(struct page *page) +{ + /* + * Can the page->hpage_pinned_refcount field be used? That field is in + * the 3rd page of the compound page, so the smallest (2-page) compound + * pages cannot support it. + */ + page = compound_head(page); + return PageCompound(page) && compound_order(page) > 1; +} + +static inline int compound_pincount(struct page *page) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + page = compound_head(page); + return atomic_read(compound_pincount_ptr(page)); +} + static inline void set_compound_order(struct page *page, unsigned int order) { page[1].compound_order = order; @@ -1084,6 +1102,11 @@ void unpin_user_pages(struct page **pages, unsigned long npages); * refcounts, and b) all the callers of this routine are expected to be able to * deal gracefully with a false positive. * + * For huge pages, the result will be exactly correct. That's because we have + * more tracking data available: the 3rd struct page in the compound page is + * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS + * scheme). + * * For more information, please see Documentation/vm/pin_user_pages.rst. * * @page: pointer to page to be queried. @@ -1092,6 +1115,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages); */ static inline bool page_dma_pinned(struct page *page) { + if (hpage_pincount_available(page)) + return compound_pincount(page) > 0; + /* * page_ref_count() is signed. If that refcount overflows, then * page_ref_count() returns a negative value, and callers will avoid diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 270aa8fd2800..d165275d8929 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -137,7 +137,7 @@ struct page { }; struct { /* Second tail page of compound page */ unsigned long _compound_pad_1; /* compound_head */ - unsigned long _compound_pad_2; + atomic_t hpage_pinned_refcount; /* For both global and memcg */ struct list_head deferred_list; }; @@ -226,6 +226,11 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page) return &page[1].compound_mapcount; } +static inline atomic_t *compound_pincount_ptr(struct page *page) +{ + return &page[2].hpage_pinned_refcount; +} + /* * Used for sizing the vmemmap region on some architectures */ diff --git a/mm/debug.c b/mm/debug.c index 4cc6cad8385d..8db7f36c7bc6 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -76,13 +76,23 @@ void __dump_page(struct page *page, const char *reason) mapcount = PageSlab(page) ? 0 : page_mapcount(page); if (PageCompound(page)) - pr_warn("page:%px refcount:%d head refcount:%d " - "mapcount:%d mapping:%px index:%#lx " - "compound_mapcount:%d\n", - page, page_ref_count(page), - page_ref_count(compound_head(page)), mapcount, - page->mapping, page_to_pgoff(page), - compound_mapcount(page)); + if (hpage_pincount_available(page)) + pr_warn("page:%px refcount:%d head refcount:%d " + "mapcount:%d mapping:%px index:%#lx " + "compound_mapcount:%d compound_pincount:%d\n", + page, page_ref_count(page), + page_ref_count(compound_head(page)), mapcount, + page->mapping, page_to_pgoff(page), + compound_mapcount(page), + compound_pincount(page)); + else + pr_warn("page:%px refcount:%d head refcount:%d " + "mapcount:%d mapping:%px index:%#lx " + "compound_mapcount:%d\n", + page, page_ref_count(page), + page_ref_count(compound_head(page)), mapcount, + page->mapping, page_to_pgoff(page), + compound_mapcount(page)); else pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n", page, page_ref_count(page), mapcount, diff --git a/mm/gup.c b/mm/gup.c index 7a96490dcc54..03e7a5cfa6a9 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,22 @@ struct follow_page_context { unsigned int page_mask; }; +static void hpage_pincount_add(struct page *page, int refs) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + VM_BUG_ON_PAGE(page != compound_head(page), page); + + atomic_add(refs, compound_pincount_ptr(page)); +} + +static void hpage_pincount_sub(struct page *page, int refs) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + VM_BUG_ON_PAGE(page != compound_head(page), page); + + atomic_sub(refs, compound_pincount_ptr(page)); +} + /* * Return the compound head page with ref appropriately incremented, * or NULL if that failed. @@ -70,8 +86,25 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page, if (flags & FOLL_GET) return try_get_compound_head(page, refs); else if (flags & FOLL_PIN) { - refs *= GUP_PIN_COUNTING_BIAS; - return try_get_compound_head(page, refs); + /* + * When pinning a compound page of order > 1 (which is what + * hpage_pincount_available() checks for), use an exact count to + * track it, via hpage_pincount_add/_sub(). + * + * However, be sure to *also* increment the normal page refcount + * field at least once, so that the page really is pinned. + */ + if (!hpage_pincount_available(page)) + refs *= GUP_PIN_COUNTING_BIAS; + + page = try_get_compound_head(page, refs); + if (!page) + return NULL; + + if (hpage_pincount_available(page)) + hpage_pincount_add(page, refs); + + return page; } WARN_ON_ONCE(1); @@ -104,6 +137,8 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) if (flags & FOLL_GET) return try_get_page(page); else if (flags & FOLL_PIN) { + int refs = 1; + page = compound_head(page); if (WARN_ON_ONCE(flags & FOLL_GET)) @@ -112,7 +147,18 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) if (WARN_ON_ONCE(page_ref_count(page) <= 0)) return false; - page_ref_add(page, GUP_PIN_COUNTING_BIAS); + if (hpage_pincount_available(page)) + hpage_pincount_add(page, 1); + else + refs = GUP_PIN_COUNTING_BIAS; + + /* + * Similar to try_grab_compound_head(): even if using the + * hpage_pincount_add/_sub() routines, be sure to + * *also* increment the normal page refcount field at least + * once, so that the page really is pinned. + */ + page_ref_add(page, refs); } return true; @@ -121,12 +167,17 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) #ifdef CONFIG_DEV_PAGEMAP_OPS static bool __unpin_devmap_managed_user_page(struct page *page) { - int count; + int count, refs = 1; if (!page_is_devmap_managed(page)) return false; - count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS); + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, 1); + else + refs = GUP_PIN_COUNTING_BIAS; + + count = page_ref_sub_return(page, refs); /* * devmap page refcounts are 1-based, rather than 0-based: if @@ -158,6 +209,8 @@ static bool __unpin_devmap_managed_user_page(struct page *page) */ void unpin_user_page(struct page *page) { + int refs = 1; + page = compound_head(page); /* @@ -169,7 +222,12 @@ void unpin_user_page(struct page *page) if (__unpin_devmap_managed_user_page(page)) return; - if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS)) + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, 1); + else + refs = GUP_PIN_COUNTING_BIAS; + + if (page_ref_sub_and_test(page, refs)) __put_page(page); } EXPORT_SYMBOL(unpin_user_page); @@ -2201,8 +2259,12 @@ static int record_subpages(struct page *page, unsigned long addr, static void put_compound_head(struct page *page, int refs, unsigned int flags) { - if (flags & FOLL_PIN) - refs *= GUP_PIN_COUNTING_BIAS; + if (flags & FOLL_PIN) { + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, refs); + else + refs *= GUP_PIN_COUNTING_BIAS; + } VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 487e998fd38e..07059d936f7b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1009,6 +1009,9 @@ static void destroy_compound_gigantic_page(struct page *page, struct page *p = page + 1; atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { clear_compound_head(p); set_page_refcounted(p); @@ -1287,6 +1290,9 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); + + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); } /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3c4eb750a199..b2fe61035b7a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -689,6 +689,8 @@ void prep_compound_page(struct page *page, unsigned int order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); } #ifdef CONFIG_DEBUG_PAGEALLOC diff --git a/mm/rmap.c b/mm/rmap.c index b3e381919835..e45b9b991e2f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1178,6 +1178,9 @@ void page_add_new_anon_rmap(struct page *page, VM_BUG_ON_PAGE(!PageTransHuge(page), page); /* increment count (starts at -1) */ atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + __inc_node_page_state(page, NR_ANON_THPS); } else { /* Anon THP always mapped first with PMD */ @@ -1974,6 +1977,9 @@ void hugepage_add_new_anon_rmap(struct page *page, { BUG_ON(address < vma->vm_start || address >= vma->vm_end); atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + __page_set_anon_rmap(page, vma, address, 1); } #endif /* CONFIG_HUGETLB_PAGE */

[v2,5/8] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages

Commit Message

Patch