diff mbox series

[v3,1/1] mm: introduce put_user_page*(), placeholder versions

Message ID 20190306235455.26348-2-jhubbard@nvidia.com (mailing list archive)
State New, archived
Headers show
Series mm: introduce put_user_page*(), placeholder versions | expand

Commit Message

john.hubbard@gmail.com March 6, 2019, 11:54 p.m. UTC
From: John Hubbard <jhubbard@nvidia.com>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This is the first step of fixing a problem (also described in [1] and
[2]) with interactions between get_user_pages ("gup") and filesystems.

Problem description: let's start with a bug report. Below, is what happens
sometimes, under memory pressure, when a driver pins some pages via gup,
and then marks those pages dirty, and releases them. Note that the gup
documentation actually recommends that pattern. The problem is that the
filesystem may do a writeback while the pages were gup-pinned, and then the
filesystem believes that the pages are clean. So, when the driver later
marks the pages as dirty, that conflicts with the filesystem's page
tracking and results in a BUG(), like this one that I experienced:

    kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
    backtrace:
        ext4_writepage
        __writepage
        write_cache_pages
        ext4_writepages
        do_writepages
        __writeback_single_inode
        writeback_sb_inodes
        __writeback_inodes_wb
        wb_writeback
        wb_workfn
        process_one_work
        worker_thread
        kthread
        ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

        ({                                                      \
                BUG_ON(!PagePrivate(page));                     \
                ((struct buffer_head *)page_private(page));     \
        })

Dave Chinner's description of this is very clear:

    "The fundamental issue is that ->page_mkwrite must be called on every
    write access to a clean file backed page, not just the first one.
    How long the GUP reference lasts is irrelevant, if the page is clean
    and you need to dirty it, you must call ->page_mkwrite before it is
    marked writeable and dirtied. Every. Time."

This is just one symptom of the larger design problem: filesystems do not
actually support get_user_pages() being called on their pages, and letting
hardware write directly to those pages--even though that patter has been
going on since about 2005 or so.

The steps are to fix it are:

1) (This patch): provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem.

[1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
[2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>    # docs
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 24 ++++++++++++++
 mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 106 insertions(+)

Comments

Christoph Lameter (Ampere) March 8, 2019, 2:58 a.m. UTC | #1
On Wed, 6 Mar 2019, john.hubbard@gmail.com wrote:

> Dave Chinner's description of this is very clear:
>
>     "The fundamental issue is that ->page_mkwrite must be called on every
>     write access to a clean file backed page, not just the first one.
>     How long the GUP reference lasts is irrelevant, if the page is clean
>     and you need to dirty it, you must call ->page_mkwrite before it is
>     marked writeable and dirtied. Every. Time."
>
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that patter has been
> going on since about 2005 or so.

Can we distinguish between real filesystems that actually write to a
backing device and the special filesystems (like hugetlbfs, shm and
friends) that are like anonymous memory and do not require
->page_mkwrite() in the same way as regular filesystems?

The use that I have seen in my section of the world has been restricted to
RDMA and get_user_pages being limited to anonymous memory and those
special filesystems. And if the RDMA memory is of such type then the use
in the past and present is safe.

So a logical other approach would be to simply not allow the use of
long term get_user_page() on real filesystem pages. I hope this patch
supports that?

It is customary after all that a file read or write operation involve one
single file(!) and that what is written either comes from or goes to
memory (anonymous or special memory filesystem).

If you have an mmapped memory segment with a regular device backed file
then you already have one file associated with a memory segment and a
filesystem that does take care of synchronizing the contents of the memory
segment to a backing device.

If you now perform RDMA or device I/O on such a memory segment then you
will have *two* different devices interacting with that memory segment. I
think that ought not to happen and not be supported out of the box. It
will be difficult to handle and the semantics will be hard for users to
understand.

What could happen is that the filesystem could agree on request to allow
third party I/O to go to such a memory segment. But that needs to be well
defined and clearly and explicitly handled by some mechanism in user space
that has well defined semantics for data integrity for the filesystem as
well as the RDMA or device I/O.
John Hubbard March 8, 2019, 3:15 a.m. UTC | #2
On 3/7/19 6:58 PM, Christopher Lameter wrote:
> On Wed, 6 Mar 2019, john.hubbard@gmail.com wrote:
> 
>> Dave Chinner's description of this is very clear:
>>
>>     "The fundamental issue is that ->page_mkwrite must be called on every
>>     write access to a clean file backed page, not just the first one.
>>     How long the GUP reference lasts is irrelevant, if the page is clean
>>     and you need to dirty it, you must call ->page_mkwrite before it is
>>     marked writeable and dirtied. Every. Time."
>>
>> This is just one symptom of the larger design problem: filesystems do not
>> actually support get_user_pages() being called on their pages, and letting
>> hardware write directly to those pages--even though that patter has been
>> going on since about 2005 or so.
> 
> Can we distinguish between real filesystems that actually write to a
> backing device and the special filesystems (like hugetlbfs, shm and
> friends) that are like anonymous memory and do not require
> ->page_mkwrite() in the same way as regular filesystems?

Yes. I'll change the wording in the commit message to say "real filesystems
that actually write to a backing device", instead of "filesystems". That
does help, thanks.

> 
> The use that I have seen in my section of the world has been restricted to
> RDMA and get_user_pages being limited to anonymous memory and those
> special filesystems. And if the RDMA memory is of such type then the use
> in the past and present is safe.

Agreed.

> 
> So a logical other approach would be to simply not allow the use of
> long term get_user_page() on real filesystem pages. I hope this patch
> supports that?

This patch neither prevents nor provides that. What this patch does is
provide a prerequisite to clear identification of pages that have had
get_user_pages() called on them.


> 
> It is customary after all that a file read or write operation involve one
> single file(!) and that what is written either comes from or goes to
> memory (anonymous or special memory filesystem).
> 
> If you have an mmapped memory segment with a regular device backed file
> then you already have one file associated with a memory segment and a
> filesystem that does take care of synchronizing the contents of the memory
> segment to a backing device.
> 
> If you now perform RDMA or device I/O on such a memory segment then you
> will have *two* different devices interacting with that memory segment. I
> think that ought not to happen and not be supported out of the box. It
> will be difficult to handle and the semantics will be hard for users to
> understand.
> 
> What could happen is that the filesystem could agree on request to allow
> third party I/O to go to such a memory segment. But that needs to be well
> defined and clearly and explicitly handled by some mechanism in user space
> that has well defined semantics for data integrity for the filesystem as
> well as the RDMA or device I/O.
> 

Those discussions are underway. Dave Chinner and others have been talking
about filesystem leases, for example. The key point here is that we'll still
need, in any of these approaches, to be able to identify the gup-pinned
pages. And there are lots (100+) of call sites to change. So I figure we'd
better get that started.

thanks,
Ira Weiny March 8, 2019, 5:43 p.m. UTC | #3
> Subject: Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder
> versions
> 
> On 3/7/19 6:58 PM, Christopher Lameter wrote:
> > On Wed, 6 Mar 2019, john.hubbard@gmail.com wrote:
> >
> >> Dave Chinner's description of this is very clear:
> >>
> >>     "The fundamental issue is that ->page_mkwrite must be called on every
> >>     write access to a clean file backed page, not just the first one.
> >>     How long the GUP reference lasts is irrelevant, if the page is clean
> >>     and you need to dirty it, you must call ->page_mkwrite before it is
> >>     marked writeable and dirtied. Every. Time."
> >>
> >> This is just one symptom of the larger design problem: filesystems do
> >> not actually support get_user_pages() being called on their pages,
> >> and letting hardware write directly to those pages--even though that
> >> patter has been going on since about 2005 or so.
> >
> > Can we distinguish between real filesystems that actually write to a
> > backing device and the special filesystems (like hugetlbfs, shm and
> > friends) that are like anonymous memory and do not require
> > ->page_mkwrite() in the same way as regular filesystems?
> 
> Yes. I'll change the wording in the commit message to say "real filesystems
> that actually write to a backing device", instead of "filesystems". That does
> help, thanks.
> 
> >
> > The use that I have seen in my section of the world has been
> > restricted to RDMA and get_user_pages being limited to anonymous
> > memory and those special filesystems. And if the RDMA memory is of
> > such type then the use in the past and present is safe.
> 
> Agreed.
> 
> >
> > So a logical other approach would be to simply not allow the use of
> > long term get_user_page() on real filesystem pages. I hope this patch
> > supports that?
> 
> This patch neither prevents nor provides that. What this patch does is
> provide a prerequisite to clear identification of pages that have had
> get_user_pages() called on them.
> 
> 
> >
> > It is customary after all that a file read or write operation involve
> > one single file(!) and that what is written either comes from or goes
> > to memory (anonymous or special memory filesystem).
> >
> > If you have an mmapped memory segment with a regular device backed
> > file then you already have one file associated with a memory segment
> > and a filesystem that does take care of synchronizing the contents of
> > the memory segment to a backing device.
> >
> > If you now perform RDMA or device I/O on such a memory segment then
> > you will have *two* different devices interacting with that memory
> > segment. I think that ought not to happen and not be supported out of
> > the box. It will be difficult to handle and the semantics will be hard
> > for users to understand.
> >
> > What could happen is that the filesystem could agree on request to
> > allow third party I/O to go to such a memory segment. But that needs
> > to be well defined and clearly and explicitly handled by some
> > mechanism in user space that has well defined semantics for data
> > integrity for the filesystem as well as the RDMA or device I/O.
> >
> 
> Those discussions are underway. Dave Chinner and others have been talking
> about filesystem leases, for example. The key point here is that we'll still
> need, in any of these approaches, to be able to identify the gup-pinned pages.
> And there are lots (100+) of call sites to change. So I figure we'd better get
> that started.
>

+ 1

I'm exploring patch sets like this.  Having this interface available will, IMO, allow for better review of those patches rather than saying "go over to Johns tree to get the pre-requisite patches".  :-D

Also I think it will be easier for users to get things right by calling [get|put]_user_pages() rather than get_user_pages() followed by put_page().

Ira

> thanks,
> --
> John Hubbard
> NVIDIA
Jerome Glisse March 8, 2019, 5:57 p.m. UTC | #4
On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This is the first step of fixing a problem (also described in [1] and
> [2]) with interactions between get_user_pages ("gup") and filesystems.
> 
> Problem description: let's start with a bug report. Below, is what happens
> sometimes, under memory pressure, when a driver pins some pages via gup,
> and then marks those pages dirty, and releases them. Note that the gup
> documentation actually recommends that pattern. The problem is that the
> filesystem may do a writeback while the pages were gup-pinned, and then the
> filesystem believes that the pages are clean. So, when the driver later
> marks the pages as dirty, that conflicts with the filesystem's page
> tracking and results in a BUG(), like this one that I experienced:
> 
>     kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
>     backtrace:
>         ext4_writepage
>         __writepage
>         write_cache_pages
>         ext4_writepages
>         do_writepages
>         __writeback_single_inode
>         writeback_sb_inodes
>         __writeback_inodes_wb
>         wb_writeback
>         wb_workfn
>         process_one_work
>         worker_thread
>         kthread
>         ret_from_fork
> 
> ...which is due to the file system asserting that there are still buffer
> heads attached:
> 
>         ({                                                      \
>                 BUG_ON(!PagePrivate(page));                     \
>                 ((struct buffer_head *)page_private(page));     \
>         })
> 
> Dave Chinner's description of this is very clear:
> 
>     "The fundamental issue is that ->page_mkwrite must be called on every
>     write access to a clean file backed page, not just the first one.
>     How long the GUP reference lasts is irrelevant, if the page is clean
>     and you need to dirty it, you must call ->page_mkwrite before it is
>     marked writeable and dirtied. Every. Time."
> 
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that patter has been
> going on since about 2005 or so.
> 
> The steps are to fix it are:
> 
> 1) (This patch): provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem.
> 
> [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
> [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>    # docs
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Just a small comments below that would help my life :)

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  include/linux/mm.h | 24 ++++++++++++++
>  mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++

Why not putting those functions in gup.c instead of swap.c ?

>  2 files changed, 106 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..809b7397d41e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -993,6 +993,30 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> +/**
> + * put_user_page() - release a gup-pinned page
> + * @page:            pointer to page to be released
> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + *
> + * put_user_page() and put_page() are not interchangeable, despite this early
> + * implementation that makes them look the same. put_user_page() calls must
> + * be perfectly matched up with get_user_page() calls.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> +	put_page(page);
> +}
> +
> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages(struct page **pages, unsigned long npages);
> +
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define SECTION_IN_PAGE_FLAGS
>  #endif
> diff --git a/mm/swap.c b/mm/swap.c
> index 4d7d37eb3c40..a6b4f693f46d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -133,6 +133,88 @@ void put_pages_list(struct list_head *pages)
>  }
>  EXPORT_SYMBOL(put_pages_list);
>  
> +typedef int (*set_dirty_func)(struct page *page);

set_dirty_func_t would be better as it is the rule for typedef to append
the _t also it make it easier for coccinelle patch.
John Hubbard March 8, 2019, 9:27 p.m. UTC | #5
On 3/8/19 9:57 AM, Jerome Glisse wrote:
[snip] 
> Just a small comments below that would help my life :)
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 

Thanks for the review! 

>> ---
>>  include/linux/mm.h | 24 ++++++++++++++
>>  mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++
> 
> Why not putting those functions in gup.c instead of swap.c ?

Yes, gup.c is better for these. And it passes the various cross compiler and
tinyconfig builds locally, so I think I'm not missing any cases. (The swap.c 
location was an artifact of very early approaches, pre-dating the
put_user_pages() name.) 

[snip]

>>  #define SECTION_IN_PAGE_FLAGS
>>  #endif
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 4d7d37eb3c40..a6b4f693f46d 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -133,6 +133,88 @@ void put_pages_list(struct list_head *pages)
>>  }
>>  EXPORT_SYMBOL(put_pages_list);
>>  
>> +typedef int (*set_dirty_func)(struct page *page);
> 
> set_dirty_func_t would be better as it is the rule for typedef to append
> the _t also it make it easier for coccinelle patch.
> 

Done. I'm posting a v4 in a moment, with both of the above, plus
Christopher's "real filesystems" wording change, and your reviewed-by
tag.


thanks,
Ira Weiny March 12, 2019, 3:30 p.m. UTC | #6
On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().

So I've been running with these patches for a while but today while ramping up
my testing I hit the following:

[ 1355.557819] ------------[ cut here ]------------
[ 1355.563436] get_user_pages pin count overflowed
[ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 get_gup_pin_page+0xa5/0xb0
[ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transpo
rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser rdma_cm ib_umad iw_cm libiscs
i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbyp
ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul ledtrig_audio snd_hda_intel
crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep snd_pcm aesni_intel crypto_simd s
nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si device_dax nd_btt ioatdma nd
_e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support libnvdimm pcspkr lpc_ich mei_m
e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en sr_mod cdrom sd_mod mgag
200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core ttm crc32c_intel igb isci ah
ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas
[ 1355.577429]  firewire_core crc_itu_t i2c_core libata dm_mod [last unloaded: rdmavt]
[ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10
[ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.02.04.0003.1023201411
38 10/23/2014
[ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0
[ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 bb 48 c7 c7 48 0a e9 81 89 4
4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f 00 66 66 66 66 90 41 57 49
bf 00 00
[ 1355.733244] RSP: 0018:ffffc90005a23b30 EFLAGS: 00010286
[ 1355.739536] RAX: 0000000000000000 RBX: ffffea0014220000 RCX: 0000000000000000
[ 1355.748005] RDX: 0000000000000003 RSI: ffffffff827d94a3 RDI: 0000000000000246
[ 1355.756453] RBP: ffffea0014220000 R08: 0000000000000002 R09: 0000000000022400
[ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0000000000000001 R12: 0000000000010207
[ 1355.773369] R13: ffff8884130b7040 R14: fff0000000000fff R15: 000fffffffe00000
[ 1355.781836] FS:  00007f2680d0d740(0000) GS:ffff88842e840000(0000) knlGS:0000000000000000
[ 1355.791384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1355.798319] CR2: 0000000000589000 CR3: 000000040b05e004 CR4: 00000000000606e0
[ 1355.806809] Call Trace:
[ 1355.810078]  follow_page_pte+0x4f3/0x5c0
[ 1355.814987]  __get_user_pages+0x1eb/0x730
[ 1355.820020]  get_user_pages+0x3e/0x50
[ 1355.824657]  ib_umem_get+0x283/0x500 [ib_uverbs]
[ 1355.830340]  ? _cond_resched+0x15/0x30
[ 1355.835065]  mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib]
[ 1355.841235]  ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs]
[ 1355.847400]  ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs]
[ 1355.853473]  __vfs_write+0x36/0x1b0
[ 1355.857904]  ? selinux_file_permission+0xf0/0x130
[ 1355.863702]  ? security_file_permission+0x2e/0xe0
[ 1355.869503]  vfs_write+0xa5/0x1a0
[ 1355.873751]  ksys_write+0x4f/0xb0
[ 1355.878009]  do_syscall_64+0x5b/0x180
[ 1355.882656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1355.888862] RIP: 0033:0x7f2680ec3ed8
[ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 45 78 0
d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49
89 d4 55
[ 1355.915573] RSP: 002b:00007ffe65d50bc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 1355.924621] RAX: ffffffffffffffda RBX: 00007ffe65d50c74 RCX: 00007f2680ec3ed8
[ 1355.933195] RDX: 0000000000000030 RSI: 00007ffe65d50c80 RDI: 0000000000000003
[ 1355.941760] RBP: 0000000000000030 R08: 0000000000000007 R09: 0000000000581260
[ 1355.950326] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000581930
[ 1355.958885] R13: 000000000000000c R14: 0000000000581260 R15: 0000000000000000
[ 1355.967430] ---[ end trace bc771ac6189977a2 ]---


I'm not sure what I did to do this and I'm going to work on a reproducer.  At
the time of the Warning I only had 1 GUP user?!?!?!?!

I'm not using ODP, so I don't think the changes we have discussed there are a
problem.

Ira

> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This is the first step of fixing a problem (also described in [1] and
> [2]) with interactions between get_user_pages ("gup") and filesystems.
> 
> Problem description: let's start with a bug report. Below, is what happens
> sometimes, under memory pressure, when a driver pins some pages via gup,
> and then marks those pages dirty, and releases them. Note that the gup
> documentation actually recommends that pattern. The problem is that the
> filesystem may do a writeback while the pages were gup-pinned, and then the
> filesystem believes that the pages are clean. So, when the driver later
> marks the pages as dirty, that conflicts with the filesystem's page
> tracking and results in a BUG(), like this one that I experienced:
> 
>     kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
>     backtrace:
>         ext4_writepage
>         __writepage
>         write_cache_pages
>         ext4_writepages
>         do_writepages
>         __writeback_single_inode
>         writeback_sb_inodes
>         __writeback_inodes_wb
>         wb_writeback
>         wb_workfn
>         process_one_work
>         worker_thread
>         kthread
>         ret_from_fork
> 
> ...which is due to the file system asserting that there are still buffer
> heads attached:
> 
>         ({                                                      \
>                 BUG_ON(!PagePrivate(page));                     \
>                 ((struct buffer_head *)page_private(page));     \
>         })
> 
> Dave Chinner's description of this is very clear:
> 
>     "The fundamental issue is that ->page_mkwrite must be called on every
>     write access to a clean file backed page, not just the first one.
>     How long the GUP reference lasts is irrelevant, if the page is clean
>     and you need to dirty it, you must call ->page_mkwrite before it is
>     marked writeable and dirtied. Every. Time."
> 
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that patter has been
> going on since about 2005 or so.
> 
> The steps are to fix it are:
> 
> 1) (This patch): provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem.
> 
> [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
> [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>    # docs
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h | 24 ++++++++++++++
>  mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 106 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..809b7397d41e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -993,6 +993,30 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> +/**
> + * put_user_page() - release a gup-pinned page
> + * @page:            pointer to page to be released
> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + *
> + * put_user_page() and put_page() are not interchangeable, despite this early
> + * implementation that makes them look the same. put_user_page() calls must
> + * be perfectly matched up with get_user_page() calls.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> +	put_page(page);
> +}
> +
> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages(struct page **pages, unsigned long npages);
> +
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define SECTION_IN_PAGE_FLAGS
>  #endif
> diff --git a/mm/swap.c b/mm/swap.c
> index 4d7d37eb3c40..a6b4f693f46d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -133,6 +133,88 @@ void put_pages_list(struct list_head *pages)
>  }
>  EXPORT_SYMBOL(put_pages_list);
>  
> +typedef int (*set_dirty_func)(struct page *page);
> +
> +static void __put_user_pages_dirty(struct page **pages,
> +				   unsigned long npages,
> +				   set_dirty_func sdf)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
> +		struct page *page = compound_head(pages[index]);
> +
> +		if (!PageDirty(page))
> +			sdf(page);
> +
> +		put_user_page(page);
> +	}
> +}
> +
> +/**
> + * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * "gup-pinned page" refers to a page that has had one of the get_user_pages()
> + * variants called on that page.
> + *
> + * For each page in the @pages array, make that page (or its head page, if a
> + * compound page) dirty, if it was previously listed as clean. Then, release
> + * the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * set_page_dirty(), which does not lock the page, is used here.
> + * Therefore, it is the caller's responsibility to ensure that this is
> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
> + *
> + */
> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty);
> +
> +/**
> + * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * For each page in the @pages array, make that page (or its head page, if a
> + * compound page) dirty, if it was previously listed as clean. Then, release
> + * the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * This is just like put_user_pages_dirty(), except that it invokes
> + * set_page_dirty_lock(), instead of set_page_dirty().
> + *
> + */
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
> +
> +/**
> + * put_user_pages() - release an array of gup-pinned pages.
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * For each page in the @pages array, release the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + */
> +void put_user_pages(struct page **pages, unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++)
> +		put_user_page(pages[index]);
> +}
> +EXPORT_SYMBOL(put_user_pages);
> +
>  /*
>   * get_kernel_pages() - pin kernel pages in memory
>   * @kiov:	An array of struct kvec structures
> -- 
> 2.21.0
>
John Hubbard March 13, 2019, 12:38 a.m. UTC | #7
On 3/12/19 8:30 AM, Ira Weiny wrote:
> On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Introduces put_user_page(), which simply calls put_page().
>> This provides a way to update all get_user_pages*() callers,
>> so that they call put_user_page(), instead of put_page().
> 
> So I've been running with these patches for a while but today while ramping up
> my testing I hit the following:
> 
> [ 1355.557819] ------------[ cut here ]------------
> [ 1355.563436] get_user_pages pin count overflowed

Hi Ira,

Thanks for reporting this. That overflow, at face value, means that we've
used more than the 22 bits worth of gup pin counts, so about 4 million pins
of the same page...

> [ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 get_gup_pin_page+0xa5/0xb0
> [ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transpo
> rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser rdma_cm ib_umad iw_cm libiscs
> i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbyp
> ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul ledtrig_audio snd_hda_intel
> crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep snd_pcm aesni_intel crypto_simd s
> nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si device_dax nd_btt ioatdma nd
> _e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support libnvdimm pcspkr lpc_ich mei_m
> e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en sr_mod cdrom sd_mod mgag
> 200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core ttm crc32c_intel igb isci ah
> ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas
> [ 1355.577429]  firewire_core crc_itu_t i2c_core libata dm_mod [last unloaded: rdmavt]
> [ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10
> [ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.02.04.0003.1023201411
> 38 10/23/2014
> [ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0
> [ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 bb 48 c7 c7 48 0a e9 81 89 4
> 4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f 00 66 66 66 66 90 41 57 49
> bf 00 00
> [ 1355.733244] RSP: 0018:ffffc90005a23b30 EFLAGS: 00010286
> [ 1355.739536] RAX: 0000000000000000 RBX: ffffea0014220000 RCX: 0000000000000000
> [ 1355.748005] RDX: 0000000000000003 RSI: ffffffff827d94a3 RDI: 0000000000000246
> [ 1355.756453] RBP: ffffea0014220000 R08: 0000000000000002 R09: 0000000000022400
> [ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0000000000000001 R12: 0000000000010207
> [ 1355.773369] R13: ffff8884130b7040 R14: fff0000000000fff R15: 000fffffffe00000
> [ 1355.781836] FS:  00007f2680d0d740(0000) GS:ffff88842e840000(0000) knlGS:0000000000000000
> [ 1355.791384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1355.798319] CR2: 0000000000589000 CR3: 000000040b05e004 CR4: 00000000000606e0
> [ 1355.806809] Call Trace:
> [ 1355.810078]  follow_page_pte+0x4f3/0x5c0
> [ 1355.814987]  __get_user_pages+0x1eb/0x730
> [ 1355.820020]  get_user_pages+0x3e/0x50
> [ 1355.824657]  ib_umem_get+0x283/0x500 [ib_uverbs]
> [ 1355.830340]  ? _cond_resched+0x15/0x30
> [ 1355.835065]  mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib]
> [ 1355.841235]  ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs]
> [ 1355.847400]  ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs]
> [ 1355.853473]  __vfs_write+0x36/0x1b0
> [ 1355.857904]  ? selinux_file_permission+0xf0/0x130
> [ 1355.863702]  ? security_file_permission+0x2e/0xe0
> [ 1355.869503]  vfs_write+0xa5/0x1a0
> [ 1355.873751]  ksys_write+0x4f/0xb0
> [ 1355.878009]  do_syscall_64+0x5b/0x180
> [ 1355.882656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1355.888862] RIP: 0033:0x7f2680ec3ed8
> [ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 45 78 0
> d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49
> 89 d4 55
> [ 1355.915573] RSP: 002b:00007ffe65d50bc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [ 1355.924621] RAX: ffffffffffffffda RBX: 00007ffe65d50c74 RCX: 00007f2680ec3ed8
> [ 1355.933195] RDX: 0000000000000030 RSI: 00007ffe65d50c80 RDI: 0000000000000003
> [ 1355.941760] RBP: 0000000000000030 R08: 0000000000000007 R09: 0000000000581260
> [ 1355.950326] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000581930
> [ 1355.958885] R13: 000000000000000c R14: 0000000000581260 R15: 0000000000000000
> [ 1355.967430] ---[ end trace bc771ac6189977a2 ]---
> 
> 
> I'm not sure what I did to do this and I'm going to work on a reproducer.  At
> the time of the Warning I only had 1 GUP user?!?!?!?!

If there is a get_user_pages() call that lacks a corresponding put_user_pages()
call, then the count could start working its way up, and up. Either that, or a
bug in my patches here, could cause this. The basic counting works correctly
in fio runs on an NVMe driver with Direct IO, when I dump out
`cat /proc/vmstat | grep gup`: the counts match up, but that is a simple test.

One way to force a faster repro is to increase the GUP_PIN_COUNTING_BIAS, so
that the gup pin count runs into the max much sooner.

I'd really love to create a test setup that would generate this failure, so
anything you discover on how to repro (including what hardware is required--I'm
sure I can scrounge up some IB gear in a pinch) is of great interest.

Also, I'm just now starting on the DEBUG_USER_PAGE_REFERENCES idea that Jerome,
Jan, and Dan floated some months ago. It's clearly a prerequisite to converting
the call sites properly--just our relatively small IB driver is showing that.
This feature will provide a different mapping of the struct pages, if get
them via get_user_pages(). That will allow easily asserting that put_user_page()
and put_page() are not swapped, in either direction.


> 
> I'm not using ODP, so I don't think the changes we have discussed there are a
> problem.
> 
> Ira
> 


thanks,
Ira Weiny March 13, 2019, 2:49 p.m. UTC | #8
On Tue, Mar 12, 2019 at 05:38:55PM -0700, John Hubbard wrote:
> On 3/12/19 8:30 AM, Ira Weiny wrote:
> > On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubbard@gmail.com wrote:
> > > From: John Hubbard <jhubbard@nvidia.com>
> > > 
> > > Introduces put_user_page(), which simply calls put_page().
> > > This provides a way to update all get_user_pages*() callers,
> > > so that they call put_user_page(), instead of put_page().
> > 
> > So I've been running with these patches for a while but today while ramping up
> > my testing I hit the following:
> > 
> > [ 1355.557819] ------------[ cut here ]------------
> > [ 1355.563436] get_user_pages pin count overflowed
> 
> Hi Ira,
> 
> Thanks for reporting this. That overflow, at face value, means that we've
> used more than the 22 bits worth of gup pin counts, so about 4 million pins
> of the same page...

This is my bug in the patches I'm playing with.  Somehow I'm causing more puts
than gets...  I'm not sure how but this is for sure my problem.

Backing off to your patch set the numbers are good.

Sorry for the noise.

With the testing I've done today I feel comfortable adding

Tested-by: Ira Weiny <ira.weiny@intel.com>

For the main GUP and InfiniBand patches.

Ira

> 
> > [ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 get_gup_pin_page+0xa5/0xb0
> > [ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transpo
> > rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser rdma_cm ib_umad iw_cm libiscs
> > i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbyp
> > ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul ledtrig_audio snd_hda_intel
> > crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep snd_pcm aesni_intel crypto_simd s
> > nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si device_dax nd_btt ioatdma nd
> > _e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support libnvdimm pcspkr lpc_ich mei_m
> > e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en sr_mod cdrom sd_mod mgag
> > 200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core ttm crc32c_intel igb isci ah
> > ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas
> > [ 1355.577429]  firewire_core crc_itu_t i2c_core libata dm_mod [last unloaded: rdmavt]
> > [ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10
> > [ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.02.04.0003.1023201411
> > 38 10/23/2014
> > [ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0
> > [ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 bb 48 c7 c7 48 0a e9 81 89 4
> > 4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f 00 66 66 66 66 90 41 57 49
> > bf 00 00
> > [ 1355.733244] RSP: 0018:ffffc90005a23b30 EFLAGS: 00010286
> > [ 1355.739536] RAX: 0000000000000000 RBX: ffffea0014220000 RCX: 0000000000000000
> > [ 1355.748005] RDX: 0000000000000003 RSI: ffffffff827d94a3 RDI: 0000000000000246
> > [ 1355.756453] RBP: ffffea0014220000 R08: 0000000000000002 R09: 0000000000022400
> > [ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0000000000000001 R12: 0000000000010207
> > [ 1355.773369] R13: ffff8884130b7040 R14: fff0000000000fff R15: 000fffffffe00000
> > [ 1355.781836] FS:  00007f2680d0d740(0000) GS:ffff88842e840000(0000) knlGS:0000000000000000
> > [ 1355.791384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1355.798319] CR2: 0000000000589000 CR3: 000000040b05e004 CR4: 00000000000606e0
> > [ 1355.806809] Call Trace:
> > [ 1355.810078]  follow_page_pte+0x4f3/0x5c0
> > [ 1355.814987]  __get_user_pages+0x1eb/0x730
> > [ 1355.820020]  get_user_pages+0x3e/0x50
> > [ 1355.824657]  ib_umem_get+0x283/0x500 [ib_uverbs]
> > [ 1355.830340]  ? _cond_resched+0x15/0x30
> > [ 1355.835065]  mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib]
> > [ 1355.841235]  ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs]
> > [ 1355.847400]  ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs]
> > [ 1355.853473]  __vfs_write+0x36/0x1b0
> > [ 1355.857904]  ? selinux_file_permission+0xf0/0x130
> > [ 1355.863702]  ? security_file_permission+0x2e/0xe0
> > [ 1355.869503]  vfs_write+0xa5/0x1a0
> > [ 1355.873751]  ksys_write+0x4f/0xb0
> > [ 1355.878009]  do_syscall_64+0x5b/0x180
> > [ 1355.882656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 1355.888862] RIP: 0033:0x7f2680ec3ed8
> > [ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 45 78 0
> > d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49
> > 89 d4 55
> > [ 1355.915573] RSP: 002b:00007ffe65d50bc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > [ 1355.924621] RAX: ffffffffffffffda RBX: 00007ffe65d50c74 RCX: 00007f2680ec3ed8
> > [ 1355.933195] RDX: 0000000000000030 RSI: 00007ffe65d50c80 RDI: 0000000000000003
> > [ 1355.941760] RBP: 0000000000000030 R08: 0000000000000007 R09: 0000000000581260
> > [ 1355.950326] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000581930
> > [ 1355.958885] R13: 000000000000000c R14: 0000000000581260 R15: 0000000000000000
> > [ 1355.967430] ---[ end trace bc771ac6189977a2 ]---
> > 
> > 
> > I'm not sure what I did to do this and I'm going to work on a reproducer.  At
> > the time of the Warning I only had 1 GUP user?!?!?!?!
> 
> If there is a get_user_pages() call that lacks a corresponding put_user_pages()
> call, then the count could start working its way up, and up. Either that, or a
> bug in my patches here, could cause this. The basic counting works correctly
> in fio runs on an NVMe driver with Direct IO, when I dump out
> `cat /proc/vmstat | grep gup`: the counts match up, but that is a simple test.
> 
> One way to force a faster repro is to increase the GUP_PIN_COUNTING_BIAS, so
> that the gup pin count runs into the max much sooner.
> 
> I'd really love to create a test setup that would generate this failure, so
> anything you discover on how to repro (including what hardware is required--I'm
> sure I can scrounge up some IB gear in a pinch) is of great interest.
> 
> Also, I'm just now starting on the DEBUG_USER_PAGE_REFERENCES idea that Jerome,
> Jan, and Dan floated some months ago. It's clearly a prerequisite to converting
> the call sites properly--just our relatively small IB driver is showing that.
> This feature will provide a different mapping of the struct pages, if get
> them via get_user_pages(). That will allow easily asserting that put_user_page()
> and put_page() are not swapped, in either direction.
> 
> 
> > 
> > I'm not using ODP, so I don't think the changes we have discussed there are a
> > problem.
> > 
> > Ira
> > 
> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
John Hubbard March 14, 2019, 3:19 a.m. UTC | #9
On 3/13/19 7:49 AM, Ira Weiny wrote:
> On Tue, Mar 12, 2019 at 05:38:55PM -0700, John Hubbard wrote:
>> On 3/12/19 8:30 AM, Ira Weiny wrote:
>>> On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubbard@gmail.com wrote:
>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>
>>>> Introduces put_user_page(), which simply calls put_page().
>>>> This provides a way to update all get_user_pages*() callers,
>>>> so that they call put_user_page(), instead of put_page().
>>>
>>> So I've been running with these patches for a while but today while ramping up
>>> my testing I hit the following:
>>>
>>> [ 1355.557819] ------------[ cut here ]------------
>>> [ 1355.563436] get_user_pages pin count overflowed
>>
>> Hi Ira,
>>
>> Thanks for reporting this. That overflow, at face value, means that we've
>> used more than the 22 bits worth of gup pin counts, so about 4 million pins
>> of the same page...
> 
> This is my bug in the patches I'm playing with.  Somehow I'm causing more puts
> than gets...  I'm not sure how but this is for sure my problem.
> 
> Backing off to your patch set the numbers are good.

Now that's a welcome bit of good news!

> 
> Sorry for the noise.
> 
> With the testing I've done today I feel comfortable adding
> 
> Tested-by: Ira Weiny <ira.weiny@intel.com>
> 
> For the main GUP and InfiniBand patches.
> 
> Ira
> 

OK, I'll add your tested-by tag to patches 1, 2, 4, 5 (the numbering refers
to the "RFC v2: mm: gup/dma tracking" posting [1]) in my repo [2], and they'll 
show up in the next posting. (Patch 3 is already upstream, and patch 6 is
documentation that needs to be rewritten entirely.)

[1] https://lore.kernel.org/r/20190204052135.25784-1-jhubbard@nvidia.com

[2] https://github.com/johnhubbard/linux/tree/gup_dma_core

thanks,
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..809b7397d41e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -993,6 +993,30 @@  static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
+/**
+ * put_user_page() - release a gup-pinned page
+ * @page:            pointer to page to be released
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ *
+ * put_user_page() and put_page() are not interchangeable, despite this early
+ * implementation that makes them look the same. put_user_page() calls must
+ * be perfectly matched up with get_user_page() calls.
+ */
+static inline void put_user_page(struct page *page)
+{
+	put_page(page);
+}
+
+void put_user_pages_dirty(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages(struct page **pages, unsigned long npages);
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/swap.c b/mm/swap.c
index 4d7d37eb3c40..a6b4f693f46d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -133,6 +133,88 @@  void put_pages_list(struct list_head *pages)
 }
 EXPORT_SYMBOL(put_pages_list);
 
+typedef int (*set_dirty_func)(struct page *page);
+
+static void __put_user_pages_dirty(struct page **pages,
+				   unsigned long npages,
+				   set_dirty_func sdf)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		struct page *page = compound_head(pages[index]);
+
+		if (!PageDirty(page))
+			sdf(page);
+
+		put_user_page(page);
+	}
+}
+
+/**
+ * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * "gup-pinned page" refers to a page that has had one of the get_user_pages()
+ * variants called on that page.
+ *
+ * For each page in the @pages array, make that page (or its head page, if a
+ * compound page) dirty, if it was previously listed as clean. Then, release
+ * the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty(), which does not lock the page, is used here.
+ * Therefore, it is the caller's responsibility to ensure that this is
+ * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ *
+ */
+void put_user_pages_dirty(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty);
+}
+EXPORT_SYMBOL(put_user_pages_dirty);
+
+/**
+ * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * For each page in the @pages array, make that page (or its head page, if a
+ * compound page) dirty, if it was previously listed as clean. Then, release
+ * the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ */
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
+}
+EXPORT_SYMBOL(put_user_pages_dirty_lock);
+
+/**
+ * put_user_pages() - release an array of gup-pinned pages.
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * For each page in the @pages array, release the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ */
+void put_user_pages(struct page **pages, unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++)
+		put_user_page(pages[index]);
+}
+EXPORT_SYMBOL(put_user_pages);
+
 /*
  * get_kernel_pages() - pin kernel pages in memory
  * @kiov:	An array of struct kvec structures