diff mbox

[2/2] mm: set PG_dma_pinned on get_user_pages*()

Message ID	20180617012510.20139-3-jhubbard@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of john.hubbard@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: john.hubbard@gmail.com To: Matthew Wilcox <willy@infradead.org>, Michal Hocko <mhocko@kernel.org>, Christopher Lameter <cl@linux.com>, Jason Gunthorpe <jgg@ziepe.ca>, Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.cz> Cc: linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>, linux-rdma <linux-rdma@vger.kernel.org>, John Hubbard <jhubbard@nvidia.com> Subject: [PATCH 2/2] mm: set PG_dma_pinned on get_user_pages*() Date: Sat, 16 Jun 2018 18:25:10 -0700 Message-Id: <20180617012510.20139-3-jhubbard@nvidia.com> In-Reply-To: <20180617012510.20139-1-jhubbard@nvidia.com> References: <20180617012510.20139-1-jhubbard@nvidia.com> Sender: owner-linux-mm@kvack.org Precedence: bulk

Message ID

20180617012510.20139-3-jhubbard@nvidia.com (mailing list archive)

State

New, archived

Headers

Received-SPF: pass (google.com: domain of john.hubbard@gmail.com designates
	209.85.220.65 as permitted sender) client-ip=209.85.220.65; 
From: john.hubbard@gmail.com
To: Matthew Wilcox <willy@infradead.org>, Michal Hocko <mhocko@kernel.org>,
	Christopher Lameter <cl@linux.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.cz>
Cc: linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	John Hubbard <jhubbard@nvidia.com>
Subject: [PATCH 2/2] mm: set PG_dma_pinned on get_user_pages*()
Date: Sat, 16 Jun 2018 18:25:10 -0700
Message-Id: <20180617012510.20139-3-jhubbard@nvidia.com>
In-Reply-To: <20180617012510.20139-1-jhubbard@nvidia.com>
References: <20180617012510.20139-1-jhubbard@nvidia.com>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Commit Message

john.hubbard@gmail.com June 17, 2018, 1:25 a.m. UTC

From: John Hubbard <jhubbard@nvidia.com>

This fixes a few problems that come up when using devices (NICs, GPUs,
for example) that want to have direct access to a chunk of system (CPU)
memory, so that they can DMA to/from that memory. Problems [1] come up
if that memory is backed by persistence storage; for example, an ext4
file system. I've been working on several customer bugs that are hitting
this, and this patchset fixes those bugs.

The bugs happen via:

1) get_user_pages() on some ext4-backed pages
2) device does DMA for a while to/from those pages

    a) Somewhere in here, some of the pages get disconnected from the
       file system, via try_to_unmap() and eventually drop_buffers()

3) device is all done, device driver calls set_page_dirty_lock(), then
   put_page()

And then at some point, we see a this BUG():

    kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
    backtrace:
        ext4_writepage
        __writepage
        write_cache_pages
        ext4_writepages
        do_writepages
        __writeback_single_inode
        writeback_sb_inodes
        __writeback_inodes_wb
        wb_writeback
        wb_workfn
        process_one_work
        worker_thread
        kthread
        ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

	({							            \
		BUG_ON(!PagePrivate(page));			\
		((struct buffer_head *)page_private(page));	\
	})

How to fix this:
----------------
Introduce a new page flag: PG_dma_pinned, and set this flag on
all pages that are returned by the get_user_pages*() family of
functions. Leave it set nearly forever: until the page is freed.

Then, check this flag before attempting to unmap pages. This will
cause a very early return from try_to_unmap_one(), and will avoid
doing things such as, notably, removing page buffers via drop_buffers().

This uses a new struct page flag, but only on 64-bit systems.

Obviously, this is heavy-handed, but given the long, broken history of
get_user_pages in combination with file-backed memory, and given the
problems with alternative designs, it's a reasonable fix for now: small,
simple, and easy to revert if and when a more comprehensive design solution
is chosen.

Some alternatives, and why they were not taken:

1. It would be better, if possible, to clear PG_dma_pinned, once all
get_user_pages callers returned the page (via something more specific than
put_page), but that would significantly change the usage for get_user_pages
callers. That's too intrusive for such a widely used and old API, so let's
leave it alone.

Also, such a design would require a new counter that would be associated
with each page. There's no room in struct page, so it would require
separate tracking, which is not acceptable for general page management.

2. There are other more complicated approaches[2], but these depend on
trying to solve very specific call paths that, in the end, are just
downstream effects of the root cause. And so these did not actually fix the
customer bugs that I was working on.

References:

[1] https://lwn.net/Articles/753027/ : "The trouble with get_user_pages()"

[2] https://marc.info/?l=linux-mm&m=<20180521143830.GA25109@bombadil.infradead.org>
   (Matthew Wilcox listed two ideas here)

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/page-flags.h     |  9 +++++++++
 include/trace/events/mmflags.h |  9 ++++++++-
 mm/gup.c                       | 11 +++++++++--
 mm/rmap.c                      |  2 ++
 4 files changed, 28 insertions(+), 3 deletions(-)

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/page-flags.h     |  9 +++++++++
 include/trace/events/mmflags.h |  9 ++++++++-
 mm/gup.c                       | 11 +++++++++--
 mm/page_alloc.c                |  1 +
 mm/rmap.c                      |  2 ++
 5 files changed, 29 insertions(+), 3 deletions(-)

Comments

Dan Williams June 17, 2018, 7:53 p.m. UTC | #1

On Sat, Jun 16, 2018 at 6:25 PM,  <john.hubbard@gmail.com> wrote:
> From: John Hubbard <jhubbard@nvidia.com>
>
> This fixes a few problems that come up when using devices (NICs, GPUs,
> for example) that want to have direct access to a chunk of system (CPU)
> memory, so that they can DMA to/from that memory. Problems [1] come up
> if that memory is backed by persistence storage; for example, an ext4
> file system. I've been working on several customer bugs that are hitting
> this, and this patchset fixes those bugs.
>
> The bugs happen via:
>
> 1) get_user_pages() on some ext4-backed pages
> 2) device does DMA for a while to/from those pages
>
>     a) Somewhere in here, some of the pages get disconnected from the
>        file system, via try_to_unmap() and eventually drop_buffers()
>
> 3) device is all done, device driver calls set_page_dirty_lock(), then
>    put_page()
>
> And then at some point, we see a this BUG():
>
>     kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
>     backtrace:
>         ext4_writepage
>         __writepage
>         write_cache_pages
>         ext4_writepages
>         do_writepages
>         __writeback_single_inode
>         writeback_sb_inodes
>         __writeback_inodes_wb
>         wb_writeback
>         wb_workfn
>         process_one_work
>         worker_thread
>         kthread
>         ret_from_fork
>
> ...which is due to the file system asserting that there are still buffer
> heads attached:
>
>         ({                                                                  \
>                 BUG_ON(!PagePrivate(page));                     \
>                 ((struct buffer_head *)page_private(page));     \
>         })
>
> How to fix this:
> ----------------
> Introduce a new page flag: PG_dma_pinned, and set this flag on
> all pages that are returned by the get_user_pages*() family of
> functions. Leave it set nearly forever: until the page is freed.
>
> Then, check this flag before attempting to unmap pages. This will
> cause a very early return from try_to_unmap_one(), and will avoid
> doing things such as, notably, removing page buffers via drop_buffers().
>
> This uses a new struct page flag, but only on 64-bit systems.
>
> Obviously, this is heavy-handed, but given the long, broken history of
> get_user_pages in combination with file-backed memory, and given the
> problems with alternative designs, it's a reasonable fix for now: small,
> simple, and easy to revert if and when a more comprehensive design solution
> is chosen.
>
> Some alternatives, and why they were not taken:
>
> 1. It would be better, if possible, to clear PG_dma_pinned, once all
> get_user_pages callers returned the page (via something more specific than
> put_page), but that would significantly change the usage for get_user_pages
> callers. That's too intrusive for such a widely used and old API, so let's
> leave it alone.
>
> Also, such a design would require a new counter that would be associated
> with each page. There's no room in struct page, so it would require
> separate tracking, which is not acceptable for general page management.
>
> 2. There are other more complicated approaches[2], but these depend on
> trying to solve very specific call paths that, in the end, are just
> downstream effects of the root cause. And so these did not actually fix the
> customer bugs that I was working on.
>
> References:
>
> [1] https://lwn.net/Articles/753027/ : "The trouble with get_user_pages()"
>
> [2] https://marc.info/?l=linux-mm&m=<20180521143830.GA25109@bombadil.infradead.org>
>    (Matthew Wilcox listed two ideas here)
>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
[..]
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6db729dc4c50..37576f0a4645 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1360,6 +1360,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>                                 flags & TTU_SPLIT_FREEZE, page);
>         }
>
> +       if (PageDmaPinned(page))
> +               return false;
>         /*
>          * We have to assume the worse case ie pmd for invalidation. Note that
>          * the page can not be free in this function as call of try_to_unmap()

We have a similiar problem with DAX and the conclusion we came to is
that it is not acceptable for userspace to arbitrarily block kernel
actions. The conclusion there was: 'wait' if the DMA is transient, and
'revoke' if the DMA is long lived, or otherwise 'block' long-lived DMA
if a revocation mechanism is not available.

Jason Gunthorpe June 17, 2018, 8:04 p.m. UTC | #2

On Sun, Jun 17, 2018 at 12:53:04PM -0700, Dan Williams wrote:
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 6db729dc4c50..37576f0a4645 100644
> > +++ b/mm/rmap.c
> > @@ -1360,6 +1360,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >                                 flags & TTU_SPLIT_FREEZE, page);
> >         }
> >
> > +       if (PageDmaPinned(page))
> > +               return false;
> >         /*
> >          * We have to assume the worse case ie pmd for invalidation. Note that
> >          * the page can not be free in this function as call of try_to_unmap()
> 
> We have a similiar problem with DAX and the conclusion we came to is
> that it is not acceptable for userspace to arbitrarily block kernel
> actions. The conclusion there was: 'wait' if the DMA is transient, and
> 'revoke' if the DMA is long lived, or otherwise 'block' long-lived DMA
> if a revocation mechanism is not available.

This might be the right answer for certain things, but it shouldn't be
the immediate reaction to everthing. There are many user APIs that
block kernel actions and hold kernel resources.

IMHO, there should be an identifiable objection, eg is blocking going
to create a DOS, dead-lock, insecurity, etc?

Jason

Dan Williams June 17, 2018, 8:10 p.m. UTC | #3

On Sun, Jun 17, 2018 at 1:04 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Sun, Jun 17, 2018 at 12:53:04PM -0700, Dan Williams wrote:
>> > diff --git a/mm/rmap.c b/mm/rmap.c
>> > index 6db729dc4c50..37576f0a4645 100644
>> > +++ b/mm/rmap.c
>> > @@ -1360,6 +1360,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                                 flags & TTU_SPLIT_FREEZE, page);
>> >         }
>> >
>> > +       if (PageDmaPinned(page))
>> > +               return false;
>> >         /*
>> >          * We have to assume the worse case ie pmd for invalidation. Note that
>> >          * the page can not be free in this function as call of try_to_unmap()
>>
>> We have a similiar problem with DAX and the conclusion we came to is
>> that it is not acceptable for userspace to arbitrarily block kernel
>> actions. The conclusion there was: 'wait' if the DMA is transient, and
>> 'revoke' if the DMA is long lived, or otherwise 'block' long-lived DMA
>> if a revocation mechanism is not available.
>
> This might be the right answer for certain things, but it shouldn't be
> the immediate reaction to everthing. There are many user APIs that
> block kernel actions and hold kernel resources.
>
> IMHO, there should be an identifiable objection, eg is blocking going
> to create a DOS, dead-lock, insecurity, etc?

I believe kernel behavior regression is a primary concern as now
fallocate() and truncate() can randomly fail where they didn't before.

John Hubbard June 17, 2018, 8:28 p.m. UTC | #4

On 06/17/2018 01:10 PM, Dan Williams wrote:
> On Sun, Jun 17, 2018 at 1:04 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> On Sun, Jun 17, 2018 at 12:53:04PM -0700, Dan Williams wrote:
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 6db729dc4c50..37576f0a4645 100644
>>>> +++ b/mm/rmap.c
>>>> @@ -1360,6 +1360,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>>                                 flags & TTU_SPLIT_FREEZE, page);
>>>>         }
>>>>
>>>> +       if (PageDmaPinned(page))
>>>> +               return false;
>>>>         /*
>>>>          * We have to assume the worse case ie pmd for invalidation. Note that
>>>>          * the page can not be free in this function as call of try_to_unmap()
>>>
>>> We have a similiar problem with DAX and the conclusion we came to is
>>> that it is not acceptable for userspace to arbitrarily block kernel
>>> actions. The conclusion there was: 'wait' if the DMA is transient, and
>>> 'revoke' if the DMA is long lived, or otherwise 'block' long-lived DMA
>>> if a revocation mechanism is not available.
>>
>> This might be the right answer for certain things, but it shouldn't be
>> the immediate reaction to everthing. There are many user APIs that
>> block kernel actions and hold kernel resources.
>>
>> IMHO, there should be an identifiable objection, eg is blocking going
>> to create a DOS, dead-lock, insecurity, etc?
> 
> I believe kernel behavior regression is a primary concern as now
> fallocate() and truncate() can randomly fail where they didn't before.
> 

Yes. However, my thinking was: get_user_pages() can become a way to indicate that 
these pages are going to be treated specially. In particular, the caller
does not really want or need to support certain file operations, while the
page is flagged this way.

If necessary, we could add a new API call. But either way, I think we could
reasonably document that "if you pin these pages (either via get_user_pages,
or some new, similar-looking API call), you can DMA to/from them, and safely
mark them as dirty when you're done, and the right things will happen. 
And in the interim, you can expect that the follow file system API calls
will not behave predictably: fallocate, truncate, ..."

Maybe in the near future, we can remove that last qualification, if we
find a more comprehensive design for this (as opposed to this cheap fix
I'm proposing here).

John Hubbard June 17, 2018, 10:19 p.m. UTC | #5

On 06/17/2018 12:53 PM, Dan Williams wrote:
> [..]
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 6db729dc4c50..37576f0a4645 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1360,6 +1360,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>                                 flags & TTU_SPLIT_FREEZE, page);
>>         }
>>
>> +       if (PageDmaPinned(page))
>> +               return false;
>>         /*
>>          * We have to assume the worse case ie pmd for invalidation. Note that
>>          * the page can not be free in this function as call of try_to_unmap()
> 
> We have a similiar problem with DAX and the conclusion we came to is
> that it is not acceptable for userspace to arbitrarily block kernel
> actions. The conclusion there was: 'wait' if the DMA is transient, and
> 'revoke' if the DMA is long lived, or otherwise 'block' long-lived DMA
> if a revocation mechanism is not available.
> 

Dan, thanks...can you please say a few words (or point to the code) about the "revoke" part of
this design? And whether you think it could be applied here (instead of the unconditional
appproach I have above)?

Christoph Hellwig June 18, 2018, 7:56 a.m. UTC | #6

On Sat, Jun 16, 2018 at 06:25:10PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> This fixes a few problems that come up when using devices (NICs, GPUs,
> for example) that want to have direct access to a chunk of system (CPU)
> memory, so that they can DMA to/from that memory. Problems [1] come up
> if that memory is backed by persistence storage; for example, an ext4
> file system. I've been working on several customer bugs that are hitting
> this, and this patchset fixes those bugs.

What happens if we do get_user_page from two different threads or even
processes on the same page?  As far as I can tell from your patch
the first one finishing the page will clear the bit and then we are
back to no protection.

Note that you can reproduce such a condition trivially using direct
I/O reads or writes.

Christoph Hellwig June 18, 2018, 8:11 a.m. UTC | #7

On Sun, Jun 17, 2018 at 01:10:04PM -0700, Dan Williams wrote:
> I believe kernel behavior regression is a primary concern as now
> fallocate() and truncate() can randomly fail where they didn't before.

Fail or block forever due to actions of other unprivileged users.

That is a complete no-go.

Christoph Hellwig June 18, 2018, 8:12 a.m. UTC | #8

On Sun, Jun 17, 2018 at 01:28:18PM -0700, John Hubbard wrote:
> Yes. However, my thinking was: get_user_pages() can become a way to indicate that 
> these pages are going to be treated specially. In particular, the caller
> does not really want or need to support certain file operations, while the
> page is flagged this way.
> 
> If necessary, we could add a new API call.

That API call is called get_user_pages_longterm.

> But either way, I think we could
> reasonably document that "if you pin these pages (either via get_user_pages,
> or some new, similar-looking API call), you can DMA to/from them, and safely
> mark them as dirty when you're done, and the right things will happen. 
> And in the interim, you can expect that the follow file system API calls
> will not behave predictably: fallocate, truncate, ..."

That is not how get_user_pages(_fast) is used.  We use it all over the
kernel, including for direct I/O.  You'd break a lot of existing use
cases very badly.

John Hubbard June 18, 2018, 5:44 p.m. UTC | #9

Hi Christoph,

Thanks for looking at this...

On 06/18/2018 12:56 AM, Christoph Hellwig wrote:
> On Sat, Jun 16, 2018 at 06:25:10PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> This fixes a few problems that come up when using devices (NICs, GPUs,
>> for example) that want to have direct access to a chunk of system (CPU)
>> memory, so that they can DMA to/from that memory. Problems [1] come up
>> if that memory is backed by persistence storage; for example, an ext4
>> file system. I've been working on several customer bugs that are hitting
>> this, and this patchset fixes those bugs.
> 
> What happens if we do get_user_page from two different threads or even
> processes on the same page?  As far as I can tell from your patch
> the first one finishing the page will clear the bit and then we are
> back to no protection.

The patch does not do that. The flag is only ever cleared when the page is 
freed. That can't happen until each of the two threads above is done and 
calls put_page(). So while there may be other design issues here, the above 
case is not one of them. :)


thanks,

John Hubbard June 18, 2018, 5:50 p.m. UTC | #10

On 06/18/2018 01:12 AM, Christoph Hellwig wrote:
> On Sun, Jun 17, 2018 at 01:28:18PM -0700, John Hubbard wrote:
>> Yes. However, my thinking was: get_user_pages() can become a way to indicate that 
>> these pages are going to be treated specially. In particular, the caller
>> does not really want or need to support certain file operations, while the
>> page is flagged this way.
>>
>> If necessary, we could add a new API call.
> 
> That API call is called get_user_pages_longterm.

OK...I had the impression that this was just semi-temporary API for dax, but
given that it's an exported symbol, I guess it really is here to stay.

Anyway, are you thinking that we could set the new page flag here? Or just pointing
out that the other get_user_pages* variants are the wrong place?

Dan Williams June 18, 2018, 5:56 p.m. UTC | #11

On Mon, Jun 18, 2018 at 10:50 AM, John Hubbard <jhubbard@nvidia.com> wrote:
> On 06/18/2018 01:12 AM, Christoph Hellwig wrote:
>> On Sun, Jun 17, 2018 at 01:28:18PM -0700, John Hubbard wrote:
>>> Yes. However, my thinking was: get_user_pages() can become a way to indicate that
>>> these pages are going to be treated specially. In particular, the caller
>>> does not really want or need to support certain file operations, while the
>>> page is flagged this way.
>>>
>>> If necessary, we could add a new API call.
>>
>> That API call is called get_user_pages_longterm.
>
> OK...I had the impression that this was just semi-temporary API for dax, but
> given that it's an exported symbol, I guess it really is here to stay.

The plan is to go back and provide api changes that bypass
get_user_page_longterm() for RDMA. However, for VFIO and others, it's
not clear what we could do. In the VFIO case the guest would need to
be prepared handle the revocation.

John Hubbard June 18, 2018, 6:14 p.m. UTC | #12

On 06/18/2018 10:56 AM, Dan Williams wrote:
> On Mon, Jun 18, 2018 at 10:50 AM, John Hubbard <jhubbard@nvidia.com> wrote:
>> On 06/18/2018 01:12 AM, Christoph Hellwig wrote:
>>> On Sun, Jun 17, 2018 at 01:28:18PM -0700, John Hubbard wrote:
>>>> Yes. However, my thinking was: get_user_pages() can become a way to indicate that
>>>> these pages are going to be treated specially. In particular, the caller
>>>> does not really want or need to support certain file operations, while the
>>>> page is flagged this way.
>>>>
>>>> If necessary, we could add a new API call.
>>>
>>> That API call is called get_user_pages_longterm.
>>
>> OK...I had the impression that this was just semi-temporary API for dax, but
>> given that it's an exported symbol, I guess it really is here to stay.
> 
> The plan is to go back and provide api changes that bypass
> get_user_page_longterm() for RDMA. However, for VFIO and others, it's
> not clear what we could do. In the VFIO case the guest would need to
> be prepared handle the revocation.

OK, let's see if I understand that plan correctly:

1. Change RDMA users (this could be done entirely in the various device drivers'
code, unless I'm overlooking something) to use mmu notifiers, and to do their
DMA to/from non-pinned pages.

2. Return early from get_user_pages_longterm, if the memory is...marked for
RDMA? (How? Same sort of page flag that I'm floating here, or something else?)
That would avoid the problem with pinned pages getting their buffer heads
removed--by disallowing the pinning. Makes sense.

Also, is there anything I can help with here, so that things can happen sooner?

Dan Williams June 18, 2018, 7:21 p.m. UTC | #13

On Mon, Jun 18, 2018 at 11:14 AM, John Hubbard <jhubbard@nvidia.com> wrote:
> On 06/18/2018 10:56 AM, Dan Williams wrote:
>> On Mon, Jun 18, 2018 at 10:50 AM, John Hubbard <jhubbard@nvidia.com> wrote:
>>> On 06/18/2018 01:12 AM, Christoph Hellwig wrote:
>>>> On Sun, Jun 17, 2018 at 01:28:18PM -0700, John Hubbard wrote:
>>>>> Yes. However, my thinking was: get_user_pages() can become a way to indicate that
>>>>> these pages are going to be treated specially. In particular, the caller
>>>>> does not really want or need to support certain file operations, while the
>>>>> page is flagged this way.
>>>>>
>>>>> If necessary, we could add a new API call.
>>>>
>>>> That API call is called get_user_pages_longterm.
>>>
>>> OK...I had the impression that this was just semi-temporary API for dax, but
>>> given that it's an exported symbol, I guess it really is here to stay.
>>
>> The plan is to go back and provide api changes that bypass
>> get_user_page_longterm() for RDMA. However, for VFIO and others, it's
>> not clear what we could do. In the VFIO case the guest would need to
>> be prepared handle the revocation.
>
> OK, let's see if I understand that plan correctly:
>
> 1. Change RDMA users (this could be done entirely in the various device drivers'
> code, unless I'm overlooking something) to use mmu notifiers, and to do their
> DMA to/from non-pinned pages.

The problem with this approach is surprising the RDMA drivers with
notifications of teardowns. It's the RDMA userspace applications that
need the notification, and it likely needs to be explicit opt-in, at
least for the non-ODP drivers.

> 2. Return early from get_user_pages_longterm, if the memory is...marked for
> RDMA? (How? Same sort of page flag that I'm floating here, or something else?)
> That would avoid the problem with pinned pages getting their buffer heads
> removed--by disallowing the pinning. Makes sense.

Well, right now the RDMA workaround is DAX specific and it seems we
need to generalize it for the page-cache case. One thought is to have
try_to_unmap() take it's own reference and wait for the page reference
count to drop to one so that the truncate path knows the page is
dma-idle and disconnected from the page cache, but I have not looked
at the details.

> Also, is there anything I can help with here, so that things can happen sooner?

I do think we should explore a page flag for pages that are "long
term" pinned. Michal asked for something along these lines at LSF / MM
so that the core-mm can give up on pages that the kernel has lost
lifetime control. Michal, did I capture your ask correctly?

Jason Gunthorpe June 18, 2018, 7:31 p.m. UTC | #14

On Mon, Jun 18, 2018 at 12:21:46PM -0700, Dan Williams wrote:
> On Mon, Jun 18, 2018 at 11:14 AM, John Hubbard <jhubbard@nvidia.com> wrote:
> > On 06/18/2018 10:56 AM, Dan Williams wrote:
> >> On Mon, Jun 18, 2018 at 10:50 AM, John Hubbard <jhubbard@nvidia.com> wrote:
> >>> On 06/18/2018 01:12 AM, Christoph Hellwig wrote:
> >>>> On Sun, Jun 17, 2018 at 01:28:18PM -0700, John Hubbard wrote:
> >>>>> Yes. However, my thinking was: get_user_pages() can become a way to indicate that
> >>>>> these pages are going to be treated specially. In particular, the caller
> >>>>> does not really want or need to support certain file operations, while the
> >>>>> page is flagged this way.
> >>>>>
> >>>>> If necessary, we could add a new API call.
> >>>>
> >>>> That API call is called get_user_pages_longterm.
> >>>
> >>> OK...I had the impression that this was just semi-temporary API for dax, but
> >>> given that it's an exported symbol, I guess it really is here to stay.
> >>
> >> The plan is to go back and provide api changes that bypass
> >> get_user_page_longterm() for RDMA. However, for VFIO and others, it's
> >> not clear what we could do. In the VFIO case the guest would need to
> >> be prepared handle the revocation.
> >
> > OK, let's see if I understand that plan correctly:
> >
> > 1. Change RDMA users (this could be done entirely in the various device drivers'
> > code, unless I'm overlooking something) to use mmu notifiers, and to do their
> > DMA to/from non-pinned pages.
> 
> The problem with this approach is surprising the RDMA drivers with
> notifications of teardowns. It's the RDMA userspace applications that
> need the notification, and it likely needs to be explicit opt-in, at
> least for the non-ODP drivers.

Well, more than that, we have no real plan on how to accomplish this,
or any idea if it can even really work.. Most userspace give up
control of the memory lifetime to the remote side of the connection
and have no way to recover it other than a full teardown.

Given that John is trying to fix a kernel oops, I don't think we
should tie progress on it to the RDMA notification idea.

.. and given that John is trying to fix a kernel oops, maybe the
weird/bad/ugly behavior of ftruncte is a better bug to have than for
unprivileged users to be able to oops the kernel???

Jason

Dan Williams June 18, 2018, 8:04 p.m. UTC | #15

On Mon, Jun 18, 2018 at 12:31 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Mon, Jun 18, 2018 at 12:21:46PM -0700, Dan Williams wrote:
>> On Mon, Jun 18, 2018 at 11:14 AM, John Hubbard <jhubbard@nvidia.com> wrote:
>> > On 06/18/2018 10:56 AM, Dan Williams wrote:
>> >> On Mon, Jun 18, 2018 at 10:50 AM, John Hubbard <jhubbard@nvidia.com> wrote:
>> >>> On 06/18/2018 01:12 AM, Christoph Hellwig wrote:
>> >>>> On Sun, Jun 17, 2018 at 01:28:18PM -0700, John Hubbard wrote:
>> >>>>> Yes. However, my thinking was: get_user_pages() can become a way to indicate that
>> >>>>> these pages are going to be treated specially. In particular, the caller
>> >>>>> does not really want or need to support certain file operations, while the
>> >>>>> page is flagged this way.
>> >>>>>
>> >>>>> If necessary, we could add a new API call.
>> >>>>
>> >>>> That API call is called get_user_pages_longterm.
>> >>>
>> >>> OK...I had the impression that this was just semi-temporary API for dax, but
>> >>> given that it's an exported symbol, I guess it really is here to stay.
>> >>
>> >> The plan is to go back and provide api changes that bypass
>> >> get_user_page_longterm() for RDMA. However, for VFIO and others, it's
>> >> not clear what we could do. In the VFIO case the guest would need to
>> >> be prepared handle the revocation.
>> >
>> > OK, let's see if I understand that plan correctly:
>> >
>> > 1. Change RDMA users (this could be done entirely in the various device drivers'
>> > code, unless I'm overlooking something) to use mmu notifiers, and to do their
>> > DMA to/from non-pinned pages.
>>
>> The problem with this approach is surprising the RDMA drivers with
>> notifications of teardowns. It's the RDMA userspace applications that
>> need the notification, and it likely needs to be explicit opt-in, at
>> least for the non-ODP drivers.
>
> Well, more than that, we have no real plan on how to accomplish this,
> or any idea if it can even really work.. Most userspace give up
> control of the memory lifetime to the remote side of the connection
> and have no way to recover it other than a full teardown.
>
> Given that John is trying to fix a kernel oops, I don't think we
> should tie progress on it to the RDMA notification idea.
>
> .. and given that John is trying to fix a kernel oops, maybe the
> weird/bad/ugly behavior of ftruncte is a better bug to have than for
> unprivileged users to be able to oops the kernel???

Trading one bug for another is not a fix. We did not fix the
DAX-dma-vs-trruncate bug by breaking truncate() guarantees.

John Hubbard June 18, 2018, 9:36 p.m. UTC | #16

On 06/18/2018 12:21 PM, Dan Williams wrote:
> On Mon, Jun 18, 2018 at 11:14 AM, John Hubbard <jhubbard@nvidia.com> wrote:
>> On 06/18/2018 10:56 AM, Dan Williams wrote:
>>> On Mon, Jun 18, 2018 at 10:50 AM, John Hubbard <jhubbard@nvidia.com> wrote:
>>>> On 06/18/2018 01:12 AM, Christoph Hellwig wrote:
>>>>> On Sun, Jun 17, 2018 at 01:28:18PM -0700, John Hubbard wrote:
>>>>>> Yes. However, my thinking was: get_user_pages() can become a way to indicate that
>>>>>> these pages are going to be treated specially. In particular, the caller
>>>>>> does not really want or need to support certain file operations, while the
>>>>>> page is flagged this way.
>>>>>>
>>>>>> If necessary, we could add a new API call.
>>>>>
>>>>> That API call is called get_user_pages_longterm.
>>>>
>>>> OK...I had the impression that this was just semi-temporary API for dax, but
>>>> given that it's an exported symbol, I guess it really is here to stay.
>>>
>>> The plan is to go back and provide api changes that bypass
>>> get_user_page_longterm() for RDMA. However, for VFIO and others, it's
>>> not clear what we could do. In the VFIO case the guest would need to
>>> be prepared handle the revocation.
>>
>> OK, let's see if I understand that plan correctly:
>>
>> 1. Change RDMA users (this could be done entirely in the various device drivers'
>> code, unless I'm overlooking something) to use mmu notifiers, and to do their
>> DMA to/from non-pinned pages.
> 
> The problem with this approach is surprising the RDMA drivers with
> notifications of teardowns. It's the RDMA userspace applications that
> need the notification, and it likely needs to be explicit opt-in, at
> least for the non-ODP drivers.
> 
>> 2. Return early from get_user_pages_longterm, if the memory is...marked for
>> RDMA? (How? Same sort of page flag that I'm floating here, or something else?)
>> That would avoid the problem with pinned pages getting their buffer heads
>> removed--by disallowing the pinning. Makes sense.
> 
> Well, right now the RDMA workaround is DAX specific and it seems we
> need to generalize it for the page-cache case. One thought is to have
> try_to_unmap() take it's own reference and wait for the page reference
> count to drop to one so that the truncate path knows the page is
> dma-idle and disconnected from the page cache, but I have not looked
> at the details.
> 
>> Also, is there anything I can help with here, so that things can happen sooner?
> 
> I do think we should explore a page flag for pages that are "long
> term" pinned. Michal asked for something along these lines at LSF / MM
> so that the core-mm can give up on pages that the kernel has lost
> lifetime control. Michal, did I capture your ask correctly?


OK, that "refcount == 1" approach sounds promising:

   -- still use a page flag, but narrow the scope to get_user_pages_longterm() pages
   -- just wait in try_to_unmap, instead of giving up

I'll look into it, while waiting for Michal's thoughts on this.

Leon Romanovsky June 19, 2018, 6:15 a.m. UTC | #17

On Mon, Jun 18, 2018 at 10:11:13AM +0200, Christoph Hellwig wrote:
> On Sun, Jun 17, 2018 at 01:10:04PM -0700, Dan Williams wrote:
> > I believe kernel behavior regression is a primary concern as now
> > fallocate() and truncate() can randomly fail where they didn't before.
>
> Fail or block forever due to actions of other unprivileged users.
>
> That is a complete no-go.

They will fail "properly" with some error, which is better than current
situation where those unprivileged users can trigger oops.

Thanks

Jan Kara June 19, 2018, 8:29 a.m. UTC | #18

On Mon 18-06-18 14:36:44, John Hubbard wrote:
> On 06/18/2018 12:21 PM, Dan Williams wrote:
> > On Mon, Jun 18, 2018 at 11:14 AM, John Hubbard <jhubbard@nvidia.com> wrote:
> >> On 06/18/2018 10:56 AM, Dan Williams wrote:
> >>> On Mon, Jun 18, 2018 at 10:50 AM, John Hubbard <jhubbard@nvidia.com> wrote:
> >>>> On 06/18/2018 01:12 AM, Christoph Hellwig wrote:
> >>>>> On Sun, Jun 17, 2018 at 01:28:18PM -0700, John Hubbard wrote:
> >>>>>> Yes. However, my thinking was: get_user_pages() can become a way to indicate that
> >>>>>> these pages are going to be treated specially. In particular, the caller
> >>>>>> does not really want or need to support certain file operations, while the
> >>>>>> page is flagged this way.
> >>>>>>
> >>>>>> If necessary, we could add a new API call.
> >>>>>
> >>>>> That API call is called get_user_pages_longterm.
> >>>>
> >>>> OK...I had the impression that this was just semi-temporary API for dax, but
> >>>> given that it's an exported symbol, I guess it really is here to stay.
> >>>
> >>> The plan is to go back and provide api changes that bypass
> >>> get_user_page_longterm() for RDMA. However, for VFIO and others, it's
> >>> not clear what we could do. In the VFIO case the guest would need to
> >>> be prepared handle the revocation.
> >>
> >> OK, let's see if I understand that plan correctly:
> >>
> >> 1. Change RDMA users (this could be done entirely in the various device drivers'
> >> code, unless I'm overlooking something) to use mmu notifiers, and to do their
> >> DMA to/from non-pinned pages.
> > 
> > The problem with this approach is surprising the RDMA drivers with
> > notifications of teardowns. It's the RDMA userspace applications that
> > need the notification, and it likely needs to be explicit opt-in, at
> > least for the non-ODP drivers.
> > 
> >> 2. Return early from get_user_pages_longterm, if the memory is...marked for
> >> RDMA? (How? Same sort of page flag that I'm floating here, or something else?)
> >> That would avoid the problem with pinned pages getting their buffer heads
> >> removed--by disallowing the pinning. Makes sense.
> > 
> > Well, right now the RDMA workaround is DAX specific and it seems we
> > need to generalize it for the page-cache case. One thought is to have
> > try_to_unmap() take it's own reference and wait for the page reference
> > count to drop to one so that the truncate path knows the page is
> > dma-idle and disconnected from the page cache, but I have not looked
> > at the details.
> > 
> >> Also, is there anything I can help with here, so that things can happen sooner?
> > 
> > I do think we should explore a page flag for pages that are "long
> > term" pinned. Michal asked for something along these lines at LSF / MM
> > so that the core-mm can give up on pages that the kernel has lost
> > lifetime control. Michal, did I capture your ask correctly?
> 
> 
> OK, that "refcount == 1" approach sounds promising:
> 
>    -- still use a page flag, but narrow the scope to get_user_pages_longterm() pages
>    -- just wait in try_to_unmap, instead of giving up

But this would fix only the RDMA use case, isn't it? Direct IO (and other
get_user_pages_fast() users) would be still problematic.

And for record, the problem with page cache pages is not only that
try_to_unmap() may unmap them. It is also that page_mkclean() can
write-protect them. And once PTEs are write-protected filesystems may end
up doing bad things if DMA then modifies the page contents (DIF/DIX
failures, data corruption, oopses). As such I don't think that solutions
based on page reference count have a big chance of dealing with the
problem.

And your page flag approach would also need to take page_mkclean() into
account. And there the issue is that until the flag is cleared (i.e., we
are sure there are no writers using references from GUP) you cannot
writeback the page safely which does not work well with your idea of
clearing the flag only once the page is evicted from page cache (hint, page
cache page cannot get evicted until it is written back).

So as sad as it is, I don't see an easy solution here.

								Honza

Matthew Wilcox June 19, 2018, 9:02 a.m. UTC | #19

On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
> And for record, the problem with page cache pages is not only that
> try_to_unmap() may unmap them. It is also that page_mkclean() can
> write-protect them. And once PTEs are write-protected filesystems may end
> up doing bad things if DMA then modifies the page contents (DIF/DIX
> failures, data corruption, oopses). As such I don't think that solutions
> based on page reference count have a big chance of dealing with the
> problem.
> 
> And your page flag approach would also need to take page_mkclean() into
> account. And there the issue is that until the flag is cleared (i.e., we
> are sure there are no writers using references from GUP) you cannot
> writeback the page safely which does not work well with your idea of
> clearing the flag only once the page is evicted from page cache (hint, page
> cache page cannot get evicted until it is written back).
> 
> So as sad as it is, I don't see an easy solution here.

Pages which are "got" don't need to be on the LRU list.  They'll be
marked dirty when they're put, so we can use page->lru for fun things
like a "got" refcount.  If we use bit 1 of page->lru for PageGot, we've
got 30/62 bits in the first word and a full 64 bits in the second word.

Jan Kara June 19, 2018, 10:41 a.m. UTC | #20

On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
> > And for record, the problem with page cache pages is not only that
> > try_to_unmap() may unmap them. It is also that page_mkclean() can
> > write-protect them. And once PTEs are write-protected filesystems may end
> > up doing bad things if DMA then modifies the page contents (DIF/DIX
> > failures, data corruption, oopses). As such I don't think that solutions
> > based on page reference count have a big chance of dealing with the
> > problem.
> > 
> > And your page flag approach would also need to take page_mkclean() into
> > account. And there the issue is that until the flag is cleared (i.e., we
> > are sure there are no writers using references from GUP) you cannot
> > writeback the page safely which does not work well with your idea of
> > clearing the flag only once the page is evicted from page cache (hint, page
> > cache page cannot get evicted until it is written back).
> > 
> > So as sad as it is, I don't see an easy solution here.
> 
> Pages which are "got" don't need to be on the LRU list.  They'll be
> marked dirty when they're put, so we can use page->lru for fun things
> like a "got" refcount.  If we use bit 1 of page->lru for PageGot, we've
> got 30/62 bits in the first word and a full 64 bits in the second word.

Interesting idea! It would destroy the aging information for the page but
for pages accessed through GUP references that is very much vague concept
anyway. It might be a bit tricky as pulling a page out of LRU requires page
lock but I don't think that's a huge problem. And page cache pages not on
LRU exist even currently when they are under reclaim so hopefully there
won't be too many places in MM that would need fixing up for such pages.

I'm also still pondering the idea of inserting a "virtual" VMA into vma
interval tree in the inode - as the GUP references are IMHO closest to an
mlocked mapping - and that would achieve all the functionality we need as
well. I just didn't have time to experiment with it.

And then there's the aspect that both these approaches are a bit too
heavyweight for some get_user_pages_fast() users (e.g. direct IO) - Al Viro
had an idea to use page lock for that path but e.g. fs/direct-io.c would have
problems due to lock ordering constraints (filesystem ->get_block would
suddently get called with the page lock held). But we can probably leave
performance optimizations for phase two.

								Honza

John Hubbard June 19, 2018, 6:11 p.m. UTC | #21

On 06/19/2018 03:41 AM, Jan Kara wrote:
> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
>>> And for record, the problem with page cache pages is not only that
>>> try_to_unmap() may unmap them. It is also that page_mkclean() can
>>> write-protect them. And once PTEs are write-protected filesystems may end
>>> up doing bad things if DMA then modifies the page contents (DIF/DIX
>>> failures, data corruption, oopses). As such I don't think that solutions
>>> based on page reference count have a big chance of dealing with the
>>> problem.
>>>
>>> And your page flag approach would also need to take page_mkclean() into
>>> account. And there the issue is that until the flag is cleared (i.e., we
>>> are sure there are no writers using references from GUP) you cannot
>>> writeback the page safely which does not work well with your idea of
>>> clearing the flag only once the page is evicted from page cache (hint, page
>>> cache page cannot get evicted until it is written back).
>>>
>>> So as sad as it is, I don't see an easy solution here.
>>
>> Pages which are "got" don't need to be on the LRU list.  They'll be
>> marked dirty when they're put, so we can use page->lru for fun things
>> like a "got" refcount.  If we use bit 1 of page->lru for PageGot, we've
>> got 30/62 bits in the first word and a full 64 bits in the second word.
> 
> Interesting idea! It would destroy the aging information for the page but
> for pages accessed through GUP references that is very much vague concept
> anyway. It might be a bit tricky as pulling a page out of LRU requires page
> lock but I don't think that's a huge problem. And page cache pages not on
> LRU exist even currently when they are under reclaim so hopefully there
> won't be too many places in MM that would need fixing up for such pages.

This sound promising, I'll try it out!

> 
> I'm also still pondering the idea of inserting a "virtual" VMA into vma
> interval tree in the inode - as the GUP references are IMHO closest to an
> mlocked mapping - and that would achieve all the functionality we need as
> well. I just didn't have time to experiment with it.

How would this work? Would it have the same virtual address range? And how
does it avoid the problems we've been discussing? Sorry to be a bit slow
here. :)

> 
> And then there's the aspect that both these approaches are a bit too
> heavyweight for some get_user_pages_fast() users (e.g. direct IO) - Al Viro
> had an idea to use page lock for that path but e.g. fs/direct-io.c would have
> problems due to lock ordering constraints (filesystem ->get_block would
> suddently get called with the page lock held). But we can probably leave
> performance optimizations for phase two.

 
So I assume that phase one would be to apply this approach only to
get_user_pages_longterm. (Please let me know if that's wrong.)


thanks,

Dan Williams June 20, 2018, 1:24 a.m. UTC | #22

On Tue, Jun 19, 2018 at 11:11 AM, John Hubbard <jhubbard@nvidia.com> wrote:
> On 06/19/2018 03:41 AM, Jan Kara wrote:
>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
[..]
>> And then there's the aspect that both these approaches are a bit too
>> heavyweight for some get_user_pages_fast() users (e.g. direct IO) - Al Viro
>> had an idea to use page lock for that path but e.g. fs/direct-io.c would have
>> problems due to lock ordering constraints (filesystem ->get_block would
>> suddently get called with the page lock held). But we can probably leave
>> performance optimizations for phase two.
>
>
> So I assume that phase one would be to apply this approach only to
> get_user_pages_longterm. (Please let me know if that's wrong.)

I think that's wrong, because get_user_pages_longterm() is only a
filesystem-dax avoidance mechanism, it's not trying to address all the
problems that Jan is talking about. I don't see any viable half-step
solutions.

John Hubbard June 20, 2018, 1:34 a.m. UTC | #23

On 06/19/2018 06:24 PM, Dan Williams wrote:
> On Tue, Jun 19, 2018 at 11:11 AM, John Hubbard <jhubbard@nvidia.com> wrote:
>> On 06/19/2018 03:41 AM, Jan Kara wrote:
>>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
>>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
> [..]
>>> And then there's the aspect that both these approaches are a bit too
>>> heavyweight for some get_user_pages_fast() users (e.g. direct IO) - Al Viro
>>> had an idea to use page lock for that path but e.g. fs/direct-io.c would have
>>> problems due to lock ordering constraints (filesystem ->get_block would
>>> suddently get called with the page lock held). But we can probably leave
>>> performance optimizations for phase two.
>>
>>
>> So I assume that phase one would be to apply this approach only to
>> get_user_pages_longterm. (Please let me know if that's wrong.)
> 
> I think that's wrong, because get_user_pages_longterm() is only a
> filesystem-dax avoidance mechanism, it's not trying to address all the
> problems that Jan is talking about. I don't see any viable half-step
> solutions.
> 

OK, but in that case, I'm slightly confused by Jan's comment above, about leaving 
performance optimizations until phase two. Because that *is* a half-step approach: 
phase one, phase two.  

Are you disagreeing with Jan, or are you suggesting "fix get_user_pages first, and
leave get_user_pages_fast alone for now?" 

Or something else?

Dan Williams June 20, 2018, 1:57 a.m. UTC | #24

On Tue, Jun 19, 2018 at 6:34 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> On 06/19/2018 06:24 PM, Dan Williams wrote:
>> On Tue, Jun 19, 2018 at 11:11 AM, John Hubbard <jhubbard@nvidia.com> wrote:
>>> On 06/19/2018 03:41 AM, Jan Kara wrote:
>>>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
>>>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
>> [..]
>>>> And then there's the aspect that both these approaches are a bit too
>>>> heavyweight for some get_user_pages_fast() users (e.g. direct IO) - Al Viro
>>>> had an idea to use page lock for that path but e.g. fs/direct-io.c would have
>>>> problems due to lock ordering constraints (filesystem ->get_block would
>>>> suddently get called with the page lock held). But we can probably leave
>>>> performance optimizations for phase two.
>>>
>>>
>>> So I assume that phase one would be to apply this approach only to
>>> get_user_pages_longterm. (Please let me know if that's wrong.)
>>
>> I think that's wrong, because get_user_pages_longterm() is only a
>> filesystem-dax avoidance mechanism, it's not trying to address all the
>> problems that Jan is talking about. I don't see any viable half-step
>> solutions.
>>
>
> OK, but in that case, I'm slightly confused by Jan's comment above, about leaving
> performance optimizations until phase two. Because that *is* a half-step approach:
> phase one, phase two.

No, sorry, I might be confusing things. The half step is leaving
truncate broken, or my strawman that only addressed unmap.

> Are you disagreeing with Jan, or are you suggesting "fix get_user_pages first, and
> leave get_user_pages_fast alone for now?"

I'm agreeing with Jan, we need to fix page_mkclean() and
try_to_unmap() without regressing truncate behavior.

John Hubbard June 20, 2018, 2:03 a.m. UTC | #25

On 06/19/2018 06:57 PM, Dan Williams wrote:
> On Tue, Jun 19, 2018 at 6:34 PM, John Hubbard <jhubbard@nvidia.com> wrote:
>> On 06/19/2018 06:24 PM, Dan Williams wrote:
>>> On Tue, Jun 19, 2018 at 11:11 AM, John Hubbard <jhubbard@nvidia.com> wrote:
>>>> On 06/19/2018 03:41 AM, Jan Kara wrote:
>>>>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
>>>>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
>>> [..]
>>>>> And then there's the aspect that both these approaches are a bit too
>>>>> heavyweight for some get_user_pages_fast() users (e.g. direct IO) - Al Viro
>>>>> had an idea to use page lock for that path but e.g. fs/direct-io.c would have
>>>>> problems due to lock ordering constraints (filesystem ->get_block would
>>>>> suddently get called with the page lock held). But we can probably leave
>>>>> performance optimizations for phase two.
>>>>
>>>>
>>>> So I assume that phase one would be to apply this approach only to
>>>> get_user_pages_longterm. (Please let me know if that's wrong.)
>>>
>>> I think that's wrong, because get_user_pages_longterm() is only a
>>> filesystem-dax avoidance mechanism, it's not trying to address all the
>>> problems that Jan is talking about. I don't see any viable half-step
>>> solutions.
>>>
>>
>> OK, but in that case, I'm slightly confused by Jan's comment above, about leaving
>> performance optimizations until phase two. Because that *is* a half-step approach:
>> phase one, phase two.
> 
> No, sorry, I might be confusing things. The half step is leaving
> truncate broken, or my strawman that only addressed unmap.
> 
>> Are you disagreeing with Jan, or are you suggesting "fix get_user_pages first, and
>> leave get_user_pages_fast alone for now?"
> 
> I'm agreeing with Jan, we need to fix page_mkclean() and
> try_to_unmap() without regressing truncate behavior.
> 

OK, perfect, thanks for clarifying.  It all sounds consistent now. :)

Jan Kara June 20, 2018, 12:08 p.m. UTC | #26

On Tue 19-06-18 11:11:48, John Hubbard wrote:
> On 06/19/2018 03:41 AM, Jan Kara wrote:
> > On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
> >> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
> >>> And for record, the problem with page cache pages is not only that
> >>> try_to_unmap() may unmap them. It is also that page_mkclean() can
> >>> write-protect them. And once PTEs are write-protected filesystems may end
> >>> up doing bad things if DMA then modifies the page contents (DIF/DIX
> >>> failures, data corruption, oopses). As such I don't think that solutions
> >>> based on page reference count have a big chance of dealing with the
> >>> problem.
> >>>
> >>> And your page flag approach would also need to take page_mkclean() into
> >>> account. And there the issue is that until the flag is cleared (i.e., we
> >>> are sure there are no writers using references from GUP) you cannot
> >>> writeback the page safely which does not work well with your idea of
> >>> clearing the flag only once the page is evicted from page cache (hint, page
> >>> cache page cannot get evicted until it is written back).
> >>>
> >>> So as sad as it is, I don't see an easy solution here.
> >>
> >> Pages which are "got" don't need to be on the LRU list.  They'll be
> >> marked dirty when they're put, so we can use page->lru for fun things
> >> like a "got" refcount.  If we use bit 1 of page->lru for PageGot, we've
> >> got 30/62 bits in the first word and a full 64 bits in the second word.
> > 
> > Interesting idea! It would destroy the aging information for the page but
> > for pages accessed through GUP references that is very much vague concept
> > anyway. It might be a bit tricky as pulling a page out of LRU requires page
> > lock but I don't think that's a huge problem. And page cache pages not on
> > LRU exist even currently when they are under reclaim so hopefully there
> > won't be too many places in MM that would need fixing up for such pages.
> 
> This sound promising, I'll try it out!
> 
> > 
> > I'm also still pondering the idea of inserting a "virtual" VMA into vma
> > interval tree in the inode - as the GUP references are IMHO closest to an
> > mlocked mapping - and that would achieve all the functionality we need as
> > well. I just didn't have time to experiment with it.
> 
> How would this work? Would it have the same virtual address range? And how
> does it avoid the problems we've been discussing? Sorry to be a bit slow
> here. :)

The range covered by the virtual mapping would be the one sent to
get_user_pages() to get page references. And then we would need to teach
page_mkclean() to check for these virtual VMAs and block / skip / report
(different situations would need different behavior) such page. But this
second part is the same regardless how we identify a page that is pinned by
get_user_pages().

> > And then there's the aspect that both these approaches are a bit too
> > heavyweight for some get_user_pages_fast() users (e.g. direct IO) - Al Viro
> > had an idea to use page lock for that path but e.g. fs/direct-io.c would have
> > problems due to lock ordering constraints (filesystem ->get_block would
> > suddently get called with the page lock held). But we can probably leave
> > performance optimizations for phase two.
> 
>  
> So I assume that phase one would be to apply this approach only to
> get_user_pages_longterm. (Please let me know if that's wrong.)

No, I meant phase 1 would be to apply this to all get_user_pages() flavors.
Then phase 2 is to try to find a way to make get_user_pages_fast() fast
again. And then in parallel to that, we also need to find a way for
get_user_pages_longterm() to signal to the user pinned pages must be
released soon. Because after phase 1 pinned pages will block page
writeback and such system won't oops but will become unusable
sooner rather than later. And again this problem needs to be solved
regardless of a mechanism of identifying pinned pages.

								Honza

John Hubbard June 20, 2018, 10:55 p.m. UTC | #27

On 06/20/2018 05:08 AM, Jan Kara wrote:
> On Tue 19-06-18 11:11:48, John Hubbard wrote:
>> On 06/19/2018 03:41 AM, Jan Kara wrote:
>>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
>>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
[...]
>>> I'm also still pondering the idea of inserting a "virtual" VMA into vma
>>> interval tree in the inode - as the GUP references are IMHO closest to an
>>> mlocked mapping - and that would achieve all the functionality we need as
>>> well. I just didn't have time to experiment with it.
>>
>> How would this work? Would it have the same virtual address range? And how
>> does it avoid the problems we've been discussing? Sorry to be a bit slow
>> here. :)
> 
> The range covered by the virtual mapping would be the one sent to
> get_user_pages() to get page references. And then we would need to teach
> page_mkclean() to check for these virtual VMAs and block / skip / report
> (different situations would need different behavior) such page. But this
> second part is the same regardless how we identify a page that is pinned by
> get_user_pages().


OK. That neatly avoids the need a new page flag, I think. But of course it is 
somewhat more extensive to implement. Sounds like something to keep in mind,
in case it has better tradeoffs than the direction I'm heading so far.

 
>>> And then there's the aspect that both these approaches are a bit too
>>> heavyweight for some get_user_pages_fast() users (e.g. direct IO) - Al Viro
>>> had an idea to use page lock for that path but e.g. fs/direct-io.c would have
>>> problems due to lock ordering constraints (filesystem ->get_block would
>>> suddently get called with the page lock held). But we can probably leave
>>> performance optimizations for phase two.
>>
>>  
>> So I assume that phase one would be to apply this approach only to
>> get_user_pages_longterm. (Please let me know if that's wrong.)
> 
> No, I meant phase 1 would be to apply this to all get_user_pages() flavors.
> Then phase 2 is to try to find a way to make get_user_pages_fast() fast
> again. And then in parallel to that, we also need to find a way for
> get_user_pages_longterm() to signal to the user pinned pages must be
> released soon. Because after phase 1 pinned pages will block page
> writeback and such system won't oops but will become unusable
> sooner rather than later. And again this problem needs to be solved
> regardless of a mechanism of identifying pinned pages.
> 

OK, thanks, that does help. I had the priorities of these get_user_pages*()
changes all scrambled, but between your and Dan's explanation, I finally 
understand the preferred ordering of this work.

Jan Kara June 21, 2018, 4:30 p.m. UTC | #28

On Wed 20-06-18 15:55:41, John Hubbard wrote:
> On 06/20/2018 05:08 AM, Jan Kara wrote:
> > On Tue 19-06-18 11:11:48, John Hubbard wrote:
> >> On 06/19/2018 03:41 AM, Jan Kara wrote:
> >>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
> >>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
> [...]
> >>> I'm also still pondering the idea of inserting a "virtual" VMA into vma
> >>> interval tree in the inode - as the GUP references are IMHO closest to an
> >>> mlocked mapping - and that would achieve all the functionality we need as
> >>> well. I just didn't have time to experiment with it.
> >>
> >> How would this work? Would it have the same virtual address range? And how
> >> does it avoid the problems we've been discussing? Sorry to be a bit slow
> >> here. :)
> > 
> > The range covered by the virtual mapping would be the one sent to
> > get_user_pages() to get page references. And then we would need to teach
> > page_mkclean() to check for these virtual VMAs and block / skip / report
> > (different situations would need different behavior) such page. But this
> > second part is the same regardless how we identify a page that is pinned by
> > get_user_pages().
> 
> 
> OK. That neatly avoids the need a new page flag, I think. But of course it is 
> somewhat more extensive to implement. Sounds like something to keep in mind,
> in case it has better tradeoffs than the direction I'm heading so far.

Yes, the changes needed are somewhat more intrusive. I'm looking into this
approach now to see how the result will look like...

								Honza

Jan Kara June 25, 2018, 3:21 p.m. UTC | #29

On Thu 21-06-18 18:30:36, Jan Kara wrote:
> On Wed 20-06-18 15:55:41, John Hubbard wrote:
> > On 06/20/2018 05:08 AM, Jan Kara wrote:
> > > On Tue 19-06-18 11:11:48, John Hubbard wrote:
> > >> On 06/19/2018 03:41 AM, Jan Kara wrote:
> > >>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
> > >>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
> > [...]
> > >>> I'm also still pondering the idea of inserting a "virtual" VMA into vma
> > >>> interval tree in the inode - as the GUP references are IMHO closest to an
> > >>> mlocked mapping - and that would achieve all the functionality we need as
> > >>> well. I just didn't have time to experiment with it.
> > >>
> > >> How would this work? Would it have the same virtual address range? And how
> > >> does it avoid the problems we've been discussing? Sorry to be a bit slow
> > >> here. :)
> > > 
> > > The range covered by the virtual mapping would be the one sent to
> > > get_user_pages() to get page references. And then we would need to teach
> > > page_mkclean() to check for these virtual VMAs and block / skip / report
> > > (different situations would need different behavior) such page. But this
> > > second part is the same regardless how we identify a page that is pinned by
> > > get_user_pages().
> > 
> > 
> > OK. That neatly avoids the need a new page flag, I think. But of course it is 
> > somewhat more extensive to implement. Sounds like something to keep in mind,
> > in case it has better tradeoffs than the direction I'm heading so far.
> 
> Yes, the changes needed are somewhat more intrusive. I'm looking into this
> approach now to see how the result will look like...

I've spent some time on this. There are two obstacles with my approach of
putting special entry into inode's VMA tree:

1) If I want to place this special entry in inode's VMA tree, I either need
to allocate full VMA, somehow initiate it so that it's clear it's a special
"pinned" range, not a VMA => uses unnecessarily too much memory, it is
ugly. Another solution I was hoping for was that I would factor out some
common bits of vm_area_struct (pgoff, rb_node, ..) into a structure common
for VMA and the locked range => doable but causes a lot of churn as VMAs
are accessed (and modified!) at hundreds of places in the kernel. Some
accessor functions would help to reduce the churn a bit but then stuff like
vma_set_pgoff(vma, pgoff) isn't exactly beautiful either.

2) Some users of GUP (e.g. direct IO) get a block of pages and then put
references to these pages at different times and in random order -
basically when IO for given page is completed, reference is dropped and one
GUP call can acquire page references for pages which end up in multiple
different bios (we don't know in advance). This makes is difficult to
implement counterpart to GUP to 'unpin' a range of pages - we'd either have
to support partial unpins (and splitting of pinned ranges and all such fun)
or just have to track internally in how many pages are still pinned in the
originally pinned range and release the pin once all individual pages are
unpinned but then it's difficult to e.g. get to this internal structure
from IO completion callback where we only have the bio.

So I think the Matthew's idea of removing pinned pages from LRU is
definitely worth trying to see how complex that would end up being. Did you
get to looking into it? If not, I can probably find some time to try that
out.

								Honza

John Hubbard June 25, 2018, 7:03 p.m. UTC | #30

On 06/25/2018 08:21 AM, Jan Kara wrote:
> On Thu 21-06-18 18:30:36, Jan Kara wrote:
>> On Wed 20-06-18 15:55:41, John Hubbard wrote:
>>> On 06/20/2018 05:08 AM, Jan Kara wrote:
>>>> On Tue 19-06-18 11:11:48, John Hubbard wrote:
>>>>> On 06/19/2018 03:41 AM, Jan Kara wrote:
>>>>>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
>>>>>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
>>> [...]
> I've spent some time on this. There are two obstacles with my approach of
> putting special entry into inode's VMA tree:
> 
> 1) If I want to place this special entry in inode's VMA tree, I either need
> to allocate full VMA, somehow initiate it so that it's clear it's a special
> "pinned" range, not a VMA => uses unnecessarily too much memory, it is
> ugly. Another solution I was hoping for was that I would factor out some
> common bits of vm_area_struct (pgoff, rb_node, ..) into a structure common
> for VMA and the locked range => doable but causes a lot of churn as VMAs
> are accessed (and modified!) at hundreds of places in the kernel. Some
> accessor functions would help to reduce the churn a bit but then stuff like
> vma_set_pgoff(vma, pgoff) isn't exactly beautiful either.
> 
> 2) Some users of GUP (e.g. direct IO) get a block of pages and then put
> references to these pages at different times and in random order -
> basically when IO for given page is completed, reference is dropped and one
> GUP call can acquire page references for pages which end up in multiple
> different bios (we don't know in advance). This makes is difficult to
> implement counterpart to GUP to 'unpin' a range of pages - we'd either have
> to support partial unpins (and splitting of pinned ranges and all such fun)
> or just have to track internally in how many pages are still pinned in the
> originally pinned range and release the pin once all individual pages are
> unpinned but then it's difficult to e.g. get to this internal structure
> from IO completion callback where we only have the bio.
>
> So I think the Matthew's idea of removing pinned pages from LRU is
> definitely worth trying to see how complex that would end up being. Did you
> get to looking into it? If not, I can probably find some time to try that
> out.
> 

OK. Even if we remove the pages from the LRU, we still have to insert a "put_gup_page"
or similarly named call. But it could be a simple replacement for put_page, with
that approach, so that does make it much much easier.
 
I was (and still am) planning on tackling this today, so let me see how far I
get before yelling for help. :)

thanks,

John Hubbard June 26, 2018, 6:31 a.m. UTC | #31

On 06/25/2018 08:21 AM, Jan Kara wrote:
> On Thu 21-06-18 18:30:36, Jan Kara wrote:
>> On Wed 20-06-18 15:55:41, John Hubbard wrote:
>>> On 06/20/2018 05:08 AM, Jan Kara wrote:
>>>> On Tue 19-06-18 11:11:48, John Hubbard wrote:
>>>>> On 06/19/2018 03:41 AM, Jan Kara wrote:
>>>>>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
>>>>>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
>>> [...]
> I've spent some time on this. There are two obstacles with my approach of
> putting special entry into inode's VMA tree:
> 
> 1) If I want to place this special entry in inode's VMA tree, I either need
> to allocate full VMA, somehow initiate it so that it's clear it's a special
> "pinned" range, not a VMA => uses unnecessarily too much memory, it is
> ugly. Another solution I was hoping for was that I would factor out some
> common bits of vm_area_struct (pgoff, rb_node, ..) into a structure common
> for VMA and the locked range => doable but causes a lot of churn as VMAs
> are accessed (and modified!) at hundreds of places in the kernel. Some
> accessor functions would help to reduce the churn a bit but then stuff like
> vma_set_pgoff(vma, pgoff) isn't exactly beautiful either.
> 
> 2) Some users of GUP (e.g. direct IO) get a block of pages and then put
> references to these pages at different times and in random order -
> basically when IO for given page is completed, reference is dropped and one
> GUP call can acquire page references for pages which end up in multiple
> different bios (we don't know in advance). This makes is difficult to
> implement counterpart to GUP to 'unpin' a range of pages - we'd either have
> to support partial unpins (and splitting of pinned ranges and all such fun)
> or just have to track internally in how many pages are still pinned in the
> originally pinned range and release the pin once all individual pages are
> unpinned but then it's difficult to e.g. get to this internal structure
> from IO completion callback where we only have the bio.
> 
> So I think the Matthew's idea of removing pinned pages from LRU is
> definitely worth trying to see how complex that would end up being. Did you
> get to looking into it? If not, I can probably find some time to try that
> out.
> 
 
OK, so I looked into this some more.

As you implied in an earlier response, removing a page from LRU is probably the
easy part. It's *keeping* it off the LRU that worries me. I looked at SetPageLRU()
uses, there were only 5 call sites, and of those, I think only one might be difficult:

    __pagevec_lru_add()

It seems like the way to avoid __pagevec_lru_add() calls on these pages is to
first call lru_add_drain_all, then remove the pages from LRU (presumably via
isolate_page_lru). I think that should do it. But I'm a little concerned that 
maybe I'm overlooking something.

Here are the 5 search hits and my analysis. This may have mistakes in it, as
I'm pretty new to this area, which is why I'm spelling it out:

1. mm/memcontrol.c:2082: SetPageLRU(page); 

    This is in unlock_page_lru(). Caller: commit_charge(), and it's conditional on 
    lrucare, so we can just skip it if the new page flag is set.

2. mm/swap.c:831: SetPageLRU(page_tail);
    This is in lru_add_page_tail(), which is only called by __split_huge_page_tail, and
    there, we can also just skip the call for these pages.

3. mm/swap.c:866:  SetPageLRU(page);
    This is in __pagevec_lru_add_fn (sole caller: __pagevec_lru_add), and is
    discussed above.

4. mm/vmscan.c:1680: SetPageLRU(page);	
    This is in putback_inactive_pages(), which I think won't get called unless
    the page is already on an LRU.

5. mm/vmscan.c:1873: SetPageLRU(page);	//  (N/A)
    This is in move_active_pages_to_lru(), which I also think won't get called unless 
    the page is already on an LRU.


thanks,

Jan Kara June 26, 2018, 7:52 a.m. UTC | #32

On Mon 25-06-18 12:03:37, John Hubbard wrote:
> On 06/25/2018 08:21 AM, Jan Kara wrote:
> > On Thu 21-06-18 18:30:36, Jan Kara wrote:
> >> On Wed 20-06-18 15:55:41, John Hubbard wrote:
> >>> On 06/20/2018 05:08 AM, Jan Kara wrote:
> >>>> On Tue 19-06-18 11:11:48, John Hubbard wrote:
> >>>>> On 06/19/2018 03:41 AM, Jan Kara wrote:
> >>>>>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote:
> >>>>>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote:
> >>> [...]
> > I've spent some time on this. There are two obstacles with my approach of
> > putting special entry into inode's VMA tree:
> > 
> > 1) If I want to place this special entry in inode's VMA tree, I either need
> > to allocate full VMA, somehow initiate it so that it's clear it's a special
> > "pinned" range, not a VMA => uses unnecessarily too much memory, it is
> > ugly. Another solution I was hoping for was that I would factor out some
> > common bits of vm_area_struct (pgoff, rb_node, ..) into a structure common
> > for VMA and the locked range => doable but causes a lot of churn as VMAs
> > are accessed (and modified!) at hundreds of places in the kernel. Some
> > accessor functions would help to reduce the churn a bit but then stuff like
> > vma_set_pgoff(vma, pgoff) isn't exactly beautiful either.
> > 
> > 2) Some users of GUP (e.g. direct IO) get a block of pages and then put
> > references to these pages at different times and in random order -
> > basically when IO for given page is completed, reference is dropped and one
> > GUP call can acquire page references for pages which end up in multiple
> > different bios (we don't know in advance). This makes is difficult to
> > implement counterpart to GUP to 'unpin' a range of pages - we'd either have
> > to support partial unpins (and splitting of pinned ranges and all such fun)
> > or just have to track internally in how many pages are still pinned in the
> > originally pinned range and release the pin once all individual pages are
> > unpinned but then it's difficult to e.g. get to this internal structure
> > from IO completion callback where we only have the bio.
> >
> > So I think the Matthew's idea of removing pinned pages from LRU is
> > definitely worth trying to see how complex that would end up being. Did you
> > get to looking into it? If not, I can probably find some time to try that
> > out.
> > 
> 
> OK. Even if we remove the pages from the LRU, we still have to insert a
> "put_gup_page" or similarly named call. But it could be a simple
> replacement for put_page, with that approach, so that does make it much
> much easier.

Yes, that's exactly what I thought about as well.

> I was (and still am) planning on tackling this today, so let me see how
> far I get before yelling for help. :)

OK, good.

								Honza

Jan Kara June 26, 2018, 11:48 a.m. UTC | #33

On Mon 25-06-18 23:31:06, John Hubbard wrote:
> On 06/25/2018 08:21 AM, Jan Kara wrote:
> > On Thu 21-06-18 18:30:36, Jan Kara wrote:
> > So I think the Matthew's idea of removing pinned pages from LRU is
> > definitely worth trying to see how complex that would end up being. Did you
> > get to looking into it? If not, I can probably find some time to try that
> > out.
> > 
>  
> OK, so I looked into this some more.
> 
> As you implied in an earlier response, removing a page from LRU is
> probably the easy part. It's *keeping* it off the LRU that worries me. I
> looked at SetPageLRU() uses, there were only 5 call sites, and of those,
> I think only one might be difficult:
> 
>     __pagevec_lru_add()
> 
> It seems like the way to avoid __pagevec_lru_add() calls on these pages
> is to first call lru_add_drain_all, then remove the pages from LRU
> (presumably via isolate_page_lru). I think that should do it. But I'm a
> little concerned that maybe I'm overlooking something.
> 
> Here are the 5 search hits and my analysis. This may have mistakes in it,
> as I'm pretty new to this area, which is why I'm spelling it out:
> 
> 1. mm/memcontrol.c:2082: SetPageLRU(page); 
> 
>     This is in unlock_page_lru(). Caller: commit_charge(), and it's conditional on 
>     lrucare, so we can just skip it if the new page flag is set.

This is only used to move a page from one LRU list to another one -
lock_page_lru() removes page from current LRU and unlock_page_lru() places
it on the target one. And that all happens under lru_lock so we should not
see our pinned pages in this code if we protect ourselves with the lru_lock
as well. But probably worth adding VM_BUG_ON()...

> 2. mm/swap.c:831: SetPageLRU(page_tail);
>     This is in lru_add_page_tail(), which is only called by __split_huge_page_tail, and
>     there, we can also just skip the call for these pages.

This is of no concern as this gets called only when splitting huge page.
Extra page reference obtained by GUP prevents huge page splitting so this
code path cannot be executed for pinned pages. Maybe VM_BUG_ON() for
checking page really is not pinned.

> 3. mm/swap.c:866:  SetPageLRU(page);
>     This is in __pagevec_lru_add_fn (sole caller: __pagevec_lru_add), and is
>     discussed above.

Agreed that here we'll need to update __pagevec_lru_add_fn() to detect
pinned pages and avoid putting them to LRU.

> 4. mm/vmscan.c:1680: SetPageLRU(page);	
>     This is in putback_inactive_pages(), which I think won't get called unless
>     the page is already on an LRU.
>
> 5. mm/vmscan.c:1873: SetPageLRU(page);	//  (N/A)
>     This is in move_active_pages_to_lru(), which I also think won't get called unless 
>     the page is already on an LRU.

These two are correct. We just have to be careful about the cases where
page pinning races with reclaim handling these pages. But when we
transition the page to the 'pinned' state under lru_lock and remove it from
any list it is on, we should be safe against all the races with reclaim. We
just have to be careful to get all the accounting right when moving page to
the pinned state.

								Honza

Michal Hocko June 26, 2018, 1:47 p.m. UTC | #34

On Mon 18-06-18 12:21:46, Dan Williams wrote:
[...]
> I do think we should explore a page flag for pages that are "long
> term" pinned. Michal asked for something along these lines at LSF / MM
> so that the core-mm can give up on pages that the kernel has lost
> lifetime control. Michal, did I capture your ask correctly?

I am sorry to be late. I didn't ask for a page flag exactly. I've asked
for a way to query for the pin to be temporal or permanent. How that is
achieved is another question. Maybe we have some more spare room after
recent struct page reorganization but I dunno, to be honest. Maybe we
can have an _count offset for these longterm pins. It is not like we are
using the whole ref count space, right?

Another thing I was asking for is to actually account those longterm
pinned pages and apply some control over those. They are basically mlock
like and so their usage should better not be unbound.

Jan Kara June 26, 2018, 4:48 p.m. UTC | #35

On Tue 26-06-18 15:47:57, Michal Hocko wrote:
> On Mon 18-06-18 12:21:46, Dan Williams wrote:
> [...]
> > I do think we should explore a page flag for pages that are "long
> > term" pinned. Michal asked for something along these lines at LSF / MM
> > so that the core-mm can give up on pages that the kernel has lost
> > lifetime control. Michal, did I capture your ask correctly?
> 
> I am sorry to be late. I didn't ask for a page flag exactly. I've asked
> for a way to query for the pin to be temporal or permanent. How that is
> achieved is another question. Maybe we have some more spare room after
> recent struct page reorganization but I dunno, to be honest. Maybe we
> can have an _count offset for these longterm pins. It is not like we are
> using the whole ref count space, right?

Matthew had an interesting idea to pull pinned pages completely out from
any LRU and reuse that space in struct page for pinned refcounts. From some
initial investigation (read on elsewhere in this thread) it looks doable. I
was considering offsetting in refcount as well but on 32-bit architectures
there's not that many bits that I'd be really comfortable with that
solution...

> Another thing I was asking for is to actually account those longterm
> pinned pages and apply some control over those. They are basically mlock
> like and so their usage should better not be unbound.

Agreed here but I'd prefer to keep this discussion separate from 'how to
identify pinned pages'.

								Honza

Michal Hocko June 27, 2018, 11:32 a.m. UTC | #36

On Tue 26-06-18 18:48:25, Jan Kara wrote:
> On Tue 26-06-18 15:47:57, Michal Hocko wrote:
> > On Mon 18-06-18 12:21:46, Dan Williams wrote:
> > [...]
> > > I do think we should explore a page flag for pages that are "long
> > > term" pinned. Michal asked for something along these lines at LSF / MM
> > > so that the core-mm can give up on pages that the kernel has lost
> > > lifetime control. Michal, did I capture your ask correctly?
> > 
> > I am sorry to be late. I didn't ask for a page flag exactly. I've asked
> > for a way to query for the pin to be temporal or permanent. How that is
> > achieved is another question. Maybe we have some more spare room after
> > recent struct page reorganization but I dunno, to be honest. Maybe we
> > can have an _count offset for these longterm pins. It is not like we are
> > using the whole ref count space, right?
> 
> Matthew had an interesting idea to pull pinned pages completely out from
> any LRU and reuse that space in struct page for pinned refcounts. From some
> initial investigation (read on elsewhere in this thread) it looks doable. I
> was considering offsetting in refcount as well but on 32-bit architectures
> there's not that many bits that I'd be really comfortable with that
> solution...

I am really slow at following up this discussion. The problem I would
see with off-lru pages is that this can quickly turn into a weird
reclaim behavior. Especially when we are talking about a lot of memory.
It is true that such pages wouldn't be reclaimable directly but could
poke them in some way if we see too many of them while scanning LRU.

Not that this is a fundamental block stopper but this is the first thing
that popped out when thinking about such a solution. Maybe it is a good
start though.

Appart from that, do we really care about 32b here? Big DIO, IB users
seem to be 64b only AFAIU.

Jan Kara June 27, 2018, 11:53 a.m. UTC | #37

On Wed 27-06-18 13:32:21, Michal Hocko wrote:
> On Tue 26-06-18 18:48:25, Jan Kara wrote:
> > On Tue 26-06-18 15:47:57, Michal Hocko wrote:
> > > On Mon 18-06-18 12:21:46, Dan Williams wrote:
> > > [...]
> > > > I do think we should explore a page flag for pages that are "long
> > > > term" pinned. Michal asked for something along these lines at LSF / MM
> > > > so that the core-mm can give up on pages that the kernel has lost
> > > > lifetime control. Michal, did I capture your ask correctly?
> > > 
> > > I am sorry to be late. I didn't ask for a page flag exactly. I've asked
> > > for a way to query for the pin to be temporal or permanent. How that is
> > > achieved is another question. Maybe we have some more spare room after
> > > recent struct page reorganization but I dunno, to be honest. Maybe we
> > > can have an _count offset for these longterm pins. It is not like we are
> > > using the whole ref count space, right?
> > 
> > Matthew had an interesting idea to pull pinned pages completely out from
> > any LRU and reuse that space in struct page for pinned refcounts. From some
> > initial investigation (read on elsewhere in this thread) it looks doable. I
> > was considering offsetting in refcount as well but on 32-bit architectures
> > there's not that many bits that I'd be really comfortable with that
> > solution...
> 
> I am really slow at following up this discussion. The problem I would
> see with off-lru pages is that this can quickly turn into a weird
> reclaim behavior. Especially when we are talking about a lot of memory.
> It is true that such pages wouldn't be reclaimable directly but could
> poke them in some way if we see too many of them while scanning LRU.
> 
> Not that this is a fundamental block stopper but this is the first thing
> that popped out when thinking about such a solution. Maybe it is a good
> start though.
> 
> Appart from that, do we really care about 32b here? Big DIO, IB users
> seem to be 64b only AFAIU.

IMO it is a bad habit to leave unpriviledged-user-triggerable oops in the
kernel even for uncommon platforms...

								Honza

Michal Hocko June 27, 2018, 11:59 a.m. UTC | #38

On Wed 27-06-18 13:53:49, Jan Kara wrote:
> On Wed 27-06-18 13:32:21, Michal Hocko wrote:
[...]
> > Appart from that, do we really care about 32b here? Big DIO, IB users
> > seem to be 64b only AFAIU.
> 
> IMO it is a bad habit to leave unpriviledged-user-triggerable oops in the
> kernel even for uncommon platforms...

Absolutely agreed! I didn't mean to keep the blow up for 32b. I just
wanted to say that we can stay with a simple solution for 32b. I thought
the g-u-p-longterm has plugged the most obvious breakage already. But
maybe I just misunderstood.

Jan Kara June 27, 2018, 12:42 p.m. UTC | #39

On Wed 27-06-18 13:59:27, Michal Hocko wrote:
> On Wed 27-06-18 13:53:49, Jan Kara wrote:
> > On Wed 27-06-18 13:32:21, Michal Hocko wrote:
> [...]
> > > Appart from that, do we really care about 32b here? Big DIO, IB users
> > > seem to be 64b only AFAIU.
> > 
> > IMO it is a bad habit to leave unpriviledged-user-triggerable oops in the
> > kernel even for uncommon platforms...
> 
> Absolutely agreed! I didn't mean to keep the blow up for 32b. I just
> wanted to say that we can stay with a simple solution for 32b. I thought
> the g-u-p-longterm has plugged the most obvious breakage already. But
> maybe I just misunderstood.

Most yes, but if you try hard enough, you can still trigger the oops e.g.
with appropriately set up direct IO when racing with writeback / reclaim.

								Honza

Jason Gunthorpe June 27, 2018, 2:57 p.m. UTC | #40

On Wed, Jun 27, 2018 at 02:42:55PM +0200, Jan Kara wrote:
> On Wed 27-06-18 13:59:27, Michal Hocko wrote:
> > On Wed 27-06-18 13:53:49, Jan Kara wrote:
> > > On Wed 27-06-18 13:32:21, Michal Hocko wrote:
> > [...]
> > > > Appart from that, do we really care about 32b here? Big DIO, IB users
> > > > seem to be 64b only AFAIU.
> > > 
> > > IMO it is a bad habit to leave unpriviledged-user-triggerable oops in the
> > > kernel even for uncommon platforms...
> > 
> > Absolutely agreed! I didn't mean to keep the blow up for 32b. I just
> > wanted to say that we can stay with a simple solution for 32b. I thought
> > the g-u-p-longterm has plugged the most obvious breakage already. But
> > maybe I just misunderstood.
> 
> Most yes, but if you try hard enough, you can still trigger the oops e.g.
> with appropriately set up direct IO when racing with writeback / reclaim.

gup longterm is only different from normal gup if you have DAX and few
people do, which really means it doesn't help at all.. AFAIK??

Jason

Jan Kara June 27, 2018, 5:02 p.m. UTC | #41

On Wed 27-06-18 08:57:18, Jason Gunthorpe wrote:
> On Wed, Jun 27, 2018 at 02:42:55PM +0200, Jan Kara wrote:
> > On Wed 27-06-18 13:59:27, Michal Hocko wrote:
> > > On Wed 27-06-18 13:53:49, Jan Kara wrote:
> > > > On Wed 27-06-18 13:32:21, Michal Hocko wrote:
> > > [...]
> > > > > Appart from that, do we really care about 32b here? Big DIO, IB users
> > > > > seem to be 64b only AFAIU.
> > > > 
> > > > IMO it is a bad habit to leave unpriviledged-user-triggerable oops in the
> > > > kernel even for uncommon platforms...
> > > 
> > > Absolutely agreed! I didn't mean to keep the blow up for 32b. I just
> > > wanted to say that we can stay with a simple solution for 32b. I thought
> > > the g-u-p-longterm has plugged the most obvious breakage already. But
> > > maybe I just misunderstood.
> > 
> > Most yes, but if you try hard enough, you can still trigger the oops e.g.
> > with appropriately set up direct IO when racing with writeback / reclaim.
> 
> gup longterm is only different from normal gup if you have DAX and few
> people do, which really means it doesn't help at all.. AFAIK??

Right, what I wrote works only for DAX. For non-DAX situation g-u-p
longterm does not currently help at all. Sorry for confusion.

								Honza

John Hubbard June 28, 2018, 2:42 a.m. UTC | #42

On 06/27/2018 10:02 AM, Jan Kara wrote:
> On Wed 27-06-18 08:57:18, Jason Gunthorpe wrote:
>> On Wed, Jun 27, 2018 at 02:42:55PM +0200, Jan Kara wrote:
>>> On Wed 27-06-18 13:59:27, Michal Hocko wrote:
>>>> On Wed 27-06-18 13:53:49, Jan Kara wrote:
>>>>> On Wed 27-06-18 13:32:21, Michal Hocko wrote:
>>>> [...]
>>>>>> Appart from that, do we really care about 32b here? Big DIO, IB users
>>>>>> seem to be 64b only AFAIU.
>>>>>
>>>>> IMO it is a bad habit to leave unpriviledged-user-triggerable oops in the
>>>>> kernel even for uncommon platforms...
>>>>
>>>> Absolutely agreed! I didn't mean to keep the blow up for 32b. I just
>>>> wanted to say that we can stay with a simple solution for 32b. I thought
>>>> the g-u-p-longterm has plugged the most obvious breakage already. But
>>>> maybe I just misunderstood.
>>>
>>> Most yes, but if you try hard enough, you can still trigger the oops e.g.
>>> with appropriately set up direct IO when racing with writeback / reclaim.
>>
>> gup longterm is only different from normal gup if you have DAX and few
>> people do, which really means it doesn't help at all.. AFAIK??
> 
> Right, what I wrote works only for DAX. For non-DAX situation g-u-p
> longterm does not currently help at all. Sorry for confusion.
> 

OK, I've got an early version of this up and running, reusing the page->lru
fields. I'll clean it up and do some heavier testing, and post as a PATCH v2.

One question though: I'm still vague on the best actions to take in the following
functions:

    page_mkclean_one
    try_to_unmap_one

At the moment, they are both just doing an evil little early-out:

	if (PageDmaPinned(page))
		return false;

...but we talked about maybe waiting for the condition to clear, instead? Thoughts?

And if so, does it sound reasonable to refactor wait_on_page_bit_common(),
so that it learns how to wait for a bit that, while inside struct page, is
not within page->flags?


thanks,

Jan Kara June 28, 2018, 9:17 a.m. UTC | #43

On Wed 27-06-18 19:42:01, John Hubbard wrote:
> On 06/27/2018 10:02 AM, Jan Kara wrote:
> > On Wed 27-06-18 08:57:18, Jason Gunthorpe wrote:
> >> On Wed, Jun 27, 2018 at 02:42:55PM +0200, Jan Kara wrote:
> >>> On Wed 27-06-18 13:59:27, Michal Hocko wrote:
> >>>> On Wed 27-06-18 13:53:49, Jan Kara wrote:
> >>>>> On Wed 27-06-18 13:32:21, Michal Hocko wrote:
> >>>> [...]
> >>>>>> Appart from that, do we really care about 32b here? Big DIO, IB users
> >>>>>> seem to be 64b only AFAIU.
> >>>>>
> >>>>> IMO it is a bad habit to leave unpriviledged-user-triggerable oops in the
> >>>>> kernel even for uncommon platforms...
> >>>>
> >>>> Absolutely agreed! I didn't mean to keep the blow up for 32b. I just
> >>>> wanted to say that we can stay with a simple solution for 32b. I thought
> >>>> the g-u-p-longterm has plugged the most obvious breakage already. But
> >>>> maybe I just misunderstood.
> >>>
> >>> Most yes, but if you try hard enough, you can still trigger the oops e.g.
> >>> with appropriately set up direct IO when racing with writeback / reclaim.
> >>
> >> gup longterm is only different from normal gup if you have DAX and few
> >> people do, which really means it doesn't help at all.. AFAIK??
> > 
> > Right, what I wrote works only for DAX. For non-DAX situation g-u-p
> > longterm does not currently help at all. Sorry for confusion.
> > 
> 
> OK, I've got an early version of this up and running, reusing the page->lru
> fields. I'll clean it up and do some heavier testing, and post as a PATCH v2.

Cool.

> One question though: I'm still vague on the best actions to take in the
> following functions:
> 
>     page_mkclean_one
>     try_to_unmap_one
> 
> At the moment, they are both just doing an evil little early-out:
> 
> 	if (PageDmaPinned(page))
> 		return false;
> 
> ...but we talked about maybe waiting for the condition to clear, instead?
> Thoughts?

What needs to happen in page_mkclean() depends on the caller. Most of the
callers really need to be sure the page is write-protected once
page_mkclean() returns. Those are:

  pagecache_isize_extended()
  fb_deferred_io_work()
  clear_page_dirty_for_io() if called for data-integrity writeback - which
    is currently known only in its caller (e.g. write_cache_pages()) where
    it can be determined as wbc->sync_mode == WB_SYNC_ALL. Getting this
    information into page_mkclean() will require some plumbing and
    clear_page_dirty_for_io() has some 50 callers but it's doable.

clear_page_dirty_for_io() for cleaning writeback (wbc->sync_mode !=
WB_SYNC_ALL) can just skip pinned pages and we probably need to do that as
otherwise memory cleaning would get stuck on pinned pages until RDMA
drivers release its pins.

> And if so, does it sound reasonable to refactor wait_on_page_bit_common(),
> so that it learns how to wait for a bit that, while inside struct page, is
> not within page->flags?

wait_on_page_bit_common() and associated wait queue handling is a fast path
so we should not make it slower for such special case as waiting for DMA
pin. OTOH we could probably refactor most of the infrastructure to take
pointer to word with flags instead of pointer to page. I'm not sure how the
result will look like but it's probably worth a try.

								Honza

Leon Romanovsky July 2, 2018, 5:52 a.m. UTC | #44

On Thu, Jun 28, 2018 at 11:17:43AM +0200, Jan Kara wrote:
> On Wed 27-06-18 19:42:01, John Hubbard wrote:
> > On 06/27/2018 10:02 AM, Jan Kara wrote:
> > > On Wed 27-06-18 08:57:18, Jason Gunthorpe wrote:
> > >> On Wed, Jun 27, 2018 at 02:42:55PM +0200, Jan Kara wrote:
> > >>> On Wed 27-06-18 13:59:27, Michal Hocko wrote:
> > >>>> On Wed 27-06-18 13:53:49, Jan Kara wrote:
> > >>>>> On Wed 27-06-18 13:32:21, Michal Hocko wrote:
> > >>>> [...]
> > >>>>>> Appart from that, do we really care about 32b here? Big DIO, IB users
> > >>>>>> seem to be 64b only AFAIU.
> > >>>>>
> > >>>>> IMO it is a bad habit to leave unpriviledged-user-triggerable oops in the
> > >>>>> kernel even for uncommon platforms...
> > >>>>
> > >>>> Absolutely agreed! I didn't mean to keep the blow up for 32b. I just
> > >>>> wanted to say that we can stay with a simple solution for 32b. I thought
> > >>>> the g-u-p-longterm has plugged the most obvious breakage already. But
> > >>>> maybe I just misunderstood.
> > >>>
> > >>> Most yes, but if you try hard enough, you can still trigger the oops e.g.
> > >>> with appropriately set up direct IO when racing with writeback / reclaim.
> > >>
> > >> gup longterm is only different from normal gup if you have DAX and few
> > >> people do, which really means it doesn't help at all.. AFAIK??
> > >
> > > Right, what I wrote works only for DAX. For non-DAX situation g-u-p
> > > longterm does not currently help at all. Sorry for confusion.
> > >
> >
> > OK, I've got an early version of this up and running, reusing the page->lru
> > fields. I'll clean it up and do some heavier testing, and post as a PATCH v2.
>
> Cool.
>
> > One question though: I'm still vague on the best actions to take in the
> > following functions:
> >
> >     page_mkclean_one
> >     try_to_unmap_one
> >
> > At the moment, they are both just doing an evil little early-out:
> >
> > 	if (PageDmaPinned(page))
> > 		return false;
> >
> > ...but we talked about maybe waiting for the condition to clear, instead?
> > Thoughts?
>
> What needs to happen in page_mkclean() depends on the caller. Most of the
> callers really need to be sure the page is write-protected once
> page_mkclean() returns. Those are:
>
>   pagecache_isize_extended()
>   fb_deferred_io_work()
>   clear_page_dirty_for_io() if called for data-integrity writeback - which
>     is currently known only in its caller (e.g. write_cache_pages()) where
>     it can be determined as wbc->sync_mode == WB_SYNC_ALL. Getting this
>     information into page_mkclean() will require some plumbing and
>     clear_page_dirty_for_io() has some 50 callers but it's doable.
>
> clear_page_dirty_for_io() for cleaning writeback (wbc->sync_mode !=
> WB_SYNC_ALL) can just skip pinned pages and we probably need to do that as
> otherwise memory cleaning would get stuck on pinned pages until RDMA
> drivers release its pins.

Sorry for naive question, but won't it create too much dirty pages
so writeback will be called "non-stop" to rebalance watermarks without
ability to progress?

Thanks

John Hubbard July 2, 2018, 6:10 a.m. UTC | #45

On 07/01/2018 10:52 PM, Leon Romanovsky wrote:
> On Thu, Jun 28, 2018 at 11:17:43AM +0200, Jan Kara wrote:
>> On Wed 27-06-18 19:42:01, John Hubbard wrote:
>>> On 06/27/2018 10:02 AM, Jan Kara wrote:
>>>> On Wed 27-06-18 08:57:18, Jason Gunthorpe wrote:
>>>>> On Wed, Jun 27, 2018 at 02:42:55PM +0200, Jan Kara wrote:
>>>>>> On Wed 27-06-18 13:59:27, Michal Hocko wrote:
>>>>>>> On Wed 27-06-18 13:53:49, Jan Kara wrote:
>>>>>>>> On Wed 27-06-18 13:32:21, Michal Hocko wrote:
>>>>>>> [...]
>>> One question though: I'm still vague on the best actions to take in the
>>> following functions:
>>>
>>>     page_mkclean_one
>>>     try_to_unmap_one
>>>
>>> At the moment, they are both just doing an evil little early-out:
>>>
>>> 	if (PageDmaPinned(page))
>>> 		return false;
>>>
>>> ...but we talked about maybe waiting for the condition to clear, instead?
>>> Thoughts?
>>
>> What needs to happen in page_mkclean() depends on the caller. Most of the
>> callers really need to be sure the page is write-protected once
>> page_mkclean() returns. Those are:
>>
>>   pagecache_isize_extended()
>>   fb_deferred_io_work()
>>   clear_page_dirty_for_io() if called for data-integrity writeback - which
>>     is currently known only in its caller (e.g. write_cache_pages()) where
>>     it can be determined as wbc->sync_mode == WB_SYNC_ALL. Getting this
>>     information into page_mkclean() will require some plumbing and
>>     clear_page_dirty_for_io() has some 50 callers but it's doable.
>>
>> clear_page_dirty_for_io() for cleaning writeback (wbc->sync_mode !=
>> WB_SYNC_ALL) can just skip pinned pages and we probably need to do that as
>> otherwise memory cleaning would get stuck on pinned pages until RDMA
>> drivers release its pins.
> 
> Sorry for naive question, but won't it create too much dirty pages
> so writeback will be called "non-stop" to rebalance watermarks without
> ability to progress?
> 

That is an interesting point. 

Holding off page writeback of this region does seem like it could cause
problems under memory pressure. Maybe adjusting the watermarks so that we
tell the writeback  system, "all is well, just ignore this region until
we're done with it" might help? Any ideas here are welcome...

Longer term, maybe some additional work could allow the kernel to be able
to writeback the gup-pinned pages (while DMA is happening--snapshots), but
that seems like a pretty big overhaul.

thanks,

Leon Romanovsky July 2, 2018, 6:34 a.m. UTC | #46

On Sun, Jul 01, 2018 at 11:10:04PM -0700, John Hubbard wrote:
> On 07/01/2018 10:52 PM, Leon Romanovsky wrote:
> > On Thu, Jun 28, 2018 at 11:17:43AM +0200, Jan Kara wrote:
> >> On Wed 27-06-18 19:42:01, John Hubbard wrote:
> >>> On 06/27/2018 10:02 AM, Jan Kara wrote:
> >>>> On Wed 27-06-18 08:57:18, Jason Gunthorpe wrote:
> >>>>> On Wed, Jun 27, 2018 at 02:42:55PM +0200, Jan Kara wrote:
> >>>>>> On Wed 27-06-18 13:59:27, Michal Hocko wrote:
> >>>>>>> On Wed 27-06-18 13:53:49, Jan Kara wrote:
> >>>>>>>> On Wed 27-06-18 13:32:21, Michal Hocko wrote:
> >>>>>>> [...]
> >>> One question though: I'm still vague on the best actions to take in the
> >>> following functions:
> >>>
> >>>     page_mkclean_one
> >>>     try_to_unmap_one
> >>>
> >>> At the moment, they are both just doing an evil little early-out:
> >>>
> >>> 	if (PageDmaPinned(page))
> >>> 		return false;
> >>>
> >>> ...but we talked about maybe waiting for the condition to clear, instead?
> >>> Thoughts?
> >>
> >> What needs to happen in page_mkclean() depends on the caller. Most of the
> >> callers really need to be sure the page is write-protected once
> >> page_mkclean() returns. Those are:
> >>
> >>   pagecache_isize_extended()
> >>   fb_deferred_io_work()
> >>   clear_page_dirty_for_io() if called for data-integrity writeback - which
> >>     is currently known only in its caller (e.g. write_cache_pages()) where
> >>     it can be determined as wbc->sync_mode == WB_SYNC_ALL. Getting this
> >>     information into page_mkclean() will require some plumbing and
> >>     clear_page_dirty_for_io() has some 50 callers but it's doable.
> >>
> >> clear_page_dirty_for_io() for cleaning writeback (wbc->sync_mode !=
> >> WB_SYNC_ALL) can just skip pinned pages and we probably need to do that as
> >> otherwise memory cleaning would get stuck on pinned pages until RDMA
> >> drivers release its pins.
> >
> > Sorry for naive question, but won't it create too much dirty pages
> > so writeback will be called "non-stop" to rebalance watermarks without
> > ability to progress?
> >
>
> That is an interesting point.
>
> Holding off page writeback of this region does seem like it could cause
> problems under memory pressure. Maybe adjusting the watermarks so that we
> tell the writeback  system, "all is well, just ignore this region until
> we're done with it" might help? Any ideas here are welcome...

AFAIR, it is per-zone, so the solution to count dirty-but-untouchable
number of pages to take them into account for accounting can work, but
it seems like an overkill. Can we create special ZONE for such gup
pages, or this is impossible too?

>
> Longer term, maybe some additional work could allow the kernel to be able
> to writeback the gup-pinned pages (while DMA is happening--snapshots), but
> that seems like a pretty big overhaul.
>
> thanks,
> --
> John Hubbard
> NVIDIA

John Hubbard July 2, 2018, 6:41 a.m. UTC | #47

On 07/01/2018 11:34 PM, Leon Romanovsky wrote:
> On Sun, Jul 01, 2018 at 11:10:04PM -0700, John Hubbard wrote:
>> On 07/01/2018 10:52 PM, Leon Romanovsky wrote:
>>> On Thu, Jun 28, 2018 at 11:17:43AM +0200, Jan Kara wrote:
>>>> On Wed 27-06-18 19:42:01, John Hubbard wrote:
>>>>> On 06/27/2018 10:02 AM, Jan Kara wrote:
>>>>>> On Wed 27-06-18 08:57:18, Jason Gunthorpe wrote:
>>>>>>> On Wed, Jun 27, 2018 at 02:42:55PM +0200, Jan Kara wrote:
>>>>>>>> On Wed 27-06-18 13:59:27, Michal Hocko wrote:
>>>>>>>>> On Wed 27-06-18 13:53:49, Jan Kara wrote:
>>>>>>>>>> On Wed 27-06-18 13:32:21, Michal Hocko wrote:
>>>>>>>>> [...]
>>>>> One question though: I'm still vague on the best actions to take in the
>>>>> following functions:
>>>>>
>>>>>     page_mkclean_one
>>>>>     try_to_unmap_one
>>>>>
>>>>> At the moment, they are both just doing an evil little early-out:
>>>>>
>>>>> 	if (PageDmaPinned(page))
>>>>> 		return false;
>>>>>
>>>>> ...but we talked about maybe waiting for the condition to clear, instead?
>>>>> Thoughts?
>>>>
>>>> What needs to happen in page_mkclean() depends on the caller. Most of the
>>>> callers really need to be sure the page is write-protected once
>>>> page_mkclean() returns. Those are:
>>>>
>>>>   pagecache_isize_extended()
>>>>   fb_deferred_io_work()
>>>>   clear_page_dirty_for_io() if called for data-integrity writeback - which
>>>>     is currently known only in its caller (e.g. write_cache_pages()) where
>>>>     it can be determined as wbc->sync_mode == WB_SYNC_ALL. Getting this
>>>>     information into page_mkclean() will require some plumbing and
>>>>     clear_page_dirty_for_io() has some 50 callers but it's doable.
>>>>
>>>> clear_page_dirty_for_io() for cleaning writeback (wbc->sync_mode !=
>>>> WB_SYNC_ALL) can just skip pinned pages and we probably need to do that as
>>>> otherwise memory cleaning would get stuck on pinned pages until RDMA
>>>> drivers release its pins.
>>>
>>> Sorry for naive question, but won't it create too much dirty pages
>>> so writeback will be called "non-stop" to rebalance watermarks without
>>> ability to progress?
>>>
>>
>> That is an interesting point.
>>
>> Holding off page writeback of this region does seem like it could cause
>> problems under memory pressure. Maybe adjusting the watermarks so that we
>> tell the writeback  system, "all is well, just ignore this region until
>> we're done with it" might help? Any ideas here are welcome...
> 
> AFAIR, it is per-zone, so the solution to count dirty-but-untouchable
> number of pages to take them into account for accounting can work, but
> it seems like an overkill. Can we create special ZONE for such gup
> pages, or this is impossible too?
> 

Let's see what Michal and others prefer. The zone idea intrigues me. 

thanks,

Jan Kara July 2, 2018, 6:58 a.m. UTC | #48

On Mon 02-07-18 08:52:51, Leon Romanovsky wrote:
> On Thu, Jun 28, 2018 at 11:17:43AM +0200, Jan Kara wrote:
> > On Wed 27-06-18 19:42:01, John Hubbard wrote:
> > > On 06/27/2018 10:02 AM, Jan Kara wrote:
> > > > On Wed 27-06-18 08:57:18, Jason Gunthorpe wrote:
> > > >> On Wed, Jun 27, 2018 at 02:42:55PM +0200, Jan Kara wrote:
> > > >>> On Wed 27-06-18 13:59:27, Michal Hocko wrote:
> > > >>>> On Wed 27-06-18 13:53:49, Jan Kara wrote:
> > > >>>>> On Wed 27-06-18 13:32:21, Michal Hocko wrote:
> > > >>>> [...]
> > > >>>>>> Appart from that, do we really care about 32b here? Big DIO, IB users
> > > >>>>>> seem to be 64b only AFAIU.
> > > >>>>>
> > > >>>>> IMO it is a bad habit to leave unpriviledged-user-triggerable oops in the
> > > >>>>> kernel even for uncommon platforms...
> > > >>>>
> > > >>>> Absolutely agreed! I didn't mean to keep the blow up for 32b. I just
> > > >>>> wanted to say that we can stay with a simple solution for 32b. I thought
> > > >>>> the g-u-p-longterm has plugged the most obvious breakage already. But
> > > >>>> maybe I just misunderstood.
> > > >>>
> > > >>> Most yes, but if you try hard enough, you can still trigger the oops e.g.
> > > >>> with appropriately set up direct IO when racing with writeback / reclaim.
> > > >>
> > > >> gup longterm is only different from normal gup if you have DAX and few
> > > >> people do, which really means it doesn't help at all.. AFAIK??
> > > >
> > > > Right, what I wrote works only for DAX. For non-DAX situation g-u-p
> > > > longterm does not currently help at all. Sorry for confusion.
> > > >
> > >
> > > OK, I've got an early version of this up and running, reusing the page->lru
> > > fields. I'll clean it up and do some heavier testing, and post as a PATCH v2.
> >
> > Cool.
> >
> > > One question though: I'm still vague on the best actions to take in the
> > > following functions:
> > >
> > >     page_mkclean_one
> > >     try_to_unmap_one
> > >
> > > At the moment, they are both just doing an evil little early-out:
> > >
> > > 	if (PageDmaPinned(page))
> > > 		return false;
> > >
> > > ...but we talked about maybe waiting for the condition to clear, instead?
> > > Thoughts?
> >
> > What needs to happen in page_mkclean() depends on the caller. Most of the
> > callers really need to be sure the page is write-protected once
> > page_mkclean() returns. Those are:
> >
> >   pagecache_isize_extended()
> >   fb_deferred_io_work()
> >   clear_page_dirty_for_io() if called for data-integrity writeback - which
> >     is currently known only in its caller (e.g. write_cache_pages()) where
> >     it can be determined as wbc->sync_mode == WB_SYNC_ALL. Getting this
> >     information into page_mkclean() will require some plumbing and
> >     clear_page_dirty_for_io() has some 50 callers but it's doable.
> >
> > clear_page_dirty_for_io() for cleaning writeback (wbc->sync_mode !=
> > WB_SYNC_ALL) can just skip pinned pages and we probably need to do that as
> > otherwise memory cleaning would get stuck on pinned pages until RDMA
> > drivers release its pins.
> 
> Sorry for naive question, but won't it create too much dirty pages
> so writeback will be called "non-stop" to rebalance watermarks without
> ability to progress?

If the amount of pinned pages is more than allowed dirty limit then yes.
However dirty limit is there exactly to prevent too many
difficult-to-get-rid-of pages in page cache. So if your amount of pinned
pages crosses the dirty limit you have just violated this mm constraint and
you either need to modify your workload not to pin so many pages or you
need to verify so many dirty pages are indeed safe and increase the dirty
limit.

Realistically, I think we need to come up with a hard limit (similar to
mlock or account them as mlock) on these pinned pages because they are even
worse than plain dirty pages. However I'd strongly prefer to keep that
discussion separate to this discussion about method how to avoid memory
corruption / oopses because that is another can of worms with a big
bikeshedding potential.

								Honza

Jan Kara July 2, 2018, 7:02 a.m. UTC | #49

On Sun 01-07-18 23:10:04, John Hubbard wrote:
> On 07/01/2018 10:52 PM, Leon Romanovsky wrote:
> > On Thu, Jun 28, 2018 at 11:17:43AM +0200, Jan Kara wrote:
> >> On Wed 27-06-18 19:42:01, John Hubbard wrote:
> >>> On 06/27/2018 10:02 AM, Jan Kara wrote:
> >>>> On Wed 27-06-18 08:57:18, Jason Gunthorpe wrote:
> >>>>> On Wed, Jun 27, 2018 at 02:42:55PM +0200, Jan Kara wrote:
> >>>>>> On Wed 27-06-18 13:59:27, Michal Hocko wrote:
> >>>>>>> On Wed 27-06-18 13:53:49, Jan Kara wrote:
> >>>>>>>> On Wed 27-06-18 13:32:21, Michal Hocko wrote:
> >>>>>>> [...]
> >>> One question though: I'm still vague on the best actions to take in the
> >>> following functions:
> >>>
> >>>     page_mkclean_one
> >>>     try_to_unmap_one
> >>>
> >>> At the moment, they are both just doing an evil little early-out:
> >>>
> >>> 	if (PageDmaPinned(page))
> >>> 		return false;
> >>>
> >>> ...but we talked about maybe waiting for the condition to clear, instead?
> >>> Thoughts?
> >>
> >> What needs to happen in page_mkclean() depends on the caller. Most of the
> >> callers really need to be sure the page is write-protected once
> >> page_mkclean() returns. Those are:
> >>
> >>   pagecache_isize_extended()
> >>   fb_deferred_io_work()
> >>   clear_page_dirty_for_io() if called for data-integrity writeback - which
> >>     is currently known only in its caller (e.g. write_cache_pages()) where
> >>     it can be determined as wbc->sync_mode == WB_SYNC_ALL. Getting this
> >>     information into page_mkclean() will require some plumbing and
> >>     clear_page_dirty_for_io() has some 50 callers but it's doable.
> >>
> >> clear_page_dirty_for_io() for cleaning writeback (wbc->sync_mode !=
> >> WB_SYNC_ALL) can just skip pinned pages and we probably need to do that as
> >> otherwise memory cleaning would get stuck on pinned pages until RDMA
> >> drivers release its pins.
> > 
> > Sorry for naive question, but won't it create too much dirty pages
> > so writeback will be called "non-stop" to rebalance watermarks without
> > ability to progress?
> > 
> 
> That is an interesting point. 
> 
> Holding off page writeback of this region does seem like it could cause
> problems under memory pressure. Maybe adjusting the watermarks so that we
> tell the writeback  system, "all is well, just ignore this region until
> we're done with it" might help? Any ideas here are welcome...
> 
> Longer term, maybe some additional work could allow the kernel to be able
> to writeback the gup-pinned pages (while DMA is happening--snapshots), but
> that seems like a pretty big overhaul.

We could use bounce pages to safely writeback pinned pages. However I don't
think it would buy us anything. From MM point of view these pages are
impossible-to-get-rid-of (page refcount is increased) and pernamently-dirty
when GUP was for write (we don't know when dirty data arrives there). So
let's not just fool MM by pretending we can make them clean. That's going
to lead to just more problems down the road.

								Honza

Michal Hocko July 2, 2018, 10:36 a.m. UTC | #50

On Sun 01-07-18 23:41:21, John Hubbard wrote:
> On 07/01/2018 11:34 PM, Leon Romanovsky wrote:
> > On Sun, Jul 01, 2018 at 11:10:04PM -0700, John Hubbard wrote:
[...]
> >>> Sorry for naive question, but won't it create too much dirty pages
> >>> so writeback will be called "non-stop" to rebalance watermarks without
> >>> ability to progress?
> >>>
> >>
> >> That is an interesting point.
> >>
> >> Holding off page writeback of this region does seem like it could cause
> >> problems under memory pressure. Maybe adjusting the watermarks so that we
> >> tell the writeback  system, "all is well, just ignore this region until
> >> we're done with it" might help? Any ideas here are welcome...
> > 
> > AFAIR, it is per-zone, so the solution to count dirty-but-untouchable
> > number of pages to take them into account for accounting can work, but
> > it seems like an overkill. Can we create special ZONE for such gup
> > pages, or this is impossible too?
> > 
> 
> Let's see what Michal and others prefer. The zone idea intrigues me. 

No new zones please. Pinned pages are essentially mlocked pages, except
they are worse because they cannot be even migrated. What we really
needs is a) limit their usage and b) have a way to find out that pins
are not ephemeral and a special action needs to be taken. What is that
special action is yet to be decided but please do not add even more
complexity on top.

Michal Hocko July 2, 2018, 2:48 p.m. UTC | #51

On Mon 02-07-18 09:02:27, Jan Kara wrote:
> On Sun 01-07-18 23:10:04, John Hubbard wrote:
[...]
> > That is an interesting point. 
> > 
> > Holding off page writeback of this region does seem like it could cause
> > problems under memory pressure. Maybe adjusting the watermarks so that we
> > tell the writeback  system, "all is well, just ignore this region until
> > we're done with it" might help? Any ideas here are welcome...
> > 
> > Longer term, maybe some additional work could allow the kernel to be able
> > to writeback the gup-pinned pages (while DMA is happening--snapshots), but
> > that seems like a pretty big overhaul.
> 
> We could use bounce pages to safely writeback pinned pages. However I don't
> think it would buy us anything. From MM point of view these pages are
> impossible-to-get-rid-of (page refcount is increased) and pernamently-dirty
> when GUP was for write (we don't know when dirty data arrives there). So
> let's not just fool MM by pretending we can make them clean. That's going
> to lead to just more problems down the road.

Absolutely agreed! We really need to have means to identify those pages
first. Only then we can make an educated guess what to do about them.
Adding kludges here and there is a wrong way about dealing with this
whole problem. So try to focus on a) a reliable way to detect a longterm
pin and b) provide an API that would tell the page to be released by its
current owner (ideally in two modes, async to kick the process in the
background and continue with something else and sync if there is no
other way than waiting for the pin.

diff mbox

Patch

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 901943e4754b..ad65a2af069a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -100,6 +100,9 @@  enum pageflags {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
 	PG_young,
 	PG_idle,
+#endif
+#if defined(CONFIG_64BIT)
+	PG_dma_pinned,
 #endif
 	__NR_PAGEFLAGS,
 
@@ -381,6 +384,12 @@  TESTCLEARFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
+#if defined(CONFIG_64BIT)
+PAGEFLAG(DmaPinned, dma_pinned, PF_ANY)
+#else
+PAGEFLAG_FALSE(DmaPinned)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a81cffb76d89..f62fd150b0d4 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -79,6 +79,12 @@ 
 #define IF_HAVE_PG_IDLE(flag,string)
 #endif
 
+#if defined(CONFIG_64BIT)
+#define IF_HAVE_PG_DMA_PINNED(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_DMA_PINNED(flag,string)
+#endif
+
 #define __def_pageflag_names						\
 	{1UL << PG_locked,		"locked"	},		\
 	{1UL << PG_waiters,		"waiters"	},		\
@@ -104,7 +110,8 @@  IF_HAVE_PG_MLOCK(PG_mlocked,		"mlocked"	)		\
 IF_HAVE_PG_UNCACHED(PG_uncached,	"uncached"	)		\
 IF_HAVE_PG_HWPOISON(PG_hwpoison,	"hwpoison"	)		\
 IF_HAVE_PG_IDLE(PG_young,		"young"		)		\
-IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
+IF_HAVE_PG_IDLE(PG_idle,		"idle"		)		\
+IF_HAVE_PG_DMA_PINNED(PG_dma_pinned,	"dma_pinned"	)
 
 #define show_page_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/gup.c b/mm/gup.c
index 73f0b3316fa7..fd6c77f33c16 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -659,7 +659,7 @@  static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *nonblocking)
 {
-	long i = 0;
+	long i = 0, j;
 	int err = 0;
 	unsigned int page_mask;
 	struct vm_area_struct *vma = NULL;
@@ -764,6 +764,10 @@  static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 	} while (nr_pages);
 
 out:
+	if (pages)
+		for (j = 0; j < i; j++)
+			SetPageDmaPinned(pages[j]);
+
 	return i ? i : err;
 }
 
@@ -1843,7 +1847,7 @@  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages)
 {
 	unsigned long addr, len, end;
-	int nr = 0, ret = 0;
+	int nr = 0, ret = 0, i;
 
 	start &= PAGE_MASK;
 	addr = start;
@@ -1864,6 +1868,9 @@  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 		ret = nr;
 	}
 
+	for (i = 0; i < nr; i++)
+		SetPageDmaPinned(pages[i]);
+
 	if (nr < nr_pages) {
 		/* Try to get the remaining pages with get_user_pages */
 		start += nr << PAGE_SHIFT;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..a96a7b20037c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1898,6 +1898,7 @@  inline void post_alloc_hook(struct page *page, unsigned int order,
 {
 	set_page_private(page, 0);
 	set_page_refcounted(page);
+	ClearPageDmaPinned(page);
 
 	arch_alloc_page(page, order);
 	kernel_map_pages(page, 1 << order, 1);
diff --git a/mm/rmap.c b/mm/rmap.c
index 6db729dc4c50..37576f0a4645 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1360,6 +1360,8 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				flags & TTU_SPLIT_FREEZE, page);
 	}
 
+	if (PageDmaPinned(page))
+		return false;
 	/*
 	 * We have to assume the worse case ie pmd for invalidation. Note that
 	 * the page can not be free in this function as call of try_to_unmap()