diff mbox series

[RFC] mm: gup: add helper page_try_gup_pin(page)

Message ID 20191103112113.8256-1-hdanton@sina.com (mailing list archive)
State New, archived
Headers show
Series [RFC] mm: gup: add helper page_try_gup_pin(page) | expand

Commit Message

Hillf Danton Nov. 3, 2019, 11:21 a.m. UTC
A helper is added for mitigating the gup issue described at
https://lwn.net/Articles/784574/. It is unsafe to write out
a dirty page that is already gup pinned for DMA.

In the current writeback context, dirty pages are written out with
no detecting whether they have been gup pinned; Nor mark to keep
gupers off. In the gup context, file pages can be pinned with other
gupers and writeback ignored.

The factor, that no room, supposedly even one bit, in the current
page struct can be used for tracking gupers, makes the issue harder
to tackle.

The approach here is, because it makes no sense to allow a file page
to have multiple gupers at the same time, looking to make gupers
mutually exclusive, and then guper's singulairty helps to tell if a
guper is existing by staring at the change in page count.

The result of that sigularity is not yet 100% correct but something
of "best effort" as the effect of random get_page() is perhaps also
folded in it.
It is assumed the best effort is feasible/acceptable in practice
without the the cost of growing the page struct size by one bit,
were it true that something similar has been applied to the page
migrate and reclaim contexts for a while.

With the helper in place, we skip writing out a dirty page if a
guper is detected; On gupping, we give up pinning a file page due
to writeback or losing the race to become a guper.

The end result is, no gup-pinned page will be put under writeback.

It is based on next-20191031.

Signed-off-by: Hillf Danton <hdanton@sina.com>
---

--

Comments

John Hubbard Nov. 3, 2019, 8:20 p.m. UTC | #1
On 11/3/19 3:21 AM, Hillf Danton wrote:
> 
> A helper is added for mitigating the gup issue described at
> https://lwn.net/Articles/784574/. It is unsafe to write out
> a dirty page that is already gup pinned for DMA.
> 
> In the current writeback context, dirty pages are written out with
> no detecting whether they have been gup pinned; Nor mark to keep
> gupers off. In the gup context, file pages can be pinned with other
> gupers and writeback ignored.
> 
> The factor, that no room, supposedly even one bit, in the current
> page struct can be used for tracking gupers, makes the issue harder
> to tackle.

Well, as long as we're counting bits, I've taken 21 bits (!) to track
"gupers". :)  More accurately, I'm sharing 31 bits with get_page()...please
see my recently posted patchset for tracking dma-pinned pages:

https://lore.kernel.org/r/20191030224930.3990755-1-jhubbard@nvidia.com

Once that is merged, you will have this available:

   static inline bool page_dma_pinned(struct page *page);

...which will reliably track dma-pinned pages.

After that, we still need to convert a some more call sites (block/bio 
in particular) to the new pin_user_pages()/put_user_page() system, in 
order for filesystems to take advantage of it, but this approach has 
the advantage of actually tracking such pages, rather than faking it by 
hoping that there is only one gup caller at a time.


> 
> The approach here is, because it makes no sense to allow a file page
> to have multiple gupers at the same time, looking to make gupers

ohhh...no, that's definitely not a claim you can make.


> mutually exclusive, and then guper's singulairty helps to tell if a
> guper is existing by staring at the change in page count.
> 
> The result of that sigularity is not yet 100% correct but something
> of "best effort" as the effect of random get_page() is perhaps also
> folded in it.
> It is assumed the best effort is feasible/acceptable in practice
> without the the cost of growing the page struct size by one bit,
> were it true that something similar has been applied to the page
> migrate and reclaim contexts for a while.
> 
> With the helper in place, we skip writing out a dirty page if a
> guper is detected; On gupping, we give up pinning a file page due
> to writeback or losing the race to become a guper.
> 
> The end result is, no gup-pinned page will be put under writeback.

I think you must have missed the many contentious debates about the
tension between gup-pinned pages, and writeback. File systems can't
just ignore writeback in all cases. This patch leads to either
system hangs or filesystem corruption, in the presence of long-lasting
gup pins.

Really, this won't work. sorry.


thanks,

John Hubbard
NVIDIA
Hillf Danton Nov. 4, 2019, 4:34 a.m. UTC | #2
On Sun, 3 Nov 2019 12:20:10 -0800 John Hubbard wrote:
> On 11/3/19 3:21 AM, Hillf Danton wrote:
> > 
> > A helper is added for mitigating the gup issue described at
> > https://lwn.net/Articles/784574/. It is unsafe to write out
> > a dirty page that is already gup pinned for DMA.
> > 
> > In the current writeback context, dirty pages are written out with
> > no detecting whether they have been gup pinned; Nor mark to keep
> > gupers off. In the gup context, file pages can be pinned with other
> > gupers and writeback ignored.
> > 
> > The factor, that no room, supposedly even one bit, in the current
> > page struct can be used for tracking gupers, makes the issue harder
> > to tackle.
> 
> Well, as long as we're counting bits, I've taken 21 bits (!) to track
> "gupers". :)  More accurately, I'm sharing 31 bits with get_page()...please

Would you please specify the reasoning of tracking multiple gupers
for a dirty page? Do you mean that it is all fine for guper-A to add
changes to guper-B's data without warning and vice versa?

> see my recently posted patchset for tracking dma-pinned pages:
> 
> https://lore.kernel.org/r/20191030224930.3990755-1-jhubbard@nvidia.com
> 
> Once that is merged, you will have this available:
> 
>    static inline bool page_dma_pinned(struct page *page);
> 
> ...which will reliably track dma-pinned pages.
> 
> After that, we still need to convert a some more call sites (block/bio 
> in particular) to the new pin_user_pages()/put_user_page() system, in 
> order for filesystems to take advantage of it, but this approach has 
> the advantage of actually tracking such pages, rather than faking it by 
> hoping that there is only one gup caller at a time.
> 
> 
> > 
> > The approach here is, because it makes no sense to allow a file page
> > to have multiple gupers at the same time, looking to make gupers
> 
> ohhh...no, that's definitely not a claim you can make.
> 
> 
> > mutually exclusive, and then guper's singulairty helps to tell if a
> > guper is existing by staring at the change in page count.
> > 
> > The result of that sigularity is not yet 100% correct but something
> > of "best effort" as the effect of random get_page() is perhaps also
> > folded in it.
> > It is assumed the best effort is feasible/acceptable in practice
> > without the the cost of growing the page struct size by one bit,
> > were it true that something similar has been applied to the page
> > migrate and reclaim contexts for a while.
> > 
> > With the helper in place, we skip writing out a dirty page if a
> > guper is detected; On gupping, we give up pinning a file page due
> > to writeback or losing the race to become a guper.
> > 
> > The end result is, no gup-pinned page will be put under writeback.
> 
> I think you must have missed the many contentious debates about the
> tension between gup-pinned pages, and writeback. File systems can't
> just ignore writeback in all cases. This patch leads to either
> system hangs or filesystem corruption, in the presence of long-lasting
> gup pins.

The current risk of data corruption due to writeback with long-lived
gup references all ignored is zeroed out by detecting gup-pinned dirty
pages and skipping them; that may lead to problems you mention above.

Though I doubt anything helpful about it can be expected from fs in near
future, we have options for instance that gupers periodically release
their references and re-pin pages after data sync the same way as the
current flusher does.

> Really, this won't work. sorry.
> 
> 
> thanks,
> 
> John Hubbard
> NVIDIA
John Hubbard Nov. 4, 2019, 6:09 a.m. UTC | #3
On 11/3/19 8:34 PM, Hillf Danton wrote:
...
>>
>> Well, as long as we're counting bits, I've taken 21 bits (!) to track
>> "gupers". :)  More accurately, I'm sharing 31 bits with get_page()...please
> 
> Would you please specify the reasoning of tracking multiple gupers
> for a dirty page? Do you mean that it is all fine for guper-A to add
> changes to guper-B's data without warning and vice versa?

It's generally OK to call get_user_pages() on a page more than once.
And even though we are seeing some work to reduce the number of places
in the kernel that call get_user_pages(), there are still lots of call sites.
That means lots of combinations and situations that could result in more
than one gup call per page.

Furthermore, there is no mechanism, convention, documentation, nor anything
at all that attempts to enforce "for each page, get_user_pages() may only
be called once."


...
>>
>> I think you must have missed the many contentious debates about the
>> tension between gup-pinned pages, and writeback. File systems can't
>> just ignore writeback in all cases. This patch leads to either
>> system hangs or filesystem corruption, in the presence of long-lasting
>> gup pins.
> 
> The current risk of data corruption due to writeback with long-lived
> gup references all ignored is zeroed out by detecting gup-pinned dirty
> pages and skipping them; that may lead to problems you mention above.
> 

Here, I believe you're pointing out that the current situation in the
kernel is already broken, with respect to fs interactions (especially
writeback) with gup. Yes, you are correct, there is a problem.

> Though I doubt anything helpful about it can be expected from fs in near

Actually, fs and mm folks are working together to solve this.

> future, we have options for instance that gupers periodically release
> their references and re-pin pages after data sync the same way as the
> current flusher does.
> 

That's one idea. I don't see it as viable, given the behavior of, say,
a compute process running OpenCL jobs on a GPU that is connected via
a network or Infiniband card--the idea of "pause" really looks more like
"tear down the complicated multi-driver connection, writeback, then set it
all up again", I suspect. (And if we could easily interrupt the job, we'd
probably really be running with a page-fault-capable GPU plus and IB card
that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...)

Anyway, this is not amenable to quick fixes, because the problem is
a couple of missing design pieces. Which we're working on putting in.
But meanwhile, smaller changes such as this one are just going to move
the problems to different places, rather than solving them. So it's best
not to do that.

thanks,
Jan Kara Nov. 4, 2019, 8:13 a.m. UTC | #4
On Sun 03-11-19 22:09:03, John Hubbard wrote:
> On 11/3/19 8:34 PM, Hillf Danton wrote:
> > future, we have options for instance that gupers periodically release
> > their references and re-pin pages after data sync the same way as the
> > current flusher does.
> > 
> 
> That's one idea. I don't see it as viable, given the behavior of, say,
> a compute process running OpenCL jobs on a GPU that is connected via
> a network or Infiniband card--the idea of "pause" really looks more like
> "tear down the complicated multi-driver connection, writeback, then set it
> all up again", I suspect. (And if we could easily interrupt the job, we'd
> probably really be running with a page-fault-capable GPU plus and IB card
> that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...)
> 
> Anyway, this is not amenable to quick fixes, because the problem is
> a couple of missing design pieces. Which we're working on putting in.
> But meanwhile, smaller changes such as this one are just going to move
> the problems to different places, rather than solving them. So it's best
> not to do that.

Yeah, fully agreed here. Quick half baked fixes will make the current messy
situation even worse...

								Honza
Hillf Danton Nov. 4, 2019, 10:20 a.m. UTC | #5
On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote:
> On 11/3/19 8:34 PM, Hillf Danton wrote:
> ...
> >>
> >> Well, as long as we're counting bits, I've taken 21 bits (!) to track
> >> "gupers". :)  More accurately, I'm sharing 31 bits with get_page()...please
> > 
> > Would you please specify the reasoning of tracking multiple gupers
> > for a dirty page? Do you mean that it is all fine for guper-A to add
> > changes to guper-B's data without warning and vice versa?
> 
> It's generally OK to call get_user_pages() on a page more than once.

Does this explain that it's generally OK to gup pin a page under
writeback and then start DMA to it behind the flusher's back without
warning? 

> And even though we are seeing some work to reduce the number of places
> in the kernel that call get_user_pages(), there are still lots of call sites.
> That means lots of combinations and situations that could result in more
> than one gup call per page.
> 
> Furthermore, there is no mechanism, convention, documentation, nor anything
> at all that attempts to enforce "for each page, get_user_pages() may only
> be called once."

What sense is this making wrt the data corruption resulting specifically
from multiple gup references?

> 
> ...
> >>
> >> I think you must have missed the many contentious debates about the
> >> tension between gup-pinned pages, and writeback. File systems can't
> >> just ignore writeback in all cases. This patch leads to either
> >> system hangs or filesystem corruption, in the presence of long-lasting
> >> gup pins.
> > 
> > The current risk of data corruption due to writeback with long-lived
> > gup references all ignored is zeroed out by detecting gup-pinned dirty
> > pages and skipping them; that may lead to problems you mention above.
> > 
> 
> Here, I believe you're pointing out that the current situation in the
> kernel is already broken, with respect to fs interactions (especially
> writeback) with gup. Yes, you are correct, there is a problem.
> 
> > Though I doubt anything helpful about it can be expected from fs in near
> 
> Actually, fs and mm folks are working together to solve this.
> 
> > future, we have options for instance that gupers periodically release
> > their references and re-pin pages after data sync the same way as the
> > current flusher does.
> > 
> 
> That's one idea. I don't see it as viable, given the behavior of, say,
> a compute process running OpenCL jobs on a GPU that is connected via
> a network or Infiniband card--the idea of "pause" really looks more like
> "tear down the complicated multi-driver connection, writeback, then set it
> all up again", I suspect. (And if we could easily interrupt the job, we'd
> probably really be running with a page-fault-capable GPU plus and IB card
> that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...)

Well is it OK to shorten the behavior above to "data corruption in
writeback is tolerable in practice because of the expensive cost of
data sync"?

What is the point of writeback? Why can the writeback of long-lived
gup-pinned pages not be skipped while data sync can be entirely
ignored?

> Anyway, this is not amenable to quick fixes, because the problem is
> a couple of missing design pieces. Which we're working on putting in.
> But meanwhile, smaller changes such as this one are just going to move
> the problems to different places, rather than solving them. So it's best
> not to do that.
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
Jerome Glisse Nov. 4, 2019, 7:03 p.m. UTC | #6
On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote:
> 
> On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote:
> > On 11/3/19 8:34 PM, Hillf Danton wrote:
> > ...
> > >>
> > >> Well, as long as we're counting bits, I've taken 21 bits (!) to track
> > >> "gupers". :)  More accurately, I'm sharing 31 bits with get_page()...please
> > > 
> > > Would you please specify the reasoning of tracking multiple gupers
> > > for a dirty page? Do you mean that it is all fine for guper-A to add
> > > changes to guper-B's data without warning and vice versa?
> > 
> > It's generally OK to call get_user_pages() on a page more than once.
> 
> Does this explain that it's generally OK to gup pin a page under
> writeback and then start DMA to it behind the flusher's back without
> warning? 

It can happens today, is it ok ... well no but we live in an imperfect
world. GUP have been abuse by few device driver over the years and those
never checked what it meant to use it so now we are left with existing
device driver that we can not break that do wrong thing.

I personaly think that we should use bounce page for writeback so that
writeback can still happens if a page is GUPed. John's patchset is the
first step to be able to identify GUPed page and maybe special case them.

> 
> > And even though we are seeing some work to reduce the number of places
> > in the kernel that call get_user_pages(), there are still lots of call sites.
> > That means lots of combinations and situations that could result in more
> > than one gup call per page.
> > 
> > Furthermore, there is no mechanism, convention, documentation, nor anything
> > at all that attempts to enforce "for each page, get_user_pages() may only
> > be called once."
> 
> What sense is this making wrt the data corruption resulting specifically
> from multiple gup references?

Multiple GUP references do not imply corruption. Only one or more devices
writing to the page while writeback is happening is a cause of corruption.
Multiple device writting in the same page concurrently is like multiple
CPU thread doing the same. Either the application/device drivers are doing
this rightfully on purpose or the application has a bug. Either way it is
not our problem (note here i am talking about userspace portion of the
device driver).


> > >> I think you must have missed the many contentious debates about the
> > >> tension between gup-pinned pages, and writeback. File systems can't
> > >> just ignore writeback in all cases. This patch leads to either
> > >> system hangs or filesystem corruption, in the presence of long-lasting
> > >> gup pins.
> > > 
> > > The current risk of data corruption due to writeback with long-lived
> > > gup references all ignored is zeroed out by detecting gup-pinned dirty
> > > pages and skipping them; that may lead to problems you mention above.
> > > 
> > 
> > Here, I believe you're pointing out that the current situation in the
> > kernel is already broken, with respect to fs interactions (especially
> > writeback) with gup. Yes, you are correct, there is a problem.
> > 
> > > Though I doubt anything helpful about it can be expected from fs in near
> > 
> > Actually, fs and mm folks are working together to solve this.
> > 
> > > future, we have options for instance that gupers periodically release
> > > their references and re-pin pages after data sync the same way as the
> > > current flusher does.
> > > 
> > 
> > That's one idea. I don't see it as viable, given the behavior of, say,
> > a compute process running OpenCL jobs on a GPU that is connected via
> > a network or Infiniband card--the idea of "pause" really looks more like
> > "tear down the complicated multi-driver connection, writeback, then set it
> > all up again", I suspect. (And if we could easily interrupt the job, we'd
> > probably really be running with a page-fault-capable GPU plus and IB card
> > that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...)
> 
> Well is it OK to shorten the behavior above to "data corruption in
> writeback is tolerable in practice because of the expensive cost of
> data sync"?
> 
> What is the point of writeback? Why can the writeback of long-lived
> gup-pinned pages not be skipped while data sync can be entirely
> ignored?

I do not think we want that (skip writeback on GUPed page). I think what
we should do is use a bounce page ie take a snapshot of the page and
starts writeback with the snapshot. We need a snapshot because fs code
expect stable page content for things like encryption or hashing or other
crazy fs features :)

Cheers,
Jérôme
Hillf Danton Nov. 5, 2019, 4:27 a.m. UTC | #7
On Mon, 4 Nov 2019 14:03:55 -0500 Jerome Glisse wrote:
> 
> On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote:
> >
> > On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote:
> > > On 11/3/19 8:34 PM, Hillf Danton wrote:
> > > ...
> > > >>
> > > >> Well, as long as we're counting bits, I've taken 21 bits (!) to track
> > > >> "gupers". :)  More accurately, I'm sharing 31 bits with get_page()...please
> > > >
> > > > Would you please specify the reasoning of tracking multiple gupers
> > > > for a dirty page? Do you mean that it is all fine for guper-A to add
> > > > changes to guper-B's data without warning and vice versa?
> > >
> > > It's generally OK to call get_user_pages() on a page more than once.
> >
> > Does this explain that it's generally OK to gup pin a page under
> > writeback and then start DMA to it behind the flusher's back without
> > warning?
> 
> It can happens today, is it ok ... well no but we live in an imperfect
> world. GUP have been abuse by few device driver over the years and those
> never checked what it meant to use it so now we are left with existing
> device driver that we can not break that do wrong thing.

See your point :)

> I personaly think that we should use bounce page for writeback so that
> writeback can still happens if a page is GUPed.

Gup can be prevented from falling foul of writeback IMHO if the page
under writeback, gup pinned or not, remains stable until it is no
longer dirty.

For that stability, either we can check PageWriteback on gup pinning
for instance as the RFC does or drivers can set a gup-pinned page
dirty only after DMA and start no more DMA until it is clean again.

As long as that stability is ensured writeback will no longer need to
take care of gup pin, long-lived or transient.

It seems unlike a request too strict to meet in practice wrt data
corruption, and bounce page for writeback sounds promising. Does it
need to do a memory copy?

> John's patchset is the
> first step to be able to identify GUPed page and maybe special case them.
> >
> > > And even though we are seeing some work to reduce the number of places
> > > in the kernel that call get_user_pages(), there are still lots of call sites.
> > > That means lots of combinations and situations that could result in more
> > > than one gup call per page.
> > >
> > > Furthermore, there is no mechanism, convention, documentation, nor anything
> > > at all that attempts to enforce "for each page, get_user_pages() may only
> > > be called once."
> >
> > What sense is this making wrt the data corruption resulting specifically
> > from multiple gup references?
> 
> Multiple GUP references do not imply corruption. Only one or more devices
> writing to the page while writeback is happening is a cause of corruption.
> Multiple device writting in the same page concurrently is like multiple
> CPU thread doing the same. Either the application/device drivers are doing
> this rightfully on purpose or the application has a bug. Either way it is
> not our problem (note here i am talking about userspace portion of the
> device driver).
> 
> 
> > > >> I think you must have missed the many contentious debates about the
> > > >> tension between gup-pinned pages, and writeback. File systems can't
> > > >> just ignore writeback in all cases. This patch leads to either
> > > >> system hangs or filesystem corruption, in the presence of long-lasting
> > > >> gup pins.
> > > >
> > > > The current risk of data corruption due to writeback with long-lived
> > > > gup references all ignored is zeroed out by detecting gup-pinned dirty
> > > > pages and skipping them; that may lead to problems you mention above.
> > > >
> > >
> > > Here, I believe you're pointing out that the current situation in the
> > > kernel is already broken, with respect to fs interactions (especially
> > > writeback) with gup. Yes, you are correct, there is a problem.
> > >
> > > > Though I doubt anything helpful about it can be expected from fs in near
> > >
> > > Actually, fs and mm folks are working together to solve this.
> > >
> > > > future, we have options for instance that gupers periodically release
> > > > their references and re-pin pages after data sync the same way as the
> > > > current flusher does.
> > > >
> > >
> > > That's one idea. I don't see it as viable, given the behavior of, say,
> > > a compute process running OpenCL jobs on a GPU that is connected via
> > > a network or Infiniband card--the idea of "pause" really looks more like
> > > "tear down the complicated multi-driver connection, writeback, then set it
> > > all up again", I suspect. (And if we could easily interrupt the job, we'd
> > > probably really be running with a page-fault-capable GPU plus and IB card
> > > that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...)
> >
> > Well is it OK to shorten the behavior above to "data corruption in
> > writeback is tolerable in practice because of the expensive cost of
> > data sync"?
> >
> > What is the point of writeback? Why can the writeback of long-lived
> > gup-pinned pages not be skipped while data sync can be entirely
> > ignored?
> 
> I do not think we want that (skip writeback on GUPed page). I think what
> we should do is use a bounce page ie take a snapshot of the page and
> starts writeback with the snapshot. We need a snapshot because fs code
> expect stable page content for things like encryption or hashing or other
> crazy fs features :)
David Hildenbrand Nov. 5, 2019, 8:56 a.m. UTC | #8
On 04.11.19 20:03, Jerome Glisse wrote:
> On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote:
>>
>> On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote:
>>> On 11/3/19 8:34 PM, Hillf Danton wrote:
>>> ...
>>>>>
>>>>> Well, as long as we're counting bits, I've taken 21 bits (!) to track
>>>>> "gupers". :)  More accurately, I'm sharing 31 bits with get_page()...please
>>>>
>>>> Would you please specify the reasoning of tracking multiple gupers
>>>> for a dirty page? Do you mean that it is all fine for guper-A to add
>>>> changes to guper-B's data without warning and vice versa?
>>>
>>> It's generally OK to call get_user_pages() on a page more than once.
>>
>> Does this explain that it's generally OK to gup pin a page under
>> writeback and then start DMA to it behind the flusher's back without
>> warning?
> 
> It can happens today, is it ok ... well no but we live in an imperfect
> world. GUP have been abuse by few device driver over the years and those
> never checked what it meant to use it so now we are left with existing
> device driver that we can not break that do wrong thing.
> 
> I personaly think that we should use bounce page for writeback so that
> writeback can still happens if a page is GUPed. John's patchset is the
> first step to be able to identify GUPed page and maybe special case them.
> 
>>
>>> And even though we are seeing some work to reduce the number of places
>>> in the kernel that call get_user_pages(), there are still lots of call sites.
>>> That means lots of combinations and situations that could result in more
>>> than one gup call per page.
>>>
>>> Furthermore, there is no mechanism, convention, documentation, nor anything
>>> at all that attempts to enforce "for each page, get_user_pages() may only
>>> be called once."
>>
>> What sense is this making wrt the data corruption resulting specifically
>> from multiple gup references?
> 
> Multiple GUP references do not imply corruption. Only one or more devices
> writing to the page while writeback is happening is a cause of corruption.
> Multiple device writting in the same page concurrently is like multiple
> CPU thread doing the same. Either the application/device drivers are doing
> this rightfully on purpose or the application has a bug. Either way it is
> not our problem (note here i am talking about userspace portion of the
> device driver).
> 

If I'm not completely off, we can have multiple GUP references easily by 
using KVM+VFIO.
Jerome Glisse Nov. 5, 2019, 3:54 p.m. UTC | #9
On Tue, Nov 05, 2019 at 12:27:55PM +0800, Hillf Danton wrote:
> 
> On Mon, 4 Nov 2019 14:03:55 -0500 Jerome Glisse wrote:
> > 
> > On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote:
> > >
> > > On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote:
> > > > On 11/3/19 8:34 PM, Hillf Danton wrote:
> > > > ...
> > > > >>
> > > > >> Well, as long as we're counting bits, I've taken 21 bits (!) to track
> > > > >> "gupers". :)  More accurately, I'm sharing 31 bits with get_page()...please
> > > > >
> > > > > Would you please specify the reasoning of tracking multiple gupers
> > > > > for a dirty page? Do you mean that it is all fine for guper-A to add
> > > > > changes to guper-B's data without warning and vice versa?
> > > >
> > > > It's generally OK to call get_user_pages() on a page more than once.
> > >
> > > Does this explain that it's generally OK to gup pin a page under
> > > writeback and then start DMA to it behind the flusher's back without
> > > warning?
> > 
> > It can happens today, is it ok ... well no but we live in an imperfect
> > world. GUP have been abuse by few device driver over the years and those
> > never checked what it meant to use it so now we are left with existing
> > device driver that we can not break that do wrong thing.
> 
> See your point :)
> 
> > I personaly think that we should use bounce page for writeback so that
> > writeback can still happens if a page is GUPed.
> 
> Gup can be prevented from falling foul of writeback IMHO if the page
> under writeback, gup pinned or not, remains stable until it is no
> longer dirty.
> 
> For that stability, either we can check PageWriteback on gup pinning
> for instance as the RFC does or drivers can set a gup-pinned page
> dirty only after DMA and start no more DMA until it is clean again.
> 
> As long as that stability is ensured writeback will no longer need to
> take care of gup pin, long-lived or transient.
> 
> It seems unlike a request too strict to meet in practice wrt data
> corruption, and bounce page for writeback sounds promising. Does it
> need to do a memory copy?

Once driver has GUP it does not check and re-check the struct page
so there is no synchronization whatsoever after GUP happened. In
fact for some driver you can not synchronize anything once the device
has been program. Many devices are not just simple DMA engine you
can start and stop at will (network, GPUs, ...).

So once a page is GUP there is just noway to garanty its stability
hence the best thing we can do is snapshot it to a bounce page.

Cheers,
Jérôme
Hillf Danton Nov. 6, 2019, 9:22 a.m. UTC | #10
On Tue, 5 Nov 2019 10:54:15 -0500 Jerome Glisse wrote:
> 
> On Tue, Nov 05, 2019 at 12:27:55PM +0800, Hillf Danton wrote:
> >
> > On Mon, 4 Nov 2019 14:03:55 -0500 Jerome Glisse wrote:
> > >
> > > On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote:
> > > >
> > > > On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote:
> > > > > On 11/3/19 8:34 PM, Hillf Danton wrote:
> > > > > ...
> > > > > >>
> > > > > >> Well, as long as we're counting bits, I've taken 21 bits (!) to track
> > > > > >> "gupers". :)  More accurately, I'm sharing 31 bits with get_page()...please
> > > > > >
> > > > > > Would you please specify the reasoning of tracking multiple gupers
> > > > > > for a dirty page? Do you mean that it is all fine for guper-A to add
> > > > > > changes to guper-B's data without warning and vice versa?
> > > > >
> > > > > It's generally OK to call get_user_pages() on a page more than once.
> > > >
> > > > Does this explain that it's generally OK to gup pin a page under
> > > > writeback and then start DMA to it behind the flusher's back without
> > > > warning?
> > >
> > > It can happens today, is it ok ... well no but we live in an imperfect
> > > world. GUP have been abuse by few device driver over the years and those
> > > never checked what it meant to use it so now we are left with existing
> > > device driver that we can not break that do wrong thing.
> >
> > See your point :)
> >
> > > I personaly think that we should use bounce page for writeback so that
> > > writeback can still happens if a page is GUPed.
> >
> > Gup can be prevented from falling foul of writeback IMHO if the page
> > under writeback, gup pinned or not, remains stable until it is no
> > longer dirty.
> >
> > For that stability, either we can check PageWriteback on gup pinning
> > for instance as the RFC does or drivers can set a gup-pinned page
> > dirty only after DMA and start no more DMA until it is clean again.
> >
> > As long as that stability is ensured writeback will no longer need to
> > take care of gup pin, long-lived or transient.
> >
> > It seems unlike a request too strict to meet in practice wrt data
> > corruption, and bounce page for writeback sounds promising. Does it
> > need to do a memory copy?
> 
> Once driver has GUP it does not check and re-check the struct page
> so there is no synchronization whatsoever after GUP happened. In
> fact for some driver you can not synchronize anything once the device
> has been program. Many devices are not just simple DMA engine you
> can start and stop at will (network, GPUs, ...).

Because "there is no synchronization whatsoever after GUP happened,"
we need to take another close look at the reasoning for tracking
multiple gupers if the chance of their mutual data corruptions exists
in the wild. (If any sync mechanism sits between them to avoid data
corruption, then it seems single pin is enough.)

> So once a page is GUP there is just noway to garanty its stability
> hence the best thing we can do is snapshot it to a bounce page.

It becomes clearer OTOH that we are more likely than not moving in
the incorrect direction, in cases like how to detect gupers and what
to do for writeback if page is gup pinned, without a clear picture
of the bounce page in the first place. Any plan to post a patch just
for idea show?

Hillf
Jerome Glisse Nov. 6, 2019, 3:46 p.m. UTC | #11
On Wed, Nov 06, 2019 at 05:22:40PM +0800, Hillf Danton wrote:
> 
> On Tue, 5 Nov 2019 10:54:15 -0500 Jerome Glisse wrote:
> > 
> > On Tue, Nov 05, 2019 at 12:27:55PM +0800, Hillf Danton wrote:
> > >
> > > On Mon, 4 Nov 2019 14:03:55 -0500 Jerome Glisse wrote:
> > > >
> > > > On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote:
> > > > >
> > > > > On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote:
> > > > > > On 11/3/19 8:34 PM, Hillf Danton wrote:
> > > > > > ...
> > > > > > >>
> > > > > > >> Well, as long as we're counting bits, I've taken 21 bits (!) to track
> > > > > > >> "gupers". :)  More accurately, I'm sharing 31 bits with get_page()...please
> > > > > > >
> > > > > > > Would you please specify the reasoning of tracking multiple gupers
> > > > > > > for a dirty page? Do you mean that it is all fine for guper-A to add
> > > > > > > changes to guper-B's data without warning and vice versa?
> > > > > >
> > > > > > It's generally OK to call get_user_pages() on a page more than once.
> > > > >
> > > > > Does this explain that it's generally OK to gup pin a page under
> > > > > writeback and then start DMA to it behind the flusher's back without
> > > > > warning?
> > > >
> > > > It can happens today, is it ok ... well no but we live in an imperfect
> > > > world. GUP have been abuse by few device driver over the years and those
> > > > never checked what it meant to use it so now we are left with existing
> > > > device driver that we can not break that do wrong thing.
> > >
> > > See your point :)
> > >
> > > > I personaly think that we should use bounce page for writeback so that
> > > > writeback can still happens if a page is GUPed.
> > >
> > > Gup can be prevented from falling foul of writeback IMHO if the page
> > > under writeback, gup pinned or not, remains stable until it is no
> > > longer dirty.
> > >
> > > For that stability, either we can check PageWriteback on gup pinning
> > > for instance as the RFC does or drivers can set a gup-pinned page
> > > dirty only after DMA and start no more DMA until it is clean again.
> > >
> > > As long as that stability is ensured writeback will no longer need to
> > > take care of gup pin, long-lived or transient.
> > >
> > > It seems unlike a request too strict to meet in practice wrt data
> > > corruption, and bounce page for writeback sounds promising. Does it
> > > need to do a memory copy?
> > 
> > Once driver has GUP it does not check and re-check the struct page
> > so there is no synchronization whatsoever after GUP happened. In
> > fact for some driver you can not synchronize anything once the device
> > has been program. Many devices are not just simple DMA engine you
> > can start and stop at will (network, GPUs, ...).
> 
> Because "there is no synchronization whatsoever after GUP happened,"
> we need to take another close look at the reasoning for tracking
> multiple gupers if the chance of their mutual data corruptions exists
> in the wild. (If any sync mechanism sits between them to avoid data
> corruption, then it seems single pin is enough.)

It does exist in the wild but the userspace application would be either
doing something stupid or something terribly clever. For instance you
can have 2 network interface writing to the same GUPed page but that is
because the application made the same request over two NICs and both
endup writting the samething.

You can also have 2 GUPer each writting to different part of the page
and never stepping on each others.

The point really is that from kernel point of view there is just no
way to know if the application is doing something wrong or if it just
perfectly fine. This is exactly the same thing as CPU threads, you do
not ask the kernel to ascertain wether what application threads are
doing is wrong or right.

So we have to live with the fact that we can have multiple GUPers and
that it is not our problems if that happens and we can do nothing
about it.

Note that we are removing GUP from some of those driver, ones where
the device can abide to mmu notifier. But that is just something
orthogonal to all this.


> > So once a page is GUP there is just noway to garanty its stability
> > hence the best thing we can do is snapshot it to a bounce page.
> 
> It becomes clearer OTOH that we are more likely than not moving in
> the incorrect direction, in cases like how to detect gupers and what
> to do for writeback if page is gup pinned, without a clear picture
> of the bounce page in the first place. Any plan to post a patch just
> for idea show?

The agreement so far is that we need to be able to identify GUPed
pages and this is what John's patchset does. Once we have that piece
than we can discuss what to do in respect of write-back. Which is
still something where there is no agreement as far as i remember the
outcome of the last discussion we had. I expect this will a topic
at next LSF/MM or maybe something we can flush out before.

In any case my opinion is bounce page is the best thing we can do,
from application and FS point of view it mimics the characteristics
of regular write-back just as if the write protection window of the
write-backed page was infinitly short.

Cheers,
Jérôme
Hillf Danton Nov. 7, 2019, 9:50 a.m. UTC | #12
On Wed, 6 Nov 2019 10:46:29 -0500 Jerome Glisse wrote:
> 
> On Wed, Nov 06, 2019 at 05:22:40PM +0800, Hillf Danton wrote:
[...]
> > >
> > > Once driver has GUP it does not check and re-check the struct page
> > > so there is no synchronization whatsoever after GUP happened. In
> > > fact for some driver you can not synchronize anything once the device
> > > has been program. Many devices are not just simple DMA engine you
> > > can start and stop at will (network, GPUs, ...).
> >
> > Because "there is no synchronization whatsoever after GUP happened,"
> > we need to take another close look at the reasoning for tracking
> > multiple gupers if the chance of their mutual data corruptions exists
> > in the wild. (If any sync mechanism sits between them to avoid data
> > corruption, then it seems single pin is enough.)
> 
> It does exist in the wild but the userspace application would be either
> doing something stupid or something terribly clever. For instance you
> can have 2 network interface writing to the same GUPed page but that is
> because the application made the same request over two NICs and both
> endup writting the samething.
> 
> You can also have 2 GUPer each writting to different part of the page
> and never stepping on each others.
> 
> The point really is that from kernel point of view there is just no
> way to know if the application is doing something wrong or if it just
> perfectly fine. This is exactly the same thing as CPU threads, you do
> not ask the kernel to ascertain wether what application threads are
> doing is wrong or right.
> 
> So we have to live with the fact that we can have multiple GUPers and
> that it is not our problems if that happens and we can do nothing
> about it.

Ok. Multiple gupers are a must-have, and perhaps their mutal data
corruptions as well.

> Note that we are removing GUP from some of those driver, ones where
> the device can abide to mmu notifier. But that is just something
> orthogonal to all this.
> 
> 
> > > So once a page is GUP there is just noway to garanty its stability
> > > hence the best thing we can do is snapshot it to a bounce page.
> >
> > It becomes clearer OTOH that we are more likely than not moving in
> > the incorrect direction, in cases like how to detect gupers and what
> > to do for writeback if page is gup pinned, without a clear picture
> > of the bounce page in the first place. Any plan to post a patch just
> > for idea show?
> 
> The agreement so far is that we need to be able to identify GUPed
> pages and this is what John's patchset does. Once we have that piece

Oh they are there, and barely trot away in case of long-lived pin.

> than we can discuss what to do in respect of write-back. Which is

Nobody seems to care what to do in the absence of gup pin.

> still something where there is no agreement as far as i remember the
> outcome of the last discussion we had. I expect this will a topic
> at next LSF/MM or maybe something we can flush out before.

These are the restraints we know

A, multiple gup pins
B, mutual data corruptions
C, no break of existing use cases
D, zero copy
E, feel free to add

then what is preventing an agreement like bounce page?

Because page migrate and reclaim have been working for a while with
gup pin taken into account, detecting it has no priority in any form
over the agreement on how to make a witeback page stable.

What seems more important, restriction B above makes C hard to meet
in any feasible approach trying to keep a writeback page stable, and
zero-copy makes it harder AFAICS.

> In any case my opinion is bounce page is the best thing we can do,
> from application and FS point of view it mimics the characteristics
> of regular write-back just as if the write protection window of the
> write-backed page was infinitly short.

A 100-line patch tells more than a 200-line explanation can and helps
to shorten the discussion prior to reaching an agreement.

Hillf
Jerome Glisse Nov. 7, 2019, 2:57 p.m. UTC | #13
On Thu, Nov 07, 2019 at 05:50:17PM +0800, Hillf Danton wrote:
> 
> On Wed, 6 Nov 2019 10:46:29 -0500 Jerome Glisse wrote:
> > 
> > On Wed, Nov 06, 2019 at 05:22:40PM +0800, Hillf Danton wrote:
> [...]
> > > >
> > > > Once driver has GUP it does not check and re-check the struct page
> > > > so there is no synchronization whatsoever after GUP happened. In
> > > > fact for some driver you can not synchronize anything once the device
> > > > has been program. Many devices are not just simple DMA engine you
> > > > can start and stop at will (network, GPUs, ...).
> > >
> > > Because "there is no synchronization whatsoever after GUP happened,"
> > > we need to take another close look at the reasoning for tracking
> > > multiple gupers if the chance of their mutual data corruptions exists
> > > in the wild. (If any sync mechanism sits between them to avoid data
> > > corruption, then it seems single pin is enough.)
> > 
> > It does exist in the wild but the userspace application would be either
> > doing something stupid or something terribly clever. For instance you
> > can have 2 network interface writing to the same GUPed page but that is
> > because the application made the same request over two NICs and both
> > endup writting the samething.
> > 
> > You can also have 2 GUPer each writting to different part of the page
> > and never stepping on each others.
> > 
> > The point really is that from kernel point of view there is just no
> > way to know if the application is doing something wrong or if it just
> > perfectly fine. This is exactly the same thing as CPU threads, you do
> > not ask the kernel to ascertain wether what application threads are
> > doing is wrong or right.
> > 
> > So we have to live with the fact that we can have multiple GUPers and
> > that it is not our problems if that happens and we can do nothing
> > about it.
> 
> Ok. Multiple gupers are a must-have, and perhaps their mutal data
> corruptions as well.
> 
> > Note that we are removing GUP from some of those driver, ones where
> > the device can abide to mmu notifier. But that is just something
> > orthogonal to all this.
> > 
> > 
> > > > So once a page is GUP there is just noway to garanty its stability
> > > > hence the best thing we can do is snapshot it to a bounce page.
> > >
> > > It becomes clearer OTOH that we are more likely than not moving in
> > > the incorrect direction, in cases like how to detect gupers and what
> > > to do for writeback if page is gup pinned, without a clear picture
> > > of the bounce page in the first place. Any plan to post a patch just
> > > for idea show?
> > 
> > The agreement so far is that we need to be able to identify GUPed
> > pages and this is what John's patchset does. Once we have that piece
> 
> Oh they are there, and barely trot away in case of long-lived pin.
> 
> > than we can discuss what to do in respect of write-back. Which is
> 
> Nobody seems to care what to do in the absence of gup pin.

I am not sure i follow ? Today we can not differentiate between GUP
and regular get_page(), if you use some combination of specific fs
and hardware you might get some BUG_ON() throws at you depending on
how lucky/unlucky you are. We can not solve this without being able
to differentiate between GUP and regular get_page(). Hence why John's
patchset is the first step in the right direction.

If there is no GUP on a page then regular writeback happens as it has
for years now so in absence of GUP i do not see any issue.


> > still something where there is no agreement as far as i remember the
> > outcome of the last discussion we had. I expect this will a topic
> > at next LSF/MM or maybe something we can flush out before.
> 
> These are the restraints we know
> 
> A, multiple gup pins
> B, mutual data corruptions
> C, no break of existing use cases
> D, zero copy

? What you mean by zero copy ?

> E, feel free to add
> 
> then what is preventing an agreement like bounce page?

There is 2 sides (AFAIR):
    - do not write back GUPed page and wait until GUP goes away to
      write them. But GUP can last as long as the uptime and we can
      loose data on power failure.
    - use a bounce page so that there is a chance we have some data
      on power failure

> 
> Because page migrate and reclaim have been working for a while with
> gup pin taken into account, detecting it has no priority in any form
> over the agreement on how to make a witeback page stable.

migrate just ignore GUPed page and thus there is no issue with migrate.
writeback is a special case here because some filesystem need a stable
page content and also we need to inhibit some fs specific things that
trigger BUG_ON() in set_page_dirty*()

> What seems more important, restriction B above makes C hard to meet
> in any feasible approach trying to keep a writeback page stable, and
> zero-copy makes it harder AFAICS.

writeback can use bounce page, zero copy ie not having to use bounce
page, is not an issue in fact in some cases we already use bounce page
(at the block device level).

> 
> > In any case my opinion is bounce page is the best thing we can do,
> > from application and FS point of view it mimics the characteristics
> > of regular write-back just as if the write protection window of the
> > write-backed page was infinitly short.
> 
> A 100-line patch tells more than a 200-line explanation can and helps
> to shorten the discussion prior to reaching an agreement.

It is not that trivial, you need to make sure every layer from fs down
to block device driver properly behave in front of bounce page. We have
such mechanism for bio but it is a the bio level but maybe it can be
dumped one level.

Cheers,
Jérôme
Hillf Danton Nov. 8, 2019, 9:38 a.m. UTC | #14
On Thu, 7 Nov 2019 09:57:48 -0500 Jerome Glisse wrote:
> 
> I am not sure i follow ? Today we can not differentiate between GUP
> and regular get_page(), if you use some combination of specific fs
> and hardware you might get some BUG_ON() throws at you depending on
> how lucky/unlucky you are. We can not solve this without being able
> to differentiate between GUP and regular get_page(). Hence why John's
> patchset is the first step in the right direction.
> 
What is the second one? And when? By who?

> If there is no GUP on a page then regular writeback happens as it has
> for years now so in absence of GUP i do not see any issue.
> 
> 
> > > still something where there is no agreement as far as i remember the
> > > outcome of the last discussion we had. I expect this will a topic
> > > at next LSF/MM or maybe something we can flush out before.
> >
> > These are the restraints we know
> >
> > A, multiple gup pins
> > B, mutual data corruptions
> > C, no break of existing use cases
> > D, zero copy
> 
> ? What you mean by zero copy ?
> 
Snippet that can be found at https://lwn.net/Articles/784574/

"get_user_pages() is a way to map user-space memory into the kernel's
address space; it will ensure that all of the requested pages have
been faulted into RAM (and locked there) and provide a kernel mapping
that, in turn, can be used for direct access by the kernel or (more
often) to set up zero-copy I/O operations.

> > E, feel free to add
> >
> > then what is preventing an agreement like bounce page?
> 
> There is 2 sides (AFAIR):
>     - do not write back GUPed page and wait until GUP goes away to
>       write them. But GUP can last as long as the uptime and we can
>       loose data on power failure.
>     - use a bounce page so that there is a chance we have some data
>       on power failure
> 
> >
> > Because page migrate and reclaim have been working for a while with
> > gup pin taken into account, detecting it has no priority in any form
> > over the agreement on how to make a witeback page stable.
> 
> migrate just ignore GUPed page and thus there is no issue with migrate.
> writeback is a special case here because some filesystem need a stable
> page content and also we need to inhibit some fs specific things that
> trigger BUG_ON() in set_page_dirty*()
> 
Which drivers so far have been snared by the BUG_ON()? Is there any
chance to fix them one after another? Otherwise what is making them
special (long-lived pin)?

After setting page dirty, is there any pending DMA transfer to the
dirty page? If yes, what is the point to do writeback for corrupted
data? If no, what is preventing the gup pin from being released?

> > What seems more important, restriction B above makes C hard to meet
> > in any feasible approach trying to keep a writeback page stable, and
> > zero-copy makes it harder AFAICS.
> 
> writeback can use bounce page, zero copy ie not having to use bounce
> page, is not an issue in fact in some cases we already use bounce page
> (at the block device level).
> 
> >
> > > In any case my opinion is bounce page is the best thing we can do,
> > > from application and FS point of view it mimics the characteristics
> > > of regular write-back just as if the write protection window of the
> > > write-backed page was infinitly short.
> >
> > A 100-line patch tells more than a 200-line explanation can and helps
> > to shorten the discussion prior to reaching an agreement.
> 
> It is not that trivial, you need to make sure every layer from fs down
> to block device driver properly behave in front of bounce page. We have
> such mechanism for bio but it is a the bio level but maybe it can be
> dumped one level.
Jerome Glisse Nov. 8, 2019, 1:59 p.m. UTC | #15
On Fri, Nov 08, 2019 at 05:38:37PM +0800, Hillf Danton wrote:
> 
> On Thu, 7 Nov 2019 09:57:48 -0500 Jerome Glisse wrote:
> > 
> > I am not sure i follow ? Today we can not differentiate between GUP
> > and regular get_page(), if you use some combination of specific fs
> > and hardware you might get some BUG_ON() throws at you depending on
> > how lucky/unlucky you are. We can not solve this without being able
> > to differentiate between GUP and regular get_page(). Hence why John's
> > patchset is the first step in the right direction.
> > 
> What is the second one? And when? By who?

Fix current BUG_ON() by not releasing buffer after write-back. Decide what
to do for write back and then a page is pin. It will happens once we are
done with pin changes and the infrastructure is in place. It will be done
by John or others.


> > If there is no GUP on a page then regular writeback happens as it has
> > for years now so in absence of GUP i do not see any issue.
> > 
> > 
> > > > still something where there is no agreement as far as i remember the
> > > > outcome of the last discussion we had. I expect this will a topic
> > > > at next LSF/MM or maybe something we can flush out before.
> > >
> > > These are the restraints we know
> > >
> > > A, multiple gup pins
> > > B, mutual data corruptions
> > > C, no break of existing use cases
> > > D, zero copy
> > 
> > ? What you mean by zero copy ?
> > 
> Snippet that can be found at https://lwn.net/Articles/784574/
> 
> "get_user_pages() is a way to map user-space memory into the kernel's
> address space; it will ensure that all of the requested pages have
> been faulted into RAM (and locked there) and provide a kernel mapping
> that, in turn, can be used for direct access by the kernel or (more
> often) to set up zero-copy I/O operations.
> 
> > > E, feel free to add
> > >
> > > then what is preventing an agreement like bounce page?
> > 
> > There is 2 sides (AFAIR):
> >     - do not write back GUPed page and wait until GUP goes away to
> >       write them. But GUP can last as long as the uptime and we can
> >       loose data on power failure.
> >     - use a bounce page so that there is a chance we have some data
> >       on power failure
> > 
> > >
> > > Because page migrate and reclaim have been working for a while with
> > > gup pin taken into account, detecting it has no priority in any form
> > > over the agreement on how to make a witeback page stable.
> > 
> > migrate just ignore GUPed page and thus there is no issue with migrate.
> > writeback is a special case here because some filesystem need a stable
> > page content and also we need to inhibit some fs specific things that
> > trigger BUG_ON() in set_page_dirty*()
> > 
> Which drivers so far have been snared by the BUG_ON()? Is there any
> chance to fix them one after another? Otherwise what is making them
> special (long-lived pin)?

It is a race thing so it does not necesarily happens, the longer the pin
the higher risk. It can only happens with some fs (i forgot which ones but
you can go read the previous threads). It is easy to fix all we need to
do is not release some fs structure of pinned pages after write-back.


> After setting page dirty, is there any pending DMA transfer to the
> dirty page? If yes, what is the point to do writeback for corrupted
> data? If no, what is preventing the gup pin from being released?

When user of GUP calls set_page_dirty() they _must_ be done with using the
page and it is the case today AFAICT, so no pending DMA. The GUP pin is
release after set_page_dirty() by all current users.

Note that current users all do that once they are done and they can hold
the pages for an _indifinite_ amount of time ie forever. They do dirty
pages in their teardown code path.

Hence the question that we need ti answer is what to do for dirty pages
while they are GUPed. Note that a page can be set dirty while GUPed
because CPU access can still happens and thus the regular dirtyness
tracking mechanism do operate on such page.

So page can go through:
    - GUPed by someone
    - write by CPU, mark as dirty
    - regular write-back kicks in
    - page is mark clean and fs might release data structure

    ... any amount of time ... what to do here if more CPU writes ?

    - GUPed user done and put_page but before call set_page_dirty()
      this might BUG_ON() inside fs code for some fs if the page was
      left clean on the CPU since last writeback

I would strongly advise to read previous thread this was discussed at
length.

Cheers,
Jérôme
diff mbox series

Patch

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1055,6 +1055,29 @@  static inline void put_page(struct page
 		__put_page(page);
 }
 
+/*
+ * @page must be pagecache page
+ */
+static inline bool page_try_gup_pin(struct page *page)
+{
+	int count;
+
+	page = compound_head(page);
+	count = page_ref_count(page);
+	smp_mb__after_atomic();
+
+	if (!count || count > page_mapcount(page) + 1 +
+				page_has_private(page))
+		return false;
+
+	if (page_ref_inc_return(page) == count + 1) {
+		smp_mb__after_atomic();
+		return true;
+	}
+	put_page(page);
+	return false;
+}
+
 /**
  * put_user_page() - release a gup-pinned page
  * @page:            pointer to page to be released
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -253,7 +253,11 @@  retry:
 	}
 
 	if (flags & FOLL_GET) {
-		if (unlikely(!try_get_page(page))) {
+		if (page_is_file_cache(page)) {
+			if (PageWriteback(page) || !page_try_gup_pin(page))
+				goto pin_fail;
+		} else if (unlikely(!try_get_page(page))) {
+pin_fail:
 			page = ERR_PTR(-ENOMEM);
 			goto out;
 		}
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2202,6 +2202,9 @@  int write_cache_pages(struct address_spa
 
 			done_index = page->index;
 
+			if (!page_try_gup_pin(page))
+				continue;
+
 			lock_page(page);
 
 			/*
@@ -2215,6 +2218,7 @@  int write_cache_pages(struct address_spa
 			if (unlikely(page->mapping != mapping)) {
 continue_unlock:
 				unlock_page(page);
+				put_page(page);
 				continue;
 			}
 
@@ -2236,6 +2240,11 @@  continue_unlock:
 
 			trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
 			error = (*writepage)(page, wbc, data);
+			/*
+			 * protection of gup pin is no longer needed after
+			 * putting page under writeback
+			 */
+			put_page(page);
 			if (unlikely(error)) {
 				/*
 				 * Handle errors according to the type of