diff mbox

[FYI] tux3: Core changes

Message ID 55545C2F.8040207@phunq.net (mailing list archive)
State New, archived
Headers show

Commit Message

Daniel Phillips May 14, 2015, 8:26 a.m. UTC
Hi Rik,

Our linux-tux3 tree currently currently carries this 652 line diff
against core, to make Tux3 work. This is mainly by Hirofumi, except
the fs-writeback.c hook, which is by me. The main part you may be
interested in is rmap.c, which addresses the issues raised at the
2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]

   LSFMM: Page forking
   http://lwn.net/Articles/548091/

This is just a FYI. An upcoming Tux3 report will be a tour of the page
forking design and implementation. For now, this is just to give a
general sense of what we have done. We heard there are concerns about
how ptrace will work. I really am not familiar with the issue, could
you please explain what you were thinking of there?

Enjoy,

Daniel

[1] Which happened to be a 15 minute bus ride away from me at the time.

diffstat tux3.core.patch
 fs/Makefile               |    1 
 fs/fs-writeback.c         |  100 +++++++++++++++++++++++++--------
 include/linux/fs.h        |    6 +
 include/linux/mm.h        |    5 +
 include/linux/pagemap.h   |    2 
 include/linux/rmap.h      |   14 ++++
 include/linux/writeback.h |   23 +++++++
 mm/filemap.c              |   82 +++++++++++++++++++++++++++
 mm/rmap.c                 |  139 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c             |   98 ++++++++++++++++++++------------
 10 files changed, 411 insertions(+), 59 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Rik van Riel May 14, 2015, 12:59 p.m. UTC | #1
On 05/14/2015 04:26 AM, Daniel Phillips wrote:
> Hi Rik,
> 
> Our linux-tux3 tree currently currently carries this 652 line diff
> against core, to make Tux3 work. This is mainly by Hirofumi, except
> the fs-writeback.c hook, which is by me. The main part you may be
> interested in is rmap.c, which addresses the issues raised at the
> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
> 
>    LSFMM: Page forking
>    http://lwn.net/Articles/548091/
> 
> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> forking design and implementation. For now, this is just to give a
> general sense of what we have done. We heard there are concerns about
> how ptrace will work. I really am not familiar with the issue, could
> you please explain what you were thinking of there?

The issue is that things like ptrace, AIO, infiniband
RDMA, and other direct memory access subsystems can take
a reference to page A, which Tux3 clones into a new page B
when the process writes it.

However, while the process now points at page B, ptrace,
AIO, infiniband, etc will still be pointing at page A.

This causes the process and the other subsystem to each
look at a different page, instead of at shared state,
causing ptrace to do nothing, AIO and RDMA data to be
invisible (or corrupted), etc...
Daniel Phillips May 15, 2015, 12:06 a.m. UTC | #2
Hi Rik,

Added Mel, Andrea and Peterz to CC as interested parties. There are
probably others, please just jump in.

On 05/14/2015 05:59 AM, Rik van Riel wrote:
> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>> Hi Rik,
>>
>> Our linux-tux3 tree currently currently carries this 652 line diff
>> against core, to make Tux3 work. This is mainly by Hirofumi, except
>> the fs-writeback.c hook, which is by me. The main part you may be
>> interested in is rmap.c, which addresses the issues raised at the
>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>>
>>    LSFMM: Page forking
>>    http://lwn.net/Articles/548091/
>>
>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
>> forking design and implementation. For now, this is just to give a
>> general sense of what we have done. We heard there are concerns about
>> how ptrace will work. I really am not familiar with the issue, could
>> you please explain what you were thinking of there?
> 
> The issue is that things like ptrace, AIO, infiniband
> RDMA, and other direct memory access subsystems can take
> a reference to page A, which Tux3 clones into a new page B
> when the process writes it.
> 
> However, while the process now points at page B, ptrace,
> AIO, infiniband, etc will still be pointing at page A.
> 
> This causes the process and the other subsystem to each
> look at a different page, instead of at shared state,
> causing ptrace to do nothing, AIO and RDMA data to be
> invisible (or corrupted), etc...

Is this a bit like page migration?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rik van Riel May 15, 2015, 3:06 a.m. UTC | #3
On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> Hi Rik,
> 
> Added Mel, Andrea and Peterz to CC as interested parties. There are
> probably others, please just jump in.
> 
> On 05/14/2015 05:59 AM, Rik van Riel wrote:
>> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>>> Hi Rik,
>>>
>>> Our linux-tux3 tree currently currently carries this 652 line diff
>>> against core, to make Tux3 work. This is mainly by Hirofumi, except
>>> the fs-writeback.c hook, which is by me. The main part you may be
>>> interested in is rmap.c, which addresses the issues raised at the
>>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>>>
>>>    LSFMM: Page forking
>>>    http://lwn.net/Articles/548091/
>>>
>>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
>>> forking design and implementation. For now, this is just to give a
>>> general sense of what we have done. We heard there are concerns about
>>> how ptrace will work. I really am not familiar with the issue, could
>>> you please explain what you were thinking of there?
>>
>> The issue is that things like ptrace, AIO, infiniband
>> RDMA, and other direct memory access subsystems can take
>> a reference to page A, which Tux3 clones into a new page B
>> when the process writes it.
>>
>> However, while the process now points at page B, ptrace,
>> AIO, infiniband, etc will still be pointing at page A.
>>
>> This causes the process and the other subsystem to each
>> look at a different page, instead of at shared state,
>> causing ptrace to do nothing, AIO and RDMA data to be
>> invisible (or corrupted), etc...
> 
> Is this a bit like page migration?

Yes. Page migration will fail if there is an "extra"
reference to the page that is not accounted for by
the migration code.

Only pages that have no extra refcount can be migrated.

Similarly, your cow code needs to fail if there is an
extra reference count pinning the page. As long as
the page has a user that you cannot migrate, you cannot
move any of the other users over. They may rely on data
written by the hidden-to-you user, and the hidden-to-you
user may write to the page when you think it is a read
only stable snapshot.
Mel Gorman May 15, 2015, 8:05 a.m. UTC | #4
On Thu, May 14, 2015 at 05:06:39PM -0700, Daniel Phillips wrote:
> Hi Rik,
> 
> Added Mel, Andrea and Peterz to CC as interested parties. There are
> probably others, please just jump in.
> 
> On 05/14/2015 05:59 AM, Rik van Riel wrote:
> > On 05/14/2015 04:26 AM, Daniel Phillips wrote:
> >> Hi Rik,
> >>
> >> Our linux-tux3 tree currently currently carries this 652 line diff
> >> against core, to make Tux3 work. This is mainly by Hirofumi, except
> >> the fs-writeback.c hook, which is by me. The main part you may be
> >> interested in is rmap.c, which addresses the issues raised at the
> >> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
> >>
> >>    LSFMM: Page forking
> >>    http://lwn.net/Articles/548091/
> >>
> >> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> >> forking design and implementation. For now, this is just to give a
> >> general sense of what we have done. We heard there are concerns about
> >> how ptrace will work. I really am not familiar with the issue, could
> >> you please explain what you were thinking of there?
> > 
> > The issue is that things like ptrace, AIO, infiniband
> > RDMA, and other direct memory access subsystems can take
> > a reference to page A, which Tux3 clones into a new page B
> > when the process writes it.
> > 
> > However, while the process now points at page B, ptrace,
> > AIO, infiniband, etc will still be pointing at page A.
> > 
> > This causes the process and the other subsystem to each
> > look at a different page, instead of at shared state,
> > causing ptrace to do nothing, AIO and RDMA data to be
> > invisible (or corrupted), etc...
> 
> Is this a bit like page migration?
> 

No, it's not.
Mel Gorman May 15, 2015, 8:09 a.m. UTC | #5
On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> > Hi Rik,
> > 
> > Added Mel, Andrea and Peterz to CC as interested parties. There are
> > probably others, please just jump in.
> > 
> > On 05/14/2015 05:59 AM, Rik van Riel wrote:
> >> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
> >>> Hi Rik,
> >>>
> >>> Our linux-tux3 tree currently currently carries this 652 line diff
> >>> against core, to make Tux3 work. This is mainly by Hirofumi, except
> >>> the fs-writeback.c hook, which is by me. The main part you may be
> >>> interested in is rmap.c, which addresses the issues raised at the
> >>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
> >>>
> >>>    LSFMM: Page forking
> >>>    http://lwn.net/Articles/548091/
> >>>
> >>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> >>> forking design and implementation. For now, this is just to give a
> >>> general sense of what we have done. We heard there are concerns about
> >>> how ptrace will work. I really am not familiar with the issue, could
> >>> you please explain what you were thinking of there?
> >>
> >> The issue is that things like ptrace, AIO, infiniband
> >> RDMA, and other direct memory access subsystems can take
> >> a reference to page A, which Tux3 clones into a new page B
> >> when the process writes it.
> >>
> >> However, while the process now points at page B, ptrace,
> >> AIO, infiniband, etc will still be pointing at page A.
> >>
> >> This causes the process and the other subsystem to each
> >> look at a different page, instead of at shared state,
> >> causing ptrace to do nothing, AIO and RDMA data to be
> >> invisible (or corrupted), etc...
> > 
> > Is this a bit like page migration?
> 
> Yes. Page migration will fail if there is an "extra"
> reference to the page that is not accounted for by
> the migration code.
> 

When I said it's not like page migration, I was referring to the fact
that a COW on a pinned page for RDMA is a different problem to page
migration. The COW of a pinned page can lead to lost writes or
corruption depending on the ordering of events. Page migration fails
when there are unexpected problems to avoid this class of issue which is
fine for page migration but may be a critical failure in a filesystem
depending on exactly why the copy is required.
Daniel Phillips May 15, 2015, 9:38 a.m. UTC | #6
On 05/14/2015 08:06 PM, Rik van Riel wrote:
> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
>>> The issue is that things like ptrace, AIO, infiniband
>>> RDMA, and other direct memory access subsystems can take
>>> a reference to page A, which Tux3 clones into a new page B
>>> when the process writes it.
>>>
>>> However, while the process now points at page B, ptrace,
>>> AIO, infiniband, etc will still be pointing at page A.
>>>
>>> This causes the process and the other subsystem to each
>>> look at a different page, instead of at shared state,
>>> causing ptrace to do nothing, AIO and RDMA data to be
>>> invisible (or corrupted), etc...
>>
>> Is this a bit like page migration?
> 
> Yes. Page migration will fail if there is an "extra"
> reference to the page that is not accounted for by
> the migration code.
> 
> Only pages that have no extra refcount can be migrated.
> 
> Similarly, your cow code needs to fail if there is an
> extra reference count pinning the page. As long as
> the page has a user that you cannot migrate, you cannot
> move any of the other users over. They may rely on data
> written by the hidden-to-you user, and the hidden-to-you
> user may write to the page when you think it is a read
> only stable snapshot.

Please bear with me as I study these cases one by one.

First one is ptrace. Only for executable files, right?
Maybe we don't need to fork pages in executable files,

Uprobes... If somebody puts a breakpoint in a page and
we fork it, the replacement page has a copy of the
breakpoint, and all the code on the page. Did anything
break?

Note: we have the option of being cowardly and just not
doing page forking for mmapped files, or certain kinds
of mmapped files, etc. But first we should give it the
old college try, to see if absolute perfection is
possible and practical.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Phillips May 15, 2015, 9:54 a.m. UTC | #7
On 05/15/2015 01:09 AM, Mel Gorman wrote:
> On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
>> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
>>>> The issue is that things like ptrace, AIO, infiniband
>>>> RDMA, and other direct memory access subsystems can take
>>>> a reference to page A, which Tux3 clones into a new page B
>>>> when the process writes it.
>>>>
>>>> However, while the process now points at page B, ptrace,
>>>> AIO, infiniband, etc will still be pointing at page A.
>>>>
>>>> This causes the process and the other subsystem to each
>>>> look at a different page, instead of at shared state,
>>>> causing ptrace to do nothing, AIO and RDMA data to be
>>>> invisible (or corrupted), etc...
>>>
>>> Is this a bit like page migration?
>>
>> Yes. Page migration will fail if there is an "extra"
>> reference to the page that is not accounted for by
>> the migration code.
> 
> When I said it's not like page migration, I was referring to the fact
> that a COW on a pinned page for RDMA is a different problem to page
> migration. The COW of a pinned page can lead to lost writes or
> corruption depending on the ordering of events.

I see the lost writes case, but not the corruption case, Do you
mean corruption by changing a page already in writeout? If so,
don't all filesystems have that problem?

If RDMA to a mmapped file races with write(2) to the same file,
maybe it is reasonable and expected to lose some data.

> Page migration fails
> when there are unexpected problems to avoid this class of issue which is
> fine for page migration but may be a critical failure in a filesystem
> depending on exactly why the copy is required.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman May 15, 2015, 11 a.m. UTC | #8
On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
> 
> 
> On 05/15/2015 01:09 AM, Mel Gorman wrote:
> > On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
> >> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> >>>> The issue is that things like ptrace, AIO, infiniband
> >>>> RDMA, and other direct memory access subsystems can take
> >>>> a reference to page A, which Tux3 clones into a new page B
> >>>> when the process writes it.
> >>>>
> >>>> However, while the process now points at page B, ptrace,
> >>>> AIO, infiniband, etc will still be pointing at page A.
> >>>>
> >>>> This causes the process and the other subsystem to each
> >>>> look at a different page, instead of at shared state,
> >>>> causing ptrace to do nothing, AIO and RDMA data to be
> >>>> invisible (or corrupted), etc...
> >>>
> >>> Is this a bit like page migration?
> >>
> >> Yes. Page migration will fail if there is an "extra"
> >> reference to the page that is not accounted for by
> >> the migration code.
> > 
> > When I said it's not like page migration, I was referring to the fact
> > that a COW on a pinned page for RDMA is a different problem to page
> > migration. The COW of a pinned page can lead to lost writes or
> > corruption depending on the ordering of events.
> 
> I see the lost writes case, but not the corruption case,

Data corruption can occur depending on the ordering of events and the
applications expectations. If a process starts IO, RDMA pins the page
for read and forks are combined with writes from another thread then when
the IO completes the reads may not be visible. The application may take
improper action at that point.

Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
class of problem.

You can choose to not define this as data corruption because thge kernel
is not directly involved and that's your call.

> Do you
> mean corruption by changing a page already in writeout? If so,
> don't all filesystems have that problem?
> 

No, the problem is different. Backing devices requiring stable pages will
block the write until the IO is complete. For those that do not require
stable pages it's ok to allow the write as long as the page is dirtied so
that it'll be written out again and no data is lost.

> If RDMA to a mmapped file races with write(2) to the same file,
> maybe it is reasonable and expected to lose some data.
> 

In the RDMA case, there is at least application awareness to work around
the problems. Normally it's ok to have both mapped and write() access
to data although userspace might need a lock to co-ordinate updates and
event ordering.
David Lang May 16, 2015, 10:38 p.m. UTC | #9
On Fri, 15 May 2015, Mel Gorman wrote:

> On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
>>
>>
>> On 05/15/2015 01:09 AM, Mel Gorman wrote:
>>> On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
>>>> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
>>>>>> The issue is that things like ptrace, AIO, infiniband
>>>>>> RDMA, and other direct memory access subsystems can take
>>>>>> a reference to page A, which Tux3 clones into a new page B
>>>>>> when the process writes it.
>>>>>>
>>>>>> However, while the process now points at page B, ptrace,
>>>>>> AIO, infiniband, etc will still be pointing at page A.
>>>>>>
>>>>>> This causes the process and the other subsystem to each
>>>>>> look at a different page, instead of at shared state,
>>>>>> causing ptrace to do nothing, AIO and RDMA data to be
>>>>>> invisible (or corrupted), etc...
>>>>>
>>>>> Is this a bit like page migration?
>>>>
>>>> Yes. Page migration will fail if there is an "extra"
>>>> reference to the page that is not accounted for by
>>>> the migration code.
>>>
>>> When I said it's not like page migration, I was referring to the fact
>>> that a COW on a pinned page for RDMA is a different problem to page
>>> migration. The COW of a pinned page can lead to lost writes or
>>> corruption depending on the ordering of events.
>>
>> I see the lost writes case, but not the corruption case,
>
> Data corruption can occur depending on the ordering of events and the
> applications expectations. If a process starts IO, RDMA pins the page
> for read and forks are combined with writes from another thread then when
> the IO completes the reads may not be visible. The application may take
> improper action at that point.

if tux3 forks the page and writes the copy while the original page is being 
modified by other things, this means that some of the changes won't be in the 
version written (and this could catch partial writes with 'interesting' results 
if the forking happens at the wrong time)

But if the original page gets re-marked as needing to be written out when it's 
changed by one of the other things that are accessing it, there shouldn't be any 
long-term corruption.

As far as short-term corruption goes, any time you have a page mmapped it could 
get written out at any time, with only some of the application changes applied 
to it, so this sort of corruption could happen anyway couldn't it?

> Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
> class of problem.
>
> You can choose to not define this as data corruption because thge kernel
> is not directly involved and that's your call.
>
>> Do you
>> mean corruption by changing a page already in writeout? If so,
>> don't all filesystems have that problem?
>>
>
> No, the problem is different. Backing devices requiring stable pages will
> block the write until the IO is complete. For those that do not require
> stable pages it's ok to allow the write as long as the page is dirtied so
> that it'll be written out again and no data is lost.

so if tux3 is prevented from forking the page in cases where the write would be 
blocked, and will get forked again for follow-up writes if it's modified again 
otherwise, won't this be the same thing?

David Lang

>> If RDMA to a mmapped file races with write(2) to the same file,
>> maybe it is reasonable and expected to lose some data.
>>
>
> In the RDMA case, there is at least application awareness to work around
> the problems. Normally it's ok to have both mapped and write() access
> to data although userspace might need a lock to co-ordinate updates and
> event ordering.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh May 17, 2015, 1:26 p.m. UTC | #10
On 05/14/2015 03:59 PM, Rik van Riel wrote:
> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>> Hi Rik,
<>
> 
> The issue is that things like ptrace, AIO, infiniband
> RDMA, and other direct memory access subsystems can take
> a reference to page A, which Tux3 clones into a new page B
> when the process writes it.
> 
> However, while the process now points at page B, ptrace,
> AIO, infiniband, etc will still be pointing at page A.
> 

All these problems can also happen with truncate+new-extending-write

It is the responsibility of the application to take file/range locks
to prevent these page-pinned problems.

> This causes the process and the other subsystem to each
> look at a different page, instead of at shared state,
> causing ptrace to do nothing, AIO and RDMA data to be
> invisible (or corrupted), etc...
> 

Again these problems already exist. Consider each in-place-write
being a truncate (punch hole) + new-write is that not the same?

Cheers
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rik van Riel May 18, 2015, 2:20 a.m. UTC | #11
On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
> On 05/14/2015 03:59 PM, Rik van Riel wrote:
>> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>>> Hi Rik,
> <>
>>
>> The issue is that things like ptrace, AIO, infiniband
>> RDMA, and other direct memory access subsystems can take
>> a reference to page A, which Tux3 clones into a new page B
>> when the process writes it.
>>
>> However, while the process now points at page B, ptrace,
>> AIO, infiniband, etc will still be pointing at page A.
>>
> 
> All these problems can also happen with truncate+new-extending-write
> 
> It is the responsibility of the application to take file/range locks
> to prevent these page-pinned problems.

It is unreasonable to expect a process that is being ptraced
(potentially without its knowledge) to take special measures
to protect the ptraced memory from disappearing.

It is impossible for the debugger to take those special measures
for anonymous memory, or unlinked inodes.

I don't think your requirement is workable or reasonable.
Boaz Harrosh May 18, 2015, 7:58 a.m. UTC | #12
On 05/18/2015 05:20 AM, Rik van Riel wrote:
> On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
>> On 05/14/2015 03:59 PM, Rik van Riel wrote:
>>> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>>>> Hi Rik,
>> <>
>>>
>>> The issue is that things like ptrace, AIO, infiniband
>>> RDMA, and other direct memory access subsystems can take
>>> a reference to page A, which Tux3 clones into a new page B
>>> when the process writes it.
>>>
>>> However, while the process now points at page B, ptrace,
>>> AIO, infiniband, etc will still be pointing at page A.
>>>
>>
>> All these problems can also happen with truncate+new-extending-write
>>
>> It is the responsibility of the application to take file/range locks
>> to prevent these page-pinned problems.
> 
> It is unreasonable to expect a process that is being ptraced
> (potentially without its knowledge) to take special measures
> to protect the ptraced memory from disappearing.

If the memory disappears that's a bug. No the memory is just there
it is just not reflecting the latest content of the fs-file.

> 
> It is impossible for the debugger to take those special measures
> for anonymous memory, or unlinked inodes.
> 

Why? one line of added code after the open and before the mmap do an flock

> I don't think your requirement is workable or reasonable.
> 

Therefor it is unreasonable to write/modify a ptraced process
file.

Again what I'm saying is COWing a page on write, has the same effect
as truncate+write. They are both allowed and both might give you the same
"stale" effect. So the presidence is there. We are not introducing a new
anomaly, just introducing a new instance of it. I guess the question
is what applications/procedures are going to break. Need lots of testing
and real life installations to answer that, I guess.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman May 18, 2015, 12:57 p.m. UTC | #13
On Sat, May 16, 2015 at 03:38:04PM -0700, David Lang wrote:
> On Fri, 15 May 2015, Mel Gorman wrote:
> 
> >On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
> >>
> >>
> >>On 05/15/2015 01:09 AM, Mel Gorman wrote:
> >>>On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
> >>>>On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> >>>>>>The issue is that things like ptrace, AIO, infiniband
> >>>>>>RDMA, and other direct memory access subsystems can take
> >>>>>>a reference to page A, which Tux3 clones into a new page B
> >>>>>>when the process writes it.
> >>>>>>
> >>>>>>However, while the process now points at page B, ptrace,
> >>>>>>AIO, infiniband, etc will still be pointing at page A.
> >>>>>>
> >>>>>>This causes the process and the other subsystem to each
> >>>>>>look at a different page, instead of at shared state,
> >>>>>>causing ptrace to do nothing, AIO and RDMA data to be
> >>>>>>invisible (or corrupted), etc...
> >>>>>
> >>>>>Is this a bit like page migration?
> >>>>
> >>>>Yes. Page migration will fail if there is an "extra"
> >>>>reference to the page that is not accounted for by
> >>>>the migration code.
> >>>
> >>>When I said it's not like page migration, I was referring to the fact
> >>>that a COW on a pinned page for RDMA is a different problem to page
> >>>migration. The COW of a pinned page can lead to lost writes or
> >>>corruption depending on the ordering of events.
> >>
> >>I see the lost writes case, but not the corruption case,
> >
> >Data corruption can occur depending on the ordering of events and the
> >applications expectations. If a process starts IO, RDMA pins the page
> >for read and forks are combined with writes from another thread then when
> >the IO completes the reads may not be visible. The application may take
> >improper action at that point.
> 
> if tux3 forks the page and writes the copy while the original page
> is being modified by other things, this means that some of the
> changes won't be in the version written (and this could catch
> partial writes with 'interesting' results if the forking happens at
> the wrong time)
> 

Potentially yes. There is likely to be some elevated memory usage but I
imagine that can be controlled.

> But if the original page gets re-marked as needing to be written out
> when it's changed by one of the other things that are accessing it,
> there shouldn't be any long-term corruption.
> 
> As far as short-term corruption goes, any time you have a page
> mmapped it could get written out at any time, with only some of the
> application changes applied to it, so this sort of corruption could
> happen anyway couldn't it?
> 

That becomes the responsibility of the application. It's up to it to sync
appropriately when it knows updates are complete.

> >Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
> >class of problem.
> >
> >You can choose to not define this as data corruption because thge kernel
> >is not directly involved and that's your call.
> >
> >>Do you
> >>mean corruption by changing a page already in writeout? If so,
> >>don't all filesystems have that problem?
> >>
> >
> >No, the problem is different. Backing devices requiring stable pages will
> >block the write until the IO is complete. For those that do not require
> >stable pages it's ok to allow the write as long as the page is dirtied so
> >that it'll be written out again and no data is lost.
> 
> so if tux3 is prevented from forking the page in cases where the
> write would be blocked, and will get forked again for follow-up
> writes if it's modified again otherwise, won't this be the same
> thing?
> 

Functionally and from a correctness point of view, it *might* be
equivalent. It depends on the implementation and the page life cycle,
particularly the details of how the writeback and dirty state are coordinated
between the user-visible pages and the page being written back. I've read
none of the code or background so I cannot answer whether it's really
equivalent or not. Just be aware that it's not the same problem as page
migration and that it's not the same as how writeback and dirty state is
handled today.
Daniel Phillips May 19, 2015, 4:46 a.m. UTC | #14
On 05/17/2015 07:20 PM, Rik van Riel wrote:
> On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
>> On 05/14/2015 03:59 PM, Rik van Riel wrote:
>>> The issue is that things like ptrace, AIO, infiniband
>>> RDMA, and other direct memory access subsystems can take
>>> a reference to page A, which Tux3 clones into a new page B
>>> when the process writes it.
>>>
>>> However, while the process now points at page B, ptrace,
>>> AIO, infiniband, etc will still be pointing at page A.
>>
>> All these problems can also happen with truncate+new-extending-write
>>
>> It is the responsibility of the application to take file/range locks
>> to prevent these page-pinned problems.
> 
> It is unreasonable to expect a process that is being ptraced
> (potentially without its knowledge) to take special measures
> to protect the ptraced memory from disappearing.
> 
> It is impossible for the debugger to take those special measures
> for anonymous memory, or unlinked inodes.
> 
> I don't think your requirement is workable or reasonable.

Hi Rik,

You are quite right to poke at this aggressively. Whether or not
there is an issue needing fixing, we want to know the details. We
really need to do a deep dive in ptrace and know exactly what it
does, and whether Tux3 creates any new kind of hole. I really know
very little about ptrace at the moment, I only have heard that it
is a horrible hack we inherited from some place far away and a time
long ago.

A little guidance from you would help. Somewhere ptrace must modify
the executable page. Unlike uprobes, which makes sense to me, I did
not find where ptrace actually does that on a quick inspection.
Perhaps you could provide a pointer?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 19, 2015, 2 p.m. UTC | #15
On Thu 14-05-15 01:26:23, Daniel Phillips wrote:
> Hi Rik,
> 
> Our linux-tux3 tree currently currently carries this 652 line diff
> against core, to make Tux3 work. This is mainly by Hirofumi, except
> the fs-writeback.c hook, which is by me. The main part you may be
> interested in is rmap.c, which addresses the issues raised at the
> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
> 
>    LSFMM: Page forking
>    http://lwn.net/Articles/548091/
> 
> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> forking design and implementation. For now, this is just to give a
> general sense of what we have done. We heard there are concerns about
> how ptrace will work. I really am not familiar with the issue, could
> you please explain what you were thinking of there?
  So here are a few things I find problematic about page forking (besides
the cases with elevated page_count already discussed in this thread - there
I believe that anything more complex than "wait for the IO instead of
forking when page has elevated use count" isn't going to work. There are
too many users depending on too subtle details of the behavior...). Some
of them are actually mentioned in the above LWN article:

When you create a copy of a page and replace it in the radix tree, nobody
in mm subsystem is aware that oldpage may be under writeback. That causes
interesting issues:
* truncate_inode_pages() can finish before all IO for the file is finished.
  So far filesystems rely on the fact that once truncate_inode_pages()
  finishes all running IO against the file is completed and new cannot be
  submitted.
* Writeback can come and try to write newpage while oldpage is still under
  IO. Then you'll have two IOs against one block which has undefined
  results.
* filemap_fdatawait() called from fsync() has additional problem that it is
  not aware of oldpage and thus may return although IO hasn't finished yet.

I understand that Tux3 may avoid these issues due to some other mechanisms
it internally has but if page forking should get into mm subsystem, the
above must work.

								Honza

> diffstat tux3.core.patch
>  fs/Makefile               |    1 
>  fs/fs-writeback.c         |  100 +++++++++++++++++++++++++--------
>  include/linux/fs.h        |    6 +
>  include/linux/mm.h        |    5 +
>  include/linux/pagemap.h   |    2 
>  include/linux/rmap.h      |   14 ++++
>  include/linux/writeback.h |   23 +++++++
>  mm/filemap.c              |   82 +++++++++++++++++++++++++++
>  mm/rmap.c                 |  139 ++++++++++++++++++++++++++++++++++++++++++++++
>  mm/truncate.c             |   98 ++++++++++++++++++++------------
>  10 files changed, 411 insertions(+), 59 deletions(-)
> 
> diff --git a/fs/Makefile b/fs/Makefile
> index 91fcfa3..44d7192 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -70,7 +70,6 @@ obj-$(CONFIG_EXT4_FS)		+= ext4/
>  obj-$(CONFIG_JBD)		+= jbd/
>  obj-$(CONFIG_JBD2)		+= jbd2/
>  obj-$(CONFIG_TUX3)		+= tux3/
> -obj-$(CONFIG_TUX3_MMAP) 	+= tux3/
>  obj-$(CONFIG_CRAMFS)		+= cramfs/
>  obj-$(CONFIG_SQUASHFS)		+= squashfs/
>  obj-y				+= ramfs/
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 2d609a5..fcd1c61 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -34,25 +34,6 @@
>   */
>  #define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_CACHE_SHIFT - 10))
>  
> -/*
> - * Passed into wb_writeback(), essentially a subset of writeback_control
> - */
> -struct wb_writeback_work {
> -	long nr_pages;
> -	struct super_block *sb;
> -	unsigned long *older_than_this;
> -	enum writeback_sync_modes sync_mode;
> -	unsigned int tagged_writepages:1;
> -	unsigned int for_kupdate:1;
> -	unsigned int range_cyclic:1;
> -	unsigned int for_background:1;
> -	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
> -	enum wb_reason reason;		/* why was writeback initiated? */
> -
> -	struct list_head list;		/* pending work list */
> -	struct completion *done;	/* set if the caller waits */
> -};
> -
>  /**
>   * writeback_in_progress - determine whether there is writeback in progress
>   * @bdi: the device's backing_dev_info structure.
> @@ -192,6 +173,36 @@ void inode_wb_list_del(struct inode *inode)
>  }
>  
>  /*
> + * Remove inode from writeback list if clean.
> + */
> +void inode_writeback_done(struct inode *inode)
> +{
> +	struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> +	spin_lock(&bdi->wb.list_lock);
> +	spin_lock(&inode->i_lock);
> +	if (!(inode->i_state & I_DIRTY))
> +		list_del_init(&inode->i_wb_list);
> +	spin_unlock(&inode->i_lock);
> +	spin_unlock(&bdi->wb.list_lock);
> +}
> +EXPORT_SYMBOL_GPL(inode_writeback_done);
> +
> +/*
> + * Add inode to writeback dirty list with current time.
> + */
> +void inode_writeback_touch(struct inode *inode)
> +{
> +	struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> +	spin_lock(&bdi->wb.list_lock);
> +	inode->dirtied_when = jiffies;
> +	list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> +	spin_unlock(&bdi->wb.list_lock);
> +}
> +EXPORT_SYMBOL_GPL(inode_writeback_touch);
> +
> +/*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
>   *
> @@ -610,9 +621,9 @@ static long writeback_chunk_size(struct backing_dev_info *bdi,
>   *
>   * Return the number of pages and/or inodes written.
>   */
> -static long writeback_sb_inodes(struct super_block *sb,
> -				struct bdi_writeback *wb,
> -				struct wb_writeback_work *work)
> +static long generic_writeback_sb_inodes(struct super_block *sb,
> +					struct bdi_writeback *wb,
> +					struct wb_writeback_work *work)
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> @@ -727,6 +738,22 @@ static long writeback_sb_inodes(struct super_block *sb,
>  	return wrote;
>  }
>  
> +static long writeback_sb_inodes(struct super_block *sb,
> +				struct bdi_writeback *wb,
> +				struct wb_writeback_work *work)
> +{
> +	if (sb->s_op->writeback) {
> +		long ret;
> +
> +		spin_unlock(&wb->list_lock);
> +		ret = sb->s_op->writeback(sb, wb, work);
> +		spin_lock(&wb->list_lock);
> +		return ret;
> +	}
> +
> +	return generic_writeback_sb_inodes(sb, wb, work);
> +}
> +
>  static long __writeback_inodes_wb(struct bdi_writeback *wb,
>  				  struct wb_writeback_work *work)
>  {
> @@ -1293,6 +1320,35 @@ static void wait_sb_inodes(struct super_block *sb)
>  }
>  
>  /**
> + * writeback_queue_work_sb -	schedule writeback work from given super_block
> + * @sb: the superblock
> + * @work: work item to queue
> + *
> + * Schedule writeback work on this super_block. This usually used to
> + * interact with sb->s_op->writeback callback. The caller must
> + * guarantee to @work is not freed while bdi flusher is using (for
> + * example, be safe against umount).
> + */
> +void writeback_queue_work_sb(struct super_block *sb,
> +			     struct wb_writeback_work *work)
> +{
> +	if (sb->s_bdi == &noop_backing_dev_info)
> +		return;
> +
> +	/* Allow only following fields to use. */
> +	*work = (struct wb_writeback_work){
> +		.sb			= sb,
> +		.sync_mode		= work->sync_mode,
> +		.tagged_writepages	= work->tagged_writepages,
> +		.done			= work->done,
> +		.nr_pages		= work->nr_pages,
> +		.reason			= work->reason,
> +	};
> +	bdi_queue_work(sb->s_bdi, work);
> +}
> +EXPORT_SYMBOL(writeback_queue_work_sb);
> +
> +/**
>   * writeback_inodes_sb_nr -	writeback dirty inodes from given super_block
>   * @sb: the superblock
>   * @nr: the number of pages to write
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 42efe13..29833d2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -356,6 +356,8 @@ struct address_space_operations {
>  
>  	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
>  	sector_t (*bmap)(struct address_space *, sector_t);
> +	void (*truncatepage)(struct address_space *, struct page *,
> +			     unsigned int, unsigned int, int);
>  	void (*invalidatepage) (struct page *, unsigned int, unsigned int);
>  	int (*releasepage) (struct page *, gfp_t);
>  	void (*freepage)(struct page *);
> @@ -1590,6 +1592,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
>  extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
>  		unsigned long, loff_t *);
>  
> +struct bdi_writeback;
> +struct wb_writeback_work;
>  struct super_operations {
>     	struct inode *(*alloc_inode)(struct super_block *sb);
>  	void (*destroy_inode)(struct inode *);
> @@ -1599,6 +1603,8 @@ struct super_operations {
>  	int (*drop_inode) (struct inode *);
>  	void (*evict_inode) (struct inode *);
>  	void (*put_super) (struct super_block *);
> +	long (*writeback)(struct super_block *super, struct bdi_writeback *wb,
> +			  struct wb_writeback_work *work);
>  	int (*sync_fs)(struct super_block *sb, int wait);
>  	int (*freeze_super) (struct super_block *);
>  	int (*freeze_fs) (struct super_block *);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dd5ea30..075f59f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1909,6 +1909,11 @@ vm_unmapped_area(struct vm_unmapped_area_info *info)
>  }
>  
>  /* truncate.c */
> +void generic_truncate_partial_page(struct address_space *mapping,
> +				   struct page *page, unsigned int start,
> +				   unsigned int len);
> +void generic_truncate_full_page(struct address_space *mapping,
> +				struct page *page, int wait);
>  extern void truncate_inode_pages(struct address_space *, loff_t);
>  extern void truncate_inode_pages_range(struct address_space *,
>  				       loff_t lstart, loff_t lend);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 4b3736f..13b70160 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -653,6 +653,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  extern void delete_from_page_cache(struct page *page);
>  extern void __delete_from_page_cache(struct page *page, void *shadow);
>  int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
> +int cow_replace_page_cache(struct page *oldpage, struct page *newpage);
> +void cow_delete_from_page_cache(struct page *page);
>  
>  /*
>   * Like add_to_page_cache_locked, but used to add newly allocated pages:
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index d9d7e7e..9b67360 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -228,6 +228,20 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
>  int page_mkclean(struct page *);
>  
>  /*
> + * Make clone page for page forking.
> + *
> + * Note: only clones page state so other state such as buffer_heads
> + * must be cloned by caller.
> + */
> +struct page *cow_clone_page(struct page *oldpage);
> +
> +/*
> + * Changes the PTES of shared mappings except the PTE in orig_vma.
> + */
> +int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
> +		  struct page *newpage);
> +
> +/*
>   * called in munlock()/munmap() path to check for other vmas holding
>   * the page mlocked.
>   */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 0004833..0784b9d 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -59,6 +59,25 @@ enum wb_reason {
>  };
>  
>  /*
> + * Passed into wb_writeback(), essentially a subset of writeback_control
> + */
> +struct wb_writeback_work {
> +	long nr_pages;
> +	struct super_block *sb;
> +	unsigned long *older_than_this;
> +	enum writeback_sync_modes sync_mode;
> +	unsigned int tagged_writepages:1;
> +	unsigned int for_kupdate:1;
> +	unsigned int range_cyclic:1;
> +	unsigned int for_background:1;
> +	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
> +	enum wb_reason reason;		/* why was writeback initiated? */
> +
> +	struct list_head list;		/* pending work list */
> +	struct completion *done;	/* set if the caller waits */
> +};
> +
> +/*
>   * A control structure which tells the writeback code what to do.  These are
>   * always on the stack, and hence need no locking.  They are always initialised
>   * in a manner such that unspecified fields are set to zero.
> @@ -90,6 +109,10 @@ struct writeback_control {
>   * fs/fs-writeback.c
>   */	
>  struct bdi_writeback;
> +void inode_writeback_done(struct inode *inode);
> +void inode_writeback_touch(struct inode *inode);
> +void writeback_queue_work_sb(struct super_block *sb,
> +			     struct wb_writeback_work *work);
>  void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
>  void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
>  							enum wb_reason reason);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 673e458..8c641d0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -639,6 +639,88 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  }
>  EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
>  
> +/*
> + * Atomically replace oldpage with newpage.
> + *
> + * Similar to migrate_pages(), but the oldpage is for writeout.
> + */
> +int cow_replace_page_cache(struct page *oldpage, struct page *newpage)
> +{
> +	struct address_space *mapping = oldpage->mapping;
> +	void **pslot;
> +
> +	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> +	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> +
> +	/* Get refcount for radix-tree */
> +	page_cache_get(newpage);
> +
> +	/* Replace page in radix tree. */
> +	spin_lock_irq(&mapping->tree_lock);
> +	/* PAGECACHE_TAG_DIRTY represents the view of frontend. Clear it. */
> +	if (PageDirty(oldpage))
> +		radix_tree_tag_clear(&mapping->page_tree, page_index(oldpage),
> +				     PAGECACHE_TAG_DIRTY);
> +	/* The refcount to newpage is used for radix tree. */
> +	pslot = radix_tree_lookup_slot(&mapping->page_tree, oldpage->index);
> +	radix_tree_replace_slot(pslot, newpage);
> +	__inc_zone_page_state(newpage, NR_FILE_PAGES);
> +	__dec_zone_page_state(oldpage, NR_FILE_PAGES);
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	/* mem_cgroup codes must not be called under tree_lock */
> +	mem_cgroup_migrate(oldpage, newpage, true);
> +
> +	/* Release refcount for radix-tree */
> +	page_cache_release(oldpage);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(cow_replace_page_cache);
> +
> +/*
> + * Delete page from radix-tree, leaving page->mapping unchanged.
> + *
> + * Similar to delete_from_page_cache(), but the deleted page is for writeout.
> + */
> +void cow_delete_from_page_cache(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +
> +	/* Delete page from radix tree. */
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * if we're uptodate, flush out into the cleancache, otherwise
> +	 * invalidate any existing cleancache entries.  We can't leave
> +	 * stale data around in the cleancache once our page is gone
> +	 */
> +	if (PageUptodate(page) && PageMappedToDisk(page))
> +		cleancache_put_page(page);
> +	else
> +		cleancache_invalidate_page(mapping, page);
> +
> +	page_cache_tree_delete(mapping, page, NULL);
> +#if 0 /* FIXME: backend is assuming page->mapping is available */
> +	page->mapping = NULL;
> +#endif
> +	/* Leave page->index set: truncation lookup relies upon it */
> +
> +	__dec_zone_page_state(page, NR_FILE_PAGES);
> +	BUG_ON(page_mapped(page));
> +
> +	/*
> +	 * The following dirty accounting is done by writeback
> +	 * path. So, we don't need to do here.
> +	 *
> +	 * dec_zone_page_state(page, NR_FILE_DIRTY);
> +	 * dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> +	 */
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	page_cache_release(page);
> +}
> +EXPORT_SYMBOL_GPL(cow_delete_from_page_cache);
> +
>  #ifdef CONFIG_NUMA
>  struct page *__page_cache_alloc(gfp_t gfp)
>  {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 71cd5bd..9125246 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -923,6 +923,145 @@ int page_mkclean(struct page *page)
>  }
>  EXPORT_SYMBOL_GPL(page_mkclean);
>  
> +/*
> + * Make clone page for page forking. (Based on migrate_page_copy())
> + *
> + * Note: only clones page state so other state such as buffer_heads
> + * must be cloned by caller.
> + */
> +struct page *cow_clone_page(struct page *oldpage)
> +{
> +	struct address_space *mapping = oldpage->mapping;
> +	gfp_t gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS;
> +	struct page *newpage = __page_cache_alloc(gfp_mask);
> +	int cpupid;
> +
> +	newpage->mapping = oldpage->mapping;
> +	newpage->index = oldpage->index;
> +	copy_highpage(newpage, oldpage);
> +
> +	/* FIXME: right? */
> +	BUG_ON(PageSwapCache(oldpage));
> +	BUG_ON(PageSwapBacked(oldpage));
> +	BUG_ON(PageHuge(oldpage));
> +	if (PageError(oldpage))
> +		SetPageError(newpage);
> +	if (PageReferenced(oldpage))
> +		SetPageReferenced(newpage);
> +	if (PageUptodate(oldpage))
> +		SetPageUptodate(newpage);
> +	if (PageActive(oldpage))
> +		SetPageActive(newpage);
> +	if (PageMappedToDisk(oldpage))
> +		SetPageMappedToDisk(newpage);
> +
> +	/*
> +	 * Copy NUMA information to the new page, to prevent over-eager
> +	 * future migrations of this same page.
> +	 */
> +	cpupid = page_cpupid_xchg_last(oldpage, -1);
> +	page_cpupid_xchg_last(newpage, cpupid);
> +
> +	mlock_migrate_page(newpage, oldpage);
> +	ksm_migrate_page(newpage, oldpage);
> +
> +	/* Lock newpage before visible via radix tree */
> +	BUG_ON(PageLocked(newpage));
> +	__set_page_locked(newpage);
> +
> +	return newpage;
> +}
> +EXPORT_SYMBOL_GPL(cow_clone_page);
> +
> +static int page_cow_one(struct page *oldpage, struct page *newpage,
> +			struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pte_t oldptval, ptval, *pte;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +
> +	pte = page_check_address(oldpage, mm, address, &ptl, 1);
> +	if (!pte)
> +		goto out;
> +
> +	flush_cache_page(vma, address, pte_pfn(*pte));
> +	oldptval = ptep_clear_flush(vma, address, pte);
> +
> +	/* Take refcount for PTE */
> +	page_cache_get(newpage);
> +
> +	/*
> +	 * vm_page_prot doesn't have writable bit, so page fault will
> +	 * be occurred immediately after returned from this page fault
> +	 * again. And second time of page fault will be resolved with
> +	 * forked page was set here.
> +	 */
> +	ptval = mk_pte(newpage, vma->vm_page_prot);
> +#if 0
> +	/* FIXME: we should check following too? Otherwise, we would
> +	 * get additional read-only => write fault at least */
> +	if (pte_write)
> +		ptval = pte_mkwrite(ptval);
> +	if (pte_dirty(oldptval))
> +		ptval = pte_mkdirty(ptval);
> +	if (pte_young(oldptval))
> +		ptval = pte_mkyoung(ptval);
> +#endif
> +	set_pte_at(mm, address, pte, ptval);
> +
> +	/* Update rmap accounting */
> +	BUG_ON(!PageMlocked(oldpage));	/* Caller should migrate mlock flag */
> +	page_remove_rmap(oldpage);
> +	page_add_file_rmap(newpage);
> +
> +	/* no need to invalidate: a not-present page won't be cached */
> +	update_mmu_cache(vma, address, pte);
> +
> +	pte_unmap_unlock(pte, ptl);
> +
> +	mmu_notifier_invalidate_page(mm, address);
> +
> +	/* Release refcount for PTE */
> +	page_cache_release(oldpage);
> +out:
> +	return ret;
> +}
> +
> +/* Change old page in PTEs to new page exclude orig_vma */
> +int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
> +		  struct page *newpage)
> +{
> +	struct address_space *mapping = page_mapping(oldpage);
> +	pgoff_t pgoff = oldpage->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +	struct vm_area_struct *vma;
> +	int ret = 0;
> +
> +	BUG_ON(!PageLocked(oldpage));
> +	BUG_ON(!PageLocked(newpage));
> +	BUG_ON(PageAnon(oldpage));
> +	BUG_ON(mapping == NULL);
> +
> +	i_mmap_lock_read(mapping);
> +	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> +		/*
> +		 * The orig_vma's PTE is handled by caller.
> +		 * (e.g. ->page_mkwrite)
> +		 */
> +		if (vma == orig_vma)
> +			continue;
> +
> +		if (vma->vm_flags & VM_SHARED) {
> +			unsigned long address = vma_address(oldpage, vma);
> +			ret += page_cow_one(oldpage, newpage, vma, address);
> +		}
> +	}
> +	i_mmap_unlock_read(mapping);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(page_cow_file);
> +
>  /**
>   * page_move_anon_rmap - move a page to our anon_vma
>   * @page:	the page to move to our anon_vma
> diff --git a/mm/truncate.c b/mm/truncate.c
> index f1e4d60..e5b4673 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -216,6 +216,56 @@ int invalidate_inode_page(struct page *page)
>  	return invalidate_complete_page(mapping, page);
>  }
>  
> +void generic_truncate_partial_page(struct address_space *mapping,
> +				   struct page *page, unsigned int start,
> +				   unsigned int len)
> +{
> +	wait_on_page_writeback(page);
> +	zero_user_segment(page, start, start + len);
> +	if (page_has_private(page))
> +		do_invalidatepage(page, start, len);
> +}
> +EXPORT_SYMBOL(generic_truncate_partial_page);
> +
> +static void truncate_partial_page(struct address_space *mapping, pgoff_t index,
> +				  unsigned int start, unsigned int len)
> +{
> +	struct page *page = find_lock_page(mapping, index);
> +	if (!page)
> +		return;
> +
> +	if (!mapping->a_ops->truncatepage)
> +		generic_truncate_partial_page(mapping, page, start, len);
> +	else
> +		mapping->a_ops->truncatepage(mapping, page, start, len, 1);
> +
> +	cleancache_invalidate_page(mapping, page);
> +	unlock_page(page);
> +	page_cache_release(page);
> +}
> +
> +void generic_truncate_full_page(struct address_space *mapping,
> +				struct page *page, int wait)
> +{
> +	if (wait)
> +		wait_on_page_writeback(page);
> +	else if (PageWriteback(page))
> +		return;
> +
> +	truncate_inode_page(mapping, page);
> +}
> +EXPORT_SYMBOL(generic_truncate_full_page);
> +
> +static void truncate_full_page(struct address_space *mapping, struct page *page,
> +			       int wait)
> +{
> +	if (!mapping->a_ops->truncatepage)
> +		generic_truncate_full_page(mapping, page, wait);
> +	else
> +		mapping->a_ops->truncatepage(mapping, page, 0, PAGE_CACHE_SIZE,
> +					     wait);
> +}
> +
>  /**
>   * truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets
>   * @mapping: mapping to truncate
> @@ -298,11 +348,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> -			if (PageWriteback(page)) {
> -				unlock_page(page);
> -				continue;
> -			}
> -			truncate_inode_page(mapping, page);
> +			truncate_full_page(mapping, page, 0);
>  			unlock_page(page);
>  		}
>  		pagevec_remove_exceptionals(&pvec);
> @@ -312,37 +358,18 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  	}
>  
>  	if (partial_start) {
> -		struct page *page = find_lock_page(mapping, start - 1);
> -		if (page) {
> -			unsigned int top = PAGE_CACHE_SIZE;
> -			if (start > end) {
> -				/* Truncation within a single page */
> -				top = partial_end;
> -				partial_end = 0;
> -			}
> -			wait_on_page_writeback(page);
> -			zero_user_segment(page, partial_start, top);
> -			cleancache_invalidate_page(mapping, page);
> -			if (page_has_private(page))
> -				do_invalidatepage(page, partial_start,
> -						  top - partial_start);
> -			unlock_page(page);
> -			page_cache_release(page);
> -		}
> -	}
> -	if (partial_end) {
> -		struct page *page = find_lock_page(mapping, end);
> -		if (page) {
> -			wait_on_page_writeback(page);
> -			zero_user_segment(page, 0, partial_end);
> -			cleancache_invalidate_page(mapping, page);
> -			if (page_has_private(page))
> -				do_invalidatepage(page, 0,
> -						  partial_end);
> -			unlock_page(page);
> -			page_cache_release(page);
> +		unsigned int top = PAGE_CACHE_SIZE;
> +		if (start > end) {
> +			/* Truncation within a single page */
> +			top = partial_end;
> +			partial_end = 0;
>  		}
> +		truncate_partial_page(mapping, start - 1, partial_start,
> +				      top - partial_start);
>  	}
> +	if (partial_end)
> +		truncate_partial_page(mapping, end, 0, partial_end);
> +
>  	/*
>  	 * If the truncation happened within a single page no pages
>  	 * will be released, just zeroed, so we can bail out now.
> @@ -386,8 +413,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  
>  			lock_page(page);
>  			WARN_ON(page->index != index);
> -			wait_on_page_writeback(page);
> -			truncate_inode_page(mapping, page);
> +			truncate_full_page(mapping, page, 1);
>  			unlock_page(page);
>  		}
>  		pagevec_remove_exceptionals(&pvec);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Phillips May 19, 2015, 7:18 p.m. UTC | #16
Hi Jan,

On 05/19/2015 07:00 AM, Jan Kara wrote:
> On Thu 14-05-15 01:26:23, Daniel Phillips wrote:
>> Hi Rik,
>>
>> Our linux-tux3 tree currently currently carries this 652 line diff
>> against core, to make Tux3 work. This is mainly by Hirofumi, except
>> the fs-writeback.c hook, which is by me. The main part you may be
>> interested in is rmap.c, which addresses the issues raised at the
>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>>
>>    LSFMM: Page forking
>>    http://lwn.net/Articles/548091/
>>
>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
>> forking design and implementation. For now, this is just to give a
>> general sense of what we have done. We heard there are concerns about
>> how ptrace will work. I really am not familiar with the issue, could
>> you please explain what you were thinking of there?
>   So here are a few things I find problematic about page forking (besides
> the cases with elevated page_count already discussed in this thread - there
> I believe that anything more complex than "wait for the IO instead of
> forking when page has elevated use count" isn't going to work. There are
> too many users depending on too subtle details of the behavior...). Some
> of them are actually mentioned in the above LWN article:
> 
> When you create a copy of a page and replace it in the radix tree, nobody
> in mm subsystem is aware that oldpage may be under writeback. That causes
> interesting issues:
> * truncate_inode_pages() can finish before all IO for the file is finished.
>   So far filesystems rely on the fact that once truncate_inode_pages()
>   finishes all running IO against the file is completed and new cannot be
>   submitted.

We do not use truncate_inode_pages because of issues like that. We use
some truncate helpers, which were available in some cases, or else had
to be implemented in Tux3 to make everything work properly. The details
are Hirofumi's stomping grounds. I am pretty sure that his solution is
good and tight, or Tux3 would not pass its torture tests.

> * Writeback can come and try to write newpage while oldpage is still under
>   IO. Then you'll have two IOs against one block which has undefined
>   results.

Those writebacks only come from Tux3 (or indirectly from fs-writeback,
through our writeback) so we are able to ensure that a dirty block is
only written once. (If redirtied, the block will fork so two dirty
blocks are written in two successive deltas.)

> * filemap_fdatawait() called from fsync() has additional problem that it is
>   not aware of oldpage and thus may return although IO hasn't finished yet.

We do not use filemap_fdatawait, instead, we wait on completion of our
own writeback, which is under our control.

> I understand that Tux3 may avoid these issues due to some other mechanisms
> it internally has but if page forking should get into mm subsystem, the
> above must work.

It does work, and by example, it does not need a lot of code to make
it work, but the changes are not trivial. Tux3's delta writeback model
will not suit everyone, so you can't just lift our code and add it to
Ext4. Using it in Ext4 would require a per-inode writeback model, which
looks practical to me but far from a weekend project. Maybe something
to consider for Ext5.

It is the job of new designs like Tux3 to chase after that final drop
of performance, not our trusty Ext4 workhorse. Though stranger things
have happened - as I recall, Ext4 had O(n) directory operations at one
time. Fixing that was not easy, but we did it because we had to. Fixing
Ext4's write performance is not urgent by comparison, and the barrier
is high, you would want jbd3 for one thing.

I think the meta-question you are asking is, where is the second user
for this new CoW functionality? With a possible implication that if
there is no second user then Tux3 cannot be merged. Is that is the
question?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lang May 19, 2015, 8:33 p.m. UTC | #17
On Tue, 19 May 2015, Daniel Phillips wrote:

>> I understand that Tux3 may avoid these issues due to some other mechanisms
>> it internally has but if page forking should get into mm subsystem, the
>> above must work.
>
> It does work, and by example, it does not need a lot of code to make
> it work, but the changes are not trivial. Tux3's delta writeback model
> will not suit everyone, so you can't just lift our code and add it to
> Ext4. Using it in Ext4 would require a per-inode writeback model, which
> looks practical to me but far from a weekend project. Maybe something
> to consider for Ext5.
>
> It is the job of new designs like Tux3 to chase after that final drop
> of performance, not our trusty Ext4 workhorse. Though stranger things
> have happened - as I recall, Ext4 had O(n) directory operations at one
> time. Fixing that was not easy, but we did it because we had to. Fixing
> Ext4's write performance is not urgent by comparison, and the barrier
> is high, you would want jbd3 for one thing.
>
> I think the meta-question you are asking is, where is the second user
> for this new CoW functionality? With a possible implication that if
> there is no second user then Tux3 cannot be merged. Is that is the
> question?

I don't think they are asking for a second user. What they are saying is that 
for this functionality to be accepted in the mm subsystem, these problem cases 
need to work reliably, not just work for Tux3 because of your implementation.

So for things that you don't use, you need to make it an error if they get used 
on a page that's been forked (or not be an error and 'do the right thing')

For cases where it doesn't matter because Tux3 controls the writeback, and it's 
undefined in general what happens if writeback is triggered twice on the same 
page, you will need to figure out how to either prevent the second writeback 
from triggering if there's one in process, or define how the two writebacks are 
going to happen so that you can't end up with them re-ordered by some other 
filesystem.

I think that that's what's meant by the top statement that I left in the quote. 
Even if your implementation details make it safe, these need to be safe even 
without your implementation details to be acceptable in the core kernel.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 20, 2015, 2:44 p.m. UTC | #18
On Tue 19-05-15 13:33:31, David Lang wrote:
> On Tue, 19 May 2015, Daniel Phillips wrote:
> 
> >>I understand that Tux3 may avoid these issues due to some other mechanisms
> >>it internally has but if page forking should get into mm subsystem, the
> >>above must work.
> >
> >It does work, and by example, it does not need a lot of code to make
> >it work, but the changes are not trivial. Tux3's delta writeback model
> >will not suit everyone, so you can't just lift our code and add it to
> >Ext4. Using it in Ext4 would require a per-inode writeback model, which
> >looks practical to me but far from a weekend project. Maybe something
> >to consider for Ext5.
> >
> >It is the job of new designs like Tux3 to chase after that final drop
> >of performance, not our trusty Ext4 workhorse. Though stranger things
> >have happened - as I recall, Ext4 had O(n) directory operations at one
> >time. Fixing that was not easy, but we did it because we had to. Fixing
> >Ext4's write performance is not urgent by comparison, and the barrier
> >is high, you would want jbd3 for one thing.
> >
> >I think the meta-question you are asking is, where is the second user
> >for this new CoW functionality? With a possible implication that if
> >there is no second user then Tux3 cannot be merged. Is that is the
> >question?
> 
> I don't think they are asking for a second user. What they are
> saying is that for this functionality to be accepted in the mm
> subsystem, these problem cases need to work reliably, not just work
> for Tux3 because of your implementation.
> 
> So for things that you don't use, you need to make it an error if
> they get used on a page that's been forked (or not be an error and
> 'do the right thing')
> 
> For cases where it doesn't matter because Tux3 controls the
> writeback, and it's undefined in general what happens if writeback
> is triggered twice on the same page, you will need to figure out how
> to either prevent the second writeback from triggering if there's
> one in process, or define how the two writebacks are going to happen
> so that you can't end up with them re-ordered by some other
> filesystem.
> 
> I think that that's what's meant by the top statement that I left in
> the quote. Even if your implementation details make it safe, these
> need to be safe even without your implementation details to be
> acceptable in the core kernel.
  Yeah, that's what I meant. If you create a function which manipulates
page cache, you better make it work with other functions manipulating page
cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
developer. Sure you can document all the conditions under which the
function is safe to use but a function that has several paragraphs in front
of it explaning when it is safe to use isn't very good API...

								Honza
Daniel Phillips May 20, 2015, 4:22 p.m. UTC | #19
On 05/20/2015 07:44 AM, Jan Kara wrote:
> On Tue 19-05-15 13:33:31, David Lang wrote:
>> On Tue, 19 May 2015, Daniel Phillips wrote:
>>
>>>> I understand that Tux3 may avoid these issues due to some other mechanisms
>>>> it internally has but if page forking should get into mm subsystem, the
>>>> above must work.
>>>
>>> It does work, and by example, it does not need a lot of code to make
>>> it work, but the changes are not trivial. Tux3's delta writeback model
>>> will not suit everyone, so you can't just lift our code and add it to
>>> Ext4. Using it in Ext4 would require a per-inode writeback model, which
>>> looks practical to me but far from a weekend project. Maybe something
>>> to consider for Ext5.
>>>
>>> It is the job of new designs like Tux3 to chase after that final drop
>>> of performance, not our trusty Ext4 workhorse. Though stranger things
>>> have happened - as I recall, Ext4 had O(n) directory operations at one
>>> time. Fixing that was not easy, but we did it because we had to. Fixing
>>> Ext4's write performance is not urgent by comparison, and the barrier
>>> is high, you would want jbd3 for one thing.
>>>
>>> I think the meta-question you are asking is, where is the second user
>>> for this new CoW functionality? With a possible implication that if
>>> there is no second user then Tux3 cannot be merged. Is that is the
>>> question?
>>
>> I don't think they are asking for a second user. What they are
>> saying is that for this functionality to be accepted in the mm
>> subsystem, these problem cases need to work reliably, not just work
>> for Tux3 because of your implementation.
>>
>> So for things that you don't use, you need to make it an error if
>> they get used on a page that's been forked (or not be an error and
>> 'do the right thing')
>>
>> For cases where it doesn't matter because Tux3 controls the
>> writeback, and it's undefined in general what happens if writeback
>> is triggered twice on the same page, you will need to figure out how
>> to either prevent the second writeback from triggering if there's
>> one in process, or define how the two writebacks are going to happen
>> so that you can't end up with them re-ordered by some other
>> filesystem.
>>
>> I think that that's what's meant by the top statement that I left in
>> the quote. Even if your implementation details make it safe, these
>> need to be safe even without your implementation details to be
>> acceptable in the core kernel.
>   Yeah, that's what I meant. If you create a function which manipulates
> page cache, you better make it work with other functions manipulating page
> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
> developer. Sure you can document all the conditions under which the
> function is safe to use but a function that has several paragraphs in front
> of it explaning when it is safe to use isn't very good API...

Violent agreement, of course. To put it in concrete terms, each of
the page fork support functions must be examined and determined
sane. They are:

 * cow_replace_page_cache
 * cow_delete_from_page_cache
 * cow_clone_page
 * page_cow_one
 * page_cow_file

Would it be useful to drill down into those, starting from the top
of the list?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lang May 20, 2015, 6:01 p.m. UTC | #20
On Wed, 20 May 2015, Daniel Phillips wrote:

> On 05/20/2015 07:44 AM, Jan Kara wrote:
>>   Yeah, that's what I meant. If you create a function which manipulates
>> page cache, you better make it work with other functions manipulating page
>> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
>> developer. Sure you can document all the conditions under which the
>> function is safe to use but a function that has several paragraphs in front
>> of it explaning when it is safe to use isn't very good API...
>
> Violent agreement, of course. To put it in concrete terms, each of
> the page fork support functions must be examined and determined
> sane. They are:
>
> * cow_replace_page_cache
> * cow_delete_from_page_cache
> * cow_clone_page
> * page_cow_one
> * page_cow_file
>
> Would it be useful to drill down into those, starting from the top
> of the list?

It's a little more than determining that these 5 functions are sane, it's making 
sure that if someone mixes the use of these functions with other existing 
functions that the result is sane.

but it's probably a good starting point to look at each of these five functions 
in detail and consider how they work and could interact badly with other things 
touching the page cache.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rik van Riel May 20, 2015, 7:53 p.m. UTC | #21
On 05/20/2015 12:22 PM, Daniel Phillips wrote:
> On 05/20/2015 07:44 AM, Jan Kara wrote:
>> On Tue 19-05-15 13:33:31, David Lang wrote:

>>   Yeah, that's what I meant. If you create a function which manipulates
>> page cache, you better make it work with other functions manipulating page
>> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
>> developer. Sure you can document all the conditions under which the
>> function is safe to use but a function that has several paragraphs in front
>> of it explaning when it is safe to use isn't very good API...
> 
> Violent agreement, of course. To put it in concrete terms, each of
> the page fork support functions must be examined and determined
> sane. They are:
> 
>  * cow_replace_page_cache
>  * cow_delete_from_page_cache
>  * cow_clone_page
>  * page_cow_one
>  * page_cow_file
> 
> Would it be useful to drill down into those, starting from the top
> of the list?

How do these interact with other page cache functions, like
find_get_page() ?

How does tux3 prevent a user of find_get_page() from reading from
or writing into the pre-COW page, instead of the current page?
Daniel Phillips May 20, 2015, 10:51 p.m. UTC | #22
On 05/20/2015 12:53 PM, Rik van Riel wrote:
> On 05/20/2015 12:22 PM, Daniel Phillips wrote:
>> On 05/20/2015 07:44 AM, Jan Kara wrote:
>>> On Tue 19-05-15 13:33:31, David Lang wrote:
> 
>>>   Yeah, that's what I meant. If you create a function which manipulates
>>> page cache, you better make it work with other functions manipulating page
>>> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
>>> developer. Sure you can document all the conditions under which the
>>> function is safe to use but a function that has several paragraphs in front
>>> of it explaning when it is safe to use isn't very good API...
>>
>> Violent agreement, of course. To put it in concrete terms, each of
>> the page fork support functions must be examined and determined
>> sane. They are:
>>
>>  * cow_replace_page_cache
>>  * cow_delete_from_page_cache
>>  * cow_clone_page
>>  * page_cow_one
>>  * page_cow_file
>>
>> Would it be useful to drill down into those, starting from the top
>> of the list?
> 
> How do these interact with other page cache functions, like
> find_get_page() ?

Nicely:

   https://github.com/OGAWAHirofumi/linux-tux3/blob/hirofumi/fs/tux3/filemap_mmap.c#L182

> How does tux3 prevent a user of find_get_page() from reading from
> or writing into the pre-COW page, instead of the current page?

Careful control of the dirty bits (we have two of them, one each
for front and back). That is what pagefork_for_blockdirty is about.

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Phillips May 21, 2015, 3:24 a.m. UTC | #23
On 05/20/2015 03:51 PM, Daniel Phillips wrote:
> On 05/20/2015 12:53 PM, Rik van Riel wrote:
>> How does tux3 prevent a user of find_get_page() from reading from
>> or writing into the pre-COW page, instead of the current page?
> 
> Careful control of the dirty bits (we have two of them, one each
> for front and back). That is what pagefork_for_blockdirty is about.

Ah, and of course it does not matter if a reader is on the
pre-cow page. It would be reading the earlier copy, which might
no longer be the current copy, but it raced with the write so
nobody should be surprised. That is a race even without page fork.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lang May 21, 2015, 3:51 a.m. UTC | #24
On Wed, 20 May 2015, Daniel Phillips wrote:

> On 05/20/2015 03:51 PM, Daniel Phillips wrote:
>> On 05/20/2015 12:53 PM, Rik van Riel wrote:
>>> How does tux3 prevent a user of find_get_page() from reading from
>>> or writing into the pre-COW page, instead of the current page?
>>
>> Careful control of the dirty bits (we have two of them, one each
>> for front and back). That is what pagefork_for_blockdirty is about.
>
> Ah, and of course it does not matter if a reader is on the
> pre-cow page. It would be reading the earlier copy, which might
> no longer be the current copy, but it raced with the write so
> nobody should be surprised. That is a race even without page fork.

how do you prevent it from continuing to interact with the old version of the 
page and never see updates or have it's changes reflected on the current page?

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Phillips May 21, 2015, 7:53 p.m. UTC | #25
On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
> how do you prevent it from continuing to interact with the old 
> version of the page and never see updates or have it's changes 
> reflected on the current page?

Why would it do that, and what would be surprising about it? Did
you have a specific case in mind?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rik van Riel May 26, 2015, 4:25 a.m. UTC | #26
On 05/21/2015 03:53 PM, Daniel Phillips wrote:
> On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
>> how do you prevent it from continuing to interact with the old version
>> of the page and never see updates or have it's changes reflected on
>> the current page?
> 
> Why would it do that, and what would be surprising about it? Did
> you have a specific case in mind?

After a get_page(), page_cache_get(), or other equivalent
function, a piece of code has the expectation that it can
continue using that page until after it has released the
reference count.

This can be an arbitrarily long period of time.
Daniel Phillips May 26, 2015, 4:30 a.m. UTC | #27
On Monday, May 25, 2015 9:25:44 PM PDT, Rik van Riel wrote:
> On 05/21/2015 03:53 PM, Daniel Phillips wrote:
>> On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
>>> how do you prevent it from continuing to interact with the old version
>>> of the page and never see updates or have it's changes reflected on
>>> the current page?
>> 
>> Why would it do that, and what would be surprising about it? Did
>> you have a specific case in mind?
>
> After a get_page(), page_cache_get(), or other equivalent
> function, a piece of code has the expectation that it can
> continue using that page until after it has released the
> reference count.
>
> This can be an arbitrarily long period of time.

It is perfectly welcome to keep using that page as long as it
wants, Tux3 does not care. When it lets go of the last reference
(and Tux3 has finished with it) then the page is freeable. Did
you have a more specific example where this would be an issue?
Are you talking about kernel or userspace code?

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lang May 26, 2015, 6:04 a.m. UTC | #28
On Mon, 25 May 2015, Daniel Phillips wrote:

> On Monday, May 25, 2015 9:25:44 PM PDT, Rik van Riel wrote:
>> On 05/21/2015 03:53 PM, Daniel Phillips wrote:
>>> On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
>>>> how do you prevent it from continuing to interact with the old version
>>>> of the page and never see updates or have it's changes reflected on
>>>> the current page?
>>> 
>>> Why would it do that, and what would be surprising about it? Did
>>> you have a specific case in mind?
>> 
>> After a get_page(), page_cache_get(), or other equivalent
>> function, a piece of code has the expectation that it can
>> continue using that page until after it has released the
>> reference count.
>> 
>> This can be an arbitrarily long period of time.
>
> It is perfectly welcome to keep using that page as long as it
> wants, Tux3 does not care. When it lets go of the last reference
> (and Tux3 has finished with it) then the page is freeable. Did
> you have a more specific example where this would be an issue?
> Are you talking about kernel or userspace code?

if the page gets modified again, will that cause any issues? what if the page 
gets modified before the copy gets written out, so that there are two dirty 
copies of the page in the process of being written?

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Phillips May 26, 2015, 6:11 a.m. UTC | #29
On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
> if the page gets modified again, will that cause any issues? 
> what if the page gets modified before the copy gets written out, 
> so that there are two dirty copies of the page in the process of 
> being written?
>
> David Lang

How is the page going to get modified again? A forked page isn't
mapped by a pte, so userspace can't modify it by mmap. The forked
page is not in the page cache, so usespace can't modify it by
posix file ops. So the writer would have to be in kernel. Tux3
knows what it is doing, so it won't modify the page. What kernel
code besides Tux3 will modify the page?

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lang May 26, 2015, 6:13 a.m. UTC | #30
On Mon, 25 May 2015, Daniel Phillips wrote:

> On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
>> if the page gets modified again, will that cause any issues? what if the 
>> page gets modified before the copy gets written out, so that there are two 
>> dirty copies of the page in the process of being written?
>> 
>> David Lang
>
> How is the page going to get modified again? A forked page isn't
> mapped by a pte, so userspace can't modify it by mmap. The forked
> page is not in the page cache, so usespace can't modify it by
> posix file ops. So the writer would have to be in kernel. Tux3
> knows what it is doing, so it won't modify the page. What kernel
> code besides Tux3 will modify the page?

I'm assuming that Rik is talking about whatever has the reference to the page 
via one of the methods that he talked about.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 26, 2015, 7:09 a.m. UTC | #31
On Mon 25-05-15 23:11:11, Daniel Phillips wrote:
> On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
> >if the page gets modified again, will that cause any issues? what
> >if the page gets modified before the copy gets written out, so
> >that there are two dirty copies of the page in the process of
> >being written?
> >
> >David Lang
> 
> How is the page going to get modified again? A forked page isn't
> mapped by a pte, so userspace can't modify it by mmap. The forked
> page is not in the page cache, so usespace can't modify it by
> posix file ops. So the writer would have to be in kernel. Tux3
> knows what it is doing, so it won't modify the page. What kernel
> code besides Tux3 will modify the page?
  E.g. video drivers (or infiniband or direct IO for that matter) which
have buffers in user memory (may be mmapped file), grab references to pages
and hand out PFNs of those pages to the hardware to store data in them...
If you fork a page after the driver has handed PFNs to the hardware, you've
just lost all the writes hardware will do.

								Honza
Daniel Phillips May 26, 2015, 8:08 a.m. UTC | #32
On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
>   E.g. video drivers (or infiniband or direct IO for that matter) which
> have buffers in user memory (may be mmapped file), grab references to pages
> and hand out PFNs of those pages to the hardware to store data in them...
> If you fork a page after the driver has handed PFNs to the hardware, you've
> just lost all the writes hardware will do.

Hi Jan,

The page forked because somebody wrote to it with write(2) or mmap write at
the same time as a video driver (or infiniband or direct IO) was doing io 
to
it. Isn't the application trying hard to lose data in that case? It would 
not need page fork to lose data that way.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Phillips May 26, 2015, 8:09 a.m. UTC | #33
On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote:
> I'm assuming that Rik is talking about whatever has the 
> reference to the page via one of the methods that he talked 
> about.

This would be a good moment to provide specifics.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 26, 2015, 9 a.m. UTC | #34
On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> >  E.g. video drivers (or infiniband or direct IO for that matter) which
> >have buffers in user memory (may be mmapped file), grab references to pages
> >and hand out PFNs of those pages to the hardware to store data in them...
> >If you fork a page after the driver has handed PFNs to the hardware, you've
> >just lost all the writes hardware will do.
> 
> Hi Jan,
> 
> The page forked because somebody wrote to it with write(2) or mmap write at
> the same time as a video driver (or infiniband or direct IO) was
> doing io to
> it. Isn't the application trying hard to lose data in that case? It
> would not need page fork to lose data that way.

So I can think of two valid uses:

1) You setup IO to part of a page and modify from userspace a different
   part of a page.

2) At least for video drivers there is one ioctl() which creates object
   with buffers in memory and another ioctl() to actually ship it to hardware
   (may be called repeatedly). So in theory app could validly dirty the pages
   before it ships them to hardware. If this happens repeatedly and interacts
   badly with background writeback, you will end up with a forked page in a
   buffer and from that point on things are broken.

So my opinion is: Don't fork the page if page_count is elevated. You can
just wait for the IO if you need stable pages in that case. It's slow but
it's safe and it should be pretty rare. Is there any problem with that?

								Honza
Pavel Machek May 26, 2015, 10:13 a.m. UTC | #35
On Tue 2015-05-26 01:09:59, Daniel Phillips wrote:
> On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote:
> >I'm assuming that Rik is talking about whatever has the reference to the
> >page via one of the methods that he talked about.
> 
> This would be a good moment to provide specifics.

Hmm. This seems like a good moment for you to audit whole kernel, to
make sure it does not do stuff you don't expect it to.

You are changing core semantics, stuff that was allowed before is not
allowed now, so it looks like you should do the auditing...

You may want to start with video4linux, as Jan pointed out.
									Pavel
Sergey Senozhatsky May 26, 2015, 10:22 a.m. UTC | #36
On (05/26/15 01:08), Daniel Phillips wrote:
> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> >  E.g. video drivers (or infiniband or direct IO for that matter) which
> >have buffers in user memory (may be mmapped file), grab references to pages
> >and hand out PFNs of those pages to the hardware to store data in them...
> >If you fork a page after the driver has handed PFNs to the hardware, you've
> >just lost all the writes hardware will do.
> 
> Hi Jan,
> 
> The page forked because somebody wrote to it with write(2) or mmap write at
> the same time as a video driver (or infiniband or direct IO) was doing io to
> it. Isn't the application trying hard to lose data in that case? It would
> not need page fork to lose data that way.
> 

Hello,

is it possible to page-fork-bomb the system by some 'malicious' app?

	-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 26, 2015, 12:33 p.m. UTC | #37
On Tue 26-05-15 19:22:39, Sergey Senozhatsky wrote:
> On (05/26/15 01:08), Daniel Phillips wrote:
> > On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> > >  E.g. video drivers (or infiniband or direct IO for that matter) which
> > >have buffers in user memory (may be mmapped file), grab references to pages
> > >and hand out PFNs of those pages to the hardware to store data in them...
> > >If you fork a page after the driver has handed PFNs to the hardware, you've
> > >just lost all the writes hardware will do.
> > 
> > Hi Jan,
> > 
> > The page forked because somebody wrote to it with write(2) or mmap write at
> > the same time as a video driver (or infiniband or direct IO) was doing io to
> > it. Isn't the application trying hard to lose data in that case? It would
> > not need page fork to lose data that way.
> > 
> 
> Hello,
> 
> is it possible to page-fork-bomb the system by some 'malicious' app?
  Well, you can have only two copies of each page - the one under writeout
and the one in page cache. Furthermore you are limited by dirty throttling
so I don't think this would allow any out-of-ordinary DOS vector...

								Honza
Daniel Phillips May 26, 2015, 7:18 p.m. UTC | #38
Hi Sergey,

On 05/26/2015 03:22 AM, Sergey Senozhatsky wrote:
> 
> Hello,
> 
> is it possible to page-fork-bomb the system by some 'malicious' app?

Not in any new way. A page fork can happen either in the front end,
where it has to wait for memory like any other normal memory user,
or in the backend, where Tux3 may have privileged access to low
memory reserves and therefore must place bounds on its memory use
like any other user of low memory reserves.

This is not specific to page fork. We must place such bounds for
any memory that the backend uses. Fortunately, the backend does not
allocate memory extravagently, for fork or anything else, so when
this does get to the top of our to-do list it should not be too
hard to deal with. We plan to attack that after merge, as we have
never observed a problem in practice. Rather, Tux3 already seems
to survive low memory situations pretty well compared to some other
filesystems.


Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Phillips May 26, 2015, 8:22 p.m. UTC | #39
On 05/26/2015 02:00 AM, Jan Kara wrote:
> On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
>> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
>>>  E.g. video drivers (or infiniband or direct IO for that matter) which
>>> have buffers in user memory (may be mmapped file), grab references to pages
>>> and hand out PFNs of those pages to the hardware to store data in them...
>>> If you fork a page after the driver has handed PFNs to the hardware, you've
>>> just lost all the writes hardware will do.
>>
>> Hi Jan,
>>
>> The page forked because somebody wrote to it with write(2) or mmap write at
>> the same time as a video driver (or infiniband or direct IO) was
>> doing io to
>> it. Isn't the application trying hard to lose data in that case? It
>> would not need page fork to lose data that way.
> 
> So I can think of two valid uses:
> 
> 1) You setup IO to part of a page and modify from userspace a different
>    part of a page.

Suppose the use case is reading textures from video memory into a mmapped
file, and at the same time, the application is allowed to update the
textures in the file via mmap or write(2). Fork happens at mkwrite time.
If the page is already dirty, we do not fork it. The video API must have
made the page writable and dirty, so I do not see an issue.

> 2) At least for video drivers there is one ioctl() which creates object
>    with buffers in memory and another ioctl() to actually ship it to hardware
>    (may be called repeatedly). So in theory app could validly dirty the pages
>    before it ships them to hardware. If this happens repeatedly and interacts
>    badly with background writeback, you will end up with a forked page in a
>    buffer and from that point on things are broken.

Writeback does not fork pages. An app may dirty a page that is in process
of being shipped to hardware (must be a distinct part of the page, or it is
a race) and the data being sent to hardware will not be disturbed. If there
is an issue here, I do not see it.

> So my opinion is: Don't fork the page if page_count is elevated. You can
> just wait for the IO if you need stable pages in that case. It's slow but
> it's safe and it should be pretty rare. Is there any problem with that?

That would be our fallback if anybody discovers a specific case where page
fork breaks something, which so far has not been demonstrated.

With a known fallback, it is hard to see why we should delay merging over
that. Perfection has never been a requirement for merging filesystems. On
the contrary, imperfection is a reason for merging, so that the many
eyeballs effect may prove its value.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rik van Riel May 26, 2015, 9:36 p.m. UTC | #40
On 05/26/2015 04:22 PM, Daniel Phillips wrote:
> On 05/26/2015 02:00 AM, Jan Kara wrote:

>> So my opinion is: Don't fork the page if page_count is elevated. You can
>> just wait for the IO if you need stable pages in that case. It's slow but
>> it's safe and it should be pretty rare. Is there any problem with that?
> 
> That would be our fallback if anybody discovers a specific case where page
> fork breaks something, which so far has not been demonstrated.
> 
> With a known fallback, it is hard to see why we should delay merging over
> that. Perfection has never been a requirement for merging filesystems. On

However, avoiding data corruption by erring on the side of safety is
a pretty basic requirement.

> the contrary, imperfection is a reason for merging, so that the many
> eyeballs effect may prove its value.

If you skip the page fork when there is an elevated page count, tux3
should be safe (at least from that aspect). Only do the COW when there
is no "strange" use of the page going on.
Daniel Phillips May 26, 2015, 9:49 p.m. UTC | #41
On 05/26/2015 02:36 PM, Rik van Riel wrote:
> On 05/26/2015 04:22 PM, Daniel Phillips wrote:
>> On 05/26/2015 02:00 AM, Jan Kara wrote:
>>> So my opinion is: Don't fork the page if page_count is elevated. You can
>>> just wait for the IO if you need stable pages in that case. It's slow but
>>> it's safe and it should be pretty rare. Is there any problem with that?
>>
>> That would be our fallback if anybody discovers a specific case where page
>> fork breaks something, which so far has not been demonstrated.
>>
>> With a known fallback, it is hard to see why we should delay merging over
>> that. Perfection has never been a requirement for merging filesystems. On
> 
> However, avoiding data corruption by erring on the side of safety is
> a pretty basic requirement.

Erring on the side of safety is still an error. As a community we have
never been fond of adding code or overhead to fix theoretical bugs. I
do not see why we should relax that principle now.

We can fix actual bugs, but theoretical bugs are only shapeless specters
passing in the night. We should not become frozen in fear of them.

>> the contrary, imperfection is a reason for merging, so that the many
>> eyeballs effect may prove its value.
> 
> If you skip the page fork when there is an elevated page count, tux3
> should be safe (at least from that aspect). Only do the COW when there
> is no "strange" use of the page going on.

Then you break the I in ACID. There must be a compelling reason to do
that.

Regards,

Daniel


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek May 27, 2015, 7:41 a.m. UTC | #42
On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
> On 05/14/2015 08:06 PM, Rik van Riel wrote:
> > On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> >>> The issue is that things like ptrace, AIO, infiniband
> >>> RDMA, and other direct memory access subsystems can take
> >>> a reference to page A, which Tux3 clones into a new page B
> >>> when the process writes it.
> >>>
> >>> However, while the process now points at page B, ptrace,
> >>> AIO, infiniband, etc will still be pointing at page A.
> >>>
> >>> This causes the process and the other subsystem to each
> >>> look at a different page, instead of at shared state,
> >>> causing ptrace to do nothing, AIO and RDMA data to be
> >>> invisible (or corrupted), etc...
> >>
> >> Is this a bit like page migration?
> > 
> > Yes. Page migration will fail if there is an "extra"
> > reference to the page that is not accounted for by
> > the migration code.
> > 
> > Only pages that have no extra refcount can be migrated.
> > 
> > Similarly, your cow code needs to fail if there is an
> > extra reference count pinning the page. As long as
> > the page has a user that you cannot migrate, you cannot
> > move any of the other users over. They may rely on data
> > written by the hidden-to-you user, and the hidden-to-you
> > user may write to the page when you think it is a read
> > only stable snapshot.
> 
> Please bear with me as I study these cases one by one.
> 
> First one is ptrace. Only for executable files, right?
> Maybe we don't need to fork pages in executable files,

Umm. Why do you think it is only issue for executable files?

I'm free to mmap() any file, and then execute from it.

/lib/ld-linux.so /path/to/binary

is known way to exec programs that do not have x bit set.
									Pavel
Jan Kara May 27, 2015, 8:41 a.m. UTC | #43
On Tue 26-05-15 13:22:38, Daniel Phillips wrote:
> On 05/26/2015 02:00 AM, Jan Kara wrote:
> > On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
> >> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> >>>  E.g. video drivers (or infiniband or direct IO for that matter) which
> >>> have buffers in user memory (may be mmapped file), grab references to pages
> >>> and hand out PFNs of those pages to the hardware to store data in them...
> >>> If you fork a page after the driver has handed PFNs to the hardware, you've
> >>> just lost all the writes hardware will do.
> >>
> >> Hi Jan,
> >>
> >> The page forked because somebody wrote to it with write(2) or mmap write at
> >> the same time as a video driver (or infiniband or direct IO) was
> >> doing io to
> >> it. Isn't the application trying hard to lose data in that case? It
> >> would not need page fork to lose data that way.
> > 
> > So I can think of two valid uses:
> > 
> > 1) You setup IO to part of a page and modify from userspace a different
> >    part of a page.
> 
> Suppose the use case is reading textures from video memory into a mmapped
> file, and at the same time, the application is allowed to update the
> textures in the file via mmap or write(2). Fork happens at mkwrite time.
> If the page is already dirty, we do not fork it. The video API must have
> made the page writable and dirty, so I do not see an issue.

So there are a few things to have in mind:
1) There is nothing like a "writeable" page. Page is always writeable (at
least on x86 architecture). When a page is mapped into some virtual address
space (or more of them), this *mapping* can be either writeable or read-only.
mkwrite changes the mapping from read-only to writeable but kernel /
hardware is free to write to the page regardless of the mapping.

2) When kernel / hardware writes to the page, it first modifies the page
and then marks it dirty.
 
So what can happen in this scenario is:

1) You hand kernel a part of a page as a buffer. page_mkwrite() happens,
   page is dirtied, kernel notes a PFN of the page somewhere internally.

2) Writeback comes and starts writeback for the page.

3) Kernel ships the PFN to the hardware.

4) Userspace comes and wants to write to the page (different part than the
   HW is instructed to use). page_mkwrite is called, page is forked.
   Userspace writes to the forked page.

5) HW stores its data in the original page.

Userspace never sees data from the HW! Data corrupted where without page
forking everything would work just fine.

Another possible scenario:

1) Userspace app tells kernel to setup a HW buffer in a page.

2) Userspace app fills page with data -> page_mkwrite is called, page is
   dirtied.

3) Userspace app tells kernel to ship buffer to video HW.

4) Writeback comes and starts writeback for the page

5) Video HW is done with the page. Userspace app fills new set of data into
   the page -> page_mkwrite is called, page is forked.

6) Userspace app tells kernel to ship buffer to video HW. But HW gets the
   old data from the original page.

Again a data corruption issue where previously things were working fine.

> > 2) At least for video drivers there is one ioctl() which creates object
> >    with buffers in memory and another ioctl() to actually ship it to hardware
> >    (may be called repeatedly). So in theory app could validly dirty the pages
> >    before it ships them to hardware. If this happens repeatedly and interacts
> >    badly with background writeback, you will end up with a forked page in a
> >    buffer and from that point on things are broken.
> 
> Writeback does not fork pages. An app may dirty a page that is in process
> of being shipped to hardware (must be a distinct part of the page, or it is
> a race) and the data being sent to hardware will not be disturbed. If there
> is an issue here, I do not see it.
> 
> > So my opinion is: Don't fork the page if page_count is elevated. You can
> > just wait for the IO if you need stable pages in that case. It's slow but
> > it's safe and it should be pretty rare. Is there any problem with that?
> 
> That would be our fallback if anybody discovers a specific case where page
> fork breaks something, which so far has not been demonstrated.
> 
> With a known fallback, it is hard to see why we should delay merging over
> that. Perfection has never been a requirement for merging filesystems. On
> the contrary, imperfection is a reason for merging, so that the many
> eyeballs effect may prove its value.

Sorry, but you've got several people telling you they are concerned about
the safety of your approach. That is the many eyeballs effect. And data
corruption issues aren't problems you can just wave away with "let's wait
whether it really happens".

								Honza
Daniel Phillips May 27, 2015, 6:09 p.m. UTC | #44
On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:
> On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
>> On 05/14/2015 08:06 PM, Rik van Riel wrote: ...
>
> Umm. Why do you think it is only issue for executable files?

I meant: files with code in them, that will be executed. Please excuse
me for colliding with the chmod sense. I will say "code files" to avoid
ambiguity.

> I'm free to mmap() any file, and then execute from it.
>
> /lib/ld-linux.so /path/to/binary
>
> is known way to exec programs that do not have x bit set.

So... why would I write to a code file at the same time as stepping
through it with ptrace? Should I expect ptrace to work perfectly if
I do that? What would "work perfectly" mean, if the code is changing
at the same time as being traced?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek May 27, 2015, 9:37 p.m. UTC | #45
On Wed 2015-05-27 11:09:25, Daniel Phillips wrote:
> On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:
> >On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
> >>On 05/14/2015 08:06 PM, Rik van Riel wrote: ...
> >
> >Umm. Why do you think it is only issue for executable files?
> 
> I meant: files with code in them, that will be executed. Please excuse
> me for colliding with the chmod sense. I will say "code files" to avoid
> ambiguity.
> 
> >I'm free to mmap() any file, and then execute from it.
> >
> >/lib/ld-linux.so /path/to/binary
> >
> >is known way to exec programs that do not have x bit set.
> 
> So... why would I write to a code file at the same time as stepping
> through it with ptrace? Should I expect ptrace to work perfectly if
> I do that? What would "work perfectly" mean, if the code is changing
> at the same time as being traced?

Do you have any imagination at all?

Reasons I should expect ptrace to work perfectly if I'm writing to
file:

1) it used to work before

2) it used to work before

3) it used to work before and regressions are not allowed

4) some kind of just in time compiler

5) some kind of malware, playing tricks so that you have trouble
analyzing it

and of course,

6) it used to work before.

									Pavel
Daniel Phillips May 27, 2015, 10:33 p.m. UTC | #46
On 05/27/2015 02:37 PM, Pavel Machek wrote:
> On Wed 2015-05-27 11:09:25, Daniel Phillips wrote:
>> On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:
>>> On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
>>>> On 05/14/2015 08:06 PM, Rik van Riel wrote: ...
>>>
>>> Umm. Why do you think it is only issue for executable files?
>>
>> I meant: files with code in them, that will be executed. Please excuse
>> me for colliding with the chmod sense. I will say "code files" to avoid
>> ambiguity.
>>
>>> I'm free to mmap() any file, and then execute from it.
>>>
>>> /lib/ld-linux.so /path/to/binary
>>>
>>> is known way to exec programs that do not have x bit set.
>>
>> So... why would I write to a code file at the same time as stepping
>> through it with ptrace? Should I expect ptrace to work perfectly if
>> I do that? What would "work perfectly" mean, if the code is changing
>> at the same time as being traced?
> 
> Do you have any imagination at all?

[Non-collegial rhetoric alert, it would be helpful to avoid that.]

> Reasons I should expect ptrace to work perfectly if I'm writing to
> file:
> 
> 1) it used to work before
> 
> 2) it used to work before
> 
> 3) it used to work before and regressions are not allowed

Are you sure that ptrace will work perfectly on a file that you are
writing to at the same time as tracing? If so, it has magic that I
do not understand. Could you please explain.

> 4) some kind of just in time compiler

A JIT that can tolerate being written to by a task it knows nothing
about, at the same time as it is generating code in the file? I do
not know of any such JIT.

> 5) some kind of malware, playing tricks so that you have trouble
> analyzing it

By writing to a code file? Then it already has write access to the
code file, so it has already gotten inside your security perimeter
without needing help from page fork. That said, we should be alert
for any new holes that page fork might open. But if there are any,
they should be actual holes, not theoretical ones.

> and of course,
> 
> 6) it used to work before.

I look forward to your explanation of how.

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
OGAWA Hirofumi June 21, 2015, 3:36 p.m. UTC | #47
Jan Kara <jack@suse.cz> writes:

Hi,

> So there are a few things to have in mind:
> 1) There is nothing like a "writeable" page. Page is always writeable (at
> least on x86 architecture). When a page is mapped into some virtual address
> space (or more of them), this *mapping* can be either writeable or read-only.
> mkwrite changes the mapping from read-only to writeable but kernel /
> hardware is free to write to the page regardless of the mapping.
>
> 2) When kernel / hardware writes to the page, it first modifies the page
> and then marks it dirty.
>  
> So what can happen in this scenario is:
>
> 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens,
>    page is dirtied, kernel notes a PFN of the page somewhere internally.
>
> 2) Writeback comes and starts writeback for the page.
>
> 3) Kernel ships the PFN to the hardware.
>
> 4) Userspace comes and wants to write to the page (different part than the
>    HW is instructed to use). page_mkwrite is called, page is forked.
>    Userspace writes to the forked page.
>
> 5) HW stores its data in the original page.
>
> Userspace never sees data from the HW! Data corrupted where without page
> forking everything would work just fine.

I'm not sure I'm understanding your pseudocode logic correctly though.
This logic doesn't seems to be a page forking specific issue.  And
this pseudocode logic seems to be missing the locking and revalidate of
page.

If you can show more details, it would be helpful to see more, and
discuss the issue of page forking, or we can think about how to handle
the corner cases.

Well, before that, why need more details?

For example, replace the page fork at (4) with "truncate", "punch
hole", or "invalidate page".

Those operations remove the old page from radix tree, so the
userspace's write creates the new page, and HW still refererences the
old page.  (I.e. situation should be same with page forking, in my
understand of this pseudocode logic.)

IOW, this pseudocode logic seems to be broken without page forking if
no lock and revalidate.  Usually, we prevent unpleasant I/O by
lock_page or PG_writeback, and an obsolated page is revalidated under
lock_page.

For page forking, we may also be able to prevent similar situation by
locking, flags, and revalidate. But those details might be different
with current code, because page states are different.

> Another possible scenario:
>
> 1) Userspace app tells kernel to setup a HW buffer in a page.
>
> 2) Userspace app fills page with data -> page_mkwrite is called, page is
>    dirtied.
>
> 3) Userspace app tells kernel to ship buffer to video HW.
>
> 4) Writeback comes and starts writeback for the page
>
> 5) Video HW is done with the page. Userspace app fills new set of data into
>    the page -> page_mkwrite is called, page is forked.
>
> 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the
>    old data from the original page.
>
> Again a data corruption issue where previously things were working fine.

This logic seems to be same as above. Replace the page fork at (5).
With no revalidate of page, (6) will use the old page.

Thanks.
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Jan Kara June 23, 2015, 4:12 p.m. UTC | #48
On Mon 22-06-15 00:36:00, OGAWA Hirofumi wrote:
> Jan Kara <jack@suse.cz> writes:
> > So there are a few things to have in mind:
> > 1) There is nothing like a "writeable" page. Page is always writeable (at
> > least on x86 architecture). When a page is mapped into some virtual address
> > space (or more of them), this *mapping* can be either writeable or read-only.
> > mkwrite changes the mapping from read-only to writeable but kernel /
> > hardware is free to write to the page regardless of the mapping.
> >
> > 2) When kernel / hardware writes to the page, it first modifies the page
> > and then marks it dirty.
> >  
> > So what can happen in this scenario is:
> >
> > 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens,
> >    page is dirtied, kernel notes a PFN of the page somewhere internally.
> >
> > 2) Writeback comes and starts writeback for the page.
> >
> > 3) Kernel ships the PFN to the hardware.
> >
> > 4) Userspace comes and wants to write to the page (different part than the
> >    HW is instructed to use). page_mkwrite is called, page is forked.
> >    Userspace writes to the forked page.
> >
> > 5) HW stores its data in the original page.
> >
> > Userspace never sees data from the HW! Data corrupted where without page
> > forking everything would work just fine.
> 
> I'm not sure I'm understanding your pseudocode logic correctly though.
> This logic doesn't seems to be a page forking specific issue.  And
> this pseudocode logic seems to be missing the locking and revalidate of
> page.
> 
> If you can show more details, it would be helpful to see more, and
> discuss the issue of page forking, or we can think about how to handle
> the corner cases.
> 
> Well, before that, why need more details?
> 
> For example, replace the page fork at (4) with "truncate", "punch
> hole", or "invalidate page".
> 
> Those operations remove the old page from radix tree, so the
> userspace's write creates the new page, and HW still refererences the
> old page.  (I.e. situation should be same with page forking, in my
> understand of this pseudocode logic.)

Yes, if userspace truncates the file, the situation we end up with is
basically the same. However for truncate to happen some malicious process
has to come and truncate the file - a failure scenario that is acceptable
for most use cases since it doesn't happen unless someone is actively
trying to screw you. With page forking it is enough for flusher thread
to start writeback for that page to trigger the problem - event that is
basically bound to happen without any other userspace application
interfering.

> IOW, this pseudocode logic seems to be broken without page forking if
> no lock and revalidate.  Usually, we prevent unpleasant I/O by
> lock_page or PG_writeback, and an obsolated page is revalidated under
> lock_page.

Well, good luck with converting all the get_user_pages() users in kernel to
use lock_page() or PG_writeback checks to avoid issues with page forking. I
don't think that's really feasible.
 
> For page forking, we may also be able to prevent similar situation by
> locking, flags, and revalidate. But those details might be different
> with current code, because page states are different.

Sorry, I don't understand what do you mean in this paragraph. Can you
explain it a bit more?
 
> > Another possible scenario:
> >
> > 1) Userspace app tells kernel to setup a HW buffer in a page.
> >
> > 2) Userspace app fills page with data -> page_mkwrite is called, page is
> >    dirtied.
> >
> > 3) Userspace app tells kernel to ship buffer to video HW.
> >
> > 4) Writeback comes and starts writeback for the page
> >
> > 5) Video HW is done with the page. Userspace app fills new set of data into
> >    the page -> page_mkwrite is called, page is forked.
> >
> > 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the
> >    old data from the original page.
> >
> > Again a data corruption issue where previously things were working fine.
> 
> This logic seems to be same as above. Replace the page fork at (5).
> With no revalidate of page, (6) will use the old page.

  Yes, the same arguments as above apply...

								Honza
OGAWA Hirofumi July 5, 2015, 12:54 p.m. UTC | #49
Jan Kara <jack@suse.cz> writes:

>> I'm not sure I'm understanding your pseudocode logic correctly though.
>> This logic doesn't seems to be a page forking specific issue.  And
>> this pseudocode logic seems to be missing the locking and revalidate of
>> page.
>> 
>> If you can show more details, it would be helpful to see more, and
>> discuss the issue of page forking, or we can think about how to handle
>> the corner cases.
>> 
>> Well, before that, why need more details?
>> 
>> For example, replace the page fork at (4) with "truncate", "punch
>> hole", or "invalidate page".
>> 
>> Those operations remove the old page from radix tree, so the
>> userspace's write creates the new page, and HW still refererences the
>> old page.  (I.e. situation should be same with page forking, in my
>> understand of this pseudocode logic.)
>
> Yes, if userspace truncates the file, the situation we end up with is
> basically the same. However for truncate to happen some malicious process
> has to come and truncate the file - a failure scenario that is acceptable
> for most use cases since it doesn't happen unless someone is actively
> trying to screw you. With page forking it is enough for flusher thread
> to start writeback for that page to trigger the problem - event that is
> basically bound to happen without any other userspace application
> interfering.

Acceptable conclusion is where came from? That pseudocode logic doesn't
say about usage at all. And even if assume it is acceptable, as far as I
can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
page on non-exists block (sparse file. i.e. missing disk space check in
your logic). And if really no any lock/check, there would be another
races.

>> IOW, this pseudocode logic seems to be broken without page forking if
>> no lock and revalidate.  Usually, we prevent unpleasant I/O by
>> lock_page or PG_writeback, and an obsolated page is revalidated under
>> lock_page.
>
> Well, good luck with converting all the get_user_pages() users in kernel to
> use lock_page() or PG_writeback checks to avoid issues with page forking. I
> don't think that's really feasible.

What does all get_user_pages() conversion mean? Well, maybe right more
or less, I also think there is the issue in/around get_user_pages() that
we have to tackle.


IMO, if there is a code that pseudocode logic actually, it is the
breakage. And "it is acceptable and limitation, and give up to fix", I
don't think it is the right way to go. If there is really code broken
like your logic, I think we should fix.

Could you point which code is using your logic? Since that seems to be
so racy, I can't believe yet there are that racy codes actually.

>> For page forking, we may also be able to prevent similar situation by
>> locking, flags, and revalidate. But those details might be different
>> with current code, because page states are different.
>
> Sorry, I don't understand what do you mean in this paragraph. Can you
> explain it a bit more?

This just means a forked page (old page) and a truncated page have
different set of flags and state, so we may have to adjust revalidation.

Thanks.
Jan Kara July 9, 2015, 4:05 p.m. UTC | #50
On Sun 05-07-15 21:54:45, OGAWA Hirofumi wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> >> I'm not sure I'm understanding your pseudocode logic correctly though.
> >> This logic doesn't seems to be a page forking specific issue.  And
> >> this pseudocode logic seems to be missing the locking and revalidate of
> >> page.
> >> 
> >> If you can show more details, it would be helpful to see more, and
> >> discuss the issue of page forking, or we can think about how to handle
> >> the corner cases.
> >> 
> >> Well, before that, why need more details?
> >> 
> >> For example, replace the page fork at (4) with "truncate", "punch
> >> hole", or "invalidate page".
> >> 
> >> Those operations remove the old page from radix tree, so the
> >> userspace's write creates the new page, and HW still refererences the
> >> old page.  (I.e. situation should be same with page forking, in my
> >> understand of this pseudocode logic.)
> >
> > Yes, if userspace truncates the file, the situation we end up with is
> > basically the same. However for truncate to happen some malicious process
> > has to come and truncate the file - a failure scenario that is acceptable
> > for most use cases since it doesn't happen unless someone is actively
> > trying to screw you. With page forking it is enough for flusher thread
> > to start writeback for that page to trigger the problem - event that is
> > basically bound to happen without any other userspace application
> > interfering.
> 
> Acceptable conclusion is where came from? That pseudocode logic doesn't
> say about usage at all. And even if assume it is acceptable, as far as I
> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
> page on non-exists block (sparse file. i.e. missing disk space check in
> your logic). And if really no any lock/check, there would be another
> races.

So drop_caches won't cause any issues because it avoids mmaped pages.
Also page reclaim or page migration don't cause any issues because
they avoid pages with increased refcount (and increased refcount would stop
drop_caches from reclaiming the page as well if it was not for the mmaped
check before). Generally, elevated page refcount currently guarantees page
isn't migrated, reclaimed, or otherwise detached from the mapping (except
for truncate where the combination of mapping-index becomes invalid) and
your page forking would change that assumption - which IMHO has a big
potential for some breakage somewhere. And frankly I fail to see why you
and Daniel care so much about this corner case because from performance POV
it's IMHO a non-issue and you bother with page forking because of
performance, don't you?

> >> IOW, this pseudocode logic seems to be broken without page forking if
> >> no lock and revalidate.  Usually, we prevent unpleasant I/O by
> >> lock_page or PG_writeback, and an obsolated page is revalidated under
> >> lock_page.
> >
> > Well, good luck with converting all the get_user_pages() users in kernel to
> > use lock_page() or PG_writeback checks to avoid issues with page forking. I
> > don't think that's really feasible.
> 
> What does all get_user_pages() conversion mean? Well, maybe right more
> or less, I also think there is the issue in/around get_user_pages() that
> we have to tackle.
> 
> 
> IMO, if there is a code that pseudocode logic actually, it is the
> breakage. And "it is acceptable and limitation, and give up to fix", I
> don't think it is the right way to go. If there is really code broken
> like your logic, I think we should fix.
> 
> Could you point which code is using your logic? Since that seems to be
> so racy, I can't believe yet there are that racy codes actually.

So you can have a look for example at
drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up
of a video device buffer at virtual address specified by user. Now I don't
know whether there really is any userspace video program that sets up the
video buffer in mmaped file. I would agree with you that it would be a
strange thing to do but I've seen enough strange userspace code that I
would not be too surprised.

Another example of similar kind is at
drivers/infiniband/core/umem.c where we again set up buffer for infiniband
cards at users specified virtual address. And there are more drivers in
kernel like that.

								Honza
OGAWA Hirofumi July 31, 2015, 4:44 a.m. UTC | #51
Jan Kara <jack@suse.cz> writes:

>> > Yes, if userspace truncates the file, the situation we end up with is
>> > basically the same. However for truncate to happen some malicious process
>> > has to come and truncate the file - a failure scenario that is acceptable
>> > for most use cases since it doesn't happen unless someone is actively
>> > trying to screw you. With page forking it is enough for flusher thread
>> > to start writeback for that page to trigger the problem - event that is
>> > basically bound to happen without any other userspace application
>> > interfering.
>> 
>> Acceptable conclusion is where came from? That pseudocode logic doesn't
>> say about usage at all. And even if assume it is acceptable, as far as I
>> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
>> page on non-exists block (sparse file. i.e. missing disk space check in
>> your logic). And if really no any lock/check, there would be another
>> races.
>
> So drop_caches won't cause any issues because it avoids mmaped pages.
> Also page reclaim or page migration don't cause any issues because
> they avoid pages with increased refcount (and increased refcount would stop
> drop_caches from reclaiming the page as well if it was not for the mmaped
> check before). Generally, elevated page refcount currently guarantees page
> isn't migrated, reclaimed, or otherwise detached from the mapping (except
> for truncate where the combination of mapping-index becomes invalid) and
> your page forking would change that assumption - which IMHO has a big
> potential for some breakage somewhere.

Lifetime and visibility from user are different topic.  The issue here
is visibility. Of course, those has relation more or less though,
refcount doesn't stop to drop page from radix-tree at all.

Well, anyway, your claim seems to be assuming the userspace app
workarounds the issues. And it sounds like still not workarounds the
ENOSPC issue (validate at page fault/GUP) even if assuming userspace
behave as perfect. Calling it as kernel assumption is strange.

If you claim, there is strange logic widely used already, and of course,
we can't simply break it because of compatibility. I would be able to
agree. But your claim sounds like that logic is sane and well designed
behavior. So I disagree.

> And frankly I fail to see why you and Daniel care so much about this
> corner case because from performance POV it's IMHO a non-issue and you
> bother with page forking because of performance, don't you?

Trying to penalize the corner case path, instead of normal path, should
try at first. Penalizing normal path to allow corner case path is insane
basically.

Make normal path faster and more reliable is what we are trying.

> So you can have a look for example at
> drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up
> of a video device buffer at virtual address specified by user. Now I don't
> know whether there really is any userspace video program that sets up the
> video buffer in mmaped file. I would agree with you that it would be a
> strange thing to do but I've seen enough strange userspace code that I
> would not be too surprised.
>
> Another example of similar kind is at
> drivers/infiniband/core/umem.c where we again set up buffer for infiniband
> cards at users specified virtual address. And there are more drivers in
> kernel like that.

Unfortunately, I'm not looking those yet though. I guess those would be
helpful to see the details.

Thanks.
Daniel Phillips July 31, 2015, 5:27 p.m. UTC | #52
On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:
> Returning ENOSPC when you have free space you can't yet prove is safer than
> not returning it and risking a data loss when you get hit by a write/commit
> storm. :)

Remember when delayed allocation was scary and unproven, because proving
that ENOSPC will always be returned when needed is extremely difficult?
But the performance advantage was compelling, so we just worked at it
until it worked. There were times when it didn't work properly, but the
code was in the tree so it got fixed.

It's like that now with page forking - a new technique with compelling
advantages, and some challenges. In the past, we (the Linux community)
would rise to the challenge and err on the side of pushing optimizations
in early. That was our mojo, and that is how Linux became the dominant
operating system it is today. Do we, the Linux community, still have that
mojo?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lang July 31, 2015, 6:29 p.m. UTC | #53
On Fri, 31 Jul 2015, Daniel Phillips wrote:

> Subject: Re: [FYI] tux3: Core changes
> 
> On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:
>> Returning ENOSPC when you have free space you can't yet prove is safer than
>> not returning it and risking a data loss when you get hit by a write/commit
>> storm. :)
>
> Remember when delayed allocation was scary and unproven, because proving
> that ENOSPC will always be returned when needed is extremely difficult?
> But the performance advantage was compelling, so we just worked at it
> until it worked. There were times when it didn't work properly, but the
> code was in the tree so it got fixed.
>
> It's like that now with page forking - a new technique with compelling
> advantages, and some challenges. In the past, we (the Linux community)
> would rise to the challenge and err on the side of pushing optimizations
> in early. That was our mojo, and that is how Linux became the dominant
> operating system it is today. Do we, the Linux community, still have that
> mojo?

We, the Linux Community have less tolerance for losing people's data and 
preventing them from operating than we used to when it was all tinkerer's 
personal data and secondary systems.

So rather than pushing optimizations out to everyone and seeing what breaks, we 
now do more testing and checking for failures before pushing things out.

This means that when something new is introduced, we default to the safe, 
slightly slower way initially (there will be enough other bugs to deal with in 
any case), and then as we gain experience from the tinkerers enabling the 
performance optimizations, we make those optimizations reliable and only then 
push them out to all users.

If you define this as "loosing our mojo", then yes we have. But most people see 
the pace of development as still being high, just with more testing and 
polishing before it gets out to users.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Phillips July 31, 2015, 6:43 p.m. UTC | #54
On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
> If you define this as "loosing our mojo", then yes we have.

A pity. There remains so much to do that simply will not get
done in the absence of mojo.

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Phillips July 31, 2015, 10:12 p.m. UTC | #55
On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
> We, the Linux Community have less tolerance for losing people's data and preventing them from operating than we used to when it was all tinkerer's personal data and secondary systems.
> 
> So rather than pushing optimizations out to everyone and seeing what breaks, we now do more testing and checking for failures before pushing things out.

By the way, I am curious about whose data you think will get lost
as a result of pushing out Tux3 with a possible theoretical bug
in a wildly improbable scenario that has not actually been
described with sufficient specificity to falsify, let alone
demonstrated.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lang July 31, 2015, 10:27 p.m. UTC | #56
On Fri, 31 Jul 2015, Daniel Phillips wrote:

> On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
>> We, the Linux Community have less tolerance for losing people's data and 
>> preventing them from operating than we used to when it was all tinkerer's 
>> personal data and secondary systems.
>> 
>> So rather than pushing optimizations out to everyone and seeing what 
>> breaks, we now do more testing and checking for failures before pushing 
>> things out.
>
> By the way, I am curious about whose data you think will get lost
> as a result of pushing out Tux3 with a possible theoretical bug
> in a wildly improbable scenario that has not actually been
> described with sufficient specificity to falsify, let alone
> demonstrated.

you weren't asking about any particular feature of Tux, you were asking if we 
were still willing to push out stuff that breaks for users and fix it later.

Especially for filesystems that can loose the data of whoever is using it, the 
answer seems to be a clear no.

there may be bugs in what's pushed out that we don't know about. But we don't 
push out potential data corruption bugs that we do know about (or think we do)

so if you think this should be pushed out with this known corner case that's not 
handled properly, you have to convince people that it's _so_ improbable that 
they shouldn't care about it.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Phillips Aug. 1, 2015, midnight UTC | #57
On Friday, July 31, 2015 3:27:12 PM PDT, David Lang wrote:
> On Fri, 31 Jul 2015, Daniel Phillips wrote:
>
>> On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: ...
>
> you weren't asking about any particular feature of Tux, you 
> were asking if we were still willing to push out stuff that 
> breaks for users and fix it later.

I think you left a key word out of my ask: "theoretical".

> Especially for filesystems that can loose the data of whoever 
> is using it, the answer seems to be a clear no.
>
> there may be bugs in what's pushed out that we don't know 
> about. But we don't push out potential data corruption bugs that 
> we do know about (or think we do)
>
> so if you think this should be pushed out with this known 
> corner case that's not handled properly, you have to convince 
> people that it's _so_ improbable that they shouldn't care about 
> it.

There should also be an onus on the person posing the worry
to prove their case beyond a reasonable doubt, which has not been
done in case we are discussing here. Note: that is a technical
assessment to which a technical response is appropriate.

I do think that we should put a cap on this fencing and make
a real effort to get Tux3 into mainline. We should at least
set a ground rule that a problem should be proved real before it
becomes a reason to derail a project in the way that our project
has been derailed. Otherwise, it's hard to see what interest is
served.

OK, lets get back to the program. I accept your assertion that
we should convince people that the issue is improbable. To do
that, I need a specific issue to address. So far, no such issue
has been provided with specificity. Do you see why this is
frustrating?

Please, community. Give us specific issues to address, or give us
some way out of this eternal limbo. Or better, lets go back to the
old way of doing things in Linux, which is what got us where we
are today. Not this.

Note: Hirofumi's email is clear, logical and speaks to the
question. This branch of the thread is largely pointless, though
it essentially says the same thing in non-technical terms. Perhaps
your next response should be to Hirofumi, and perhaps it should be
technical.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Phillips Aug. 1, 2015, 12:16 a.m. UTC | #58
On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote:
> Note: Hirofumi's email is clear, logical and speaks to the
> question. This branch of the thread is largely pointless, though
> it essentially says the same thing in non-technical terms. Perhaps
> your next response should be to Hirofumi, and perhaps it should be
> technical.

Now, let me try to lead the way, but being specific. RDMA was raised
as a potential failure case for Tux3 page forking. But the RDMA api
does not let you use memory mmaped by Tux3 as a source or destination
of IO. Instead, it sets up its own pages and hands them out to the
RDMA app from a pool. So no issue. One down, right?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Aug. 3, 2015, 1:07 p.m. UTC | #59
On Fri 31-07-15 17:16:45, Daniel Phillips wrote:
> On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote:
> >Note: Hirofumi's email is clear, logical and speaks to the
> >question. This branch of the thread is largely pointless, though
> >it essentially says the same thing in non-technical terms. Perhaps
> >your next response should be to Hirofumi, and perhaps it should be
> >technical.
> 
> Now, let me try to lead the way, but being specific. RDMA was raised
> as a potential failure case for Tux3 page forking. But the RDMA api
> does not let you use memory mmaped by Tux3 as a source or destination
> of IO. Instead, it sets up its own pages and hands them out to the
> RDMA app from a pool. So no issue. One down, right?

Can you please tell me how you arrived to that conclusion? As far as I'm
looking at the code in drivers/infiniband/ I don't see anything there
preventing userspace from passing in mmapped memory...

								Honza
Jan Kara Aug. 3, 2015, 1:42 p.m. UTC | #60
On Fri 31-07-15 13:44:44, OGAWA Hirofumi wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> >> > Yes, if userspace truncates the file, the situation we end up with is
> >> > basically the same. However for truncate to happen some malicious process
> >> > has to come and truncate the file - a failure scenario that is acceptable
> >> > for most use cases since it doesn't happen unless someone is actively
> >> > trying to screw you. With page forking it is enough for flusher thread
> >> > to start writeback for that page to trigger the problem - event that is
> >> > basically bound to happen without any other userspace application
> >> > interfering.
> >> 
> >> Acceptable conclusion is where came from? That pseudocode logic doesn't
> >> say about usage at all. And even if assume it is acceptable, as far as I
> >> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
> >> page on non-exists block (sparse file. i.e. missing disk space check in
> >> your logic). And if really no any lock/check, there would be another
> >> races.
> >
> > So drop_caches won't cause any issues because it avoids mmaped pages.
> > Also page reclaim or page migration don't cause any issues because
> > they avoid pages with increased refcount (and increased refcount would stop
> > drop_caches from reclaiming the page as well if it was not for the mmaped
> > check before). Generally, elevated page refcount currently guarantees page
> > isn't migrated, reclaimed, or otherwise detached from the mapping (except
> > for truncate where the combination of mapping-index becomes invalid) and
> > your page forking would change that assumption - which IMHO has a big
> > potential for some breakage somewhere.
> 
> Lifetime and visibility from user are different topic.  The issue here
> is visibility. Of course, those has relation more or less though,
> refcount doesn't stop to drop page from radix-tree at all.

Well, refcount prevents dropping page from a radix-tree in some cases -
memory pressure, page migration to name the most prominent ones. It doesn't
prevent page from being dropped because of truncate, that is correct. In
general, the rule we currently obey is that kernel doesn't detach a page
with increased refcount from a radix tree unless there is a syscall asking
kernel to do that.

> Well, anyway, your claim seems to be assuming the userspace app
> workarounds the issues. And it sounds like still not workarounds the
> ENOSPC issue (validate at page fault/GUP) even if assuming userspace
> behave as perfect. Calling it as kernel assumption is strange.

Realistically, I don't think userspace apps workaround anything. They just
do what happens to work. Nobody happens to delete files while application
works on it and expect application to gracefully handle that. So everyone
is happy. I'm not sure about which ENOSPC issue you are speaking BTW. Can
you please ellaborate?

> If you claim, there is strange logic widely used already, and of course,
> we can't simply break it because of compatibility. I would be able to
> agree. But your claim sounds like that logic is sane and well designed
> behavior. So I disagree.

To me the rule: "Do not detach a page from a radix tree if it has an elevated
refcount unless explicitely requested by a syscall" looks like a sane one.
Yes.

> > And frankly I fail to see why you and Daniel care so much about this
> > corner case because from performance POV it's IMHO a non-issue and you
> > bother with page forking because of performance, don't you?
> 
> Trying to penalize the corner case path, instead of normal path, should
> try at first. Penalizing normal path to allow corner case path is insane
> basically.
>
> Make normal path faster and more reliable is what we are trying.

Elevated refcount of a page is in my opinion a corner case path. That's why
I think that penalizing that case by waiting for IO instead of forking is
acceptable cost for the improved compatibility & maintainability of the
code.

								Honza
OGAWA Hirofumi Aug. 9, 2015, 1:42 p.m. UTC | #61
Jan Kara <jack@suse.cz> writes:

> I'm not sure about which ENOSPC issue you are speaking BTW. Can you
> please ellaborate?

1. GUP simulate page fault, and prepare to modify
2. writeback clear dirty, and make PTE read-only
3. snapshot/reflink make block cow
4. driver called GUP modifies page, and dirty page without simulate page fault

>> If you claim, there is strange logic widely used already, and of course,
>> we can't simply break it because of compatibility. I would be able to
>> agree. But your claim sounds like that logic is sane and well designed
>> behavior. So I disagree.
>
> To me the rule: "Do not detach a page from a radix tree if it has an elevated
> refcount unless explicitely requested by a syscall" looks like a sane one.
> Yes.
>
>> > And frankly I fail to see why you and Daniel care so much about this
>> > corner case because from performance POV it's IMHO a non-issue and you
>> > bother with page forking because of performance, don't you?
>> 
>> Trying to penalize the corner case path, instead of normal path, should
>> try at first. Penalizing normal path to allow corner case path is insane
>> basically.
>>
>> Make normal path faster and more reliable is what we are trying.
>
> Elevated refcount of a page is in my opinion a corner case path. That's why
> I think that penalizing that case by waiting for IO instead of forking is
> acceptable cost for the improved compatibility & maintainability of the
> code.

What is "elevated refcount"? What is difference with normal refcount?
Are you saying "refcount >= specified threshold + waitq/wakeup" or
such? If so, it is not the path.  It is the state. IOW, some group may
not hit much, but some group may hit much, on normal path.

So it sounds like yet another "stable page". I.e. unpredictable
performance. (BTW, by recall of "stable page", noticed "stable page"
would not provide stabled page data for that logic too.)

Well, assuming "elevated refcount == threshold + waitq/wakeup", so
IMO, it is not attractive.  Rather the last option if there is no
others as design choice.

Thanks.
Jan Kara Aug. 10, 2015, 12:45 p.m. UTC | #62
On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > I'm not sure about which ENOSPC issue you are speaking BTW. Can you
> > please ellaborate?
> 
> 1. GUP simulate page fault, and prepare to modify
> 2. writeback clear dirty, and make PTE read-only
> 3. snapshot/reflink make block cow

I assume by point 3. you mean that snapshot / reflink happens now and thus
the page / block is marked as COW. Am I right?

> 4. driver called GUP modifies page, and dirty page without simulate page fault

OK, but this doesn't hit ENOSPC because as you correctly write in point 4.,
the page gets modified without triggering another page fault so COW for the
modified page isn't triggered. Modified page contents will be in both the
original and the reflinked file, won't it?

And I agree that the fact that snapshotted file's original contents can
still get modified is a bug. A one which is difficult to fix.

> >> If you claim, there is strange logic widely used already, and of course,
> >> we can't simply break it because of compatibility. I would be able to
> >> agree. But your claim sounds like that logic is sane and well designed
> >> behavior. So I disagree.
> >
> > To me the rule: "Do not detach a page from a radix tree if it has an elevated
> > refcount unless explicitely requested by a syscall" looks like a sane one.
> > Yes.
> >
> >> > And frankly I fail to see why you and Daniel care so much about this
> >> > corner case because from performance POV it's IMHO a non-issue and you
> >> > bother with page forking because of performance, don't you?
> >> 
> >> Trying to penalize the corner case path, instead of normal path, should
> >> try at first. Penalizing normal path to allow corner case path is insane
> >> basically.
> >>
> >> Make normal path faster and more reliable is what we are trying.
> >
> > Elevated refcount of a page is in my opinion a corner case path. That's why
> > I think that penalizing that case by waiting for IO instead of forking is
> > acceptable cost for the improved compatibility & maintainability of the
> > code.
> 
> What is "elevated refcount"? What is difference with normal refcount?
> Are you saying "refcount >= specified threshold + waitq/wakeup" or
> such? If so, it is not the path.  It is the state. IOW, some group may
> not hit much, but some group may hit much, on normal path.

Yes, by "elevated refcount" I meant refcount > 2 (one for pagecache, one for
your code inspecting the page).

> So it sounds like yet another "stable page". I.e. unpredictable
> performance. (BTW, by recall of "stable page", noticed "stable page"
> would not provide stabled page data for that logic too.)
> 
> Well, assuming "elevated refcount == threshold + waitq/wakeup", so
> IMO, it is not attractive.  Rather the last option if there is no
> others as design choice.

I agree the performance will be less predictable and that is not good. But
changing what is visible in the file when writeback races with GUP is a
worse problem to me.

Maybe if GUP marked pages it got ref for so that we could trigger the slow
behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so
that pages pinned by get_user_pages() would be properly accounted and then
we could use PG_mlocked and elevated refcount as a more reliable indication
of pages that need special handling).

								Honza

[1] http://thread.gmane.org/gmane.linux.kernel.mm/117679
OGAWA Hirofumi Aug. 16, 2015, 7:42 p.m. UTC | #63
Jan Kara <jack@suse.cz> writes:

> On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote:
>> Jan Kara <jack@suse.cz> writes:
>> 
>> > I'm not sure about which ENOSPC issue you are speaking BTW. Can you
>> > please ellaborate?
>> 
>> 1. GUP simulate page fault, and prepare to modify
>> 2. writeback clear dirty, and make PTE read-only
>> 3. snapshot/reflink make block cow
>
> I assume by point 3. you mean that snapshot / reflink happens now and thus
> the page / block is marked as COW. Am I right?

Right.

>> 4. driver called GUP modifies page, and dirty page without simulate page fault
>
> OK, but this doesn't hit ENOSPC because as you correctly write in point 4.,
> the page gets modified without triggering another page fault so COW for the
> modified page isn't triggered. Modified page contents will be in both the
> original and the reflinked file, won't it?

And above result can be ENOSPC too, depending on implement and race
condition. Also, if FS converted zerod blocks to hole like hammerfs,
simply ENOSPC happens. I.e. other process uses all spaces, but then no
->page_mkwrite() callback to check ENOSPC.

> And I agree that the fact that snapshotted file's original contents can
> still get modified is a bug. A one which is difficult to fix.

Yes, it is why I'm thinking this logic is issue, before page forking.

>> So it sounds like yet another "stable page". I.e. unpredictable
>> performance. (BTW, by recall of "stable page", noticed "stable page"
>> would not provide stabled page data for that logic too.)
>> 
>> Well, assuming "elevated refcount == threshold + waitq/wakeup", so
>> IMO, it is not attractive.  Rather the last option if there is no
>> others as design choice.
>
> I agree the performance will be less predictable and that is not good. But
> changing what is visible in the file when writeback races with GUP is a
> worse problem to me.
>
> Maybe if GUP marked pages it got ref for so that we could trigger the slow
> behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so
> that pages pinned by get_user_pages() would be properly accounted and then
> we could use PG_mlocked and elevated refcount as a more reliable indication
> of pages that need special handling).

I'm not reading Peter's patchset fully though, looks like good, and
maybe similar strategy in my mind currently. Also I'm thinking to add
callback for FS at start and end of GUP's pin window. (for just an
example, callback can be used to stop writeback by FS if FS wants.)

Thanks.
Rik van Riel Aug. 18, 2015, 4:39 p.m. UTC | #64
On 07/31/2015 01:27 PM, Daniel Phillips wrote:
> On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:
>> Returning ENOSPC when you have free space you can't yet prove is safer
>> than
>> not returning it and risking a data loss when you get hit by a
>> write/commit
>> storm. :)
>
> Remember when delayed allocation was scary and unproven, because proving
> that ENOSPC will always be returned when needed is extremely difficult?
> But the performance advantage was compelling, so we just worked at it
> until it worked. There were times when it didn't work properly, but the
> code was in the tree so it got fixed.
>
> It's like that now with page forking - a new technique with compelling
> advantages, and some challenges. In the past, we (the Linux community)
> would rise to the challenge and err on the side of pushing optimizations
> in early. That was our mojo, and that is how Linux became the dominant
> operating system it is today. Do we, the Linux community, still have that
> mojo?

Do you have the mojo to come up with a proposal on how
to make things work, in a way that ensures data consistency
for Linux users?

Yes, we know page forking is not compatible with the way
Linux currently uses refcounts.

The question is, does anyone have an idea on how we could
fix that?

Not necessarily an implementation yet, just an idea might
be enough to move forward at this stage.

However, if nobody wants to work on even an idea, page
forking may simply not be a safe thing to do.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/Makefile b/fs/Makefile
index 91fcfa3..44d7192 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -70,7 +70,6 @@  obj-$(CONFIG_EXT4_FS)		+= ext4/
 obj-$(CONFIG_JBD)		+= jbd/
 obj-$(CONFIG_JBD2)		+= jbd2/
 obj-$(CONFIG_TUX3)		+= tux3/
-obj-$(CONFIG_TUX3_MMAP) 	+= tux3/
 obj-$(CONFIG_CRAMFS)		+= cramfs/
 obj-$(CONFIG_SQUASHFS)		+= squashfs/
 obj-y				+= ramfs/
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2d609a5..fcd1c61 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,25 +34,6 @@ 
  */
 #define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_CACHE_SHIFT - 10))
 
-/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
-	long nr_pages;
-	struct super_block *sb;
-	unsigned long *older_than_this;
-	enum writeback_sync_modes sync_mode;
-	unsigned int tagged_writepages:1;
-	unsigned int for_kupdate:1;
-	unsigned int range_cyclic:1;
-	unsigned int for_background:1;
-	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
-	enum wb_reason reason;		/* why was writeback initiated? */
-
-	struct list_head list;		/* pending work list */
-	struct completion *done;	/* set if the caller waits */
-};
-
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -192,6 +173,36 @@  void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * Remove inode from writeback list if clean.
+ */
+void inode_writeback_done(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
+	spin_lock(&inode->i_lock);
+	if (!(inode->i_state & I_DIRTY))
+		list_del_init(&inode->i_wb_list);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(&bdi->wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_done);
+
+/*
+ * Add inode to writeback dirty list with current time.
+ */
+void inode_writeback_touch(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
+	inode->dirtied_when = jiffies;
+	list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+	spin_unlock(&bdi->wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_touch);
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -610,9 +621,9 @@  static long writeback_chunk_size(struct backing_dev_info *bdi,
  *
  * Return the number of pages and/or inodes written.
  */
-static long writeback_sb_inodes(struct super_block *sb,
-				struct bdi_writeback *wb,
-				struct wb_writeback_work *work)
+static long generic_writeback_sb_inodes(struct super_block *sb,
+					struct bdi_writeback *wb,
+					struct wb_writeback_work *work)
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
@@ -727,6 +738,22 @@  static long writeback_sb_inodes(struct super_block *sb,
 	return wrote;
 }
 
+static long writeback_sb_inodes(struct super_block *sb,
+				struct bdi_writeback *wb,
+				struct wb_writeback_work *work)
+{
+	if (sb->s_op->writeback) {
+		long ret;
+
+		spin_unlock(&wb->list_lock);
+		ret = sb->s_op->writeback(sb, wb, work);
+		spin_lock(&wb->list_lock);
+		return ret;
+	}
+
+	return generic_writeback_sb_inodes(sb, wb, work);
+}
+
 static long __writeback_inodes_wb(struct bdi_writeback *wb,
 				  struct wb_writeback_work *work)
 {
@@ -1293,6 +1320,35 @@  static void wait_sb_inodes(struct super_block *sb)
 }
 
 /**
+ * writeback_queue_work_sb -	schedule writeback work from given super_block
+ * @sb: the superblock
+ * @work: work item to queue
+ *
+ * Schedule writeback work on this super_block. This usually used to
+ * interact with sb->s_op->writeback callback. The caller must
+ * guarantee to @work is not freed while bdi flusher is using (for
+ * example, be safe against umount).
+ */
+void writeback_queue_work_sb(struct super_block *sb,
+			     struct wb_writeback_work *work)
+{
+	if (sb->s_bdi == &noop_backing_dev_info)
+		return;
+
+	/* Allow only following fields to use. */
+	*work = (struct wb_writeback_work){
+		.sb			= sb,
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.done			= work->done,
+		.nr_pages		= work->nr_pages,
+		.reason			= work->reason,
+	};
+	bdi_queue_work(sb->s_bdi, work);
+}
+EXPORT_SYMBOL(writeback_queue_work_sb);
+
+/**
  * writeback_inodes_sb_nr -	writeback dirty inodes from given super_block
  * @sb: the superblock
  * @nr: the number of pages to write
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42efe13..29833d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -356,6 +356,8 @@  struct address_space_operations {
 
 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
 	sector_t (*bmap)(struct address_space *, sector_t);
+	void (*truncatepage)(struct address_space *, struct page *,
+			     unsigned int, unsigned int, int);
 	void (*invalidatepage) (struct page *, unsigned int, unsigned int);
 	int (*releasepage) (struct page *, gfp_t);
 	void (*freepage)(struct page *);
@@ -1590,6 +1592,8 @@  extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 
+struct bdi_writeback;
+struct wb_writeback_work;
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
 	void (*destroy_inode)(struct inode *);
@@ -1599,6 +1603,8 @@  struct super_operations {
 	int (*drop_inode) (struct inode *);
 	void (*evict_inode) (struct inode *);
 	void (*put_super) (struct super_block *);
+	long (*writeback)(struct super_block *super, struct bdi_writeback *wb,
+			  struct wb_writeback_work *work);
 	int (*sync_fs)(struct super_block *sb, int wait);
 	int (*freeze_super) (struct super_block *);
 	int (*freeze_fs) (struct super_block *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd5ea30..075f59f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1909,6 +1909,11 @@  vm_unmapped_area(struct vm_unmapped_area_info *info)
 }
 
 /* truncate.c */
+void generic_truncate_partial_page(struct address_space *mapping,
+				   struct page *page, unsigned int start,
+				   unsigned int len);
+void generic_truncate_full_page(struct address_space *mapping,
+				struct page *page, int wait);
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
 				       loff_t lstart, loff_t lend);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b3736f..13b70160 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -653,6 +653,8 @@  int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 extern void delete_from_page_cache(struct page *page);
 extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
+int cow_replace_page_cache(struct page *oldpage, struct page *newpage);
+void cow_delete_from_page_cache(struct page *page);
 
 /*
  * Like add_to_page_cache_locked, but used to add newly allocated pages:
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d9d7e7e..9b67360 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -228,6 +228,20 @@  unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 int page_mkclean(struct page *);
 
 /*
+ * Make clone page for page forking.
+ *
+ * Note: only clones page state so other state such as buffer_heads
+ * must be cloned by caller.
+ */
+struct page *cow_clone_page(struct page *oldpage);
+
+/*
+ * Changes the PTES of shared mappings except the PTE in orig_vma.
+ */
+int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
+		  struct page *newpage);
+
+/*
  * called in munlock()/munmap() path to check for other vmas holding
  * the page mlocked.
  */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0004833..0784b9d 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -59,6 +59,25 @@  enum wb_reason {
 };
 
 /*
+ * Passed into wb_writeback(), essentially a subset of writeback_control
+ */
+struct wb_writeback_work {
+	long nr_pages;
+	struct super_block *sb;
+	unsigned long *older_than_this;
+	enum writeback_sync_modes sync_mode;
+	unsigned int tagged_writepages:1;
+	unsigned int for_kupdate:1;
+	unsigned int range_cyclic:1;
+	unsigned int for_background:1;
+	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
+	enum wb_reason reason;		/* why was writeback initiated? */
+
+	struct list_head list;		/* pending work list */
+	struct completion *done;	/* set if the caller waits */
+};
+
+/*
  * A control structure which tells the writeback code what to do.  These are
  * always on the stack, and hence need no locking.  They are always initialised
  * in a manner such that unspecified fields are set to zero.
@@ -90,6 +109,10 @@  struct writeback_control {
  * fs/fs-writeback.c
  */	
 struct bdi_writeback;
+void inode_writeback_done(struct inode *inode);
+void inode_writeback_touch(struct inode *inode);
+void writeback_queue_work_sb(struct super_block *sb,
+			     struct wb_writeback_work *work);
 void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
 void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
 							enum wb_reason reason);
diff --git a/mm/filemap.c b/mm/filemap.c
index 673e458..8c641d0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -639,6 +639,88 @@  int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
+/*
+ * Atomically replace oldpage with newpage.
+ *
+ * Similar to migrate_pages(), but the oldpage is for writeout.
+ */
+int cow_replace_page_cache(struct page *oldpage, struct page *newpage)
+{
+	struct address_space *mapping = oldpage->mapping;
+	void **pslot;
+
+	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+
+	/* Get refcount for radix-tree */
+	page_cache_get(newpage);
+
+	/* Replace page in radix tree. */
+	spin_lock_irq(&mapping->tree_lock);
+	/* PAGECACHE_TAG_DIRTY represents the view of frontend. Clear it. */
+	if (PageDirty(oldpage))
+		radix_tree_tag_clear(&mapping->page_tree, page_index(oldpage),
+				     PAGECACHE_TAG_DIRTY);
+	/* The refcount to newpage is used for radix tree. */
+	pslot = radix_tree_lookup_slot(&mapping->page_tree, oldpage->index);
+	radix_tree_replace_slot(pslot, newpage);
+	__inc_zone_page_state(newpage, NR_FILE_PAGES);
+	__dec_zone_page_state(oldpage, NR_FILE_PAGES);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/* mem_cgroup codes must not be called under tree_lock */
+	mem_cgroup_migrate(oldpage, newpage, true);
+
+	/* Release refcount for radix-tree */
+	page_cache_release(oldpage);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cow_replace_page_cache);
+
+/*
+ * Delete page from radix-tree, leaving page->mapping unchanged.
+ *
+ * Similar to delete_from_page_cache(), but the deleted page is for writeout.
+ */
+void cow_delete_from_page_cache(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+
+	/* Delete page from radix tree. */
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * if we're uptodate, flush out into the cleancache, otherwise
+	 * invalidate any existing cleancache entries.  We can't leave
+	 * stale data around in the cleancache once our page is gone
+	 */
+	if (PageUptodate(page) && PageMappedToDisk(page))
+		cleancache_put_page(page);
+	else
+		cleancache_invalidate_page(mapping, page);
+
+	page_cache_tree_delete(mapping, page, NULL);
+#if 0 /* FIXME: backend is assuming page->mapping is available */
+	page->mapping = NULL;
+#endif
+	/* Leave page->index set: truncation lookup relies upon it */
+
+	__dec_zone_page_state(page, NR_FILE_PAGES);
+	BUG_ON(page_mapped(page));
+
+	/*
+	 * The following dirty accounting is done by writeback
+	 * path. So, we don't need to do here.
+	 *
+	 * dec_zone_page_state(page, NR_FILE_DIRTY);
+	 * dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+	 */
+	spin_unlock_irq(&mapping->tree_lock);
+
+	page_cache_release(page);
+}
+EXPORT_SYMBOL_GPL(cow_delete_from_page_cache);
+
 #ifdef CONFIG_NUMA
 struct page *__page_cache_alloc(gfp_t gfp)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index 71cd5bd..9125246 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -923,6 +923,145 @@  int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+/*
+ * Make clone page for page forking. (Based on migrate_page_copy())
+ *
+ * Note: only clones page state so other state such as buffer_heads
+ * must be cloned by caller.
+ */
+struct page *cow_clone_page(struct page *oldpage)
+{
+	struct address_space *mapping = oldpage->mapping;
+	gfp_t gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS;
+	struct page *newpage = __page_cache_alloc(gfp_mask);
+	int cpupid;
+
+	newpage->mapping = oldpage->mapping;
+	newpage->index = oldpage->index;
+	copy_highpage(newpage, oldpage);
+
+	/* FIXME: right? */
+	BUG_ON(PageSwapCache(oldpage));
+	BUG_ON(PageSwapBacked(oldpage));
+	BUG_ON(PageHuge(oldpage));
+	if (PageError(oldpage))
+		SetPageError(newpage);
+	if (PageReferenced(oldpage))
+		SetPageReferenced(newpage);
+	if (PageUptodate(oldpage))
+		SetPageUptodate(newpage);
+	if (PageActive(oldpage))
+		SetPageActive(newpage);
+	if (PageMappedToDisk(oldpage))
+		SetPageMappedToDisk(newpage);
+
+	/*
+	 * Copy NUMA information to the new page, to prevent over-eager
+	 * future migrations of this same page.
+	 */
+	cpupid = page_cpupid_xchg_last(oldpage, -1);
+	page_cpupid_xchg_last(newpage, cpupid);
+
+	mlock_migrate_page(newpage, oldpage);
+	ksm_migrate_page(newpage, oldpage);
+
+	/* Lock newpage before visible via radix tree */
+	BUG_ON(PageLocked(newpage));
+	__set_page_locked(newpage);
+
+	return newpage;
+}
+EXPORT_SYMBOL_GPL(cow_clone_page);
+
+static int page_cow_one(struct page *oldpage, struct page *newpage,
+			struct vm_area_struct *vma, unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t oldptval, ptval, *pte;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	pte = page_check_address(oldpage, mm, address, &ptl, 1);
+	if (!pte)
+		goto out;
+
+	flush_cache_page(vma, address, pte_pfn(*pte));
+	oldptval = ptep_clear_flush(vma, address, pte);
+
+	/* Take refcount for PTE */
+	page_cache_get(newpage);
+
+	/*
+	 * vm_page_prot doesn't have writable bit, so page fault will
+	 * be occurred immediately after returned from this page fault
+	 * again. And second time of page fault will be resolved with
+	 * forked page was set here.
+	 */
+	ptval = mk_pte(newpage, vma->vm_page_prot);
+#if 0
+	/* FIXME: we should check following too? Otherwise, we would
+	 * get additional read-only => write fault at least */
+	if (pte_write)
+		ptval = pte_mkwrite(ptval);
+	if (pte_dirty(oldptval))
+		ptval = pte_mkdirty(ptval);
+	if (pte_young(oldptval))
+		ptval = pte_mkyoung(ptval);
+#endif
+	set_pte_at(mm, address, pte, ptval);
+
+	/* Update rmap accounting */
+	BUG_ON(!PageMlocked(oldpage));	/* Caller should migrate mlock flag */
+	page_remove_rmap(oldpage);
+	page_add_file_rmap(newpage);
+
+	/* no need to invalidate: a not-present page won't be cached */
+	update_mmu_cache(vma, address, pte);
+
+	pte_unmap_unlock(pte, ptl);
+
+	mmu_notifier_invalidate_page(mm, address);
+
+	/* Release refcount for PTE */
+	page_cache_release(oldpage);
+out:
+	return ret;
+}
+
+/* Change old page in PTEs to new page exclude orig_vma */
+int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
+		  struct page *newpage)
+{
+	struct address_space *mapping = page_mapping(oldpage);
+	pgoff_t pgoff = oldpage->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	BUG_ON(!PageLocked(oldpage));
+	BUG_ON(!PageLocked(newpage));
+	BUG_ON(PageAnon(oldpage));
+	BUG_ON(mapping == NULL);
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		/*
+		 * The orig_vma's PTE is handled by caller.
+		 * (e.g. ->page_mkwrite)
+		 */
+		if (vma == orig_vma)
+			continue;
+
+		if (vma->vm_flags & VM_SHARED) {
+			unsigned long address = vma_address(oldpage, vma);
+			ret += page_cow_one(oldpage, newpage, vma, address);
+		}
+	}
+	i_mmap_unlock_read(mapping);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(page_cow_file);
+
 /**
  * page_move_anon_rmap - move a page to our anon_vma
  * @page:	the page to move to our anon_vma
diff --git a/mm/truncate.c b/mm/truncate.c
index f1e4d60..e5b4673 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -216,6 +216,56 @@  int invalidate_inode_page(struct page *page)
 	return invalidate_complete_page(mapping, page);
 }
 
+void generic_truncate_partial_page(struct address_space *mapping,
+				   struct page *page, unsigned int start,
+				   unsigned int len)
+{
+	wait_on_page_writeback(page);
+	zero_user_segment(page, start, start + len);
+	if (page_has_private(page))
+		do_invalidatepage(page, start, len);
+}
+EXPORT_SYMBOL(generic_truncate_partial_page);
+
+static void truncate_partial_page(struct address_space *mapping, pgoff_t index,
+				  unsigned int start, unsigned int len)
+{
+	struct page *page = find_lock_page(mapping, index);
+	if (!page)
+		return;
+
+	if (!mapping->a_ops->truncatepage)
+		generic_truncate_partial_page(mapping, page, start, len);
+	else
+		mapping->a_ops->truncatepage(mapping, page, start, len, 1);
+
+	cleancache_invalidate_page(mapping, page);
+	unlock_page(page);
+	page_cache_release(page);
+}
+
+void generic_truncate_full_page(struct address_space *mapping,
+				struct page *page, int wait)
+{
+	if (wait)
+		wait_on_page_writeback(page);
+	else if (PageWriteback(page))
+		return;
+
+	truncate_inode_page(mapping, page);
+}
+EXPORT_SYMBOL(generic_truncate_full_page);
+
+static void truncate_full_page(struct address_space *mapping, struct page *page,
+			       int wait)
+{
+	if (!mapping->a_ops->truncatepage)
+		generic_truncate_full_page(mapping, page, wait);
+	else
+		mapping->a_ops->truncatepage(mapping, page, 0, PAGE_CACHE_SIZE,
+					     wait);
+}
+
 /**
  * truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets
  * @mapping: mapping to truncate
@@ -298,11 +348,7 @@  void truncate_inode_pages_range(struct address_space *mapping,
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
-			if (PageWriteback(page)) {
-				unlock_page(page);
-				continue;
-			}
-			truncate_inode_page(mapping, page);
+			truncate_full_page(mapping, page, 0);
 			unlock_page(page);
 		}
 		pagevec_remove_exceptionals(&pvec);
@@ -312,37 +358,18 @@  void truncate_inode_pages_range(struct address_space *mapping,
 	}
 
 	if (partial_start) {
-		struct page *page = find_lock_page(mapping, start - 1);
-		if (page) {
-			unsigned int top = PAGE_CACHE_SIZE;
-			if (start > end) {
-				/* Truncation within a single page */
-				top = partial_end;
-				partial_end = 0;
-			}
-			wait_on_page_writeback(page);
-			zero_user_segment(page, partial_start, top);
-			cleancache_invalidate_page(mapping, page);
-			if (page_has_private(page))
-				do_invalidatepage(page, partial_start,
-						  top - partial_start);
-			unlock_page(page);
-			page_cache_release(page);
-		}
-	}
-	if (partial_end) {
-		struct page *page = find_lock_page(mapping, end);
-		if (page) {
-			wait_on_page_writeback(page);
-			zero_user_segment(page, 0, partial_end);
-			cleancache_invalidate_page(mapping, page);
-			if (page_has_private(page))
-				do_invalidatepage(page, 0,
-						  partial_end);
-			unlock_page(page);
-			page_cache_release(page);
+		unsigned int top = PAGE_CACHE_SIZE;
+		if (start > end) {
+			/* Truncation within a single page */
+			top = partial_end;
+			partial_end = 0;
 		}
+		truncate_partial_page(mapping, start - 1, partial_start,
+				      top - partial_start);
 	}
+	if (partial_end)
+		truncate_partial_page(mapping, end, 0, partial_end);
+
 	/*
 	 * If the truncation happened within a single page no pages
 	 * will be released, just zeroed, so we can bail out now.
@@ -386,8 +413,7 @@  void truncate_inode_pages_range(struct address_space *mapping,
 
 			lock_page(page);
 			WARN_ON(page->index != index);
-			wait_on_page_writeback(page);
-			truncate_inode_page(mapping, page);
+			truncate_full_page(mapping, page, 1);
 			unlock_page(page);
 		}
 		pagevec_remove_exceptionals(&pvec);