Message ID | 55545C2F.8040207@phunq.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 05/14/2015 04:26 AM, Daniel Phillips wrote: > Hi Rik, > > Our linux-tux3 tree currently currently carries this 652 line diff > against core, to make Tux3 work. This is mainly by Hirofumi, except > the fs-writeback.c hook, which is by me. The main part you may be > interested in is rmap.c, which addresses the issues raised at the > 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1] > > LSFMM: Page forking > http://lwn.net/Articles/548091/ > > This is just a FYI. An upcoming Tux3 report will be a tour of the page > forking design and implementation. For now, this is just to give a > general sense of what we have done. We heard there are concerns about > how ptrace will work. I really am not familiar with the issue, could > you please explain what you were thinking of there? The issue is that things like ptrace, AIO, infiniband RDMA, and other direct memory access subsystems can take a reference to page A, which Tux3 clones into a new page B when the process writes it. However, while the process now points at page B, ptrace, AIO, infiniband, etc will still be pointing at page A. This causes the process and the other subsystem to each look at a different page, instead of at shared state, causing ptrace to do nothing, AIO and RDMA data to be invisible (or corrupted), etc...
Hi Rik, Added Mel, Andrea and Peterz to CC as interested parties. There are probably others, please just jump in. On 05/14/2015 05:59 AM, Rik van Riel wrote: > On 05/14/2015 04:26 AM, Daniel Phillips wrote: >> Hi Rik, >> >> Our linux-tux3 tree currently currently carries this 652 line diff >> against core, to make Tux3 work. This is mainly by Hirofumi, except >> the fs-writeback.c hook, which is by me. The main part you may be >> interested in is rmap.c, which addresses the issues raised at the >> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1] >> >> LSFMM: Page forking >> http://lwn.net/Articles/548091/ >> >> This is just a FYI. An upcoming Tux3 report will be a tour of the page >> forking design and implementation. For now, this is just to give a >> general sense of what we have done. We heard there are concerns about >> how ptrace will work. I really am not familiar with the issue, could >> you please explain what you were thinking of there? > > The issue is that things like ptrace, AIO, infiniband > RDMA, and other direct memory access subsystems can take > a reference to page A, which Tux3 clones into a new page B > when the process writes it. > > However, while the process now points at page B, ptrace, > AIO, infiniband, etc will still be pointing at page A. > > This causes the process and the other subsystem to each > look at a different page, instead of at shared state, > causing ptrace to do nothing, AIO and RDMA data to be > invisible (or corrupted), etc... Is this a bit like page migration? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/14/2015 08:06 PM, Daniel Phillips wrote: > Hi Rik, > > Added Mel, Andrea and Peterz to CC as interested parties. There are > probably others, please just jump in. > > On 05/14/2015 05:59 AM, Rik van Riel wrote: >> On 05/14/2015 04:26 AM, Daniel Phillips wrote: >>> Hi Rik, >>> >>> Our linux-tux3 tree currently currently carries this 652 line diff >>> against core, to make Tux3 work. This is mainly by Hirofumi, except >>> the fs-writeback.c hook, which is by me. The main part you may be >>> interested in is rmap.c, which addresses the issues raised at the >>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1] >>> >>> LSFMM: Page forking >>> http://lwn.net/Articles/548091/ >>> >>> This is just a FYI. An upcoming Tux3 report will be a tour of the page >>> forking design and implementation. For now, this is just to give a >>> general sense of what we have done. We heard there are concerns about >>> how ptrace will work. I really am not familiar with the issue, could >>> you please explain what you were thinking of there? >> >> The issue is that things like ptrace, AIO, infiniband >> RDMA, and other direct memory access subsystems can take >> a reference to page A, which Tux3 clones into a new page B >> when the process writes it. >> >> However, while the process now points at page B, ptrace, >> AIO, infiniband, etc will still be pointing at page A. >> >> This causes the process and the other subsystem to each >> look at a different page, instead of at shared state, >> causing ptrace to do nothing, AIO and RDMA data to be >> invisible (or corrupted), etc... > > Is this a bit like page migration? Yes. Page migration will fail if there is an "extra" reference to the page that is not accounted for by the migration code. Only pages that have no extra refcount can be migrated. Similarly, your cow code needs to fail if there is an extra reference count pinning the page. As long as the page has a user that you cannot migrate, you cannot move any of the other users over. They may rely on data written by the hidden-to-you user, and the hidden-to-you user may write to the page when you think it is a read only stable snapshot.
On Thu, May 14, 2015 at 05:06:39PM -0700, Daniel Phillips wrote: > Hi Rik, > > Added Mel, Andrea and Peterz to CC as interested parties. There are > probably others, please just jump in. > > On 05/14/2015 05:59 AM, Rik van Riel wrote: > > On 05/14/2015 04:26 AM, Daniel Phillips wrote: > >> Hi Rik, > >> > >> Our linux-tux3 tree currently currently carries this 652 line diff > >> against core, to make Tux3 work. This is mainly by Hirofumi, except > >> the fs-writeback.c hook, which is by me. The main part you may be > >> interested in is rmap.c, which addresses the issues raised at the > >> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1] > >> > >> LSFMM: Page forking > >> http://lwn.net/Articles/548091/ > >> > >> This is just a FYI. An upcoming Tux3 report will be a tour of the page > >> forking design and implementation. For now, this is just to give a > >> general sense of what we have done. We heard there are concerns about > >> how ptrace will work. I really am not familiar with the issue, could > >> you please explain what you were thinking of there? > > > > The issue is that things like ptrace, AIO, infiniband > > RDMA, and other direct memory access subsystems can take > > a reference to page A, which Tux3 clones into a new page B > > when the process writes it. > > > > However, while the process now points at page B, ptrace, > > AIO, infiniband, etc will still be pointing at page A. > > > > This causes the process and the other subsystem to each > > look at a different page, instead of at shared state, > > causing ptrace to do nothing, AIO and RDMA data to be > > invisible (or corrupted), etc... > > Is this a bit like page migration? > No, it's not.
On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote: > On 05/14/2015 08:06 PM, Daniel Phillips wrote: > > Hi Rik, > > > > Added Mel, Andrea and Peterz to CC as interested parties. There are > > probably others, please just jump in. > > > > On 05/14/2015 05:59 AM, Rik van Riel wrote: > >> On 05/14/2015 04:26 AM, Daniel Phillips wrote: > >>> Hi Rik, > >>> > >>> Our linux-tux3 tree currently currently carries this 652 line diff > >>> against core, to make Tux3 work. This is mainly by Hirofumi, except > >>> the fs-writeback.c hook, which is by me. The main part you may be > >>> interested in is rmap.c, which addresses the issues raised at the > >>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1] > >>> > >>> LSFMM: Page forking > >>> http://lwn.net/Articles/548091/ > >>> > >>> This is just a FYI. An upcoming Tux3 report will be a tour of the page > >>> forking design and implementation. For now, this is just to give a > >>> general sense of what we have done. We heard there are concerns about > >>> how ptrace will work. I really am not familiar with the issue, could > >>> you please explain what you were thinking of there? > >> > >> The issue is that things like ptrace, AIO, infiniband > >> RDMA, and other direct memory access subsystems can take > >> a reference to page A, which Tux3 clones into a new page B > >> when the process writes it. > >> > >> However, while the process now points at page B, ptrace, > >> AIO, infiniband, etc will still be pointing at page A. > >> > >> This causes the process and the other subsystem to each > >> look at a different page, instead of at shared state, > >> causing ptrace to do nothing, AIO and RDMA data to be > >> invisible (or corrupted), etc... > > > > Is this a bit like page migration? > > Yes. Page migration will fail if there is an "extra" > reference to the page that is not accounted for by > the migration code. > When I said it's not like page migration, I was referring to the fact that a COW on a pinned page for RDMA is a different problem to page migration. The COW of a pinned page can lead to lost writes or corruption depending on the ordering of events. Page migration fails when there are unexpected problems to avoid this class of issue which is fine for page migration but may be a critical failure in a filesystem depending on exactly why the copy is required.
On 05/14/2015 08:06 PM, Rik van Riel wrote: > On 05/14/2015 08:06 PM, Daniel Phillips wrote: >>> The issue is that things like ptrace, AIO, infiniband >>> RDMA, and other direct memory access subsystems can take >>> a reference to page A, which Tux3 clones into a new page B >>> when the process writes it. >>> >>> However, while the process now points at page B, ptrace, >>> AIO, infiniband, etc will still be pointing at page A. >>> >>> This causes the process and the other subsystem to each >>> look at a different page, instead of at shared state, >>> causing ptrace to do nothing, AIO and RDMA data to be >>> invisible (or corrupted), etc... >> >> Is this a bit like page migration? > > Yes. Page migration will fail if there is an "extra" > reference to the page that is not accounted for by > the migration code. > > Only pages that have no extra refcount can be migrated. > > Similarly, your cow code needs to fail if there is an > extra reference count pinning the page. As long as > the page has a user that you cannot migrate, you cannot > move any of the other users over. They may rely on data > written by the hidden-to-you user, and the hidden-to-you > user may write to the page when you think it is a read > only stable snapshot. Please bear with me as I study these cases one by one. First one is ptrace. Only for executable files, right? Maybe we don't need to fork pages in executable files, Uprobes... If somebody puts a breakpoint in a page and we fork it, the replacement page has a copy of the breakpoint, and all the code on the page. Did anything break? Note: we have the option of being cowardly and just not doing page forking for mmapped files, or certain kinds of mmapped files, etc. But first we should give it the old college try, to see if absolute perfection is possible and practical. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/15/2015 01:09 AM, Mel Gorman wrote: > On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote: >> On 05/14/2015 08:06 PM, Daniel Phillips wrote: >>>> The issue is that things like ptrace, AIO, infiniband >>>> RDMA, and other direct memory access subsystems can take >>>> a reference to page A, which Tux3 clones into a new page B >>>> when the process writes it. >>>> >>>> However, while the process now points at page B, ptrace, >>>> AIO, infiniband, etc will still be pointing at page A. >>>> >>>> This causes the process and the other subsystem to each >>>> look at a different page, instead of at shared state, >>>> causing ptrace to do nothing, AIO and RDMA data to be >>>> invisible (or corrupted), etc... >>> >>> Is this a bit like page migration? >> >> Yes. Page migration will fail if there is an "extra" >> reference to the page that is not accounted for by >> the migration code. > > When I said it's not like page migration, I was referring to the fact > that a COW on a pinned page for RDMA is a different problem to page > migration. The COW of a pinned page can lead to lost writes or > corruption depending on the ordering of events. I see the lost writes case, but not the corruption case, Do you mean corruption by changing a page already in writeout? If so, don't all filesystems have that problem? If RDMA to a mmapped file races with write(2) to the same file, maybe it is reasonable and expected to lose some data. > Page migration fails > when there are unexpected problems to avoid this class of issue which is > fine for page migration but may be a critical failure in a filesystem > depending on exactly why the copy is required. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote: > > > On 05/15/2015 01:09 AM, Mel Gorman wrote: > > On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote: > >> On 05/14/2015 08:06 PM, Daniel Phillips wrote: > >>>> The issue is that things like ptrace, AIO, infiniband > >>>> RDMA, and other direct memory access subsystems can take > >>>> a reference to page A, which Tux3 clones into a new page B > >>>> when the process writes it. > >>>> > >>>> However, while the process now points at page B, ptrace, > >>>> AIO, infiniband, etc will still be pointing at page A. > >>>> > >>>> This causes the process and the other subsystem to each > >>>> look at a different page, instead of at shared state, > >>>> causing ptrace to do nothing, AIO and RDMA data to be > >>>> invisible (or corrupted), etc... > >>> > >>> Is this a bit like page migration? > >> > >> Yes. Page migration will fail if there is an "extra" > >> reference to the page that is not accounted for by > >> the migration code. > > > > When I said it's not like page migration, I was referring to the fact > > that a COW on a pinned page for RDMA is a different problem to page > > migration. The COW of a pinned page can lead to lost writes or > > corruption depending on the ordering of events. > > I see the lost writes case, but not the corruption case, Data corruption can occur depending on the ordering of events and the applications expectations. If a process starts IO, RDMA pins the page for read and forks are combined with writes from another thread then when the IO completes the reads may not be visible. The application may take improper action at that point. Users of RDMA are typically expected to use MADV_DONTFORK to avoid this class of problem. You can choose to not define this as data corruption because thge kernel is not directly involved and that's your call. > Do you > mean corruption by changing a page already in writeout? If so, > don't all filesystems have that problem? > No, the problem is different. Backing devices requiring stable pages will block the write until the IO is complete. For those that do not require stable pages it's ok to allow the write as long as the page is dirtied so that it'll be written out again and no data is lost. > If RDMA to a mmapped file races with write(2) to the same file, > maybe it is reasonable and expected to lose some data. > In the RDMA case, there is at least application awareness to work around the problems. Normally it's ok to have both mapped and write() access to data although userspace might need a lock to co-ordinate updates and event ordering.
On Fri, 15 May 2015, Mel Gorman wrote: > On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote: >> >> >> On 05/15/2015 01:09 AM, Mel Gorman wrote: >>> On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote: >>>> On 05/14/2015 08:06 PM, Daniel Phillips wrote: >>>>>> The issue is that things like ptrace, AIO, infiniband >>>>>> RDMA, and other direct memory access subsystems can take >>>>>> a reference to page A, which Tux3 clones into a new page B >>>>>> when the process writes it. >>>>>> >>>>>> However, while the process now points at page B, ptrace, >>>>>> AIO, infiniband, etc will still be pointing at page A. >>>>>> >>>>>> This causes the process and the other subsystem to each >>>>>> look at a different page, instead of at shared state, >>>>>> causing ptrace to do nothing, AIO and RDMA data to be >>>>>> invisible (or corrupted), etc... >>>>> >>>>> Is this a bit like page migration? >>>> >>>> Yes. Page migration will fail if there is an "extra" >>>> reference to the page that is not accounted for by >>>> the migration code. >>> >>> When I said it's not like page migration, I was referring to the fact >>> that a COW on a pinned page for RDMA is a different problem to page >>> migration. The COW of a pinned page can lead to lost writes or >>> corruption depending on the ordering of events. >> >> I see the lost writes case, but not the corruption case, > > Data corruption can occur depending on the ordering of events and the > applications expectations. If a process starts IO, RDMA pins the page > for read and forks are combined with writes from another thread then when > the IO completes the reads may not be visible. The application may take > improper action at that point. if tux3 forks the page and writes the copy while the original page is being modified by other things, this means that some of the changes won't be in the version written (and this could catch partial writes with 'interesting' results if the forking happens at the wrong time) But if the original page gets re-marked as needing to be written out when it's changed by one of the other things that are accessing it, there shouldn't be any long-term corruption. As far as short-term corruption goes, any time you have a page mmapped it could get written out at any time, with only some of the application changes applied to it, so this sort of corruption could happen anyway couldn't it? > Users of RDMA are typically expected to use MADV_DONTFORK to avoid this > class of problem. > > You can choose to not define this as data corruption because thge kernel > is not directly involved and that's your call. > >> Do you >> mean corruption by changing a page already in writeout? If so, >> don't all filesystems have that problem? >> > > No, the problem is different. Backing devices requiring stable pages will > block the write until the IO is complete. For those that do not require > stable pages it's ok to allow the write as long as the page is dirtied so > that it'll be written out again and no data is lost. so if tux3 is prevented from forking the page in cases where the write would be blocked, and will get forked again for follow-up writes if it's modified again otherwise, won't this be the same thing? David Lang >> If RDMA to a mmapped file races with write(2) to the same file, >> maybe it is reasonable and expected to lose some data. >> > > In the RDMA case, there is at least application awareness to work around > the problems. Normally it's ok to have both mapped and write() access > to data although userspace might need a lock to co-ordinate updates and > event ordering. > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/14/2015 03:59 PM, Rik van Riel wrote: > On 05/14/2015 04:26 AM, Daniel Phillips wrote: >> Hi Rik, <> > > The issue is that things like ptrace, AIO, infiniband > RDMA, and other direct memory access subsystems can take > a reference to page A, which Tux3 clones into a new page B > when the process writes it. > > However, while the process now points at page B, ptrace, > AIO, infiniband, etc will still be pointing at page A. > All these problems can also happen with truncate+new-extending-write It is the responsibility of the application to take file/range locks to prevent these page-pinned problems. > This causes the process and the other subsystem to each > look at a different page, instead of at shared state, > causing ptrace to do nothing, AIO and RDMA data to be > invisible (or corrupted), etc... > Again these problems already exist. Consider each in-place-write being a truncate (punch hole) + new-write is that not the same? Cheers Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/17/2015 09:26 AM, Boaz Harrosh wrote: > On 05/14/2015 03:59 PM, Rik van Riel wrote: >> On 05/14/2015 04:26 AM, Daniel Phillips wrote: >>> Hi Rik, > <> >> >> The issue is that things like ptrace, AIO, infiniband >> RDMA, and other direct memory access subsystems can take >> a reference to page A, which Tux3 clones into a new page B >> when the process writes it. >> >> However, while the process now points at page B, ptrace, >> AIO, infiniband, etc will still be pointing at page A. >> > > All these problems can also happen with truncate+new-extending-write > > It is the responsibility of the application to take file/range locks > to prevent these page-pinned problems. It is unreasonable to expect a process that is being ptraced (potentially without its knowledge) to take special measures to protect the ptraced memory from disappearing. It is impossible for the debugger to take those special measures for anonymous memory, or unlinked inodes. I don't think your requirement is workable or reasonable.
On 05/18/2015 05:20 AM, Rik van Riel wrote: > On 05/17/2015 09:26 AM, Boaz Harrosh wrote: >> On 05/14/2015 03:59 PM, Rik van Riel wrote: >>> On 05/14/2015 04:26 AM, Daniel Phillips wrote: >>>> Hi Rik, >> <> >>> >>> The issue is that things like ptrace, AIO, infiniband >>> RDMA, and other direct memory access subsystems can take >>> a reference to page A, which Tux3 clones into a new page B >>> when the process writes it. >>> >>> However, while the process now points at page B, ptrace, >>> AIO, infiniband, etc will still be pointing at page A. >>> >> >> All these problems can also happen with truncate+new-extending-write >> >> It is the responsibility of the application to take file/range locks >> to prevent these page-pinned problems. > > It is unreasonable to expect a process that is being ptraced > (potentially without its knowledge) to take special measures > to protect the ptraced memory from disappearing. If the memory disappears that's a bug. No the memory is just there it is just not reflecting the latest content of the fs-file. > > It is impossible for the debugger to take those special measures > for anonymous memory, or unlinked inodes. > Why? one line of added code after the open and before the mmap do an flock > I don't think your requirement is workable or reasonable. > Therefor it is unreasonable to write/modify a ptraced process file. Again what I'm saying is COWing a page on write, has the same effect as truncate+write. They are both allowed and both might give you the same "stale" effect. So the presidence is there. We are not introducing a new anomaly, just introducing a new instance of it. I guess the question is what applications/procedures are going to break. Need lots of testing and real life installations to answer that, I guess. Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, May 16, 2015 at 03:38:04PM -0700, David Lang wrote: > On Fri, 15 May 2015, Mel Gorman wrote: > > >On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote: > >> > >> > >>On 05/15/2015 01:09 AM, Mel Gorman wrote: > >>>On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote: > >>>>On 05/14/2015 08:06 PM, Daniel Phillips wrote: > >>>>>>The issue is that things like ptrace, AIO, infiniband > >>>>>>RDMA, and other direct memory access subsystems can take > >>>>>>a reference to page A, which Tux3 clones into a new page B > >>>>>>when the process writes it. > >>>>>> > >>>>>>However, while the process now points at page B, ptrace, > >>>>>>AIO, infiniband, etc will still be pointing at page A. > >>>>>> > >>>>>>This causes the process and the other subsystem to each > >>>>>>look at a different page, instead of at shared state, > >>>>>>causing ptrace to do nothing, AIO and RDMA data to be > >>>>>>invisible (or corrupted), etc... > >>>>> > >>>>>Is this a bit like page migration? > >>>> > >>>>Yes. Page migration will fail if there is an "extra" > >>>>reference to the page that is not accounted for by > >>>>the migration code. > >>> > >>>When I said it's not like page migration, I was referring to the fact > >>>that a COW on a pinned page for RDMA is a different problem to page > >>>migration. The COW of a pinned page can lead to lost writes or > >>>corruption depending on the ordering of events. > >> > >>I see the lost writes case, but not the corruption case, > > > >Data corruption can occur depending on the ordering of events and the > >applications expectations. If a process starts IO, RDMA pins the page > >for read and forks are combined with writes from another thread then when > >the IO completes the reads may not be visible. The application may take > >improper action at that point. > > if tux3 forks the page and writes the copy while the original page > is being modified by other things, this means that some of the > changes won't be in the version written (and this could catch > partial writes with 'interesting' results if the forking happens at > the wrong time) > Potentially yes. There is likely to be some elevated memory usage but I imagine that can be controlled. > But if the original page gets re-marked as needing to be written out > when it's changed by one of the other things that are accessing it, > there shouldn't be any long-term corruption. > > As far as short-term corruption goes, any time you have a page > mmapped it could get written out at any time, with only some of the > application changes applied to it, so this sort of corruption could > happen anyway couldn't it? > That becomes the responsibility of the application. It's up to it to sync appropriately when it knows updates are complete. > >Users of RDMA are typically expected to use MADV_DONTFORK to avoid this > >class of problem. > > > >You can choose to not define this as data corruption because thge kernel > >is not directly involved and that's your call. > > > >>Do you > >>mean corruption by changing a page already in writeout? If so, > >>don't all filesystems have that problem? > >> > > > >No, the problem is different. Backing devices requiring stable pages will > >block the write until the IO is complete. For those that do not require > >stable pages it's ok to allow the write as long as the page is dirtied so > >that it'll be written out again and no data is lost. > > so if tux3 is prevented from forking the page in cases where the > write would be blocked, and will get forked again for follow-up > writes if it's modified again otherwise, won't this be the same > thing? > Functionally and from a correctness point of view, it *might* be equivalent. It depends on the implementation and the page life cycle, particularly the details of how the writeback and dirty state are coordinated between the user-visible pages and the page being written back. I've read none of the code or background so I cannot answer whether it's really equivalent or not. Just be aware that it's not the same problem as page migration and that it's not the same as how writeback and dirty state is handled today.
On 05/17/2015 07:20 PM, Rik van Riel wrote: > On 05/17/2015 09:26 AM, Boaz Harrosh wrote: >> On 05/14/2015 03:59 PM, Rik van Riel wrote: >>> The issue is that things like ptrace, AIO, infiniband >>> RDMA, and other direct memory access subsystems can take >>> a reference to page A, which Tux3 clones into a new page B >>> when the process writes it. >>> >>> However, while the process now points at page B, ptrace, >>> AIO, infiniband, etc will still be pointing at page A. >> >> All these problems can also happen with truncate+new-extending-write >> >> It is the responsibility of the application to take file/range locks >> to prevent these page-pinned problems. > > It is unreasonable to expect a process that is being ptraced > (potentially without its knowledge) to take special measures > to protect the ptraced memory from disappearing. > > It is impossible for the debugger to take those special measures > for anonymous memory, or unlinked inodes. > > I don't think your requirement is workable or reasonable. Hi Rik, You are quite right to poke at this aggressively. Whether or not there is an issue needing fixing, we want to know the details. We really need to do a deep dive in ptrace and know exactly what it does, and whether Tux3 creates any new kind of hole. I really know very little about ptrace at the moment, I only have heard that it is a horrible hack we inherited from some place far away and a time long ago. A little guidance from you would help. Somewhere ptrace must modify the executable page. Unlike uprobes, which makes sense to me, I did not find where ptrace actually does that on a quick inspection. Perhaps you could provide a pointer? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 14-05-15 01:26:23, Daniel Phillips wrote: > Hi Rik, > > Our linux-tux3 tree currently currently carries this 652 line diff > against core, to make Tux3 work. This is mainly by Hirofumi, except > the fs-writeback.c hook, which is by me. The main part you may be > interested in is rmap.c, which addresses the issues raised at the > 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1] > > LSFMM: Page forking > http://lwn.net/Articles/548091/ > > This is just a FYI. An upcoming Tux3 report will be a tour of the page > forking design and implementation. For now, this is just to give a > general sense of what we have done. We heard there are concerns about > how ptrace will work. I really am not familiar with the issue, could > you please explain what you were thinking of there? So here are a few things I find problematic about page forking (besides the cases with elevated page_count already discussed in this thread - there I believe that anything more complex than "wait for the IO instead of forking when page has elevated use count" isn't going to work. There are too many users depending on too subtle details of the behavior...). Some of them are actually mentioned in the above LWN article: When you create a copy of a page and replace it in the radix tree, nobody in mm subsystem is aware that oldpage may be under writeback. That causes interesting issues: * truncate_inode_pages() can finish before all IO for the file is finished. So far filesystems rely on the fact that once truncate_inode_pages() finishes all running IO against the file is completed and new cannot be submitted. * Writeback can come and try to write newpage while oldpage is still under IO. Then you'll have two IOs against one block which has undefined results. * filemap_fdatawait() called from fsync() has additional problem that it is not aware of oldpage and thus may return although IO hasn't finished yet. I understand that Tux3 may avoid these issues due to some other mechanisms it internally has but if page forking should get into mm subsystem, the above must work. Honza > diffstat tux3.core.patch > fs/Makefile | 1 > fs/fs-writeback.c | 100 +++++++++++++++++++++++++-------- > include/linux/fs.h | 6 + > include/linux/mm.h | 5 + > include/linux/pagemap.h | 2 > include/linux/rmap.h | 14 ++++ > include/linux/writeback.h | 23 +++++++ > mm/filemap.c | 82 +++++++++++++++++++++++++++ > mm/rmap.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++ > mm/truncate.c | 98 ++++++++++++++++++++------------ > 10 files changed, 411 insertions(+), 59 deletions(-) > > diff --git a/fs/Makefile b/fs/Makefile > index 91fcfa3..44d7192 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -70,7 +70,6 @@ obj-$(CONFIG_EXT4_FS) += ext4/ > obj-$(CONFIG_JBD) += jbd/ > obj-$(CONFIG_JBD2) += jbd2/ > obj-$(CONFIG_TUX3) += tux3/ > -obj-$(CONFIG_TUX3_MMAP) += tux3/ > obj-$(CONFIG_CRAMFS) += cramfs/ > obj-$(CONFIG_SQUASHFS) += squashfs/ > obj-y += ramfs/ > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 2d609a5..fcd1c61 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -34,25 +34,6 @@ > */ > #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_CACHE_SHIFT - 10)) > > -/* > - * Passed into wb_writeback(), essentially a subset of writeback_control > - */ > -struct wb_writeback_work { > - long nr_pages; > - struct super_block *sb; > - unsigned long *older_than_this; > - enum writeback_sync_modes sync_mode; > - unsigned int tagged_writepages:1; > - unsigned int for_kupdate:1; > - unsigned int range_cyclic:1; > - unsigned int for_background:1; > - unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ > - enum wb_reason reason; /* why was writeback initiated? */ > - > - struct list_head list; /* pending work list */ > - struct completion *done; /* set if the caller waits */ > -}; > - > /** > * writeback_in_progress - determine whether there is writeback in progress > * @bdi: the device's backing_dev_info structure. > @@ -192,6 +173,36 @@ void inode_wb_list_del(struct inode *inode) > } > > /* > + * Remove inode from writeback list if clean. > + */ > +void inode_writeback_done(struct inode *inode) > +{ > + struct backing_dev_info *bdi = inode_to_bdi(inode); > + > + spin_lock(&bdi->wb.list_lock); > + spin_lock(&inode->i_lock); > + if (!(inode->i_state & I_DIRTY)) > + list_del_init(&inode->i_wb_list); > + spin_unlock(&inode->i_lock); > + spin_unlock(&bdi->wb.list_lock); > +} > +EXPORT_SYMBOL_GPL(inode_writeback_done); > + > +/* > + * Add inode to writeback dirty list with current time. > + */ > +void inode_writeback_touch(struct inode *inode) > +{ > + struct backing_dev_info *bdi = inode_to_bdi(inode); > + > + spin_lock(&bdi->wb.list_lock); > + inode->dirtied_when = jiffies; > + list_move(&inode->i_wb_list, &bdi->wb.b_dirty); > + spin_unlock(&bdi->wb.list_lock); > +} > +EXPORT_SYMBOL_GPL(inode_writeback_touch); > + > +/* > * Redirty an inode: set its when-it-was dirtied timestamp and move it to the > * furthest end of its superblock's dirty-inode list. > * > @@ -610,9 +621,9 @@ static long writeback_chunk_size(struct backing_dev_info *bdi, > * > * Return the number of pages and/or inodes written. > */ > -static long writeback_sb_inodes(struct super_block *sb, > - struct bdi_writeback *wb, > - struct wb_writeback_work *work) > +static long generic_writeback_sb_inodes(struct super_block *sb, > + struct bdi_writeback *wb, > + struct wb_writeback_work *work) > { > struct writeback_control wbc = { > .sync_mode = work->sync_mode, > @@ -727,6 +738,22 @@ static long writeback_sb_inodes(struct super_block *sb, > return wrote; > } > > +static long writeback_sb_inodes(struct super_block *sb, > + struct bdi_writeback *wb, > + struct wb_writeback_work *work) > +{ > + if (sb->s_op->writeback) { > + long ret; > + > + spin_unlock(&wb->list_lock); > + ret = sb->s_op->writeback(sb, wb, work); > + spin_lock(&wb->list_lock); > + return ret; > + } > + > + return generic_writeback_sb_inodes(sb, wb, work); > +} > + > static long __writeback_inodes_wb(struct bdi_writeback *wb, > struct wb_writeback_work *work) > { > @@ -1293,6 +1320,35 @@ static void wait_sb_inodes(struct super_block *sb) > } > > /** > + * writeback_queue_work_sb - schedule writeback work from given super_block > + * @sb: the superblock > + * @work: work item to queue > + * > + * Schedule writeback work on this super_block. This usually used to > + * interact with sb->s_op->writeback callback. The caller must > + * guarantee to @work is not freed while bdi flusher is using (for > + * example, be safe against umount). > + */ > +void writeback_queue_work_sb(struct super_block *sb, > + struct wb_writeback_work *work) > +{ > + if (sb->s_bdi == &noop_backing_dev_info) > + return; > + > + /* Allow only following fields to use. */ > + *work = (struct wb_writeback_work){ > + .sb = sb, > + .sync_mode = work->sync_mode, > + .tagged_writepages = work->tagged_writepages, > + .done = work->done, > + .nr_pages = work->nr_pages, > + .reason = work->reason, > + }; > + bdi_queue_work(sb->s_bdi, work); > +} > +EXPORT_SYMBOL(writeback_queue_work_sb); > + > +/** > * writeback_inodes_sb_nr - writeback dirty inodes from given super_block > * @sb: the superblock > * @nr: the number of pages to write > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 42efe13..29833d2 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -356,6 +356,8 @@ struct address_space_operations { > > /* Unfortunately this kludge is needed for FIBMAP. Don't use it */ > sector_t (*bmap)(struct address_space *, sector_t); > + void (*truncatepage)(struct address_space *, struct page *, > + unsigned int, unsigned int, int); > void (*invalidatepage) (struct page *, unsigned int, unsigned int); > int (*releasepage) (struct page *, gfp_t); > void (*freepage)(struct page *); > @@ -1590,6 +1592,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *, > extern ssize_t vfs_writev(struct file *, const struct iovec __user *, > unsigned long, loff_t *); > > +struct bdi_writeback; > +struct wb_writeback_work; > struct super_operations { > struct inode *(*alloc_inode)(struct super_block *sb); > void (*destroy_inode)(struct inode *); > @@ -1599,6 +1603,8 @@ struct super_operations { > int (*drop_inode) (struct inode *); > void (*evict_inode) (struct inode *); > void (*put_super) (struct super_block *); > + long (*writeback)(struct super_block *super, struct bdi_writeback *wb, > + struct wb_writeback_work *work); > int (*sync_fs)(struct super_block *sb, int wait); > int (*freeze_super) (struct super_block *); > int (*freeze_fs) (struct super_block *); > diff --git a/include/linux/mm.h b/include/linux/mm.h > index dd5ea30..075f59f 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1909,6 +1909,11 @@ vm_unmapped_area(struct vm_unmapped_area_info *info) > } > > /* truncate.c */ > +void generic_truncate_partial_page(struct address_space *mapping, > + struct page *page, unsigned int start, > + unsigned int len); > +void generic_truncate_full_page(struct address_space *mapping, > + struct page *page, int wait); > extern void truncate_inode_pages(struct address_space *, loff_t); > extern void truncate_inode_pages_range(struct address_space *, > loff_t lstart, loff_t lend); > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 4b3736f..13b70160 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -653,6 +653,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, > extern void delete_from_page_cache(struct page *page); > extern void __delete_from_page_cache(struct page *page, void *shadow); > int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); > +int cow_replace_page_cache(struct page *oldpage, struct page *newpage); > +void cow_delete_from_page_cache(struct page *page); > > /* > * Like add_to_page_cache_locked, but used to add newly allocated pages: > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index d9d7e7e..9b67360 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -228,6 +228,20 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); > int page_mkclean(struct page *); > > /* > + * Make clone page for page forking. > + * > + * Note: only clones page state so other state such as buffer_heads > + * must be cloned by caller. > + */ > +struct page *cow_clone_page(struct page *oldpage); > + > +/* > + * Changes the PTES of shared mappings except the PTE in orig_vma. > + */ > +int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage, > + struct page *newpage); > + > +/* > * called in munlock()/munmap() path to check for other vmas holding > * the page mlocked. > */ > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > index 0004833..0784b9d 100644 > --- a/include/linux/writeback.h > +++ b/include/linux/writeback.h > @@ -59,6 +59,25 @@ enum wb_reason { > }; > > /* > + * Passed into wb_writeback(), essentially a subset of writeback_control > + */ > +struct wb_writeback_work { > + long nr_pages; > + struct super_block *sb; > + unsigned long *older_than_this; > + enum writeback_sync_modes sync_mode; > + unsigned int tagged_writepages:1; > + unsigned int for_kupdate:1; > + unsigned int range_cyclic:1; > + unsigned int for_background:1; > + unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ > + enum wb_reason reason; /* why was writeback initiated? */ > + > + struct list_head list; /* pending work list */ > + struct completion *done; /* set if the caller waits */ > +}; > + > +/* > * A control structure which tells the writeback code what to do. These are > * always on the stack, and hence need no locking. They are always initialised > * in a manner such that unspecified fields are set to zero. > @@ -90,6 +109,10 @@ struct writeback_control { > * fs/fs-writeback.c > */ > struct bdi_writeback; > +void inode_writeback_done(struct inode *inode); > +void inode_writeback_touch(struct inode *inode); > +void writeback_queue_work_sb(struct super_block *sb, > + struct wb_writeback_work *work); > void writeback_inodes_sb(struct super_block *, enum wb_reason reason); > void writeback_inodes_sb_nr(struct super_block *, unsigned long nr, > enum wb_reason reason); > diff --git a/mm/filemap.c b/mm/filemap.c > index 673e458..8c641d0 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -639,6 +639,88 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, > } > EXPORT_SYMBOL_GPL(add_to_page_cache_lru); > > +/* > + * Atomically replace oldpage with newpage. > + * > + * Similar to migrate_pages(), but the oldpage is for writeout. > + */ > +int cow_replace_page_cache(struct page *oldpage, struct page *newpage) > +{ > + struct address_space *mapping = oldpage->mapping; > + void **pslot; > + > + VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage); > + VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); > + > + /* Get refcount for radix-tree */ > + page_cache_get(newpage); > + > + /* Replace page in radix tree. */ > + spin_lock_irq(&mapping->tree_lock); > + /* PAGECACHE_TAG_DIRTY represents the view of frontend. Clear it. */ > + if (PageDirty(oldpage)) > + radix_tree_tag_clear(&mapping->page_tree, page_index(oldpage), > + PAGECACHE_TAG_DIRTY); > + /* The refcount to newpage is used for radix tree. */ > + pslot = radix_tree_lookup_slot(&mapping->page_tree, oldpage->index); > + radix_tree_replace_slot(pslot, newpage); > + __inc_zone_page_state(newpage, NR_FILE_PAGES); > + __dec_zone_page_state(oldpage, NR_FILE_PAGES); > + spin_unlock_irq(&mapping->tree_lock); > + > + /* mem_cgroup codes must not be called under tree_lock */ > + mem_cgroup_migrate(oldpage, newpage, true); > + > + /* Release refcount for radix-tree */ > + page_cache_release(oldpage); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(cow_replace_page_cache); > + > +/* > + * Delete page from radix-tree, leaving page->mapping unchanged. > + * > + * Similar to delete_from_page_cache(), but the deleted page is for writeout. > + */ > +void cow_delete_from_page_cache(struct page *page) > +{ > + struct address_space *mapping = page->mapping; > + > + /* Delete page from radix tree. */ > + spin_lock_irq(&mapping->tree_lock); > + /* > + * if we're uptodate, flush out into the cleancache, otherwise > + * invalidate any existing cleancache entries. We can't leave > + * stale data around in the cleancache once our page is gone > + */ > + if (PageUptodate(page) && PageMappedToDisk(page)) > + cleancache_put_page(page); > + else > + cleancache_invalidate_page(mapping, page); > + > + page_cache_tree_delete(mapping, page, NULL); > +#if 0 /* FIXME: backend is assuming page->mapping is available */ > + page->mapping = NULL; > +#endif > + /* Leave page->index set: truncation lookup relies upon it */ > + > + __dec_zone_page_state(page, NR_FILE_PAGES); > + BUG_ON(page_mapped(page)); > + > + /* > + * The following dirty accounting is done by writeback > + * path. So, we don't need to do here. > + * > + * dec_zone_page_state(page, NR_FILE_DIRTY); > + * dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); > + */ > + spin_unlock_irq(&mapping->tree_lock); > + > + page_cache_release(page); > +} > +EXPORT_SYMBOL_GPL(cow_delete_from_page_cache); > + > #ifdef CONFIG_NUMA > struct page *__page_cache_alloc(gfp_t gfp) > { > diff --git a/mm/rmap.c b/mm/rmap.c > index 71cd5bd..9125246 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -923,6 +923,145 @@ int page_mkclean(struct page *page) > } > EXPORT_SYMBOL_GPL(page_mkclean); > > +/* > + * Make clone page for page forking. (Based on migrate_page_copy()) > + * > + * Note: only clones page state so other state such as buffer_heads > + * must be cloned by caller. > + */ > +struct page *cow_clone_page(struct page *oldpage) > +{ > + struct address_space *mapping = oldpage->mapping; > + gfp_t gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS; > + struct page *newpage = __page_cache_alloc(gfp_mask); > + int cpupid; > + > + newpage->mapping = oldpage->mapping; > + newpage->index = oldpage->index; > + copy_highpage(newpage, oldpage); > + > + /* FIXME: right? */ > + BUG_ON(PageSwapCache(oldpage)); > + BUG_ON(PageSwapBacked(oldpage)); > + BUG_ON(PageHuge(oldpage)); > + if (PageError(oldpage)) > + SetPageError(newpage); > + if (PageReferenced(oldpage)) > + SetPageReferenced(newpage); > + if (PageUptodate(oldpage)) > + SetPageUptodate(newpage); > + if (PageActive(oldpage)) > + SetPageActive(newpage); > + if (PageMappedToDisk(oldpage)) > + SetPageMappedToDisk(newpage); > + > + /* > + * Copy NUMA information to the new page, to prevent over-eager > + * future migrations of this same page. > + */ > + cpupid = page_cpupid_xchg_last(oldpage, -1); > + page_cpupid_xchg_last(newpage, cpupid); > + > + mlock_migrate_page(newpage, oldpage); > + ksm_migrate_page(newpage, oldpage); > + > + /* Lock newpage before visible via radix tree */ > + BUG_ON(PageLocked(newpage)); > + __set_page_locked(newpage); > + > + return newpage; > +} > +EXPORT_SYMBOL_GPL(cow_clone_page); > + > +static int page_cow_one(struct page *oldpage, struct page *newpage, > + struct vm_area_struct *vma, unsigned long address) > +{ > + struct mm_struct *mm = vma->vm_mm; > + pte_t oldptval, ptval, *pte; > + spinlock_t *ptl; > + int ret = 0; > + > + pte = page_check_address(oldpage, mm, address, &ptl, 1); > + if (!pte) > + goto out; > + > + flush_cache_page(vma, address, pte_pfn(*pte)); > + oldptval = ptep_clear_flush(vma, address, pte); > + > + /* Take refcount for PTE */ > + page_cache_get(newpage); > + > + /* > + * vm_page_prot doesn't have writable bit, so page fault will > + * be occurred immediately after returned from this page fault > + * again. And second time of page fault will be resolved with > + * forked page was set here. > + */ > + ptval = mk_pte(newpage, vma->vm_page_prot); > +#if 0 > + /* FIXME: we should check following too? Otherwise, we would > + * get additional read-only => write fault at least */ > + if (pte_write) > + ptval = pte_mkwrite(ptval); > + if (pte_dirty(oldptval)) > + ptval = pte_mkdirty(ptval); > + if (pte_young(oldptval)) > + ptval = pte_mkyoung(ptval); > +#endif > + set_pte_at(mm, address, pte, ptval); > + > + /* Update rmap accounting */ > + BUG_ON(!PageMlocked(oldpage)); /* Caller should migrate mlock flag */ > + page_remove_rmap(oldpage); > + page_add_file_rmap(newpage); > + > + /* no need to invalidate: a not-present page won't be cached */ > + update_mmu_cache(vma, address, pte); > + > + pte_unmap_unlock(pte, ptl); > + > + mmu_notifier_invalidate_page(mm, address); > + > + /* Release refcount for PTE */ > + page_cache_release(oldpage); > +out: > + return ret; > +} > + > +/* Change old page in PTEs to new page exclude orig_vma */ > +int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage, > + struct page *newpage) > +{ > + struct address_space *mapping = page_mapping(oldpage); > + pgoff_t pgoff = oldpage->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); > + struct vm_area_struct *vma; > + int ret = 0; > + > + BUG_ON(!PageLocked(oldpage)); > + BUG_ON(!PageLocked(newpage)); > + BUG_ON(PageAnon(oldpage)); > + BUG_ON(mapping == NULL); > + > + i_mmap_lock_read(mapping); > + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { > + /* > + * The orig_vma's PTE is handled by caller. > + * (e.g. ->page_mkwrite) > + */ > + if (vma == orig_vma) > + continue; > + > + if (vma->vm_flags & VM_SHARED) { > + unsigned long address = vma_address(oldpage, vma); > + ret += page_cow_one(oldpage, newpage, vma, address); > + } > + } > + i_mmap_unlock_read(mapping); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(page_cow_file); > + > /** > * page_move_anon_rmap - move a page to our anon_vma > * @page: the page to move to our anon_vma > diff --git a/mm/truncate.c b/mm/truncate.c > index f1e4d60..e5b4673 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -216,6 +216,56 @@ int invalidate_inode_page(struct page *page) > return invalidate_complete_page(mapping, page); > } > > +void generic_truncate_partial_page(struct address_space *mapping, > + struct page *page, unsigned int start, > + unsigned int len) > +{ > + wait_on_page_writeback(page); > + zero_user_segment(page, start, start + len); > + if (page_has_private(page)) > + do_invalidatepage(page, start, len); > +} > +EXPORT_SYMBOL(generic_truncate_partial_page); > + > +static void truncate_partial_page(struct address_space *mapping, pgoff_t index, > + unsigned int start, unsigned int len) > +{ > + struct page *page = find_lock_page(mapping, index); > + if (!page) > + return; > + > + if (!mapping->a_ops->truncatepage) > + generic_truncate_partial_page(mapping, page, start, len); > + else > + mapping->a_ops->truncatepage(mapping, page, start, len, 1); > + > + cleancache_invalidate_page(mapping, page); > + unlock_page(page); > + page_cache_release(page); > +} > + > +void generic_truncate_full_page(struct address_space *mapping, > + struct page *page, int wait) > +{ > + if (wait) > + wait_on_page_writeback(page); > + else if (PageWriteback(page)) > + return; > + > + truncate_inode_page(mapping, page); > +} > +EXPORT_SYMBOL(generic_truncate_full_page); > + > +static void truncate_full_page(struct address_space *mapping, struct page *page, > + int wait) > +{ > + if (!mapping->a_ops->truncatepage) > + generic_truncate_full_page(mapping, page, wait); > + else > + mapping->a_ops->truncatepage(mapping, page, 0, PAGE_CACHE_SIZE, > + wait); > +} > + > /** > * truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets > * @mapping: mapping to truncate > @@ -298,11 +348,7 @@ void truncate_inode_pages_range(struct address_space *mapping, > if (!trylock_page(page)) > continue; > WARN_ON(page->index != index); > - if (PageWriteback(page)) { > - unlock_page(page); > - continue; > - } > - truncate_inode_page(mapping, page); > + truncate_full_page(mapping, page, 0); > unlock_page(page); > } > pagevec_remove_exceptionals(&pvec); > @@ -312,37 +358,18 @@ void truncate_inode_pages_range(struct address_space *mapping, > } > > if (partial_start) { > - struct page *page = find_lock_page(mapping, start - 1); > - if (page) { > - unsigned int top = PAGE_CACHE_SIZE; > - if (start > end) { > - /* Truncation within a single page */ > - top = partial_end; > - partial_end = 0; > - } > - wait_on_page_writeback(page); > - zero_user_segment(page, partial_start, top); > - cleancache_invalidate_page(mapping, page); > - if (page_has_private(page)) > - do_invalidatepage(page, partial_start, > - top - partial_start); > - unlock_page(page); > - page_cache_release(page); > - } > - } > - if (partial_end) { > - struct page *page = find_lock_page(mapping, end); > - if (page) { > - wait_on_page_writeback(page); > - zero_user_segment(page, 0, partial_end); > - cleancache_invalidate_page(mapping, page); > - if (page_has_private(page)) > - do_invalidatepage(page, 0, > - partial_end); > - unlock_page(page); > - page_cache_release(page); > + unsigned int top = PAGE_CACHE_SIZE; > + if (start > end) { > + /* Truncation within a single page */ > + top = partial_end; > + partial_end = 0; > } > + truncate_partial_page(mapping, start - 1, partial_start, > + top - partial_start); > } > + if (partial_end) > + truncate_partial_page(mapping, end, 0, partial_end); > + > /* > * If the truncation happened within a single page no pages > * will be released, just zeroed, so we can bail out now. > @@ -386,8 +413,7 @@ void truncate_inode_pages_range(struct address_space *mapping, > > lock_page(page); > WARN_ON(page->index != index); > - wait_on_page_writeback(page); > - truncate_inode_page(mapping, page); > + truncate_full_page(mapping, page, 1); > unlock_page(page); > } > pagevec_remove_exceptionals(&pvec); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jan, On 05/19/2015 07:00 AM, Jan Kara wrote: > On Thu 14-05-15 01:26:23, Daniel Phillips wrote: >> Hi Rik, >> >> Our linux-tux3 tree currently currently carries this 652 line diff >> against core, to make Tux3 work. This is mainly by Hirofumi, except >> the fs-writeback.c hook, which is by me. The main part you may be >> interested in is rmap.c, which addresses the issues raised at the >> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1] >> >> LSFMM: Page forking >> http://lwn.net/Articles/548091/ >> >> This is just a FYI. An upcoming Tux3 report will be a tour of the page >> forking design and implementation. For now, this is just to give a >> general sense of what we have done. We heard there are concerns about >> how ptrace will work. I really am not familiar with the issue, could >> you please explain what you were thinking of there? > So here are a few things I find problematic about page forking (besides > the cases with elevated page_count already discussed in this thread - there > I believe that anything more complex than "wait for the IO instead of > forking when page has elevated use count" isn't going to work. There are > too many users depending on too subtle details of the behavior...). Some > of them are actually mentioned in the above LWN article: > > When you create a copy of a page and replace it in the radix tree, nobody > in mm subsystem is aware that oldpage may be under writeback. That causes > interesting issues: > * truncate_inode_pages() can finish before all IO for the file is finished. > So far filesystems rely on the fact that once truncate_inode_pages() > finishes all running IO against the file is completed and new cannot be > submitted. We do not use truncate_inode_pages because of issues like that. We use some truncate helpers, which were available in some cases, or else had to be implemented in Tux3 to make everything work properly. The details are Hirofumi's stomping grounds. I am pretty sure that his solution is good and tight, or Tux3 would not pass its torture tests. > * Writeback can come and try to write newpage while oldpage is still under > IO. Then you'll have two IOs against one block which has undefined > results. Those writebacks only come from Tux3 (or indirectly from fs-writeback, through our writeback) so we are able to ensure that a dirty block is only written once. (If redirtied, the block will fork so two dirty blocks are written in two successive deltas.) > * filemap_fdatawait() called from fsync() has additional problem that it is > not aware of oldpage and thus may return although IO hasn't finished yet. We do not use filemap_fdatawait, instead, we wait on completion of our own writeback, which is under our control. > I understand that Tux3 may avoid these issues due to some other mechanisms > it internally has but if page forking should get into mm subsystem, the > above must work. It does work, and by example, it does not need a lot of code to make it work, but the changes are not trivial. Tux3's delta writeback model will not suit everyone, so you can't just lift our code and add it to Ext4. Using it in Ext4 would require a per-inode writeback model, which looks practical to me but far from a weekend project. Maybe something to consider for Ext5. It is the job of new designs like Tux3 to chase after that final drop of performance, not our trusty Ext4 workhorse. Though stranger things have happened - as I recall, Ext4 had O(n) directory operations at one time. Fixing that was not easy, but we did it because we had to. Fixing Ext4's write performance is not urgent by comparison, and the barrier is high, you would want jbd3 for one thing. I think the meta-question you are asking is, where is the second user for this new CoW functionality? With a possible implication that if there is no second user then Tux3 cannot be merged. Is that is the question? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 19 May 2015, Daniel Phillips wrote: >> I understand that Tux3 may avoid these issues due to some other mechanisms >> it internally has but if page forking should get into mm subsystem, the >> above must work. > > It does work, and by example, it does not need a lot of code to make > it work, but the changes are not trivial. Tux3's delta writeback model > will not suit everyone, so you can't just lift our code and add it to > Ext4. Using it in Ext4 would require a per-inode writeback model, which > looks practical to me but far from a weekend project. Maybe something > to consider for Ext5. > > It is the job of new designs like Tux3 to chase after that final drop > of performance, not our trusty Ext4 workhorse. Though stranger things > have happened - as I recall, Ext4 had O(n) directory operations at one > time. Fixing that was not easy, but we did it because we had to. Fixing > Ext4's write performance is not urgent by comparison, and the barrier > is high, you would want jbd3 for one thing. > > I think the meta-question you are asking is, where is the second user > for this new CoW functionality? With a possible implication that if > there is no second user then Tux3 cannot be merged. Is that is the > question? I don't think they are asking for a second user. What they are saying is that for this functionality to be accepted in the mm subsystem, these problem cases need to work reliably, not just work for Tux3 because of your implementation. So for things that you don't use, you need to make it an error if they get used on a page that's been forked (or not be an error and 'do the right thing') For cases where it doesn't matter because Tux3 controls the writeback, and it's undefined in general what happens if writeback is triggered twice on the same page, you will need to figure out how to either prevent the second writeback from triggering if there's one in process, or define how the two writebacks are going to happen so that you can't end up with them re-ordered by some other filesystem. I think that that's what's meant by the top statement that I left in the quote. Even if your implementation details make it safe, these need to be safe even without your implementation details to be acceptable in the core kernel. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 19-05-15 13:33:31, David Lang wrote: > On Tue, 19 May 2015, Daniel Phillips wrote: > > >>I understand that Tux3 may avoid these issues due to some other mechanisms > >>it internally has but if page forking should get into mm subsystem, the > >>above must work. > > > >It does work, and by example, it does not need a lot of code to make > >it work, but the changes are not trivial. Tux3's delta writeback model > >will not suit everyone, so you can't just lift our code and add it to > >Ext4. Using it in Ext4 would require a per-inode writeback model, which > >looks practical to me but far from a weekend project. Maybe something > >to consider for Ext5. > > > >It is the job of new designs like Tux3 to chase after that final drop > >of performance, not our trusty Ext4 workhorse. Though stranger things > >have happened - as I recall, Ext4 had O(n) directory operations at one > >time. Fixing that was not easy, but we did it because we had to. Fixing > >Ext4's write performance is not urgent by comparison, and the barrier > >is high, you would want jbd3 for one thing. > > > >I think the meta-question you are asking is, where is the second user > >for this new CoW functionality? With a possible implication that if > >there is no second user then Tux3 cannot be merged. Is that is the > >question? > > I don't think they are asking for a second user. What they are > saying is that for this functionality to be accepted in the mm > subsystem, these problem cases need to work reliably, not just work > for Tux3 because of your implementation. > > So for things that you don't use, you need to make it an error if > they get used on a page that's been forked (or not be an error and > 'do the right thing') > > For cases where it doesn't matter because Tux3 controls the > writeback, and it's undefined in general what happens if writeback > is triggered twice on the same page, you will need to figure out how > to either prevent the second writeback from triggering if there's > one in process, or define how the two writebacks are going to happen > so that you can't end up with them re-ordered by some other > filesystem. > > I think that that's what's meant by the top statement that I left in > the quote. Even if your implementation details make it safe, these > need to be safe even without your implementation details to be > acceptable in the core kernel. Yeah, that's what I meant. If you create a function which manipulates page cache, you better make it work with other functions manipulating page cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting developer. Sure you can document all the conditions under which the function is safe to use but a function that has several paragraphs in front of it explaning when it is safe to use isn't very good API... Honza
On 05/20/2015 07:44 AM, Jan Kara wrote: > On Tue 19-05-15 13:33:31, David Lang wrote: >> On Tue, 19 May 2015, Daniel Phillips wrote: >> >>>> I understand that Tux3 may avoid these issues due to some other mechanisms >>>> it internally has but if page forking should get into mm subsystem, the >>>> above must work. >>> >>> It does work, and by example, it does not need a lot of code to make >>> it work, but the changes are not trivial. Tux3's delta writeback model >>> will not suit everyone, so you can't just lift our code and add it to >>> Ext4. Using it in Ext4 would require a per-inode writeback model, which >>> looks practical to me but far from a weekend project. Maybe something >>> to consider for Ext5. >>> >>> It is the job of new designs like Tux3 to chase after that final drop >>> of performance, not our trusty Ext4 workhorse. Though stranger things >>> have happened - as I recall, Ext4 had O(n) directory operations at one >>> time. Fixing that was not easy, but we did it because we had to. Fixing >>> Ext4's write performance is not urgent by comparison, and the barrier >>> is high, you would want jbd3 for one thing. >>> >>> I think the meta-question you are asking is, where is the second user >>> for this new CoW functionality? With a possible implication that if >>> there is no second user then Tux3 cannot be merged. Is that is the >>> question? >> >> I don't think they are asking for a second user. What they are >> saying is that for this functionality to be accepted in the mm >> subsystem, these problem cases need to work reliably, not just work >> for Tux3 because of your implementation. >> >> So for things that you don't use, you need to make it an error if >> they get used on a page that's been forked (or not be an error and >> 'do the right thing') >> >> For cases where it doesn't matter because Tux3 controls the >> writeback, and it's undefined in general what happens if writeback >> is triggered twice on the same page, you will need to figure out how >> to either prevent the second writeback from triggering if there's >> one in process, or define how the two writebacks are going to happen >> so that you can't end up with them re-ordered by some other >> filesystem. >> >> I think that that's what's meant by the top statement that I left in >> the quote. Even if your implementation details make it safe, these >> need to be safe even without your implementation details to be >> acceptable in the core kernel. > Yeah, that's what I meant. If you create a function which manipulates > page cache, you better make it work with other functions manipulating page > cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting > developer. Sure you can document all the conditions under which the > function is safe to use but a function that has several paragraphs in front > of it explaning when it is safe to use isn't very good API... Violent agreement, of course. To put it in concrete terms, each of the page fork support functions must be examined and determined sane. They are: * cow_replace_page_cache * cow_delete_from_page_cache * cow_clone_page * page_cow_one * page_cow_file Would it be useful to drill down into those, starting from the top of the list? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 20 May 2015, Daniel Phillips wrote: > On 05/20/2015 07:44 AM, Jan Kara wrote: >> Yeah, that's what I meant. If you create a function which manipulates >> page cache, you better make it work with other functions manipulating page >> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting >> developer. Sure you can document all the conditions under which the >> function is safe to use but a function that has several paragraphs in front >> of it explaning when it is safe to use isn't very good API... > > Violent agreement, of course. To put it in concrete terms, each of > the page fork support functions must be examined and determined > sane. They are: > > * cow_replace_page_cache > * cow_delete_from_page_cache > * cow_clone_page > * page_cow_one > * page_cow_file > > Would it be useful to drill down into those, starting from the top > of the list? It's a little more than determining that these 5 functions are sane, it's making sure that if someone mixes the use of these functions with other existing functions that the result is sane. but it's probably a good starting point to look at each of these five functions in detail and consider how they work and could interact badly with other things touching the page cache. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/20/2015 12:22 PM, Daniel Phillips wrote: > On 05/20/2015 07:44 AM, Jan Kara wrote: >> On Tue 19-05-15 13:33:31, David Lang wrote: >> Yeah, that's what I meant. If you create a function which manipulates >> page cache, you better make it work with other functions manipulating page >> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting >> developer. Sure you can document all the conditions under which the >> function is safe to use but a function that has several paragraphs in front >> of it explaning when it is safe to use isn't very good API... > > Violent agreement, of course. To put it in concrete terms, each of > the page fork support functions must be examined and determined > sane. They are: > > * cow_replace_page_cache > * cow_delete_from_page_cache > * cow_clone_page > * page_cow_one > * page_cow_file > > Would it be useful to drill down into those, starting from the top > of the list? How do these interact with other page cache functions, like find_get_page() ? How does tux3 prevent a user of find_get_page() from reading from or writing into the pre-COW page, instead of the current page?
On 05/20/2015 12:53 PM, Rik van Riel wrote: > On 05/20/2015 12:22 PM, Daniel Phillips wrote: >> On 05/20/2015 07:44 AM, Jan Kara wrote: >>> On Tue 19-05-15 13:33:31, David Lang wrote: > >>> Yeah, that's what I meant. If you create a function which manipulates >>> page cache, you better make it work with other functions manipulating page >>> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting >>> developer. Sure you can document all the conditions under which the >>> function is safe to use but a function that has several paragraphs in front >>> of it explaning when it is safe to use isn't very good API... >> >> Violent agreement, of course. To put it in concrete terms, each of >> the page fork support functions must be examined and determined >> sane. They are: >> >> * cow_replace_page_cache >> * cow_delete_from_page_cache >> * cow_clone_page >> * page_cow_one >> * page_cow_file >> >> Would it be useful to drill down into those, starting from the top >> of the list? > > How do these interact with other page cache functions, like > find_get_page() ? Nicely: https://github.com/OGAWAHirofumi/linux-tux3/blob/hirofumi/fs/tux3/filemap_mmap.c#L182 > How does tux3 prevent a user of find_get_page() from reading from > or writing into the pre-COW page, instead of the current page? Careful control of the dirty bits (we have two of them, one each for front and back). That is what pagefork_for_blockdirty is about. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/20/2015 03:51 PM, Daniel Phillips wrote: > On 05/20/2015 12:53 PM, Rik van Riel wrote: >> How does tux3 prevent a user of find_get_page() from reading from >> or writing into the pre-COW page, instead of the current page? > > Careful control of the dirty bits (we have two of them, one each > for front and back). That is what pagefork_for_blockdirty is about. Ah, and of course it does not matter if a reader is on the pre-cow page. It would be reading the earlier copy, which might no longer be the current copy, but it raced with the write so nobody should be surprised. That is a race even without page fork. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 20 May 2015, Daniel Phillips wrote: > On 05/20/2015 03:51 PM, Daniel Phillips wrote: >> On 05/20/2015 12:53 PM, Rik van Riel wrote: >>> How does tux3 prevent a user of find_get_page() from reading from >>> or writing into the pre-COW page, instead of the current page? >> >> Careful control of the dirty bits (we have two of them, one each >> for front and back). That is what pagefork_for_blockdirty is about. > > Ah, and of course it does not matter if a reader is on the > pre-cow page. It would be reading the earlier copy, which might > no longer be the current copy, but it raced with the write so > nobody should be surprised. That is a race even without page fork. how do you prevent it from continuing to interact with the old version of the page and never see updates or have it's changes reflected on the current page? David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote: > how do you prevent it from continuing to interact with the old > version of the page and never see updates or have it's changes > reflected on the current page? Why would it do that, and what would be surprising about it? Did you have a specific case in mind? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/21/2015 03:53 PM, Daniel Phillips wrote: > On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote: >> how do you prevent it from continuing to interact with the old version >> of the page and never see updates or have it's changes reflected on >> the current page? > > Why would it do that, and what would be surprising about it? Did > you have a specific case in mind? After a get_page(), page_cache_get(), or other equivalent function, a piece of code has the expectation that it can continue using that page until after it has released the reference count. This can be an arbitrarily long period of time.
On Monday, May 25, 2015 9:25:44 PM PDT, Rik van Riel wrote: > On 05/21/2015 03:53 PM, Daniel Phillips wrote: >> On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote: >>> how do you prevent it from continuing to interact with the old version >>> of the page and never see updates or have it's changes reflected on >>> the current page? >> >> Why would it do that, and what would be surprising about it? Did >> you have a specific case in mind? > > After a get_page(), page_cache_get(), or other equivalent > function, a piece of code has the expectation that it can > continue using that page until after it has released the > reference count. > > This can be an arbitrarily long period of time. It is perfectly welcome to keep using that page as long as it wants, Tux3 does not care. When it lets go of the last reference (and Tux3 has finished with it) then the page is freeable. Did you have a more specific example where this would be an issue? Are you talking about kernel or userspace code? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 25 May 2015, Daniel Phillips wrote: > On Monday, May 25, 2015 9:25:44 PM PDT, Rik van Riel wrote: >> On 05/21/2015 03:53 PM, Daniel Phillips wrote: >>> On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote: >>>> how do you prevent it from continuing to interact with the old version >>>> of the page and never see updates or have it's changes reflected on >>>> the current page? >>> >>> Why would it do that, and what would be surprising about it? Did >>> you have a specific case in mind? >> >> After a get_page(), page_cache_get(), or other equivalent >> function, a piece of code has the expectation that it can >> continue using that page until after it has released the >> reference count. >> >> This can be an arbitrarily long period of time. > > It is perfectly welcome to keep using that page as long as it > wants, Tux3 does not care. When it lets go of the last reference > (and Tux3 has finished with it) then the page is freeable. Did > you have a more specific example where this would be an issue? > Are you talking about kernel or userspace code? if the page gets modified again, will that cause any issues? what if the page gets modified before the copy gets written out, so that there are two dirty copies of the page in the process of being written? David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote: > if the page gets modified again, will that cause any issues? > what if the page gets modified before the copy gets written out, > so that there are two dirty copies of the page in the process of > being written? > > David Lang How is the page going to get modified again? A forked page isn't mapped by a pte, so userspace can't modify it by mmap. The forked page is not in the page cache, so usespace can't modify it by posix file ops. So the writer would have to be in kernel. Tux3 knows what it is doing, so it won't modify the page. What kernel code besides Tux3 will modify the page? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 25 May 2015, Daniel Phillips wrote: > On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote: >> if the page gets modified again, will that cause any issues? what if the >> page gets modified before the copy gets written out, so that there are two >> dirty copies of the page in the process of being written? >> >> David Lang > > How is the page going to get modified again? A forked page isn't > mapped by a pte, so userspace can't modify it by mmap. The forked > page is not in the page cache, so usespace can't modify it by > posix file ops. So the writer would have to be in kernel. Tux3 > knows what it is doing, so it won't modify the page. What kernel > code besides Tux3 will modify the page? I'm assuming that Rik is talking about whatever has the reference to the page via one of the methods that he talked about. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 25-05-15 23:11:11, Daniel Phillips wrote: > On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote: > >if the page gets modified again, will that cause any issues? what > >if the page gets modified before the copy gets written out, so > >that there are two dirty copies of the page in the process of > >being written? > > > >David Lang > > How is the page going to get modified again? A forked page isn't > mapped by a pte, so userspace can't modify it by mmap. The forked > page is not in the page cache, so usespace can't modify it by > posix file ops. So the writer would have to be in kernel. Tux3 > knows what it is doing, so it won't modify the page. What kernel > code besides Tux3 will modify the page? E.g. video drivers (or infiniband or direct IO for that matter) which have buffers in user memory (may be mmapped file), grab references to pages and hand out PFNs of those pages to the hardware to store data in them... If you fork a page after the driver has handed PFNs to the hardware, you've just lost all the writes hardware will do. Honza
On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote: > E.g. video drivers (or infiniband or direct IO for that matter) which > have buffers in user memory (may be mmapped file), grab references to pages > and hand out PFNs of those pages to the hardware to store data in them... > If you fork a page after the driver has handed PFNs to the hardware, you've > just lost all the writes hardware will do. Hi Jan, The page forked because somebody wrote to it with write(2) or mmap write at the same time as a video driver (or infiniband or direct IO) was doing io to it. Isn't the application trying hard to lose data in that case? It would not need page fork to lose data that way. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote: > I'm assuming that Rik is talking about whatever has the > reference to the page via one of the methods that he talked > about. This would be a good moment to provide specifics. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 26-05-15 01:08:56, Daniel Phillips wrote: > On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote: > > E.g. video drivers (or infiniband or direct IO for that matter) which > >have buffers in user memory (may be mmapped file), grab references to pages > >and hand out PFNs of those pages to the hardware to store data in them... > >If you fork a page after the driver has handed PFNs to the hardware, you've > >just lost all the writes hardware will do. > > Hi Jan, > > The page forked because somebody wrote to it with write(2) or mmap write at > the same time as a video driver (or infiniband or direct IO) was > doing io to > it. Isn't the application trying hard to lose data in that case? It > would not need page fork to lose data that way. So I can think of two valid uses: 1) You setup IO to part of a page and modify from userspace a different part of a page. 2) At least for video drivers there is one ioctl() which creates object with buffers in memory and another ioctl() to actually ship it to hardware (may be called repeatedly). So in theory app could validly dirty the pages before it ships them to hardware. If this happens repeatedly and interacts badly with background writeback, you will end up with a forked page in a buffer and from that point on things are broken. So my opinion is: Don't fork the page if page_count is elevated. You can just wait for the IO if you need stable pages in that case. It's slow but it's safe and it should be pretty rare. Is there any problem with that? Honza
On Tue 2015-05-26 01:09:59, Daniel Phillips wrote: > On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote: > >I'm assuming that Rik is talking about whatever has the reference to the > >page via one of the methods that he talked about. > > This would be a good moment to provide specifics. Hmm. This seems like a good moment for you to audit whole kernel, to make sure it does not do stuff you don't expect it to. You are changing core semantics, stuff that was allowed before is not allowed now, so it looks like you should do the auditing... You may want to start with video4linux, as Jan pointed out. Pavel
On (05/26/15 01:08), Daniel Phillips wrote: > On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote: > > E.g. video drivers (or infiniband or direct IO for that matter) which > >have buffers in user memory (may be mmapped file), grab references to pages > >and hand out PFNs of those pages to the hardware to store data in them... > >If you fork a page after the driver has handed PFNs to the hardware, you've > >just lost all the writes hardware will do. > > Hi Jan, > > The page forked because somebody wrote to it with write(2) or mmap write at > the same time as a video driver (or infiniband or direct IO) was doing io to > it. Isn't the application trying hard to lose data in that case? It would > not need page fork to lose data that way. > Hello, is it possible to page-fork-bomb the system by some 'malicious' app? -ss -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 26-05-15 19:22:39, Sergey Senozhatsky wrote: > On (05/26/15 01:08), Daniel Phillips wrote: > > On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote: > > > E.g. video drivers (or infiniband or direct IO for that matter) which > > >have buffers in user memory (may be mmapped file), grab references to pages > > >and hand out PFNs of those pages to the hardware to store data in them... > > >If you fork a page after the driver has handed PFNs to the hardware, you've > > >just lost all the writes hardware will do. > > > > Hi Jan, > > > > The page forked because somebody wrote to it with write(2) or mmap write at > > the same time as a video driver (or infiniband or direct IO) was doing io to > > it. Isn't the application trying hard to lose data in that case? It would > > not need page fork to lose data that way. > > > > Hello, > > is it possible to page-fork-bomb the system by some 'malicious' app? Well, you can have only two copies of each page - the one under writeout and the one in page cache. Furthermore you are limited by dirty throttling so I don't think this would allow any out-of-ordinary DOS vector... Honza
Hi Sergey, On 05/26/2015 03:22 AM, Sergey Senozhatsky wrote: > > Hello, > > is it possible to page-fork-bomb the system by some 'malicious' app? Not in any new way. A page fork can happen either in the front end, where it has to wait for memory like any other normal memory user, or in the backend, where Tux3 may have privileged access to low memory reserves and therefore must place bounds on its memory use like any other user of low memory reserves. This is not specific to page fork. We must place such bounds for any memory that the backend uses. Fortunately, the backend does not allocate memory extravagently, for fork or anything else, so when this does get to the top of our to-do list it should not be too hard to deal with. We plan to attack that after merge, as we have never observed a problem in practice. Rather, Tux3 already seems to survive low memory situations pretty well compared to some other filesystems. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/26/2015 02:00 AM, Jan Kara wrote: > On Tue 26-05-15 01:08:56, Daniel Phillips wrote: >> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote: >>> E.g. video drivers (or infiniband or direct IO for that matter) which >>> have buffers in user memory (may be mmapped file), grab references to pages >>> and hand out PFNs of those pages to the hardware to store data in them... >>> If you fork a page after the driver has handed PFNs to the hardware, you've >>> just lost all the writes hardware will do. >> >> Hi Jan, >> >> The page forked because somebody wrote to it with write(2) or mmap write at >> the same time as a video driver (or infiniband or direct IO) was >> doing io to >> it. Isn't the application trying hard to lose data in that case? It >> would not need page fork to lose data that way. > > So I can think of two valid uses: > > 1) You setup IO to part of a page and modify from userspace a different > part of a page. Suppose the use case is reading textures from video memory into a mmapped file, and at the same time, the application is allowed to update the textures in the file via mmap or write(2). Fork happens at mkwrite time. If the page is already dirty, we do not fork it. The video API must have made the page writable and dirty, so I do not see an issue. > 2) At least for video drivers there is one ioctl() which creates object > with buffers in memory and another ioctl() to actually ship it to hardware > (may be called repeatedly). So in theory app could validly dirty the pages > before it ships them to hardware. If this happens repeatedly and interacts > badly with background writeback, you will end up with a forked page in a > buffer and from that point on things are broken. Writeback does not fork pages. An app may dirty a page that is in process of being shipped to hardware (must be a distinct part of the page, or it is a race) and the data being sent to hardware will not be disturbed. If there is an issue here, I do not see it. > So my opinion is: Don't fork the page if page_count is elevated. You can > just wait for the IO if you need stable pages in that case. It's slow but > it's safe and it should be pretty rare. Is there any problem with that? That would be our fallback if anybody discovers a specific case where page fork breaks something, which so far has not been demonstrated. With a known fallback, it is hard to see why we should delay merging over that. Perfection has never been a requirement for merging filesystems. On the contrary, imperfection is a reason for merging, so that the many eyeballs effect may prove its value. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/26/2015 04:22 PM, Daniel Phillips wrote: > On 05/26/2015 02:00 AM, Jan Kara wrote: >> So my opinion is: Don't fork the page if page_count is elevated. You can >> just wait for the IO if you need stable pages in that case. It's slow but >> it's safe and it should be pretty rare. Is there any problem with that? > > That would be our fallback if anybody discovers a specific case where page > fork breaks something, which so far has not been demonstrated. > > With a known fallback, it is hard to see why we should delay merging over > that. Perfection has never been a requirement for merging filesystems. On However, avoiding data corruption by erring on the side of safety is a pretty basic requirement. > the contrary, imperfection is a reason for merging, so that the many > eyeballs effect may prove its value. If you skip the page fork when there is an elevated page count, tux3 should be safe (at least from that aspect). Only do the COW when there is no "strange" use of the page going on.
On 05/26/2015 02:36 PM, Rik van Riel wrote: > On 05/26/2015 04:22 PM, Daniel Phillips wrote: >> On 05/26/2015 02:00 AM, Jan Kara wrote: >>> So my opinion is: Don't fork the page if page_count is elevated. You can >>> just wait for the IO if you need stable pages in that case. It's slow but >>> it's safe and it should be pretty rare. Is there any problem with that? >> >> That would be our fallback if anybody discovers a specific case where page >> fork breaks something, which so far has not been demonstrated. >> >> With a known fallback, it is hard to see why we should delay merging over >> that. Perfection has never been a requirement for merging filesystems. On > > However, avoiding data corruption by erring on the side of safety is > a pretty basic requirement. Erring on the side of safety is still an error. As a community we have never been fond of adding code or overhead to fix theoretical bugs. I do not see why we should relax that principle now. We can fix actual bugs, but theoretical bugs are only shapeless specters passing in the night. We should not become frozen in fear of them. >> the contrary, imperfection is a reason for merging, so that the many >> eyeballs effect may prove its value. > > If you skip the page fork when there is an elevated page count, tux3 > should be safe (at least from that aspect). Only do the COW when there > is no "strange" use of the page going on. Then you break the I in ACID. There must be a compelling reason to do that. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 2015-05-15 02:38:33, Daniel Phillips wrote: > On 05/14/2015 08:06 PM, Rik van Riel wrote: > > On 05/14/2015 08:06 PM, Daniel Phillips wrote: > >>> The issue is that things like ptrace, AIO, infiniband > >>> RDMA, and other direct memory access subsystems can take > >>> a reference to page A, which Tux3 clones into a new page B > >>> when the process writes it. > >>> > >>> However, while the process now points at page B, ptrace, > >>> AIO, infiniband, etc will still be pointing at page A. > >>> > >>> This causes the process and the other subsystem to each > >>> look at a different page, instead of at shared state, > >>> causing ptrace to do nothing, AIO and RDMA data to be > >>> invisible (or corrupted), etc... > >> > >> Is this a bit like page migration? > > > > Yes. Page migration will fail if there is an "extra" > > reference to the page that is not accounted for by > > the migration code. > > > > Only pages that have no extra refcount can be migrated. > > > > Similarly, your cow code needs to fail if there is an > > extra reference count pinning the page. As long as > > the page has a user that you cannot migrate, you cannot > > move any of the other users over. They may rely on data > > written by the hidden-to-you user, and the hidden-to-you > > user may write to the page when you think it is a read > > only stable snapshot. > > Please bear with me as I study these cases one by one. > > First one is ptrace. Only for executable files, right? > Maybe we don't need to fork pages in executable files, Umm. Why do you think it is only issue for executable files? I'm free to mmap() any file, and then execute from it. /lib/ld-linux.so /path/to/binary is known way to exec programs that do not have x bit set. Pavel
On Tue 26-05-15 13:22:38, Daniel Phillips wrote: > On 05/26/2015 02:00 AM, Jan Kara wrote: > > On Tue 26-05-15 01:08:56, Daniel Phillips wrote: > >> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote: > >>> E.g. video drivers (or infiniband or direct IO for that matter) which > >>> have buffers in user memory (may be mmapped file), grab references to pages > >>> and hand out PFNs of those pages to the hardware to store data in them... > >>> If you fork a page after the driver has handed PFNs to the hardware, you've > >>> just lost all the writes hardware will do. > >> > >> Hi Jan, > >> > >> The page forked because somebody wrote to it with write(2) or mmap write at > >> the same time as a video driver (or infiniband or direct IO) was > >> doing io to > >> it. Isn't the application trying hard to lose data in that case? It > >> would not need page fork to lose data that way. > > > > So I can think of two valid uses: > > > > 1) You setup IO to part of a page and modify from userspace a different > > part of a page. > > Suppose the use case is reading textures from video memory into a mmapped > file, and at the same time, the application is allowed to update the > textures in the file via mmap or write(2). Fork happens at mkwrite time. > If the page is already dirty, we do not fork it. The video API must have > made the page writable and dirty, so I do not see an issue. So there are a few things to have in mind: 1) There is nothing like a "writeable" page. Page is always writeable (at least on x86 architecture). When a page is mapped into some virtual address space (or more of them), this *mapping* can be either writeable or read-only. mkwrite changes the mapping from read-only to writeable but kernel / hardware is free to write to the page regardless of the mapping. 2) When kernel / hardware writes to the page, it first modifies the page and then marks it dirty. So what can happen in this scenario is: 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens, page is dirtied, kernel notes a PFN of the page somewhere internally. 2) Writeback comes and starts writeback for the page. 3) Kernel ships the PFN to the hardware. 4) Userspace comes and wants to write to the page (different part than the HW is instructed to use). page_mkwrite is called, page is forked. Userspace writes to the forked page. 5) HW stores its data in the original page. Userspace never sees data from the HW! Data corrupted where without page forking everything would work just fine. Another possible scenario: 1) Userspace app tells kernel to setup a HW buffer in a page. 2) Userspace app fills page with data -> page_mkwrite is called, page is dirtied. 3) Userspace app tells kernel to ship buffer to video HW. 4) Writeback comes and starts writeback for the page 5) Video HW is done with the page. Userspace app fills new set of data into the page -> page_mkwrite is called, page is forked. 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the old data from the original page. Again a data corruption issue where previously things were working fine. > > 2) At least for video drivers there is one ioctl() which creates object > > with buffers in memory and another ioctl() to actually ship it to hardware > > (may be called repeatedly). So in theory app could validly dirty the pages > > before it ships them to hardware. If this happens repeatedly and interacts > > badly with background writeback, you will end up with a forked page in a > > buffer and from that point on things are broken. > > Writeback does not fork pages. An app may dirty a page that is in process > of being shipped to hardware (must be a distinct part of the page, or it is > a race) and the data being sent to hardware will not be disturbed. If there > is an issue here, I do not see it. > > > So my opinion is: Don't fork the page if page_count is elevated. You can > > just wait for the IO if you need stable pages in that case. It's slow but > > it's safe and it should be pretty rare. Is there any problem with that? > > That would be our fallback if anybody discovers a specific case where page > fork breaks something, which so far has not been demonstrated. > > With a known fallback, it is hard to see why we should delay merging over > that. Perfection has never been a requirement for merging filesystems. On > the contrary, imperfection is a reason for merging, so that the many > eyeballs effect may prove its value. Sorry, but you've got several people telling you they are concerned about the safety of your approach. That is the many eyeballs effect. And data corruption issues aren't problems you can just wave away with "let's wait whether it really happens". Honza
On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote: > On Fri 2015-05-15 02:38:33, Daniel Phillips wrote: >> On 05/14/2015 08:06 PM, Rik van Riel wrote: ... > > Umm. Why do you think it is only issue for executable files? I meant: files with code in them, that will be executed. Please excuse me for colliding with the chmod sense. I will say "code files" to avoid ambiguity. > I'm free to mmap() any file, and then execute from it. > > /lib/ld-linux.so /path/to/binary > > is known way to exec programs that do not have x bit set. So... why would I write to a code file at the same time as stepping through it with ptrace? Should I expect ptrace to work perfectly if I do that? What would "work perfectly" mean, if the code is changing at the same time as being traced? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 2015-05-27 11:09:25, Daniel Phillips wrote: > On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote: > >On Fri 2015-05-15 02:38:33, Daniel Phillips wrote: > >>On 05/14/2015 08:06 PM, Rik van Riel wrote: ... > > > >Umm. Why do you think it is only issue for executable files? > > I meant: files with code in them, that will be executed. Please excuse > me for colliding with the chmod sense. I will say "code files" to avoid > ambiguity. > > >I'm free to mmap() any file, and then execute from it. > > > >/lib/ld-linux.so /path/to/binary > > > >is known way to exec programs that do not have x bit set. > > So... why would I write to a code file at the same time as stepping > through it with ptrace? Should I expect ptrace to work perfectly if > I do that? What would "work perfectly" mean, if the code is changing > at the same time as being traced? Do you have any imagination at all? Reasons I should expect ptrace to work perfectly if I'm writing to file: 1) it used to work before 2) it used to work before 3) it used to work before and regressions are not allowed 4) some kind of just in time compiler 5) some kind of malware, playing tricks so that you have trouble analyzing it and of course, 6) it used to work before. Pavel
On 05/27/2015 02:37 PM, Pavel Machek wrote: > On Wed 2015-05-27 11:09:25, Daniel Phillips wrote: >> On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote: >>> On Fri 2015-05-15 02:38:33, Daniel Phillips wrote: >>>> On 05/14/2015 08:06 PM, Rik van Riel wrote: ... >>> >>> Umm. Why do you think it is only issue for executable files? >> >> I meant: files with code in them, that will be executed. Please excuse >> me for colliding with the chmod sense. I will say "code files" to avoid >> ambiguity. >> >>> I'm free to mmap() any file, and then execute from it. >>> >>> /lib/ld-linux.so /path/to/binary >>> >>> is known way to exec programs that do not have x bit set. >> >> So... why would I write to a code file at the same time as stepping >> through it with ptrace? Should I expect ptrace to work perfectly if >> I do that? What would "work perfectly" mean, if the code is changing >> at the same time as being traced? > > Do you have any imagination at all? [Non-collegial rhetoric alert, it would be helpful to avoid that.] > Reasons I should expect ptrace to work perfectly if I'm writing to > file: > > 1) it used to work before > > 2) it used to work before > > 3) it used to work before and regressions are not allowed Are you sure that ptrace will work perfectly on a file that you are writing to at the same time as tracing? If so, it has magic that I do not understand. Could you please explain. > 4) some kind of just in time compiler A JIT that can tolerate being written to by a task it knows nothing about, at the same time as it is generating code in the file? I do not know of any such JIT. > 5) some kind of malware, playing tricks so that you have trouble > analyzing it By writing to a code file? Then it already has write access to the code file, so it has already gotten inside your security perimeter without needing help from page fork. That said, we should be alert for any new holes that page fork might open. But if there are any, they should be actual holes, not theoretical ones. > and of course, > > 6) it used to work before. I look forward to your explanation of how. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Kara <jack@suse.cz> writes: Hi, > So there are a few things to have in mind: > 1) There is nothing like a "writeable" page. Page is always writeable (at > least on x86 architecture). When a page is mapped into some virtual address > space (or more of them), this *mapping* can be either writeable or read-only. > mkwrite changes the mapping from read-only to writeable but kernel / > hardware is free to write to the page regardless of the mapping. > > 2) When kernel / hardware writes to the page, it first modifies the page > and then marks it dirty. > > So what can happen in this scenario is: > > 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens, > page is dirtied, kernel notes a PFN of the page somewhere internally. > > 2) Writeback comes and starts writeback for the page. > > 3) Kernel ships the PFN to the hardware. > > 4) Userspace comes and wants to write to the page (different part than the > HW is instructed to use). page_mkwrite is called, page is forked. > Userspace writes to the forked page. > > 5) HW stores its data in the original page. > > Userspace never sees data from the HW! Data corrupted where without page > forking everything would work just fine. I'm not sure I'm understanding your pseudocode logic correctly though. This logic doesn't seems to be a page forking specific issue. And this pseudocode logic seems to be missing the locking and revalidate of page. If you can show more details, it would be helpful to see more, and discuss the issue of page forking, or we can think about how to handle the corner cases. Well, before that, why need more details? For example, replace the page fork at (4) with "truncate", "punch hole", or "invalidate page". Those operations remove the old page from radix tree, so the userspace's write creates the new page, and HW still refererences the old page. (I.e. situation should be same with page forking, in my understand of this pseudocode logic.) IOW, this pseudocode logic seems to be broken without page forking if no lock and revalidate. Usually, we prevent unpleasant I/O by lock_page or PG_writeback, and an obsolated page is revalidated under lock_page. For page forking, we may also be able to prevent similar situation by locking, flags, and revalidate. But those details might be different with current code, because page states are different. > Another possible scenario: > > 1) Userspace app tells kernel to setup a HW buffer in a page. > > 2) Userspace app fills page with data -> page_mkwrite is called, page is > dirtied. > > 3) Userspace app tells kernel to ship buffer to video HW. > > 4) Writeback comes and starts writeback for the page > > 5) Video HW is done with the page. Userspace app fills new set of data into > the page -> page_mkwrite is called, page is forked. > > 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the > old data from the original page. > > Again a data corruption issue where previously things were working fine. This logic seems to be same as above. Replace the page fork at (5). With no revalidate of page, (6) will use the old page. Thanks. -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
On Mon 22-06-15 00:36:00, OGAWA Hirofumi wrote: > Jan Kara <jack@suse.cz> writes: > > So there are a few things to have in mind: > > 1) There is nothing like a "writeable" page. Page is always writeable (at > > least on x86 architecture). When a page is mapped into some virtual address > > space (or more of them), this *mapping* can be either writeable or read-only. > > mkwrite changes the mapping from read-only to writeable but kernel / > > hardware is free to write to the page regardless of the mapping. > > > > 2) When kernel / hardware writes to the page, it first modifies the page > > and then marks it dirty. > > > > So what can happen in this scenario is: > > > > 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens, > > page is dirtied, kernel notes a PFN of the page somewhere internally. > > > > 2) Writeback comes and starts writeback for the page. > > > > 3) Kernel ships the PFN to the hardware. > > > > 4) Userspace comes and wants to write to the page (different part than the > > HW is instructed to use). page_mkwrite is called, page is forked. > > Userspace writes to the forked page. > > > > 5) HW stores its data in the original page. > > > > Userspace never sees data from the HW! Data corrupted where without page > > forking everything would work just fine. > > I'm not sure I'm understanding your pseudocode logic correctly though. > This logic doesn't seems to be a page forking specific issue. And > this pseudocode logic seems to be missing the locking and revalidate of > page. > > If you can show more details, it would be helpful to see more, and > discuss the issue of page forking, or we can think about how to handle > the corner cases. > > Well, before that, why need more details? > > For example, replace the page fork at (4) with "truncate", "punch > hole", or "invalidate page". > > Those operations remove the old page from radix tree, so the > userspace's write creates the new page, and HW still refererences the > old page. (I.e. situation should be same with page forking, in my > understand of this pseudocode logic.) Yes, if userspace truncates the file, the situation we end up with is basically the same. However for truncate to happen some malicious process has to come and truncate the file - a failure scenario that is acceptable for most use cases since it doesn't happen unless someone is actively trying to screw you. With page forking it is enough for flusher thread to start writeback for that page to trigger the problem - event that is basically bound to happen without any other userspace application interfering. > IOW, this pseudocode logic seems to be broken without page forking if > no lock and revalidate. Usually, we prevent unpleasant I/O by > lock_page or PG_writeback, and an obsolated page is revalidated under > lock_page. Well, good luck with converting all the get_user_pages() users in kernel to use lock_page() or PG_writeback checks to avoid issues with page forking. I don't think that's really feasible. > For page forking, we may also be able to prevent similar situation by > locking, flags, and revalidate. But those details might be different > with current code, because page states are different. Sorry, I don't understand what do you mean in this paragraph. Can you explain it a bit more? > > Another possible scenario: > > > > 1) Userspace app tells kernel to setup a HW buffer in a page. > > > > 2) Userspace app fills page with data -> page_mkwrite is called, page is > > dirtied. > > > > 3) Userspace app tells kernel to ship buffer to video HW. > > > > 4) Writeback comes and starts writeback for the page > > > > 5) Video HW is done with the page. Userspace app fills new set of data into > > the page -> page_mkwrite is called, page is forked. > > > > 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the > > old data from the original page. > > > > Again a data corruption issue where previously things were working fine. > > This logic seems to be same as above. Replace the page fork at (5). > With no revalidate of page, (6) will use the old page. Yes, the same arguments as above apply... Honza
Jan Kara <jack@suse.cz> writes: >> I'm not sure I'm understanding your pseudocode logic correctly though. >> This logic doesn't seems to be a page forking specific issue. And >> this pseudocode logic seems to be missing the locking and revalidate of >> page. >> >> If you can show more details, it would be helpful to see more, and >> discuss the issue of page forking, or we can think about how to handle >> the corner cases. >> >> Well, before that, why need more details? >> >> For example, replace the page fork at (4) with "truncate", "punch >> hole", or "invalidate page". >> >> Those operations remove the old page from radix tree, so the >> userspace's write creates the new page, and HW still refererences the >> old page. (I.e. situation should be same with page forking, in my >> understand of this pseudocode logic.) > > Yes, if userspace truncates the file, the situation we end up with is > basically the same. However for truncate to happen some malicious process > has to come and truncate the file - a failure scenario that is acceptable > for most use cases since it doesn't happen unless someone is actively > trying to screw you. With page forking it is enough for flusher thread > to start writeback for that page to trigger the problem - event that is > basically bound to happen without any other userspace application > interfering. Acceptable conclusion is where came from? That pseudocode logic doesn't say about usage at all. And even if assume it is acceptable, as far as I can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a page on non-exists block (sparse file. i.e. missing disk space check in your logic). And if really no any lock/check, there would be another races. >> IOW, this pseudocode logic seems to be broken without page forking if >> no lock and revalidate. Usually, we prevent unpleasant I/O by >> lock_page or PG_writeback, and an obsolated page is revalidated under >> lock_page. > > Well, good luck with converting all the get_user_pages() users in kernel to > use lock_page() or PG_writeback checks to avoid issues with page forking. I > don't think that's really feasible. What does all get_user_pages() conversion mean? Well, maybe right more or less, I also think there is the issue in/around get_user_pages() that we have to tackle. IMO, if there is a code that pseudocode logic actually, it is the breakage. And "it is acceptable and limitation, and give up to fix", I don't think it is the right way to go. If there is really code broken like your logic, I think we should fix. Could you point which code is using your logic? Since that seems to be so racy, I can't believe yet there are that racy codes actually. >> For page forking, we may also be able to prevent similar situation by >> locking, flags, and revalidate. But those details might be different >> with current code, because page states are different. > > Sorry, I don't understand what do you mean in this paragraph. Can you > explain it a bit more? This just means a forked page (old page) and a truncated page have different set of flags and state, so we may have to adjust revalidation. Thanks.
On Sun 05-07-15 21:54:45, OGAWA Hirofumi wrote: > Jan Kara <jack@suse.cz> writes: > > >> I'm not sure I'm understanding your pseudocode logic correctly though. > >> This logic doesn't seems to be a page forking specific issue. And > >> this pseudocode logic seems to be missing the locking and revalidate of > >> page. > >> > >> If you can show more details, it would be helpful to see more, and > >> discuss the issue of page forking, or we can think about how to handle > >> the corner cases. > >> > >> Well, before that, why need more details? > >> > >> For example, replace the page fork at (4) with "truncate", "punch > >> hole", or "invalidate page". > >> > >> Those operations remove the old page from radix tree, so the > >> userspace's write creates the new page, and HW still refererences the > >> old page. (I.e. situation should be same with page forking, in my > >> understand of this pseudocode logic.) > > > > Yes, if userspace truncates the file, the situation we end up with is > > basically the same. However for truncate to happen some malicious process > > has to come and truncate the file - a failure scenario that is acceptable > > for most use cases since it doesn't happen unless someone is actively > > trying to screw you. With page forking it is enough for flusher thread > > to start writeback for that page to trigger the problem - event that is > > basically bound to happen without any other userspace application > > interfering. > > Acceptable conclusion is where came from? That pseudocode logic doesn't > say about usage at all. And even if assume it is acceptable, as far as I > can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a > page on non-exists block (sparse file. i.e. missing disk space check in > your logic). And if really no any lock/check, there would be another > races. So drop_caches won't cause any issues because it avoids mmaped pages. Also page reclaim or page migration don't cause any issues because they avoid pages with increased refcount (and increased refcount would stop drop_caches from reclaiming the page as well if it was not for the mmaped check before). Generally, elevated page refcount currently guarantees page isn't migrated, reclaimed, or otherwise detached from the mapping (except for truncate where the combination of mapping-index becomes invalid) and your page forking would change that assumption - which IMHO has a big potential for some breakage somewhere. And frankly I fail to see why you and Daniel care so much about this corner case because from performance POV it's IMHO a non-issue and you bother with page forking because of performance, don't you? > >> IOW, this pseudocode logic seems to be broken without page forking if > >> no lock and revalidate. Usually, we prevent unpleasant I/O by > >> lock_page or PG_writeback, and an obsolated page is revalidated under > >> lock_page. > > > > Well, good luck with converting all the get_user_pages() users in kernel to > > use lock_page() or PG_writeback checks to avoid issues with page forking. I > > don't think that's really feasible. > > What does all get_user_pages() conversion mean? Well, maybe right more > or less, I also think there is the issue in/around get_user_pages() that > we have to tackle. > > > IMO, if there is a code that pseudocode logic actually, it is the > breakage. And "it is acceptable and limitation, and give up to fix", I > don't think it is the right way to go. If there is really code broken > like your logic, I think we should fix. > > Could you point which code is using your logic? Since that seems to be > so racy, I can't believe yet there are that racy codes actually. So you can have a look for example at drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up of a video device buffer at virtual address specified by user. Now I don't know whether there really is any userspace video program that sets up the video buffer in mmaped file. I would agree with you that it would be a strange thing to do but I've seen enough strange userspace code that I would not be too surprised. Another example of similar kind is at drivers/infiniband/core/umem.c where we again set up buffer for infiniband cards at users specified virtual address. And there are more drivers in kernel like that. Honza
Jan Kara <jack@suse.cz> writes: >> > Yes, if userspace truncates the file, the situation we end up with is >> > basically the same. However for truncate to happen some malicious process >> > has to come and truncate the file - a failure scenario that is acceptable >> > for most use cases since it doesn't happen unless someone is actively >> > trying to screw you. With page forking it is enough for flusher thread >> > to start writeback for that page to trigger the problem - event that is >> > basically bound to happen without any other userspace application >> > interfering. >> >> Acceptable conclusion is where came from? That pseudocode logic doesn't >> say about usage at all. And even if assume it is acceptable, as far as I >> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a >> page on non-exists block (sparse file. i.e. missing disk space check in >> your logic). And if really no any lock/check, there would be another >> races. > > So drop_caches won't cause any issues because it avoids mmaped pages. > Also page reclaim or page migration don't cause any issues because > they avoid pages with increased refcount (and increased refcount would stop > drop_caches from reclaiming the page as well if it was not for the mmaped > check before). Generally, elevated page refcount currently guarantees page > isn't migrated, reclaimed, or otherwise detached from the mapping (except > for truncate where the combination of mapping-index becomes invalid) and > your page forking would change that assumption - which IMHO has a big > potential for some breakage somewhere. Lifetime and visibility from user are different topic. The issue here is visibility. Of course, those has relation more or less though, refcount doesn't stop to drop page from radix-tree at all. Well, anyway, your claim seems to be assuming the userspace app workarounds the issues. And it sounds like still not workarounds the ENOSPC issue (validate at page fault/GUP) even if assuming userspace behave as perfect. Calling it as kernel assumption is strange. If you claim, there is strange logic widely used already, and of course, we can't simply break it because of compatibility. I would be able to agree. But your claim sounds like that logic is sane and well designed behavior. So I disagree. > And frankly I fail to see why you and Daniel care so much about this > corner case because from performance POV it's IMHO a non-issue and you > bother with page forking because of performance, don't you? Trying to penalize the corner case path, instead of normal path, should try at first. Penalizing normal path to allow corner case path is insane basically. Make normal path faster and more reliable is what we are trying. > So you can have a look for example at > drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up > of a video device buffer at virtual address specified by user. Now I don't > know whether there really is any userspace video program that sets up the > video buffer in mmaped file. I would agree with you that it would be a > strange thing to do but I've seen enough strange userspace code that I > would not be too surprised. > > Another example of similar kind is at > drivers/infiniband/core/umem.c where we again set up buffer for infiniband > cards at users specified virtual address. And there are more drivers in > kernel like that. Unfortunately, I'm not looking those yet though. I guess those would be helpful to see the details. Thanks.
On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote: > Returning ENOSPC when you have free space you can't yet prove is safer than > not returning it and risking a data loss when you get hit by a write/commit > storm. :) Remember when delayed allocation was scary and unproven, because proving that ENOSPC will always be returned when needed is extremely difficult? But the performance advantage was compelling, so we just worked at it until it worked. There were times when it didn't work properly, but the code was in the tree so it got fixed. It's like that now with page forking - a new technique with compelling advantages, and some challenges. In the past, we (the Linux community) would rise to the challenge and err on the side of pushing optimizations in early. That was our mojo, and that is how Linux became the dominant operating system it is today. Do we, the Linux community, still have that mojo? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 31 Jul 2015, Daniel Phillips wrote: > Subject: Re: [FYI] tux3: Core changes > > On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote: >> Returning ENOSPC when you have free space you can't yet prove is safer than >> not returning it and risking a data loss when you get hit by a write/commit >> storm. :) > > Remember when delayed allocation was scary and unproven, because proving > that ENOSPC will always be returned when needed is extremely difficult? > But the performance advantage was compelling, so we just worked at it > until it worked. There were times when it didn't work properly, but the > code was in the tree so it got fixed. > > It's like that now with page forking - a new technique with compelling > advantages, and some challenges. In the past, we (the Linux community) > would rise to the challenge and err on the side of pushing optimizations > in early. That was our mojo, and that is how Linux became the dominant > operating system it is today. Do we, the Linux community, still have that > mojo? We, the Linux Community have less tolerance for losing people's data and preventing them from operating than we used to when it was all tinkerer's personal data and secondary systems. So rather than pushing optimizations out to everyone and seeing what breaks, we now do more testing and checking for failures before pushing things out. This means that when something new is introduced, we default to the safe, slightly slower way initially (there will be enough other bugs to deal with in any case), and then as we gain experience from the tinkerers enabling the performance optimizations, we make those optimizations reliable and only then push them out to all users. If you define this as "loosing our mojo", then yes we have. But most people see the pace of development as still being high, just with more testing and polishing before it gets out to users. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
> If you define this as "loosing our mojo", then yes we have.
A pity. There remains so much to do that simply will not get
done in the absence of mojo.
Regards,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: > We, the Linux Community have less tolerance for losing people's data and preventing them from operating than we used to when it was all tinkerer's personal data and secondary systems. > > So rather than pushing optimizations out to everyone and seeing what breaks, we now do more testing and checking for failures before pushing things out. By the way, I am curious about whose data you think will get lost as a result of pushing out Tux3 with a possible theoretical bug in a wildly improbable scenario that has not actually been described with sufficient specificity to falsify, let alone demonstrated. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 31 Jul 2015, Daniel Phillips wrote: > On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: >> We, the Linux Community have less tolerance for losing people's data and >> preventing them from operating than we used to when it was all tinkerer's >> personal data and secondary systems. >> >> So rather than pushing optimizations out to everyone and seeing what >> breaks, we now do more testing and checking for failures before pushing >> things out. > > By the way, I am curious about whose data you think will get lost > as a result of pushing out Tux3 with a possible theoretical bug > in a wildly improbable scenario that has not actually been > described with sufficient specificity to falsify, let alone > demonstrated. you weren't asking about any particular feature of Tux, you were asking if we were still willing to push out stuff that breaks for users and fix it later. Especially for filesystems that can loose the data of whoever is using it, the answer seems to be a clear no. there may be bugs in what's pushed out that we don't know about. But we don't push out potential data corruption bugs that we do know about (or think we do) so if you think this should be pushed out with this known corner case that's not handled properly, you have to convince people that it's _so_ improbable that they shouldn't care about it. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday, July 31, 2015 3:27:12 PM PDT, David Lang wrote: > On Fri, 31 Jul 2015, Daniel Phillips wrote: > >> On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: ... > > you weren't asking about any particular feature of Tux, you > were asking if we were still willing to push out stuff that > breaks for users and fix it later. I think you left a key word out of my ask: "theoretical". > Especially for filesystems that can loose the data of whoever > is using it, the answer seems to be a clear no. > > there may be bugs in what's pushed out that we don't know > about. But we don't push out potential data corruption bugs that > we do know about (or think we do) > > so if you think this should be pushed out with this known > corner case that's not handled properly, you have to convince > people that it's _so_ improbable that they shouldn't care about > it. There should also be an onus on the person posing the worry to prove their case beyond a reasonable doubt, which has not been done in case we are discussing here. Note: that is a technical assessment to which a technical response is appropriate. I do think that we should put a cap on this fencing and make a real effort to get Tux3 into mainline. We should at least set a ground rule that a problem should be proved real before it becomes a reason to derail a project in the way that our project has been derailed. Otherwise, it's hard to see what interest is served. OK, lets get back to the program. I accept your assertion that we should convince people that the issue is improbable. To do that, I need a specific issue to address. So far, no such issue has been provided with specificity. Do you see why this is frustrating? Please, community. Give us specific issues to address, or give us some way out of this eternal limbo. Or better, lets go back to the old way of doing things in Linux, which is what got us where we are today. Not this. Note: Hirofumi's email is clear, logical and speaks to the question. This branch of the thread is largely pointless, though it essentially says the same thing in non-technical terms. Perhaps your next response should be to Hirofumi, and perhaps it should be technical. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote: > Note: Hirofumi's email is clear, logical and speaks to the > question. This branch of the thread is largely pointless, though > it essentially says the same thing in non-technical terms. Perhaps > your next response should be to Hirofumi, and perhaps it should be > technical. Now, let me try to lead the way, but being specific. RDMA was raised as a potential failure case for Tux3 page forking. But the RDMA api does not let you use memory mmaped by Tux3 as a source or destination of IO. Instead, it sets up its own pages and hands them out to the RDMA app from a pool. So no issue. One down, right? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 31-07-15 17:16:45, Daniel Phillips wrote: > On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote: > >Note: Hirofumi's email is clear, logical and speaks to the > >question. This branch of the thread is largely pointless, though > >it essentially says the same thing in non-technical terms. Perhaps > >your next response should be to Hirofumi, and perhaps it should be > >technical. > > Now, let me try to lead the way, but being specific. RDMA was raised > as a potential failure case for Tux3 page forking. But the RDMA api > does not let you use memory mmaped by Tux3 as a source or destination > of IO. Instead, it sets up its own pages and hands them out to the > RDMA app from a pool. So no issue. One down, right? Can you please tell me how you arrived to that conclusion? As far as I'm looking at the code in drivers/infiniband/ I don't see anything there preventing userspace from passing in mmapped memory... Honza
On Fri 31-07-15 13:44:44, OGAWA Hirofumi wrote: > Jan Kara <jack@suse.cz> writes: > > >> > Yes, if userspace truncates the file, the situation we end up with is > >> > basically the same. However for truncate to happen some malicious process > >> > has to come and truncate the file - a failure scenario that is acceptable > >> > for most use cases since it doesn't happen unless someone is actively > >> > trying to screw you. With page forking it is enough for flusher thread > >> > to start writeback for that page to trigger the problem - event that is > >> > basically bound to happen without any other userspace application > >> > interfering. > >> > >> Acceptable conclusion is where came from? That pseudocode logic doesn't > >> say about usage at all. And even if assume it is acceptable, as far as I > >> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a > >> page on non-exists block (sparse file. i.e. missing disk space check in > >> your logic). And if really no any lock/check, there would be another > >> races. > > > > So drop_caches won't cause any issues because it avoids mmaped pages. > > Also page reclaim or page migration don't cause any issues because > > they avoid pages with increased refcount (and increased refcount would stop > > drop_caches from reclaiming the page as well if it was not for the mmaped > > check before). Generally, elevated page refcount currently guarantees page > > isn't migrated, reclaimed, or otherwise detached from the mapping (except > > for truncate where the combination of mapping-index becomes invalid) and > > your page forking would change that assumption - which IMHO has a big > > potential for some breakage somewhere. > > Lifetime and visibility from user are different topic. The issue here > is visibility. Of course, those has relation more or less though, > refcount doesn't stop to drop page from radix-tree at all. Well, refcount prevents dropping page from a radix-tree in some cases - memory pressure, page migration to name the most prominent ones. It doesn't prevent page from being dropped because of truncate, that is correct. In general, the rule we currently obey is that kernel doesn't detach a page with increased refcount from a radix tree unless there is a syscall asking kernel to do that. > Well, anyway, your claim seems to be assuming the userspace app > workarounds the issues. And it sounds like still not workarounds the > ENOSPC issue (validate at page fault/GUP) even if assuming userspace > behave as perfect. Calling it as kernel assumption is strange. Realistically, I don't think userspace apps workaround anything. They just do what happens to work. Nobody happens to delete files while application works on it and expect application to gracefully handle that. So everyone is happy. I'm not sure about which ENOSPC issue you are speaking BTW. Can you please ellaborate? > If you claim, there is strange logic widely used already, and of course, > we can't simply break it because of compatibility. I would be able to > agree. But your claim sounds like that logic is sane and well designed > behavior. So I disagree. To me the rule: "Do not detach a page from a radix tree if it has an elevated refcount unless explicitely requested by a syscall" looks like a sane one. Yes. > > And frankly I fail to see why you and Daniel care so much about this > > corner case because from performance POV it's IMHO a non-issue and you > > bother with page forking because of performance, don't you? > > Trying to penalize the corner case path, instead of normal path, should > try at first. Penalizing normal path to allow corner case path is insane > basically. > > Make normal path faster and more reliable is what we are trying. Elevated refcount of a page is in my opinion a corner case path. That's why I think that penalizing that case by waiting for IO instead of forking is acceptable cost for the improved compatibility & maintainability of the code. Honza
Jan Kara <jack@suse.cz> writes: > I'm not sure about which ENOSPC issue you are speaking BTW. Can you > please ellaborate? 1. GUP simulate page fault, and prepare to modify 2. writeback clear dirty, and make PTE read-only 3. snapshot/reflink make block cow 4. driver called GUP modifies page, and dirty page without simulate page fault >> If you claim, there is strange logic widely used already, and of course, >> we can't simply break it because of compatibility. I would be able to >> agree. But your claim sounds like that logic is sane and well designed >> behavior. So I disagree. > > To me the rule: "Do not detach a page from a radix tree if it has an elevated > refcount unless explicitely requested by a syscall" looks like a sane one. > Yes. > >> > And frankly I fail to see why you and Daniel care so much about this >> > corner case because from performance POV it's IMHO a non-issue and you >> > bother with page forking because of performance, don't you? >> >> Trying to penalize the corner case path, instead of normal path, should >> try at first. Penalizing normal path to allow corner case path is insane >> basically. >> >> Make normal path faster and more reliable is what we are trying. > > Elevated refcount of a page is in my opinion a corner case path. That's why > I think that penalizing that case by waiting for IO instead of forking is > acceptable cost for the improved compatibility & maintainability of the > code. What is "elevated refcount"? What is difference with normal refcount? Are you saying "refcount >= specified threshold + waitq/wakeup" or such? If so, it is not the path. It is the state. IOW, some group may not hit much, but some group may hit much, on normal path. So it sounds like yet another "stable page". I.e. unpredictable performance. (BTW, by recall of "stable page", noticed "stable page" would not provide stabled page data for that logic too.) Well, assuming "elevated refcount == threshold + waitq/wakeup", so IMO, it is not attractive. Rather the last option if there is no others as design choice. Thanks.
On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote: > Jan Kara <jack@suse.cz> writes: > > > I'm not sure about which ENOSPC issue you are speaking BTW. Can you > > please ellaborate? > > 1. GUP simulate page fault, and prepare to modify > 2. writeback clear dirty, and make PTE read-only > 3. snapshot/reflink make block cow I assume by point 3. you mean that snapshot / reflink happens now and thus the page / block is marked as COW. Am I right? > 4. driver called GUP modifies page, and dirty page without simulate page fault OK, but this doesn't hit ENOSPC because as you correctly write in point 4., the page gets modified without triggering another page fault so COW for the modified page isn't triggered. Modified page contents will be in both the original and the reflinked file, won't it? And I agree that the fact that snapshotted file's original contents can still get modified is a bug. A one which is difficult to fix. > >> If you claim, there is strange logic widely used already, and of course, > >> we can't simply break it because of compatibility. I would be able to > >> agree. But your claim sounds like that logic is sane and well designed > >> behavior. So I disagree. > > > > To me the rule: "Do not detach a page from a radix tree if it has an elevated > > refcount unless explicitely requested by a syscall" looks like a sane one. > > Yes. > > > >> > And frankly I fail to see why you and Daniel care so much about this > >> > corner case because from performance POV it's IMHO a non-issue and you > >> > bother with page forking because of performance, don't you? > >> > >> Trying to penalize the corner case path, instead of normal path, should > >> try at first. Penalizing normal path to allow corner case path is insane > >> basically. > >> > >> Make normal path faster and more reliable is what we are trying. > > > > Elevated refcount of a page is in my opinion a corner case path. That's why > > I think that penalizing that case by waiting for IO instead of forking is > > acceptable cost for the improved compatibility & maintainability of the > > code. > > What is "elevated refcount"? What is difference with normal refcount? > Are you saying "refcount >= specified threshold + waitq/wakeup" or > such? If so, it is not the path. It is the state. IOW, some group may > not hit much, but some group may hit much, on normal path. Yes, by "elevated refcount" I meant refcount > 2 (one for pagecache, one for your code inspecting the page). > So it sounds like yet another "stable page". I.e. unpredictable > performance. (BTW, by recall of "stable page", noticed "stable page" > would not provide stabled page data for that logic too.) > > Well, assuming "elevated refcount == threshold + waitq/wakeup", so > IMO, it is not attractive. Rather the last option if there is no > others as design choice. I agree the performance will be less predictable and that is not good. But changing what is visible in the file when writeback races with GUP is a worse problem to me. Maybe if GUP marked pages it got ref for so that we could trigger the slow behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so that pages pinned by get_user_pages() would be properly accounted and then we could use PG_mlocked and elevated refcount as a more reliable indication of pages that need special handling). Honza [1] http://thread.gmane.org/gmane.linux.kernel.mm/117679
Jan Kara <jack@suse.cz> writes: > On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote: >> Jan Kara <jack@suse.cz> writes: >> >> > I'm not sure about which ENOSPC issue you are speaking BTW. Can you >> > please ellaborate? >> >> 1. GUP simulate page fault, and prepare to modify >> 2. writeback clear dirty, and make PTE read-only >> 3. snapshot/reflink make block cow > > I assume by point 3. you mean that snapshot / reflink happens now and thus > the page / block is marked as COW. Am I right? Right. >> 4. driver called GUP modifies page, and dirty page without simulate page fault > > OK, but this doesn't hit ENOSPC because as you correctly write in point 4., > the page gets modified without triggering another page fault so COW for the > modified page isn't triggered. Modified page contents will be in both the > original and the reflinked file, won't it? And above result can be ENOSPC too, depending on implement and race condition. Also, if FS converted zerod blocks to hole like hammerfs, simply ENOSPC happens. I.e. other process uses all spaces, but then no ->page_mkwrite() callback to check ENOSPC. > And I agree that the fact that snapshotted file's original contents can > still get modified is a bug. A one which is difficult to fix. Yes, it is why I'm thinking this logic is issue, before page forking. >> So it sounds like yet another "stable page". I.e. unpredictable >> performance. (BTW, by recall of "stable page", noticed "stable page" >> would not provide stabled page data for that logic too.) >> >> Well, assuming "elevated refcount == threshold + waitq/wakeup", so >> IMO, it is not attractive. Rather the last option if there is no >> others as design choice. > > I agree the performance will be less predictable and that is not good. But > changing what is visible in the file when writeback races with GUP is a > worse problem to me. > > Maybe if GUP marked pages it got ref for so that we could trigger the slow > behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so > that pages pinned by get_user_pages() would be properly accounted and then > we could use PG_mlocked and elevated refcount as a more reliable indication > of pages that need special handling). I'm not reading Peter's patchset fully though, looks like good, and maybe similar strategy in my mind currently. Also I'm thinking to add callback for FS at start and end of GUP's pin window. (for just an example, callback can be used to stop writeback by FS if FS wants.) Thanks.
On 07/31/2015 01:27 PM, Daniel Phillips wrote: > On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote: >> Returning ENOSPC when you have free space you can't yet prove is safer >> than >> not returning it and risking a data loss when you get hit by a >> write/commit >> storm. :) > > Remember when delayed allocation was scary and unproven, because proving > that ENOSPC will always be returned when needed is extremely difficult? > But the performance advantage was compelling, so we just worked at it > until it worked. There were times when it didn't work properly, but the > code was in the tree so it got fixed. > > It's like that now with page forking - a new technique with compelling > advantages, and some challenges. In the past, we (the Linux community) > would rise to the challenge and err on the side of pushing optimizations > in early. That was our mojo, and that is how Linux became the dominant > operating system it is today. Do we, the Linux community, still have that > mojo? Do you have the mojo to come up with a proposal on how to make things work, in a way that ensures data consistency for Linux users? Yes, we know page forking is not compatible with the way Linux currently uses refcounts. The question is, does anyone have an idea on how we could fix that? Not necessarily an implementation yet, just an idea might be enough to move forward at this stage. However, if nobody wants to work on even an idea, page forking may simply not be a safe thing to do. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/Makefile b/fs/Makefile index 91fcfa3..44d7192 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -70,7 +70,6 @@ obj-$(CONFIG_EXT4_FS) += ext4/ obj-$(CONFIG_JBD) += jbd/ obj-$(CONFIG_JBD2) += jbd2/ obj-$(CONFIG_TUX3) += tux3/ -obj-$(CONFIG_TUX3_MMAP) += tux3/ obj-$(CONFIG_CRAMFS) += cramfs/ obj-$(CONFIG_SQUASHFS) += squashfs/ obj-y += ramfs/ diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 2d609a5..fcd1c61 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -34,25 +34,6 @@ */ #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_CACHE_SHIFT - 10)) -/* - * Passed into wb_writeback(), essentially a subset of writeback_control - */ -struct wb_writeback_work { - long nr_pages; - struct super_block *sb; - unsigned long *older_than_this; - enum writeback_sync_modes sync_mode; - unsigned int tagged_writepages:1; - unsigned int for_kupdate:1; - unsigned int range_cyclic:1; - unsigned int for_background:1; - unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ - enum wb_reason reason; /* why was writeback initiated? */ - - struct list_head list; /* pending work list */ - struct completion *done; /* set if the caller waits */ -}; - /** * writeback_in_progress - determine whether there is writeback in progress * @bdi: the device's backing_dev_info structure. @@ -192,6 +173,36 @@ void inode_wb_list_del(struct inode *inode) } /* + * Remove inode from writeback list if clean. + */ +void inode_writeback_done(struct inode *inode) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + + spin_lock(&bdi->wb.list_lock); + spin_lock(&inode->i_lock); + if (!(inode->i_state & I_DIRTY)) + list_del_init(&inode->i_wb_list); + spin_unlock(&inode->i_lock); + spin_unlock(&bdi->wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_done); + +/* + * Add inode to writeback dirty list with current time. + */ +void inode_writeback_touch(struct inode *inode) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + + spin_lock(&bdi->wb.list_lock); + inode->dirtied_when = jiffies; + list_move(&inode->i_wb_list, &bdi->wb.b_dirty); + spin_unlock(&bdi->wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_touch); + +/* * Redirty an inode: set its when-it-was dirtied timestamp and move it to the * furthest end of its superblock's dirty-inode list. * @@ -610,9 +621,9 @@ static long writeback_chunk_size(struct backing_dev_info *bdi, * * Return the number of pages and/or inodes written. */ -static long writeback_sb_inodes(struct super_block *sb, - struct bdi_writeback *wb, - struct wb_writeback_work *work) +static long generic_writeback_sb_inodes(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) { struct writeback_control wbc = { .sync_mode = work->sync_mode, @@ -727,6 +738,22 @@ static long writeback_sb_inodes(struct super_block *sb, return wrote; } +static long writeback_sb_inodes(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) +{ + if (sb->s_op->writeback) { + long ret; + + spin_unlock(&wb->list_lock); + ret = sb->s_op->writeback(sb, wb, work); + spin_lock(&wb->list_lock); + return ret; + } + + return generic_writeback_sb_inodes(sb, wb, work); +} + static long __writeback_inodes_wb(struct bdi_writeback *wb, struct wb_writeback_work *work) { @@ -1293,6 +1320,35 @@ static void wait_sb_inodes(struct super_block *sb) } /** + * writeback_queue_work_sb - schedule writeback work from given super_block + * @sb: the superblock + * @work: work item to queue + * + * Schedule writeback work on this super_block. This usually used to + * interact with sb->s_op->writeback callback. The caller must + * guarantee to @work is not freed while bdi flusher is using (for + * example, be safe against umount). + */ +void writeback_queue_work_sb(struct super_block *sb, + struct wb_writeback_work *work) +{ + if (sb->s_bdi == &noop_backing_dev_info) + return; + + /* Allow only following fields to use. */ + *work = (struct wb_writeback_work){ + .sb = sb, + .sync_mode = work->sync_mode, + .tagged_writepages = work->tagged_writepages, + .done = work->done, + .nr_pages = work->nr_pages, + .reason = work->reason, + }; + bdi_queue_work(sb->s_bdi, work); +} +EXPORT_SYMBOL(writeback_queue_work_sb); + +/** * writeback_inodes_sb_nr - writeback dirty inodes from given super_block * @sb: the superblock * @nr: the number of pages to write diff --git a/include/linux/fs.h b/include/linux/fs.h index 42efe13..29833d2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -356,6 +356,8 @@ struct address_space_operations { /* Unfortunately this kludge is needed for FIBMAP. Don't use it */ sector_t (*bmap)(struct address_space *, sector_t); + void (*truncatepage)(struct address_space *, struct page *, + unsigned int, unsigned int, int); void (*invalidatepage) (struct page *, unsigned int, unsigned int); int (*releasepage) (struct page *, gfp_t); void (*freepage)(struct page *); @@ -1590,6 +1592,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *, extern ssize_t vfs_writev(struct file *, const struct iovec __user *, unsigned long, loff_t *); +struct bdi_writeback; +struct wb_writeback_work; struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); @@ -1599,6 +1603,8 @@ struct super_operations { int (*drop_inode) (struct inode *); void (*evict_inode) (struct inode *); void (*put_super) (struct super_block *); + long (*writeback)(struct super_block *super, struct bdi_writeback *wb, + struct wb_writeback_work *work); int (*sync_fs)(struct super_block *sb, int wait); int (*freeze_super) (struct super_block *); int (*freeze_fs) (struct super_block *); diff --git a/include/linux/mm.h b/include/linux/mm.h index dd5ea30..075f59f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1909,6 +1909,11 @@ vm_unmapped_area(struct vm_unmapped_area_info *info) } /* truncate.c */ +void generic_truncate_partial_page(struct address_space *mapping, + struct page *page, unsigned int start, + unsigned int len); +void generic_truncate_full_page(struct address_space *mapping, + struct page *page, int wait); extern void truncate_inode_pages(struct address_space *, loff_t); extern void truncate_inode_pages_range(struct address_space *, loff_t lstart, loff_t lend); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 4b3736f..13b70160 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -653,6 +653,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, extern void delete_from_page_cache(struct page *page); extern void __delete_from_page_cache(struct page *page, void *shadow); int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); +int cow_replace_page_cache(struct page *oldpage, struct page *newpage); +void cow_delete_from_page_cache(struct page *page); /* * Like add_to_page_cache_locked, but used to add newly allocated pages: diff --git a/include/linux/rmap.h b/include/linux/rmap.h index d9d7e7e..9b67360 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -228,6 +228,20 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); int page_mkclean(struct page *); /* + * Make clone page for page forking. + * + * Note: only clones page state so other state such as buffer_heads + * must be cloned by caller. + */ +struct page *cow_clone_page(struct page *oldpage); + +/* + * Changes the PTES of shared mappings except the PTE in orig_vma. + */ +int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage, + struct page *newpage); + +/* * called in munlock()/munmap() path to check for other vmas holding * the page mlocked. */ diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 0004833..0784b9d 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -59,6 +59,25 @@ enum wb_reason { }; /* + * Passed into wb_writeback(), essentially a subset of writeback_control + */ +struct wb_writeback_work { + long nr_pages; + struct super_block *sb; + unsigned long *older_than_this; + enum writeback_sync_modes sync_mode; + unsigned int tagged_writepages:1; + unsigned int for_kupdate:1; + unsigned int range_cyclic:1; + unsigned int for_background:1; + unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ + enum wb_reason reason; /* why was writeback initiated? */ + + struct list_head list; /* pending work list */ + struct completion *done; /* set if the caller waits */ +}; + +/* * A control structure which tells the writeback code what to do. These are * always on the stack, and hence need no locking. They are always initialised * in a manner such that unspecified fields are set to zero. @@ -90,6 +109,10 @@ struct writeback_control { * fs/fs-writeback.c */ struct bdi_writeback; +void inode_writeback_done(struct inode *inode); +void inode_writeback_touch(struct inode *inode); +void writeback_queue_work_sb(struct super_block *sb, + struct wb_writeback_work *work); void writeback_inodes_sb(struct super_block *, enum wb_reason reason); void writeback_inodes_sb_nr(struct super_block *, unsigned long nr, enum wb_reason reason); diff --git a/mm/filemap.c b/mm/filemap.c index 673e458..8c641d0 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -639,6 +639,88 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, } EXPORT_SYMBOL_GPL(add_to_page_cache_lru); +/* + * Atomically replace oldpage with newpage. + * + * Similar to migrate_pages(), but the oldpage is for writeout. + */ +int cow_replace_page_cache(struct page *oldpage, struct page *newpage) +{ + struct address_space *mapping = oldpage->mapping; + void **pslot; + + VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage); + VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); + + /* Get refcount for radix-tree */ + page_cache_get(newpage); + + /* Replace page in radix tree. */ + spin_lock_irq(&mapping->tree_lock); + /* PAGECACHE_TAG_DIRTY represents the view of frontend. Clear it. */ + if (PageDirty(oldpage)) + radix_tree_tag_clear(&mapping->page_tree, page_index(oldpage), + PAGECACHE_TAG_DIRTY); + /* The refcount to newpage is used for radix tree. */ + pslot = radix_tree_lookup_slot(&mapping->page_tree, oldpage->index); + radix_tree_replace_slot(pslot, newpage); + __inc_zone_page_state(newpage, NR_FILE_PAGES); + __dec_zone_page_state(oldpage, NR_FILE_PAGES); + spin_unlock_irq(&mapping->tree_lock); + + /* mem_cgroup codes must not be called under tree_lock */ + mem_cgroup_migrate(oldpage, newpage, true); + + /* Release refcount for radix-tree */ + page_cache_release(oldpage); + + return 0; +} +EXPORT_SYMBOL_GPL(cow_replace_page_cache); + +/* + * Delete page from radix-tree, leaving page->mapping unchanged. + * + * Similar to delete_from_page_cache(), but the deleted page is for writeout. + */ +void cow_delete_from_page_cache(struct page *page) +{ + struct address_space *mapping = page->mapping; + + /* Delete page from radix tree. */ + spin_lock_irq(&mapping->tree_lock); + /* + * if we're uptodate, flush out into the cleancache, otherwise + * invalidate any existing cleancache entries. We can't leave + * stale data around in the cleancache once our page is gone + */ + if (PageUptodate(page) && PageMappedToDisk(page)) + cleancache_put_page(page); + else + cleancache_invalidate_page(mapping, page); + + page_cache_tree_delete(mapping, page, NULL); +#if 0 /* FIXME: backend is assuming page->mapping is available */ + page->mapping = NULL; +#endif + /* Leave page->index set: truncation lookup relies upon it */ + + __dec_zone_page_state(page, NR_FILE_PAGES); + BUG_ON(page_mapped(page)); + + /* + * The following dirty accounting is done by writeback + * path. So, we don't need to do here. + * + * dec_zone_page_state(page, NR_FILE_DIRTY); + * dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + */ + spin_unlock_irq(&mapping->tree_lock); + + page_cache_release(page); +} +EXPORT_SYMBOL_GPL(cow_delete_from_page_cache); + #ifdef CONFIG_NUMA struct page *__page_cache_alloc(gfp_t gfp) { diff --git a/mm/rmap.c b/mm/rmap.c index 71cd5bd..9125246 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -923,6 +923,145 @@ int page_mkclean(struct page *page) } EXPORT_SYMBOL_GPL(page_mkclean); +/* + * Make clone page for page forking. (Based on migrate_page_copy()) + * + * Note: only clones page state so other state such as buffer_heads + * must be cloned by caller. + */ +struct page *cow_clone_page(struct page *oldpage) +{ + struct address_space *mapping = oldpage->mapping; + gfp_t gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS; + struct page *newpage = __page_cache_alloc(gfp_mask); + int cpupid; + + newpage->mapping = oldpage->mapping; + newpage->index = oldpage->index; + copy_highpage(newpage, oldpage); + + /* FIXME: right? */ + BUG_ON(PageSwapCache(oldpage)); + BUG_ON(PageSwapBacked(oldpage)); + BUG_ON(PageHuge(oldpage)); + if (PageError(oldpage)) + SetPageError(newpage); + if (PageReferenced(oldpage)) + SetPageReferenced(newpage); + if (PageUptodate(oldpage)) + SetPageUptodate(newpage); + if (PageActive(oldpage)) + SetPageActive(newpage); + if (PageMappedToDisk(oldpage)) + SetPageMappedToDisk(newpage); + + /* + * Copy NUMA information to the new page, to prevent over-eager + * future migrations of this same page. + */ + cpupid = page_cpupid_xchg_last(oldpage, -1); + page_cpupid_xchg_last(newpage, cpupid); + + mlock_migrate_page(newpage, oldpage); + ksm_migrate_page(newpage, oldpage); + + /* Lock newpage before visible via radix tree */ + BUG_ON(PageLocked(newpage)); + __set_page_locked(newpage); + + return newpage; +} +EXPORT_SYMBOL_GPL(cow_clone_page); + +static int page_cow_one(struct page *oldpage, struct page *newpage, + struct vm_area_struct *vma, unsigned long address) +{ + struct mm_struct *mm = vma->vm_mm; + pte_t oldptval, ptval, *pte; + spinlock_t *ptl; + int ret = 0; + + pte = page_check_address(oldpage, mm, address, &ptl, 1); + if (!pte) + goto out; + + flush_cache_page(vma, address, pte_pfn(*pte)); + oldptval = ptep_clear_flush(vma, address, pte); + + /* Take refcount for PTE */ + page_cache_get(newpage); + + /* + * vm_page_prot doesn't have writable bit, so page fault will + * be occurred immediately after returned from this page fault + * again. And second time of page fault will be resolved with + * forked page was set here. + */ + ptval = mk_pte(newpage, vma->vm_page_prot); +#if 0 + /* FIXME: we should check following too? Otherwise, we would + * get additional read-only => write fault at least */ + if (pte_write) + ptval = pte_mkwrite(ptval); + if (pte_dirty(oldptval)) + ptval = pte_mkdirty(ptval); + if (pte_young(oldptval)) + ptval = pte_mkyoung(ptval); +#endif + set_pte_at(mm, address, pte, ptval); + + /* Update rmap accounting */ + BUG_ON(!PageMlocked(oldpage)); /* Caller should migrate mlock flag */ + page_remove_rmap(oldpage); + page_add_file_rmap(newpage); + + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, address, pte); + + pte_unmap_unlock(pte, ptl); + + mmu_notifier_invalidate_page(mm, address); + + /* Release refcount for PTE */ + page_cache_release(oldpage); +out: + return ret; +} + +/* Change old page in PTEs to new page exclude orig_vma */ +int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage, + struct page *newpage) +{ + struct address_space *mapping = page_mapping(oldpage); + pgoff_t pgoff = oldpage->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + struct vm_area_struct *vma; + int ret = 0; + + BUG_ON(!PageLocked(oldpage)); + BUG_ON(!PageLocked(newpage)); + BUG_ON(PageAnon(oldpage)); + BUG_ON(mapping == NULL); + + i_mmap_lock_read(mapping); + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { + /* + * The orig_vma's PTE is handled by caller. + * (e.g. ->page_mkwrite) + */ + if (vma == orig_vma) + continue; + + if (vma->vm_flags & VM_SHARED) { + unsigned long address = vma_address(oldpage, vma); + ret += page_cow_one(oldpage, newpage, vma, address); + } + } + i_mmap_unlock_read(mapping); + + return ret; +} +EXPORT_SYMBOL_GPL(page_cow_file); + /** * page_move_anon_rmap - move a page to our anon_vma * @page: the page to move to our anon_vma diff --git a/mm/truncate.c b/mm/truncate.c index f1e4d60..e5b4673 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -216,6 +216,56 @@ int invalidate_inode_page(struct page *page) return invalidate_complete_page(mapping, page); } +void generic_truncate_partial_page(struct address_space *mapping, + struct page *page, unsigned int start, + unsigned int len) +{ + wait_on_page_writeback(page); + zero_user_segment(page, start, start + len); + if (page_has_private(page)) + do_invalidatepage(page, start, len); +} +EXPORT_SYMBOL(generic_truncate_partial_page); + +static void truncate_partial_page(struct address_space *mapping, pgoff_t index, + unsigned int start, unsigned int len) +{ + struct page *page = find_lock_page(mapping, index); + if (!page) + return; + + if (!mapping->a_ops->truncatepage) + generic_truncate_partial_page(mapping, page, start, len); + else + mapping->a_ops->truncatepage(mapping, page, start, len, 1); + + cleancache_invalidate_page(mapping, page); + unlock_page(page); + page_cache_release(page); +} + +void generic_truncate_full_page(struct address_space *mapping, + struct page *page, int wait) +{ + if (wait) + wait_on_page_writeback(page); + else if (PageWriteback(page)) + return; + + truncate_inode_page(mapping, page); +} +EXPORT_SYMBOL(generic_truncate_full_page); + +static void truncate_full_page(struct address_space *mapping, struct page *page, + int wait) +{ + if (!mapping->a_ops->truncatepage) + generic_truncate_full_page(mapping, page, wait); + else + mapping->a_ops->truncatepage(mapping, page, 0, PAGE_CACHE_SIZE, + wait); +} + /** * truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets * @mapping: mapping to truncate @@ -298,11 +348,7 @@ void truncate_inode_pages_range(struct address_space *mapping, if (!trylock_page(page)) continue; WARN_ON(page->index != index); - if (PageWriteback(page)) { - unlock_page(page); - continue; - } - truncate_inode_page(mapping, page); + truncate_full_page(mapping, page, 0); unlock_page(page); } pagevec_remove_exceptionals(&pvec); @@ -312,37 +358,18 @@ void truncate_inode_pages_range(struct address_space *mapping, } if (partial_start) { - struct page *page = find_lock_page(mapping, start - 1); - if (page) { - unsigned int top = PAGE_CACHE_SIZE; - if (start > end) { - /* Truncation within a single page */ - top = partial_end; - partial_end = 0; - } - wait_on_page_writeback(page); - zero_user_segment(page, partial_start, top); - cleancache_invalidate_page(mapping, page); - if (page_has_private(page)) - do_invalidatepage(page, partial_start, - top - partial_start); - unlock_page(page); - page_cache_release(page); - } - } - if (partial_end) { - struct page *page = find_lock_page(mapping, end); - if (page) { - wait_on_page_writeback(page); - zero_user_segment(page, 0, partial_end); - cleancache_invalidate_page(mapping, page); - if (page_has_private(page)) - do_invalidatepage(page, 0, - partial_end); - unlock_page(page); - page_cache_release(page); + unsigned int top = PAGE_CACHE_SIZE; + if (start > end) { + /* Truncation within a single page */ + top = partial_end; + partial_end = 0; } + truncate_partial_page(mapping, start - 1, partial_start, + top - partial_start); } + if (partial_end) + truncate_partial_page(mapping, end, 0, partial_end); + /* * If the truncation happened within a single page no pages * will be released, just zeroed, so we can bail out now. @@ -386,8 +413,7 @@ void truncate_inode_pages_range(struct address_space *mapping, lock_page(page); WARN_ON(page->index != index); - wait_on_page_writeback(page); - truncate_inode_page(mapping, page); + truncate_full_page(mapping, page, 1); unlock_page(page); } pagevec_remove_exceptionals(&pvec);