diff mbox series

[1/5] mm: Do not reclaim private data from pinned page

Message ID 20230209123206.3548-1-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series Writeback handling of pinned pages | expand

Commit Message

Jan Kara Feb. 9, 2023, 12:31 p.m. UTC
If the page is pinned, there's no point in trying to reclaim it.
Furthermore if the page is from the page cache we don't want to reclaim
fs-private data from the page because the pinning process may be writing
to the page at any time and reclaiming fs private info on a dirty page
can upset the filesystem (see link below).

Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/vmscan.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Matthew Wilcox Feb. 9, 2023, 4:17 p.m. UTC | #1
On Thu, Feb 09, 2023 at 01:31:53PM +0100, Jan Kara wrote:
> If the page is pinned, there's no point in trying to reclaim it.
> Furthermore if the page is from the page cache we don't want to reclaim
> fs-private data from the page because the pinning process may be writing
> to the page at any time and reclaiming fs private info on a dirty page
> can upset the filesystem (see link below).
> 
> Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz

OK, but now I'm confused.  I've been told before that the reason we
can't take pinned pages off the LRU list is that they need to be written
back periodically for ... reasons.  But now the pages are going to be
skipped if they're found on the LRU list, so why is this better than
taking them off the LRU list?
Jan Kara Feb. 10, 2023, 11:29 a.m. UTC | #2
On Thu 09-02-23 16:17:47, Matthew Wilcox wrote:
> On Thu, Feb 09, 2023 at 01:31:53PM +0100, Jan Kara wrote:
> > If the page is pinned, there's no point in trying to reclaim it.
> > Furthermore if the page is from the page cache we don't want to reclaim
> > fs-private data from the page because the pinning process may be writing
> > to the page at any time and reclaiming fs private info on a dirty page
> > can upset the filesystem (see link below).
> > 
> > Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
> 
> OK, but now I'm confused.  I've been told before that the reason we
> can't take pinned pages off the LRU list is that they need to be written
> back periodically for ... reasons.  But now the pages are going to be
> skipped if they're found on the LRU list, so why is this better than
> taking them off the LRU list?

You are mixing things together a bit :). Yes, we do need to writeback
pinned pages from time to time - for data integrity purposes like fsync(2).
This has nothing to do with taking the pinned page out of LRU. It would be
actually nice to be able to take pinned pages out of the LRU and
functionally that would make sense but as I've mentioned in my reply to you
[1], the problem here is the performance. I've now dug out the discussion
from 2018 where John actually tried to take pinned pages out of the LRU [2]
and the result was 20% IOPS degradation on his NVME drive because of the
cost of taking the LRU lock. I'm not even speaking how costly that would
get on any heavily parallel direct IO workload on some high-iops device...

								Honza

[1] https://lore.kernel.org/all/20230124102931.g7e33syuhfo7s36h@quack3
[2] https://lore.kernel.org/all/f5ad7210-05e0-3dc4-02df-01ce5346e198@nvidia.com
David Hildenbrand Feb. 13, 2023, 9:01 a.m. UTC | #3
On 09.02.23 13:31, Jan Kara wrote:
> If the page is pinned, there's no point in trying to reclaim it.
> Furthermore if the page is from the page cache we don't want to reclaim
> fs-private data from the page because the pinning process may be writing
> to the page at any time and reclaiming fs private info on a dirty page
> can upset the filesystem (see link below).
> 
> Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>   mm/vmscan.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bf3eedf0209c..ab3911a8b116 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1901,6 +1901,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   			}
>   		}
>   
> +		/*
> +		 * Folio is unmapped now so it cannot be newly pinned anymore.
> +		 * No point in trying to reclaim folio if it is pinned.
> +		 * Furthermore we don't want to reclaim underlying fs metadata
> +		 * if the folio is pinned and thus potentially modified by the
> +		 * pinning process is that may upset the filesystem.
> +		 */
> +		if (folio_maybe_dma_pinned(folio))
> +			goto activate_locked;
> +
>   		mapping = folio_mapping(folio);
>   		if (folio_test_dirty(folio)) {
>   			/*

At this point, we made sure that the folio is completely unmapped. 
However, we specify "TTU_BATCH_FLUSH", so rmap code might defer a TLB 
flush and consequently defer an IPI sync.

I remember that this check here is fine regarding GUP-fast: even if 
concurrent GUP-fast pins the page after our check here, it should 
observe the changed PTE and unpin it again.


Checking after unmapping makes sense: we reduce the likelyhood of false 
positives when a file-backed page is mapped many times (>= 1024). OTOH, 
we might unmap pinned pages because we cannot really detect it early.

For anon pages, we have an early (racy) check, which turned out "ok" in 
practice, because we don't frequently have that many anon pages that are 
shared by that many processes. I assume we don't want something similar 
for pagecache pages, because having a single page mapped by many 
processes can happen easily and would prevent reclaim.

I once had a patch lying around that documented for the existing 
folio_maybe_dma_pinned() for anon pages exactly that (racy+false 
positives with many mappings).


Long story short, I assume this change is fine.
Christoph Hellwig Feb. 13, 2023, 9:55 a.m. UTC | #4
On Fri, Feb 10, 2023 at 12:29:54PM +0100, Jan Kara wrote:
> functionally that would make sense but as I've mentioned in my reply to you
> [1], the problem here is the performance. I've now dug out the discussion
> from 2018 where John actually tried to take pinned pages out of the LRU [2]
> and the result was 20% IOPS degradation on his NVME drive because of the
> cost of taking the LRU lock. I'm not even speaking how costly that would
> get on any heavily parallel direct IO workload on some high-iops device...

I think we need to distinguish between short- and long terms pins.
For short term pins like direct I/O it doesn't make sense to take them
off the lru, or to do any other special action.  Writeback will simplify
have to wait for the short term pin.

Long-term pins absolutely would make sense to be taken off the LRU list.
Jan Kara Feb. 14, 2023, 1 p.m. UTC | #5
On Mon 13-02-23 10:01:35, David Hildenbrand wrote:
> On 09.02.23 13:31, Jan Kara wrote:
> > If the page is pinned, there's no point in trying to reclaim it.
> > Furthermore if the page is from the page cache we don't want to reclaim
> > fs-private data from the page because the pinning process may be writing
> > to the page at any time and reclaiming fs private info on a dirty page
> > can upset the filesystem (see link below).
> > 
> > Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >   mm/vmscan.c | 10 ++++++++++
> >   1 file changed, 10 insertions(+)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bf3eedf0209c..ab3911a8b116 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1901,6 +1901,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >   			}
> >   		}
> > +		/*
> > +		 * Folio is unmapped now so it cannot be newly pinned anymore.
> > +		 * No point in trying to reclaim folio if it is pinned.
> > +		 * Furthermore we don't want to reclaim underlying fs metadata
> > +		 * if the folio is pinned and thus potentially modified by the
> > +		 * pinning process is that may upset the filesystem.
> > +		 */
> > +		if (folio_maybe_dma_pinned(folio))
> > +			goto activate_locked;
> > +
> >   		mapping = folio_mapping(folio);
> >   		if (folio_test_dirty(folio)) {
> >   			/*
> 
> At this point, we made sure that the folio is completely unmapped. However,
> we specify "TTU_BATCH_FLUSH", so rmap code might defer a TLB flush and
> consequently defer an IPI sync.
> 
> I remember that this check here is fine regarding GUP-fast: even if
> concurrent GUP-fast pins the page after our check here, it should observe
> the changed PTE and unpin it again.
>  
> Checking after unmapping makes sense: we reduce the likelyhood of false
> positives when a file-backed page is mapped many times (>= 1024). OTOH, we
> might unmap pinned pages because we cannot really detect it early.
> 
> For anon pages, we have an early (racy) check, which turned out "ok" in
> practice, because we don't frequently have that many anon pages that are
> shared by that many processes. I assume we don't want something similar for
> pagecache pages, because having a single page mapped by many processes can
> happen easily and would prevent reclaim.

Yeah, I think pagecache pages shared by many processes are more likely.
Furthermore I think pinned pagecache pages are rather rare so unmapping
them before checking seems fine to me. Obviously we can reconsider if
reality would prove me wrong ;).

> I once had a patch lying around that documented for the existing
> folio_maybe_dma_pinned() for anon pages exactly that (racy+false positives
> with many mappings).
> 
> Long story short, I assume this change is fine.

Thanks for the throughout verification :)

								Honza
Jan Kara Feb. 14, 2023, 1:06 p.m. UTC | #6
On Mon 13-02-23 01:55:04, Christoph Hellwig wrote:
> On Fri, Feb 10, 2023 at 12:29:54PM +0100, Jan Kara wrote:
> > functionally that would make sense but as I've mentioned in my reply to you
> > [1], the problem here is the performance. I've now dug out the discussion
> > from 2018 where John actually tried to take pinned pages out of the LRU [2]
> > and the result was 20% IOPS degradation on his NVME drive because of the
> > cost of taking the LRU lock. I'm not even speaking how costly that would
> > get on any heavily parallel direct IO workload on some high-iops device...
> 
> I think we need to distinguish between short- and long terms pins.
> For short term pins like direct I/O it doesn't make sense to take them
> off the lru, or to do any other special action.  Writeback will simplify
> have to wait for the short term pin.
> 
> Long-term pins absolutely would make sense to be taken off the LRU list.

Yeah, I agree distinguishing these two would be nice as we could treat them
differently then. The trouble is a bit with always-crowded struct page. But
now it occurred to me that if we are going to take these long-term pinned
pages out from the LRU, we could overload the space for LRU pointers with
the counter (which is what I think John originally did). So yes, possibly
we could track separately long-term and short-term pins. John, what do you
think? Maybe time to revive your patches from 2018 in a bit different form?
;)

								Honza
John Hubbard Feb. 14, 2023, 9:40 p.m. UTC | #7
On 2/14/23 05:06, Jan Kara wrote:
> On Mon 13-02-23 01:55:04, Christoph Hellwig wrote:
>> I think we need to distinguish between short- and long terms pins.
>> For short term pins like direct I/O it doesn't make sense to take them
>> off the lru, or to do any other special action.  Writeback will simplify
>> have to wait for the short term pin.
>>
>> Long-term pins absolutely would make sense to be taken off the LRU list.
> 
> Yeah, I agree distinguishing these two would be nice as we could treat them
> differently then. The trouble is a bit with always-crowded struct page. But
> now it occurred to me that if we are going to take these long-term pinned
> pages out from the LRU, we could overload the space for LRU pointers with
> the counter (which is what I think John originally did). So yes, possibly
> we could track separately long-term and short-term pins. John, what do you
> think? Maybe time to revive your patches from 2018 in a bit different form?
> ;)
> 

Oh wow, I really love this idea. We kept running into problems because
long- and short-term pins were mixed up together (except during
creation), and this, at long last, separates them. Very nice. I'd almost
forgotten about the 2018 page.lru adventures, too. ha :)

One objection might be that pinning is now going to be taking a lot of
space in struct page / folio, but I think it's warranted, based on the
long-standing, difficult problems that it would solve.

We could even leave most of these patches, and David Howells' patches,
intact, by using an approach similar to the mm_users and mm_count
technique: maintain a long-term pin count in one of the folio->lru
fields, and any non-zero count there creates a single count in
folio->_pincount.

I could put together something to do that.


thanks,
Jan Kara Feb. 16, 2023, 11:56 a.m. UTC | #8
On Tue 14-02-23 13:40:17, John Hubbard wrote:
> On 2/14/23 05:06, Jan Kara wrote:
> > On Mon 13-02-23 01:55:04, Christoph Hellwig wrote:
> >> I think we need to distinguish between short- and long terms pins.
> >> For short term pins like direct I/O it doesn't make sense to take them
> >> off the lru, or to do any other special action.  Writeback will simplify
> >> have to wait for the short term pin.
> >>
> >> Long-term pins absolutely would make sense to be taken off the LRU list.
> > 
> > Yeah, I agree distinguishing these two would be nice as we could treat them
> > differently then. The trouble is a bit with always-crowded struct page. But
> > now it occurred to me that if we are going to take these long-term pinned
> > pages out from the LRU, we could overload the space for LRU pointers with
> > the counter (which is what I think John originally did). So yes, possibly
> > we could track separately long-term and short-term pins. John, what do you
> > think? Maybe time to revive your patches from 2018 in a bit different form?
> > ;)
> > 
> 
> Oh wow, I really love this idea. We kept running into problems because
> long- and short-term pins were mixed up together (except during
> creation), and this, at long last, separates them. Very nice. I'd almost
> forgotten about the 2018 page.lru adventures, too. ha :)
> 
> One objection might be that pinning is now going to be taking a lot of
> space in struct page / folio, but I think it's warranted, based on the
> long-standing, difficult problems that it would solve.

Well, it doesn't need to consume more space in the struct page than it
already does currently AFAICS. We could just mark the folio as unevictable
and make sure folio_evictable() returns false for such pages. Then we
should be safe to use space of lru pointers for whatever we need.

> We could even leave most of these patches, and David Howells' patches,
> intact, by using an approach similar to the mm_users and mm_count
> technique: maintain a long-term pin count in one of the folio->lru
> fields, and any non-zero count there creates a single count in
> folio->_pincount.

Oh, you mean that the first longterm pin would take one short-term pin?
Yes, that should be possible but I'm not sure that would be a huge win. I
can imagine users can care about distinguishing these states:

1) unpinned
2) has any pin
3) has only short term pins

Now distinguishing between 1 and 2+3 would still be done by
folio_maybe_dma_pinned(). Your change will allow us to not look at lru
pointers in folio_maybe_dma_pinned() so that's some simplification and
perhaps performance optimization (potentially is can save us a need to pull
in another cacheline but mostly _refcount and lru will be in the same
cacheline anyway) so maybe it's worth it in the end.

								Honza
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bf3eedf0209c..ab3911a8b116 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1901,6 +1901,16 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 			}
 		}
 
+		/*
+		 * Folio is unmapped now so it cannot be newly pinned anymore.
+		 * No point in trying to reclaim folio if it is pinned.
+		 * Furthermore we don't want to reclaim underlying fs metadata
+		 * if the folio is pinned and thus potentially modified by the
+		 * pinning process is that may upset the filesystem.
+		 */
+		if (folio_maybe_dma_pinned(folio))
+			goto activate_locked;
+
 		mapping = folio_mapping(folio);
 		if (folio_test_dirty(folio)) {
 			/*