[0/3,v2] xfs: Fix races between readahead and hole punching

Message ID	20190829131034.10563-1-jack@suse.cz (mailing list archive)
Headers	show Return-Path: <SRS0=XaYT=WZ=vger.kernel.org=linux-fsdevel-owner@kernel.org> From: Jan Kara <jack@suse.cz> To: <linux-xfs@vger.kernel.org> Cc: <linux-mm@kvack.org>, Amir Goldstein <amir73il@gmail.com>, "Darrick J. Wong" <darrick.wong@oracle.com>, Boaz Harrosh <boaz@plexistor.com>, <linux-fsdevel@vger.kernel.org>, Jan Kara <jack@suse.cz> Subject: [PATCH 0/3 v2] xfs: Fix races between readahead and hole punching Date: Thu, 29 Aug 2019 15:10:31 +0200 Message-Id: <20190829131034.10563-1-jack@suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	xfs: Fix races between readahead and hole punching \| expand [0/3,v2] xfs: Fix races between readahead and hole punching [1/3] mm: Handle MADV_WILLNEED through vfs_fadvise() [2/3] fs: Export generic_fadvise() [3/3] xfs: Fix stale data exposure when readahead races with hole punch

Jan Kara Aug. 29, 2019, 1:10 p.m. UTC

Hello,

this is a patch series that addresses a possible race between readahead and
hole punching Amir has discovered [1]. The first patch makes madvise(2) to
handle readahead requests through fadvise infrastructure, the third patch
then adds necessary locking to XFS to protect against the race. Note that
other filesystems need similar protections but e.g. in case of ext4 it isn't
so simple without seriously regressing mixed rw workload performance so
I'm pushing just xfs fix at this moment which is simple.

Changes since v1 (posted at [2]):
* Added reviewed-by tags
* Fixed indentation in xfs_file_fadvise()
* Improved comment and readibility of xfs_file_fadvise()

								Honza

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mail.gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20190711140012.1671-1-jack@suse.cz/

Amir Goldstein Jan. 17, 2020, 10:50 a.m. UTC | #1

On Thu, Aug 29, 2019 at 4:10 PM Jan Kara <jack@suse.cz> wrote:
>
> Hello,
>
> this is a patch series that addresses a possible race between readahead and
> hole punching Amir has discovered [1]. The first patch makes madvise(2) to
> handle readahead requests through fadvise infrastructure, the third patch
> then adds necessary locking to XFS to protect against the race. Note that
> other filesystems need similar protections but e.g. in case of ext4 it isn't
> so simple without seriously regressing mixed rw workload performance so
> I'm pushing just xfs fix at this moment which is simple.
>

Jan,

Could you give a quick status update about the state of this issue for
ext4 and other fs. I remember some solutions were discussed.
Perhaps this could be a good topic for a cross track session in LSF/MM?
Aren't the challenges posed by this race also relevant for RWF_UNCACHED?

Thanks,
Amir.

Amir Goldstein Jan. 19, 2020, 8:35 a.m. UTC | #2

On Fri, Jan 17, 2020 at 12:50 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Aug 29, 2019 at 4:10 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hello,
> >
> > this is a patch series that addresses a possible race between readahead and
> > hole punching Amir has discovered [1]. The first patch makes madvise(2) to
> > handle readahead requests through fadvise infrastructure, the third patch
> > then adds necessary locking to XFS to protect against the race. Note that
> > other filesystems need similar protections but e.g. in case of ext4 it isn't
> > so simple without seriously regressing mixed rw workload performance so
> > I'm pushing just xfs fix at this moment which is simple.
> >
>
> Jan,
>
> Could you give a quick status update about the state of this issue for
> ext4 and other fs. I remember some solutions were discussed.
> Perhaps this could be a good topic for a cross track session in LSF/MM?
> Aren't the challenges posed by this race also relevant for RWF_UNCACHED?
>

Maybe a silly question:

Can someone please explain to me why we even bother truncating pages on
punch hole?
Wouldn't it solve the race if instead we zeroed those pages and marked them
readonly?

The comment above trunacte_pagecache_range() says:
 * This function should typically be called before the filesystem
 * releases resources associated with the freed range (eg. deallocates
 * blocks). This way, pagecache will always stay logically coherent
 * with on-disk format, and the filesystem would not have to deal with
 * situations such as writepage being called for a page that has already
 * had its underlying blocks deallocated.

So in order to prevent writepage from being called on a punched hole,
we need to make sure that page write fault will be called, which is the same
state as if an exiting hole has been read into page cache but not written yet.
Right? Wrong?

What am I missing here?

Thanks,
Amir.

Jan Kara Jan. 20, 2020, 11:47 a.m. UTC | #3

On Sun 19-01-20 10:35:08, Amir Goldstein wrote:
> On Fri, Jan 17, 2020 at 12:50 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Thu, Aug 29, 2019 at 4:10 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > Hello,
> > >
> > > this is a patch series that addresses a possible race between readahead and
> > > hole punching Amir has discovered [1]. The first patch makes madvise(2) to
> > > handle readahead requests through fadvise infrastructure, the third patch
> > > then adds necessary locking to XFS to protect against the race. Note that
> > > other filesystems need similar protections but e.g. in case of ext4 it isn't
> > > so simple without seriously regressing mixed rw workload performance so
> > > I'm pushing just xfs fix at this moment which is simple.
> > >
> >
> > Jan,
> >
> > Could you give a quick status update about the state of this issue for
> > ext4 and other fs. I remember some solutions were discussed.
> > Perhaps this could be a good topic for a cross track session in LSF/MM?
> > Aren't the challenges posed by this race also relevant for RWF_UNCACHED?
> >
> 
> Maybe a silly question:
> 
> Can someone please explain to me why we even bother truncating pages on
> punch hole?
> Wouldn't it solve the race if instead we zeroed those pages and marked them
> readonly?

Not if we also didn't keep them locked. Page reclaim can reclaim clean
unlocked pages any time it wants... Plus the CPU overhead of zeroing
potentially large ranges of pages would be significant.

> The comment above trunacte_pagecache_range() says:
>  * This function should typically be called before the filesystem
>  * releases resources associated with the freed range (eg. deallocates
>  * blocks). This way, pagecache will always stay logically coherent
>  * with on-disk format, and the filesystem would not have to deal with
>  * situations such as writepage being called for a page that has already
>  * had its underlying blocks deallocated.
> 
> So in order to prevent writepage from being called on a punched hole,
> we need to make sure that page write fault will be called, which is the same
> state as if an exiting hole has been read into page cache but not written yet.
> Right? Wrong?

Also the writeback in the comment you mention above is just an example. As
the race you've found shows, there is a problem with reading as well.

								Honza

Jan Kara Jan. 20, 2020, 12:03 p.m. UTC | #4

Hi Amir!

On Fri 17-01-20 12:50:58, Amir Goldstein wrote:
> On Thu, Aug 29, 2019 at 4:10 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hello,
> >
> > this is a patch series that addresses a possible race between readahead and
> > hole punching Amir has discovered [1]. The first patch makes madvise(2) to
> > handle readahead requests through fadvise infrastructure, the third patch
> > then adds necessary locking to XFS to protect against the race. Note that
> > other filesystems need similar protections but e.g. in case of ext4 it isn't
> > so simple without seriously regressing mixed rw workload performance so
> > I'm pushing just xfs fix at this moment which is simple.
> >
> 
> Could you give a quick status update about the state of this issue for
> ext4 and other fs. I remember some solutions were discussed.

Shortly: I didn't get to this. I'm sorry :-|. I'll bump up a priority but I
can't promise anything at the moment.

> Perhaps this could be a good topic for a cross track session in LSF/MM?

Maybe although this is one of the cases where it's easy to chat about
possible solutions but somewhat tedious to write one so I'm not sure how
productive that would be. BTW my discussion with Kent [1] is in fact very
related to this problem (the interval lock he has is to stop exactly races
like this).

> Aren't the challenges posed by this race also relevant for RWF_UNCACHED?

Do you have anything particular in mind? I don't see how RWF_UNCACHED would
make this any better or worse than DIO / readahead...

								Honza

[1] https://lore.kernel.org/linux-fsdevel/20191216193852.GA8664@kmo-pixel/

Amir Goldstein Jan. 20, 2020, 1:54 p.m. UTC | #5

On Mon, Jan 20, 2020 at 2:03 PM Jan Kara <jack@suse.cz> wrote:
>
> Hi Amir!
>
> On Fri 17-01-20 12:50:58, Amir Goldstein wrote:
> > On Thu, Aug 29, 2019 at 4:10 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > Hello,
> > >
> > > this is a patch series that addresses a possible race between readahead and
> > > hole punching Amir has discovered [1]. The first patch makes madvise(2) to
> > > handle readahead requests through fadvise infrastructure, the third patch
> > > then adds necessary locking to XFS to protect against the race. Note that
> > > other filesystems need similar protections but e.g. in case of ext4 it isn't
> > > so simple without seriously regressing mixed rw workload performance so
> > > I'm pushing just xfs fix at this moment which is simple.
> > >
> >
> > Could you give a quick status update about the state of this issue for
> > ext4 and other fs. I remember some solutions were discussed.
>
> Shortly: I didn't get to this. I'm sorry :-|. I'll bump up a priority but I
> can't promise anything at the moment.
>
> > Perhaps this could be a good topic for a cross track session in LSF/MM?
>
> Maybe although this is one of the cases where it's easy to chat about
> possible solutions but somewhat tedious to write one so I'm not sure how
> productive that would be. BTW my discussion with Kent [1] is in fact very
> related to this problem (the interval lock he has is to stop exactly races
> like this).
>

Well, I was mostly interested to know if there is an agreement on the way to
solve the problem. If we need to discuss it to reach consensus than it might
be a good topic for LSF/MM. If you already know what needs to be done,
there is no need for a discussion.

> > Aren't the challenges posed by this race also relevant for RWF_UNCACHED?
>
> Do you have anything particular in mind? I don't see how RWF_UNCACHED would
> make this any better or worse than DIO / readahead...
>

Not better nor worse. I meant that RFW_UNCACHED is another case that
would suffer the same races.

Thanks,
Amir.

Jan Kara Jan. 20, 2020, 4:58 p.m. UTC | #6

On Mon 20-01-20 15:54:28, Amir Goldstein wrote:
> On Mon, Jan 20, 2020 at 2:03 PM Jan Kara <jack@suse.cz> wrote:
> > On Fri 17-01-20 12:50:58, Amir Goldstein wrote:
> > > On Thu, Aug 29, 2019 at 4:10 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > Hello,
> > > >
> > > > this is a patch series that addresses a possible race between readahead and
> > > > hole punching Amir has discovered [1]. The first patch makes madvise(2) to
> > > > handle readahead requests through fadvise infrastructure, the third patch
> > > > then adds necessary locking to XFS to protect against the race. Note that
> > > > other filesystems need similar protections but e.g. in case of ext4 it isn't
> > > > so simple without seriously regressing mixed rw workload performance so
> > > > I'm pushing just xfs fix at this moment which is simple.
> > > >
> > >
> > > Could you give a quick status update about the state of this issue for
> > > ext4 and other fs. I remember some solutions were discussed.
> >
> > Shortly: I didn't get to this. I'm sorry :-|. I'll bump up a priority but I
> > can't promise anything at the moment.
> >
> > > Perhaps this could be a good topic for a cross track session in LSF/MM?
> >
> > Maybe although this is one of the cases where it's easy to chat about
> > possible solutions but somewhat tedious to write one so I'm not sure how
> > productive that would be. BTW my discussion with Kent [1] is in fact very
> > related to this problem (the interval lock he has is to stop exactly races
> > like this).
> >
> 
> Well, I was mostly interested to know if there is an agreement on the way to
> solve the problem. If we need to discuss it to reach consensus than it might
> be a good topic for LSF/MM. If you already know what needs to be done,
> there is no need for a discussion.

So I have an idea how it could be solved: Change calling convention for
->readpage() so that it gets called without page locked and take
i_mmap_sem there (and in ->readpages()) to protect from the race. But
I wanted to present it in the form of patches as the devil here is in the
details and it may prove to be too ugly to be bearable.

If I won't get to writing the patches, you're right it may be sensible to
present the idea to people at LSF/MM what they think about it.

> > > Aren't the challenges posed by this race also relevant for RWF_UNCACHED?
> >
> > Do you have anything particular in mind? I don't see how RWF_UNCACHED would
> > make this any better or worse than DIO / readahead...
> >
> 
> Not better nor worse. I meant that RFW_UNCACHED is another case that
> would suffer the same races.

Yes, that's right.

								Honza

[0/3,v2] xfs: Fix races between readahead and hole punching

Message

Comments