[0/7,RFC,v3] fs: Hole punch vs page cache filling races

Message ID	20210413105205.3093-1-jack@suse.cz (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Jan Kara <jack@suse.cz> To: <linux-fsdevel@vger.kernel.org> Cc: <linux-ext4@vger.kernel.org>, <linux-xfs@vger.kernel.org>, Ted Tso <tytso@mit.edu>, Christoph Hellwig <hch@infradead.org>, Amir Goldstein <amir73il@gmail.com>, Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz> Subject: [PATCH 0/7 RFC v3] fs: Hole punch vs page cache filling races Date: Tue, 13 Apr 2021 13:28:44 +0200 Message-Id: <20210413105205.3093-1-jack@suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	fs: Hole punch vs page cache filling races \| expand [0/7,RFC,v3] fs: Hole punch vs page cache filling races [1/7] mm: Fix comments mentioning i_mutex [2/7] mm: Protect operations adding pages to page cache with i_mapping_lock [3/7] ext4: Convert to use inode->i_mapping_sem [4/7] ext2: Convert to using i_mapping_sem [5/7] xfs: Convert to use i_mapping_sem [6/7] zonefs: Convert to using i_mapping_sem [7/7] fs: Remove i_mapping_sem protection from .page_mkwrite handlers

Jan Kara April 13, 2021, 11:28 a.m. UTC

Hello,

here is another version of my patches to address races between hole punching
and page cache filling functions for ext4 and other filesystems. Since the last
posting I've added more documentation and comments regarding the lock ordering
of the new lock (i_mapping_sem) and how is the new lock supposed to be used.
I've also added conversions of ext2, xfs, zonefs so that there is better idea
how the conversion looks for most filesystems. Obviously, there are much more
filesystems to convert but I don't want to do that work unless we have a
concensus this is indeed the right approach. Also as a result of spelling out
locking rules more precisely, I have realized there's no need to use
i_mapping_sem in .page_mkwrite handlers so I've added a bonus patch removing
those - not sure we want to actually do that together with the rest of the
series (maybe we can do this cleanup later when the rest of the conversion has
settled down).

Also when writing the documentation I came across one question: Do we mandate
i_mapping_sem for truncate + hole punch for all filesystems or just for
filesystems that support hole punching (or other complex fallocate operations)?
I wrote the documentation so that we require every filesystem to use
i_mapping_sem. This makes locking rules simpler, we can also add asserts when
all filesystems are converted. The downside is that simple filesystems now pay
the overhead of the locking unnecessary for them. The overhead is small
(uncontended rwsem acquisition for truncate) so I don't think we care and the
simplicity is worth it but I wanted to spell this out.

What do people think about this?

Changes since v2:
* Added documentation and comments regarding lock ordering and how the lock is
supposed to be used
* Added conversions of ext2, xfs, zonefs
* Added patch removing i_mapping_sem protection from .page_mkwrite handlers

Changes since v1:
* Moved to using inode->i_mapping_sem instead of aops handler to acquire
appropriate lock

---
Motivation:

Amir has reported [1] a that ext4 has a potential issues when reads can race
with hole punching possibly exposing stale data from freed blocks or even
corrupting filesystem when stale mapping data gets used for writeout. The
problem is that during hole punching, new page cache pages can get instantiated
and block mapping from the looked up in a punched range after
truncate_inode_pages() has run but before the filesystem removes blocks from
the file. In principle any filesystem implementing hole punching thus needs to
implement a mechanism to block instantiating page cache pages during hole
punching to avoid this race. This is further complicated by the fact that there
are multiple places that can instantiate pages in page cache. We can have
regular read(2) or page fault doing this but fadvise(2) or madvise(2) can also
result in reading in page cache pages through force_page_cache_readahead().

There are couple of ways how to fix this. First way (currently implemented by
XFS) is to protect read(2) and *advise(2) calls with i_rwsem so that they are
serialized with hole punching. This is easy to do but as a result all reads
would then be serialized with writes and thus mixed read-write workloads suffer
heavily on ext4. Thus this series introduces inode->i_mapping_sem and uses it
when creating new pages in the page cache and looking up their corresponding
block mapping. We also replace EXT4_I(inode)->i_mmap_sem with this new rwsem
which provides necessary serialization with hole punching for ext4.

Honza

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mail.gmail.com/

Previous versions:
Link: https://lore.kernel.org/linux-fsdevel/20210208163918.7871-1-jack@suse.cz/

Christoph Hellwig April 13, 2021, 1:09 p.m. UTC | #1

> Also when writing the documentation I came across one question: Do we mandate
> i_mapping_sem for truncate + hole punch for all filesystems or just for
> filesystems that support hole punching (or other complex fallocate operations)?
> I wrote the documentation so that we require every filesystem to use
> i_mapping_sem. This makes locking rules simpler, we can also add asserts when
> all filesystems are converted. The downside is that simple filesystems now pay
> the overhead of the locking unnecessary for them. The overhead is small
> (uncontended rwsem acquisition for truncate) so I don't think we care and the
> simplicity is worth it but I wanted to spell this out.

I think all makes for much better to understand and document rules,
so I'd shoot for that eventually.

Btw, what about locking for DAX faults?  XFS seems to take
the mmap sem for those as well currently.

Jan Kara April 13, 2021, 2:17 p.m. UTC | #2

On Tue 13-04-21 14:09:50, Christoph Hellwig wrote:
> > Also when writing the documentation I came across one question: Do we mandate
> > i_mapping_sem for truncate + hole punch for all filesystems or just for
> > filesystems that support hole punching (or other complex fallocate operations)?
> > I wrote the documentation so that we require every filesystem to use
> > i_mapping_sem. This makes locking rules simpler, we can also add asserts when
> > all filesystems are converted. The downside is that simple filesystems now pay
> > the overhead of the locking unnecessary for them. The overhead is small
> > (uncontended rwsem acquisition for truncate) so I don't think we care and the
> > simplicity is worth it but I wanted to spell this out.
> 
> I think all makes for much better to understand and document rules,
> so I'd shoot for that eventually.

OK.

> Btw, what about locking for DAX faults?  XFS seems to take
> the mmap sem for those as well currently.

Yes, I've mechanically converted all those uses to i_mapping_sem for XFS,
ext4, and ext2 as well. Longer term we may be able to move some locking
into generic DAX code now that the lock is in struct inode. But I want to
leave that for later since DAX locking is different enough that it needs
some careful thinking and justification...

								Honza

Matthew Wilcox April 19, 2021, 3:20 p.m. UTC | #3

On Tue, Apr 13, 2021 at 01:28:44PM +0200, Jan Kara wrote:
> Also when writing the documentation I came across one question: Do we mandate
> i_mapping_sem for truncate + hole punch for all filesystems or just for
> filesystems that support hole punching (or other complex fallocate operations)?
> I wrote the documentation so that we require every filesystem to use
> i_mapping_sem. This makes locking rules simpler, we can also add asserts when
> all filesystems are converted. The downside is that simple filesystems now pay
> the overhead of the locking unnecessary for them. The overhead is small
> (uncontended rwsem acquisition for truncate) so I don't think we care and the
> simplicity is worth it but I wanted to spell this out.

What do we actually get in return for supporting these complex fallocate
operations?  Someone added them for a reason, but does that reason
actually benefit me?  Other than running xfstests, how many times has
holepunch been called on your laptop in the last week?  I don't want to
incur even one extra instruction per I/O operation to support something
that happens twice a week; that's a bad tradeoff.

Can we implement holepunch as a NOP?  Or return -ENOTTY?  Those both
seem like better solutions than adding an extra rwsem to every inode.
Failing that, is there a bigger hammer we can use on the holepunch side
(eg preventing all concurrent accesses while the holepunch is happening)
to reduce the overhead on the read side?

Jan Kara April 19, 2021, 4:25 p.m. UTC | #4

On Mon 19-04-21 16:20:08, Matthew Wilcox wrote:
> On Tue, Apr 13, 2021 at 01:28:44PM +0200, Jan Kara wrote:
> > Also when writing the documentation I came across one question: Do we mandate
> > i_mapping_sem for truncate + hole punch for all filesystems or just for
> > filesystems that support hole punching (or other complex fallocate operations)?
> > I wrote the documentation so that we require every filesystem to use
> > i_mapping_sem. This makes locking rules simpler, we can also add asserts when
> > all filesystems are converted. The downside is that simple filesystems now pay
> > the overhead of the locking unnecessary for them. The overhead is small
> > (uncontended rwsem acquisition for truncate) so I don't think we care and the
> > simplicity is worth it but I wanted to spell this out.
> 
> What do we actually get in return for supporting these complex fallocate
> operations?  Someone added them for a reason, but does that reason
> actually benefit me?  Other than running xfstests, how many times has
> holepunch been called on your laptop in the last week?  I don't want to
> incur even one extra instruction per I/O operation to support something
> that happens twice a week; that's a bad tradeoff.

I agree hole punch is relatively rare compared to normal operations but
when it is used, it is used rather frequently - e.g. by VMs to manage their
filesystem images. So if we regress holepunch either by not freeing blocks
or by slowing it down significantly, I'm pretty sure some people will
complain. That being said I fully understand your reluctance to add lock to
the read path but note that it is taken only when we need to fill data from
the storage and it should be taken once per readahead request so I actually
doubt the extra acquisition will be visible in the profiles. But I can
profile it to be sure.

> Can we implement holepunch as a NOP?  Or return -ENOTTY?  Those both
> seem like better solutions than adding an extra rwsem to every inode.

We already have that rwsem there today for most major filesystems. This
work just lifts it from fs-private inode area into the VFS inode. So in
terms of memory usage we are not loosing that much.

> Failing that, is there a bigger hammer we can use on the holepunch side
> (eg preventing all concurrent accesses while the holepunch is happening)
> to reduce the overhead on the read side?

I'm open to other solutions but frankly this was the best I could come up
with. Holepunch already uses a pretty much big hammer approach - take all
the locks there are on the inode in exclusive mode, block DIO, unmap
everything and then do its dirty deeds... I don't think we want hole punch
to block anything on fs-wide basis (that's a DoS recipe) and besides that
I don't see how the hammer could be bigger ;).

								Honza

Dave Chinner April 20, 2021, 10:12 p.m. UTC | #5

On Mon, Apr 19, 2021 at 04:20:08PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 13, 2021 at 01:28:44PM +0200, Jan Kara wrote:
> > Also when writing the documentation I came across one question: Do we mandate
> > i_mapping_sem for truncate + hole punch for all filesystems or just for
> > filesystems that support hole punching (or other complex fallocate operations)?
> > I wrote the documentation so that we require every filesystem to use
> > i_mapping_sem. This makes locking rules simpler, we can also add asserts when
> > all filesystems are converted. The downside is that simple filesystems now pay
> > the overhead of the locking unnecessary for them. The overhead is small
> > (uncontended rwsem acquisition for truncate) so I don't think we care and the
> > simplicity is worth it but I wanted to spell this out.
> 
> What do we actually get in return for supporting these complex fallocate
> operations?  Someone added them for a reason, but does that reason
> actually benefit me?  Other than running xfstests, how many times has
> holepunch been called on your laptop in the last week?

Quite a lot, actually.

> I don't want to
> incur even one extra instruction per I/O operation to support something
> that happens twice a week; that's a bad tradeoff.

Hole punching gets into all sorts of interesting places. For
example, did you know that issuing fstrim (discards) or "write
zeroes" on a file-backed loopback device will issue hole punches to
the underlying file? nvmet does the same. Userspace iscsi server
implementations (e.g. TGT) do the same thing and have for a long
time. NFSv4 servers issue hole punching based on client side
requests, too.

Then there's Kubernetes management tools. Samba. Qemu. Libvirt.
Mysql. Network-Manager. Gluster. Chromium. RocksDB. Swift. Systemd.
The list of core system infrastructure we have that uses hole
punching is quite large...

So, really, hole punching is something that happens a lot and in
many unexpected places. You can argue that your laptop doesn't use
it, but that really doesn't matter in the bigger scheme of things.
Hole punching is something applications expect to work and not
corrupt data....

> Can we implement holepunch as a NOP?  Or return -ENOTTY?  Those both
> seem like better solutions than adding an extra rwsem to every inode.

We've already added this extra i_rwsem to ext4 and XFS - it's a sunk
cost for almost every production machine out there in the wild. It
needs to be made generic so we can optimise the implementation and
not have to implement a unicorn in every filesystem to work around
the fact the page cache and page faults have no internal
serialisation mechanism against filesystem operations that directly
manipulate and invalidate large ranges of the backing storage the
page cache sits over.

> Failing that, is there a bigger hammer we can use on the holepunch side
> (eg preventing all concurrent accesses while the holepunch is happening)
> to reduce the overhead on the read side?

That's what we currently do and what Jan is trying to refine....

Cheers,

Dave.

[0/7,RFC,v3] fs: Hole punch vs page cache filling races

Message

Comments