[RFC,0/5] btrfs: fix hole corruption issue with !NO_HOLES

Message ID	20191230213118.7532-1-josef@toxicpanda.com (mailing list archive)
Headers	show Return-Path: <SRS0=x8Sa=2U=vger.kernel.org=linux-btrfs-owner@kernel.org> From: Josef Bacik <josef@toxicpanda.com> To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [RFC][PATCH 0/5] btrfs: fix hole corruption issue with !NO_HOLES Date: Mon, 30 Dec 2019 16:31:13 -0500 Message-Id: <20191230213118.7532-1-josef@toxicpanda.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	btrfs: fix hole corruption issue with !NO_HOLES \| expand [RFC,0/5] btrfs: fix hole corruption issue with !NO_HOLES [1/5] btrfs: use btrfs_ordered_update_i_size in clone_finish_inode_update [2/5] btrfs: introduce the inode->file_extent_tree [3/5] btrfs: use the file extent tree infrastructure [4/5] btrfs: replace all uses of btrfs_ordered_update_i_size [5/5] btrfs: delete the ordered isize update code

Message ID

20191230213118.7532-1-josef@toxicpanda.com (mailing list archive)

Headers

From: Josef Bacik <josef@toxicpanda.com>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [RFC][PATCH 0/5] btrfs: fix hole corruption issue with !NO_HOLES
Date: Mon, 30 Dec 2019 16:31:13 -0500
Message-Id: <20191230213118.7532-1-josef@toxicpanda.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk

Series

btrfs: fix hole corruption issue with !NO_HOLES | expand

Message

Josef Bacik Dec. 30, 2019, 9:31 p.m. UTC

We've historically had this problem where you could flush a targeted section of
an inode and end up with a hole between extents without a hole extent item.
This of course makes fsck complain because this is not ok for a file system that
doesn't have NO_HOLES set.  Because this is a well understood problem I and
others have been ignoring fsck failures during certain xfstests (generic/475 for
example) because they would regularly trigger this edge case.

However this isn't a great behavior to have, we should really be taking all fsck
failures seriously, and we could potentially ignore fsck legitimate fsck errors
because we expect it to be this particular failure.

In order to fix this we need to keep track of where we have valid extent items,
and only update i_size to encompass that area.  This unfortunately means we need
a new per-inode extent_io_tree to keep track of the valid ranges.  This is
relatively straightforward in practice, and helpers have been added to manage
this so that in the case of a NO_HOLES file system we just simply skip this work
altogether.

I've been hammering on this for a week now and I'm pretty sure its ok, but I'd
really like Filipe to take a look and I still have some longer running tests
going on the series.  All of our boxes internally are btrfs and the box I was
testing on ended up with a weird RPM db corruption that was likely from an
earlier, broken version of the patch.  However I cannot be 100% sure that was
the case, so I'm giving it a few more days of testing before I'm satisfied
there's not some weird thing that RPM does that xfstests doesn't cover.

This has gone through several iterations of xfstests already, including many
loops of generic/475 for validation to make sure it was no longer failing.  So
far so good, but for something like this wider testing will definitely be
necessary.  Thanks,

Josef

Comments

Qu Wenruo Dec. 31, 2019, 12:25 p.m. UTC | #1

On 2019/12/31 上午5:31, Josef Bacik wrote:
> We've historically had this problem where you could flush a targeted section of
> an inode and end up with a hole between extents without a hole extent item.
> This of course makes fsck complain because this is not ok for a file system that
> doesn't have NO_HOLES set.  Because this is a well understood problem I and
> others have been ignoring fsck failures during certain xfstests (generic/475 for
> example) because they would regularly trigger this edge case.
> 
> However this isn't a great behavior to have, we should really be taking all fsck
> failures seriously, and we could potentially ignore fsck legitimate fsck errors
> because we expect it to be this particular failure.
> 
> In order to fix this we need to keep track of where we have valid extent items,
> and only update i_size to encompass that area.  This unfortunately means we need
> a new per-inode extent_io_tree to keep track of the valid ranges.  This is
> relatively straightforward in practice, and helpers have been added to manage
> this so that in the case of a NO_HOLES file system we just simply skip this work
> altogether.

Not an expert of this problem, but AFAIK this is caused by mixing
buffered and direct IO, right?

Since that deadly mix is not recommended anyway, can we make things
simpler by just block any buffered IO if the same inode is under going
any direct IO?

Thanks,
Qu

> 
> I've been hammering on this for a week now and I'm pretty sure its ok, but I'd
> really like Filipe to take a look and I still have some longer running tests
> going on the series.  All of our boxes internally are btrfs and the box I was
> testing on ended up with a weird RPM db corruption that was likely from an
> earlier, broken version of the patch.  However I cannot be 100% sure that was
> the case, so I'm giving it a few more days of testing before I'm satisfied
> there's not some weird thing that RPM does that xfstests doesn't cover.
> 
> This has gone through several iterations of xfstests already, including many
> loops of generic/475 for validation to make sure it was no longer failing.  So
> far so good, but for something like this wider testing will definitely be
> necessary.  Thanks,
> 
> Josef
>

Josef Bacik Jan. 2, 2020, 4:10 p.m. UTC | #2

On 12/31/19 7:25 AM, Qu Wenruo wrote:
> 
> 
> On 2019/12/31 上午5:31, Josef Bacik wrote:
>> We've historically had this problem where you could flush a targeted section of
>> an inode and end up with a hole between extents without a hole extent item.
>> This of course makes fsck complain because this is not ok for a file system that
>> doesn't have NO_HOLES set.  Because this is a well understood problem I and
>> others have been ignoring fsck failures during certain xfstests (generic/475 for
>> example) because they would regularly trigger this edge case.
>>
>> However this isn't a great behavior to have, we should really be taking all fsck
>> failures seriously, and we could potentially ignore fsck legitimate fsck errors
>> because we expect it to be this particular failure.
>>
>> In order to fix this we need to keep track of where we have valid extent items,
>> and only update i_size to encompass that area.  This unfortunately means we need
>> a new per-inode extent_io_tree to keep track of the valid ranges.  This is
>> relatively straightforward in practice, and helpers have been added to manage
>> this so that in the case of a NO_HOLES file system we just simply skip this work
>> altogether.
> 
> Not an expert of this problem, but AFAIK this is caused by mixing
> buffered and direct IO, right?
> 
> Since that deadly mix is not recommended anyway, can we make things
> simpler by just block any buffered IO if the same inode is under going
> any direct IO?
>

This can happen if you write 100mb and then sync_file_range 1mb in the middle of 
the file, it's not just restricted to O_DIRECT.  Thanks,

Josef