diff mbox series

[1/2] ocfs2: Fix data corruption on truncate

Message ID 20211025151332.11301-1-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series ocfs2: Truncate data corruption fix | expand

Commit Message

Jan Kara Oct. 25, 2021, 3:13 p.m. UTC
ocfs2_truncate_file() did unmap invalidate page cache pages before
zeroing partial tail cluster and setting i_size. Thus some pages could
be left (and likely have left if the cluster zeroing happened) in the
page cache beyond i_size after truncate finished letting user possibly
see stale data once the file was extended again. Also the tail cluster
zeroing was not guaranteed to finish before truncate finished causing
possible stale data exposure. The problem started to be particularly
easy to hit after commit 6dbf7bb55598 "fs: Don't invalidate page buffers
in block_write_full_page()" stopped invalidation of pages beyond i_size
from page writeback path.

Fix these problems by unmapping and invalidating pages in the page cache
after the i_size is reduced and tail cluster is zeroed out.

CC: stable@vger.kernel.org
Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Joseph Qi Oct. 28, 2021, 7:09 a.m. UTC | #1
Hi Jan,

On 10/25/21 11:13 PM, Jan Kara wrote:
> ocfs2_truncate_file() did unmap invalidate page cache pages before
> zeroing partial tail cluster and setting i_size. Thus some pages could
> be left (and likely have left if the cluster zeroing happened) in the
> page cache beyond i_size after truncate finished letting user possibly
> see stale data once the file was extended again. Also the tail cluster

I don't quite understand the case. 
truncate_inode_pages() will truncate pages from new_i_size to i_size,
and the following ocfs2_orphan_for_truncate() will zero range and then
update i_size for inode as well as dinode.
So once truncate finished, how stale data exposing happens? Or do you
mean a race case between the above two steps?

Thanks,
Joseph 

> zeroing was not guaranteed to finish before truncate finished causing
> possible stale data exposure. The problem started to be particularly
> easy to hit after commit 6dbf7bb55598 "fs: Don't invalidate page buffers
> in block_write_full_page()" stopped invalidation of pages beyond i_size
> from page writeback path.
> 
> Fix these problems by unmapping and invalidating pages in the page cache
> after the i_size is reduced and tail cluster is zeroed out.
Joseph Qi Oct. 28, 2021, 7:44 a.m. UTC | #2
On 10/28/21 3:09 PM, Joseph Qi wrote:
> Hi Jan,
> 
> On 10/25/21 11:13 PM, Jan Kara wrote:
>> ocfs2_truncate_file() did unmap invalidate page cache pages before
>> zeroing partial tail cluster and setting i_size. Thus some pages could
>> be left (and likely have left if the cluster zeroing happened) in the
>> page cache beyond i_size after truncate finished letting user possibly
>> see stale data once the file was extended again. Also the tail cluster
> 
> I don't quite understand the case. 
> truncate_inode_pages() will truncate pages from new_i_size to i_size,
> and the following ocfs2_orphan_for_truncate() will zero range and then
> update i_size for inode as well as dinode.
> So once truncate finished, how stale data exposing happens? Or do you
> mean a race case between the above two steps?
> 
Or do you mean ocfs2_zero_range_for_truncate() will grab and zero eof
pages? Though it depends on block_write_full_page() to write out, the
pages are zeroed now. Still a little confused...
Jan Kara Nov. 1, 2021, 11:31 a.m. UTC | #3
On Thu 28-10-21 15:09:08, Joseph Qi wrote:
> Hi Jan,
> 
> On 10/25/21 11:13 PM, Jan Kara wrote:
> > ocfs2_truncate_file() did unmap invalidate page cache pages before
> > zeroing partial tail cluster and setting i_size. Thus some pages could
> > be left (and likely have left if the cluster zeroing happened) in the
> > page cache beyond i_size after truncate finished letting user possibly
> > see stale data once the file was extended again. Also the tail cluster
> 
> I don't quite understand the case. 
> truncate_inode_pages() will truncate pages from new_i_size to i_size,
> and the following ocfs2_orphan_for_truncate() will zero range and then
> update i_size for inode as well as dinode.
> So once truncate finished, how stale data exposing happens? Or do you
> mean a race case between the above two steps?

Sorry, I was not quite accurate in the above paragraph. There are several
ways how stale pages in the pagecache can cause problems.

1) Because i_size is reduced after truncating page cache, page fault can
happen after truncating page cache and zeroing pages but before reducing i_size.
This will in allow user to arbitrarily modify pages that are used for
writing zeroes into the cluster tail and after file extension these data
will become visible.

2) The tail cluster zeroing in ocfs2_orphan_for_truncate() can actually try
to write zeroed pages above i_size (e.g. if we have 4k blocksize, 64k
clustersize, and do truncate(f, 4k) on a 4k file). This will cause exactly
same problems as already described in commit 5314454ea3f "ocfs2: fix data
corruption after conversion from inline format".

Hope it is clearer now.

									Honza
Joseph Qi Nov. 2, 2021, 2:36 a.m. UTC | #4
On 11/1/21 7:31 PM, Jan Kara wrote:
> On Thu 28-10-21 15:09:08, Joseph Qi wrote:
>> Hi Jan,
>>
>> On 10/25/21 11:13 PM, Jan Kara wrote:
>>> ocfs2_truncate_file() did unmap invalidate page cache pages before
>>> zeroing partial tail cluster and setting i_size. Thus some pages could
>>> be left (and likely have left if the cluster zeroing happened) in the
>>> page cache beyond i_size after truncate finished letting user possibly
>>> see stale data once the file was extended again. Also the tail cluster
>>
>> I don't quite understand the case. 
>> truncate_inode_pages() will truncate pages from new_i_size to i_size,
>> and the following ocfs2_orphan_for_truncate() will zero range and then
>> update i_size for inode as well as dinode.
>> So once truncate finished, how stale data exposing happens? Or do you
>> mean a race case between the above two steps?
> 
> Sorry, I was not quite accurate in the above paragraph. There are several
> ways how stale pages in the pagecache can cause problems.
> 
> 1) Because i_size is reduced after truncating page cache, page fault can
> happen after truncating page cache and zeroing pages but before reducing i_size.
> This will in allow user to arbitrarily modify pages that are used for
> writing zeroes into the cluster tail and after file extension these data
> will become visible.
> 
> 2) The tail cluster zeroing in ocfs2_orphan_for_truncate() can actually try
> to write zeroed pages above i_size (e.g. if we have 4k blocksize, 64k
> clustersize, and do truncate(f, 4k) on a 4k file). This will cause exactly
> same problems as already described in commit 5314454ea3f "ocfs2: fix data
> corruption after conversion from inline format".
> 
> Hope it is clearer now.
> 
So the core reason is ocfs2_zero_range_for_truncate() grabs pages and
then zero, right?
I think an alternative way is using zeroout instead of zero pages, which
won't grab pages again.
Anyway, I'm also fine with your way since it is simple.

Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Jan Kara Nov. 2, 2021, 9:55 a.m. UTC | #5
On Tue 02-11-21 10:36:42, Joseph Qi wrote:
> On 11/1/21 7:31 PM, Jan Kara wrote:
> > On Thu 28-10-21 15:09:08, Joseph Qi wrote:
> >> Hi Jan,
> >>
> >> On 10/25/21 11:13 PM, Jan Kara wrote:
> >>> ocfs2_truncate_file() did unmap invalidate page cache pages before
> >>> zeroing partial tail cluster and setting i_size. Thus some pages could
> >>> be left (and likely have left if the cluster zeroing happened) in the
> >>> page cache beyond i_size after truncate finished letting user possibly
> >>> see stale data once the file was extended again. Also the tail cluster
> >>
> >> I don't quite understand the case. 
> >> truncate_inode_pages() will truncate pages from new_i_size to i_size,
> >> and the following ocfs2_orphan_for_truncate() will zero range and then
> >> update i_size for inode as well as dinode.
> >> So once truncate finished, how stale data exposing happens? Or do you
> >> mean a race case between the above two steps?
> > 
> > Sorry, I was not quite accurate in the above paragraph. There are several
> > ways how stale pages in the pagecache can cause problems.
> > 
> > 1) Because i_size is reduced after truncating page cache, page fault can
> > happen after truncating page cache and zeroing pages but before reducing i_size.
> > This will in allow user to arbitrarily modify pages that are used for
> > writing zeroes into the cluster tail and after file extension these data
> > will become visible.
> > 
> > 2) The tail cluster zeroing in ocfs2_orphan_for_truncate() can actually try
> > to write zeroed pages above i_size (e.g. if we have 4k blocksize, 64k
> > clustersize, and do truncate(f, 4k) on a 4k file). This will cause exactly
> > same problems as already described in commit 5314454ea3f "ocfs2: fix data
> > corruption after conversion from inline format".
> > 
> > Hope it is clearer now.
> > 
> So the core reason is ocfs2_zero_range_for_truncate() grabs pages and
> then zero, right?

Well, that is the part that makes things easy to reproduce.

> I think an alternative way is using zeroout instead of zero pages, which
> won't grab pages again.

That would certainly reduce the likelyhood of problems but it is always
problematic to first truncate page cache and only then update i_size.
For OCFS2 racing page faults can recreate pages in the page cache before
i_size is reduced and thus cause "interesting" problems.

> Anyway, I'm also fine with your way since it is simple.
> 
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

Thanks!

									Honza
diff mbox series

Patch

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 54d7843c0211..fc5f780fa235 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -476,10 +476,11 @@  int ocfs2_truncate_file(struct inode *inode,
 	 * greater than page size, so we have to truncate them
 	 * anyway.
 	 */
-	unmap_mapping_range(inode->i_mapping, new_i_size + PAGE_SIZE - 1, 0, 1);
-	truncate_inode_pages(inode->i_mapping, new_i_size);
 
 	if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
+		unmap_mapping_range(inode->i_mapping,
+				    new_i_size + PAGE_SIZE - 1, 0, 1);
+		truncate_inode_pages(inode->i_mapping, new_i_size);
 		status = ocfs2_truncate_inline(inode, di_bh, new_i_size,
 					       i_size_read(inode), 1);
 		if (status)
@@ -498,6 +499,9 @@  int ocfs2_truncate_file(struct inode *inode,
 		goto bail_unlock_sem;
 	}
 
+	unmap_mapping_range(inode->i_mapping, new_i_size + PAGE_SIZE - 1, 0, 1);
+	truncate_inode_pages(inode->i_mapping, new_i_size);
+
 	status = ocfs2_commit_truncate(osb, inode, di_bh);
 	if (status < 0) {
 		mlog_errno(status);