diff mbox series

[2/2] ext4: Fix stale data exposure when read races with hole punch

Message ID 20190603132155.20600-3-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series fs: Hole punch vs page cache filling races | expand

Commit Message

Jan Kara June 3, 2019, 1:21 p.m. UTC
Hole puching currently evicts pages from page cache and then goes on to
remove blocks from the inode. This happens under both i_mmap_sem and
i_rwsem held exclusively which provides appropriate serialization with
racing page faults. However there is currently nothing that prevents
ordinary read(2) from racing with the hole punch and instantiating page
cache page after hole punching has evicted page cache but before it has
removed blocks from the inode. This page cache page will be mapping soon
to be freed block and that can lead to returning stale data to userspace
or even filesystem corruption.

Fix the problem by protecting reads as well as readahead requests with
i_mmap_sem.

CC: stable@vger.kernel.org
Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

Comments

Amir Goldstein June 3, 2019, 4:33 p.m. UTC | #1
On Mon, Jun 3, 2019 at 4:22 PM Jan Kara <jack@suse.cz> wrote:
>
> Hole puching currently evicts pages from page cache and then goes on to
> remove blocks from the inode. This happens under both i_mmap_sem and
> i_rwsem held exclusively which provides appropriate serialization with
> racing page faults. However there is currently nothing that prevents
> ordinary read(2) from racing with the hole punch and instantiating page
> cache page after hole punching has evicted page cache but before it has
> removed blocks from the inode. This page cache page will be mapping soon
> to be freed block and that can lead to returning stale data to userspace
> or even filesystem corruption.
>
> Fix the problem by protecting reads as well as readahead requests with
> i_mmap_sem.
>

So ->write_iter() does not take  i_mmap_sem right?
and therefore mixed randrw workload is not expected to regress heavily
because of this change?

Did you test performance diff?
Here [1] I posted results of fio test that did x5 worse in xfs vs.
ext4, but I've
seen much worse cases.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxhu=Qtme9RJ7uZXYXt0UE+=xD+OC4gQ9EYkDC1ap8Hizg@mail.gmail.com/
Jan Kara June 4, 2019, 7:57 a.m. UTC | #2
On Mon 03-06-19 19:33:50, Amir Goldstein wrote:
> On Mon, Jun 3, 2019 at 4:22 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hole puching currently evicts pages from page cache and then goes on to
> > remove blocks from the inode. This happens under both i_mmap_sem and
> > i_rwsem held exclusively which provides appropriate serialization with
> > racing page faults. However there is currently nothing that prevents
> > ordinary read(2) from racing with the hole punch and instantiating page
> > cache page after hole punching has evicted page cache but before it has
> > removed blocks from the inode. This page cache page will be mapping soon
> > to be freed block and that can lead to returning stale data to userspace
> > or even filesystem corruption.
> >
> > Fix the problem by protecting reads as well as readahead requests with
> > i_mmap_sem.
> >
> 
> So ->write_iter() does not take  i_mmap_sem right?
> and therefore mixed randrw workload is not expected to regress heavily
> because of this change?

Yes. i_mmap_sem is taken in exclusive mode only for truncate, punch hole,
and similar operations removing blocks from file. So reads will now be more
serialized with such operations. But not with writes. There may be some
regression still visible due to the fact that although readers won't block
one another or with writers, they'll still contend on updating the cacheline
with i_mmap_sem and that's going to be visible for cache hot readers
running from multiple NUMA nodes.

> Did you test performance diff?

No, not really. But I'll queue up some test to see the difference.

> Here [1] I posted results of fio test that did x5 worse in xfs vs.
> ext4, but I've seen much worse cases.

								Honza
Dave Chinner June 5, 2019, 1:25 a.m. UTC | #3
On Mon, Jun 03, 2019 at 03:21:55PM +0200, Jan Kara wrote:
> Hole puching currently evicts pages from page cache and then goes on to
> remove blocks from the inode. This happens under both i_mmap_sem and
> i_rwsem held exclusively which provides appropriate serialization with
> racing page faults. However there is currently nothing that prevents
> ordinary read(2) from racing with the hole punch and instantiating page
> cache page after hole punching has evicted page cache but before it has
> removed blocks from the inode. This page cache page will be mapping soon
> to be freed block and that can lead to returning stale data to userspace
> or even filesystem corruption.
> 
> Fix the problem by protecting reads as well as readahead requests with
> i_mmap_sem.
> 
> CC: stable@vger.kernel.org
> Reported-by: Amir Goldstein <amir73il@gmail.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/file.c | 35 +++++++++++++++++++++++++++++++----
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2c5baa5e8291..a21fa9f8fb5d 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -34,6 +34,17 @@
>  #include "xattr.h"
>  #include "acl.h"
>  
> +static ssize_t ext4_file_buffered_read(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	ssize_t ret;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	down_read(&EXT4_I(inode)->i_mmap_sem);
> +	ret = generic_file_read_iter(iocb, to);
> +	up_read(&EXT4_I(inode)->i_mmap_sem);
> +	return ret;

Isn't i_mmap_sem taken in the page fault path? What makes it safe
to take here both outside and inside the mmap_sem at the same time?
I mean, the whole reason for i_mmap_sem existing is that the inode
i_rwsem can't be taken both outside and inside the i_mmap_sem at the
same time, so what makes the i_mmap_sem different?

Cheers,

Dave.
Jan Kara June 5, 2019, 9:27 a.m. UTC | #4
On Wed 05-06-19 11:25:51, Dave Chinner wrote:
> On Mon, Jun 03, 2019 at 03:21:55PM +0200, Jan Kara wrote:
> > Hole puching currently evicts pages from page cache and then goes on to
> > remove blocks from the inode. This happens under both i_mmap_sem and
> > i_rwsem held exclusively which provides appropriate serialization with
> > racing page faults. However there is currently nothing that prevents
> > ordinary read(2) from racing with the hole punch and instantiating page
> > cache page after hole punching has evicted page cache but before it has
> > removed blocks from the inode. This page cache page will be mapping soon
> > to be freed block and that can lead to returning stale data to userspace
> > or even filesystem corruption.
> > 
> > Fix the problem by protecting reads as well as readahead requests with
> > i_mmap_sem.
> > 
> > CC: stable@vger.kernel.org
> > Reported-by: Amir Goldstein <amir73il@gmail.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext4/file.c | 35 +++++++++++++++++++++++++++++++----
> >  1 file changed, 31 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index 2c5baa5e8291..a21fa9f8fb5d 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -34,6 +34,17 @@
> >  #include "xattr.h"
> >  #include "acl.h"
> >  
> > +static ssize_t ext4_file_buffered_read(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > +	ssize_t ret;
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > +	down_read(&EXT4_I(inode)->i_mmap_sem);
> > +	ret = generic_file_read_iter(iocb, to);
> > +	up_read(&EXT4_I(inode)->i_mmap_sem);
> > +	return ret;
> 
> Isn't i_mmap_sem taken in the page fault path? What makes it safe
> to take here both outside and inside the mmap_sem at the same time?
> I mean, the whole reason for i_mmap_sem existing is that the inode
> i_rwsem can't be taken both outside and inside the i_mmap_sem at the
> same time, so what makes the i_mmap_sem different?

Drat, you're right that read path may take page fault which will cause lock
inversion with mmap_sem. Just my xfstests run apparently didn't trigger
this as I didn't get any lockdep splat. Thanks for catching this!

								Honza
diff mbox series

Patch

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2c5baa5e8291..a21fa9f8fb5d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -34,6 +34,17 @@ 
 #include "xattr.h"
 #include "acl.h"
 
+static ssize_t ext4_file_buffered_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	ssize_t ret;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	ret = generic_file_read_iter(iocb, to);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	return ret;
+}
+
 #ifdef CONFIG_FS_DAX
 static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
@@ -52,7 +63,7 @@  static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (!IS_DAX(inode)) {
 		inode_unlock_shared(inode);
 		/* Fallback to buffered IO in case we cannot support DAX */
-		return generic_file_read_iter(iocb, to);
+		return ext4_file_buffered_read(iocb, to);
 	}
 	ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
 	inode_unlock_shared(inode);
@@ -64,17 +75,32 @@  static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
-	if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
 		return -EIO;
 
 	if (!iov_iter_count(to))
 		return 0; /* skip atime */
 
 #ifdef CONFIG_FS_DAX
-	if (IS_DAX(file_inode(iocb->ki_filp)))
+	if (IS_DAX(inode))
 		return ext4_dax_read_iter(iocb, to);
 #endif
-	return generic_file_read_iter(iocb, to);
+	if (iocb->ki_flags & IOCB_DIRECT)
+		return generic_file_read_iter(iocb, to);
+	return ext4_file_buffered_read(iocb, to);
+}
+
+static int ext4_readahead(struct file *filp, loff_t start, loff_t end)
+{
+	struct inode *inode = file_inode(filp);
+	int ret;
+
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	ret = generic_readahead(filp, start, end);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	return ret;
 }
 
 /*
@@ -518,6 +544,7 @@  const struct file_operations ext4_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ext4_fallocate,
+	.readahead	= ext4_readahead,
 };
 
 const struct inode_operations ext4_file_inode_operations = {