diff mbox

[v3] fs: Fix page cache inconsistency when mixing buffered and AIO DIO

Message ID 1500380368-31661-1-git-send-email-lczerner@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Lukas Czerner July 18, 2017, 12:19 p.m. UTC
Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
---
v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
v3: Do not invalidate in case of error. Add some coments

 fs/direct-io.c | 37 ++++++++++++++++++++++++++++++++-----
 fs/iomap.c     |  8 ++++++++
 2 files changed, 40 insertions(+), 5 deletions(-)

Comments

Christoph Hellwig July 18, 2017, 1:44 p.m. UTC | #1
> +	if ((ret > 0) &&

No need for the braces here.

> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq)
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);

So now we initialize the workqueue on the first aio write.  Maybe we
should just always initialize it?  Especially given that the cost of
a workqueue is rather cheap.  I also don't really understand why
we even need the workqueue per-superblock instead of global.

> index 1732228..2f8dbf9 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,16 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	loff_t offset = iocb->ki_pos;

If you introduce this variable please also use it later in the function
instead of iocb->ki_pos.  OR remove the variable, which would be fine
with me as well.

> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	if ((!dio->error) &&

no need for the inner braces.
Jan Kara July 18, 2017, 2:17 p.m. UTC | #2
On Tue 18-07-17 06:44:20, Christoph Hellwig wrote:
> > +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> > +		retval = 0;
> > +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> > +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> > +			retval = dio_set_defer_completion(dio);
> > +		else if (!dio->inode->i_sb->s_dio_done_wq)
> > +			/*
> > +			 * In case of AIO write racing with buffered read we
> > +			 * need to defer completion. We can't decide this now,
> > +			 * however the workqueue needs to be initialized here.
> > +			 */
> > +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> 
> So now we initialize the workqueue on the first aio write.  Maybe we
> should just always initialize it?  Especially given that the cost of
> a workqueue is rather cheap.  I also don't really understand why
> we even need the workqueue per-superblock instead of global.

So the workqueue is WQ_MEM_RECLAIM which means there will be always
"rescue" worker running. Not that it would make workqueue too expensive but
it is not zero cost either. So saving the cost for filesystems that don't
support AIO DIO makes sense to me.

Regarding creating global workqueue - it would create IO completion
dependencies between filesystems which could have unwanted side effects and
possibly create deadlocks. The default paralelism of workqueues would
mostly hide this but I'm not sure there won't be some corner case e.g. when
memory is tight...

								Honza
Lukas Czerner July 19, 2017, 8:42 a.m. UTC | #3
On Tue, Jul 18, 2017 at 06:44:20AM -0700, Christoph Hellwig wrote:
> ->inode->i_sb);
> 
> So now we initialize the workqueue on the first aio write.  Maybe we
> should just always initialize it?  Especially given that the cost of
> a workqueue is rather cheap.  I also don't really understand why
> we even need the workqueue per-superblock instead of global.

As Jan mentioned and I agree it's worth initializing it only when
actually needed.

> 
> > index 1732228..2f8dbf9 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -713,8 +713,16 @@ struct iomap_dio {
> >  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >  {
> >  	struct kiocb *iocb = dio->iocb;
> > +	loff_t offset = iocb->ki_pos;
> 
> If you introduce this variable please also use it later in the function
> instead of iocb->ki_pos.  OR remove the variable, which would be fine
> with me as well.

Right, I did not use it later in the fucntion because it would be
confusing (we're changing iocb->ki_pos). So I'll just remove the
variable.

> 
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> >  	ssize_t ret;
> >  
> > +	if ((!dio->error) &&
> 
> no need for the inner braces.

ok


Thanks!
-Lukas
diff mbox

Patch

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 08cf278..efd3246 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -258,6 +258,12 @@  static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 	if (ret == 0)
 		ret = transferred;
 
+	if ((ret > 0) &&
+	    (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages))
+		invalidate_inode_pages2_range(dio->inode->i_mapping,
+					offset >> PAGE_SHIFT,
+					(offset + ret - 1) >> PAGE_SHIFT);
+
 	if (dio->end_io) {
 		int err;
 
@@ -304,6 +310,7 @@  static void dio_bio_end_aio(struct bio *bio)
 	struct dio *dio = bio->bi_private;
 	unsigned long remaining;
 	unsigned long flags;
+	bool defer_completion = false;
 
 	/* cleanup the bio */
 	dio_bio_complete(dio, bio);
@@ -315,7 +322,19 @@  static void dio_bio_end_aio(struct bio *bio)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
 	if (remaining == 0) {
-		if (dio->result && dio->defer_completion) {
+		/*
+		 * Defer completion when defer_completion is set or
+		 * when the inode has pages mapped and this is AIO write.
+		 * We need to invalidate those pages because there is a
+		 * chance they contain stale data in the case buffered IO
+		 * went in between AIO submission and completion into the
+		 * same region.
+		 */
+		if (dio->result)
+			defer_completion = dio->defer_completion ||
+					   (dio->op == REQ_OP_WRITE &&
+					    dio->inode->i_mapping->nrpages);
+		if (defer_completion) {
 			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
 			queue_work(dio->inode->i_sb->s_dio_done_wq,
 				   &dio->complete_work);
@@ -1210,10 +1229,18 @@  do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
 	 * so that we can call ->fsync.
 	 */
-	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
-	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
-	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
-		retval = dio_set_defer_completion(dio);
+	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
+		retval = 0;
+		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
+		    IS_SYNC(iocb->ki_filp->f_mapping->host))
+			retval = dio_set_defer_completion(dio);
+		else if (!dio->inode->i_sb->s_dio_done_wq)
+			/*
+			 * In case of AIO write racing with buffered read we
+			 * need to defer completion. We can't decide this now,
+			 * however the workqueue needs to be initialized here.
+			 */
+			retval = sb_init_dio_done_wq(dio->inode->i_sb);
 		if (retval) {
 			/*
 			 * We grab i_mutex only for reads so we don't have
diff --git a/fs/iomap.c b/fs/iomap.c
index 1732228..2f8dbf9 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -713,8 +713,16 @@  struct iomap_dio {
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
+	loff_t offset = iocb->ki_pos;
+	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	if ((!dio->error) &&
+	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
+		invalidate_inode_pages2_range(inode->i_mapping,
+				offset >> PAGE_SHIFT,
+				(offset + dio->size - 1) >> PAGE_SHIFT);
+
 	if (dio->end_io) {
 		ret = dio->end_io(iocb,
 				dio->error ? dio->error : dio->size,