diff mbox

direct-io: only inc/dec inode->i_dio_count for file systems

Message ID 20150415182256.GA30339@kernel.dk (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe April 15, 2015, 6:22 p.m. UTC
Hi,

This is a reposting of a patch that was originally in the blk-mq series.
It has huge upside on shared access to a multiqueue device doing
O_DIRECT, it's basically the scaling block that ends up killing
performance. A quick test here reveals that we spend 30% of all system
time just incrementing and decremening inode->i_dio_count. For block
devices this isn't useful at all, as we don't need protection against
truncate. For that test case, performance increases about 3.6x (!!) by
getting rid of this inc/dec per IO.

I've cleaned it up a bit since last time, integrating the checks in
inode_dio_done() and adding a inode_dio_begin() so that callers don't
need to know about this.

We've been running a variant of this patch in the FB kernel for a while.
I'd like to finally get this upstream.

Signed-off-by: Jens Axboe <axboe@fb.com>

---

 fs/block_dev.c     |    2 +-
 fs/btrfs/inode.c   |    6 +++---
 fs/dax.c           |    4 ++--
 fs/direct-io.c     |    5 +++--
 fs/ext4/indirect.c |    6 +++---
 fs/ext4/inode.c    |    4 ++--
 fs/inode.c         |   21 +++++++++++++++++++--
 fs/nfs/direct.c    |   10 +++++-----
 include/linux/fs.h |    6 +++++-
 9 files changed, 43 insertions(+), 21 deletions(-)

Comments

Andrew Morton April 15, 2015, 6:56 p.m. UTC | #1
On Wed, 15 Apr 2015 12:22:56 -0600 Jens Axboe <axboe@fb.com> wrote:

> Hi,
> 
> This is a reposting of a patch that was originally in the blk-mq series.
> It has huge upside on shared access to a multiqueue device doing
> O_DIRECT, it's basically the scaling block that ends up killing
> performance. A quick test here reveals that we spend 30% of all system
> time just incrementing and decremening inode->i_dio_count. For block
> devices this isn't useful at all, as we don't need protection against
> truncate. For that test case, performance increases about 3.6x (!!) by
> getting rid of this inc/dec per IO.
> 
> I've cleaned it up a bit since last time, integrating the checks in
> inode_dio_done() and adding a inode_dio_begin() so that callers don't
> need to know about this.
> 
> We've been running a variant of this patch in the FB kernel for a while.
> I'd like to finally get this upstream.

30% overhead for one atomic_inc+atomic_dec+wake_up_bit() per IO?  That
seems very high!  Is there something else going on?



Is there similar impact to direct-io-to-file?  It would be nice to fix
that up also.  Many filesystems do something along the lines of

	atomic_inc(i_dio_count);
	wibble()
	atomic_dev(i_dio_count);
	__blockdev_direct_IO(...);

and with your patch I think we could change them to

	atomic_inc(i_dio_count);
	wibble()
	__blockdev_direct_IO(..., flags|DIO_IGNORE_TRUNCATE);
	atomic_dev(i_dio_count);

which would halve the atomic op load.

But that's piling hack on top of hack.  Can we change the
do_blockdev_direct_IO() interface to "caller shall hold i_mutex, or
increment i_dio_count"?  ie: exclusion against truncate is wholly the
caller's responsibility.  That way, this awkward sharing of
responsibility between caller and callee gets cleaned up and
DIO_IGNORE_TRUNCATE goes away.



inode_dio_begin() would be a good place to assert that i_mutex is held,
btw.



This whole i_dio_count thing is pretty nasty, really.  If you stand
back and squint, it's basically an rwsem.  I wonder if we can use an
rwsem...



What's the reason for DIO_IGNORE_TRUNCATE rather than boring old
!S_ISBLK?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe April 15, 2015, 7:26 p.m. UTC | #2
On 04/15/2015 12:56 PM, Andrew Morton wrote:
> On Wed, 15 Apr 2015 12:22:56 -0600 Jens Axboe <axboe@fb.com> wrote:
>
>> Hi,
>>
>> This is a reposting of a patch that was originally in the blk-mq series.
>> It has huge upside on shared access to a multiqueue device doing
>> O_DIRECT, it's basically the scaling block that ends up killing
>> performance. A quick test here reveals that we spend 30% of all system
>> time just incrementing and decremening inode->i_dio_count. For block
>> devices this isn't useful at all, as we don't need protection against
>> truncate. For that test case, performance increases about 3.6x (!!) by
>> getting rid of this inc/dec per IO.
>>
>> I've cleaned it up a bit since last time, integrating the checks in
>> inode_dio_done() and adding a inode_dio_begin() so that callers don't
>> need to know about this.
>>
>> We've been running a variant of this patch in the FB kernel for a while.
>> I'd like to finally get this upstream.
>
> 30% overhead for one atomic_inc+atomic_dec+wake_up_bit() per IO?  That
> seems very high!  Is there something else going on?

Nope, that is it. But for this simple test case, that is essentially the 
only shared state that exists and is dirtied between all CPUs that are 
running an application that does O_DIRECT to the device. The test case 
ran at ~9.6M IOPS after the patch, and at ~2.5M IOPS before. On a simple 
2 socket box, no less, nothing fancy there.

And it's _just_ the atomic inc/dec. I ran with the exact patch posted, 
and that still does (needlessly, I think) the wake_up_bit() for 
inode_dio_done().

> Is there similar impact to direct-io-to-file?  It would be nice to fix
> that up also.  Many filesystems do something along the lines of
>
> 	atomic_inc(i_dio_count);
> 	wibble()
> 	atomic_dev(i_dio_count);
> 	__blockdev_direct_IO(...);
>
> and with your patch I think we could change them to
>
> 	atomic_inc(i_dio_count);
> 	wibble()
> 	__blockdev_direct_IO(..., flags|DIO_IGNORE_TRUNCATE);
> 	atomic_dev(i_dio_count);
>
> which would halve the atomic op load.

I haven't checked pure file, but without extending, I suspect that we 
should see similar benefits there. In any case, it'd make sense doing, 
having twice the atomic inc/dec is just a bad idea in general, if we can 
get rid of it.

A quick grep doesn't show this use case, or I'm just blind. Where do you 
see that?

> But that's piling hack on top of hack.  Can we change the

I'd more view it as reducing the hack, the real hack is the way that we 
manually do atomic_inc() on i_dio_count, and then call a magic 
inode_dio_done() that decrements it again. It's not very pretty, I'm 
just reducing the scope of the hack :-)

> do_blockdev_direct_IO() interface to "caller shall hold i_mutex, or
> increment i_dio_count"?  ie: exclusion against truncate is wholly the
> caller's responsibility.  That way, this awkward sharing of
> responsibility between caller and callee gets cleaned up and
> DIO_IGNORE_TRUNCATE goes away.

That would clean it up, at the expense of touching more churn. I'd be 
fine with doing it that way.

> inode_dio_begin() would be a good place to assert that i_mutex is held,
> btw.

How would you check it? If the i_dio_count is bumped, then you'd not 
need to hold i_mutex.

> This whole i_dio_count thing is pretty nasty, really.  If you stand
> back and squint, it's basically an rwsem.  I wonder if we can use an
> rwsem...

We could, but a bit orthogonal to the patch since we'd do the same 
avoidance for blkdevs.

> What's the reason for DIO_IGNORE_TRUNCATE rather than boring old
> !S_ISBLK?

Didn't want to tie it to block devices, but rather just keep the "need 
lock or not" as a separate flag in case there were more use cases.
Andrew Morton April 15, 2015, 7:46 p.m. UTC | #3
On Wed, 15 Apr 2015 13:26:51 -0600 Jens Axboe <axboe@fb.com> wrote:

> > Is there similar impact to direct-io-to-file?  It would be nice to fix
> > that up also.  Many filesystems do something along the lines of
> >
> > 	atomic_inc(i_dio_count);
> > 	wibble()
> > 	atomic_dev(i_dio_count);
> > 	__blockdev_direct_IO(...);
> >
> > and with your patch I think we could change them to
> >
> > 	atomic_inc(i_dio_count);
> > 	wibble()
> > 	__blockdev_direct_IO(..., flags|DIO_IGNORE_TRUNCATE);
> > 	atomic_dev(i_dio_count);
> >
> > which would halve the atomic op load.
> 
> I haven't checked pure file, but without extending, I suspect that we 
> should see similar benefits there. In any case, it'd make sense doing, 
> having twice the atomic inc/dec is just a bad idea in general, if we can 
> get rid of it.
> 
> A quick grep doesn't show this use case, or I'm just blind. Where do you 
> see that?

btrfs_direct_IO() holds i_dio_count across its call to
__blockdev_direct_IO() for writes.  That makes the dio_bio_count
manipulation in do_blockdev_direct_IO() unneeded?  ext4 is similar.

Reducing from 4 ops to 2 probably won't make as much difference as
reducing from 2 to 0 - most of the cost comes from initially grabbing that
cacheline from a different CPU.

> > But that's piling hack on top of hack.  Can we change the
> 
> I'd more view it as reducing the hack, the real hack is the way that we 
> manually do atomic_inc() on i_dio_count, and then call a magic 
> inode_dio_done() that decrements it again. It's not very pretty, I'm 
> just reducing the scope of the hack :-)

A magic flag which says "you don't need to do this in that case because
I know this inode is special".  direct-io already has too much of this :(

> > do_blockdev_direct_IO() interface to "caller shall hold i_mutex, or
> > increment i_dio_count"?  ie: exclusion against truncate is wholly the
> > caller's responsibility.  That way, this awkward sharing of
> > responsibility between caller and callee gets cleaned up and
> > DIO_IGNORE_TRUNCATE goes away.
> 
> That would clean it up, at the expense of touching more churn. I'd be 
> fine with doing it that way.

OK, could be done later I suppose.

> > inode_dio_begin() would be a good place to assert that i_mutex is held,
> > btw.
> 
> How would you check it? If the i_dio_count is bumped, then you'd not 
> need to hold i_mutex.

	if (atomic_add_return() == 1)
		assert()

I guess.  It was just a thought.  Having wandered around the code, I'm
not 100% confident that everyone is holding i_mutex - it's not all
obviously correct.

otoh, the caller doesn't *have* to choose i_mutex for the external
exclusion, and perhaps some callers have used something else.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe April 15, 2015, 8:08 p.m. UTC | #4
On 04/15/2015 01:46 PM, Andrew Morton wrote:
> On Wed, 15 Apr 2015 13:26:51 -0600 Jens Axboe <axboe@fb.com> wrote:
>
>>> Is there similar impact to direct-io-to-file?  It would be nice to fix
>>> that up also.  Many filesystems do something along the lines of
>>>
>>> 	atomic_inc(i_dio_count);
>>> 	wibble()
>>> 	atomic_dev(i_dio_count);
>>> 	__blockdev_direct_IO(...);
>>>
>>> and with your patch I think we could change them to
>>>
>>> 	atomic_inc(i_dio_count);
>>> 	wibble()
>>> 	__blockdev_direct_IO(..., flags|DIO_IGNORE_TRUNCATE);
>>> 	atomic_dev(i_dio_count);
>>>
>>> which would halve the atomic op load.
>>
>> I haven't checked pure file, but without extending, I suspect that we
>> should see similar benefits there. In any case, it'd make sense doing,
>> having twice the atomic inc/dec is just a bad idea in general, if we can
>> get rid of it.
>>
>> A quick grep doesn't show this use case, or I'm just blind. Where do you
>> see that?
>
> btrfs_direct_IO() holds i_dio_count across its call to
> __blockdev_direct_IO() for writes.  That makes the dio_bio_count
> manipulation in do_blockdev_direct_IO() unneeded?  ext4 is similar.

You are right, I even modified those.. Yes, that looks like it's 
unnecessary. Chris?

> Reducing from 4 ops to 2 probably won't make as much difference as
> reducing from 2 to 0 - most of the cost comes from initially grabbing that
> cacheline from a different CPU.

It wont be a 50% reduction, but it all really depends a lot on timing. 
If you have enough banging on it, it could be close to a 50% reduction.


>>> But that's piling hack on top of hack.  Can we change the
>>
>> I'd more view it as reducing the hack, the real hack is the way that we
>> manually do atomic_inc() on i_dio_count, and then call a magic
>> inode_dio_done() that decrements it again. It's not very pretty, I'm
>> just reducing the scope of the hack :-)
>
> A magic flag which says "you don't need to do this in that case because
> I know this inode is special".  direct-io already has too much of this :(

Well, outside of rewriting the dio code, that's what we get. The flags 
already exist, and I don't disagree with you that the situation is 
generally a mess.

>>> do_blockdev_direct_IO() interface to "caller shall hold i_mutex, or
>>> increment i_dio_count"?  ie: exclusion against truncate is wholly the
>>> caller's responsibility.  That way, this awkward sharing of
>>> responsibility between caller and callee gets cleaned up and
>>> DIO_IGNORE_TRUNCATE goes away.
>>
>> That would clean it up, at the expense of touching more churn. I'd be
>> fine with doing it that way.
>
> OK, could be done later I suppose.

It could easily be layered on top, doesn't have to be part of the 
initial patch. Once the flag is there, callers can do what they need and 
add the flag.

>>> inode_dio_begin() would be a good place to assert that i_mutex is held,
>>> btw.
>>
>> How would you check it? If the i_dio_count is bumped, then you'd not
>> need to hold i_mutex.
>
> 	if (atomic_add_return() == 1)
> 		assert()

That's not a 100% catch, but probably good enough.

> I guess.  It was just a thought.  Having wandered around the code, I'm
> not 100% confident that everyone is holding i_mutex - it's not all
> obviously correct.

Personally I'd just prefer not having to dive deeper into this for the 
benefit of this patch.

> otoh, the caller doesn't *have* to choose i_mutex for the external
> exclusion, and perhaps some callers have used something else.
>
Dave Chinner April 15, 2015, 10:25 p.m. UTC | #5
On Wed, Apr 15, 2015 at 11:56:53AM -0700, Andrew Morton wrote:
> On Wed, 15 Apr 2015 12:22:56 -0600 Jens Axboe <axboe@fb.com> wrote:
> 
> > Hi,
> > 
> > This is a reposting of a patch that was originally in the blk-mq series.
> > It has huge upside on shared access to a multiqueue device doing
> > O_DIRECT, it's basically the scaling block that ends up killing
> > performance. A quick test here reveals that we spend 30% of all system
> > time just incrementing and decremening inode->i_dio_count. For block
> > devices this isn't useful at all, as we don't need protection against
> > truncate. For that test case, performance increases about 3.6x (!!) by
> > getting rid of this inc/dec per IO.
> > 
> > I've cleaned it up a bit since last time, integrating the checks in
> > inode_dio_done() and adding a inode_dio_begin() so that callers don't
> > need to know about this.
> > 
> > We've been running a variant of this patch in the FB kernel for a while.
> > I'd like to finally get this upstream.
....
> Is there similar impact to direct-io-to-file?  It would be nice to fix
> that up also.  Many filesystems do something along the lines of
> 
> 	atomic_inc(i_dio_count);
> 	wibble()
> 	atomic_dev(i_dio_count);
> 	__blockdev_direct_IO(...);
> 
> and with your patch I think we could change them to
> 
> 	atomic_inc(i_dio_count);
> 	wibble()
> 	__blockdev_direct_IO(..., flags|DIO_IGNORE_TRUNCATE);
> 	atomic_dev(i_dio_count);

Can't do it quite that way.

AIO requires the i_dio_count to held until IO completion for all
outstanding IOs.  i.e. the increment needs to be in the submission
path, the decrement needs to be in the dio_complete() path,
otherwise we have AIO DIO in progress without a reference count we
can wait on in truncate.

Yes, we might be able to pull it up to the filesystem level now that
dio_complete() is only ever called once per __blockdev_direct_IO()
call, so that may be a solution we can use via filesystem ->end_io
callbacks provided to __blockdev_direct_IO.

> which would halve the atomic op load.

XFS doesn't touch i_dio_count, so it would make no difference to it
at all, which is important, given the DIO rates I can drive through
a single file on XFS - it becomes rwsem cacheline bound on the
shared IO lock at about 2 million IOPS (random 4k read) to a single
file.

FWIW, keep in mind that this i_dio_count originally came from XFS in
the first place, and was pushed into the DIO layer to solve all the
DIO vs extent manipulation problems other fileystems were having...

> But that's piling hack on top of hack.  Can we change the
> do_blockdev_direct_IO() interface to "caller shall hold i_mutex, or
> increment i_dio_count"?  ie: exclusion against truncate is wholly the
> caller's responsibility.  That way, this awkward sharing of
> responsibility between caller and callee gets cleaned up and
> DIO_IGNORE_TRUNCATE goes away.
> 
> inode_dio_begin() would be a good place to assert that i_mutex is held,
> btw.

Can't do that, either, as filesystems like XFS don't hold the
i_mutex during direct IO submission.

> This whole i_dio_count thing is pretty nasty, really.  If you stand
> back and squint, it's basically an rwsem.  I wonder if we can use an
> rwsem...

That doesn't avoid the atomic operations that limit performance.

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 975266be67d3..9b290121301a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -155,7 +155,7 @@  blkdev_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,
 
 	return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iter,
 				    offset, blkdev_get_block,
-				    NULL, NULL, 0);
+				    NULL, NULL, DIO_IGNORE_TRUNCATE);
 }
 
 int __sync_blockdev(struct block_device *bdev, int wait)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d2e732d7af52..6c66719179ba 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8129,7 +8129,7 @@  static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
 	if (check_direct_IO(BTRFS_I(inode)->root, rw, iocb, iter, offset))
 		return 0;
 
-	atomic_inc(&inode->i_dio_count);
+	inode_dio_begin(inode, flags);
 	smp_mb__after_atomic();
 
 	/*
@@ -8169,7 +8169,7 @@  static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
 		current->journal_info = &outstanding_extents;
 	} else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
 				     &BTRFS_I(inode)->runtime_flags)) {
-		inode_dio_done(inode);
+		inode_dio_done(inode, flags);
 		flags = DIO_LOCKING | DIO_SKIP_HOLES;
 		wakeup = false;
 	}
@@ -8188,7 +8188,7 @@  static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
 	}
 out:
 	if (wakeup)
-		inode_dio_done(inode);
+		inode_dio_done(inode, flags);
 	if (relock)
 		mutex_lock(&inode->i_mutex);
 
diff --git a/fs/dax.c b/fs/dax.c
index ed1619ec6537..0f55ee4ed2fe 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -210,7 +210,7 @@  ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 	}
 
 	/* Protects against truncate */
-	atomic_inc(&inode->i_dio_count);
+	inode_dio_begin(inode, flags);
 
 	retval = dax_io(rw, inode, iter, pos, end, get_block, &bh);
 
@@ -220,7 +220,7 @@  ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 	if ((retval > 0) && end_io)
 		end_io(iocb, pos, retval, bh.b_private);
 
-	inode_dio_done(inode);
+	inode_dio_done(inode, flags);
  out:
 	return retval;
 }
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b2e297..5f59709a8b80 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -254,7 +254,8 @@  static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
 	if (dio->end_io && dio->result)
 		dio->end_io(dio->iocb, offset, transferred, dio->private);
 
-	inode_dio_done(dio->inode);
+	inode_dio_done(dio->inode, dio->flags);
+
 	if (is_async) {
 		if (dio->rw & WRITE) {
 			int err;
@@ -1199,7 +1200,7 @@  do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	/*
 	 * Will be decremented at I/O completion time.
 	 */
-	atomic_inc(&inode->i_dio_count);
+	inode_dio_begin(inode, dio->flags);
 
 	retval = 0;
 	sdio.blkbits = blkbits;
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 45fe924f82bc..853a8674b137 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -682,11 +682,11 @@  retry:
 		 * via ext4_inode_block_unlocked_dio(). Check inode's state
 		 * while holding extra i_dio_count ref.
 		 */
-		atomic_inc(&inode->i_dio_count);
+		inode_dio_begin(inode, 0);
 		smp_mb();
 		if (unlikely(ext4_test_inode_state(inode,
 						    EXT4_STATE_DIOREAD_LOCK))) {
-			inode_dio_done(inode);
+			inode_dio_done(inode, 0);
 			goto locked;
 		}
 		if (IS_DAX(inode))
@@ -696,7 +696,7 @@  retry:
 			ret = __blockdev_direct_IO(rw, iocb, inode,
 					inode->i_sb->s_bdev, iter, offset,
 					ext4_get_block, NULL, NULL, 0);
-		inode_dio_done(inode);
+		inode_dio_done(inode, 0);
 	} else {
 locked:
 		if (IS_DAX(inode))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a212b86f..5310e31a4327 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2978,7 +2978,7 @@  static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 	 * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
 	 */
 	if (rw == WRITE)
-		atomic_inc(&inode->i_dio_count);
+		inode_dio_begin(inode, dio_flags);
 
 	/* If we do a overwrite dio, i_mutex locking can be released */
 	overwrite = *((int *)iocb->private);
@@ -3080,7 +3080,7 @@  static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 
 retake_lock:
 	if (rw == WRITE)
-		inode_dio_done(inode);
+		inode_dio_done(inode, dio_flags);
 	/* take i_mutex locking again if we do a ovewrite dio */
 	if (overwrite) {
 		up_read(&EXT4_I(inode)->i_data_sem);
diff --git a/fs/inode.c b/fs/inode.c
index f00b16f45507..7a544bea1566 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1946,15 +1946,32 @@  void inode_dio_wait(struct inode *inode)
 EXPORT_SYMBOL(inode_dio_wait);
 
 /*
+ * inode_dio_begin - signal start of a direct I/O requests
+ * @inode: inode the direct I/O happens on
+ * @flags: DIO_* flags
+ *
+ * This is called once we've finished processing a direct I/O request,
+ * and is used to wake up callers waiting for direct I/O to be quiesced.
+ */
+void inode_dio_begin(struct inode *inode, unsigned int flags)
+{
+	if (!(flags & DIO_IGNORE_TRUNCATE))
+		atomic_inc(&inode->i_dio_count);
+}
+EXPORT_SYMBOL(inode_dio_begin);
+
+/*
  * inode_dio_done - signal finish of a direct I/O requests
  * @inode: inode the direct I/O happens on
+ * @flags: DIO_* flags
  *
  * This is called once we've finished processing a direct I/O request,
  * and is used to wake up callers waiting for direct I/O to be quiesced.
  */
-void inode_dio_done(struct inode *inode)
+void inode_dio_done(struct inode *inode, unsigned int flags)
 {
-	if (atomic_dec_and_test(&inode->i_dio_count))
+	if ((flags & DIO_IGNORE_TRUNCATE) ||
+	    atomic_dec_and_test(&inode->i_dio_count))
 		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
 }
 EXPORT_SYMBOL(inode_dio_done);
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index e907c8cf732e..a5d764431f11 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -387,7 +387,7 @@  static void nfs_direct_complete(struct nfs_direct_req *dreq, bool write)
 	if (write)
 		nfs_zap_mapping(inode, inode->i_mapping);
 
-	inode_dio_done(inode);
+	inode_dio_done(inode, 0);
 
 	if (dreq->iocb) {
 		long res = (long) dreq->error;
@@ -487,7 +487,7 @@  static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 			     &nfs_direct_read_completion_ops);
 	get_dreq(dreq);
 	desc.pg_dreq = dreq;
-	atomic_inc(&inode->i_dio_count);
+	inode_dio_begin(inode, 0);
 
 	while (iov_iter_count(iter)) {
 		struct page **pagevec;
@@ -539,7 +539,7 @@  static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 	 * generic layer handle the completion.
 	 */
 	if (requested_bytes == 0) {
-		inode_dio_done(inode);
+		inode_dio_done(inode, 0);
 		nfs_direct_req_release(dreq);
 		return result < 0 ? result : -EIO;
 	}
@@ -873,7 +873,7 @@  static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			      &nfs_direct_write_completion_ops);
 	desc.pg_dreq = dreq;
 	get_dreq(dreq);
-	atomic_inc(&inode->i_dio_count);
+	inode_dio_begin(inode, 0);
 
 	NFS_I(inode)->write_io += iov_iter_count(iter);
 	while (iov_iter_count(iter)) {
@@ -929,7 +929,7 @@  static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 	 * generic layer handle the completion.
 	 */
 	if (requested_bytes == 0) {
-		inode_dio_done(inode);
+		inode_dio_done(inode, 0);
 		nfs_direct_req_release(dreq);
 		return result < 0 ? result : -EIO;
 	}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 52cc4492cb3a..43aab9f588fa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2613,6 +2613,9 @@  enum {
 
 	/* filesystem can handle aio writes beyond i_size */
 	DIO_ASYNC_EXTEND = 0x04,
+
+	/* inode/fs/bdev does not need truncate protection */
+	DIO_IGNORE_TRUNCATE = 0x08,
 };
 
 void dio_end_io(struct bio *bio, int error);
@@ -2633,7 +2636,8 @@  static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
 #endif
 
 void inode_dio_wait(struct inode *inode);
-void inode_dio_done(struct inode *inode);
+void inode_dio_begin(struct inode *inode, unsigned int flags);
+void inode_dio_done(struct inode *inode, unsigned int flags);
 
 extern void inode_set_flags(struct inode *inode, unsigned int flags,
 			    unsigned int mask);