[0/5] ext4: direct IO via iomap infrastructure

Message ID	cover.1565609891.git.mbobrowski@mbobrowski.org (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Mon, 12 Aug 2019 22:52:19 +1000 From: Matthew Bobrowski <mbobrowski@mbobrowski.org> To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, jack@suse.cz, tytso@mit.edu, riteshh@linux.ibm.com Subject: [PATCH 0/5] ext4: direct IO via iomap infrastructure Message-ID: <cover.1565609891.git.mbobrowski@mbobrowski.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	ext4: direct IO via iomap infrastructure \| expand [0/5] ext4: direct IO via iomap infrastructure [1/5] ext4: introduce direct IO read code path using iomap infrastructure [2/5] ext4: move inode extension/truncate code out from ext4_iomap_end() [3/5] iomap: modify ->end_io() calling convention [4/5] ext4: introduce direct IO write code path using iomap infrastructure [5/5] ext4: clean up redundant buffer_head direct IO code

Matthew Bobrowski Aug. 12, 2019, 12:52 p.m. UTC

This patch series converts the ext4 direct IO code paths to make use of the
iomap infrastructure and removes the old buffer_head direct-io based
implementation. The result is that ext4 is converted to the newer framework
and that it may _possibly_ gain a performance boost for O_SYNC | O_DIRECT IO.

These changes have been tested using xfstests in both DAX and non-DAX modes
using various configurations i.e. 4k, dioread_nolock, dax.

Matthew Bobrowski (5):
  ext4: introduce direct IO read code path using iomap infrastructure
  ext4: move inode extension/truncate code out from ext4_iomap_end()
  iomap: modify ->end_io() calling convention
  ext4: introduce direct IO write code path using iomap infrastructure
  ext4: clean up redundant buffer_head direct IO code

 fs/ext4/ext4.h        |   3 -
 fs/ext4/extents.c     |   8 +-
 fs/ext4/file.c        | 329 +++++++++++++++++++++++++++-------
 fs/ext4/inode.c       | 488 +++++---------------------------------------------
 fs/iomap/direct-io.c  |   9 +-
 fs/xfs/xfs_file.c     |  17 +-
 include/linux/iomap.h |   4 +-
 7 files changed, 322 insertions(+), 536 deletions(-)

Ritesh Harjani Aug. 12, 2019, 5:31 p.m. UTC | #1

Hello Matthew,

On 8/12/19 6:22 PM, Matthew Bobrowski wrote:

> This patch series converts the ext4 direct IO code paths to make use of the
> iomap infrastructure and removes the old buffer_head direct-io based
> implementation. The result is that ext4 is converted to the newer framework
> and that it may _possibly_ gain a performance boost for O_SYNC | O_DIRECT IO.
>
> These changes have been tested using xfstests in both DAX and non-DAX modes
> using various configurations i.e. 4k, dioread_nolock, dax.

I had some minor review comments posted on Patch-4.
But the rest of the patch series looks good to me.
I will also do some basic testing of xfstests which I did for my patches 
and will revert back.

One query, could you please help answering below for my understanding :-

I was under the assumption that we need to maintain 
ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or 
atomic_read(&EXT4_I(inode)->i_unwritten))
in case of non-AIO directIO or AIO directIO case as well (when we may 
allocate unwritten extents),
to protect with some kind of race with other parts of code(maybe 
truncate/bufferedIO/fallocate not sure?) which may call for 
ext4_can_extents_be_merged()
to check if extents can be merged or not.

Is it not the case?
Now that directIO code has no way of specifying that this inode has 
unwritten extent, will it not race with any other path, where this info 
was necessary (like
in above func ext4_can_extents_be_merged())?

Thanks
Ritesh

>
> Matthew Bobrowski (5):
>    ext4: introduce direct IO read code path using iomap infrastructure
>    ext4: move inode extension/truncate code out from ext4_iomap_end()
>    iomap: modify ->end_io() calling convention
>    ext4: introduce direct IO write code path using iomap infrastructure
>    ext4: clean up redundant buffer_head direct IO code
>
>   fs/ext4/ext4.h        |   3 -
>   fs/ext4/extents.c     |   8 +-
>   fs/ext4/file.c        | 329 +++++++++++++++++++++++++++-------
>   fs/ext4/inode.c       | 488 +++++---------------------------------------------
>   fs/iomap/direct-io.c  |   9 +-
>   fs/xfs/xfs_file.c     |  17 +-
>   include/linux/iomap.h |   4 +-
>   7 files changed, 322 insertions(+), 536 deletions(-)
>

Matthew Bobrowski Aug. 13, 2019, 11:10 a.m. UTC | #2

On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote:
> > This patch series converts the ext4 direct IO code paths to make use of the
> > iomap infrastructure and removes the old buffer_head direct-io based
> > implementation. The result is that ext4 is converted to the newer framework
> > and that it may _possibly_ gain a performance boost for O_SYNC | O_DIRECT IO.
> > 
> > These changes have been tested using xfstests in both DAX and non-DAX modes
> > using various configurations i.e. 4k, dioread_nolock, dax.
> 
> I had some minor review comments posted on Patch-4.
> But the rest of the patch series looks good to me.

Thanks for the review, much appreciated! Also, apologies about any
delayed response to your queries, I predominantly do all this work in
my personal time.

> I will also do some basic testing of xfstests which I did for my patches and
> will revert back.

Sounds good!

> One query, could you please help answering below for my understanding :-
> 
> I was under the assumption that we need to maintain
> ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or
> atomic_read(&EXT4_I(inode)->i_unwritten))
> in case of non-AIO directIO or AIO directIO case as well (when we may
> allocate unwritten extents),
> to protect with some kind of race with other parts of code(maybe
> truncate/bufferedIO/fallocate not sure?) which may call for
> ext4_can_extents_be_merged()
> to check if extents can be merged or not.
> 
> Is it not the case?
> Now that directIO code has no way of specifying that this inode has
> unwritten extent, will it not race with any other path, where this info was
> necessary (like
> in above func ext4_can_extents_be_merged())?

Ah yes, I was under the same assumption when reviewing the code
initially and one of my first solutions was to also use this dynamic
'state' flag in the ->end_io() handler. But, I fell flat on my face as
that deemed to be problematic... This is because there can be multiple
direct IOs to unwritten extents against the same inode, so you cannot
possibly get away with tracking them using this single inode flag. So,
hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use
IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks
whether _this_ particular IO has an underlying unwritten extent.

Ritesh Harjani Aug. 13, 2019, 12:27 p.m. UTC | #3

On 8/13/19 4:40 PM, Matthew Bobrowski wrote:
> On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote:
>>> This patch series converts the ext4 direct IO code paths to make use of the
>>> iomap infrastructure and removes the old buffer_head direct-io based
>>> implementation. The result is that ext4 is converted to the newer framework
>>> and that it may _possibly_ gain a performance boost for O_SYNC | O_DIRECT IO.
>>>
>>> These changes have been tested using xfstests in both DAX and non-DAX modes
>>> using various configurations i.e. 4k, dioread_nolock, dax.
>> I had some minor review comments posted on Patch-4.
>> But the rest of the patch series looks good to me.
> Thanks for the review, much appreciated! Also, apologies about any
> delayed response to your queries, I predominantly do all this work in
> my personal time.
Np at all.

>
>> I will also do some basic testing of xfstests which I did for my patches and
>> will revert back.
> Sounds good!

I did not find any failure new failures in xfstests with 4K block size.
Neither in basic fio DIO/AIO testing. So my basic testing looks good
(these are mostly the tests which I was using for my debug/dev too)


>
>> One query, could you please help answering below for my understanding :-
>>
>> I was under the assumption that we need to maintain
>> ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or
>> atomic_read(&EXT4_I(inode)->i_unwritten))
>> in case of non-AIO directIO or AIO directIO case as well (when we may
>> allocate unwritten extents),
>> to protect with some kind of race with other parts of code(maybe
>> truncate/bufferedIO/fallocate not sure?) which may call for
>> ext4_can_extents_be_merged()
>> to check if extents can be merged or not.
>>
>> Is it not the case?
>> Now that directIO code has no way of specifying that this inode has
>> unwritten extent, will it not race with any other path, where this info was
>> necessary (like
>> in above func ext4_can_extents_be_merged())?
> Ah yes, I was under the same assumption when reviewing the code
> initially and one of my first solutions was to also use this dynamic
> 'state' flag in the ->end_io() handler. But, I fell flat on my face as
> that deemed to be problematic... This is because there can be multiple
> direct IOs to unwritten extents against the same inode, so you cannot
> possibly get away with tracking them using this single inode flag. So,
> hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use
> IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks
> whether _this_ particular IO has an underlying unwritten extent.

Thanks for taking time to explain this.

Yes, I do understand that part - i.e. while preparing block mapping in 
->iomap_begin
we get to know(from ext4_map_blocks) whether this is an unwritten extent 
and we add the flag
IOMAP_DIO_UNWRITTEN to iomap. This is needed so that we can convert 
unwritten extents in ->end_io
before we update the inode size and mark the inode dirty - to avoid any 
race with other AIO DIO or bufferedIO.

But what I meant was this (I may be wrong here since I haven't really 
looked into it),
but for my understanding I would like to discuss this -

So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining 
on whether a newextent can be merged with ex1 in function
ext4_extents_can_be_merged. But now since we have removed that flag we 
have no way of knowing that whether
this inode has any unwritten extents or not from any DIO path.
What I meant is isn't this removal of setting/unsetting of 
flag(EXT4_STATE_DIO_UNWRITTEN)
changing the behavior of this func - ext4_extents_can_be_merged?

Also - could you please explain why this check returns 0 in the first 
place (line 1762 - 1766 below)?

1733 int
1734 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent 
*ex1,
1735                                 struct ext4_extent *ex2)
<...>

1762         if (ext4_ext_is_unwritten(ex1) &&
1763             (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
1764              atomic_read(&EXT4_I(inode)->i_unwritten) ||
1765              (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
1766                 return 0;
<...>

Regards
Ritesh

Matthew Bobrowski Aug. 14, 2019, 9:48 a.m. UTC | #4

On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
> On 8/13/19 4:40 PM, Matthew Bobrowski wrote:
> > On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote:
> > > I was under the assumption that we need to maintain
> > > ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or
> > > atomic_read(&EXT4_I(inode)->i_unwritten))
> > > in case of non-AIO directIO or AIO directIO case as well (when we may
> > > allocate unwritten extents),
> > > to protect with some kind of race with other parts of code(maybe
> > > truncate/bufferedIO/fallocate not sure?) which may call for
> > > ext4_can_extents_be_merged()
> > > to check if extents can be merged or not.
> > > 
> > > Is it not the case?
> > > Now that directIO code has no way of specifying that this inode has
> > > unwritten extent, will it not race with any other path, where this info was
> > > necessary (like
> > > in above func ext4_can_extents_be_merged())?
> > Ah yes, I was under the same assumption when reviewing the code
> > initially and one of my first solutions was to also use this dynamic
> > 'state' flag in the ->end_io() handler. But, I fell flat on my face as
> > that deemed to be problematic... This is because there can be multiple
> > direct IOs to unwritten extents against the same inode, so you cannot
> > possibly get away with tracking them using this single inode flag. So,
> > hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use
> > IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks
> > whether _this_ particular IO has an underlying unwritten extent.
> 
> Thanks for taking time to explain this.
> 
> But what I meant was this (I may be wrong here since I haven't really looked
> into it),
> but for my understanding I would like to discuss this -
> 
> So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
> whether a newextent can be merged with ex1 in function
> ext4_extents_can_be_merged. But now since we have removed that flag we have
> no way of knowing that whether
> this inode has any unwritten extents or not from any DIO path.
> What I meant is isn't this removal of setting/unsetting of
> flag(EXT4_STATE_DIO_UNWRITTEN)
> changing the behavior of this func - ext4_extents_can_be_merged?

Ah yes, I see. I believe that what you're saying is correct and I
think we will need to account for this case. But, I haven't thought
about how to do this just yet.

> Also - could you please explain why this check returns 0 in the first place
> (line 1762 - 1766 below)?

I cannot explain why, because I myself don't know exactly why in this
particular case the extents cannot be merged. Perhaps `git blame` is
our friend and we can direct that question accordingly, or someone
else on this mailing list knows the answer. :-)

> 1733 int
> 1734 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent
> *ex1,
> 1735                                 struct ext4_extent *ex2)
> <...>
> 
> 1762         if (ext4_ext_is_unwritten(ex1) &&
> 1763             (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
> 1764              atomic_read(&EXT4_I(inode)->i_unwritten) ||
> 1765              (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
> 1766                 return 0;

--M

Ritesh Harjani Aug. 14, 2019, 11:58 a.m. UTC | #5

On 8/14/19 3:18 PM, Matthew Bobrowski wrote:
> On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
>> On 8/13/19 4:40 PM, Matthew Bobrowski wrote:
>>> On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote:
>>>> I was under the assumption that we need to maintain
>>>> ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or
>>>> atomic_read(&EXT4_I(inode)->i_unwritten))
>>>> in case of non-AIO directIO or AIO directIO case as well (when we may
>>>> allocate unwritten extents),
>>>> to protect with some kind of race with other parts of code(maybe
>>>> truncate/bufferedIO/fallocate not sure?) which may call for
>>>> ext4_can_extents_be_merged()
>>>> to check if extents can be merged or not.
>>>>
>>>> Is it not the case?
>>>> Now that directIO code has no way of specifying that this inode has
>>>> unwritten extent, will it not race with any other path, where this info was
>>>> necessary (like
>>>> in above func ext4_can_extents_be_merged())?
>>> Ah yes, I was under the same assumption when reviewing the code
>>> initially and one of my first solutions was to also use this dynamic
>>> 'state' flag in the ->end_io() handler. But, I fell flat on my face as
>>> that deemed to be problematic... This is because there can be multiple
>>> direct IOs to unwritten extents against the same inode, so you cannot
>>> possibly get away with tracking them using this single inode flag. So,
>>> hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use
>>> IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks
>>> whether _this_ particular IO has an underlying unwritten extent.
>> Thanks for taking time to explain this.
>>
>> But what I meant was this (I may be wrong here since I haven't really looked
>> into it),
>> but for my understanding I would like to discuss this -
>>
>> So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
>> whether a newextent can be merged with ex1 in function
>> ext4_extents_can_be_merged. But now since we have removed that flag we have
>> no way of knowing that whether
>> this inode has any unwritten extents or not from any DIO path.
>> What I meant is isn't this removal of setting/unsetting of
>> flag(EXT4_STATE_DIO_UNWRITTEN)
>> changing the behavior of this func - ext4_extents_can_be_merged?
> Ah yes, I see. I believe that what you're saying is correct and I
> think we will need to account for this case. But, I haven't thought
> about how to do this just yet.

That's not a problem, we can surely discuss the possible approaches.


>> Also - could you please explain why this check returns 0 in the first place
>> (line 1762 - 1766 below)?
>>
>> 1733 int
>> 1734 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent
>> *ex1,
>> 1735                                 struct ext4_extent *ex2)
>> <...>
>>
>> 1762         if (ext4_ext_is_unwritten(ex1) &&
>> 1763             (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
>> 1764              atomic_read(&EXT4_I(inode)->i_unwritten) ||
>> 1765              (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
>> 1766                 return 0;
> I cannot explain why, because I myself don't know exactly why in this
> particular case the extents cannot be merged. Perhaps `git blame` is
> our friend and we can direct that question accordingly, or someone
> else on this mailing list knows the answer. :-)

git blame didn't help much. But I think I may know what the above 
condition means.
So I think if there is an ongoing IO to an unwritten extent, we may 
sometimes first split that extent
to match the exact range of the IO, then write data to it and then 
convert that *exact range* to written extent.

So this means while there is an ongoing IO to any inode which has 
unwritten extents, we should not allow
any other extent to merge with extents of this inode.

Now one race which I could think of it is, when we are doing AIO DIO and 
in parallel doing fallocate to the end of the file.
But since we wait for inode_dio_wait in fallocate, so that should not 
race with any DIO paths.

I think, here we can wait to get answers from others to understand, if 
there is any race with any of the DIO paths on removing this state 
flag(EXT4_STATE_DIO_UNWRITTEN) & not
setting "i_unwritten". Whether it can race with any other path and if 
this state flag is necessary(in DIO cases also) for the correct 
functionality?


Regards
Ritesh

Matthew Bobrowski Aug. 21, 2019, 1:14 p.m. UTC | #6

On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
> But what I meant was this (I may be wrong here since I haven't really looked
> into it), but for my understanding I would like to discuss this -
> 
> So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
> whether a newextent can be merged with ex1 in function
> ext4_extents_can_be_merged. But now since we have removed that flag we have
> no way of knowing that whether this inode has any unwritten extents or not
> from any DIO path.
> 
> What I meant is isn't this removal of setting/unsetting of
> flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func -
> ext4_extents_can_be_merged?

OK, I'm stuck and looking for either clarity, revalidation of my
thought process, or any input on how to solve this problem for that
matter.

In the current ext4 direct IO implementation, the dynamic state flag
EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO
writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset
for ext4_io_end->flag, and the value of i_unwritten is
incremented/decremented for asynchronous direct IO writes. All
mechanisms by which are used to track and determine whether the inode,
or an IO in flight against a particular inode have any pending
unwritten extents that need to be converted after the IO has
completed. In addition to this, we have ext4_can_extents_be_merged()
performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and
i_unwritten and using them to determine whether it can or cannot merge
a requested extent into an existing extent.

This is all fine for the current direct IO implementation. However,
while switching the direct IO code paths over to make use of the iomap
infrastructure, I believe that we can no longer simply track whether
an inode has unwritten extents needing to be converted by simply
setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the
inode. The reason being is that there can be multiple direct IO
operations to unwritten extents running against the inode and we don't
particularly distinguish synchronous from asynchronous writes within
ext4_iomap_begin() as there's really no need. So, the only way to
accurately determine whether extent conversion is deemed necessary for
an IO operation whether it'd be synchronous or asynchronous is by
checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io()
callback. I'm certain that this portion of the logic is correct, but
we're still left with some issues when it comes to the checks that I
previously mentioned in ext4_can_extents_be_merged(), which is the
part I need some input on.

I was doing some thinking and I don't believe that making use of the
EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not
only for reasons that I've briefly mentioned above, but also because
of the fact that it'll probably lead to a lot of inaccurate judgements
while taking particular code paths and some really ugly code that
creeps close to the definition of insanity. Rather, what if we solve
this problem by continuing to just use i_unwritten to keep track of
all the direct IOs to unwritten against running against an inode?
Within ext4_iomap_begin() post successful creation of unwritten
extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and
subsequently within the ->end_io() callback whether we take the
success or error path we'd call
atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can
still rely on this value to be used in the check within
ext4_can_extents_be_merged(). Open for alternate suggestions if anyone
has any...

--M

Matthew Bobrowski Aug. 22, 2019, noon UTC | #7

On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote:
> On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
> > But what I meant was this (I may be wrong here since I haven't
> > really looked into it), but for my understanding I would like to
> > discuss this -
> > 
> > So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
> > whether a newextent can be merged with ex1 in function
> > ext4_extents_can_be_merged. But now since we have removed that flag we have
> > no way of knowing that whether this inode has any unwritten extents or not
> > from any DIO path.
> > 
> > What I meant is isn't this removal of setting/unsetting of
> > flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func -
> > ext4_extents_can_be_merged?
> 
> OK, I'm stuck and looking for either clarity, revalidation of my
> thought process, or any input on how to solve this problem for that
> matter.
> 
> In the current ext4 direct IO implementation, the dynamic state flag
> EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO
> writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset
> for ext4_io_end->flag, and the value of i_unwritten is
> incremented/decremented for asynchronous direct IO writes. All
> mechanisms by which are used to track and determine whether the inode,
> or an IO in flight against a particular inode have any pending
> unwritten extents that need to be converted after the IO has
> completed. In addition to this, we have ext4_can_extents_be_merged()
> performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and
> i_unwritten and using them to determine whether it can or cannot merge
> a requested extent into an existing extent.
> 
> This is all fine for the current direct IO implementation. However,
> while switching the direct IO code paths over to make use of the iomap
> infrastructure, I believe that we can no longer simply track whether
> an inode has unwritten extents needing to be converted by simply
> setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the
> inode. The reason being is that there can be multiple direct IO
> operations to unwritten extents running against the inode and we don't
> particularly distinguish synchronous from asynchronous writes within
> ext4_iomap_begin() as there's really no need. So, the only way to
> accurately determine whether extent conversion is deemed necessary for
> an IO operation whether it'd be synchronous or asynchronous is by
> checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io()
> callback. I'm certain that this portion of the logic is correct, but
> we're still left with some issues when it comes to the checks that I
> previously mentioned in ext4_can_extents_be_merged(), which is the
> part I need some input on.
> 
> I was doing some thinking and I don't believe that making use of the
> EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not
> only for reasons that I've briefly mentioned above, but also because
> of the fact that it'll probably lead to a lot of inaccurate judgements
> while taking particular code paths and some really ugly code that
> creeps close to the definition of insanity. Rather, what if we solve
> this problem by continuing to just use i_unwritten to keep track of
> all the direct IOs to unwritten against running against an inode?
> Within ext4_iomap_begin() post successful creation of unwritten
> extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and
> subsequently within the ->end_io() callback whether we take the
> success or error path we'd call
> atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can
> still rely on this value to be used in the check within
> ext4_can_extents_be_merged(). Open for alternate suggestions if anyone
> has any...

Actually, no...

I've done some more thinking and what I suggested above around the use
of i_unwritten will also not work properly. Using iomap
infrastructure, there is the possibility of calling into the
->iomap_begin() more than once for a single direct IO operation. This
means that by the time we even get to decrementing i_unwritten in the
->end_io() callback after converting the unwritten extents we're
already running the possibility of i_unwritten becoming unbalanced
really quickly and staying that way. This also means that the
statement checking i_unwritten in ext4_can_extents_be_merged() will be
affected and potentially result in it being evaluated incorrectly. I
was thinking that we could just decrement i_unwritten in
->iomap_end(), but that seems to me like it would be racy and also
lead to incorrect results. At this point I'm out of ideas on how to
solve this, so any other ideas would be appreciated!

--M

Ritesh Harjani Aug. 22, 2019, 2:11 p.m. UTC | #8

On 8/22/19 5:30 PM, Matthew Bobrowski wrote:
> On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote:
>> On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
>>> But what I meant was this (I may be wrong here since I haven't
>>> really looked into it), but for my understanding I would like to
>>> discuss this -
>>>
>>> So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
>>> whether a newextent can be merged with ex1 in function
>>> ext4_extents_can_be_merged. But now since we have removed that flag we have
>>> no way of knowing that whether this inode has any unwritten extents or not
>>> from any DIO path.
>>>
>>> What I meant is isn't this removal of setting/unsetting of
>>> flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func -
>>> ext4_extents_can_be_merged?
>>
>> OK, I'm stuck and looking for either clarity, revalidation of my
>> thought process, or any input on how to solve this problem for that
>> matter.
>>
>> In the current ext4 direct IO implementation, the dynamic state flag
>> EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO
>> writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset
>> for ext4_io_end->flag, and the value of i_unwritten is
>> incremented/decremented for asynchronous direct IO writes. All
>> mechanisms by which are used to track and determine whether the inode,
>> or an IO in flight against a particular inode have any pending
>> unwritten extents that need to be converted after the IO has
>> completed. In addition to this, we have ext4_can_extents_be_merged()
>> performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and
>> i_unwritten and using them to determine whether it can or cannot merge
>> a requested extent into an existing extent.
>>
>> This is all fine for the current direct IO implementation. However,
>> while switching the direct IO code paths over to make use of the iomap
>> infrastructure, I believe that we can no longer simply track whether
>> an inode has unwritten extents needing to be converted by simply
>> setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the
>> inode. The reason being is that there can be multiple direct IO
>> operations to unwritten extents running against the inode and we don't
>> particularly distinguish synchronous from asynchronous writes within
>> ext4_iomap_begin() as there's really no need. So, the only way to
>> accurately determine whether extent conversion is deemed necessary for
>> an IO operation whether it'd be synchronous or asynchronous is by
>> checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io()
>> callback. I'm certain that this portion of the logic is correct, but
>> we're still left with some issues when it comes to the checks that I
>> previously mentioned in ext4_can_extents_be_merged(), which is the
>> part I need some input on.
>>
>> I was doing some thinking and I don't believe that making use of the
>> EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not
>> only for reasons that I've briefly mentioned above, but also because
>> of the fact that it'll probably lead to a lot of inaccurate judgements
>> while taking particular code paths and some really ugly code that
>> creeps close to the definition of insanity. Rather, what if we solve
>> this problem by continuing to just use i_unwritten to keep track of
>> all the direct IOs to unwritten against running against an inode?
>> Within ext4_iomap_begin() post successful creation of unwritten
>> extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and
>> subsequently within the ->end_io() callback whether we take the
>> success or error path we'd call
>> atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can
>> still rely on this value to be used in the check within
>> ext4_can_extents_be_merged(). Open for alternate suggestions if anyone
>> has any...
> 
> Actually, no...
> 
> I've done some more thinking and what I suggested above around the use
> of i_unwritten will also not work properly. Using iomap
> infrastructure, there is the possibility of calling into the
> ->iomap_begin() more than once for a single direct IO operation. This
> means that by the time we even get to decrementing i_unwritten in the
> ->end_io() callback after converting the unwritten extents we're
> already running the possibility of i_unwritten becoming unbalanced
> really quickly and staying that way. This also means that the
> statement checking i_unwritten in ext4_can_extents_be_merged() will be
> affected and potentially result in it being evaluated incorrectly. I
> was thinking that we could just decrement i_unwritten in
> ->iomap_end(), but that seems to me like it would be racy and also
> lead to incorrect results. At this point I'm out of ideas on how to
> solve this, so any other ideas would be appreciated!

I will let others also comment, if someone has any other better approach.

1. One approach is to add the infrastructure in iomap with 
iomap_dio->private which is filesystem specific pointer. This can be
updated by filesystem in ->iomap_begin call into iomap->private.
And in case of iomap_dio_rw, this iomap->private will be copied to 
iomap_dio->private if not already set.

But I think this will eventually become hacky in the sense when you will 
have to determine on whether the dio->private is already set or not when 
iomap_apply will be called second time. It will become a problem with 
AIO DIO in ext4 since we use i_unwritten which tells us whether there is 
any unwritten extent but it does not tell whether this unwritten extent 
is due to a DIRECT AIO DIO in progress or a buffered one.

So we can ignore this approach - unless you or someone else have some 
good design ideas to build on top of above.


2. Second approach which I was thinking is to track only those extents 
which are marked unwritten and are under IO. This can be done in 
ext4_map_blocks. This way we will not have to track a particular inode 
has any unwritten extents or not, but it will be extent based.
Something similar was also done a while ago. Do you think this approach 
will work in our case?

So with this extents will be scanned in extent status tree to see if any 
among those are under IO and are unwritten in func 
ext4_can_extents_be_merged.

https://patchwork.ozlabs.org/patch/1013837/


-Ritesh

Matthew Bobrowski Aug. 24, 2019, 3:18 a.m. UTC | #9

On Thu, Aug 22, 2019 at 07:41:25PM +0530, Ritesh Harjani wrote:
> 
> 
> On 8/22/19 5:30 PM, Matthew Bobrowski wrote:
> > On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote:
> > > On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
> > > > But what I meant was this (I may be wrong here since I haven't
> > > > really looked into it), but for my understanding I would like to
> > > > discuss this -
> > > > 
> > > > So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
> > > > whether a newextent can be merged with ex1 in function
> > > > ext4_extents_can_be_merged. But now since we have removed that flag we have
> > > > no way of knowing that whether this inode has any unwritten extents or not
> > > > from any DIO path.
> > > > 
> > > > What I meant is isn't this removal of setting/unsetting of
> > > > flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func -
> > > > ext4_extents_can_be_merged?
> > > 
> > > OK, I'm stuck and looking for either clarity, revalidation of my
> > > thought process, or any input on how to solve this problem for that
> > > matter.
> > > 
> > > In the current ext4 direct IO implementation, the dynamic state flag
> > > EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO
> > > writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset
> > > for ext4_io_end->flag, and the value of i_unwritten is
> > > incremented/decremented for asynchronous direct IO writes. All
> > > mechanisms by which are used to track and determine whether the inode,
> > > or an IO in flight against a particular inode have any pending
> > > unwritten extents that need to be converted after the IO has
> > > completed. In addition to this, we have ext4_can_extents_be_merged()
> > > performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and
> > > i_unwritten and using them to determine whether it can or cannot merge
> > > a requested extent into an existing extent.
> > > 
> > > This is all fine for the current direct IO implementation. However,
> > > while switching the direct IO code paths over to make use of the iomap
> > > infrastructure, I believe that we can no longer simply track whether
> > > an inode has unwritten extents needing to be converted by simply
> > > setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the
> > > inode. The reason being is that there can be multiple direct IO
> > > operations to unwritten extents running against the inode and we don't
> > > particularly distinguish synchronous from asynchronous writes within
> > > ext4_iomap_begin() as there's really no need. So, the only way to
> > > accurately determine whether extent conversion is deemed necessary for
> > > an IO operation whether it'd be synchronous or asynchronous is by
> > > checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io()
> > > callback. I'm certain that this portion of the logic is correct, but
> > > we're still left with some issues when it comes to the checks that I
> > > previously mentioned in ext4_can_extents_be_merged(), which is the
> > > part I need some input on.
> > > 
> > > I was doing some thinking and I don't believe that making use of the
> > > EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not
> > > only for reasons that I've briefly mentioned above, but also because
> > > of the fact that it'll probably lead to a lot of inaccurate judgements
> > > while taking particular code paths and some really ugly code that
> > > creeps close to the definition of insanity. Rather, what if we solve
> > > this problem by continuing to just use i_unwritten to keep track of
> > > all the direct IOs to unwritten against running against an inode?
> > > Within ext4_iomap_begin() post successful creation of unwritten
> > > extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and
> > > subsequently within the ->end_io() callback whether we take the
> > > success or error path we'd call
> > > atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can
> > > still rely on this value to be used in the check within
> > > ext4_can_extents_be_merged(). Open for alternate suggestions if anyone
> > > has any...
> > 
> > Actually, no...
> > 
> > I've done some more thinking and what I suggested above around the use
> > of i_unwritten will also not work properly. Using iomap
> > infrastructure, there is the possibility of calling into the
> > ->iomap_begin() more than once for a single direct IO operation. This
> > means that by the time we even get to decrementing i_unwritten in the
> > ->end_io() callback after converting the unwritten extents we're
> > already running the possibility of i_unwritten becoming unbalanced
> > really quickly and staying that way. This also means that the
> > statement checking i_unwritten in ext4_can_extents_be_merged() will be
> > affected and potentially result in it being evaluated incorrectly. I
> > was thinking that we could just decrement i_unwritten in
> > ->iomap_end(), but that seems to me like it would be racy and also
> > lead to incorrect results. At this point I'm out of ideas on how to
> > solve this, so any other ideas would be appreciated!
> 
> I will let others also comment, if someone has any other better approach.
> 
> 1. One approach is to add the infrastructure in iomap with
> iomap_dio->private which is filesystem specific pointer. This can be
> updated by filesystem in ->iomap_begin call into iomap->private.
> And in case of iomap_dio_rw, this iomap->private will be copied to
> iomap_dio->private if not already set.
> 
> But I think this will eventually become hacky in the sense when you will
> have to determine on whether the dio->private is already set or not when
> iomap_apply will be called second time. It will become a problem with AIO
> DIO in ext4 since we use i_unwritten which tells us whether there is any
> unwritten extent but it does not tell whether this unwritten extent is due
> to a DIRECT AIO DIO in progress or a buffered one.
> 
> So we can ignore this approach - unless you or someone else have some good
> design ideas to build on top of above.

I'm not sure whether _this_ is the solution or not, so let's maybe
wait for others to comment. One thing that I and probably the iomap
maintainers would like to avoid is adding any special case code to
iomap infrastructure, if possible. I mean, from what you suggest it
seems to be rather generic and not overly intrusive, although I know
for a fact that iomap infrastructure exists because stuff like
buffer_heads and the old direct IO code ended up accommodating so many
different edge cases making it almost unmodifiable and unmaintainable.

> 2. Second approach which I was thinking is to track only those extents which
> are marked unwritten and are under IO. This can be done in ext4_map_blocks.
> This way we will not have to track a particular inode has any unwritten
> extents or not, but it will be extent based.
> Something similar was also done a while ago. Do you think this approach will
> work in our case?
> 
> So with this extents will be scanned in extent status tree to see if any
> among those are under IO and are unwritten in func
> ext4_can_extents_be_merged.
> 
> https://patchwork.ozlabs.org/patch/1013837/

Maybe this would be a better approach and I think that it'd
work. Based upon what I read within in that thread there weren't
really any objections to the idea, although I can't see that it made
it upstream, so I may be missing something?

--M

Darrick J. Wong Aug. 24, 2019, 3:55 a.m. UTC | #10

On Sat, Aug 24, 2019 at 01:18:31PM +1000, Matthew Bobrowski wrote:
> On Thu, Aug 22, 2019 at 07:41:25PM +0530, Ritesh Harjani wrote:
> > 
> > 
> > On 8/22/19 5:30 PM, Matthew Bobrowski wrote:
> > > On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote:
> > > > On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
> > > > > But what I meant was this (I may be wrong here since I haven't
> > > > > really looked into it), but for my understanding I would like to
> > > > > discuss this -
> > > > > 
> > > > > So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
> > > > > whether a newextent can be merged with ex1 in function
> > > > > ext4_extents_can_be_merged. But now since we have removed that flag we have
> > > > > no way of knowing that whether this inode has any unwritten extents or not
> > > > > from any DIO path.
> > > > > 
> > > > > What I meant is isn't this removal of setting/unsetting of
> > > > > flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func -
> > > > > ext4_extents_can_be_merged?
> > > > 
> > > > OK, I'm stuck and looking for either clarity, revalidation of my
> > > > thought process, or any input on how to solve this problem for that
> > > > matter.
> > > > 
> > > > In the current ext4 direct IO implementation, the dynamic state flag
> > > > EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO
> > > > writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset
> > > > for ext4_io_end->flag, and the value of i_unwritten is
> > > > incremented/decremented for asynchronous direct IO writes. All
> > > > mechanisms by which are used to track and determine whether the inode,
> > > > or an IO in flight against a particular inode have any pending
> > > > unwritten extents that need to be converted after the IO has
> > > > completed. In addition to this, we have ext4_can_extents_be_merged()
> > > > performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and
> > > > i_unwritten and using them to determine whether it can or cannot merge
> > > > a requested extent into an existing extent.
> > > > 
> > > > This is all fine for the current direct IO implementation. However,
> > > > while switching the direct IO code paths over to make use of the iomap
> > > > infrastructure, I believe that we can no longer simply track whether
> > > > an inode has unwritten extents needing to be converted by simply
> > > > setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the
> > > > inode. The reason being is that there can be multiple direct IO
> > > > operations to unwritten extents running against the inode and we don't
> > > > particularly distinguish synchronous from asynchronous writes within
> > > > ext4_iomap_begin() as there's really no need. So, the only way to
> > > > accurately determine whether extent conversion is deemed necessary for
> > > > an IO operation whether it'd be synchronous or asynchronous is by
> > > > checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io()
> > > > callback. I'm certain that this portion of the logic is correct, but
> > > > we're still left with some issues when it comes to the checks that I
> > > > previously mentioned in ext4_can_extents_be_merged(), which is the
> > > > part I need some input on.
> > > > 
> > > > I was doing some thinking and I don't believe that making use of the
> > > > EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not
> > > > only for reasons that I've briefly mentioned above, but also because
> > > > of the fact that it'll probably lead to a lot of inaccurate judgements
> > > > while taking particular code paths and some really ugly code that
> > > > creeps close to the definition of insanity. Rather, what if we solve
> > > > this problem by continuing to just use i_unwritten to keep track of
> > > > all the direct IOs to unwritten against running against an inode?
> > > > Within ext4_iomap_begin() post successful creation of unwritten
> > > > extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and
> > > > subsequently within the ->end_io() callback whether we take the
> > > > success or error path we'd call
> > > > atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can
> > > > still rely on this value to be used in the check within
> > > > ext4_can_extents_be_merged(). Open for alternate suggestions if anyone
> > > > has any...
> > > 
> > > Actually, no...
> > > 
> > > I've done some more thinking and what I suggested above around the use
> > > of i_unwritten will also not work properly. Using iomap
> > > infrastructure, there is the possibility of calling into the
> > > ->iomap_begin() more than once for a single direct IO operation. This
> > > means that by the time we even get to decrementing i_unwritten in the
> > > ->end_io() callback after converting the unwritten extents we're
> > > already running the possibility of i_unwritten becoming unbalanced
> > > really quickly and staying that way. This also means that the
> > > statement checking i_unwritten in ext4_can_extents_be_merged() will be
> > > affected and potentially result in it being evaluated incorrectly. I
> > > was thinking that we could just decrement i_unwritten in
> > > ->iomap_end(), but that seems to me like it would be racy and also
> > > lead to incorrect results. At this point I'm out of ideas on how to
> > > solve this, so any other ideas would be appreciated!
> > 
> > I will let others also comment, if someone has any other better approach.
> > 
> > 1. One approach is to add the infrastructure in iomap with
> > iomap_dio->private which is filesystem specific pointer. This can be
> > updated by filesystem in ->iomap_begin call into iomap->private.
> > And in case of iomap_dio_rw, this iomap->private will be copied to
> > iomap_dio->private if not already set.
> > 
> > But I think this will eventually become hacky in the sense when you will
> > have to determine on whether the dio->private is already set or not when
> > iomap_apply will be called second time. It will become a problem with AIO
> > DIO in ext4 since we use i_unwritten which tells us whether there is any
> > unwritten extent but it does not tell whether this unwritten extent is due
> > to a DIRECT AIO DIO in progress or a buffered one.
> > 
> > So we can ignore this approach - unless you or someone else have some good
> > design ideas to build on top of above.
> 
> I'm not sure whether _this_ is the solution or not, so let's maybe
> wait for others to comment. One thing that I and probably the iomap
> maintainers would like to avoid is adding any special case code to
> iomap infrastructure, if possible. I mean, from what you suggest it
> seems to be rather generic and not overly intrusive, although I know
> for a fact that iomap infrastructure exists because stuff like
> buffer_heads and the old direct IO code ended up accommodating so many
> different edge cases making it almost unmodifiable and unmaintainable.

I'm probably misunderstanding the ext4 extent cache horribly, but I keep
wondering why any of this is necessary -- why can't ext4 track the
unwritten status in the extent records directly?  And why is there all
this strange "can merge" logic?  If you need to convert blocks X to Y
to written state because a write to those blocks completed, isn't that
just manipulation of a bunch of incore records?  And can't you just seek
back and forth in the extent cache to look for adjacent records to merge
with? <confuseD>

(I'd really prefer not to go adding private fields all over the
place...)

--D

> > 2. Second approach which I was thinking is to track only those extents which
> > are marked unwritten and are under IO. This can be done in ext4_map_blocks.
> > This way we will not have to track a particular inode has any unwritten
> > extents or not, but it will be extent based.
> > Something similar was also done a while ago. Do you think this approach will
> > work in our case?
> > 
> > So with this extents will be scanned in extent status tree to see if any
> > among those are under IO and are unwritten in func
> > ext4_can_extents_be_merged.
> > 
> > https://patchwork.ozlabs.org/patch/1013837/
> 
> Maybe this would be a better approach and I think that it'd
> work. Based upon what I read within in that thread there weren't
> really any objections to the idea, although I can't see that it made
> it upstream, so I may be missing something?
> 
> --M

Christoph Hellwig Aug. 24, 2019, 11:04 p.m. UTC | #11

On Fri, Aug 23, 2019 at 08:55:54PM -0700, Darrick J. Wong wrote:
> I'm probably misunderstanding the ext4 extent cache horribly, but I keep
> wondering why any of this is necessary -- why can't ext4 track the
> unwritten status in the extent records directly?  And why is there all
> this strange "can merge" logic?  If you need to convert blocks X to Y
> to written state because a write to those blocks completed, isn't that
> just manipulation of a bunch of incore records?  And can't you just seek
> back and forth in the extent cache to look for adjacent records to merge
> with? <confuseD>

Same here.  I'm not an ext4 expert, but here is what we do in XFS, which
hopefully works in some form for ext4 a well:

 - when starting a direct I/O we allocate any needed blocks and do so
   as unwritten extent.  The extent tree code will merge them in
   whatever way that seems suitable
 - if the IOMAP_DIO_UNWRITTEN is set on the iomap at ->end_io time we
   call a function that walks the whole range covered by the ioend,
   and convert any unwritten extent to a normal written extent.  Any
   splitting and merging will be done as needed by the low-level
   extent tree code
 - this also means we don't need the xfs_ioen structure (which ext4)
   copied from for direct I/O at all (we used to have it initially,
   though including the time when ext4 copied this code).
 - we don't need the equivalent to the ext4_unwritten_wait call in
   ext4_file_write_iter because we serialize any non-aligned I/O
   instead of trying to optimize for weird corner cases


> (I'd really prefer not to go adding private fields all over the
> place...)

Agreed.

Matthew Bobrowski Aug. 27, 2019, 9:52 a.m. UTC | #12

On Sat, Aug 24, 2019 at 04:04:27PM -0700, Christoph Hellwig wrote:
> On Fri, Aug 23, 2019 at 08:55:54PM -0700, Darrick J. Wong wrote:
> > I'm probably misunderstanding the ext4 extent cache horribly, but I keep
> > wondering why any of this is necessary -- why can't ext4 track the
> > unwritten status in the extent records directly?  And why is there all
> > this strange "can merge" logic?  If you need to convert blocks X to Y
> > to written state because a write to those blocks completed, isn't that
> > just manipulation of a bunch of incore records?  And can't you just seek
> > back and forth in the extent cache to look for adjacent records to merge
> > with? <confuseD>
> 
> Same here.  I'm not an ext4 expert, but here is what we do in XFS, which
> hopefully works in some form for ext4 a well:
> 
>  - when starting a direct I/O we allocate any needed blocks and do so
>    as unwritten extent.  The extent tree code will merge them in
>    whatever way that seems suitable
>  - if the IOMAP_DIO_UNWRITTEN is set on the iomap at ->end_io time we
>    call a function that walks the whole range covered by the ioend,
>    and convert any unwritten extent to a normal written extent.  Any
>    splitting and merging will be done as needed by the low-level
>    extent tree code
>  - this also means we don't need the xfs_ioen structure (which ext4)
>    copied from for direct I/O at all (we used to have it initially,
>    though including the time when ext4 copied this code).
>  - we don't need the equivalent to the ext4_unwritten_wait call in
>    ext4_file_write_iter because we serialize any non-aligned I/O
>    instead of trying to optimize for weird corner cases

Yeah, so what you've detailed above is essentially the approach I've
taken in my patch series...

What is not clear to me at this point though is whether it is still
necessary to explicitly track unwritten extents via in-core inode
attributes i.e. ->i_unwritten and ->i_state_flags under the new direct
IO code path implementation, which makes use of the iomap
infrastructure. Or, whether we can get away with simply not using
these in-core inode attributes and rely just on checks against the
extent record directly, as breifly mentioned by Darrick. I would think
that this type of check would be enough, however the checks around
whether the inode is currently undergoing direct IO were implemented
at some point, so there must be a reason for having them
(a9b8241594add)?

--M

Matthew Bobrowski Aug. 28, 2019, 12:05 p.m. UTC | #13

On Tue, Aug 27, 2019 at 07:52:23PM +1000, Matthew Bobrowski wrote:
> On Sat, Aug 24, 2019 at 04:04:27PM -0700, Christoph Hellwig wrote:
> > On Fri, Aug 23, 2019 at 08:55:54PM -0700, Darrick J. Wong wrote:
> > > I'm probably misunderstanding the ext4 extent cache horribly, but I keep
> > > wondering why any of this is necessary -- why can't ext4 track the
> > > unwritten status in the extent records directly?  And why is there all
> > > this strange "can merge" logic?  If you need to convert blocks X to Y
> > > to written state because a write to those blocks completed, isn't that
> > > just manipulation of a bunch of incore records?  And can't you just seek
> > > back and forth in the extent cache to look for adjacent records to merge
> > > with? <confuseD>
> > 
> > Same here.  I'm not an ext4 expert, but here is what we do in XFS, which
> > hopefully works in some form for ext4 a well:
> > 
> >  - when starting a direct I/O we allocate any needed blocks and do so
> >    as unwritten extent.  The extent tree code will merge them in
> >    whatever way that seems suitable
> >  - if the IOMAP_DIO_UNWRITTEN is set on the iomap at ->end_io time we
> >    call a function that walks the whole range covered by the ioend,
> >    and convert any unwritten extent to a normal written extent.  Any
> >    splitting and merging will be done as needed by the low-level
> >    extent tree code
> >  - this also means we don't need the xfs_ioen structure (which ext4)
> >    copied from for direct I/O at all (we used to have it initially,
> >    though including the time when ext4 copied this code).
> >  - we don't need the equivalent to the ext4_unwritten_wait call in
> >    ext4_file_write_iter because we serialize any non-aligned I/O
> >    instead of trying to optimize for weird corner cases
> 
> Yeah, so what you've detailed above is essentially the approach I've
> taken in my patch series...
> 
> What is not clear to me at this point though is whether it is still
> necessary to explicitly track unwritten extents via in-core inode
> attributes i.e. ->i_unwritten and ->i_state_flags under the new direct
> IO code path implementation, which makes use of the iomap
> infrastructure. Or, whether we can get away with simply not using
> these in-core inode attributes and rely just on checks against the
> extent record directly, as breifly mentioned by Darrick. I would think
> that this type of check would be enough, however the checks around
> whether the inode is currently undergoing direct IO were implemented
> at some point, so there must be a reason for having them
> (a9b8241594add)?

Maybe it's a silly question, although I'm wanting to clarify my
understanding around why it is that when we either try prepend or
append to an existing extent, we don't permit merging of extents if
the inode is undergoing direct IO?

--M

Theodore Ts'o Aug. 28, 2019, 2:27 p.m. UTC | #14

On Wed, Aug 28, 2019 at 10:05:11PM +1000, Matthew Bobrowski wrote:
> > What is not clear to me at this point though is whether it is still
> > necessary to explicitly track unwritten extents via in-core inode
> > attributes i.e. ->i_unwritten and ->i_state_flags under the new direct
> > IO code path implementation, which makes use of the iomap
> > infrastructure. Or, whether we can get away with simply not using
> > these in-core inode attributes and rely just on checks against the
> > extent record directly, as breifly mentioned by Darrick. I would think
> > that this type of check would be enough, however the checks around
> > whether the inode is currently undergoing direct IO were implemented
> > at some point, so there must be a reason for having them
> > (a9b8241594add)?

The original reason why we created the DIO_STATE_UNWRITTEN flag was a
fast path, where the common case is writing blocks to an existing
location in a file where the blocks are already allocated, and marked
as written.  So consulting the on-disk extent tree to determine
whether unwritten extents need to be converted and/or split is
certainly doable.  However, it's expensive for the common case.  So
having a hint whether we need to schedule a workqueue to possibly
convert an unwritten region is helpful.  If we can just free the bio
and exit the I/O completion handler without having to take shared
locks to examine the on-disk extent tree, so much the better.

> Maybe it's a silly question, although I'm wanting to clarify my
> understanding around why it is that when we either try prepend or
> append to an existing extent, we don't permit merging of extents if

If I recall correctly, the reason for this check was mainly the
concern that we would end up merging an extent that we would then have
to split later on (when the direct I/O completed).

To be honest, i'm not 100% sure what would happen if we removed that
restriction; it might be that things would work just fine (just slower
in some workloads), or whether there is some hidden dependency that
would explode.  I suspect we'd have to try the experiment to be sure.

      		  	       	    - Ted

Jan Kara Aug. 28, 2019, 6:02 p.m. UTC | #15

On Wed 28-08-19 10:27:29, Theodore Y. Ts'o wrote:
> On Wed, Aug 28, 2019 at 10:05:11PM +1000, Matthew Bobrowski wrote:
> > > What is not clear to me at this point though is whether it is still
> > > necessary to explicitly track unwritten extents via in-core inode
> > > attributes i.e. ->i_unwritten and ->i_state_flags under the new direct
> > > IO code path implementation, which makes use of the iomap
> > > infrastructure. Or, whether we can get away with simply not using
> > > these in-core inode attributes and rely just on checks against the
> > > extent record directly, as breifly mentioned by Darrick. I would think
> > > that this type of check would be enough, however the checks around
> > > whether the inode is currently undergoing direct IO were implemented
> > > at some point, so there must be a reason for having them
> > > (a9b8241594add)?
> 
> The original reason why we created the DIO_STATE_UNWRITTEN flag was a
> fast path, where the common case is writing blocks to an existing
> location in a file where the blocks are already allocated, and marked
> as written.  So consulting the on-disk extent tree to determine
> whether unwritten extents need to be converted and/or split is
> certainly doable.  However, it's expensive for the common case.  So
> having a hint whether we need to schedule a workqueue to possibly
> convert an unwritten region is helpful.  If we can just free the bio
> and exit the I/O completion handler without having to take shared
> locks to examine the on-disk extent tree, so much the better.

Yes, but for determining whether extent conversion on IO completion is
needed we now use IOMAP_DIO_UNWRITTEN flag iomap infrastructure provides to
us. So we don't have to track this internally in ext4 anymore.

> > Maybe it's a silly question, although I'm wanting to clarify my
> > understanding around why it is that when we either try prepend or
> > append to an existing extent, we don't permit merging of extents if
> 
> If I recall correctly, the reason for this check was mainly the
> concern that we would end up merging an extent that we would then have
> to split later on (when the direct I/O completed).
> 
> To be honest, i'm not 100% sure what would happen if we removed that
> restriction; it might be that things would work just fine (just slower
> in some workloads), or whether there is some hidden dependency that
> would explode.  I suspect we'd have to try the experiment to be sure.

As far as I remember the concern was that extent split may need block
allocation and we may not have enough free blocks to do it. These days we
have some blocks reserved in the filesystem to accomodate unexpected extent
splits so this shouldn't happen anymore so the only real concern is the
wasted performance due to unnecessary extent merge & split. Kind of a
stress test for this would be to fire of lots of sequential AIO DIO
requests against a hole in a file.

								Honza

Christoph Hellwig Aug. 29, 2019, 6:36 a.m. UTC | #16

On Wed, Aug 28, 2019 at 08:02:15PM +0200, Jan Kara wrote:
> > The original reason why we created the DIO_STATE_UNWRITTEN flag was a
> > fast path, where the common case is writing blocks to an existing
> > location in a file where the blocks are already allocated, and marked
> > as written.  So consulting the on-disk extent tree to determine
> > whether unwritten extents need to be converted and/or split is
> > certainly doable.  However, it's expensive for the common case.  So
> > having a hint whether we need to schedule a workqueue to possibly
> > convert an unwritten region is helpful.  If we can just free the bio
> > and exit the I/O completion handler without having to take shared
> > locks to examine the on-disk extent tree, so much the better.
> 
> Yes, but for determining whether extent conversion on IO completion is
> needed we now use IOMAP_DIO_UNWRITTEN flag iomap infrastructure provides to
> us. So we don't have to track this internally in ext4 anymore.

Exactly.  As mentioned before the ioend to track unwritten thing was
in XFS by the time ext4 copied the ioend approach. but we actually got
rid of that long before the iomap conversion.  Maybe to make everything
easier to understand and bisect you might want to get rid of the ioend
for direct I/O in ext4 as a prep path as well.

The relevant commit is: 273dda76f757 ("xfs: don't use ioends for direct
write completions")

> > To be honest, i'm not 100% sure what would happen if we removed that
> > restriction; it might be that things would work just fine (just slower
> > in some workloads), or whether there is some hidden dependency that
> > would explode.  I suspect we'd have to try the experiment to be sure.
> 
> As far as I remember the concern was that extent split may need block
> allocation and we may not have enough free blocks to do it. These days we
> have some blocks reserved in the filesystem to accomodate unexpected extent
> splits so this shouldn't happen anymore so the only real concern is the
> wasted performance due to unnecessary extent merge & split. Kind of a
> stress test for this would be to fire of lots of sequential AIO DIO
> requests against a hole in a file.

Well, you can always add a don't merge flag to the actual allocation.
You might still get a merge for pathological case (fallocate adjacent
to a dio write just submitted), but if the merging is such a performance
over head here is easy ways to avoid it for the common case.

Matthew Bobrowski Aug. 29, 2019, 11:20 a.m. UTC | #17

Awesome, and thank you *all* for your very valueable input.

On Wed, Aug 28, 2019 at 11:36:08PM -0700, Christoph Hellwig wrote:
> On Wed, Aug 28, 2019 at 08:02:15PM +0200, Jan Kara wrote:
> > > The original reason why we created the DIO_STATE_UNWRITTEN flag was a
> > > fast path, where the common case is writing blocks to an existing
> > > location in a file where the blocks are already allocated, and marked
> > > as written.  So consulting the on-disk extent tree to determine
> > > whether unwritten extents need to be converted and/or split is
> > > certainly doable.  However, it's expensive for the common case.  So
> > > having a hint whether we need to schedule a workqueue to possibly
> > > convert an unwritten region is helpful.  If we can just free the bio
> > > and exit the I/O completion handler without having to take shared
> > > locks to examine the on-disk extent tree, so much the better.
> > 
> > Yes, but for determining whether extent conversion on IO completion is
> > needed we now use IOMAP_DIO_UNWRITTEN flag iomap infrastructure provides to
> > us. So we don't have to track this internally in ext4 anymore.
> 
> Exactly.  As mentioned before the ioend to track unwritten thing was
> in XFS by the time ext4 copied the ioend approach. but we actually got
> rid of that long before the iomap conversion.  Maybe to make everything
> easier to understand and bisect you might want to get rid of the ioend
> for direct I/O in ext4 as a prep path as well.
> 
> The relevant commit is: 273dda76f757 ("xfs: don't use ioends for direct
> write completions")

Uh ha! So, we conclude that there's no need to muck around with hairy
ioend's, or the need to denote whether there's unwritten extents held
against the inode using tricky state flag for that matter.

> > > To be honest, i'm not 100% sure what would happen if we removed that
> > > restriction; it might be that things would work just fine (just slower
> > > in some workloads), or whether there is some hidden dependency that
> > > would explode.  I suspect we'd have to try the experiment to be sure.
> > 
> > As far as I remember the concern was that extent split may need block
> > allocation and we may not have enough free blocks to do it. These days we
> > have some blocks reserved in the filesystem to accomodate unexpected extent
> > splits so this shouldn't happen anymore so the only real concern is the
> > wasted performance due to unnecessary extent merge & split. Kind of a
> > stress test for this would be to fire of lots of sequential AIO DIO
> > requests against a hole in a file.
> 
> Well, you can always add a don't merge flag to the actual allocation.
> You might still get a merge for pathological case (fallocate adjacent
> to a dio write just submitted), but if the merging is such a performance
> over head here is easy ways to avoid it for the common case.

After I've posted through the next version of this patch series, I
will attempt to perform some stress testing to see what the
performance hit could potentially be.

Christoph Hellwig Aug. 29, 2019, 2:41 p.m. UTC | #18

On Thu, Aug 29, 2019 at 09:20:50PM +1000, Matthew Bobrowski wrote:
> Uh ha! So, we conclude that there's no need to muck around with hairy
> ioend's, or the need to denote whether there's unwritten extents held
> against the inode using tricky state flag for that matter.

So in XFS we never had the i_unwritten counter, that is something
purely ext4 specific, and I can't really help with that unfortunately,
but maybe the people who implemented it might be able to help on that
one.

[0/5] ext4: direct IO via iomap infrastructure

Message

Comments