Message ID | cover.1565609891.git.mbobrowski@mbobrowski.org (mailing list archive) |
---|---|
Headers | show |
Series | ext4: direct IO via iomap infrastructure | expand |
Hello Matthew, On 8/12/19 6:22 PM, Matthew Bobrowski wrote: > This patch series converts the ext4 direct IO code paths to make use of the > iomap infrastructure and removes the old buffer_head direct-io based > implementation. The result is that ext4 is converted to the newer framework > and that it may _possibly_ gain a performance boost for O_SYNC | O_DIRECT IO. > > These changes have been tested using xfstests in both DAX and non-DAX modes > using various configurations i.e. 4k, dioread_nolock, dax. I had some minor review comments posted on Patch-4. But the rest of the patch series looks good to me. I will also do some basic testing of xfstests which I did for my patches and will revert back. One query, could you please help answering below for my understanding :- I was under the assumption that we need to maintain ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or atomic_read(&EXT4_I(inode)->i_unwritten)) in case of non-AIO directIO or AIO directIO case as well (when we may allocate unwritten extents), to protect with some kind of race with other parts of code(maybe truncate/bufferedIO/fallocate not sure?) which may call for ext4_can_extents_be_merged() to check if extents can be merged or not. Is it not the case? Now that directIO code has no way of specifying that this inode has unwritten extent, will it not race with any other path, where this info was necessary (like in above func ext4_can_extents_be_merged())? Thanks Ritesh > > Matthew Bobrowski (5): > ext4: introduce direct IO read code path using iomap infrastructure > ext4: move inode extension/truncate code out from ext4_iomap_end() > iomap: modify ->end_io() calling convention > ext4: introduce direct IO write code path using iomap infrastructure > ext4: clean up redundant buffer_head direct IO code > > fs/ext4/ext4.h | 3 - > fs/ext4/extents.c | 8 +- > fs/ext4/file.c | 329 +++++++++++++++++++++++++++------- > fs/ext4/inode.c | 488 +++++--------------------------------------------- > fs/iomap/direct-io.c | 9 +- > fs/xfs/xfs_file.c | 17 +- > include/linux/iomap.h | 4 +- > 7 files changed, 322 insertions(+), 536 deletions(-) >
On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote: > > This patch series converts the ext4 direct IO code paths to make use of the > > iomap infrastructure and removes the old buffer_head direct-io based > > implementation. The result is that ext4 is converted to the newer framework > > and that it may _possibly_ gain a performance boost for O_SYNC | O_DIRECT IO. > > > > These changes have been tested using xfstests in both DAX and non-DAX modes > > using various configurations i.e. 4k, dioread_nolock, dax. > > I had some minor review comments posted on Patch-4. > But the rest of the patch series looks good to me. Thanks for the review, much appreciated! Also, apologies about any delayed response to your queries, I predominantly do all this work in my personal time. > I will also do some basic testing of xfstests which I did for my patches and > will revert back. Sounds good! > One query, could you please help answering below for my understanding :- > > I was under the assumption that we need to maintain > ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or > atomic_read(&EXT4_I(inode)->i_unwritten)) > in case of non-AIO directIO or AIO directIO case as well (when we may > allocate unwritten extents), > to protect with some kind of race with other parts of code(maybe > truncate/bufferedIO/fallocate not sure?) which may call for > ext4_can_extents_be_merged() > to check if extents can be merged or not. > > Is it not the case? > Now that directIO code has no way of specifying that this inode has > unwritten extent, will it not race with any other path, where this info was > necessary (like > in above func ext4_can_extents_be_merged())? Ah yes, I was under the same assumption when reviewing the code initially and one of my first solutions was to also use this dynamic 'state' flag in the ->end_io() handler. But, I fell flat on my face as that deemed to be problematic... This is because there can be multiple direct IOs to unwritten extents against the same inode, so you cannot possibly get away with tracking them using this single inode flag. So, hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks whether _this_ particular IO has an underlying unwritten extent.
On 8/13/19 4:40 PM, Matthew Bobrowski wrote: > On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote: >>> This patch series converts the ext4 direct IO code paths to make use of the >>> iomap infrastructure and removes the old buffer_head direct-io based >>> implementation. The result is that ext4 is converted to the newer framework >>> and that it may _possibly_ gain a performance boost for O_SYNC | O_DIRECT IO. >>> >>> These changes have been tested using xfstests in both DAX and non-DAX modes >>> using various configurations i.e. 4k, dioread_nolock, dax. >> I had some minor review comments posted on Patch-4. >> But the rest of the patch series looks good to me. > Thanks for the review, much appreciated! Also, apologies about any > delayed response to your queries, I predominantly do all this work in > my personal time. Np at all. > >> I will also do some basic testing of xfstests which I did for my patches and >> will revert back. > Sounds good! I did not find any failure new failures in xfstests with 4K block size. Neither in basic fio DIO/AIO testing. So my basic testing looks good (these are mostly the tests which I was using for my debug/dev too) > >> One query, could you please help answering below for my understanding :- >> >> I was under the assumption that we need to maintain >> ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or >> atomic_read(&EXT4_I(inode)->i_unwritten)) >> in case of non-AIO directIO or AIO directIO case as well (when we may >> allocate unwritten extents), >> to protect with some kind of race with other parts of code(maybe >> truncate/bufferedIO/fallocate not sure?) which may call for >> ext4_can_extents_be_merged() >> to check if extents can be merged or not. >> >> Is it not the case? >> Now that directIO code has no way of specifying that this inode has >> unwritten extent, will it not race with any other path, where this info was >> necessary (like >> in above func ext4_can_extents_be_merged())? > Ah yes, I was under the same assumption when reviewing the code > initially and one of my first solutions was to also use this dynamic > 'state' flag in the ->end_io() handler. But, I fell flat on my face as > that deemed to be problematic... This is because there can be multiple > direct IOs to unwritten extents against the same inode, so you cannot > possibly get away with tracking them using this single inode flag. So, > hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use > IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks > whether _this_ particular IO has an underlying unwritten extent. Thanks for taking time to explain this. Yes, I do understand that part - i.e. while preparing block mapping in ->iomap_begin we get to know(from ext4_map_blocks) whether this is an unwritten extent and we add the flag IOMAP_DIO_UNWRITTEN to iomap. This is needed so that we can convert unwritten extents in ->end_io before we update the inode size and mark the inode dirty - to avoid any race with other AIO DIO or bufferedIO. But what I meant was this (I may be wrong here since I haven't really looked into it), but for my understanding I would like to discuss this - So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on whether a newextent can be merged with ex1 in function ext4_extents_can_be_merged. But now since we have removed that flag we have no way of knowing that whether this inode has any unwritten extents or not from any DIO path. What I meant is isn't this removal of setting/unsetting of flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func - ext4_extents_can_be_merged? Also - could you please explain why this check returns 0 in the first place (line 1762 - 1766 below)? 1733 int 1734 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, 1735 struct ext4_extent *ex2) <...> 1762 if (ext4_ext_is_unwritten(ex1) && 1763 (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) || 1764 atomic_read(&EXT4_I(inode)->i_unwritten) || 1765 (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN))) 1766 return 0; <...> Regards Ritesh
On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote: > On 8/13/19 4:40 PM, Matthew Bobrowski wrote: > > On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote: > > > I was under the assumption that we need to maintain > > > ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or > > > atomic_read(&EXT4_I(inode)->i_unwritten)) > > > in case of non-AIO directIO or AIO directIO case as well (when we may > > > allocate unwritten extents), > > > to protect with some kind of race with other parts of code(maybe > > > truncate/bufferedIO/fallocate not sure?) which may call for > > > ext4_can_extents_be_merged() > > > to check if extents can be merged or not. > > > > > > Is it not the case? > > > Now that directIO code has no way of specifying that this inode has > > > unwritten extent, will it not race with any other path, where this info was > > > necessary (like > > > in above func ext4_can_extents_be_merged())? > > Ah yes, I was under the same assumption when reviewing the code > > initially and one of my first solutions was to also use this dynamic > > 'state' flag in the ->end_io() handler. But, I fell flat on my face as > > that deemed to be problematic... This is because there can be multiple > > direct IOs to unwritten extents against the same inode, so you cannot > > possibly get away with tracking them using this single inode flag. So, > > hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use > > IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks > > whether _this_ particular IO has an underlying unwritten extent. > > Thanks for taking time to explain this. > > But what I meant was this (I may be wrong here since I haven't really looked > into it), > but for my understanding I would like to discuss this - > > So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on > whether a newextent can be merged with ex1 in function > ext4_extents_can_be_merged. But now since we have removed that flag we have > no way of knowing that whether > this inode has any unwritten extents or not from any DIO path. > What I meant is isn't this removal of setting/unsetting of > flag(EXT4_STATE_DIO_UNWRITTEN) > changing the behavior of this func - ext4_extents_can_be_merged? Ah yes, I see. I believe that what you're saying is correct and I think we will need to account for this case. But, I haven't thought about how to do this just yet. > Also - could you please explain why this check returns 0 in the first place > (line 1762 - 1766 below)? I cannot explain why, because I myself don't know exactly why in this particular case the extents cannot be merged. Perhaps `git blame` is our friend and we can direct that question accordingly, or someone else on this mailing list knows the answer. :-) > 1733 int > 1734 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent > *ex1, > 1735 struct ext4_extent *ex2) > <...> > > 1762 if (ext4_ext_is_unwritten(ex1) && > 1763 (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) || > 1764 atomic_read(&EXT4_I(inode)->i_unwritten) || > 1765 (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN))) > 1766 return 0; --M
On 8/14/19 3:18 PM, Matthew Bobrowski wrote: > On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote: >> On 8/13/19 4:40 PM, Matthew Bobrowski wrote: >>> On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote: >>>> I was under the assumption that we need to maintain >>>> ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or >>>> atomic_read(&EXT4_I(inode)->i_unwritten)) >>>> in case of non-AIO directIO or AIO directIO case as well (when we may >>>> allocate unwritten extents), >>>> to protect with some kind of race with other parts of code(maybe >>>> truncate/bufferedIO/fallocate not sure?) which may call for >>>> ext4_can_extents_be_merged() >>>> to check if extents can be merged or not. >>>> >>>> Is it not the case? >>>> Now that directIO code has no way of specifying that this inode has >>>> unwritten extent, will it not race with any other path, where this info was >>>> necessary (like >>>> in above func ext4_can_extents_be_merged())? >>> Ah yes, I was under the same assumption when reviewing the code >>> initially and one of my first solutions was to also use this dynamic >>> 'state' flag in the ->end_io() handler. But, I fell flat on my face as >>> that deemed to be problematic... This is because there can be multiple >>> direct IOs to unwritten extents against the same inode, so you cannot >>> possibly get away with tracking them using this single inode flag. So, >>> hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use >>> IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks >>> whether _this_ particular IO has an underlying unwritten extent. >> Thanks for taking time to explain this. >> >> But what I meant was this (I may be wrong here since I haven't really looked >> into it), >> but for my understanding I would like to discuss this - >> >> So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on >> whether a newextent can be merged with ex1 in function >> ext4_extents_can_be_merged. But now since we have removed that flag we have >> no way of knowing that whether >> this inode has any unwritten extents or not from any DIO path. >> What I meant is isn't this removal of setting/unsetting of >> flag(EXT4_STATE_DIO_UNWRITTEN) >> changing the behavior of this func - ext4_extents_can_be_merged? > Ah yes, I see. I believe that what you're saying is correct and I > think we will need to account for this case. But, I haven't thought > about how to do this just yet. That's not a problem, we can surely discuss the possible approaches. >> Also - could you please explain why this check returns 0 in the first place >> (line 1762 - 1766 below)? >> >> 1733 int >> 1734 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent >> *ex1, >> 1735 struct ext4_extent *ex2) >> <...> >> >> 1762 if (ext4_ext_is_unwritten(ex1) && >> 1763 (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) || >> 1764 atomic_read(&EXT4_I(inode)->i_unwritten) || >> 1765 (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN))) >> 1766 return 0; > I cannot explain why, because I myself don't know exactly why in this > particular case the extents cannot be merged. Perhaps `git blame` is > our friend and we can direct that question accordingly, or someone > else on this mailing list knows the answer. :-) git blame didn't help much. But I think I may know what the above condition means. So I think if there is an ongoing IO to an unwritten extent, we may sometimes first split that extent to match the exact range of the IO, then write data to it and then convert that *exact range* to written extent. So this means while there is an ongoing IO to any inode which has unwritten extents, we should not allow any other extent to merge with extents of this inode. Now one race which I could think of it is, when we are doing AIO DIO and in parallel doing fallocate to the end of the file. But since we wait for inode_dio_wait in fallocate, so that should not race with any DIO paths. I think, here we can wait to get answers from others to understand, if there is any race with any of the DIO paths on removing this state flag(EXT4_STATE_DIO_UNWRITTEN) & not setting "i_unwritten". Whether it can race with any other path and if this state flag is necessary(in DIO cases also) for the correct functionality? Regards Ritesh
On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote: > But what I meant was this (I may be wrong here since I haven't really looked > into it), but for my understanding I would like to discuss this - > > So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on > whether a newextent can be merged with ex1 in function > ext4_extents_can_be_merged. But now since we have removed that flag we have > no way of knowing that whether this inode has any unwritten extents or not > from any DIO path. > > What I meant is isn't this removal of setting/unsetting of > flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func - > ext4_extents_can_be_merged? OK, I'm stuck and looking for either clarity, revalidation of my thought process, or any input on how to solve this problem for that matter. In the current ext4 direct IO implementation, the dynamic state flag EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset for ext4_io_end->flag, and the value of i_unwritten is incremented/decremented for asynchronous direct IO writes. All mechanisms by which are used to track and determine whether the inode, or an IO in flight against a particular inode have any pending unwritten extents that need to be converted after the IO has completed. In addition to this, we have ext4_can_extents_be_merged() performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and i_unwritten and using them to determine whether it can or cannot merge a requested extent into an existing extent. This is all fine for the current direct IO implementation. However, while switching the direct IO code paths over to make use of the iomap infrastructure, I believe that we can no longer simply track whether an inode has unwritten extents needing to be converted by simply setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the inode. The reason being is that there can be multiple direct IO operations to unwritten extents running against the inode and we don't particularly distinguish synchronous from asynchronous writes within ext4_iomap_begin() as there's really no need. So, the only way to accurately determine whether extent conversion is deemed necessary for an IO operation whether it'd be synchronous or asynchronous is by checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io() callback. I'm certain that this portion of the logic is correct, but we're still left with some issues when it comes to the checks that I previously mentioned in ext4_can_extents_be_merged(), which is the part I need some input on. I was doing some thinking and I don't believe that making use of the EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not only for reasons that I've briefly mentioned above, but also because of the fact that it'll probably lead to a lot of inaccurate judgements while taking particular code paths and some really ugly code that creeps close to the definition of insanity. Rather, what if we solve this problem by continuing to just use i_unwritten to keep track of all the direct IOs to unwritten against running against an inode? Within ext4_iomap_begin() post successful creation of unwritten extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and subsequently within the ->end_io() callback whether we take the success or error path we'd call atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can still rely on this value to be used in the check within ext4_can_extents_be_merged(). Open for alternate suggestions if anyone has any... --M
On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote: > On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote: > > But what I meant was this (I may be wrong here since I haven't > > really looked into it), but for my understanding I would like to > > discuss this - > > > > So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on > > whether a newextent can be merged with ex1 in function > > ext4_extents_can_be_merged. But now since we have removed that flag we have > > no way of knowing that whether this inode has any unwritten extents or not > > from any DIO path. > > > > What I meant is isn't this removal of setting/unsetting of > > flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func - > > ext4_extents_can_be_merged? > > OK, I'm stuck and looking for either clarity, revalidation of my > thought process, or any input on how to solve this problem for that > matter. > > In the current ext4 direct IO implementation, the dynamic state flag > EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO > writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset > for ext4_io_end->flag, and the value of i_unwritten is > incremented/decremented for asynchronous direct IO writes. All > mechanisms by which are used to track and determine whether the inode, > or an IO in flight against a particular inode have any pending > unwritten extents that need to be converted after the IO has > completed. In addition to this, we have ext4_can_extents_be_merged() > performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and > i_unwritten and using them to determine whether it can or cannot merge > a requested extent into an existing extent. > > This is all fine for the current direct IO implementation. However, > while switching the direct IO code paths over to make use of the iomap > infrastructure, I believe that we can no longer simply track whether > an inode has unwritten extents needing to be converted by simply > setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the > inode. The reason being is that there can be multiple direct IO > operations to unwritten extents running against the inode and we don't > particularly distinguish synchronous from asynchronous writes within > ext4_iomap_begin() as there's really no need. So, the only way to > accurately determine whether extent conversion is deemed necessary for > an IO operation whether it'd be synchronous or asynchronous is by > checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io() > callback. I'm certain that this portion of the logic is correct, but > we're still left with some issues when it comes to the checks that I > previously mentioned in ext4_can_extents_be_merged(), which is the > part I need some input on. > > I was doing some thinking and I don't believe that making use of the > EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not > only for reasons that I've briefly mentioned above, but also because > of the fact that it'll probably lead to a lot of inaccurate judgements > while taking particular code paths and some really ugly code that > creeps close to the definition of insanity. Rather, what if we solve > this problem by continuing to just use i_unwritten to keep track of > all the direct IOs to unwritten against running against an inode? > Within ext4_iomap_begin() post successful creation of unwritten > extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and > subsequently within the ->end_io() callback whether we take the > success or error path we'd call > atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can > still rely on this value to be used in the check within > ext4_can_extents_be_merged(). Open for alternate suggestions if anyone > has any... Actually, no... I've done some more thinking and what I suggested above around the use of i_unwritten will also not work properly. Using iomap infrastructure, there is the possibility of calling into the ->iomap_begin() more than once for a single direct IO operation. This means that by the time we even get to decrementing i_unwritten in the ->end_io() callback after converting the unwritten extents we're already running the possibility of i_unwritten becoming unbalanced really quickly and staying that way. This also means that the statement checking i_unwritten in ext4_can_extents_be_merged() will be affected and potentially result in it being evaluated incorrectly. I was thinking that we could just decrement i_unwritten in ->iomap_end(), but that seems to me like it would be racy and also lead to incorrect results. At this point I'm out of ideas on how to solve this, so any other ideas would be appreciated! --M
On 8/22/19 5:30 PM, Matthew Bobrowski wrote: > On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote: >> On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote: >>> But what I meant was this (I may be wrong here since I haven't >>> really looked into it), but for my understanding I would like to >>> discuss this - >>> >>> So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on >>> whether a newextent can be merged with ex1 in function >>> ext4_extents_can_be_merged. But now since we have removed that flag we have >>> no way of knowing that whether this inode has any unwritten extents or not >>> from any DIO path. >>> >>> What I meant is isn't this removal of setting/unsetting of >>> flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func - >>> ext4_extents_can_be_merged? >> >> OK, I'm stuck and looking for either clarity, revalidation of my >> thought process, or any input on how to solve this problem for that >> matter. >> >> In the current ext4 direct IO implementation, the dynamic state flag >> EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO >> writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset >> for ext4_io_end->flag, and the value of i_unwritten is >> incremented/decremented for asynchronous direct IO writes. All >> mechanisms by which are used to track and determine whether the inode, >> or an IO in flight against a particular inode have any pending >> unwritten extents that need to be converted after the IO has >> completed. In addition to this, we have ext4_can_extents_be_merged() >> performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and >> i_unwritten and using them to determine whether it can or cannot merge >> a requested extent into an existing extent. >> >> This is all fine for the current direct IO implementation. However, >> while switching the direct IO code paths over to make use of the iomap >> infrastructure, I believe that we can no longer simply track whether >> an inode has unwritten extents needing to be converted by simply >> setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the >> inode. The reason being is that there can be multiple direct IO >> operations to unwritten extents running against the inode and we don't >> particularly distinguish synchronous from asynchronous writes within >> ext4_iomap_begin() as there's really no need. So, the only way to >> accurately determine whether extent conversion is deemed necessary for >> an IO operation whether it'd be synchronous or asynchronous is by >> checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io() >> callback. I'm certain that this portion of the logic is correct, but >> we're still left with some issues when it comes to the checks that I >> previously mentioned in ext4_can_extents_be_merged(), which is the >> part I need some input on. >> >> I was doing some thinking and I don't believe that making use of the >> EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not >> only for reasons that I've briefly mentioned above, but also because >> of the fact that it'll probably lead to a lot of inaccurate judgements >> while taking particular code paths and some really ugly code that >> creeps close to the definition of insanity. Rather, what if we solve >> this problem by continuing to just use i_unwritten to keep track of >> all the direct IOs to unwritten against running against an inode? >> Within ext4_iomap_begin() post successful creation of unwritten >> extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and >> subsequently within the ->end_io() callback whether we take the >> success or error path we'd call >> atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can >> still rely on this value to be used in the check within >> ext4_can_extents_be_merged(). Open for alternate suggestions if anyone >> has any... > > Actually, no... > > I've done some more thinking and what I suggested above around the use > of i_unwritten will also not work properly. Using iomap > infrastructure, there is the possibility of calling into the > ->iomap_begin() more than once for a single direct IO operation. This > means that by the time we even get to decrementing i_unwritten in the > ->end_io() callback after converting the unwritten extents we're > already running the possibility of i_unwritten becoming unbalanced > really quickly and staying that way. This also means that the > statement checking i_unwritten in ext4_can_extents_be_merged() will be > affected and potentially result in it being evaluated incorrectly. I > was thinking that we could just decrement i_unwritten in > ->iomap_end(), but that seems to me like it would be racy and also > lead to incorrect results. At this point I'm out of ideas on how to > solve this, so any other ideas would be appreciated! I will let others also comment, if someone has any other better approach. 1. One approach is to add the infrastructure in iomap with iomap_dio->private which is filesystem specific pointer. This can be updated by filesystem in ->iomap_begin call into iomap->private. And in case of iomap_dio_rw, this iomap->private will be copied to iomap_dio->private if not already set. But I think this will eventually become hacky in the sense when you will have to determine on whether the dio->private is already set or not when iomap_apply will be called second time. It will become a problem with AIO DIO in ext4 since we use i_unwritten which tells us whether there is any unwritten extent but it does not tell whether this unwritten extent is due to a DIRECT AIO DIO in progress or a buffered one. So we can ignore this approach - unless you or someone else have some good design ideas to build on top of above. 2. Second approach which I was thinking is to track only those extents which are marked unwritten and are under IO. This can be done in ext4_map_blocks. This way we will not have to track a particular inode has any unwritten extents or not, but it will be extent based. Something similar was also done a while ago. Do you think this approach will work in our case? So with this extents will be scanned in extent status tree to see if any among those are under IO and are unwritten in func ext4_can_extents_be_merged. https://patchwork.ozlabs.org/patch/1013837/ -Ritesh
On Thu, Aug 22, 2019 at 07:41:25PM +0530, Ritesh Harjani wrote: > > > On 8/22/19 5:30 PM, Matthew Bobrowski wrote: > > On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote: > > > On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote: > > > > But what I meant was this (I may be wrong here since I haven't > > > > really looked into it), but for my understanding I would like to > > > > discuss this - > > > > > > > > So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on > > > > whether a newextent can be merged with ex1 in function > > > > ext4_extents_can_be_merged. But now since we have removed that flag we have > > > > no way of knowing that whether this inode has any unwritten extents or not > > > > from any DIO path. > > > > > > > > What I meant is isn't this removal of setting/unsetting of > > > > flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func - > > > > ext4_extents_can_be_merged? > > > > > > OK, I'm stuck and looking for either clarity, revalidation of my > > > thought process, or any input on how to solve this problem for that > > > matter. > > > > > > In the current ext4 direct IO implementation, the dynamic state flag > > > EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO > > > writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset > > > for ext4_io_end->flag, and the value of i_unwritten is > > > incremented/decremented for asynchronous direct IO writes. All > > > mechanisms by which are used to track and determine whether the inode, > > > or an IO in flight against a particular inode have any pending > > > unwritten extents that need to be converted after the IO has > > > completed. In addition to this, we have ext4_can_extents_be_merged() > > > performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and > > > i_unwritten and using them to determine whether it can or cannot merge > > > a requested extent into an existing extent. > > > > > > This is all fine for the current direct IO implementation. However, > > > while switching the direct IO code paths over to make use of the iomap > > > infrastructure, I believe that we can no longer simply track whether > > > an inode has unwritten extents needing to be converted by simply > > > setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the > > > inode. The reason being is that there can be multiple direct IO > > > operations to unwritten extents running against the inode and we don't > > > particularly distinguish synchronous from asynchronous writes within > > > ext4_iomap_begin() as there's really no need. So, the only way to > > > accurately determine whether extent conversion is deemed necessary for > > > an IO operation whether it'd be synchronous or asynchronous is by > > > checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io() > > > callback. I'm certain that this portion of the logic is correct, but > > > we're still left with some issues when it comes to the checks that I > > > previously mentioned in ext4_can_extents_be_merged(), which is the > > > part I need some input on. > > > > > > I was doing some thinking and I don't believe that making use of the > > > EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not > > > only for reasons that I've briefly mentioned above, but also because > > > of the fact that it'll probably lead to a lot of inaccurate judgements > > > while taking particular code paths and some really ugly code that > > > creeps close to the definition of insanity. Rather, what if we solve > > > this problem by continuing to just use i_unwritten to keep track of > > > all the direct IOs to unwritten against running against an inode? > > > Within ext4_iomap_begin() post successful creation of unwritten > > > extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and > > > subsequently within the ->end_io() callback whether we take the > > > success or error path we'd call > > > atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can > > > still rely on this value to be used in the check within > > > ext4_can_extents_be_merged(). Open for alternate suggestions if anyone > > > has any... > > > > Actually, no... > > > > I've done some more thinking and what I suggested above around the use > > of i_unwritten will also not work properly. Using iomap > > infrastructure, there is the possibility of calling into the > > ->iomap_begin() more than once for a single direct IO operation. This > > means that by the time we even get to decrementing i_unwritten in the > > ->end_io() callback after converting the unwritten extents we're > > already running the possibility of i_unwritten becoming unbalanced > > really quickly and staying that way. This also means that the > > statement checking i_unwritten in ext4_can_extents_be_merged() will be > > affected and potentially result in it being evaluated incorrectly. I > > was thinking that we could just decrement i_unwritten in > > ->iomap_end(), but that seems to me like it would be racy and also > > lead to incorrect results. At this point I'm out of ideas on how to > > solve this, so any other ideas would be appreciated! > > I will let others also comment, if someone has any other better approach. > > 1. One approach is to add the infrastructure in iomap with > iomap_dio->private which is filesystem specific pointer. This can be > updated by filesystem in ->iomap_begin call into iomap->private. > And in case of iomap_dio_rw, this iomap->private will be copied to > iomap_dio->private if not already set. > > But I think this will eventually become hacky in the sense when you will > have to determine on whether the dio->private is already set or not when > iomap_apply will be called second time. It will become a problem with AIO > DIO in ext4 since we use i_unwritten which tells us whether there is any > unwritten extent but it does not tell whether this unwritten extent is due > to a DIRECT AIO DIO in progress or a buffered one. > > So we can ignore this approach - unless you or someone else have some good > design ideas to build on top of above. I'm not sure whether _this_ is the solution or not, so let's maybe wait for others to comment. One thing that I and probably the iomap maintainers would like to avoid is adding any special case code to iomap infrastructure, if possible. I mean, from what you suggest it seems to be rather generic and not overly intrusive, although I know for a fact that iomap infrastructure exists because stuff like buffer_heads and the old direct IO code ended up accommodating so many different edge cases making it almost unmodifiable and unmaintainable. > 2. Second approach which I was thinking is to track only those extents which > are marked unwritten and are under IO. This can be done in ext4_map_blocks. > This way we will not have to track a particular inode has any unwritten > extents or not, but it will be extent based. > Something similar was also done a while ago. Do you think this approach will > work in our case? > > So with this extents will be scanned in extent status tree to see if any > among those are under IO and are unwritten in func > ext4_can_extents_be_merged. > > https://patchwork.ozlabs.org/patch/1013837/ Maybe this would be a better approach and I think that it'd work. Based upon what I read within in that thread there weren't really any objections to the idea, although I can't see that it made it upstream, so I may be missing something? --M
On Sat, Aug 24, 2019 at 01:18:31PM +1000, Matthew Bobrowski wrote: > On Thu, Aug 22, 2019 at 07:41:25PM +0530, Ritesh Harjani wrote: > > > > > > On 8/22/19 5:30 PM, Matthew Bobrowski wrote: > > > On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote: > > > > On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote: > > > > > But what I meant was this (I may be wrong here since I haven't > > > > > really looked into it), but for my understanding I would like to > > > > > discuss this - > > > > > > > > > > So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on > > > > > whether a newextent can be merged with ex1 in function > > > > > ext4_extents_can_be_merged. But now since we have removed that flag we have > > > > > no way of knowing that whether this inode has any unwritten extents or not > > > > > from any DIO path. > > > > > > > > > > What I meant is isn't this removal of setting/unsetting of > > > > > flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func - > > > > > ext4_extents_can_be_merged? > > > > > > > > OK, I'm stuck and looking for either clarity, revalidation of my > > > > thought process, or any input on how to solve this problem for that > > > > matter. > > > > > > > > In the current ext4 direct IO implementation, the dynamic state flag > > > > EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO > > > > writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset > > > > for ext4_io_end->flag, and the value of i_unwritten is > > > > incremented/decremented for asynchronous direct IO writes. All > > > > mechanisms by which are used to track and determine whether the inode, > > > > or an IO in flight against a particular inode have any pending > > > > unwritten extents that need to be converted after the IO has > > > > completed. In addition to this, we have ext4_can_extents_be_merged() > > > > performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and > > > > i_unwritten and using them to determine whether it can or cannot merge > > > > a requested extent into an existing extent. > > > > > > > > This is all fine for the current direct IO implementation. However, > > > > while switching the direct IO code paths over to make use of the iomap > > > > infrastructure, I believe that we can no longer simply track whether > > > > an inode has unwritten extents needing to be converted by simply > > > > setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the > > > > inode. The reason being is that there can be multiple direct IO > > > > operations to unwritten extents running against the inode and we don't > > > > particularly distinguish synchronous from asynchronous writes within > > > > ext4_iomap_begin() as there's really no need. So, the only way to > > > > accurately determine whether extent conversion is deemed necessary for > > > > an IO operation whether it'd be synchronous or asynchronous is by > > > > checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io() > > > > callback. I'm certain that this portion of the logic is correct, but > > > > we're still left with some issues when it comes to the checks that I > > > > previously mentioned in ext4_can_extents_be_merged(), which is the > > > > part I need some input on. > > > > > > > > I was doing some thinking and I don't believe that making use of the > > > > EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not > > > > only for reasons that I've briefly mentioned above, but also because > > > > of the fact that it'll probably lead to a lot of inaccurate judgements > > > > while taking particular code paths and some really ugly code that > > > > creeps close to the definition of insanity. Rather, what if we solve > > > > this problem by continuing to just use i_unwritten to keep track of > > > > all the direct IOs to unwritten against running against an inode? > > > > Within ext4_iomap_begin() post successful creation of unwritten > > > > extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and > > > > subsequently within the ->end_io() callback whether we take the > > > > success or error path we'd call > > > > atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can > > > > still rely on this value to be used in the check within > > > > ext4_can_extents_be_merged(). Open for alternate suggestions if anyone > > > > has any... > > > > > > Actually, no... > > > > > > I've done some more thinking and what I suggested above around the use > > > of i_unwritten will also not work properly. Using iomap > > > infrastructure, there is the possibility of calling into the > > > ->iomap_begin() more than once for a single direct IO operation. This > > > means that by the time we even get to decrementing i_unwritten in the > > > ->end_io() callback after converting the unwritten extents we're > > > already running the possibility of i_unwritten becoming unbalanced > > > really quickly and staying that way. This also means that the > > > statement checking i_unwritten in ext4_can_extents_be_merged() will be > > > affected and potentially result in it being evaluated incorrectly. I > > > was thinking that we could just decrement i_unwritten in > > > ->iomap_end(), but that seems to me like it would be racy and also > > > lead to incorrect results. At this point I'm out of ideas on how to > > > solve this, so any other ideas would be appreciated! > > > > I will let others also comment, if someone has any other better approach. > > > > 1. One approach is to add the infrastructure in iomap with > > iomap_dio->private which is filesystem specific pointer. This can be > > updated by filesystem in ->iomap_begin call into iomap->private. > > And in case of iomap_dio_rw, this iomap->private will be copied to > > iomap_dio->private if not already set. > > > > But I think this will eventually become hacky in the sense when you will > > have to determine on whether the dio->private is already set or not when > > iomap_apply will be called second time. It will become a problem with AIO > > DIO in ext4 since we use i_unwritten which tells us whether there is any > > unwritten extent but it does not tell whether this unwritten extent is due > > to a DIRECT AIO DIO in progress or a buffered one. > > > > So we can ignore this approach - unless you or someone else have some good > > design ideas to build on top of above. > > I'm not sure whether _this_ is the solution or not, so let's maybe > wait for others to comment. One thing that I and probably the iomap > maintainers would like to avoid is adding any special case code to > iomap infrastructure, if possible. I mean, from what you suggest it > seems to be rather generic and not overly intrusive, although I know > for a fact that iomap infrastructure exists because stuff like > buffer_heads and the old direct IO code ended up accommodating so many > different edge cases making it almost unmodifiable and unmaintainable. I'm probably misunderstanding the ext4 extent cache horribly, but I keep wondering why any of this is necessary -- why can't ext4 track the unwritten status in the extent records directly? And why is there all this strange "can merge" logic? If you need to convert blocks X to Y to written state because a write to those blocks completed, isn't that just manipulation of a bunch of incore records? And can't you just seek back and forth in the extent cache to look for adjacent records to merge with? <confuseD> (I'd really prefer not to go adding private fields all over the place...) --D > > 2. Second approach which I was thinking is to track only those extents which > > are marked unwritten and are under IO. This can be done in ext4_map_blocks. > > This way we will not have to track a particular inode has any unwritten > > extents or not, but it will be extent based. > > Something similar was also done a while ago. Do you think this approach will > > work in our case? > > > > So with this extents will be scanned in extent status tree to see if any > > among those are under IO and are unwritten in func > > ext4_can_extents_be_merged. > > > > https://patchwork.ozlabs.org/patch/1013837/ > > Maybe this would be a better approach and I think that it'd > work. Based upon what I read within in that thread there weren't > really any objections to the idea, although I can't see that it made > it upstream, so I may be missing something? > > --M
On Fri, Aug 23, 2019 at 08:55:54PM -0700, Darrick J. Wong wrote: > I'm probably misunderstanding the ext4 extent cache horribly, but I keep > wondering why any of this is necessary -- why can't ext4 track the > unwritten status in the extent records directly? And why is there all > this strange "can merge" logic? If you need to convert blocks X to Y > to written state because a write to those blocks completed, isn't that > just manipulation of a bunch of incore records? And can't you just seek > back and forth in the extent cache to look for adjacent records to merge > with? <confuseD> Same here. I'm not an ext4 expert, but here is what we do in XFS, which hopefully works in some form for ext4 a well: - when starting a direct I/O we allocate any needed blocks and do so as unwritten extent. The extent tree code will merge them in whatever way that seems suitable - if the IOMAP_DIO_UNWRITTEN is set on the iomap at ->end_io time we call a function that walks the whole range covered by the ioend, and convert any unwritten extent to a normal written extent. Any splitting and merging will be done as needed by the low-level extent tree code - this also means we don't need the xfs_ioen structure (which ext4) copied from for direct I/O at all (we used to have it initially, though including the time when ext4 copied this code). - we don't need the equivalent to the ext4_unwritten_wait call in ext4_file_write_iter because we serialize any non-aligned I/O instead of trying to optimize for weird corner cases > (I'd really prefer not to go adding private fields all over the > place...) Agreed.
On Sat, Aug 24, 2019 at 04:04:27PM -0700, Christoph Hellwig wrote: > On Fri, Aug 23, 2019 at 08:55:54PM -0700, Darrick J. Wong wrote: > > I'm probably misunderstanding the ext4 extent cache horribly, but I keep > > wondering why any of this is necessary -- why can't ext4 track the > > unwritten status in the extent records directly? And why is there all > > this strange "can merge" logic? If you need to convert blocks X to Y > > to written state because a write to those blocks completed, isn't that > > just manipulation of a bunch of incore records? And can't you just seek > > back and forth in the extent cache to look for adjacent records to merge > > with? <confuseD> > > Same here. I'm not an ext4 expert, but here is what we do in XFS, which > hopefully works in some form for ext4 a well: > > - when starting a direct I/O we allocate any needed blocks and do so > as unwritten extent. The extent tree code will merge them in > whatever way that seems suitable > - if the IOMAP_DIO_UNWRITTEN is set on the iomap at ->end_io time we > call a function that walks the whole range covered by the ioend, > and convert any unwritten extent to a normal written extent. Any > splitting and merging will be done as needed by the low-level > extent tree code > - this also means we don't need the xfs_ioen structure (which ext4) > copied from for direct I/O at all (we used to have it initially, > though including the time when ext4 copied this code). > - we don't need the equivalent to the ext4_unwritten_wait call in > ext4_file_write_iter because we serialize any non-aligned I/O > instead of trying to optimize for weird corner cases Yeah, so what you've detailed above is essentially the approach I've taken in my patch series... What is not clear to me at this point though is whether it is still necessary to explicitly track unwritten extents via in-core inode attributes i.e. ->i_unwritten and ->i_state_flags under the new direct IO code path implementation, which makes use of the iomap infrastructure. Or, whether we can get away with simply not using these in-core inode attributes and rely just on checks against the extent record directly, as breifly mentioned by Darrick. I would think that this type of check would be enough, however the checks around whether the inode is currently undergoing direct IO were implemented at some point, so there must be a reason for having them (a9b8241594add)? --M
On Tue, Aug 27, 2019 at 07:52:23PM +1000, Matthew Bobrowski wrote: > On Sat, Aug 24, 2019 at 04:04:27PM -0700, Christoph Hellwig wrote: > > On Fri, Aug 23, 2019 at 08:55:54PM -0700, Darrick J. Wong wrote: > > > I'm probably misunderstanding the ext4 extent cache horribly, but I keep > > > wondering why any of this is necessary -- why can't ext4 track the > > > unwritten status in the extent records directly? And why is there all > > > this strange "can merge" logic? If you need to convert blocks X to Y > > > to written state because a write to those blocks completed, isn't that > > > just manipulation of a bunch of incore records? And can't you just seek > > > back and forth in the extent cache to look for adjacent records to merge > > > with? <confuseD> > > > > Same here. I'm not an ext4 expert, but here is what we do in XFS, which > > hopefully works in some form for ext4 a well: > > > > - when starting a direct I/O we allocate any needed blocks and do so > > as unwritten extent. The extent tree code will merge them in > > whatever way that seems suitable > > - if the IOMAP_DIO_UNWRITTEN is set on the iomap at ->end_io time we > > call a function that walks the whole range covered by the ioend, > > and convert any unwritten extent to a normal written extent. Any > > splitting and merging will be done as needed by the low-level > > extent tree code > > - this also means we don't need the xfs_ioen structure (which ext4) > > copied from for direct I/O at all (we used to have it initially, > > though including the time when ext4 copied this code). > > - we don't need the equivalent to the ext4_unwritten_wait call in > > ext4_file_write_iter because we serialize any non-aligned I/O > > instead of trying to optimize for weird corner cases > > Yeah, so what you've detailed above is essentially the approach I've > taken in my patch series... > > What is not clear to me at this point though is whether it is still > necessary to explicitly track unwritten extents via in-core inode > attributes i.e. ->i_unwritten and ->i_state_flags under the new direct > IO code path implementation, which makes use of the iomap > infrastructure. Or, whether we can get away with simply not using > these in-core inode attributes and rely just on checks against the > extent record directly, as breifly mentioned by Darrick. I would think > that this type of check would be enough, however the checks around > whether the inode is currently undergoing direct IO were implemented > at some point, so there must be a reason for having them > (a9b8241594add)? Maybe it's a silly question, although I'm wanting to clarify my understanding around why it is that when we either try prepend or append to an existing extent, we don't permit merging of extents if the inode is undergoing direct IO? --M
On Wed, Aug 28, 2019 at 10:05:11PM +1000, Matthew Bobrowski wrote: > > What is not clear to me at this point though is whether it is still > > necessary to explicitly track unwritten extents via in-core inode > > attributes i.e. ->i_unwritten and ->i_state_flags under the new direct > > IO code path implementation, which makes use of the iomap > > infrastructure. Or, whether we can get away with simply not using > > these in-core inode attributes and rely just on checks against the > > extent record directly, as breifly mentioned by Darrick. I would think > > that this type of check would be enough, however the checks around > > whether the inode is currently undergoing direct IO were implemented > > at some point, so there must be a reason for having them > > (a9b8241594add)? The original reason why we created the DIO_STATE_UNWRITTEN flag was a fast path, where the common case is writing blocks to an existing location in a file where the blocks are already allocated, and marked as written. So consulting the on-disk extent tree to determine whether unwritten extents need to be converted and/or split is certainly doable. However, it's expensive for the common case. So having a hint whether we need to schedule a workqueue to possibly convert an unwritten region is helpful. If we can just free the bio and exit the I/O completion handler without having to take shared locks to examine the on-disk extent tree, so much the better. > Maybe it's a silly question, although I'm wanting to clarify my > understanding around why it is that when we either try prepend or > append to an existing extent, we don't permit merging of extents if If I recall correctly, the reason for this check was mainly the concern that we would end up merging an extent that we would then have to split later on (when the direct I/O completed). To be honest, i'm not 100% sure what would happen if we removed that restriction; it might be that things would work just fine (just slower in some workloads), or whether there is some hidden dependency that would explode. I suspect we'd have to try the experiment to be sure. - Ted
On Wed 28-08-19 10:27:29, Theodore Y. Ts'o wrote: > On Wed, Aug 28, 2019 at 10:05:11PM +1000, Matthew Bobrowski wrote: > > > What is not clear to me at this point though is whether it is still > > > necessary to explicitly track unwritten extents via in-core inode > > > attributes i.e. ->i_unwritten and ->i_state_flags under the new direct > > > IO code path implementation, which makes use of the iomap > > > infrastructure. Or, whether we can get away with simply not using > > > these in-core inode attributes and rely just on checks against the > > > extent record directly, as breifly mentioned by Darrick. I would think > > > that this type of check would be enough, however the checks around > > > whether the inode is currently undergoing direct IO were implemented > > > at some point, so there must be a reason for having them > > > (a9b8241594add)? > > The original reason why we created the DIO_STATE_UNWRITTEN flag was a > fast path, where the common case is writing blocks to an existing > location in a file where the blocks are already allocated, and marked > as written. So consulting the on-disk extent tree to determine > whether unwritten extents need to be converted and/or split is > certainly doable. However, it's expensive for the common case. So > having a hint whether we need to schedule a workqueue to possibly > convert an unwritten region is helpful. If we can just free the bio > and exit the I/O completion handler without having to take shared > locks to examine the on-disk extent tree, so much the better. Yes, but for determining whether extent conversion on IO completion is needed we now use IOMAP_DIO_UNWRITTEN flag iomap infrastructure provides to us. So we don't have to track this internally in ext4 anymore. > > Maybe it's a silly question, although I'm wanting to clarify my > > understanding around why it is that when we either try prepend or > > append to an existing extent, we don't permit merging of extents if > > If I recall correctly, the reason for this check was mainly the > concern that we would end up merging an extent that we would then have > to split later on (when the direct I/O completed). > > To be honest, i'm not 100% sure what would happen if we removed that > restriction; it might be that things would work just fine (just slower > in some workloads), or whether there is some hidden dependency that > would explode. I suspect we'd have to try the experiment to be sure. As far as I remember the concern was that extent split may need block allocation and we may not have enough free blocks to do it. These days we have some blocks reserved in the filesystem to accomodate unexpected extent splits so this shouldn't happen anymore so the only real concern is the wasted performance due to unnecessary extent merge & split. Kind of a stress test for this would be to fire of lots of sequential AIO DIO requests against a hole in a file. Honza
On Wed, Aug 28, 2019 at 08:02:15PM +0200, Jan Kara wrote: > > The original reason why we created the DIO_STATE_UNWRITTEN flag was a > > fast path, where the common case is writing blocks to an existing > > location in a file where the blocks are already allocated, and marked > > as written. So consulting the on-disk extent tree to determine > > whether unwritten extents need to be converted and/or split is > > certainly doable. However, it's expensive for the common case. So > > having a hint whether we need to schedule a workqueue to possibly > > convert an unwritten region is helpful. If we can just free the bio > > and exit the I/O completion handler without having to take shared > > locks to examine the on-disk extent tree, so much the better. > > Yes, but for determining whether extent conversion on IO completion is > needed we now use IOMAP_DIO_UNWRITTEN flag iomap infrastructure provides to > us. So we don't have to track this internally in ext4 anymore. Exactly. As mentioned before the ioend to track unwritten thing was in XFS by the time ext4 copied the ioend approach. but we actually got rid of that long before the iomap conversion. Maybe to make everything easier to understand and bisect you might want to get rid of the ioend for direct I/O in ext4 as a prep path as well. The relevant commit is: 273dda76f757 ("xfs: don't use ioends for direct write completions") > > To be honest, i'm not 100% sure what would happen if we removed that > > restriction; it might be that things would work just fine (just slower > > in some workloads), or whether there is some hidden dependency that > > would explode. I suspect we'd have to try the experiment to be sure. > > As far as I remember the concern was that extent split may need block > allocation and we may not have enough free blocks to do it. These days we > have some blocks reserved in the filesystem to accomodate unexpected extent > splits so this shouldn't happen anymore so the only real concern is the > wasted performance due to unnecessary extent merge & split. Kind of a > stress test for this would be to fire of lots of sequential AIO DIO > requests against a hole in a file. Well, you can always add a don't merge flag to the actual allocation. You might still get a merge for pathological case (fallocate adjacent to a dio write just submitted), but if the merging is such a performance over head here is easy ways to avoid it for the common case.
Awesome, and thank you *all* for your very valueable input. On Wed, Aug 28, 2019 at 11:36:08PM -0700, Christoph Hellwig wrote: > On Wed, Aug 28, 2019 at 08:02:15PM +0200, Jan Kara wrote: > > > The original reason why we created the DIO_STATE_UNWRITTEN flag was a > > > fast path, where the common case is writing blocks to an existing > > > location in a file where the blocks are already allocated, and marked > > > as written. So consulting the on-disk extent tree to determine > > > whether unwritten extents need to be converted and/or split is > > > certainly doable. However, it's expensive for the common case. So > > > having a hint whether we need to schedule a workqueue to possibly > > > convert an unwritten region is helpful. If we can just free the bio > > > and exit the I/O completion handler without having to take shared > > > locks to examine the on-disk extent tree, so much the better. > > > > Yes, but for determining whether extent conversion on IO completion is > > needed we now use IOMAP_DIO_UNWRITTEN flag iomap infrastructure provides to > > us. So we don't have to track this internally in ext4 anymore. > > Exactly. As mentioned before the ioend to track unwritten thing was > in XFS by the time ext4 copied the ioend approach. but we actually got > rid of that long before the iomap conversion. Maybe to make everything > easier to understand and bisect you might want to get rid of the ioend > for direct I/O in ext4 as a prep path as well. > > The relevant commit is: 273dda76f757 ("xfs: don't use ioends for direct > write completions") Uh ha! So, we conclude that there's no need to muck around with hairy ioend's, or the need to denote whether there's unwritten extents held against the inode using tricky state flag for that matter. > > > To be honest, i'm not 100% sure what would happen if we removed that > > > restriction; it might be that things would work just fine (just slower > > > in some workloads), or whether there is some hidden dependency that > > > would explode. I suspect we'd have to try the experiment to be sure. > > > > As far as I remember the concern was that extent split may need block > > allocation and we may not have enough free blocks to do it. These days we > > have some blocks reserved in the filesystem to accomodate unexpected extent > > splits so this shouldn't happen anymore so the only real concern is the > > wasted performance due to unnecessary extent merge & split. Kind of a > > stress test for this would be to fire of lots of sequential AIO DIO > > requests against a hole in a file. > > Well, you can always add a don't merge flag to the actual allocation. > You might still get a merge for pathological case (fallocate adjacent > to a dio write just submitted), but if the merging is such a performance > over head here is easy ways to avoid it for the common case. After I've posted through the next version of this patch series, I will attempt to perform some stress testing to see what the performance hit could potentially be.
On Thu, Aug 29, 2019 at 09:20:50PM +1000, Matthew Bobrowski wrote: > Uh ha! So, we conclude that there's no need to muck around with hairy > ioend's, or the need to denote whether there's unwritten extents held > against the inode using tricky state flag for that matter. So in XFS we never had the i_unwritten counter, that is something purely ext4 specific, and I can't really help with that unfortunately, but maybe the people who implemented it might be able to help on that one.