Message ID | 20210405145903.629152-2-bfoster@redhat.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | xfs: ioend batching log reservation deadlock | expand |
> @@ -182,12 +155,10 @@ xfs_end_ioend( > error = xfs_reflink_end_cow(ip, offset, size); > else if (ioend->io_type == IOMAP_UNWRITTEN) > error = xfs_iomap_write_unwritten(ip, offset, size, false); > - else > - ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private); I first though we'd now call xfs_setfilesize for unwritten extents as well, but as those have already updated di_size we are fine here. As a future enhancement it would be useful to let xfs_reflink_end_cow update the file size similar to what we do for the unwritten case. > done: > - if (ioend->io_private) > - error = xfs_setfilesize_ioend(ioend, error); > + if (!error && xfs_ioend_is_append(ioend)) > + error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size); > iomap_finish_ioends(ioend, error); The done label can move after the call to xfs_setfilesize now. Otherwise looks good: Reviewed-by: Christoph Hellwig <hch@lst.de>
On Wed, Apr 07, 2021 at 07:33:21AM +0100, Christoph Hellwig wrote: > > @@ -182,12 +155,10 @@ xfs_end_ioend( > > error = xfs_reflink_end_cow(ip, offset, size); > > else if (ioend->io_type == IOMAP_UNWRITTEN) > > error = xfs_iomap_write_unwritten(ip, offset, size, false); > > - else > > - ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private); > > I first though we'd now call xfs_setfilesize for unwritten extents > as well, but as those have already updated di_size we are fine here. > > As a future enhancement it would be useful to let xfs_reflink_end_cow > update the file size similar to what we do for the unwritten case. > Agreed. I noticed that when first passing through the code but didn't want to get into further functional changes. > > done: > > - if (ioend->io_private) > > - error = xfs_setfilesize_ioend(ioend, error); > > + if (!error && xfs_ioend_is_append(ioend)) > > + error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size); > > iomap_finish_ioends(ioend, error); > > The done label can move after the call to xfs_setfilesize now. > Fixed. > Otherwise looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> > Thanks. Brian
On Mon, Apr 05, 2021 at 10:59:00AM -0400, Brian Foster wrote: > Per-inode ioend completion batching has a log reservation deadlock > vector between preallocated append transactions and transactions > that are acquired at completion time for other purposes (i.e., > unwritten extent conversion or COW fork remaps). For example, if the > ioend completion workqueue task executes on a batch of ioends that > are sorted such that an append ioend sits at the tail, it's possible > for the outstanding append transaction reservation to block > allocation of transactions required to process preceding ioends in > the list. > > Append ioend completion is historically the common path for on-disk > inode size updates. While file extending writes may have completed > sometime earlier, the on-disk inode size is only updated after > successful writeback completion. These transactions are preallocated > serially from writeback context to mitigate concurrency and > associated log reservation pressure across completions processed by > multi-threaded workqueue tasks. > > However, now that delalloc blocks unconditionally map to unwritten > extents at physical block allocation time, size updates via append > ioends are relatively rare. This means that inode size updates most > commonly occur as part of the preexisting completion time > transaction to convert unwritten extents. As a result, there is no > longer a strong need to preallocate size update transactions. > > Remove the preallocation of inode size update transactions to avoid > the ioend completion processing log reservation deadlock. Instead, > continue to send all potential size extending ioends to workqueue > context for completion and allocate the transaction from that > context. This ensures that no outstanding log reservation is owned > by the ioend completion worker task when it begins to process > ioends. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/xfs_aops.c | 45 +++------------------------------------------ > 1 file changed, 3 insertions(+), 42 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 1cc7c36d98e9..c1951975bd6a 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -39,33 +39,6 @@ static inline bool xfs_ioend_is_append(struct iomap_ioend *ioend) > XFS_I(ioend->io_inode)->i_d.di_size; > } > > -STATIC int > -xfs_setfilesize_trans_alloc( > - struct iomap_ioend *ioend) > -{ > - struct xfs_mount *mp = XFS_I(ioend->io_inode)->i_mount; > - struct xfs_trans *tp; > - int error; > - > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp); > - if (error) > - return error; > - > - ioend->io_private = tp; > - > - /* > - * We may pass freeze protection with a transaction. So tell lockdep > - * we released it. > - */ > - __sb_writers_release(ioend->io_inode->i_sb, SB_FREEZE_FS); > - /* > - * We hand off the transaction to the completion thread now, so > - * clear the flag here. > - */ > - xfs_trans_clear_context(tp); > - return 0; > -} > - > /* > * Update on-disk file size now that data has been written to disk. > */ > @@ -182,12 +155,10 @@ xfs_end_ioend( > error = xfs_reflink_end_cow(ip, offset, size); Seems reasonable to me. xfs_reflink_end_cow_extent should probably learn how to extend the ondisk EOF as patch 6/4. Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D > else if (ioend->io_type == IOMAP_UNWRITTEN) > error = xfs_iomap_write_unwritten(ip, offset, size, false); > - else > - ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private); > > done: > - if (ioend->io_private) > - error = xfs_setfilesize_ioend(ioend, error); > + if (!error && xfs_ioend_is_append(ioend)) > + error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size); > iomap_finish_ioends(ioend, error); > memalloc_nofs_restore(nofs_flag); > } > @@ -237,7 +208,7 @@ xfs_end_io( > > static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend) > { > - return ioend->io_private || > + return xfs_ioend_is_append(ioend) || > ioend->io_type == IOMAP_UNWRITTEN || > (ioend->io_flags & IOMAP_F_SHARED); > } > @@ -250,8 +221,6 @@ xfs_end_bio( > struct xfs_inode *ip = XFS_I(ioend->io_inode); > unsigned long flags; > > - ASSERT(xfs_ioend_needs_workqueue(ioend)); > - > spin_lock_irqsave(&ip->i_ioend_lock, flags); > if (list_empty(&ip->i_ioend_list)) > WARN_ON_ONCE(!queue_work(ip->i_mount->m_unwritten_workqueue, > @@ -501,14 +470,6 @@ xfs_prepare_ioend( > ioend->io_offset, ioend->io_size); > } > > - /* Reserve log space if we might write beyond the on-disk inode size. */ > - if (!status && > - ((ioend->io_flags & IOMAP_F_SHARED) || > - ioend->io_type != IOMAP_UNWRITTEN) && > - xfs_ioend_is_append(ioend) && > - !ioend->io_private) > - status = xfs_setfilesize_trans_alloc(ioend); > - > memalloc_nofs_restore(nofs_flag); > > if (xfs_ioend_needs_workqueue(ioend)) > -- > 2.26.3 >
On Wed, Apr 07, 2021 at 08:31:52AM -0700, Darrick J. Wong wrote: > On Mon, Apr 05, 2021 at 10:59:00AM -0400, Brian Foster wrote: > > Per-inode ioend completion batching has a log reservation deadlock > > vector between preallocated append transactions and transactions > > that are acquired at completion time for other purposes (i.e., > > unwritten extent conversion or COW fork remaps). For example, if the > > ioend completion workqueue task executes on a batch of ioends that > > are sorted such that an append ioend sits at the tail, it's possible > > for the outstanding append transaction reservation to block > > allocation of transactions required to process preceding ioends in > > the list. > > > > Append ioend completion is historically the common path for on-disk > > inode size updates. While file extending writes may have completed > > sometime earlier, the on-disk inode size is only updated after > > successful writeback completion. These transactions are preallocated > > serially from writeback context to mitigate concurrency and > > associated log reservation pressure across completions processed by > > multi-threaded workqueue tasks. > > > > However, now that delalloc blocks unconditionally map to unwritten > > extents at physical block allocation time, size updates via append > > ioends are relatively rare. This means that inode size updates most > > commonly occur as part of the preexisting completion time > > transaction to convert unwritten extents. As a result, there is no > > longer a strong need to preallocate size update transactions. > > > > Remove the preallocation of inode size update transactions to avoid > > the ioend completion processing log reservation deadlock. Instead, > > continue to send all potential size extending ioends to workqueue > > context for completion and allocate the transaction from that > > context. This ensures that no outstanding log reservation is owned > > by the ioend completion worker task when it begins to process > > ioends. > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/xfs/xfs_aops.c | 45 +++------------------------------------------ > > 1 file changed, 3 insertions(+), 42 deletions(-) > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > > index 1cc7c36d98e9..c1951975bd6a 100644 > > --- a/fs/xfs/xfs_aops.c > > +++ b/fs/xfs/xfs_aops.c > > @@ -39,33 +39,6 @@ static inline bool xfs_ioend_is_append(struct iomap_ioend *ioend) > > XFS_I(ioend->io_inode)->i_d.di_size; > > } > > > > -STATIC int > > -xfs_setfilesize_trans_alloc( > > - struct iomap_ioend *ioend) > > -{ > > - struct xfs_mount *mp = XFS_I(ioend->io_inode)->i_mount; > > - struct xfs_trans *tp; > > - int error; > > - > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp); > > - if (error) > > - return error; > > - > > - ioend->io_private = tp; > > - > > - /* > > - * We may pass freeze protection with a transaction. So tell lockdep > > - * we released it. > > - */ > > - __sb_writers_release(ioend->io_inode->i_sb, SB_FREEZE_FS); > > - /* > > - * We hand off the transaction to the completion thread now, so > > - * clear the flag here. > > - */ > > - xfs_trans_clear_context(tp); > > - return 0; > > -} > > - > > /* > > * Update on-disk file size now that data has been written to disk. > > */ > > @@ -182,12 +155,10 @@ xfs_end_ioend( > > error = xfs_reflink_end_cow(ip, offset, size); > > Seems reasonable to me. xfs_reflink_end_cow_extent should probably > learn how to extend the ondisk EOF as patch 6/4. > After looking into this since multiple people asked for it, I'm not as convinced it's the right approach as I was on first thought. The reason being that the reflink completion code actually remaps in reverse order (I suspect due to depending on the same behavior of the bmap unmap code, which is used to pull extents from the COW fork..). That means that while inter-ioend completion takes some care (even if not necessarily guaranteed) to order on-disk size updates against completion within the associated i_size, a size update along with the COW fork remap transaction could result in the opposite in some cases (i.e. if the cow fork is fragmented relative to the data fork). I suspect this is a rare situation and not a critical problem, but at the same time I find it a bit odd to implement intra-ioend ordering inconsistent with the broader inter-ioend logic. I could probably be convinced otherwise, but I think this warrants more thought and is not necessarily a straightforward cleanup. I'm also not sure how common disk size updates via COW fork I/O completion really are in the first place, so I'm not sure it's worth the oddness to save an extra transaction. In any event, I'll probably leave this off from this series for now because it doesn't have much to do with fixing the deadlock problem and can revisit it if anybody has thoughts on any of the above.. Brian > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > --D > > > else if (ioend->io_type == IOMAP_UNWRITTEN) > > error = xfs_iomap_write_unwritten(ip, offset, size, false); > > - else > > - ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private); > > > > done: > > - if (ioend->io_private) > > - error = xfs_setfilesize_ioend(ioend, error); > > + if (!error && xfs_ioend_is_append(ioend)) > > + error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size); > > iomap_finish_ioends(ioend, error); > > memalloc_nofs_restore(nofs_flag); > > } > > @@ -237,7 +208,7 @@ xfs_end_io( > > > > static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend) > > { > > - return ioend->io_private || > > + return xfs_ioend_is_append(ioend) || > > ioend->io_type == IOMAP_UNWRITTEN || > > (ioend->io_flags & IOMAP_F_SHARED); > > } > > @@ -250,8 +221,6 @@ xfs_end_bio( > > struct xfs_inode *ip = XFS_I(ioend->io_inode); > > unsigned long flags; > > > > - ASSERT(xfs_ioend_needs_workqueue(ioend)); > > - > > spin_lock_irqsave(&ip->i_ioend_lock, flags); > > if (list_empty(&ip->i_ioend_list)) > > WARN_ON_ONCE(!queue_work(ip->i_mount->m_unwritten_workqueue, > > @@ -501,14 +470,6 @@ xfs_prepare_ioend( > > ioend->io_offset, ioend->io_size); > > } > > > > - /* Reserve log space if we might write beyond the on-disk inode size. */ > > - if (!status && > > - ((ioend->io_flags & IOMAP_F_SHARED) || > > - ioend->io_type != IOMAP_UNWRITTEN) && > > - xfs_ioend_is_append(ioend) && > > - !ioend->io_private) > > - status = xfs_setfilesize_trans_alloc(ioend); > > - > > memalloc_nofs_restore(nofs_flag); > > > > if (xfs_ioend_needs_workqueue(ioend)) > > -- > > 2.26.3 > > >
On Fri, Apr 09, 2021 at 09:47:24AM -0400, Brian Foster wrote: > On Wed, Apr 07, 2021 at 08:31:52AM -0700, Darrick J. Wong wrote: > > On Mon, Apr 05, 2021 at 10:59:00AM -0400, Brian Foster wrote: > > > Per-inode ioend completion batching has a log reservation deadlock > > > vector between preallocated append transactions and transactions > > > that are acquired at completion time for other purposes (i.e., > > > unwritten extent conversion or COW fork remaps). For example, if the > > > ioend completion workqueue task executes on a batch of ioends that > > > are sorted such that an append ioend sits at the tail, it's possible > > > for the outstanding append transaction reservation to block > > > allocation of transactions required to process preceding ioends in > > > the list. > > > > > > Append ioend completion is historically the common path for on-disk > > > inode size updates. While file extending writes may have completed > > > sometime earlier, the on-disk inode size is only updated after > > > successful writeback completion. These transactions are preallocated > > > serially from writeback context to mitigate concurrency and > > > associated log reservation pressure across completions processed by > > > multi-threaded workqueue tasks. > > > > > > However, now that delalloc blocks unconditionally map to unwritten > > > extents at physical block allocation time, size updates via append > > > ioends are relatively rare. This means that inode size updates most > > > commonly occur as part of the preexisting completion time > > > transaction to convert unwritten extents. As a result, there is no > > > longer a strong need to preallocate size update transactions. > > > > > > Remove the preallocation of inode size update transactions to avoid > > > the ioend completion processing log reservation deadlock. Instead, > > > continue to send all potential size extending ioends to workqueue > > > context for completion and allocate the transaction from that > > > context. This ensures that no outstanding log reservation is owned > > > by the ioend completion worker task when it begins to process > > > ioends. > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > fs/xfs/xfs_aops.c | 45 +++------------------------------------------ > > > 1 file changed, 3 insertions(+), 42 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > > > index 1cc7c36d98e9..c1951975bd6a 100644 > > > --- a/fs/xfs/xfs_aops.c > > > +++ b/fs/xfs/xfs_aops.c > > > @@ -39,33 +39,6 @@ static inline bool xfs_ioend_is_append(struct iomap_ioend *ioend) > > > XFS_I(ioend->io_inode)->i_d.di_size; > > > } > > > > > > -STATIC int > > > -xfs_setfilesize_trans_alloc( > > > - struct iomap_ioend *ioend) > > > -{ > > > - struct xfs_mount *mp = XFS_I(ioend->io_inode)->i_mount; > > > - struct xfs_trans *tp; > > > - int error; > > > - > > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp); > > > - if (error) > > > - return error; > > > - > > > - ioend->io_private = tp; > > > - > > > - /* > > > - * We may pass freeze protection with a transaction. So tell lockdep > > > - * we released it. > > > - */ > > > - __sb_writers_release(ioend->io_inode->i_sb, SB_FREEZE_FS); > > > - /* > > > - * We hand off the transaction to the completion thread now, so > > > - * clear the flag here. > > > - */ > > > - xfs_trans_clear_context(tp); > > > - return 0; > > > -} > > > - > > > /* > > > * Update on-disk file size now that data has been written to disk. > > > */ > > > @@ -182,12 +155,10 @@ xfs_end_ioend( > > > error = xfs_reflink_end_cow(ip, offset, size); > > > > Seems reasonable to me. xfs_reflink_end_cow_extent should probably > > learn how to extend the ondisk EOF as patch 6/4. > > > > After looking into this since multiple people asked for it, I'm not as > convinced it's the right approach as I was on first thought. The reason > being that the reflink completion code actually remaps in reverse order > (I suspect due to depending on the same behavior of the bmap unmap code, > which is used to pull extents from the COW fork..). Yes. There's no algorithmic reason why _end_cow has to call bunmapi directly (and thereby remap in reverse order) -- it could log a bunmap intent item for the data fork before logging the refcount btree intent and the bmap remap intent. With such a rework in place, we would effectively be doing the remap in order of increasing file offset, instead of forwards-backwards like we do now. If you want to chase that, you might also want to look at this patch[1] that eliminates the need to requeue bunmap operations. [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=reflink-speedups&id=323b0acb563ea50bfff79801083f5ba8544c7985 > That means that while inter-ioend completion takes some care (even if > not necessarily guaranteed) to order on-disk size updates against > completion within the associated i_size, a size update along with the > COW fork remap transaction could result in the opposite in some cases > (i.e. if the cow fork is fragmented relative to the data fork). I > suspect this is a rare situation and not a critical problem, but at the > same time I find it a bit odd to implement intra-ioend ordering > inconsistent with the broader inter-ioend logic. > > I could probably be convinced otherwise, but I think this warrants more > thought and is not necessarily a straightforward cleanup. I'm also not > sure how common disk size updates via COW fork I/O completion really are > in the first place, so I'm not sure it's worth the oddness to save an > extra transaction. In any event, I'll probably leave this off from this > series for now because it doesn't have much to do with fixing the > deadlock problem and can revisit it if anybody has thoughts on any of > the above.. The only case I can think of is an extending write to a reflinked file if (say) you snapshot /var on a live system. --D > > Brian > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > --D > > > > > else if (ioend->io_type == IOMAP_UNWRITTEN) > > > error = xfs_iomap_write_unwritten(ip, offset, size, false); > > > - else > > > - ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private); > > > > > > done: > > > - if (ioend->io_private) > > > - error = xfs_setfilesize_ioend(ioend, error); > > > + if (!error && xfs_ioend_is_append(ioend)) > > > + error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size); > > > iomap_finish_ioends(ioend, error); > > > memalloc_nofs_restore(nofs_flag); > > > } > > > @@ -237,7 +208,7 @@ xfs_end_io( > > > > > > static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend) > > > { > > > - return ioend->io_private || > > > + return xfs_ioend_is_append(ioend) || > > > ioend->io_type == IOMAP_UNWRITTEN || > > > (ioend->io_flags & IOMAP_F_SHARED); > > > } > > > @@ -250,8 +221,6 @@ xfs_end_bio( > > > struct xfs_inode *ip = XFS_I(ioend->io_inode); > > > unsigned long flags; > > > > > > - ASSERT(xfs_ioend_needs_workqueue(ioend)); > > > - > > > spin_lock_irqsave(&ip->i_ioend_lock, flags); > > > if (list_empty(&ip->i_ioend_list)) > > > WARN_ON_ONCE(!queue_work(ip->i_mount->m_unwritten_workqueue, > > > @@ -501,14 +470,6 @@ xfs_prepare_ioend( > > > ioend->io_offset, ioend->io_size); > > > } > > > > > > - /* Reserve log space if we might write beyond the on-disk inode size. */ > > > - if (!status && > > > - ((ioend->io_flags & IOMAP_F_SHARED) || > > > - ioend->io_type != IOMAP_UNWRITTEN) && > > > - xfs_ioend_is_append(ioend) && > > > - !ioend->io_private) > > > - status = xfs_setfilesize_trans_alloc(ioend); > > > - > > > memalloc_nofs_restore(nofs_flag); > > > > > > if (xfs_ioend_needs_workqueue(ioend)) > > > -- > > > 2.26.3 > > > > > >
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 1cc7c36d98e9..c1951975bd6a 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -39,33 +39,6 @@ static inline bool xfs_ioend_is_append(struct iomap_ioend *ioend) XFS_I(ioend->io_inode)->i_d.di_size; } -STATIC int -xfs_setfilesize_trans_alloc( - struct iomap_ioend *ioend) -{ - struct xfs_mount *mp = XFS_I(ioend->io_inode)->i_mount; - struct xfs_trans *tp; - int error; - - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp); - if (error) - return error; - - ioend->io_private = tp; - - /* - * We may pass freeze protection with a transaction. So tell lockdep - * we released it. - */ - __sb_writers_release(ioend->io_inode->i_sb, SB_FREEZE_FS); - /* - * We hand off the transaction to the completion thread now, so - * clear the flag here. - */ - xfs_trans_clear_context(tp); - return 0; -} - /* * Update on-disk file size now that data has been written to disk. */ @@ -182,12 +155,10 @@ xfs_end_ioend( error = xfs_reflink_end_cow(ip, offset, size); else if (ioend->io_type == IOMAP_UNWRITTEN) error = xfs_iomap_write_unwritten(ip, offset, size, false); - else - ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private); done: - if (ioend->io_private) - error = xfs_setfilesize_ioend(ioend, error); + if (!error && xfs_ioend_is_append(ioend)) + error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size); iomap_finish_ioends(ioend, error); memalloc_nofs_restore(nofs_flag); } @@ -237,7 +208,7 @@ xfs_end_io( static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend) { - return ioend->io_private || + return xfs_ioend_is_append(ioend) || ioend->io_type == IOMAP_UNWRITTEN || (ioend->io_flags & IOMAP_F_SHARED); } @@ -250,8 +221,6 @@ xfs_end_bio( struct xfs_inode *ip = XFS_I(ioend->io_inode); unsigned long flags; - ASSERT(xfs_ioend_needs_workqueue(ioend)); - spin_lock_irqsave(&ip->i_ioend_lock, flags); if (list_empty(&ip->i_ioend_list)) WARN_ON_ONCE(!queue_work(ip->i_mount->m_unwritten_workqueue, @@ -501,14 +470,6 @@ xfs_prepare_ioend( ioend->io_offset, ioend->io_size); } - /* Reserve log space if we might write beyond the on-disk inode size. */ - if (!status && - ((ioend->io_flags & IOMAP_F_SHARED) || - ioend->io_type != IOMAP_UNWRITTEN) && - xfs_ioend_is_append(ioend) && - !ioend->io_private) - status = xfs_setfilesize_trans_alloc(ioend); - memalloc_nofs_restore(nofs_flag); if (xfs_ioend_needs_workqueue(ioend))
Per-inode ioend completion batching has a log reservation deadlock vector between preallocated append transactions and transactions that are acquired at completion time for other purposes (i.e., unwritten extent conversion or COW fork remaps). For example, if the ioend completion workqueue task executes on a batch of ioends that are sorted such that an append ioend sits at the tail, it's possible for the outstanding append transaction reservation to block allocation of transactions required to process preceding ioends in the list. Append ioend completion is historically the common path for on-disk inode size updates. While file extending writes may have completed sometime earlier, the on-disk inode size is only updated after successful writeback completion. These transactions are preallocated serially from writeback context to mitigate concurrency and associated log reservation pressure across completions processed by multi-threaded workqueue tasks. However, now that delalloc blocks unconditionally map to unwritten extents at physical block allocation time, size updates via append ioends are relatively rare. This means that inode size updates most commonly occur as part of the preexisting completion time transaction to convert unwritten extents. As a result, there is no longer a strong need to preallocate size update transactions. Remove the preallocation of inode size update transactions to avoid the ioend completion processing log reservation deadlock. Instead, continue to send all potential size extending ioends to workqueue context for completion and allocate the transaction from that context. This ensures that no outstanding log reservation is owned by the ioend completion worker task when it begins to process ioends. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_aops.c | 45 +++------------------------------------------ 1 file changed, 3 insertions(+), 42 deletions(-)