Message ID | 1432598142-4434-1-git-send-email-fdmanana@kernel.org (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Zygo Blaxell and other users have reported occasional hangs while an > inode is being evicted, leading to traces like the following: > > [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. > [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 > [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000 > [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001 > [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78 > [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123 > [ 5281.981396] Call Trace: > [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 > [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] > [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 > [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs] > [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs] > [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs] > [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 > [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 > [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa > [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 > [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab > [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f > [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b > [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 > [ 5282.001299] 1 lock held by rm/20488: > [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b > > This happens when we have readahead, which calls readpages(), happening > right before the inode eviction handler is invoked. So the reason is > essentially: > > 1) readpages() is called while a reference on the inode is held, so > eviction can not be triggered before readpages() returns. It also > locks one or more ranges in the inode's io_tree (which is done at > extent_io.c:__do_contiguous_readpages()); > > 2) readpages() submits several read bios, all with an end io callback > that runs extent_io.c:end_bio_extent_readpage() and that is executed > by other task when a bio finishes, corresponding to a work queue > (fs_info->end_io_workers) worker kthread. This callback unlocks > the ranges in the inode's io_tree that were previously locked in > step 1; > > 3) readpages() returns, the reference on the inode is dropped; > > 4) One or more of the read bios previously submitted are still not > complete (their end io callback was not yet invoked or has not > yet finished execution); > > 5) Inode eviction is triggered (through an unlink call for example). > The inode reference count was not incremented before submitting > the read bios, therefore this is possible; > > 6) The eviction handler starts executing and enters the loop that > iterates over all extent states in the inode's io_tree; One question here: btrfs_evict_inode()-> evict_inode_truncate_pages()-> truncate_inode_pages_final() truncate_inode_pages_final() finds all pages in the mapping and lock_page() on each page, if there is a readpage running by other tasks, it should be blocked on lock_page() until end_bio_extent_readpage() does a unlock_page(). How does eviction handler bypasses this and enters 'extent_state' loop? Am I missing something? Thanks, -liubo > > 7) The loop picks one extent state record and uses its ->start and > ->end fields, after releasing the inode's io_tree spinlock, to > call lock_extent_bits() and clear_extent_bit(). The call to lock > the range [state->start, state->end] blocks because the whole > range or a part of it was locked by the previous call to > readpages() and the corresponding end io callback, which unlocks > the range was not yet executed; > > 8) The end io callback for the read bio is executed and unlocks the > range [state->start, state->end] (or a superset of that range). > And at clear_extent_bit() the extent_state record state is used > as a second argument to split_state(), which sets state->start to > a larger value; > > 9) The task executing the eviction handler is woken up by the task > executing the bio's end io callback (through clear_state_bit) and > the eviction handler locks the range > [old value for state->start, state->end]. Shortly after, when > calling clear_extent_bit(), it unlocks the range > [new value for state->start, state->end], so it ends up unlocking > only part of the range that it locked, leaving an extent state > record in the io_tree that represents the unlocked subrange; > > 10) The eviction handler loop, in its next iteration, gets the > extent_state record for the subrange that it did not unlock in the > previous step and then tries to lock it, resulting in an hang. > > So fix this by not using the ->start and ->end fields of an existing > extent_state record. This is a simple solution, and an alternative > could be to bump the inode's reference count before submitting each > read bio and having it dropped in the bio's end io callback. But that > would be a more invasive/complex change and would not protect against > other possible places that are not holding a reference on the inode > as well. Something to consider in the future. > > Many thanks to Zygo Blaxell for reporting, in the mailing list, the > issue, a set of scripts to trigger it and testing this fix. > > Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> > Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > fs/btrfs/inode.c | 26 +++++++++++++++++++++----- > 1 file changed, 21 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 0020b56..ef05800 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode) > } > write_unlock(&map_tree->lock); > > + /* > + * Keep looping until we have no more ranges in the io tree. > + * We can have ongoing bios started by readpages (called from readahead) > + * that didn't get their end io callbacks called yet or they are still > + * in progress ((extent_io.c:end_bio_extent_readpage()). This means some > + * ranges can still be locked and eviction started because before > + * submitting those bios, which are executed by a separate task (work > + * queue kthread), inode references (inode->i_count) were not taken > + * (which would be dropped in the end io callback of each bio). > + * Therefore here we effectively end up waiting for those bios and > + * anyone else holding locked ranges without having bumped the inode's > + * reference count - if we don't do it, when they access the inode's > + * io_tree to unlock a range it may be too late, leading to an > + * use-after-free issue. > + */ > spin_lock(&io_tree->lock); > while (!RB_EMPTY_ROOT(&io_tree->state)) { > struct extent_state *state; > struct extent_state *cached_state = NULL; > + u64 start; > + u64 end; > > node = rb_first(&io_tree->state); > state = rb_entry(node, struct extent_state, rb_node); > - atomic_inc(&state->refs); > + start = state->start; > + end = state->end; > spin_unlock(&io_tree->lock); > > - lock_extent_bits(io_tree, state->start, state->end, > - 0, &cached_state); > - clear_extent_bit(io_tree, state->start, state->end, > + lock_extent_bits(io_tree, start, end, 0, &cached_state); > + clear_extent_bit(io_tree, start, end, > EXTENT_LOCKED | EXTENT_DIRTY | > EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | > EXTENT_DEFRAG, 1, 1, > &cached_state, GFP_NOFS); > - free_extent_state(state); > > cond_resched(); > spin_lock(&io_tree->lock); > -- > 2.1.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@kernel.org wrote: >> From: Filipe Manana <fdmanana@suse.com> >> >> Zygo Blaxell and other users have reported occasional hangs while an >> inode is being evicted, leading to traces like the following: >> >> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. >> [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 >> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000 >> [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001 >> [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78 >> [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123 >> [ 5281.981396] Call Trace: >> [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 >> [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] >> [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 >> [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs] >> [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs] >> [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs] >> [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 >> [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 >> [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa >> [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 >> [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab >> [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f >> [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b >> [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 >> [ 5282.001299] 1 lock held by rm/20488: >> [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b >> >> This happens when we have readahead, which calls readpages(), happening >> right before the inode eviction handler is invoked. So the reason is >> essentially: >> >> 1) readpages() is called while a reference on the inode is held, so >> eviction can not be triggered before readpages() returns. It also >> locks one or more ranges in the inode's io_tree (which is done at >> extent_io.c:__do_contiguous_readpages()); >> >> 2) readpages() submits several read bios, all with an end io callback >> that runs extent_io.c:end_bio_extent_readpage() and that is executed >> by other task when a bio finishes, corresponding to a work queue >> (fs_info->end_io_workers) worker kthread. This callback unlocks >> the ranges in the inode's io_tree that were previously locked in >> step 1; >> >> 3) readpages() returns, the reference on the inode is dropped; >> >> 4) One or more of the read bios previously submitted are still not >> complete (their end io callback was not yet invoked or has not >> yet finished execution); >> >> 5) Inode eviction is triggered (through an unlink call for example). >> The inode reference count was not incremented before submitting >> the read bios, therefore this is possible; >> >> 6) The eviction handler starts executing and enters the loop that >> iterates over all extent states in the inode's io_tree; > > One question here: > btrfs_evict_inode()-> > evict_inode_truncate_pages()-> > truncate_inode_pages_final() > > truncate_inode_pages_final() finds all pages in the mapping and > lock_page() on each page, if there is a readpage running by other tasks, > it should be blocked on lock_page() until end_bio_extent_readpage() > does a unlock_page(). > > How does eviction handler bypasses this and enters 'extent_state' loop? > > Am I missing something? Yes, you are missing something. end_bio_extent_readpage() unlocks the pages and only after it does unlock the ranges in the inode's io tree. thanks > > Thanks, > > -liubo >> >> 7) The loop picks one extent state record and uses its ->start and >> ->end fields, after releasing the inode's io_tree spinlock, to >> call lock_extent_bits() and clear_extent_bit(). The call to lock >> the range [state->start, state->end] blocks because the whole >> range or a part of it was locked by the previous call to >> readpages() and the corresponding end io callback, which unlocks >> the range was not yet executed; >> >> 8) The end io callback for the read bio is executed and unlocks the >> range [state->start, state->end] (or a superset of that range). >> And at clear_extent_bit() the extent_state record state is used >> as a second argument to split_state(), which sets state->start to >> a larger value; >> >> 9) The task executing the eviction handler is woken up by the task >> executing the bio's end io callback (through clear_state_bit) and >> the eviction handler locks the range >> [old value for state->start, state->end]. Shortly after, when >> calling clear_extent_bit(), it unlocks the range >> [new value for state->start, state->end], so it ends up unlocking >> only part of the range that it locked, leaving an extent state >> record in the io_tree that represents the unlocked subrange; >> >> 10) The eviction handler loop, in its next iteration, gets the >> extent_state record for the subrange that it did not unlock in the >> previous step and then tries to lock it, resulting in an hang. >> >> So fix this by not using the ->start and ->end fields of an existing >> extent_state record. This is a simple solution, and an alternative >> could be to bump the inode's reference count before submitting each >> read bio and having it dropped in the bio's end io callback. But that >> would be a more invasive/complex change and would not protect against >> other possible places that are not holding a reference on the inode >> as well. Something to consider in the future. >> >> Many thanks to Zygo Blaxell for reporting, in the mailing list, the >> issue, a set of scripts to trigger it and testing this fix. >> >> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> >> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> >> Signed-off-by: Filipe Manana <fdmanana@suse.com> >> --- >> fs/btrfs/inode.c | 26 +++++++++++++++++++++----- >> 1 file changed, 21 insertions(+), 5 deletions(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 0020b56..ef05800 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode) >> } >> write_unlock(&map_tree->lock); >> >> + /* >> + * Keep looping until we have no more ranges in the io tree. >> + * We can have ongoing bios started by readpages (called from readahead) >> + * that didn't get their end io callbacks called yet or they are still >> + * in progress ((extent_io.c:end_bio_extent_readpage()). This means some >> + * ranges can still be locked and eviction started because before >> + * submitting those bios, which are executed by a separate task (work >> + * queue kthread), inode references (inode->i_count) were not taken >> + * (which would be dropped in the end io callback of each bio). >> + * Therefore here we effectively end up waiting for those bios and >> + * anyone else holding locked ranges without having bumped the inode's >> + * reference count - if we don't do it, when they access the inode's >> + * io_tree to unlock a range it may be too late, leading to an >> + * use-after-free issue. >> + */ >> spin_lock(&io_tree->lock); >> while (!RB_EMPTY_ROOT(&io_tree->state)) { >> struct extent_state *state; >> struct extent_state *cached_state = NULL; >> + u64 start; >> + u64 end; >> >> node = rb_first(&io_tree->state); >> state = rb_entry(node, struct extent_state, rb_node); >> - atomic_inc(&state->refs); >> + start = state->start; >> + end = state->end; >> spin_unlock(&io_tree->lock); >> >> - lock_extent_bits(io_tree, state->start, state->end, >> - 0, &cached_state); >> - clear_extent_bit(io_tree, state->start, state->end, >> + lock_extent_bits(io_tree, start, end, 0, &cached_state); >> + clear_extent_bit(io_tree, start, end, >> EXTENT_LOCKED | EXTENT_DIRTY | >> EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | >> EXTENT_DEFRAG, 1, 1, >> &cached_state, GFP_NOFS); >> - free_extent_state(state); >> >> cond_resched(); >> spin_lock(&io_tree->lock); >> -- >> 2.1.3 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 10, 2015 at 11:44:17AM +0100, Filipe Manana wrote: > On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > > On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@kernel.org wrote: > >> From: Filipe Manana <fdmanana@suse.com> > >> > >> Zygo Blaxell and other users have reported occasional hangs while an > >> inode is being evicted, leading to traces like the following: > >> > >> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. > >> [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 > >> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >> [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000 > >> [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001 > >> [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78 > >> [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123 > >> [ 5281.981396] Call Trace: > >> [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 > >> [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] > >> [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 > >> [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs] > >> [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs] > >> [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs] > >> [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 > >> [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 > >> [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa > >> [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 > >> [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab > >> [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f > >> [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b > >> [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 > >> [ 5282.001299] 1 lock held by rm/20488: > >> [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b > >> > >> This happens when we have readahead, which calls readpages(), happening > >> right before the inode eviction handler is invoked. So the reason is > >> essentially: > >> > >> 1) readpages() is called while a reference on the inode is held, so > >> eviction can not be triggered before readpages() returns. It also > >> locks one or more ranges in the inode's io_tree (which is done at > >> extent_io.c:__do_contiguous_readpages()); > >> > >> 2) readpages() submits several read bios, all with an end io callback > >> that runs extent_io.c:end_bio_extent_readpage() and that is executed > >> by other task when a bio finishes, corresponding to a work queue > >> (fs_info->end_io_workers) worker kthread. This callback unlocks > >> the ranges in the inode's io_tree that were previously locked in > >> step 1; > >> > >> 3) readpages() returns, the reference on the inode is dropped; > >> > >> 4) One or more of the read bios previously submitted are still not > >> complete (their end io callback was not yet invoked or has not > >> yet finished execution); > >> > >> 5) Inode eviction is triggered (through an unlink call for example). > >> The inode reference count was not incremented before submitting > >> the read bios, therefore this is possible; > >> > >> 6) The eviction handler starts executing and enters the loop that > >> iterates over all extent states in the inode's io_tree; > > > > One question here: > > btrfs_evict_inode()-> > > evict_inode_truncate_pages()-> > > truncate_inode_pages_final() > > > > truncate_inode_pages_final() finds all pages in the mapping and > > lock_page() on each page, if there is a readpage running by other tasks, > > it should be blocked on lock_page() until end_bio_extent_readpage() > > does a unlock_page(). > > > > How does eviction handler bypasses this and enters 'extent_state' loop? > > > > Am I missing something? > > Yes, you are missing something. > end_bio_extent_readpage() unlocks the pages and only after it does > unlock the ranges in the inode's io tree. ah, that's right, another thing is truncate_inode_pages_final() gets every page's lock, which means that all the endio must have been processed, but the below comment in the code says some bios may still not get their endio called. Thanks, -liubo > > thanks > > > > > > Thanks, > > > > -liubo > >> > >> 7) The loop picks one extent state record and uses its ->start and > >> ->end fields, after releasing the inode's io_tree spinlock, to > >> call lock_extent_bits() and clear_extent_bit(). The call to lock > >> the range [state->start, state->end] blocks because the whole > >> range or a part of it was locked by the previous call to > >> readpages() and the corresponding end io callback, which unlocks > >> the range was not yet executed; > >> > >> 8) The end io callback for the read bio is executed and unlocks the > >> range [state->start, state->end] (or a superset of that range). > >> And at clear_extent_bit() the extent_state record state is used > >> as a second argument to split_state(), which sets state->start to > >> a larger value; > >> > >> 9) The task executing the eviction handler is woken up by the task > >> executing the bio's end io callback (through clear_state_bit) and > >> the eviction handler locks the range > >> [old value for state->start, state->end]. Shortly after, when > >> calling clear_extent_bit(), it unlocks the range > >> [new value for state->start, state->end], so it ends up unlocking > >> only part of the range that it locked, leaving an extent state > >> record in the io_tree that represents the unlocked subrange; > >> > >> 10) The eviction handler loop, in its next iteration, gets the > >> extent_state record for the subrange that it did not unlock in the > >> previous step and then tries to lock it, resulting in an hang. > >> > >> So fix this by not using the ->start and ->end fields of an existing > >> extent_state record. This is a simple solution, and an alternative > >> could be to bump the inode's reference count before submitting each > >> read bio and having it dropped in the bio's end io callback. But that > >> would be a more invasive/complex change and would not protect against > >> other possible places that are not holding a reference on the inode > >> as well. Something to consider in the future. > >> > >> Many thanks to Zygo Blaxell for reporting, in the mailing list, the > >> issue, a set of scripts to trigger it and testing this fix. > >> > >> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> > >> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> > >> Signed-off-by: Filipe Manana <fdmanana@suse.com> > >> --- > >> fs/btrfs/inode.c | 26 +++++++++++++++++++++----- > >> 1 file changed, 21 insertions(+), 5 deletions(-) > >> > >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >> index 0020b56..ef05800 100644 > >> --- a/fs/btrfs/inode.c > >> +++ b/fs/btrfs/inode.c > >> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode) > >> } > >> write_unlock(&map_tree->lock); > >> > >> + /* > >> + * Keep looping until we have no more ranges in the io tree. > >> + * We can have ongoing bios started by readpages (called from readahead) > >> + * that didn't get their end io callbacks called yet or they are still > >> + * in progress ((extent_io.c:end_bio_extent_readpage()). This means some > >> + * ranges can still be locked and eviction started because before > >> + * submitting those bios, which are executed by a separate task (work > >> + * queue kthread), inode references (inode->i_count) were not taken > >> + * (which would be dropped in the end io callback of each bio). > >> + * Therefore here we effectively end up waiting for those bios and > >> + * anyone else holding locked ranges without having bumped the inode's > >> + * reference count - if we don't do it, when they access the inode's > >> + * io_tree to unlock a range it may be too late, leading to an > >> + * use-after-free issue. > >> + */ > >> spin_lock(&io_tree->lock); > >> while (!RB_EMPTY_ROOT(&io_tree->state)) { > >> struct extent_state *state; > >> struct extent_state *cached_state = NULL; > >> + u64 start; > >> + u64 end; > >> > >> node = rb_first(&io_tree->state); > >> state = rb_entry(node, struct extent_state, rb_node); > >> - atomic_inc(&state->refs); > >> + start = state->start; > >> + end = state->end; > >> spin_unlock(&io_tree->lock); > >> > >> - lock_extent_bits(io_tree, state->start, state->end, > >> - 0, &cached_state); > >> - clear_extent_bit(io_tree, state->start, state->end, > >> + lock_extent_bits(io_tree, start, end, 0, &cached_state); > >> + clear_extent_bit(io_tree, start, end, > >> EXTENT_LOCKED | EXTENT_DIRTY | > >> EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | > >> EXTENT_DEFRAG, 1, 1, > >> &cached_state, GFP_NOFS); > >> - free_extent_state(state); > >> > >> cond_resched(); > >> spin_lock(&io_tree->lock); > >> -- > >> 2.1.3 > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 10, 2015 at 12:05 PM, Liu Bo <bo.li.liu@oracle.com> wrote: > On Wed, Jun 10, 2015 at 11:44:17AM +0100, Filipe Manana wrote: >> On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote: >> > On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@kernel.org wrote: >> >> From: Filipe Manana <fdmanana@suse.com> >> >> >> >> Zygo Blaxell and other users have reported occasional hangs while an >> >> inode is being evicted, leading to traces like the following: >> >> >> >> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. >> >> [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 >> >> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> >> [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000 >> >> [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001 >> >> [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78 >> >> [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123 >> >> [ 5281.981396] Call Trace: >> >> [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 >> >> [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] >> >> [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 >> >> [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs] >> >> [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs] >> >> [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs] >> >> [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 >> >> [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 >> >> [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa >> >> [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 >> >> [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab >> >> [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f >> >> [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b >> >> [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 >> >> [ 5282.001299] 1 lock held by rm/20488: >> >> [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b >> >> >> >> This happens when we have readahead, which calls readpages(), happening >> >> right before the inode eviction handler is invoked. So the reason is >> >> essentially: >> >> >> >> 1) readpages() is called while a reference on the inode is held, so >> >> eviction can not be triggered before readpages() returns. It also >> >> locks one or more ranges in the inode's io_tree (which is done at >> >> extent_io.c:__do_contiguous_readpages()); >> >> >> >> 2) readpages() submits several read bios, all with an end io callback >> >> that runs extent_io.c:end_bio_extent_readpage() and that is executed >> >> by other task when a bio finishes, corresponding to a work queue >> >> (fs_info->end_io_workers) worker kthread. This callback unlocks >> >> the ranges in the inode's io_tree that were previously locked in >> >> step 1; >> >> >> >> 3) readpages() returns, the reference on the inode is dropped; >> >> >> >> 4) One or more of the read bios previously submitted are still not >> >> complete (their end io callback was not yet invoked or has not >> >> yet finished execution); >> >> >> >> 5) Inode eviction is triggered (through an unlink call for example). >> >> The inode reference count was not incremented before submitting >> >> the read bios, therefore this is possible; >> >> >> >> 6) The eviction handler starts executing and enters the loop that >> >> iterates over all extent states in the inode's io_tree; >> > >> > One question here: >> > btrfs_evict_inode()-> >> > evict_inode_truncate_pages()-> >> > truncate_inode_pages_final() >> > >> > truncate_inode_pages_final() finds all pages in the mapping and >> > lock_page() on each page, if there is a readpage running by other tasks, >> > it should be blocked on lock_page() until end_bio_extent_readpage() >> > does a unlock_page(). >> > >> > How does eviction handler bypasses this and enters 'extent_state' loop? >> > >> > Am I missing something? >> >> Yes, you are missing something. >> end_bio_extent_readpage() unlocks the pages and only after it does >> unlock the ranges in the inode's io tree. > > ah, that's right, another thing is > truncate_inode_pages_final() gets every page's lock, which means that > all the endio must have been processed, but the below comment in the > code says some bios may still not get their endio called. truncate_inode_pages_final() can lock the pages after end_bio_extent_readpage() unlocks them and before end_bio_extent_readpage() unlocks the ranges in the io tree. I know there's some mangling around that makes the bio's original endio function to be run by an work queue (fs_info->endio_workers) task plus one or two more levels of indirection. But by endio I really meant end_bio_extent_readpage(), as I mention in the comment between parenthesis, and not any of the indirections added by btrfs_submit_bio_hook(). > > Thanks, > > -liubo >> >> thanks >> >> >> > >> > Thanks, >> > >> > -liubo >> >> >> >> 7) The loop picks one extent state record and uses its ->start and >> >> ->end fields, after releasing the inode's io_tree spinlock, to >> >> call lock_extent_bits() and clear_extent_bit(). The call to lock >> >> the range [state->start, state->end] blocks because the whole >> >> range or a part of it was locked by the previous call to >> >> readpages() and the corresponding end io callback, which unlocks >> >> the range was not yet executed; >> >> >> >> 8) The end io callback for the read bio is executed and unlocks the >> >> range [state->start, state->end] (or a superset of that range). >> >> And at clear_extent_bit() the extent_state record state is used >> >> as a second argument to split_state(), which sets state->start to >> >> a larger value; >> >> >> >> 9) The task executing the eviction handler is woken up by the task >> >> executing the bio's end io callback (through clear_state_bit) and >> >> the eviction handler locks the range >> >> [old value for state->start, state->end]. Shortly after, when >> >> calling clear_extent_bit(), it unlocks the range >> >> [new value for state->start, state->end], so it ends up unlocking >> >> only part of the range that it locked, leaving an extent state >> >> record in the io_tree that represents the unlocked subrange; >> >> >> >> 10) The eviction handler loop, in its next iteration, gets the >> >> extent_state record for the subrange that it did not unlock in the >> >> previous step and then tries to lock it, resulting in an hang. >> >> >> >> So fix this by not using the ->start and ->end fields of an existing >> >> extent_state record. This is a simple solution, and an alternative >> >> could be to bump the inode's reference count before submitting each >> >> read bio and having it dropped in the bio's end io callback. But that >> >> would be a more invasive/complex change and would not protect against >> >> other possible places that are not holding a reference on the inode >> >> as well. Something to consider in the future. >> >> >> >> Many thanks to Zygo Blaxell for reporting, in the mailing list, the >> >> issue, a set of scripts to trigger it and testing this fix. >> >> >> >> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> >> >> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> >> >> Signed-off-by: Filipe Manana <fdmanana@suse.com> >> >> --- >> >> fs/btrfs/inode.c | 26 +++++++++++++++++++++----- >> >> 1 file changed, 21 insertions(+), 5 deletions(-) >> >> >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> >> index 0020b56..ef05800 100644 >> >> --- a/fs/btrfs/inode.c >> >> +++ b/fs/btrfs/inode.c >> >> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode) >> >> } >> >> write_unlock(&map_tree->lock); >> >> >> >> + /* >> >> + * Keep looping until we have no more ranges in the io tree. >> >> + * We can have ongoing bios started by readpages (called from readahead) >> >> + * that didn't get their end io callbacks called yet or they are still >> >> + * in progress ((extent_io.c:end_bio_extent_readpage()). This means some >> >> + * ranges can still be locked and eviction started because before >> >> + * submitting those bios, which are executed by a separate task (work >> >> + * queue kthread), inode references (inode->i_count) were not taken >> >> + * (which would be dropped in the end io callback of each bio). >> >> + * Therefore here we effectively end up waiting for those bios and >> >> + * anyone else holding locked ranges without having bumped the inode's >> >> + * reference count - if we don't do it, when they access the inode's >> >> + * io_tree to unlock a range it may be too late, leading to an >> >> + * use-after-free issue. >> >> + */ >> >> spin_lock(&io_tree->lock); >> >> while (!RB_EMPTY_ROOT(&io_tree->state)) { >> >> struct extent_state *state; >> >> struct extent_state *cached_state = NULL; >> >> + u64 start; >> >> + u64 end; >> >> >> >> node = rb_first(&io_tree->state); >> >> state = rb_entry(node, struct extent_state, rb_node); >> >> - atomic_inc(&state->refs); >> >> + start = state->start; >> >> + end = state->end; >> >> spin_unlock(&io_tree->lock); >> >> >> >> - lock_extent_bits(io_tree, state->start, state->end, >> >> - 0, &cached_state); >> >> - clear_extent_bit(io_tree, state->start, state->end, >> >> + lock_extent_bits(io_tree, start, end, 0, &cached_state); >> >> + clear_extent_bit(io_tree, start, end, >> >> EXTENT_LOCKED | EXTENT_DIRTY | >> >> EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | >> >> EXTENT_DEFRAG, 1, 1, >> >> &cached_state, GFP_NOFS); >> >> - free_extent_state(state); >> >> >> >> cond_resched(); >> >> spin_lock(&io_tree->lock); >> >> -- >> >> 2.1.3 >> >> >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> >> the body of a message to majordomo@vger.kernel.org >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 10, 2015 at 12:12:11PM +0100, Filipe Manana wrote: > On Wed, Jun 10, 2015 at 12:05 PM, Liu Bo <bo.li.liu@oracle.com> wrote: > > On Wed, Jun 10, 2015 at 11:44:17AM +0100, Filipe Manana wrote: > >> On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > >> > On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@kernel.org wrote: > >> >> From: Filipe Manana <fdmanana@suse.com> > >> >> > >> >> Zygo Blaxell and other users have reported occasional hangs while an > >> >> inode is being evicted, leading to traces like the following: > >> >> > >> >> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. > >> >> [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 > >> >> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >> >> [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000 > >> >> [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001 > >> >> [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78 > >> >> [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123 > >> >> [ 5281.981396] Call Trace: > >> >> [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 > >> >> [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] > >> >> [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 > >> >> [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs] > >> >> [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs] > >> >> [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs] > >> >> [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 > >> >> [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 > >> >> [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa > >> >> [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 > >> >> [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab > >> >> [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f > >> >> [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b > >> >> [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 > >> >> [ 5282.001299] 1 lock held by rm/20488: > >> >> [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b > >> >> > >> >> This happens when we have readahead, which calls readpages(), happening > >> >> right before the inode eviction handler is invoked. So the reason is > >> >> essentially: > >> >> > >> >> 1) readpages() is called while a reference on the inode is held, so > >> >> eviction can not be triggered before readpages() returns. It also > >> >> locks one or more ranges in the inode's io_tree (which is done at > >> >> extent_io.c:__do_contiguous_readpages()); > >> >> > >> >> 2) readpages() submits several read bios, all with an end io callback > >> >> that runs extent_io.c:end_bio_extent_readpage() and that is executed > >> >> by other task when a bio finishes, corresponding to a work queue > >> >> (fs_info->end_io_workers) worker kthread. This callback unlocks > >> >> the ranges in the inode's io_tree that were previously locked in > >> >> step 1; > >> >> > >> >> 3) readpages() returns, the reference on the inode is dropped; > >> >> > >> >> 4) One or more of the read bios previously submitted are still not > >> >> complete (their end io callback was not yet invoked or has not > >> >> yet finished execution); > >> >> > >> >> 5) Inode eviction is triggered (through an unlink call for example). > >> >> The inode reference count was not incremented before submitting > >> >> the read bios, therefore this is possible; > >> >> > >> >> 6) The eviction handler starts executing and enters the loop that > >> >> iterates over all extent states in the inode's io_tree; > >> > > >> > One question here: > >> > btrfs_evict_inode()-> > >> > evict_inode_truncate_pages()-> > >> > truncate_inode_pages_final() > >> > > >> > truncate_inode_pages_final() finds all pages in the mapping and > >> > lock_page() on each page, if there is a readpage running by other tasks, > >> > it should be blocked on lock_page() until end_bio_extent_readpage() > >> > does a unlock_page(). > >> > > >> > How does eviction handler bypasses this and enters 'extent_state' loop? > >> > > >> > Am I missing something? > >> > >> Yes, you are missing something. > >> end_bio_extent_readpage() unlocks the pages and only after it does > >> unlock the ranges in the inode's io tree. > > > > ah, that's right, another thing is > > truncate_inode_pages_final() gets every page's lock, which means that > > all the endio must have been processed, but the below comment in the > > code says some bios may still not get their endio called. > > truncate_inode_pages_final() can lock the pages after > end_bio_extent_readpage() unlocks them and before > end_bio_extent_readpage() unlocks the ranges in the io tree. I know > there's some mangling around that makes the bio's original endio > function to be run by an work queue (fs_info->endio_workers) task plus > one or two more levels of indirection. But by endio I really meant > end_bio_extent_readpage(), as I mention in the comment between > parenthesis, and not any of the indirections added by > btrfs_submit_bio_hook(). Excerpted from the below comment, > >> >> + * We can have ongoing bios started by readpages (called from readahead) > >> >> + * that didn't get their end io callbacks called yet or they are still > >> >> + * in progress ((extent_io.c:end_bio_extent_readpage()). I mean that by the time of truncate_inode_pages_final() ending, all the end_bio_extent_readpage() have already been called since pages are needed to be unlock'd to make evict() go on. Thanks, -liubo > > > > > Thanks, > > > > -liubo > >> > >> thanks > >> > >> > >> > > >> > Thanks, > >> > > >> > -liubo > >> >> > >> >> 7) The loop picks one extent state record and uses its ->start and > >> >> ->end fields, after releasing the inode's io_tree spinlock, to > >> >> call lock_extent_bits() and clear_extent_bit(). The call to lock > >> >> the range [state->start, state->end] blocks because the whole > >> >> range or a part of it was locked by the previous call to > >> >> readpages() and the corresponding end io callback, which unlocks > >> >> the range was not yet executed; > >> >> > >> >> 8) The end io callback for the read bio is executed and unlocks the > >> >> range [state->start, state->end] (or a superset of that range). > >> >> And at clear_extent_bit() the extent_state record state is used > >> >> as a second argument to split_state(), which sets state->start to > >> >> a larger value; > >> >> > >> >> 9) The task executing the eviction handler is woken up by the task > >> >> executing the bio's end io callback (through clear_state_bit) and > >> >> the eviction handler locks the range > >> >> [old value for state->start, state->end]. Shortly after, when > >> >> calling clear_extent_bit(), it unlocks the range > >> >> [new value for state->start, state->end], so it ends up unlocking > >> >> only part of the range that it locked, leaving an extent state > >> >> record in the io_tree that represents the unlocked subrange; > >> >> > >> >> 10) The eviction handler loop, in its next iteration, gets the > >> >> extent_state record for the subrange that it did not unlock in the > >> >> previous step and then tries to lock it, resulting in an hang. > >> >> > >> >> So fix this by not using the ->start and ->end fields of an existing > >> >> extent_state record. This is a simple solution, and an alternative > >> >> could be to bump the inode's reference count before submitting each > >> >> read bio and having it dropped in the bio's end io callback. But that > >> >> would be a more invasive/complex change and would not protect against > >> >> other possible places that are not holding a reference on the inode > >> >> as well. Something to consider in the future. > >> >> > >> >> Many thanks to Zygo Blaxell for reporting, in the mailing list, the > >> >> issue, a set of scripts to trigger it and testing this fix. > >> >> > >> >> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> > >> >> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> > >> >> Signed-off-by: Filipe Manana <fdmanana@suse.com> > >> >> --- > >> >> fs/btrfs/inode.c | 26 +++++++++++++++++++++----- > >> >> 1 file changed, 21 insertions(+), 5 deletions(-) > >> >> > >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >> >> index 0020b56..ef05800 100644 > >> >> --- a/fs/btrfs/inode.c > >> >> +++ b/fs/btrfs/inode.c > >> >> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode) > >> >> } > >> >> write_unlock(&map_tree->lock); > >> >> > >> >> + /* > >> >> + * Keep looping until we have no more ranges in the io tree. > >> >> + * We can have ongoing bios started by readpages (called from readahead) > >> >> + * that didn't get their end io callbacks called yet or they are still > >> >> + * in progress ((extent_io.c:end_bio_extent_readpage()). This means some > >> >> + * ranges can still be locked and eviction started because before > >> >> + * submitting those bios, which are executed by a separate task (work > >> >> + * queue kthread), inode references (inode->i_count) were not taken > >> >> + * (which would be dropped in the end io callback of each bio). > >> >> + * Therefore here we effectively end up waiting for those bios and > >> >> + * anyone else holding locked ranges without having bumped the inode's > >> >> + * reference count - if we don't do it, when they access the inode's > >> >> + * io_tree to unlock a range it may be too late, leading to an > >> >> + * use-after-free issue. > >> >> + */ > >> >> spin_lock(&io_tree->lock); > >> >> while (!RB_EMPTY_ROOT(&io_tree->state)) { > >> >> struct extent_state *state; > >> >> struct extent_state *cached_state = NULL; > >> >> + u64 start; > >> >> + u64 end; > >> >> > >> >> node = rb_first(&io_tree->state); > >> >> state = rb_entry(node, struct extent_state, rb_node); > >> >> - atomic_inc(&state->refs); > >> >> + start = state->start; > >> >> + end = state->end; > >> >> spin_unlock(&io_tree->lock); > >> >> > >> >> - lock_extent_bits(io_tree, state->start, state->end, > >> >> - 0, &cached_state); > >> >> - clear_extent_bit(io_tree, state->start, state->end, > >> >> + lock_extent_bits(io_tree, start, end, 0, &cached_state); > >> >> + clear_extent_bit(io_tree, start, end, > >> >> EXTENT_LOCKED | EXTENT_DIRTY | > >> >> EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | > >> >> EXTENT_DEFRAG, 1, 1, > >> >> &cached_state, GFP_NOFS); > >> >> - free_extent_state(state); > >> >> > >> >> cond_resched(); > >> >> spin_lock(&io_tree->lock); > >> >> -- > >> >> 2.1.3 > >> >> > >> >> -- > >> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >> >> the body of a message to majordomo@vger.kernel.org > >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 10, 2015 at 12:27 PM, Liu Bo <bo.li.liu@oracle.com> wrote: > On Wed, Jun 10, 2015 at 12:12:11PM +0100, Filipe Manana wrote: >> On Wed, Jun 10, 2015 at 12:05 PM, Liu Bo <bo.li.liu@oracle.com> wrote: >> > On Wed, Jun 10, 2015 at 11:44:17AM +0100, Filipe Manana wrote: >> >> On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote: >> >> > On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@kernel.org wrote: >> >> >> From: Filipe Manana <fdmanana@suse.com> >> >> >> >> >> >> Zygo Blaxell and other users have reported occasional hangs while an >> >> >> inode is being evicted, leading to traces like the following: >> >> >> >> >> >> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. >> >> >> [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 >> >> >> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> >> >> [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000 >> >> >> [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001 >> >> >> [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78 >> >> >> [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123 >> >> >> [ 5281.981396] Call Trace: >> >> >> [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 >> >> >> [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] >> >> >> [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 >> >> >> [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs] >> >> >> [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs] >> >> >> [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs] >> >> >> [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 >> >> >> [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 >> >> >> [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa >> >> >> [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 >> >> >> [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab >> >> >> [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f >> >> >> [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b >> >> >> [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 >> >> >> [ 5282.001299] 1 lock held by rm/20488: >> >> >> [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b >> >> >> >> >> >> This happens when we have readahead, which calls readpages(), happening >> >> >> right before the inode eviction handler is invoked. So the reason is >> >> >> essentially: >> >> >> >> >> >> 1) readpages() is called while a reference on the inode is held, so >> >> >> eviction can not be triggered before readpages() returns. It also >> >> >> locks one or more ranges in the inode's io_tree (which is done at >> >> >> extent_io.c:__do_contiguous_readpages()); >> >> >> >> >> >> 2) readpages() submits several read bios, all with an end io callback >> >> >> that runs extent_io.c:end_bio_extent_readpage() and that is executed >> >> >> by other task when a bio finishes, corresponding to a work queue >> >> >> (fs_info->end_io_workers) worker kthread. This callback unlocks >> >> >> the ranges in the inode's io_tree that were previously locked in >> >> >> step 1; >> >> >> >> >> >> 3) readpages() returns, the reference on the inode is dropped; >> >> >> >> >> >> 4) One or more of the read bios previously submitted are still not >> >> >> complete (their end io callback was not yet invoked or has not >> >> >> yet finished execution); >> >> >> >> >> >> 5) Inode eviction is triggered (through an unlink call for example). >> >> >> The inode reference count was not incremented before submitting >> >> >> the read bios, therefore this is possible; >> >> >> >> >> >> 6) The eviction handler starts executing and enters the loop that >> >> >> iterates over all extent states in the inode's io_tree; >> >> > >> >> > One question here: >> >> > btrfs_evict_inode()-> >> >> > evict_inode_truncate_pages()-> >> >> > truncate_inode_pages_final() >> >> > >> >> > truncate_inode_pages_final() finds all pages in the mapping and >> >> > lock_page() on each page, if there is a readpage running by other tasks, >> >> > it should be blocked on lock_page() until end_bio_extent_readpage() >> >> > does a unlock_page(). >> >> > >> >> > How does eviction handler bypasses this and enters 'extent_state' loop? >> >> > >> >> > Am I missing something? >> >> >> >> Yes, you are missing something. >> >> end_bio_extent_readpage() unlocks the pages and only after it does >> >> unlock the ranges in the inode's io tree. >> > >> > ah, that's right, another thing is >> > truncate_inode_pages_final() gets every page's lock, which means that >> > all the endio must have been processed, but the below comment in the >> > code says some bios may still not get their endio called. >> >> truncate_inode_pages_final() can lock the pages after >> end_bio_extent_readpage() unlocks them and before >> end_bio_extent_readpage() unlocks the ranges in the io tree. I know >> there's some mangling around that makes the bio's original endio >> function to be run by an work queue (fs_info->endio_workers) task plus >> one or two more levels of indirection. But by endio I really meant >> end_bio_extent_readpage(), as I mention in the comment between >> parenthesis, and not any of the indirections added by >> btrfs_submit_bio_hook(). > > > Excerpted from the below comment, > >> >> >> + * We can have ongoing bios started by readpages (called from readahead) >> >> >> + * that didn't get their end io callbacks called yet or they are still >> >> >> + * in progress ((extent_io.c:end_bio_extent_readpage()). > > I mean that by the time of truncate_inode_pages_final() ending, all the end_bio_extent_readpage() > have already been called since pages are needed to be unlock'd to make evict() go on. When you say "have already been called" I understood completely called and not still in progress. Yes, by that time there's only the possibility of being in progress. I'll send a patch to fix the comment since Chris already picked this. > > Thanks, > > -liubo > >> >> > >> > Thanks, >> > >> > -liubo >> >> >> >> thanks >> >> >> >> >> >> > >> >> > Thanks, >> >> > >> >> > -liubo >> >> >> >> >> >> 7) The loop picks one extent state record and uses its ->start and >> >> >> ->end fields, after releasing the inode's io_tree spinlock, to >> >> >> call lock_extent_bits() and clear_extent_bit(). The call to lock >> >> >> the range [state->start, state->end] blocks because the whole >> >> >> range or a part of it was locked by the previous call to >> >> >> readpages() and the corresponding end io callback, which unlocks >> >> >> the range was not yet executed; >> >> >> >> >> >> 8) The end io callback for the read bio is executed and unlocks the >> >> >> range [state->start, state->end] (or a superset of that range). >> >> >> And at clear_extent_bit() the extent_state record state is used >> >> >> as a second argument to split_state(), which sets state->start to >> >> >> a larger value; >> >> >> >> >> >> 9) The task executing the eviction handler is woken up by the task >> >> >> executing the bio's end io callback (through clear_state_bit) and >> >> >> the eviction handler locks the range >> >> >> [old value for state->start, state->end]. Shortly after, when >> >> >> calling clear_extent_bit(), it unlocks the range >> >> >> [new value for state->start, state->end], so it ends up unlocking >> >> >> only part of the range that it locked, leaving an extent state >> >> >> record in the io_tree that represents the unlocked subrange; >> >> >> >> >> >> 10) The eviction handler loop, in its next iteration, gets the >> >> >> extent_state record for the subrange that it did not unlock in the >> >> >> previous step and then tries to lock it, resulting in an hang. >> >> >> >> >> >> So fix this by not using the ->start and ->end fields of an existing >> >> >> extent_state record. This is a simple solution, and an alternative >> >> >> could be to bump the inode's reference count before submitting each >> >> >> read bio and having it dropped in the bio's end io callback. But that >> >> >> would be a more invasive/complex change and would not protect against >> >> >> other possible places that are not holding a reference on the inode >> >> >> as well. Something to consider in the future. >> >> >> >> >> >> Many thanks to Zygo Blaxell for reporting, in the mailing list, the >> >> >> issue, a set of scripts to trigger it and testing this fix. >> >> >> >> >> >> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> >> >> >> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> >> >> >> Signed-off-by: Filipe Manana <fdmanana@suse.com> >> >> >> --- >> >> >> fs/btrfs/inode.c | 26 +++++++++++++++++++++----- >> >> >> 1 file changed, 21 insertions(+), 5 deletions(-) >> >> >> >> >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> >> >> index 0020b56..ef05800 100644 >> >> >> --- a/fs/btrfs/inode.c >> >> >> +++ b/fs/btrfs/inode.c >> >> >> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode) >> >> >> } >> >> >> write_unlock(&map_tree->lock); >> >> >> >> >> >> + /* >> >> >> + * Keep looping until we have no more ranges in the io tree. >> >> >> + * We can have ongoing bios started by readpages (called from readahead) >> >> >> + * that didn't get their end io callbacks called yet or they are still >> >> >> + * in progress ((extent_io.c:end_bio_extent_readpage()). This means some >> >> >> + * ranges can still be locked and eviction started because before >> >> >> + * submitting those bios, which are executed by a separate task (work >> >> >> + * queue kthread), inode references (inode->i_count) were not taken >> >> >> + * (which would be dropped in the end io callback of each bio). >> >> >> + * Therefore here we effectively end up waiting for those bios and >> >> >> + * anyone else holding locked ranges without having bumped the inode's >> >> >> + * reference count - if we don't do it, when they access the inode's >> >> >> + * io_tree to unlock a range it may be too late, leading to an >> >> >> + * use-after-free issue. >> >> >> + */ >> >> >> spin_lock(&io_tree->lock); >> >> >> while (!RB_EMPTY_ROOT(&io_tree->state)) { >> >> >> struct extent_state *state; >> >> >> struct extent_state *cached_state = NULL; >> >> >> + u64 start; >> >> >> + u64 end; >> >> >> >> >> >> node = rb_first(&io_tree->state); >> >> >> state = rb_entry(node, struct extent_state, rb_node); >> >> >> - atomic_inc(&state->refs); >> >> >> + start = state->start; >> >> >> + end = state->end; >> >> >> spin_unlock(&io_tree->lock); >> >> >> >> >> >> - lock_extent_bits(io_tree, state->start, state->end, >> >> >> - 0, &cached_state); >> >> >> - clear_extent_bit(io_tree, state->start, state->end, >> >> >> + lock_extent_bits(io_tree, start, end, 0, &cached_state); >> >> >> + clear_extent_bit(io_tree, start, end, >> >> >> EXTENT_LOCKED | EXTENT_DIRTY | >> >> >> EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | >> >> >> EXTENT_DEFRAG, 1, 1, >> >> >> &cached_state, GFP_NOFS); >> >> >> - free_extent_state(state); >> >> >> >> >> >> cond_resched(); >> >> >> spin_lock(&io_tree->lock); >> >> >> -- >> >> >> 2.1.3 >> >> >> >> >> >> -- >> >> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> >> >> the body of a message to majordomo@vger.kernel.org >> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 10, 2015 at 12:51:15PM +0100, Filipe Manana wrote: > On Wed, Jun 10, 2015 at 12:27 PM, Liu Bo <bo.li.liu@oracle.com> wrote: > > On Wed, Jun 10, 2015 at 12:12:11PM +0100, Filipe Manana wrote: > >> On Wed, Jun 10, 2015 at 12:05 PM, Liu Bo <bo.li.liu@oracle.com> wrote: > >> > On Wed, Jun 10, 2015 at 11:44:17AM +0100, Filipe Manana wrote: > >> >> On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > >> >> > On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@kernel.org wrote: > >> >> >> From: Filipe Manana <fdmanana@suse.com> > >> >> >> > >> >> >> Zygo Blaxell and other users have reported occasional hangs while an > >> >> >> inode is being evicted, leading to traces like the following: > >> >> >> > >> >> >> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. > >> >> >> [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 > >> >> >> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >> >> >> [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000 > >> >> >> [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001 > >> >> >> [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78 > >> >> >> [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123 > >> >> >> [ 5281.981396] Call Trace: > >> >> >> [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 > >> >> >> [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] > >> >> >> [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 > >> >> >> [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs] > >> >> >> [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs] > >> >> >> [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs] > >> >> >> [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 > >> >> >> [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 > >> >> >> [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa > >> >> >> [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 > >> >> >> [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab > >> >> >> [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f > >> >> >> [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b > >> >> >> [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 > >> >> >> [ 5282.001299] 1 lock held by rm/20488: > >> >> >> [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b > >> >> >> > >> >> >> This happens when we have readahead, which calls readpages(), happening > >> >> >> right before the inode eviction handler is invoked. So the reason is > >> >> >> essentially: > >> >> >> > >> >> >> 1) readpages() is called while a reference on the inode is held, so > >> >> >> eviction can not be triggered before readpages() returns. It also > >> >> >> locks one or more ranges in the inode's io_tree (which is done at > >> >> >> extent_io.c:__do_contiguous_readpages()); > >> >> >> > >> >> >> 2) readpages() submits several read bios, all with an end io callback > >> >> >> that runs extent_io.c:end_bio_extent_readpage() and that is executed > >> >> >> by other task when a bio finishes, corresponding to a work queue > >> >> >> (fs_info->end_io_workers) worker kthread. This callback unlocks > >> >> >> the ranges in the inode's io_tree that were previously locked in > >> >> >> step 1; > >> >> >> > >> >> >> 3) readpages() returns, the reference on the inode is dropped; > >> >> >> > >> >> >> 4) One or more of the read bios previously submitted are still not > >> >> >> complete (their end io callback was not yet invoked or has not > >> >> >> yet finished execution); > >> >> >> > >> >> >> 5) Inode eviction is triggered (through an unlink call for example). > >> >> >> The inode reference count was not incremented before submitting > >> >> >> the read bios, therefore this is possible; > >> >> >> > >> >> >> 6) The eviction handler starts executing and enters the loop that > >> >> >> iterates over all extent states in the inode's io_tree; > >> >> > > >> >> > One question here: > >> >> > btrfs_evict_inode()-> > >> >> > evict_inode_truncate_pages()-> > >> >> > truncate_inode_pages_final() > >> >> > > >> >> > truncate_inode_pages_final() finds all pages in the mapping and > >> >> > lock_page() on each page, if there is a readpage running by other tasks, > >> >> > it should be blocked on lock_page() until end_bio_extent_readpage() > >> >> > does a unlock_page(). > >> >> > > >> >> > How does eviction handler bypasses this and enters 'extent_state' loop? > >> >> > > >> >> > Am I missing something? > >> >> > >> >> Yes, you are missing something. > >> >> end_bio_extent_readpage() unlocks the pages and only after it does > >> >> unlock the ranges in the inode's io tree. > >> > > >> > ah, that's right, another thing is > >> > truncate_inode_pages_final() gets every page's lock, which means that > >> > all the endio must have been processed, but the below comment in the > >> > code says some bios may still not get their endio called. > >> > >> truncate_inode_pages_final() can lock the pages after > >> end_bio_extent_readpage() unlocks them and before > >> end_bio_extent_readpage() unlocks the ranges in the io tree. I know > >> there's some mangling around that makes the bio's original endio > >> function to be run by an work queue (fs_info->endio_workers) task plus > >> one or two more levels of indirection. But by endio I really meant > >> end_bio_extent_readpage(), as I mention in the comment between > >> parenthesis, and not any of the indirections added by > >> btrfs_submit_bio_hook(). > > > > > > Excerpted from the below comment, > > > >> >> >> + * We can have ongoing bios started by readpages (called from readahead) > >> >> >> + * that didn't get their end io callbacks called yet or they are still > >> >> >> + * in progress ((extent_io.c:end_bio_extent_readpage()). > > > > I mean that by the time of truncate_inode_pages_final() ending, all the end_bio_extent_readpage() > > have already been called since pages are needed to be unlock'd to make evict() go on. > > When you say "have already been called" I understood completely called > and not still in progress. > Yes, by that time there's only the possibility of being in progress. > I'll send a patch to fix the comment since Chris already picked this. Okay. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Thanks, -liubo > > > > > Thanks, > > > > -liubo > > > >> > >> > > >> > Thanks, > >> > > >> > -liubo > >> >> > >> >> thanks > >> >> > >> >> > >> >> > > >> >> > Thanks, > >> >> > > >> >> > -liubo > >> >> >> > >> >> >> 7) The loop picks one extent state record and uses its ->start and > >> >> >> ->end fields, after releasing the inode's io_tree spinlock, to > >> >> >> call lock_extent_bits() and clear_extent_bit(). The call to lock > >> >> >> the range [state->start, state->end] blocks because the whole > >> >> >> range or a part of it was locked by the previous call to > >> >> >> readpages() and the corresponding end io callback, which unlocks > >> >> >> the range was not yet executed; > >> >> >> > >> >> >> 8) The end io callback for the read bio is executed and unlocks the > >> >> >> range [state->start, state->end] (or a superset of that range). > >> >> >> And at clear_extent_bit() the extent_state record state is used > >> >> >> as a second argument to split_state(), which sets state->start to > >> >> >> a larger value; > >> >> >> > >> >> >> 9) The task executing the eviction handler is woken up by the task > >> >> >> executing the bio's end io callback (through clear_state_bit) and > >> >> >> the eviction handler locks the range > >> >> >> [old value for state->start, state->end]. Shortly after, when > >> >> >> calling clear_extent_bit(), it unlocks the range > >> >> >> [new value for state->start, state->end], so it ends up unlocking > >> >> >> only part of the range that it locked, leaving an extent state > >> >> >> record in the io_tree that represents the unlocked subrange; > >> >> >> > >> >> >> 10) The eviction handler loop, in its next iteration, gets the > >> >> >> extent_state record for the subrange that it did not unlock in the > >> >> >> previous step and then tries to lock it, resulting in an hang. > >> >> >> > >> >> >> So fix this by not using the ->start and ->end fields of an existing > >> >> >> extent_state record. This is a simple solution, and an alternative > >> >> >> could be to bump the inode's reference count before submitting each > >> >> >> read bio and having it dropped in the bio's end io callback. But that > >> >> >> would be a more invasive/complex change and would not protect against > >> >> >> other possible places that are not holding a reference on the inode > >> >> >> as well. Something to consider in the future. > >> >> >> > >> >> >> Many thanks to Zygo Blaxell for reporting, in the mailing list, the > >> >> >> issue, a set of scripts to trigger it and testing this fix. > >> >> >> > >> >> >> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> > >> >> >> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> > >> >> >> Signed-off-by: Filipe Manana <fdmanana@suse.com> > >> >> >> --- > >> >> >> fs/btrfs/inode.c | 26 +++++++++++++++++++++----- > >> >> >> 1 file changed, 21 insertions(+), 5 deletions(-) > >> >> >> > >> >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >> >> >> index 0020b56..ef05800 100644 > >> >> >> --- a/fs/btrfs/inode.c > >> >> >> +++ b/fs/btrfs/inode.c > >> >> >> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode) > >> >> >> } > >> >> >> write_unlock(&map_tree->lock); > >> >> >> > >> >> >> + /* > >> >> >> + * Keep looping until we have no more ranges in the io tree. > >> >> >> + * We can have ongoing bios started by readpages (called from readahead) > >> >> >> + * that didn't get their end io callbacks called yet or they are still > >> >> >> + * in progress ((extent_io.c:end_bio_extent_readpage()). This means some > >> >> >> + * ranges can still be locked and eviction started because before > >> >> >> + * submitting those bios, which are executed by a separate task (work > >> >> >> + * queue kthread), inode references (inode->i_count) were not taken > >> >> >> + * (which would be dropped in the end io callback of each bio). > >> >> >> + * Therefore here we effectively end up waiting for those bios and > >> >> >> + * anyone else holding locked ranges without having bumped the inode's > >> >> >> + * reference count - if we don't do it, when they access the inode's > >> >> >> + * io_tree to unlock a range it may be too late, leading to an > >> >> >> + * use-after-free issue. > >> >> >> + */ > >> >> >> spin_lock(&io_tree->lock); > >> >> >> while (!RB_EMPTY_ROOT(&io_tree->state)) { > >> >> >> struct extent_state *state; > >> >> >> struct extent_state *cached_state = NULL; > >> >> >> + u64 start; > >> >> >> + u64 end; > >> >> >> > >> >> >> node = rb_first(&io_tree->state); > >> >> >> state = rb_entry(node, struct extent_state, rb_node); > >> >> >> - atomic_inc(&state->refs); > >> >> >> + start = state->start; > >> >> >> + end = state->end; > >> >> >> spin_unlock(&io_tree->lock); > >> >> >> > >> >> >> - lock_extent_bits(io_tree, state->start, state->end, > >> >> >> - 0, &cached_state); > >> >> >> - clear_extent_bit(io_tree, state->start, state->end, > >> >> >> + lock_extent_bits(io_tree, start, end, 0, &cached_state); > >> >> >> + clear_extent_bit(io_tree, start, end, > >> >> >> EXTENT_LOCKED | EXTENT_DIRTY | > >> >> >> EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | > >> >> >> EXTENT_DEFRAG, 1, 1, > >> >> >> &cached_state, GFP_NOFS); > >> >> >> - free_extent_state(state); > >> >> >> > >> >> >> cond_resched(); > >> >> >> spin_lock(&io_tree->lock); > >> >> >> -- > >> >> >> 2.1.3 > >> >> >> > >> >> >> -- > >> >> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >> >> >> the body of a message to majordomo@vger.kernel.org > >> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0020b56..ef05800 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode) } write_unlock(&map_tree->lock); + /* + * Keep looping until we have no more ranges in the io tree. + * We can have ongoing bios started by readpages (called from readahead) + * that didn't get their end io callbacks called yet or they are still + * in progress ((extent_io.c:end_bio_extent_readpage()). This means some + * ranges can still be locked and eviction started because before + * submitting those bios, which are executed by a separate task (work + * queue kthread), inode references (inode->i_count) were not taken + * (which would be dropped in the end io callback of each bio). + * Therefore here we effectively end up waiting for those bios and + * anyone else holding locked ranges without having bumped the inode's + * reference count - if we don't do it, when they access the inode's + * io_tree to unlock a range it may be too late, leading to an + * use-after-free issue. + */ spin_lock(&io_tree->lock); while (!RB_EMPTY_ROOT(&io_tree->state)) { struct extent_state *state; struct extent_state *cached_state = NULL; + u64 start; + u64 end; node = rb_first(&io_tree->state); state = rb_entry(node, struct extent_state, rb_node); - atomic_inc(&state->refs); + start = state->start; + end = state->end; spin_unlock(&io_tree->lock); - lock_extent_bits(io_tree, state->start, state->end, - 0, &cached_state); - clear_extent_bit(io_tree, state->start, state->end, + lock_extent_bits(io_tree, start, end, 0, &cached_state); + clear_extent_bit(io_tree, start, end, EXTENT_LOCKED | EXTENT_DIRTY | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1, 1, &cached_state, GFP_NOFS); - free_extent_state(state); cond_resched(); spin_lock(&io_tree->lock);