Message ID | 0bea516a54b26e4e1c42e6fe47548cb48cc4172b.1565112813.git.osandov@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: fix workqueue deadlock on dependent filesystems | expand |
On 6.08.19 г. 20:34 ч., Omar Sandoval wrote: > From: Omar Sandoval <osandov@fb.com> > > We hit a the following very strange deadlock on a system with Btrfs on a > loop device backed by another Btrfs filesystem: > > 1. The top (loop device) filesystem queues an async_cow work item from > cow_file_range_async(). We'll call this work X. > 2. Worker thread A starts work X (normal_work_helper()). > 3. Worker thread A executes the ordered work for the top filesystem > (run_ordered_work()). > 4. Worker thread A finishes the ordered work for work X and frees X > (work->ordered_free()). > 5. Worker thread A executes another ordered work and gets blocked on I/O > to the bottom filesystem (still in run_ordered_work()). > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow > work item which happens to be the recently-freed X. > 7. The workqueue code sees that X is already being executed by worker > thread A, so it schedules X to be executed _after_ worker thread A > finishes (see the find_worker_executing_work() call in > process_one_work()). Isn't the bigger problem that a single run_ordered_work could potentially run the ordered work for more than one normal work? E.g. what if btrfs' code is reworked such that run_ordered_work executes ordered_func for just one work item (the one which called the function in the first place) ? Wouldn't that also resolve the issue? Correct me if I'm wrong but it seems silly to have one work item outlive ordered_free which is what currently happens, right? > > Now, the top filesystem is waiting for I/O on the bottom filesystem, but > the bottom filesystem is waiting for the top filesystem to finish, so we > deadlock. > > This happens because we are breaking the workqueue assumption that a > work item cannot be recycled while it still depends on other work. Fix > it by waiting to free the work item until we are done with all of the > related ordered work. > > P.S.: > > One might ask why the workqueue code doesn't try to detect a recycled > work item. It actually does try by checking whether the work item has > the same work function (find_worker_executing_work()), but in our case > the function is the same. This is the only key that the workqueue code > has available to compare, short of adding an additional, layer-violating > "custom key". Considering that we're the only ones that have ever hit > this, we should just play by the rules. > > Unfortunately, we haven't been able to create a minimal reproducer other > than our full container setup using a compress-force=zstd filesystem on > top of another compress-force=zstd filesystem. > > Suggested-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Omar Sandoval <osandov@fb.com> > --- > fs/btrfs/async-thread.c | 56 ++++++++++++++++++++++++++++++++--------- > 1 file changed, 44 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c > index 122cb97c7909..b2bfde560331 100644 > --- a/fs/btrfs/async-thread.c > +++ b/fs/btrfs/async-thread.c > @@ -250,16 +250,17 @@ static inline void thresh_exec_hook(struct __btrfs_workqueue *wq) > } > } > > -static void run_ordered_work(struct __btrfs_workqueue *wq) > +static void run_ordered_work(struct btrfs_work *self) > { > + struct __btrfs_workqueue *wq = self->wq; > struct list_head *list = &wq->ordered_list; > struct btrfs_work *work; > spinlock_t *lock = &wq->list_lock; > unsigned long flags; > + void *wtag; > + bool free_self = false; > > while (1) { > - void *wtag; > - > spin_lock_irqsave(lock, flags); > if (list_empty(list)) > break; > @@ -285,16 +286,47 @@ static void run_ordered_work(struct __btrfs_workqueue *wq) > list_del(&work->ordered_list); > spin_unlock_irqrestore(lock, flags); > > - /* > - * We don't want to call the ordered free functions with the > - * lock held though. Save the work as tag for the trace event, > - * because the callback could free the structure. > - */ > - wtag = work; > - work->ordered_free(work); > - trace_btrfs_all_work_done(wq->fs_info, wtag); > + if (work == self) { > + /* > + * This is the work item that the worker is currently > + * executing. > + * > + * The kernel workqueue code guarantees non-reentrancy > + * of work items. I.e., if a work item with the same > + * address and work function is queued twice, the second > + * execution is blocked until the first one finishes. A > + * work item may be freed and recycled with the same > + * work function; the workqueue code assumes that the > + * original work item cannot depend on the recycled work > + * item in that case (see find_worker_executing_work()). > + * > + * Note that the work of one Btrfs filesystem may depend > + * on the work of another Btrfs filesystem via, e.g., a > + * loop device. Therefore, we must not allow the current > + * work item to be recycled until we are really done, > + * otherwise we break the above assumption and can > + * deadlock. > + */ > + free_self = true; > + } else { > + /* > + * We don't want to call the ordered free functions with > + * the lock held though. Save the work as tag for the > + * trace event, because the callback could free the > + * structure. > + */ > + wtag = work; > + work->ordered_free(work); > + trace_btrfs_all_work_done(wq->fs_info, wtag); > + } > } > spin_unlock_irqrestore(lock, flags); > + > + if (free_self) { > + wtag = self; > + self->ordered_free(self); > + trace_btrfs_all_work_done(wq->fs_info, wtag); > + } > } > > static void normal_work_helper(struct btrfs_work *work) > @@ -322,7 +354,7 @@ static void normal_work_helper(struct btrfs_work *work) > work->func(work); > if (need_order) { > set_bit(WORK_DONE_BIT, &work->flags); > - run_ordered_work(wq); > + run_ordered_work(work); > } > if (!need_order) > trace_btrfs_all_work_done(wq->fs_info, wtag); >
On Wed, Aug 07, 2019 at 10:17:26AM +0300, Nikolay Borisov wrote: > > > On 6.08.19 г. 20:34 ч., Omar Sandoval wrote: > > From: Omar Sandoval <osandov@fb.com> > > > > We hit a the following very strange deadlock on a system with Btrfs on a > > loop device backed by another Btrfs filesystem: > > > > 1. The top (loop device) filesystem queues an async_cow work item from > > cow_file_range_async(). We'll call this work X. > > 2. Worker thread A starts work X (normal_work_helper()). > > 3. Worker thread A executes the ordered work for the top filesystem > > (run_ordered_work()). > > 4. Worker thread A finishes the ordered work for work X and frees X > > (work->ordered_free()). > > 5. Worker thread A executes another ordered work and gets blocked on I/O > > to the bottom filesystem (still in run_ordered_work()). > > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow > > work item which happens to be the recently-freed X. > > 7. The workqueue code sees that X is already being executed by worker > > thread A, so it schedules X to be executed _after_ worker thread A > > finishes (see the find_worker_executing_work() call in > > process_one_work()). > > Isn't the bigger problem that a single run_ordered_work could > potentially run the ordered work for more than one normal work? E.g. > what if btrfs' code is reworked such that run_ordered_work executes > ordered_func for just one work item (the one which called the function > in the first place) ? Wouldn't that also resolve the issue? Correct me > if I'm wrong but it seems silly to have one work item outlive > ordered_free which is what currently happens, right? We can't always run the ordered work for the normal work because then it wouldn't be ordered :) If work item N completes before item N-1, then we can't run the ordered work for N yet. Then, when N-1 completes, we need to do the ordered work for N-1 _and_ N, which is how we get in this situation.
On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osandov@osandov.com> wrote: > > From: Omar Sandoval <osandov@fb.com> > > We hit a the following very strange deadlock on a system with Btrfs on a > loop device backed by another Btrfs filesystem: > > 1. The top (loop device) filesystem queues an async_cow work item from > cow_file_range_async(). We'll call this work X. > 2. Worker thread A starts work X (normal_work_helper()). > 3. Worker thread A executes the ordered work for the top filesystem > (run_ordered_work()). > 4. Worker thread A finishes the ordered work for work X and frees X > (work->ordered_free()). > 5. Worker thread A executes another ordered work and gets blocked on I/O > to the bottom filesystem (still in run_ordered_work()). > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow > work item which happens to be the recently-freed X. > 7. The workqueue code sees that X is already being executed by worker > thread A, so it schedules X to be executed _after_ worker thread A > finishes (see the find_worker_executing_work() call in > process_one_work()). > > Now, the top filesystem is waiting for I/O on the bottom filesystem, but > the bottom filesystem is waiting for the top filesystem to finish, so we > deadlock. > > This happens because we are breaking the workqueue assumption that a > work item cannot be recycled while it still depends on other work. Fix > it by waiting to free the work item until we are done with all of the > related ordered work. > > P.S.: > > One might ask why the workqueue code doesn't try to detect a recycled > work item. It actually does try by checking whether the work item has > the same work function (find_worker_executing_work()), but in our case > the function is the same. This is the only key that the workqueue code > has available to compare, short of adding an additional, layer-violating > "custom key". Considering that we're the only ones that have ever hit > this, we should just play by the rules. > > Unfortunately, we haven't been able to create a minimal reproducer other > than our full container setup using a compress-force=zstd filesystem on > top of another compress-force=zstd filesystem. > > Suggested-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Looks good to me, thanks. Another variant of the problem Liu fixed back in 2014 (commit 9e0af23764344f7f1b68e4eefbe7dc865018b63d). > --- > fs/btrfs/async-thread.c | 56 ++++++++++++++++++++++++++++++++--------- > 1 file changed, 44 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c > index 122cb97c7909..b2bfde560331 100644 > --- a/fs/btrfs/async-thread.c > +++ b/fs/btrfs/async-thread.c > @@ -250,16 +250,17 @@ static inline void thresh_exec_hook(struct __btrfs_workqueue *wq) > } > } > > -static void run_ordered_work(struct __btrfs_workqueue *wq) > +static void run_ordered_work(struct btrfs_work *self) > { > + struct __btrfs_workqueue *wq = self->wq; > struct list_head *list = &wq->ordered_list; > struct btrfs_work *work; > spinlock_t *lock = &wq->list_lock; > unsigned long flags; > + void *wtag; > + bool free_self = false; > > while (1) { > - void *wtag; > - > spin_lock_irqsave(lock, flags); > if (list_empty(list)) > break; > @@ -285,16 +286,47 @@ static void run_ordered_work(struct __btrfs_workqueue *wq) > list_del(&work->ordered_list); > spin_unlock_irqrestore(lock, flags); > > - /* > - * We don't want to call the ordered free functions with the > - * lock held though. Save the work as tag for the trace event, > - * because the callback could free the structure. > - */ > - wtag = work; > - work->ordered_free(work); > - trace_btrfs_all_work_done(wq->fs_info, wtag); > + if (work == self) { > + /* > + * This is the work item that the worker is currently > + * executing. > + * > + * The kernel workqueue code guarantees non-reentrancy > + * of work items. I.e., if a work item with the same > + * address and work function is queued twice, the second > + * execution is blocked until the first one finishes. A > + * work item may be freed and recycled with the same > + * work function; the workqueue code assumes that the > + * original work item cannot depend on the recycled work > + * item in that case (see find_worker_executing_work()). > + * > + * Note that the work of one Btrfs filesystem may depend > + * on the work of another Btrfs filesystem via, e.g., a > + * loop device. Therefore, we must not allow the current > + * work item to be recycled until we are really done, > + * otherwise we break the above assumption and can > + * deadlock. > + */ > + free_self = true; > + } else { > + /* > + * We don't want to call the ordered free functions with > + * the lock held though. Save the work as tag for the > + * trace event, because the callback could free the > + * structure. > + */ > + wtag = work; > + work->ordered_free(work); > + trace_btrfs_all_work_done(wq->fs_info, wtag); > + } > } > spin_unlock_irqrestore(lock, flags); > + > + if (free_self) { > + wtag = self; > + self->ordered_free(self); > + trace_btrfs_all_work_done(wq->fs_info, wtag); > + } > } > > static void normal_work_helper(struct btrfs_work *work) > @@ -322,7 +354,7 @@ static void normal_work_helper(struct btrfs_work *work) > work->func(work); > if (need_order) { > set_bit(WORK_DONE_BIT, &work->flags); > - run_ordered_work(wq); > + run_ordered_work(work); > } > if (!need_order) > trace_btrfs_all_work_done(wq->fs_info, wtag); > -- > 2.22.0 >
On Mon, Aug 12, 2019 at 12:38:55PM +0100, Filipe Manana wrote: > On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osandov@osandov.com> wrote: > > > > From: Omar Sandoval <osandov@fb.com> > > > > We hit a the following very strange deadlock on a system with Btrfs on a > > loop device backed by another Btrfs filesystem: > > > > 1. The top (loop device) filesystem queues an async_cow work item from > > cow_file_range_async(). We'll call this work X. > > 2. Worker thread A starts work X (normal_work_helper()). > > 3. Worker thread A executes the ordered work for the top filesystem > > (run_ordered_work()). > > 4. Worker thread A finishes the ordered work for work X and frees X > > (work->ordered_free()). > > 5. Worker thread A executes another ordered work and gets blocked on I/O > > to the bottom filesystem (still in run_ordered_work()). > > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow > > work item which happens to be the recently-freed X. > > 7. The workqueue code sees that X is already being executed by worker > > thread A, so it schedules X to be executed _after_ worker thread A > > finishes (see the find_worker_executing_work() call in > > process_one_work()). > > > > Now, the top filesystem is waiting for I/O on the bottom filesystem, but > > the bottom filesystem is waiting for the top filesystem to finish, so we > > deadlock. > > > > This happens because we are breaking the workqueue assumption that a > > work item cannot be recycled while it still depends on other work. Fix > > it by waiting to free the work item until we are done with all of the > > related ordered work. > > > > P.S.: > > > > One might ask why the workqueue code doesn't try to detect a recycled > > work item. It actually does try by checking whether the work item has > > the same work function (find_worker_executing_work()), but in our case > > the function is the same. This is the only key that the workqueue code > > has available to compare, short of adding an additional, layer-violating > > "custom key". Considering that we're the only ones that have ever hit > > this, we should just play by the rules. > > > > Unfortunately, we haven't been able to create a minimal reproducer other > > than our full container setup using a compress-force=zstd filesystem on > > top of another compress-force=zstd filesystem. > > > > Suggested-by: Tejun Heo <tj@kernel.org> > > Signed-off-by: Omar Sandoval <osandov@fb.com> > > Reviewed-by: Filipe Manana <fdmanana@suse.com> > > Looks good to me, thanks. > Another variant of the problem Liu fixed back in 2014 (commit > 9e0af23764344f7f1b68e4eefbe7dc865018b63d). Good point. I think we can actually get rid of those unique helpers with this fix. I'll send some followup cleanups.
On Mon, Aug 12, 2019 at 7:48 PM Omar Sandoval <osandov@osandov.com> wrote: > > On Mon, Aug 12, 2019 at 12:38:55PM +0100, Filipe Manana wrote: > > On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osandov@osandov.com> wrote: > > > > > > From: Omar Sandoval <osandov@fb.com> > > > > > > We hit a the following very strange deadlock on a system with Btrfs on a > > > loop device backed by another Btrfs filesystem: > > > > > > 1. The top (loop device) filesystem queues an async_cow work item from > > > cow_file_range_async(). We'll call this work X. > > > 2. Worker thread A starts work X (normal_work_helper()). > > > 3. Worker thread A executes the ordered work for the top filesystem > > > (run_ordered_work()). > > > 4. Worker thread A finishes the ordered work for work X and frees X > > > (work->ordered_free()). > > > 5. Worker thread A executes another ordered work and gets blocked on I/O > > > to the bottom filesystem (still in run_ordered_work()). > > > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow > > > work item which happens to be the recently-freed X. > > > 7. The workqueue code sees that X is already being executed by worker > > > thread A, so it schedules X to be executed _after_ worker thread A > > > finishes (see the find_worker_executing_work() call in > > > process_one_work()). > > > > > > Now, the top filesystem is waiting for I/O on the bottom filesystem, but > > > the bottom filesystem is waiting for the top filesystem to finish, so we > > > deadlock. > > > > > > This happens because we are breaking the workqueue assumption that a > > > work item cannot be recycled while it still depends on other work. Fix > > > it by waiting to free the work item until we are done with all of the > > > related ordered work. > > > > > > P.S.: > > > > > > One might ask why the workqueue code doesn't try to detect a recycled > > > work item. It actually does try by checking whether the work item has > > > the same work function (find_worker_executing_work()), but in our case > > > the function is the same. This is the only key that the workqueue code > > > has available to compare, short of adding an additional, layer-violating > > > "custom key". Considering that we're the only ones that have ever hit > > > this, we should just play by the rules. > > > > > > Unfortunately, we haven't been able to create a minimal reproducer other > > > than our full container setup using a compress-force=zstd filesystem on > > > top of another compress-force=zstd filesystem. > > > > > > Suggested-by: Tejun Heo <tj@kernel.org> > > > Signed-off-by: Omar Sandoval <osandov@fb.com> > > > > Reviewed-by: Filipe Manana <fdmanana@suse.com> > > > > Looks good to me, thanks. > > Another variant of the problem Liu fixed back in 2014 (commit > > 9e0af23764344f7f1b68e4eefbe7dc865018b63d). > > Good point. I think we can actually get rid of those unique helpers with > this fix. I'll send some followup cleanups. Great! Thanks.
On Tue, Aug 06, 2019 at 10:34:52AM -0700, Omar Sandoval wrote: > From: Omar Sandoval <osandov@fb.com> > > We hit a the following very strange deadlock on a system with Btrfs on a > loop device backed by another Btrfs filesystem: > > 1. The top (loop device) filesystem queues an async_cow work item from > cow_file_range_async(). We'll call this work X. > 2. Worker thread A starts work X (normal_work_helper()). > 3. Worker thread A executes the ordered work for the top filesystem > (run_ordered_work()). > 4. Worker thread A finishes the ordered work for work X and frees X > (work->ordered_free()). > 5. Worker thread A executes another ordered work and gets blocked on I/O > to the bottom filesystem (still in run_ordered_work()). > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow > work item which happens to be the recently-freed X. > 7. The workqueue code sees that X is already being executed by worker > thread A, so it schedules X to be executed _after_ worker thread A > finishes (see the find_worker_executing_work() call in > process_one_work()). > > Now, the top filesystem is waiting for I/O on the bottom filesystem, but > the bottom filesystem is waiting for the top filesystem to finish, so we > deadlock. > > This happens because we are breaking the workqueue assumption that a > work item cannot be recycled while it still depends on other work. Fix > it by waiting to free the work item until we are done with all of the > related ordered work. > > P.S.: > > One might ask why the workqueue code doesn't try to detect a recycled > work item. It actually does try by checking whether the work item has > the same work function (find_worker_executing_work()), but in our case > the function is the same. This is the only key that the workqueue code > has available to compare, short of adding an additional, layer-violating > "custom key". Considering that we're the only ones that have ever hit > this, we should just play by the rules. > > Unfortunately, we haven't been able to create a minimal reproducer other > than our full container setup using a compress-force=zstd filesystem on > top of another compress-force=zstd filesystem. > > Suggested-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Omar Sandoval <osandov@fb.com> Added to misc-next, thanks.
diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index 122cb97c7909..b2bfde560331 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -250,16 +250,17 @@ static inline void thresh_exec_hook(struct __btrfs_workqueue *wq) } } -static void run_ordered_work(struct __btrfs_workqueue *wq) +static void run_ordered_work(struct btrfs_work *self) { + struct __btrfs_workqueue *wq = self->wq; struct list_head *list = &wq->ordered_list; struct btrfs_work *work; spinlock_t *lock = &wq->list_lock; unsigned long flags; + void *wtag; + bool free_self = false; while (1) { - void *wtag; - spin_lock_irqsave(lock, flags); if (list_empty(list)) break; @@ -285,16 +286,47 @@ static void run_ordered_work(struct __btrfs_workqueue *wq) list_del(&work->ordered_list); spin_unlock_irqrestore(lock, flags); - /* - * We don't want to call the ordered free functions with the - * lock held though. Save the work as tag for the trace event, - * because the callback could free the structure. - */ - wtag = work; - work->ordered_free(work); - trace_btrfs_all_work_done(wq->fs_info, wtag); + if (work == self) { + /* + * This is the work item that the worker is currently + * executing. + * + * The kernel workqueue code guarantees non-reentrancy + * of work items. I.e., if a work item with the same + * address and work function is queued twice, the second + * execution is blocked until the first one finishes. A + * work item may be freed and recycled with the same + * work function; the workqueue code assumes that the + * original work item cannot depend on the recycled work + * item in that case (see find_worker_executing_work()). + * + * Note that the work of one Btrfs filesystem may depend + * on the work of another Btrfs filesystem via, e.g., a + * loop device. Therefore, we must not allow the current + * work item to be recycled until we are really done, + * otherwise we break the above assumption and can + * deadlock. + */ + free_self = true; + } else { + /* + * We don't want to call the ordered free functions with + * the lock held though. Save the work as tag for the + * trace event, because the callback could free the + * structure. + */ + wtag = work; + work->ordered_free(work); + trace_btrfs_all_work_done(wq->fs_info, wtag); + } } spin_unlock_irqrestore(lock, flags); + + if (free_self) { + wtag = self; + self->ordered_free(self); + trace_btrfs_all_work_done(wq->fs_info, wtag); + } } static void normal_work_helper(struct btrfs_work *work) @@ -322,7 +354,7 @@ static void normal_work_helper(struct btrfs_work *work) work->func(work); if (need_order) { set_bit(WORK_DONE_BIT, &work->flags); - run_ordered_work(wq); + run_ordered_work(work); } if (!need_order) trace_btrfs_all_work_done(wq->fs_info, wtag);