diff mbox

Btrfs: fix deadlock between direct IO reads and buffered writes

Message ID 1455817370-1832-1-git-send-email-fdmanana@kernel.org (mailing list archive)
State Accepted
Headers show

Commit Message

Filipe Manana Feb. 18, 2016, 5:42 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

While running a test with a mix of buffered IO and direct IO against
the same files I hit a deadlock reported by the following trace:

[11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
[11642.142452]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.146332] kworker/u32:3   D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
[11642.149771]     0 15282      2 0x00000000
[11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
[11642.154074]  ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
[11642.156722]  ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
[11642.159205]  0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
[11642.161403] Call Trace:
[11642.162129]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.163396]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.164871]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.167020]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.167931]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.182320]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.183762]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.185308]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.186782]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.188217]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.189626]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.190803]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.192158]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.193379]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.194831]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.197068]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.199188]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.200723]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.202465]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.203836]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.205624]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.207057]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.208529]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.210375]  [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
[11642.212132]  [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
[11642.213837]  [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
[11642.215457]  [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
[11642.217095]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.218324]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.219466]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.220801]  [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.222032]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.223190]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.224394]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.226295] 2 locks held by kworker/u32:3/15282:
[11642.227273]  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.229412]  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
[11642.232872]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.235776] kworker/u32:8   D ffff88020de5f848     0 15289      2 0x00000000
[11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
[11642.238670]  ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
[11642.240475]  ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
[11642.242154]  0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
[11642.243715] Call Trace:
[11642.244390]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.245432]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.246392]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.247479]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.248551]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.249968]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.251043]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.252202]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.253210]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.254307]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.256118]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.257131]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.258200]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.259168]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.260516]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.261841]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.263531]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.264747]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.266148]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.267264]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.268280]  [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
[11642.269407]  [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
[11642.270476]  [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
[11642.271547]  [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
[11642.272588]  [<ffffffff81194821>] wb_workfn+0x201/0x341
[11642.273523]  [<ffffffff81194821>] ? wb_workfn+0x201/0x341
[11642.274479]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.275497]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.276518]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.277520]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.278517]  [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.279371]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.280468]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.281607]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.282604] 3 locks held by kworker/u32:8/15289:
[11642.283423]  #0:  ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.285629]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.287538]  #2:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
[11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
[11642.290547]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.292864] fdm-stress      D ffff88022c107c20     0 26848  26591 0x00000000
[11642.294118]  ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
[11642.295602]  ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
[11642.297098]  ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
[11642.298433] Call Trace:
[11642.298896]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.299738]  [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
[11642.300833]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.301943]  [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
[11642.303270]  [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
[11642.304552]  [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
[11642.305782]  [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
[11642.306878]  [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
[11642.307729]  [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
[11642.308602]  [<ffffffff8116efbb>] SyS_write+0x50/0x7e
[11642.309410]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.310403] 3 locks held by fdm-stress/26848:
[11642.311108]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.312578]  #1:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.314170]  #2:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
[11642.317842]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.319959] fdm-stress      D ffff8801964ffa68     0 26849  26591 0x00000000
[11642.321312]  ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
[11642.322555]  ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
[11642.323715]  ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
[11642.325096] Call Trace:
[11642.325532]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.326303]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.327180]  [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
[11642.328114]  [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
[11642.329051]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.330053]  [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
[11642.330952]  [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
[11642.331869]  [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
[11642.332925]  [<ffffffff81074075>] ? wake_up_q+0x47/0x47
[11642.333736]  [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
[11642.334672]  [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
[11642.335858]  [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
[11642.336854]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.337820]  [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
[11642.339026]  [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
[11642.340214]  [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
[11642.341123]  [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
[11642.341934]  [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
[11642.342936]  [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
[11642.343772]  [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
[11642.344673]  [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
[11642.346024]  [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
[11642.346873]  [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
[11642.347720]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.350222] 4 locks held by fdm-stress/26849:
[11642.350898]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.352375]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
[11642.354072]  #2:  (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
[11642.355647]  #3:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
[11642.358508]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.368625] fdm-stress      D ffff88021f167688     0 26850  26591 0x00000000
[11642.369716]  ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
[11642.370950]  ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
[11642.372210]  0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
[11642.373430] Call Trace:
[11642.373853]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.374623]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.375948]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.376862]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.377637]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.378610]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.379457]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.380366]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.381353]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.382255]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.383162]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.383945]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.384875]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.385749]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.386721]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.387596]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.389030]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.389973]  [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
[11642.390939]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.392271]  [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
[11642.393305]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.394239]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.395045]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.395991]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.397144]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.398392]  [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
[11642.399363]  [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
[11642.400445]  [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
[11642.401309]  [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
[11642.402213]  [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
[11642.403139]  [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
[11642.404360]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.406187]  [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
[11642.407070]  [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
[11642.407990]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.409192]  [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
[11642.410146]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.411291]  [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
[11642.412263]  [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
[11642.413057]  [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
[11642.413897]  [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
[11642.414708]  [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
[11642.415573]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.416572] 1 lock held by fdm-stress/26850:
[11642.417345]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
[11642.419698]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.421807] fdm-stress      D ffff880196483d28     0 26851  26591 0x00000000
[11642.422878]  ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
[11642.424149]  ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
[11642.425374]  ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
[11642.426591] Call Trace:
[11642.427013]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.427856]  [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
[11642.428852]  [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
[11642.429743]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.430911]  [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.432102]  [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
[11642.433259]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.434431]  [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
[11642.436079]  [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
[11642.437009]  [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
[11642.437860]  [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
[11642.438723]  [<ffffffff81171435>] iterate_supers+0x75/0xc2
[11642.439597]  [<ffffffff81197d00>] sys_sync+0x52/0x80
[11642.440454]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.441533] 3 locks held by fdm-stress/26851:
[11642.442370]  #0:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
[11642.444043]  #1:  (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
[11642.446010]  #2:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]

This happened because under specific timings the path for direct IO reads
can deadlock with concurrent buffered writes. The diagram below shows how
this happens for an example file that has the following layout:

     [  extent A  ]  [  extent B  ]  [ ....
     0K              4K              8K

     CPU 1                                               CPU 2                             CPU 3

DIO read against range
 [0K, 8K[ starts

btrfs_direct_IO()
  --> calls btrfs_get_blocks_direct()
      which finds the extent map for the
      extent A and leaves the range
      [0K, 4K[ locked in the inode's
      io tree

                                                   buffered write against
                                                   range [4K, 8K[ starts

                                                   __btrfs_buffered_write()
                                                     --> dirties page at 4K

                                                                                     a user space
                                                                                     task calls sync
                                                                                     for e.g or
                                                                                     writepages() is
                                                                                     invoked by mm

                                                                                     writepages()
                                                                                       run_delalloc_range()
                                                                                         cow_file_range()
                                                                                           --> ordered extent X
                                                                                               for the buffered
                                                                                               write is created
                                                                                               and
                                                                                               writeback starts

  --> calls btrfs_get_blocks_direct()
      again, without submitting first
      a bio for reading extent A, and
      finds the extent map for extent B

  --> calls lock_extent_direct()

      --> locks range [4K, 8K[
      --> finds ordered extent X
          covering range [4K, 8K[
      --> unlocks range [4K, 8K[

                                                  buffered write against
                                                  range [0K, 8K[ starts

                                                  __btrfs_buffered_write()
                                                    prepare_pages()
                                                      --> locks pages with
                                                          offsets 0 and 4K
                                                    lock_and_cleanup_extent_if_need()
                                                      --> blocks attempting to
                                                          lock range [0K, 8K[ in
                                                          the inode's io tree,
                                                          because the range [0, 4K[
                                                          is already locked by the
                                                          direct IO task at CPU 1

      --> calls
          btrfs_start_ordered_extent(oe X)

          btrfs_start_ordered_extent(oe X)

            --> At this point writeback for ordered
                extent X has not finished yet

            filemap_fdatawrite_range()
              btrfs_writepages()
                extent_writepages()
                  extent_write_cache_pages()
                    --> finds page with offset 0
                        with the writeback tag
                        (and not dirty)
                    --> tries to lock it
                         --> deadlock, task at CPU 1
                             has the page locked and
                             is blocked on the io range
                             [0, 4K[ that was locked
                             earlier by this task

So fix this by falling back to a buffered read in the direct IO read path
when an ordered extent for a buffered write is found.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

Comments

Liu Bo Feb. 18, 2016, 7:53 p.m. UTC | #1
On Thu, Feb 18, 2016 at 05:42:50PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> While running a test with a mix of buffered IO and direct IO against
> the same files I hit a deadlock reported by the following trace:
> 
> [11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
> [11642.142452]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
> [11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [11642.146332] kworker/u32:3   D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
> [11642.149771]     0 15282      2 0x00000000
> [11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
> [11642.154074]  ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
> [11642.156722]  ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
> [11642.159205]  0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
> [11642.161403] Call Trace:
> [11642.162129]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
> [11642.163396]  [<ffffffff8147b541>] schedule+0x82/0x9a
> [11642.164871]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
> [11642.167020]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
> [11642.167931]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
> [11642.182320]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
> [11642.183762]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
> [11642.185308]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
> [11642.186782]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
> [11642.188217]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
> [11642.189626]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
> [11642.190803]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
> [11642.192158]  [<ffffffff8111829f>] __lock_page+0x66/0x68
> [11642.193379]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
> [11642.194831]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
> [11642.197068]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
> [11642.199188]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
> [11642.200723]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
> [11642.202465]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
> [11642.203836]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
> [11642.205624]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
> [11642.207057]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
> [11642.208529]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
> [11642.210375]  [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
> [11642.212132]  [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
> [11642.213837]  [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
> [11642.215457]  [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
> [11642.217095]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
> [11642.218324]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
> [11642.219466]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
> [11642.220801]  [<ffffffff8106a500>] kthread+0xd4/0xdc
> [11642.222032]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
> [11642.223190]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
> [11642.224394]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
> [11642.226295] 2 locks held by kworker/u32:3/15282:
> [11642.227273]  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
> [11642.229412]  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
> [11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
> [11642.232872]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
> [11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [11642.235776] kworker/u32:8   D ffff88020de5f848     0 15289      2 0x00000000
> [11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
> [11642.238670]  ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
> [11642.240475]  ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
> [11642.242154]  0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
> [11642.243715] Call Trace:
> [11642.244390]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
> [11642.245432]  [<ffffffff8147b541>] schedule+0x82/0x9a
> [11642.246392]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
> [11642.247479]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
> [11642.248551]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
> [11642.249968]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
> [11642.251043]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
> [11642.252202]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
> [11642.253210]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
> [11642.254307]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
> [11642.256118]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
> [11642.257131]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
> [11642.258200]  [<ffffffff8111829f>] __lock_page+0x66/0x68
> [11642.259168]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
> [11642.260516]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
> [11642.261841]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
> [11642.263531]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
> [11642.264747]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
> [11642.266148]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
> [11642.267264]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
> [11642.268280]  [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
> [11642.269407]  [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
> [11642.270476]  [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
> [11642.271547]  [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
> [11642.272588]  [<ffffffff81194821>] wb_workfn+0x201/0x341
> [11642.273523]  [<ffffffff81194821>] ? wb_workfn+0x201/0x341
> [11642.274479]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
> [11642.275497]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
> [11642.276518]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
> [11642.277520]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
> [11642.278517]  [<ffffffff8106a500>] kthread+0xd4/0xdc
> [11642.279371]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
> [11642.280468]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
> [11642.281607]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
> [11642.282604] 3 locks held by kworker/u32:8/15289:
> [11642.283423]  #0:  ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
> [11642.285629]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
> [11642.287538]  #2:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
> [11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
> [11642.290547]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
> [11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [11642.292864] fdm-stress      D ffff88022c107c20     0 26848  26591 0x00000000
> [11642.294118]  ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
> [11642.295602]  ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
> [11642.297098]  ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
> [11642.298433] Call Trace:
> [11642.298896]  [<ffffffff8147b541>] schedule+0x82/0x9a
> [11642.299738]  [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
> [11642.300833]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
> [11642.301943]  [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
> [11642.303270]  [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
> [11642.304552]  [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
> [11642.305782]  [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
> [11642.306878]  [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
> [11642.307729]  [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
> [11642.308602]  [<ffffffff8116efbb>] SyS_write+0x50/0x7e
> [11642.309410]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
> [11642.310403] 3 locks held by fdm-stress/26848:
> [11642.311108]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
> [11642.312578]  #1:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
> [11642.314170]  #2:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
> [11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
> [11642.317842]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
> [11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [11642.319959] fdm-stress      D ffff8801964ffa68     0 26849  26591 0x00000000
> [11642.321312]  ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
> [11642.322555]  ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
> [11642.323715]  ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
> [11642.325096] Call Trace:
> [11642.325532]  [<ffffffff8147b541>] schedule+0x82/0x9a
> [11642.326303]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
> [11642.327180]  [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
> [11642.328114]  [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
> [11642.329051]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
> [11642.330053]  [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
> [11642.330952]  [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
> [11642.331869]  [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
> [11642.332925]  [<ffffffff81074075>] ? wake_up_q+0x47/0x47
> [11642.333736]  [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
> [11642.334672]  [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
> [11642.335858]  [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
> [11642.336854]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
> [11642.337820]  [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
> [11642.339026]  [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
> [11642.340214]  [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
> [11642.341123]  [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
> [11642.341934]  [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
> [11642.342936]  [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
> [11642.343772]  [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
> [11642.344673]  [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
> [11642.346024]  [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
> [11642.346873]  [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
> [11642.347720]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
> [11642.350222] 4 locks held by fdm-stress/26849:
> [11642.350898]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
> [11642.352375]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
> [11642.354072]  #2:  (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
> [11642.355647]  #3:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
> [11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
> [11642.358508]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
> [11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [11642.368625] fdm-stress      D ffff88021f167688     0 26850  26591 0x00000000
> [11642.369716]  ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
> [11642.370950]  ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
> [11642.372210]  0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
> [11642.373430] Call Trace:
> [11642.373853]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
> [11642.374623]  [<ffffffff8147b541>] schedule+0x82/0x9a
> [11642.375948]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
> [11642.376862]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
> [11642.377637]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
> [11642.378610]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
> [11642.379457]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
> [11642.380366]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
> [11642.381353]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
> [11642.382255]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
> [11642.383162]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
> [11642.383945]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
> [11642.384875]  [<ffffffff8111829f>] __lock_page+0x66/0x68
> [11642.385749]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
> [11642.386721]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
> [11642.387596]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
> [11642.389030]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
> [11642.389973]  [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
> [11642.390939]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
> [11642.392271]  [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
> [11642.393305]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
> [11642.394239]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
> [11642.395045]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
> [11642.395991]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
> [11642.397144]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
> [11642.398392]  [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
> [11642.399363]  [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
> [11642.400445]  [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
> [11642.401309]  [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
> [11642.402213]  [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
> [11642.403139]  [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
> [11642.404360]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
> [11642.406187]  [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
> [11642.407070]  [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
> [11642.407990]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
> [11642.409192]  [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
> [11642.410146]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
> [11642.411291]  [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
> [11642.412263]  [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
> [11642.413057]  [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
> [11642.413897]  [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
> [11642.414708]  [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
> [11642.415573]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
> [11642.416572] 1 lock held by fdm-stress/26850:
> [11642.417345]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
> [11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
> [11642.419698]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
> [11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [11642.421807] fdm-stress      D ffff880196483d28     0 26851  26591 0x00000000
> [11642.422878]  ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
> [11642.424149]  ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
> [11642.425374]  ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
> [11642.426591] Call Trace:
> [11642.427013]  [<ffffffff8147b541>] schedule+0x82/0x9a
> [11642.427856]  [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
> [11642.428852]  [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
> [11642.429743]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
> [11642.430911]  [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
> [11642.432102]  [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
> [11642.433259]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
> [11642.434431]  [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
> [11642.436079]  [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
> [11642.437009]  [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
> [11642.437860]  [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
> [11642.438723]  [<ffffffff81171435>] iterate_supers+0x75/0xc2
> [11642.439597]  [<ffffffff81197d00>] sys_sync+0x52/0x80
> [11642.440454]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
> [11642.441533] 3 locks held by fdm-stress/26851:
> [11642.442370]  #0:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
> [11642.444043]  #1:  (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
> [11642.446010]  #2:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
> 
> This happened because under specific timings the path for direct IO reads
> can deadlock with concurrent buffered writes. The diagram below shows how
> this happens for an example file that has the following layout:
> 
>      [  extent A  ]  [  extent B  ]  [ ....
>      0K              4K              8K
> 
>      CPU 1                                               CPU 2                             CPU 3
> 
> DIO read against range
>  [0K, 8K[ starts
> 
> btrfs_direct_IO()
>   --> calls btrfs_get_blocks_direct()
>       which finds the extent map for the
>       extent A and leaves the range
>       [0K, 4K[ locked in the inode's
>       io tree
> 
>                                                    buffered write against
>                                                    range [4K, 8K[ starts
> 
>                                                    __btrfs_buffered_write()
>                                                      --> dirties page at 4K
> 
>                                                                                      a user space
>                                                                                      task calls sync
>                                                                                      for e.g or
>                                                                                      writepages() is
>                                                                                      invoked by mm
> 
>                                                                                      writepages()
>                                                                                        run_delalloc_range()
>                                                                                          cow_file_range()
>                                                                                            --> ordered extent X
>                                                                                                for the buffered
>                                                                                                write is created
>                                                                                                and
>                                                                                                writeback starts
> 
>   --> calls btrfs_get_blocks_direct()
>       again, without submitting first
>       a bio for reading extent A, and
>       finds the extent map for extent B
> 
>   --> calls lock_extent_direct()
> 
>       --> locks range [4K, 8K[
>       --> finds ordered extent X
>           covering range [4K, 8K[
>       --> unlocks range [4K, 8K[
> 
>                                                   buffered write against
>                                                   range [0K, 8K[ starts
> 
>                                                   __btrfs_buffered_write()
>                                                     prepare_pages()
>                                                       --> locks pages with
>                                                           offsets 0 and 4K
>                                                     lock_and_cleanup_extent_if_need()
>                                                       --> blocks attempting to
>                                                           lock range [0K, 8K[ in
>                                                           the inode's io tree,
>                                                           because the range [0, 4K[
>                                                           is already locked by the
>                                                           direct IO task at CPU 1
> 
>       --> calls
>           btrfs_start_ordered_extent(oe X)
> 
>           btrfs_start_ordered_extent(oe X)
> 
>             --> At this point writeback for ordered
>                 extent X has not finished yet
> 
>             filemap_fdatawrite_range()
>               btrfs_writepages()
>                 extent_writepages()
>                   extent_write_cache_pages()
>                     --> finds page with offset 0
>                         with the writeback tag
>                         (and not dirty)
>                     --> tries to lock it
>                          --> deadlock, task at CPU 1
                                                ^^^^ CPU 2
>                              has the page locked and
>                              is blocked on the io range
>                              [0, 4K[ that was locked
>                              earlier by this task

Looks like the task at _CPU 2_ has the page locked and is stuck in lock_and_cleanup_extent_if_need().

> 
> So fix this by falling back to a buffered read in the direct IO read path
> when an ordered extent for a buffered write is found.

My question is, after we return ENOTBLK for [4k, 8k], will [0, 4k] dio
read submit the bio or fall back to buffered read?

Thanks,

-liubo
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/inode.c | 25 +++++++++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a4d3c54a..64191a9 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7419,7 +7419,26 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
>  				     cached_state, GFP_NOFS);
>  
>  		if (ordered) {
> -			btrfs_start_ordered_extent(inode, ordered, 1);
> +			/*
> +			 * If we are doing a DIO read and the ordered extent we
> +			 * found is for a buffered write, we can not wait for it
> +			 * to complete and retry, because if we do so we can
> +			 * deadlock with concurrent buffered writes on page
> +			 * locks. This happens only if our DIO read covers more
> +			 * than one extent map, if at this point has already
> +			 * created an ordered extent for a previous extent map
> +			 * and locked its range in the inode's io tree, and a
> +			 * concurrent write against that previous extent map's
> +			 * range and this range started (we unlock the ranges
> +			 * in the io tree only when the bios complete and
> +			 * buffered writes always lock pages before attempting
> +			 * to lock range in the io tree).
> +			 */
> +			if (writing ||
> +			    test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags))
> +				btrfs_start_ordered_extent(inode, ordered, 1);
> +			else
> +				ret = -ENOTBLK;
>  			btrfs_put_ordered_extent(ordered);
>  		} else {
>  			/*
> @@ -7436,9 +7455,11 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
>  			 * that page.
>  			 */
>  			ret = -ENOTBLK;
> -			break;
>  		}
>  
> +		if (ret)
> +			break;
> +
>  		cond_resched();
>  	}
>  
> -- 
> 2.7.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana Feb. 18, 2016, 8:05 p.m. UTC | #2
On Thu, Feb 18, 2016 at 7:53 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Thu, Feb 18, 2016 at 05:42:50PM +0000, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> While running a test with a mix of buffered IO and direct IO against
>> the same files I hit a deadlock reported by the following trace:
>>
>> [11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
>> [11642.142452]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
>> [11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [11642.146332] kworker/u32:3   D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
>> [11642.149771]     0 15282      2 0x00000000
>> [11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
>> [11642.154074]  ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
>> [11642.156722]  ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
>> [11642.159205]  0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
>> [11642.161403] Call Trace:
>> [11642.162129]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
>> [11642.163396]  [<ffffffff8147b541>] schedule+0x82/0x9a
>> [11642.164871]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
>> [11642.167020]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
>> [11642.167931]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
>> [11642.182320]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
>> [11642.183762]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
>> [11642.185308]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
>> [11642.186782]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
>> [11642.188217]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
>> [11642.189626]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
>> [11642.190803]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
>> [11642.192158]  [<ffffffff8111829f>] __lock_page+0x66/0x68
>> [11642.193379]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
>> [11642.194831]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
>> [11642.197068]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
>> [11642.199188]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
>> [11642.200723]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
>> [11642.202465]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
>> [11642.203836]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
>> [11642.205624]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
>> [11642.207057]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
>> [11642.208529]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
>> [11642.210375]  [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
>> [11642.212132]  [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
>> [11642.213837]  [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
>> [11642.215457]  [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>> [11642.217095]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
>> [11642.218324]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
>> [11642.219466]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
>> [11642.220801]  [<ffffffff8106a500>] kthread+0xd4/0xdc
>> [11642.222032]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
>> [11642.223190]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
>> [11642.224394]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
>> [11642.226295] 2 locks held by kworker/u32:3/15282:
>> [11642.227273]  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
>> [11642.229412]  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
>> [11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
>> [11642.232872]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
>> [11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [11642.235776] kworker/u32:8   D ffff88020de5f848     0 15289      2 0x00000000
>> [11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
>> [11642.238670]  ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
>> [11642.240475]  ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
>> [11642.242154]  0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
>> [11642.243715] Call Trace:
>> [11642.244390]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
>> [11642.245432]  [<ffffffff8147b541>] schedule+0x82/0x9a
>> [11642.246392]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
>> [11642.247479]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
>> [11642.248551]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
>> [11642.249968]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
>> [11642.251043]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
>> [11642.252202]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
>> [11642.253210]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
>> [11642.254307]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
>> [11642.256118]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
>> [11642.257131]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
>> [11642.258200]  [<ffffffff8111829f>] __lock_page+0x66/0x68
>> [11642.259168]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
>> [11642.260516]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
>> [11642.261841]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
>> [11642.263531]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
>> [11642.264747]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
>> [11642.266148]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
>> [11642.267264]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
>> [11642.268280]  [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
>> [11642.269407]  [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
>> [11642.270476]  [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
>> [11642.271547]  [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
>> [11642.272588]  [<ffffffff81194821>] wb_workfn+0x201/0x341
>> [11642.273523]  [<ffffffff81194821>] ? wb_workfn+0x201/0x341
>> [11642.274479]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
>> [11642.275497]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
>> [11642.276518]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
>> [11642.277520]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
>> [11642.278517]  [<ffffffff8106a500>] kthread+0xd4/0xdc
>> [11642.279371]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
>> [11642.280468]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
>> [11642.281607]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
>> [11642.282604] 3 locks held by kworker/u32:8/15289:
>> [11642.283423]  #0:  ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
>> [11642.285629]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
>> [11642.287538]  #2:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
>> [11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
>> [11642.290547]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
>> [11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [11642.292864] fdm-stress      D ffff88022c107c20     0 26848  26591 0x00000000
>> [11642.294118]  ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
>> [11642.295602]  ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
>> [11642.297098]  ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
>> [11642.298433] Call Trace:
>> [11642.298896]  [<ffffffff8147b541>] schedule+0x82/0x9a
>> [11642.299738]  [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
>> [11642.300833]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
>> [11642.301943]  [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
>> [11642.303270]  [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
>> [11642.304552]  [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
>> [11642.305782]  [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
>> [11642.306878]  [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
>> [11642.307729]  [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
>> [11642.308602]  [<ffffffff8116efbb>] SyS_write+0x50/0x7e
>> [11642.309410]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
>> [11642.310403] 3 locks held by fdm-stress/26848:
>> [11642.311108]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
>> [11642.312578]  #1:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
>> [11642.314170]  #2:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
>> [11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
>> [11642.317842]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
>> [11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [11642.319959] fdm-stress      D ffff8801964ffa68     0 26849  26591 0x00000000
>> [11642.321312]  ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
>> [11642.322555]  ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
>> [11642.323715]  ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
>> [11642.325096] Call Trace:
>> [11642.325532]  [<ffffffff8147b541>] schedule+0x82/0x9a
>> [11642.326303]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
>> [11642.327180]  [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
>> [11642.328114]  [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
>> [11642.329051]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
>> [11642.330053]  [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
>> [11642.330952]  [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
>> [11642.331869]  [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
>> [11642.332925]  [<ffffffff81074075>] ? wake_up_q+0x47/0x47
>> [11642.333736]  [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
>> [11642.334672]  [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
>> [11642.335858]  [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
>> [11642.336854]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
>> [11642.337820]  [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
>> [11642.339026]  [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
>> [11642.340214]  [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
>> [11642.341123]  [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
>> [11642.341934]  [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
>> [11642.342936]  [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
>> [11642.343772]  [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
>> [11642.344673]  [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
>> [11642.346024]  [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
>> [11642.346873]  [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
>> [11642.347720]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
>> [11642.350222] 4 locks held by fdm-stress/26849:
>> [11642.350898]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
>> [11642.352375]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
>> [11642.354072]  #2:  (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
>> [11642.355647]  #3:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
>> [11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
>> [11642.358508]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
>> [11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [11642.368625] fdm-stress      D ffff88021f167688     0 26850  26591 0x00000000
>> [11642.369716]  ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
>> [11642.370950]  ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
>> [11642.372210]  0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
>> [11642.373430] Call Trace:
>> [11642.373853]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
>> [11642.374623]  [<ffffffff8147b541>] schedule+0x82/0x9a
>> [11642.375948]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
>> [11642.376862]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
>> [11642.377637]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
>> [11642.378610]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
>> [11642.379457]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
>> [11642.380366]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
>> [11642.381353]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
>> [11642.382255]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
>> [11642.383162]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
>> [11642.383945]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
>> [11642.384875]  [<ffffffff8111829f>] __lock_page+0x66/0x68
>> [11642.385749]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
>> [11642.386721]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
>> [11642.387596]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
>> [11642.389030]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
>> [11642.389973]  [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
>> [11642.390939]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
>> [11642.392271]  [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
>> [11642.393305]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
>> [11642.394239]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
>> [11642.395045]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
>> [11642.395991]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
>> [11642.397144]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
>> [11642.398392]  [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
>> [11642.399363]  [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
>> [11642.400445]  [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
>> [11642.401309]  [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
>> [11642.402213]  [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
>> [11642.403139]  [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
>> [11642.404360]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
>> [11642.406187]  [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
>> [11642.407070]  [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
>> [11642.407990]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
>> [11642.409192]  [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
>> [11642.410146]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
>> [11642.411291]  [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
>> [11642.412263]  [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
>> [11642.413057]  [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
>> [11642.413897]  [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
>> [11642.414708]  [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
>> [11642.415573]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
>> [11642.416572] 1 lock held by fdm-stress/26850:
>> [11642.417345]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
>> [11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
>> [11642.419698]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
>> [11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [11642.421807] fdm-stress      D ffff880196483d28     0 26851  26591 0x00000000
>> [11642.422878]  ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
>> [11642.424149]  ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
>> [11642.425374]  ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
>> [11642.426591] Call Trace:
>> [11642.427013]  [<ffffffff8147b541>] schedule+0x82/0x9a
>> [11642.427856]  [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
>> [11642.428852]  [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
>> [11642.429743]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
>> [11642.430911]  [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
>> [11642.432102]  [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
>> [11642.433259]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
>> [11642.434431]  [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
>> [11642.436079]  [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
>> [11642.437009]  [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
>> [11642.437860]  [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
>> [11642.438723]  [<ffffffff81171435>] iterate_supers+0x75/0xc2
>> [11642.439597]  [<ffffffff81197d00>] sys_sync+0x52/0x80
>> [11642.440454]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
>> [11642.441533] 3 locks held by fdm-stress/26851:
>> [11642.442370]  #0:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
>> [11642.444043]  #1:  (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
>> [11642.446010]  #2:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
>>
>> This happened because under specific timings the path for direct IO reads
>> can deadlock with concurrent buffered writes. The diagram below shows how
>> this happens for an example file that has the following layout:
>>
>>      [  extent A  ]  [  extent B  ]  [ ....
>>      0K              4K              8K
>>
>>      CPU 1                                               CPU 2                             CPU 3
>>
>> DIO read against range
>>  [0K, 8K[ starts
>>
>> btrfs_direct_IO()
>>   --> calls btrfs_get_blocks_direct()
>>       which finds the extent map for the
>>       extent A and leaves the range
>>       [0K, 4K[ locked in the inode's
>>       io tree
>>
>>                                                    buffered write against
>>                                                    range [4K, 8K[ starts
>>
>>                                                    __btrfs_buffered_write()
>>                                                      --> dirties page at 4K
>>
>>                                                                                      a user space
>>                                                                                      task calls sync
>>                                                                                      for e.g or
>>                                                                                      writepages() is
>>                                                                                      invoked by mm
>>
>>                                                                                      writepages()
>>                                                                                        run_delalloc_range()
>>                                                                                          cow_file_range()
>>                                                                                            --> ordered extent X
>>                                                                                                for the buffered
>>                                                                                                write is created
>>                                                                                                and
>>                                                                                                writeback starts
>>
>>   --> calls btrfs_get_blocks_direct()
>>       again, without submitting first
>>       a bio for reading extent A, and
>>       finds the extent map for extent B
>>
>>   --> calls lock_extent_direct()
>>
>>       --> locks range [4K, 8K[
>>       --> finds ordered extent X
>>           covering range [4K, 8K[
>>       --> unlocks range [4K, 8K[
>>
>>                                                   buffered write against
>>                                                   range [0K, 8K[ starts
>>
>>                                                   __btrfs_buffered_write()
>>                                                     prepare_pages()
>>                                                       --> locks pages with
>>                                                           offsets 0 and 4K
>>                                                     lock_and_cleanup_extent_if_need()
>>                                                       --> blocks attempting to
>>                                                           lock range [0K, 8K[ in
>>                                                           the inode's io tree,
>>                                                           because the range [0, 4K[
>>                                                           is already locked by the
>>                                                           direct IO task at CPU 1
>>
>>       --> calls
>>           btrfs_start_ordered_extent(oe X)
>>
>>           btrfs_start_ordered_extent(oe X)
>>
>>             --> At this point writeback for ordered
>>                 extent X has not finished yet
>>
>>             filemap_fdatawrite_range()
>>               btrfs_writepages()
>>                 extent_writepages()
>>                   extent_write_cache_pages()
>>                     --> finds page with offset 0
>>                         with the writeback tag
>>                         (and not dirty)
>>                     --> tries to lock it
>>                          --> deadlock, task at CPU 1
>                                                 ^^^^ CPU 2
>>                              has the page locked and
>>                              is blocked on the io range
>>                              [0, 4K[ that was locked
>>                              earlier by this task
>
> Looks like the task at _CPU 2_ has the page locked and is stuck in lock_and_cleanup_extent_if_need().

Yes, that's what is said above in the CPU 2 column, isn't it?

>
>>
>> So fix this by falling back to a buffered read in the direct IO read path
>> when an ordered extent for a buffered write is found.
>
> My question is, after we return ENOTBLK for [4k, 8k], will [0, 4k] dio
> read submit the bio or fall back to buffered read?

Yes, it will call the submit bio callback for all the previous calls
to btrfs_get_block_direct that succeeded.

thanks.

>
> Thanks,
>
> -liubo
>>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/inode.c | 25 +++++++++++++++++++++++--
>>  1 file changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index a4d3c54a..64191a9 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -7419,7 +7419,26 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
>>                                    cached_state, GFP_NOFS);
>>
>>               if (ordered) {
>> -                     btrfs_start_ordered_extent(inode, ordered, 1);
>> +                     /*
>> +                      * If we are doing a DIO read and the ordered extent we
>> +                      * found is for a buffered write, we can not wait for it
>> +                      * to complete and retry, because if we do so we can
>> +                      * deadlock with concurrent buffered writes on page
>> +                      * locks. This happens only if our DIO read covers more
>> +                      * than one extent map, if at this point has already
>> +                      * created an ordered extent for a previous extent map
>> +                      * and locked its range in the inode's io tree, and a
>> +                      * concurrent write against that previous extent map's
>> +                      * range and this range started (we unlock the ranges
>> +                      * in the io tree only when the bios complete and
>> +                      * buffered writes always lock pages before attempting
>> +                      * to lock range in the io tree).
>> +                      */
>> +                     if (writing ||
>> +                         test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags))
>> +                             btrfs_start_ordered_extent(inode, ordered, 1);
>> +                     else
>> +                             ret = -ENOTBLK;
>>                       btrfs_put_ordered_extent(ordered);
>>               } else {
>>                       /*
>> @@ -7436,9 +7455,11 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
>>                        * that page.
>>                        */
>>                       ret = -ENOTBLK;
>> -                     break;
>>               }
>>
>> +             if (ret)
>> +                     break;
>> +
>>               cond_resched();
>>       }
>>
>> --
>> 2.7.0.rc3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Feb. 20, 2016, 1:02 a.m. UTC | #3
On Thu, Feb 18, 2016 at 08:05:33PM +0000, Filipe Manana wrote:
> On Thu, Feb 18, 2016 at 7:53 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > On Thu, Feb 18, 2016 at 05:42:50PM +0000, fdmanana@kernel.org wrote:
> >> From: Filipe Manana <fdmanana@suse.com>
[long stack]
> >> This happened because under specific timings the path for direct IO reads
> >> can deadlock with concurrent buffered writes. The diagram below shows how
> >> this happens for an example file that has the following layout:
> >>
> >>      [  extent A  ]  [  extent B  ]  [ ....
> >>      0K              4K              8K
> >>
> >>      CPU 1                                               CPU 2                             CPU 3
> >>
> >> DIO read against range
> >>  [0K, 8K[ starts
> >>
> >> btrfs_direct_IO()
> >>   --> calls btrfs_get_blocks_direct()
> >>       which finds the extent map for the
> >>       extent A and leaves the range
> >>       [0K, 4K[ locked in the inode's
> >>       io tree
> >>
> >>                                                    buffered write against
> >>                                                    range [4K, 8K[ starts
> >>
> >>                                                    __btrfs_buffered_write()
> >>                                                      --> dirties page at 4K
> >>
> >>                                                                                      a user space
> >>                                                                                      task calls sync
> >>                                                                                      for e.g or
> >>                                                                                      writepages() is
> >>                                                                                      invoked by mm
> >>
> >>                                                                                      writepages()
> >>                                                                                        run_delalloc_range()
> >>                                                                                          cow_file_range()
> >>                                                                                            --> ordered extent X
> >>                                                                                                for the buffered
> >>                                                                                                write is created
> >>                                                                                                and
> >>                                                                                                writeback starts
> >>
> >>   --> calls btrfs_get_blocks_direct()
> >>       again, without submitting first
> >>       a bio for reading extent A, and
> >>       finds the extent map for extent B
> >>
> >>   --> calls lock_extent_direct()
> >>
> >>       --> locks range [4K, 8K[
> >>       --> finds ordered extent X
> >>           covering range [4K, 8K[
> >>       --> unlocks range [4K, 8K[
> >>
> >>                                                   buffered write against
> >>                                                   range [0K, 8K[ starts
> >>
> >>                                                   __btrfs_buffered_write()
> >>                                                     prepare_pages()
> >>                                                       --> locks pages with
> >>                                                           offsets 0 and 4K
> >>                                                     lock_and_cleanup_extent_if_need()
> >>                                                       --> blocks attempting to
> >>                                                           lock range [0K, 8K[ in
> >>                                                           the inode's io tree,
> >>                                                           because the range [0, 4K[
> >>                                                           is already locked by the
> >>                                                           direct IO task at CPU 1
> >>
> >>       --> calls
> >>           btrfs_start_ordered_extent(oe X)
> >>
> >>           btrfs_start_ordered_extent(oe X)
> >>
> >>             --> At this point writeback for ordered
> >>                 extent X has not finished yet
> >>
> >>             filemap_fdatawrite_range()
> >>               btrfs_writepages()
> >>                 extent_writepages()
> >>                   extent_write_cache_pages()
> >>                     --> finds page with offset 0
> >>                         with the writeback tag
> >>                         (and not dirty)
> >>                     --> tries to lock it
> >>                          --> deadlock, task at CPU 1
> >                                                 ^^^^ CPU 2
> >>                              has the page locked and
> >>                              is blocked on the io range
> >>                              [0, 4K[ that was locked
> >>                              earlier by this task
> >
> > Looks like the task at _CPU 2_ has the page locked and is stuck in lock_and_cleanup_extent_if_need().
> 
> Yes, that's what is said above in the CPU 2 column, isn't it?

Yes, I was just trying to point a typo out.

> 
> >
> >>
> >> So fix this by falling back to a buffered read in the direct IO read path
> >> when an ordered extent for a buffered write is found.
> >
> > My question is, after we return ENOTBLK for [4k, 8k], will [0, 4k] dio
> > read submit the bio or fall back to buffered read?
> 
> Yes, it will call the submit bio callback for all the previous calls
> to btrfs_get_block_direct that succeeded.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Btw, after revisiting the code, I think that it is not necessary to limit
 'len' to sectorsize (which ends up with the above deadlock), the reason
 we did that is we can use extent_io_tree to store checksum, but now
we're in fact using csum array in btrfs_io_bio.  After that limit is
removed, xfstests cases in aio groups doesn't report errors, but I'm not
sure if it would survive your stress tests.

(cc Josef as Josef made that change.)

Thanks,

-liubo

> 
> thanks.
> 
> >
> > Thanks,
> >
> > -liubo
> >>
> >> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >> ---
> >>  fs/btrfs/inode.c | 25 +++++++++++++++++++++++--
> >>  1 file changed, 23 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >> index a4d3c54a..64191a9 100644
> >> --- a/fs/btrfs/inode.c
> >> +++ b/fs/btrfs/inode.c
> >> @@ -7419,7 +7419,26 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
> >>                                    cached_state, GFP_NOFS);
> >>
> >>               if (ordered) {
> >> -                     btrfs_start_ordered_extent(inode, ordered, 1);
> >> +                     /*
> >> +                      * If we are doing a DIO read and the ordered extent we
> >> +                      * found is for a buffered write, we can not wait for it
> >> +                      * to complete and retry, because if we do so we can
> >> +                      * deadlock with concurrent buffered writes on page
> >> +                      * locks. This happens only if our DIO read covers more
> >> +                      * than one extent map, if at this point has already
> >> +                      * created an ordered extent for a previous extent map
> >> +                      * and locked its range in the inode's io tree, and a
> >> +                      * concurrent write against that previous extent map's
> >> +                      * range and this range started (we unlock the ranges
> >> +                      * in the io tree only when the bios complete and
> >> +                      * buffered writes always lock pages before attempting
> >> +                      * to lock range in the io tree).
> >> +                      */
> >> +                     if (writing ||
> >> +                         test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags))
> >> +                             btrfs_start_ordered_extent(inode, ordered, 1);
> >> +                     else
> >> +                             ret = -ENOTBLK;
> >>                       btrfs_put_ordered_extent(ordered);
> >>               } else {
> >>                       /*
> >> @@ -7436,9 +7455,11 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
> >>                        * that page.
> >>                        */
> >>                       ret = -ENOTBLK;
> >> -                     break;
> >>               }
> >>
> >> +             if (ret)
> >> +                     break;
> >> +
> >>               cond_resched();
> >>       }
> >>
> >> --
> >> 2.7.0.rc3
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a4d3c54a..64191a9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7419,7 +7419,26 @@  static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 				     cached_state, GFP_NOFS);
 
 		if (ordered) {
-			btrfs_start_ordered_extent(inode, ordered, 1);
+			/*
+			 * If we are doing a DIO read and the ordered extent we
+			 * found is for a buffered write, we can not wait for it
+			 * to complete and retry, because if we do so we can
+			 * deadlock with concurrent buffered writes on page
+			 * locks. This happens only if our DIO read covers more
+			 * than one extent map, if at this point has already
+			 * created an ordered extent for a previous extent map
+			 * and locked its range in the inode's io tree, and a
+			 * concurrent write against that previous extent map's
+			 * range and this range started (we unlock the ranges
+			 * in the io tree only when the bios complete and
+			 * buffered writes always lock pages before attempting
+			 * to lock range in the io tree).
+			 */
+			if (writing ||
+			    test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags))
+				btrfs_start_ordered_extent(inode, ordered, 1);
+			else
+				ret = -ENOTBLK;
 			btrfs_put_ordered_extent(ordered);
 		} else {
 			/*
@@ -7436,9 +7455,11 @@  static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 			 * that page.
 			 */
 			ret = -ENOTBLK;
-			break;
 		}
 
+		if (ret)
+			break;
+
 		cond_resched();
 	}