diff mbox

[2/2] dm: stay in blk_queue_bypass until queue becomes initialized

Message ID 50890937.7010809@ce.jp.nec.com (mailing list archive)
State Deferred, archived
Headers show

Commit Message

Junichi Nomura Oct. 25, 2012, 9:41 a.m. UTC
[PATCH] dm: stay in blk_queue_bypass until queue becomes initialized

With 749fefe677 ("block: lift the initial queue bypass mode on
blk_register_queue() instead of blk_init_allocated_queue()"),
add_disk() eventually calls blk_queue_bypass_end().
This change invokes the following warning when multipath is used.

  BUG: scheduling while atomic: multipath/2460/0x00000002
  1 lock held by multipath/2460:
   #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
  Call Trace:
   [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
   [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
   [<ffffffff814291e6>] schedule+0x64/0x66
   [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
   [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
   [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
   [<ffffffff814289e3>] wait_for_common+0x9d/0xee
   [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
   [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
   [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
   [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
   [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
   [<ffffffff8106fb18>] ? complete+0x21/0x53
   [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
   [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
   [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
   [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
   [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [<ffffffff811d8c41>] elevator_init+0xe1/0x115
   [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
   [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
   [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
   [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
   [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
   [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
   [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
   [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b

The warning means during queue initialization blk_queue_bypass_start()
calls sleeping function (synchronize_rcu) while dm holds md->type_lock.

dm device initialization basically includes the following 3 steps:
  1. create ioctl, allocates queue and call add_disk()
  2. table load ioctl, determines device type and initialize queue
     if request-based
  3. resume ioctl, device becomes functional

So it is better to have dm's queue stay in bypass mode until
the initialization completes in table load ioctl.

The effect of additional blk_queue_bypass_start():

  3.7-rc2 (plain)
    # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
                                    dmsetup remove a; done

    real  0m15.434s
    user  0m0.423s
    sys   0m7.052s

  3.7-rc2 (with this patch)
    # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
                                    dmsetup remove a; done
    real  0m19.766s
    user  0m0.442s
    sys   0m6.861s

If this additional cost is not negligible, we need a variant of add_disk()
that does not end bypassing.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alasdair G Kergon <agk@redhat.com>

---
 drivers/md/dm.c        |    4 ++++
 1 file changed, 4 insertions(+)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Vivek Goyal Oct. 26, 2012, 8:21 p.m. UTC | #1
On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
> 
> With 749fefe677 ("block: lift the initial queue bypass mode on
> blk_register_queue() instead of blk_init_allocated_queue()"),
> add_disk() eventually calls blk_queue_bypass_end().
> This change invokes the following warning when multipath is used.
> 
>   BUG: scheduling while atomic: multipath/2460/0x00000002
>   1 lock held by multipath/2460:
>    #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
>   Modules linked in: ...
>   Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
>   Call Trace:
>    [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
>    [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
>    [<ffffffff814291e6>] schedule+0x64/0x66
>    [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
>    [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
>    [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
>    [<ffffffff814289e3>] wait_for_common+0x9d/0xee
>    [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
>    [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
>    [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
>    [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
>    [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
>    [<ffffffff8106fb18>] ? complete+0x21/0x53
>    [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
>    [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
>    [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
>    [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
>    [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
>    [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
>    [<ffffffff811d8c41>] elevator_init+0xe1/0x115
>    [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
>    [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
>    [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
>    [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
>    [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
>    [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
>    [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
>    [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
>    [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
>    [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
>    [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>    [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
> 
> The warning means during queue initialization blk_queue_bypass_start()
> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.

md->type_lock is a mutex, isn't it? I thought we are allowed to block
and schedule out under mutex?

add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
So we already have code which can block/wait under md->type_lock. I am
not sure why should we get this warning under a mutex.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Junichi Nomura Oct. 29, 2012, 10:15 a.m. UTC | #2
On 10/27/12 05:21, Vivek Goyal wrote:
> On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
>> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
>>
>> With 749fefe677 ("block: lift the initial queue bypass mode on
>> blk_register_queue() instead of blk_init_allocated_queue()"),
>> add_disk() eventually calls blk_queue_bypass_end().
>> This change invokes the following warning when multipath is used.
>>
>>   BUG: scheduling while atomic: multipath/2460/0x00000002
>>   1 lock held by multipath/2460:
>>    #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
>>   Modules linked in: ...
>>   Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
>>   Call Trace:
>>    [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
>>    [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
>>    [<ffffffff814291e6>] schedule+0x64/0x66
>>    [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
>>    [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
>>    [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
>>    [<ffffffff814289e3>] wait_for_common+0x9d/0xee
>>    [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
>>    [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
>>    [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
>>    [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
>>    [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
>>    [<ffffffff8106fb18>] ? complete+0x21/0x53
>>    [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
>>    [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
>>    [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
>>    [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
>>    [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
>>    [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
>>    [<ffffffff811d8c41>] elevator_init+0xe1/0x115
>>    [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
>>    [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
>>    [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
>>    [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
>>    [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
>>    [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
>>    [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
>>    [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
>>    [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
>>    [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
>>    [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>>    [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
>>
>> The warning means during queue initialization blk_queue_bypass_start()
>> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
> 
> md->type_lock is a mutex, isn't it? I thought we are allowed to block
> and schedule out under mutex?

Hm, you are right. It's a mutex.
The warning occurs only if I turned on CONFIG_PREEMPT=y.

> add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
> So we already have code which can block/wait under md->type_lock. I am
> not sure why should we get this warning under a mutex.

add_disk() is called without md->type_lock.

Call flow is like this:

dm_create
  alloc_dev
    blk_alloc_queue
    alloc_disk
    add_disk
      blk_queue_bypass_end [with 3.7-rc2]

table_load
  dm_lock_md_type [takes md->type_lock]
  dm_setup_md_queue
    blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED]
      elevator_init
        blkcg_activate_policy
          blk_queue_bypass_start <-- THIS triggers the warning
          blk_queue_bypass_end
      blk_queue_bypass_end [with 3.6]
  dm_unlock_md_type

blk_queue_bypass_start() in blkcg_activate_policy was nested call,
that did nothing, with 3.6.
With 3.7-rc2, it becomes the initial call and does
actual draining stuff.
Vivek Goyal Oct. 29, 2012, 4:38 p.m. UTC | #3
On Mon, Oct 29, 2012 at 07:15:08PM +0900, Jun'ichi Nomura wrote:
> On 10/27/12 05:21, Vivek Goyal wrote:
> > On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
> >> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
> >>
> >> With 749fefe677 ("block: lift the initial queue bypass mode on
> >> blk_register_queue() instead of blk_init_allocated_queue()"),
> >> add_disk() eventually calls blk_queue_bypass_end().
> >> This change invokes the following warning when multipath is used.
> >>
> >>   BUG: scheduling while atomic: multipath/2460/0x00000002
> >>   1 lock held by multipath/2460:
> >>    #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
> >>   Modules linked in: ...
> >>   Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
> >>   Call Trace:
> >>    [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
> >>    [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
> >>    [<ffffffff814291e6>] schedule+0x64/0x66
> >>    [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
> >>    [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
> >>    [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
> >>    [<ffffffff814289e3>] wait_for_common+0x9d/0xee
> >>    [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
> >>    [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
> >>    [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
> >>    [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
> >>    [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
> >>    [<ffffffff8106fb18>] ? complete+0x21/0x53
> >>    [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
> >>    [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
> >>    [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
> >>    [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
> >>    [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
> >>    [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
> >>    [<ffffffff811d8c41>] elevator_init+0xe1/0x115
> >>    [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
> >>    [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
> >>    [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
> >>    [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
> >>    [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
> >>    [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
> >>    [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
> >>    [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
> >>    [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
> >>    [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
> >>    [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> >>    [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
> >>
> >> The warning means during queue initialization blk_queue_bypass_start()
> >> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
> > 
> > md->type_lock is a mutex, isn't it? I thought we are allowed to block
> > and schedule out under mutex?
> 
> Hm, you are right. It's a mutex.
> The warning occurs only if I turned on CONFIG_PREEMPT=y.

Ok, so the question is what's wrong with calling synchronize_rcu() inside
a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney  and
peterz.

> 
> > add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
> > So we already have code which can block/wait under md->type_lock. I am
> > not sure why should we get this warning under a mutex.
> 
> add_disk() is called without md->type_lock.
> 
> Call flow is like this:
> 
> dm_create
>   alloc_dev
>     blk_alloc_queue
>     alloc_disk
>     add_disk
>       blk_queue_bypass_end [with 3.7-rc2]
> 
> table_load
>   dm_lock_md_type [takes md->type_lock]
>   dm_setup_md_queue
>     blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED]
>       elevator_init
>         blkcg_activate_policy
>           blk_queue_bypass_start <-- THIS triggers the warning
>           blk_queue_bypass_end
>       blk_queue_bypass_end [with 3.6]
>   dm_unlock_md_type
> 
> blk_queue_bypass_start() in blkcg_activate_policy was nested call,
> that did nothing, with 3.6.
> With 3.7-rc2, it becomes the initial call and does
> actual draining stuff.

Ok. Once we know what's wrong, we should be able to figure out the 
right solution. Artificially putting queue one level deep in bypass
to avoid calling synchronize_rcu() sounds bad.

Thanks
Vivek

> 
> -- 
> Jun'ichi Nomura, NEC Corporation

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Peter Zijlstra Oct. 29, 2012, 4:45 p.m. UTC | #4
On Mon, 2012-10-29 at 12:38 -0400, Vivek Goyal wrote:
> Ok, so the question is what's wrong with calling synchronize_rcu() inside
> a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney  and
> peterz.

int blkcg_activate_policy(struct request_queue *q,

{

	...

        preloaded = !radix_tree_preload(GFP_KERNEL);

        blk_queue_bypass_start(q);




where:

int radix_tree_preload(gfp_t gfp_mask)
{
        struct radix_tree_preload *rtp;
        struct radix_tree_node *node;
        int ret = -ENOMEM;
        
        preempt_disable();


Seems obvious why it explodes..

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Paul E. McKenney Oct. 29, 2012, 4:55 p.m. UTC | #5
On Mon, Oct 29, 2012 at 12:38:45PM -0400, Vivek Goyal wrote:
> On Mon, Oct 29, 2012 at 07:15:08PM +0900, Jun'ichi Nomura wrote:
> > On 10/27/12 05:21, Vivek Goyal wrote:
> > > On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
> > >> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
> > >>
> > >> With 749fefe677 ("block: lift the initial queue bypass mode on
> > >> blk_register_queue() instead of blk_init_allocated_queue()"),
> > >> add_disk() eventually calls blk_queue_bypass_end().
> > >> This change invokes the following warning when multipath is used.
> > >>
> > >>   BUG: scheduling while atomic: multipath/2460/0x00000002

Looks like preemption has been disabled, nested two deep, based
on the /0x00000002, which dumps out preempt_count().

So is someone invoking this with preemption disabled?

> > >>   1 lock held by multipath/2460:
> > >>    #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
> > >>   Modules linked in: ...
> > >>   Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
> > >>   Call Trace:
> > >>    [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
> > >>    [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
> > >>    [<ffffffff814291e6>] schedule+0x64/0x66
> > >>    [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
> > >>    [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
> > >>    [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
> > >>    [<ffffffff814289e3>] wait_for_common+0x9d/0xee
> > >>    [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
> > >>    [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
> > >>    [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
> > >>    [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
> > >>    [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
> > >>    [<ffffffff8106fb18>] ? complete+0x21/0x53
> > >>    [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
> > >>    [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
> > >>    [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
> > >>    [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
> > >>    [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
> > >>    [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
> > >>    [<ffffffff811d8c41>] elevator_init+0xe1/0x115
> > >>    [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
> > >>    [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
> > >>    [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
> > >>    [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
> > >>    [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
> > >>    [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
> > >>    [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
> > >>    [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
> > >>    [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
> > >>    [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
> > >>    [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > >>    [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
> > >>
> > >> The warning means during queue initialization blk_queue_bypass_start()
> > >> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
> > > 
> > > md->type_lock is a mutex, isn't it? I thought we are allowed to block
> > > and schedule out under mutex?
> > 
> > Hm, you are right. It's a mutex.
> > The warning occurs only if I turned on CONFIG_PREEMPT=y.
> 
> Ok, so the question is what's wrong with calling synchronize_rcu() inside
> a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney  and
> peterz.

Nothing is wrong with calling synchronize_rcu() inside a mutex, this has
been used and does work.  Last I tried it, anyway.  ;-)

> > > add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
> > > So we already have code which can block/wait under md->type_lock. I am
> > > not sure why should we get this warning under a mutex.
> > 
> > add_disk() is called without md->type_lock.
> > 
> > Call flow is like this:
> > 
> > dm_create
> >   alloc_dev
> >     blk_alloc_queue
> >     alloc_disk
> >     add_disk
> >       blk_queue_bypass_end [with 3.7-rc2]
> > 
> > table_load
> >   dm_lock_md_type [takes md->type_lock]
> >   dm_setup_md_queue
> >     blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED]
> >       elevator_init
> >         blkcg_activate_policy
> >           blk_queue_bypass_start <-- THIS triggers the warning
> >           blk_queue_bypass_end
> >       blk_queue_bypass_end [with 3.6]
> >   dm_unlock_md_type
> > 
> > blk_queue_bypass_start() in blkcg_activate_policy was nested call,
> > that did nothing, with 3.6.
> > With 3.7-rc2, it becomes the initial call and does
> > actual draining stuff.
> 
> Ok. Once we know what's wrong, we should be able to figure out the 
> right solution. Artificially putting queue one level deep in bypass
> to avoid calling synchronize_rcu() sounds bad.

Also should be unnecessary.  I suggest finding out where preemption is
disabled. Scattering checks for preempt_count()!=0 might help locate it.

							Thanx, Paul

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vivek Goyal Oct. 29, 2012, 5:13 p.m. UTC | #6
On Mon, Oct 29, 2012 at 05:45:15PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-10-29 at 12:38 -0400, Vivek Goyal wrote:
> > Ok, so the question is what's wrong with calling synchronize_rcu() inside
> > a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney  and
> > peterz.
> 
> int blkcg_activate_policy(struct request_queue *q,
> 
> {
> 
> 	...
> 
>         preloaded = !radix_tree_preload(GFP_KERNEL);
> 
>         blk_queue_bypass_start(q);
> 
> 
> 
> 
> where:
> 
> int radix_tree_preload(gfp_t gfp_mask)
> {
>         struct radix_tree_preload *rtp;
>         struct radix_tree_node *node;
>         int ret = -ENOMEM;
>         
>         preempt_disable();
> 
> 
> Seems obvious why it explodes..

Oh right. Thanks Peter. So just calling blk_queue_bypass_start() before
radix_tree_preload() should fix it. Junichi, can you please give it
a try.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 02db918..ad02761 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1869,6 +1869,8 @@  static struct mapped_device *alloc_dev(int minor)
 	md->disk->private_data = md;
 	sprintf(md->disk->disk_name, "dm-%d", minor);
 	add_disk(md->disk);
+	/* Until md type is determined, put the queue in bypass mode */
+	blk_queue_bypass_start(md->queue);
 	format_dev_t(md->name, MKDEV(_major, minor));
 
 	md->wq = alloc_workqueue("kdmflush",
@@ -2172,6 +2174,7 @@  static int dm_init_request_based_queue(struct mapped_device *md)
 		return 1;
 
 	/* Fully initialize the queue */
+	WARN_ON(!blk_queue_bypass(md->queue));
 	q = blk_init_allocated_queue(md->queue, dm_request_fn, NULL);
 	if (!q)
 		return 0;
@@ -2198,6 +2201,7 @@  int dm_setup_md_queue(struct mapped_device *md)
 		return -EINVAL;
 	}
 
+	blk_queue_bypass_end(md->queue);
 	return 0;
 }