diff mbox

[1/2] rbd: don't queue watch delayed work when we are removing device

Message ID 5B11116C.3080004@easystack.cn (mailing list archive)
State New, archived
Headers show

Commit Message

Dongsheng Yang June 1, 2018, 9:27 a.m. UTC
On 05/31/2018 10:42 PM, Ilya Dryomov wrote:
> On Thu, May 31, 2018 at 10:34 AM, Dongsheng Yang
> <dongsheng.yang@easystack.cn> wrote:
>>
>> On 05/31/2018 04:13 PM, Ilya Dryomov wrote:
>>> On Thu, May 31, 2018 at 9:18 AM, Dongsheng Yang
>>> <dongsheng.yang@easystack.cn> wrote:
>>>> On 05/31/2018 03:11 AM, Ilya Dryomov wrote:
>>>>> On Tue, May 29, 2018 at 5:22 AM, Dongsheng Yang
>>>>> <dongsheng.yang@easystack.cn> wrote:
>>>>>> We will cancel all watch delayed work in
>>>>>> cancel_delayed_work_sync(&rbd_dev->watch_dwork);
>>>>>> If we queue delayed work after this, there will be a use-after-free
>>>>>> problem:
>>>>>>
>>>>>> [  549.932085] BUG: unable to handle kernel NULL pointer dereference at
>>>>>> 0000000000000000
>>>>>> [  549.934134] PGD 0 P4D 0
>>>>>> [  549.935145] Oops: 0000 [#1] SMP PTI
>>>>>> [  549.936283] Modules linked in: rbd(OE) libceph(OE) tcp_diag udp_diag
>>>>>> inet_diag unix_diag af_packet_diag netlink_diag dns_resolver
>>>>>> ebtable_filter
>>>>>> ebtables ip6table_filter ip6_tables iptable_filter sg cfg80211 rfkill
>>>>>> snd_hda_codec_generic ext4 snd_hda_intel snd_hda_codec crct10dif_pclmul
>>>>>> crc32_pclmul ghash_clmulni_intel snd_hda_core pcbc snd_hwdep snd_seq
>>>>>> mbcache
>>>>>> aesni_intel snd_seq_device jbd2 crypto_simd nfsd cryptd glue_helper
>>>>>> snd_pcm
>>>>>> snd_timer auth_rpcgss pcspkr snd virtio_balloon nfs_acl soundcore
>>>>>> i2c_piix4
>>>>>> lockd grace sunrpc ip_tables xfs libcrc32c virtio_console virtio_blk
>>>>>> ata_generic pata_acpi 8139too qxl drm_kms_helper syscopyarea
>>>>>> sysfillrect
>>>>>> sysimgblt fb_sys_fops ttm drm ata_piix libata crc32c_intel virtio_pci
>>>>>> 8139cp
>>>>>> virtio_ring i2c_core mii virtio floppy serio_raw dm_mirror
>>>>>> dm_region_hash
>>>>>> [  549.951835]  dm_log dm_mod dax [last unloaded: libceph]
>>>>>> [  549.953490] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G           OE
>>>>>> 4.17.0-rc6+ #13
>>>>>> [  549.955502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>>>>>> [  549.957246] RIP: 0010:__queue_work+0x6a/0x3b0
>>>>>> [  549.958744] RSP: 0018:ffff9427df1c3e90 EFLAGS: 00010086
>>>>>> [  549.960374] RAX: ffff9427deca8400 RBX: 0000000000000000 RCX:
>>>>>> 0000000000000000
>>>>>> [  549.962297] RDX: ffff9427deca8400 RSI: ffff9427df1c3e50 RDI:
>>>>>> 0000000000000000
>>>>>> [  549.964216] RBP: ffff942783e39e00 R08: ffff9427deca8400 R09:
>>>>>> ffff9427df1c3f00
>>>>>> [  549.966136] R10: 0000000000000004 R11: 0000000000000005 R12:
>>>>>> ffff9427cfb85970
>>>>>> [  549.968070] R13: 0000000000002000 R14: 000000000001eca0 R15:
>>>>>> 0000000000000007
>>>>>> [  549.969999] FS:  0000000000000000(0000) GS:ffff9427df1c0000(0000)
>>>>>> knlGS:0000000000000000
>>>>>> [  549.972069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [  549.973775] CR2: 0000000000000000 CR3: 00000004c900a005 CR4:
>>>>>> 00000000000206e0
>>>>>> [  549.975695] Call Trace:
>>>>>> [  549.976900]  <IRQ>
>>>>>> [  549.978033]  ? __queue_work+0x3b0/0x3b0
>>>>>> [  549.979442]  call_timer_fn+0x2d/0x130
>>>>>> [  549.980824]  run_timer_softirq+0x16e/0x430
>>>>>> [  549.982263]  ? tick_sched_timer+0x37/0x70
>>>>>> [  549.983691]  __do_softirq+0xd2/0x280
>>>>>> [  549.985035]  irq_exit+0xd5/0xe0
>>>>>> [  549.986316]  smp_apic_timer_interrupt+0x6c/0x130
>>>>>> [  549.987835]  apic_timer_interrupt+0xf/0x20
>>>>>>
>>>>>> This patch forbid to queue watch_dwork when we are removing device.
>>>>>>
>>>>>> Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
>>>>>> ---
>>>>>>     drivers/block/rbd.c | 10 +++++++---
>>>>>>     1 file changed, 7 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>>>>> index 2b4e90d..d1d8f46 100644
>>>>>> --- a/drivers/block/rbd.c
>>>>>> +++ b/drivers/block/rbd.c
>>>>>> @@ -3475,9 +3475,13 @@ static void rbd_reregister_watch(struct
>>>>>> work_struct *work)
>>>>>>                            set_bit(RBD_DEV_FLAG_BLACKLISTED,
>>>>>> &rbd_dev->flags);
>>>>>>                            wake_requests(rbd_dev, true);
>>>>>>                    } else {
>>>>>> -                       queue_delayed_work(rbd_dev->task_wq,
>>>>>> -                                          &rbd_dev->watch_dwork,
>>>>>> -                                          RBD_RETRY_DELAY);
>>>>>> +                       spin_lock_irq(&rbd_dev->lock);
>>>>>> +                       if (!test_bit(RBD_DEV_FLAG_REMOVING,
>>>>>> &rbd_dev->flags)) {
>>>>>> +                               queue_delayed_work(rbd_dev->task_wq,
>>>>>> +
>>>>>> &rbd_dev->watch_dwork,
>>>>>> +                                                  RBD_RETRY_DELAY);
>>>>>> +                       }
>>>>>> +                       spin_unlock_irq(&rbd_dev->lock);
>>>>>>                    }
>>>>>>                    mutex_unlock(&rbd_dev->watch_mutex);
>>>>>>                    return;
>>>>> Hi Dongsheng,
>>>>>
>>>>> What made you think it is rbd (or ceph) related?  Do you know what gets
>>>>> dereferenced?  Is it reproducible?
>>>>
>>>> Hi Ilya,
>>>>       There is a simple reproduce script:
>>>>
>>>> ./vstart.sh  -k -l --bluestore
>>>> rbd map -o osd_request_timeout=10 test1
>>>> time dd if=/dev/zero of=/dev/rbd0 bs=64K count=1000 oflag=direct &
>>>> sleep 1
>>>> ps -ef|grep -E "ceph-mon|ceph-osd"|gawk '{print "kill -9 "$2}'|bash
>>>> rbd unmap -o force /dev/rbd0
>>>>
>>>> But maybe you need to run this script in a loop, because that's not 100%
>>>> happen.
>>> Ah, I see.  I think a more correct fix would be to move ->watch_dwork
>>> cancellation:
>>>
>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>> index 2b4e90d06822..23ae0df7a978 100644
>>> --- a/drivers/block/rbd.c
>>> +++ b/drivers/block/rbd.c
>>> @@ -3400,7 +3400,6 @@ static void cancel_tasks_sync(struct rbd_device
>>> *rbd_dev)
>>>    {
>>>           dout("%s rbd_dev %p\n", __func__, rbd_dev);
>>>
>>> -       cancel_delayed_work_sync(&rbd_dev->watch_dwork);
>>>           cancel_work_sync(&rbd_dev->acquired_lock_work);
>>>           cancel_work_sync(&rbd_dev->released_lock_work);
>>>           cancel_delayed_work_sync(&rbd_dev->lock_dwork);
>>> @@ -3418,6 +3417,7 @@ static void rbd_unregister_watch(struct
>>> rbd_device *rbd_dev)
>>>           rbd_dev->watch_state = RBD_WATCH_STATE_UNREGISTERED;
>>>           mutex_unlock(&rbd_dev->watch_mutex);
>>>
>>> +       cancel_delayed_work_sync(&rbd_dev->watch_dwork);
>>>           ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>>>    }
>>>
>>> This way rbd_reregister_watch() should either bail out early because
>>> the watch is UNREGISTERED at that point or just get cancelled.
>>
>> Sounds good. what's your suggestion about the [2/2] about lock_dwork?
> I don't think it's needed.  In order for rbd_acquire_lock() to possibly
> requeue itself, it must be queued after it is cancelled.
>
> Looking at the places where ->lock_dwork is queued:
>
> - rbd_wait_state_locked(), i.e. new I/O request -- shoudn't happen at
>    that point
>
> - rbd_reacquire_lock(), from rbd_reregister_watch() if LOCKED -- again
>    shouldn't happen at that point because it is preceded by rbd_unlock()
>    in rbd_dev_image_unlock()

Hi Ilya,
     That's true. So get back to the fix for watch_dwork, maybe we
can move the all cancel_tasks_sync() after unregister, to put
the all taks canceling togather. Something like that:

         if (rbd_dev->watch_state == RBD_WATCH_STATE_REGISTERED)
@@ -3427,6 +3426,7 @@ static void rbd_unregister_watch(struct rbd_device 
*rbd_dev)
         rbd_dev->watch_state = RBD_WATCH_STATE_UNREGISTERED;
         mutex_unlock(&rbd_dev->watch_mutex);

+       cancel_tasks_sync(rbd_dev);
ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
  }

Thanx
Yang
>
> If you are seeing a similar crash on ->lock_dwork, the bug is somewhere
> else.
>
> Thanks,
>
>                  Ilya
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index c450016..d1a1e78 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3419,7 +3419,6 @@  static void cancel_tasks_sync(struct rbd_device 
*rbd_dev)
  static void rbd_unregister_watch(struct rbd_device *rbd_dev)
  {
         WARN_ON(waitqueue_active(&rbd_dev->lock_waitq));
-       cancel_tasks_sync(rbd_dev);

         mutex_lock(&rbd_dev->watch_mutex);