diff mbox series

[2/3] sbitmap: ammortize cost of clearing bits

Message ID 20181130011234.32674-3-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series [1/3] sbitmap: ensure that sbitmap maps are properly aligned | expand

Commit Message

Jens Axboe Nov. 30, 2018, 1:12 a.m. UTC
sbitmap maintains a set of words that we use to set and clear bits, with
each bit representing a tag for blk-mq. Even though we spread the bits
out and maintain a hint cache, one particular bit allocated will end up
being cleared in the exact same spot.

This introduces batched clearing of bits. Instead of clearing a given
bit, the same bit is set in a cleared/free mask instead. If we fail
allocating a bit from a given word, then we check the free mask, and
batch move those cleared bits at that time. This trades 64 atomic bitops
for 2 cmpxchg().

In a threaded poll test case, half the overhead of getting and clearing
tags is removed with this change. On another poll test case with a
single thread, performance is unchanged.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/sbitmap.h | 26 +++++++++++++++---
 lib/sbitmap.c           | 60 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 78 insertions(+), 8 deletions(-)

Comments

Guenter Roeck Dec. 9, 2018, 5:51 a.m. UTC | #1
Hi,

On Thu, Nov 29, 2018 at 06:12:33PM -0700, Jens Axboe wrote:
> sbitmap maintains a set of words that we use to set and clear bits, with
> each bit representing a tag for blk-mq. Even though we spread the bits
> out and maintain a hint cache, one particular bit allocated will end up
> being cleared in the exact same spot.
> 
> This introduces batched clearing of bits. Instead of clearing a given
> bit, the same bit is set in a cleared/free mask instead. If we fail
> allocating a bit from a given word, then we check the free mask, and
> batch move those cleared bits at that time. This trades 64 atomic bitops
> for 2 cmpxchg().
> 
> In a threaded poll test case, half the overhead of getting and clearing
> tags is removed with this change. On another poll test case with a
> single thread, performance is unchanged.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

This patch results in irq lock inversion warnings when trying to boot
from usb drives. This was observed with qemu boots of aarch64, x86, and
x86_64 images.

Log output as well as bisect results are attached below. I ran the bisect
twice, once with aarch64 and once with x86_64, with the same results.

Unfortunately, reverting the patch results in a compile error, so I was
unable to test if reverting the patch fixes the problem.

Sample qemu command line (x86_64):

qemu-system-x86_64 \
	-kernel arch/x86/boot/bzImage -M q35 -cpu SandyBridge \
	-no-reboot -snapshot -smp 4 -m 1G \
	-usb -device usb-storage,drive=d0 \
	-drive file=rootfs.ext2,if=none,id=d0,format=raw \
	--append 'root=/dev/sda rw rootwait panic=-1 console=ttyS0 console=tty' \
	-nographic

Guenter

---
...
[   19.035262] Waiting for root device /dev/sda...
[   19.139135] scsi 6:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
[   19.161632] sd 6:0:0:0: Attached scsi generic sg1 type 0
[   19.171661] sd 6:0:0:0: Power-on or device reset occurred
[   19.186664] sd 6:0:0:0: [sda] 20480 512-byte logical blocks: (10.5 MB/10.0 MiB)
[   19.197035] sd 6:0:0:0: [sda] Write Protect is off
[   19.209327] sd 6:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   19.330020] sd 6:0:0:0: [sda] Attached SCSI disk
[   19.363210] EXT4-fs (sda): mounting ext2 file system using the ext4 subsystem
[   19.392150] 
[   19.394410] ========================================================
[   19.397051] WARNING: possible irq lock inversion dependency detected
[   19.399744] 4.20.0-rc5-next-20181207 #1 Not tainted
[   19.402294] --------------------------------------------------------
[   19.404981] swapper/3/0 just changed the state of lock:
[   19.407185] (____ptrval____) (&sbq->ws[i].wait){..-.}, at: __wake_up_common_lock+0x5e/0xb0
[   19.409500] but this lock took another, SOFTIRQ-unsafe lock in the past:
[   19.411650]  (&(&sb->map[i].swap_lock)->rlock){+.+.}
[   19.411663] 
[   19.411663] 
[   19.411663] and interrupts could create inverse lock ordering between them.
[   19.411663] 
[   19.421461] 
[   19.421461] other info that might help us debug this:
[   19.425187] Chain exists of:
[   19.425187]   &sbq->ws[i].wait --> &(&hctx->dispatch_wait_lock)->rlock --> &(&sb->map[i].swap_lock)->rlock
[   19.425187] 
[   19.432974]  Possible interrupt unsafe locking scenario:
[   19.432974] 
[   19.436954]        CPU0                    CPU1
[   19.439002]        ----                    ----
[   19.440897]   lock(&(&sb->map[i].swap_lock)->rlock);
[   19.442828]                                local_irq_disable();
[   19.444767]                                lock(&sbq->ws[i].wait);
[   19.446767]                                lock(&(&hctx->dispatch_wait_lock)->rlock);
[   19.448716]   <Interrupt>
[   19.450574]     lock(&sbq->ws[i].wait);
[   19.452330] 
[   19.452330]  *** DEADLOCK ***
[   19.452330] 
[   19.456951] no locks held by swapper/3/0.
[   19.458648] 
[   19.458648] the shortest dependencies between 2nd lock and 1st lock:
[   19.461928]   -> (&(&sb->map[i].swap_lock)->rlock){+.+.} ops: 38 {
[   19.463648]      HARDIRQ-ON-W at:
[   19.465233]                         _raw_spin_lock+0x27/0x40
[   19.466837]                         sbitmap_get+0x8b/0x160
[   19.468490]                         __sbitmap_queue_get+0x24/0x90
[   19.469988]                         blk_mq_get_tag+0x28a/0x2c0
[   19.471455]                         blk_mq_get_driver_tag+0x7a/0x100
[   19.473051]                         blk_mq_dispatch_rq_list+0x95/0x550
[   19.474653]                         blk_mq_do_dispatch_sched+0x71/0x110
[   19.476137]                         blk_mq_sched_dispatch_requests+0x114/0x160
[   19.477712]                         __blk_mq_run_hw_queue+0x67/0xf0
[   19.479282]                         __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.480830]                         blk_mq_run_hw_queue+0x83/0xe0
[   19.482302]                         blk_mq_sched_insert_request+0x14a/0x1c0
[   19.483868]                         blk_execute_rq+0x64/0x90
[   19.485396]                         __scsi_execute+0x104/0x250
[   19.486779]                         sd_revalidate_disk+0xe7/0x1cf0
[   19.488184]                         sd_probe_async+0xae/0x194
[   19.489598]                         async_run_entry_fn+0x34/0x100
[   19.491025]                         process_one_work+0x237/0x5d0
[   19.493611]                         worker_thread+0x37/0x380
[   19.496588]                         kthread+0x118/0x130
[   19.498871]                         ret_from_fork+0x3a/0x50
[   19.501009]      SOFTIRQ-ON-W at:
[   19.502991]                         _raw_spin_lock+0x27/0x40
[   19.505302]                         sbitmap_get+0x8b/0x160
[   19.507375]                         __sbitmap_queue_get+0x24/0x90
[   19.509440]                         blk_mq_get_tag+0x28a/0x2c0
[   19.510859]                         blk_mq_get_driver_tag+0x7a/0x100
[   19.512334]                         blk_mq_dispatch_rq_list+0x95/0x550
[   19.513960]                         blk_mq_do_dispatch_sched+0x71/0x110
[   19.515388]                         blk_mq_sched_dispatch_requests+0x114/0x160
[   19.516909]                         __blk_mq_run_hw_queue+0x67/0xf0
[   19.518433]                         __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.519831]                         blk_mq_run_hw_queue+0x83/0xe0
[   19.521232]                         blk_mq_sched_insert_request+0x14a/0x1c0
[   19.522638]                         blk_execute_rq+0x64/0x90
[   19.524015]                         __scsi_execute+0x104/0x250
[   19.525393]                         sd_revalidate_disk+0xe7/0x1cf0
[   19.526795]                         sd_probe_async+0xae/0x194
[   19.528213]                         async_run_entry_fn+0x34/0x100
[   19.529810]                         process_one_work+0x237/0x5d0
[   19.531178]                         worker_thread+0x37/0x380
[   19.532548]                         kthread+0x118/0x130
[   19.533892]                         ret_from_fork+0x3a/0x50
[   19.535232]      INITIAL USE at:
[   19.536504]                        _raw_spin_lock+0x27/0x40
[   19.537939]                        sbitmap_get+0x8b/0x160
[   19.539411]                        __sbitmap_queue_get+0x24/0x90
[   19.540782]                        blk_mq_get_tag+0x28a/0x2c0
[   19.542134]                        blk_mq_get_driver_tag+0x7a/0x100
[   19.543495]                        blk_mq_dispatch_rq_list+0x95/0x550
[   19.544925]                        blk_mq_do_dispatch_sched+0x71/0x110
[   19.546312]                        blk_mq_sched_dispatch_requests+0x114/0x160
[   19.547712]                        __blk_mq_run_hw_queue+0x67/0xf0
[   19.549097]                        __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.550692]                        blk_mq_run_hw_queue+0x83/0xe0
[   19.552086]                        blk_mq_sched_insert_request+0x14a/0x1c0
[   19.553494]                        blk_execute_rq+0x64/0x90
[   19.554862]                        __scsi_execute+0x104/0x250
[   19.556233]                        sd_revalidate_disk+0xe7/0x1cf0
[   19.557621]                        sd_probe_async+0xae/0x194
[   19.558993]                        async_run_entry_fn+0x34/0x100
[   19.560383]                        process_one_work+0x237/0x5d0
[   19.561751]                        worker_thread+0x37/0x380
[   19.563267]                        kthread+0x118/0x130
[   19.564715]                        ret_from_fork+0x3a/0x50
[   19.566050]    }
[   19.567296]    ... key      at: [<ffffffffa2f27468>] __key.23877+0x0/0x8
[   19.568691]    ... acquired at:
[   19.569986]    sbitmap_get+0x8b/0x160
[   19.571279]    __sbitmap_queue_get+0x24/0x90
[   19.572619]    blk_mq_get_tag+0x28a/0x2c0
[   19.573957]    blk_mq_get_driver_tag+0xda/0x100
[   19.575308]    blk_mq_dispatch_rq_list+0x2f6/0x550
[   19.576643]    blk_mq_do_dispatch_sched+0x71/0x110
[   19.577968]    blk_mq_sched_dispatch_requests+0x114/0x160
[   19.579328]    __blk_mq_run_hw_queue+0x67/0xf0
[   19.580671]    __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.582047]    blk_mq_run_hw_queue+0x83/0xe0
[   19.583517]    blk_mq_sched_insert_request+0x14a/0x1c0
[   19.584932]    blk_execute_rq+0x64/0x90
[   19.586256]    __scsi_execute+0x104/0x250
[   19.587617]    scsi_probe_and_add_lun+0x1b5/0xc00
[   19.589020]    __scsi_scan_target+0x1f9/0x530
[   19.590367]    scsi_scan_channel.part.10+0x51/0x70
[   19.591742]    scsi_scan_host_selected+0xc9/0x140
[   19.593200]    scsi_scan_host+0x18f/0x1d0
[   19.594621]    usb_stor_scan_dwork+0x1a/0x80
[   19.595943]    process_one_work+0x237/0x5d0
[   19.597271]    worker_thread+0x37/0x380
[   19.598566]    kthread+0x118/0x130
[   19.599845]    ret_from_fork+0x3a/0x50
[   19.601128] 
[   19.602384]  -> (&(&hctx->dispatch_wait_lock)->rlock){....} ops: 1 {
[   19.603964]     INITIAL USE at:
[   19.605271]                      _raw_spin_lock+0x27/0x40
[   19.606670]                      blk_mq_dispatch_rq_list+0x270/0x550
[   19.608277]                      blk_mq_do_dispatch_sched+0x71/0x110
[   19.609808]                      blk_mq_sched_dispatch_requests+0x114/0x160
[   19.611239]                      __blk_mq_run_hw_queue+0x67/0xf0
[   19.612650]                      __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.614080]                      blk_mq_run_hw_queue+0x83/0xe0
[   19.615497]                      blk_mq_sched_insert_request+0x14a/0x1c0
[   19.616949]                      blk_execute_rq+0x64/0x90
[   19.618374]                      __scsi_execute+0x104/0x250
[   19.619787]                      scsi_probe_and_add_lun+0x1b5/0xc00
[   19.621226]                      __scsi_scan_target+0x1f9/0x530
[   19.622666]                      scsi_scan_channel.part.10+0x51/0x70
[   19.624117]                      scsi_scan_host_selected+0xc9/0x140
[   19.625557]                      scsi_scan_host+0x18f/0x1d0
[   19.626992]                      usb_stor_scan_dwork+0x1a/0x80
[   19.628441]                      process_one_work+0x237/0x5d0
[   19.629832]                      worker_thread+0x37/0x380
[   19.631235]                      kthread+0x118/0x130
[   19.632724]                      ret_from_fork+0x3a/0x50
[   19.634252]   }
[   19.635499]   ... key      at: [<ffffffffa2f256a0>] __key.49100+0x0/0x8
[   19.636925]   ... acquired at:
[   19.638255]    blk_mq_dispatch_rq_list+0x270/0x550
[   19.639655]    blk_mq_do_dispatch_sched+0x71/0x110
[   19.641076]    blk_mq_sched_dispatch_requests+0x114/0x160
[   19.642517]    __blk_mq_run_hw_queue+0x67/0xf0
[   19.643961]    __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.645424]    blk_mq_run_hw_queue+0x83/0xe0
[   19.646831]    blk_mq_sched_insert_request+0x14a/0x1c0
[   19.648265]    blk_execute_rq+0x64/0x90
[   19.649634]    __scsi_execute+0x104/0x250
[   19.650983]    scsi_probe_and_add_lun+0x1b5/0xc00
[   19.652360]    __scsi_scan_target+0x1f9/0x530
[   19.653808]    scsi_scan_channel.part.10+0x51/0x70
[   19.655364]    scsi_scan_host_selected+0xc9/0x140
[   19.656751]    scsi_scan_host+0x18f/0x1d0
[   19.658086]    usb_stor_scan_dwork+0x1a/0x80
[   19.659416]    process_one_work+0x237/0x5d0
[   19.660750]    worker_thread+0x37/0x380
[   19.662057]    kthread+0x118/0x130
[   19.663353]    ret_from_fork+0x3a/0x50
[   19.664649] 
[   19.665860] -> (&sbq->ws[i].wait){..-.} ops: 3 {
[   19.667186]    IN-SOFTIRQ-W at:
[   19.668474]                     _raw_spin_lock_irqsave+0x30/0x40
[   19.669878]                     __wake_up_common_lock+0x5e/0xb0
[   19.671278]                     __sbq_wake_up+0x8d/0xa0
[   19.672650]                     sbitmap_queue_clear+0x3b/0x60
[   19.674036]                     __blk_mq_free_request+0x64/0x90
[   19.675438]                     scsi_end_request+0x109/0x350
[   19.676911]                     scsi_io_completion+0x72/0x600
[   19.678477]                     blk_done_softirq+0x9a/0xd0
[   19.679847]                     __do_softirq+0xc8/0x455
[   19.681212]                     irq_exit+0xcc/0xe0
[   19.682543]                     call_function_single_interrupt+0xf/0x20
[   19.683941]                     default_idle+0x19/0x160
[   19.685311]                     do_idle+0x1d3/0x250
[   19.686643]                     cpu_startup_entry+0x14/0x20
[   19.688000]                     start_secondary+0x18f/0x1c0
[   19.689443]                     secondary_startup_64+0xa4/0xb0
[   19.690983]    INITIAL USE at:
[   19.693258]                    _raw_spin_lock_irq+0x2d/0x40
[   19.696185]                    blk_mq_dispatch_rq_list+0x25c/0x550
[   19.699217]                    blk_mq_do_dispatch_sched+0x71/0x110
[   19.702076]                    blk_mq_sched_dispatch_requests+0x114/0x160
[   19.704737]                    __blk_mq_run_hw_queue+0x67/0xf0
[   19.707182]                    __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.709412]                    blk_mq_run_hw_queue+0x83/0xe0
[   19.710880]                    blk_mq_sched_insert_request+0x14a/0x1c0
[   19.712307]                    blk_execute_rq+0x64/0x90
[   19.713697]                    __scsi_execute+0x104/0x250
[   19.715083]                    scsi_probe_and_add_lun+0x1b5/0xc00
[   19.716574]                    __scsi_scan_target+0x1f9/0x530
[   19.718206]                    scsi_scan_channel.part.10+0x51/0x70
[   19.719641]                    scsi_scan_host_selected+0xc9/0x140
[   19.721184]                    scsi_scan_host+0x18f/0x1d0
[   19.722697]                    usb_stor_scan_dwork+0x1a/0x80
[   19.724077]                    process_one_work+0x237/0x5d0
[   19.725460]                    worker_thread+0x37/0x380
[   19.726820]                    kthread+0x118/0x130
[   19.728167]                    ret_from_fork+0x3a/0x50
[   19.729508]  }
[   19.730752]  ... key      at: [<ffffffffa2f27460>] __key.24146+0x0/0x8
[   19.732169]  ... acquired at:
[   19.733509]    __lock_acquire+0x8c6/0x1880
[   19.734865]    lock_acquire+0xaa/0x180
[   19.736210]    _raw_spin_lock_irqsave+0x30/0x40
[   19.737564]    __wake_up_common_lock+0x5e/0xb0
[   19.738927]    __sbq_wake_up+0x8d/0xa0
[   19.740276]    sbitmap_queue_clear+0x3b/0x60
[   19.741635]    __blk_mq_free_request+0x64/0x90
[   19.743008]    scsi_end_request+0x109/0x350
[   19.744361]    scsi_io_completion+0x72/0x600
[   19.745690]    blk_done_softirq+0x9a/0xd0
[   19.747017]    __do_softirq+0xc8/0x455
[   19.748342]    irq_exit+0xcc/0xe0
[   19.749637]    call_function_single_interrupt+0xf/0x20
[   19.750979]    default_idle+0x19/0x160
[   19.752318]    do_idle+0x1d3/0x250
[   19.753758]    cpu_startup_entry+0x14/0x20
[   19.755129]    start_secondary+0x18f/0x1c0
[   19.756470]    secondary_startup_64+0xa4/0xb0
[   19.757954] 
[   19.759225] 
[   19.759225] stack backtrace:
[   19.761899] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.20.0-rc5-next-20181207 #1
[   19.763359] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[   19.765024] Call Trace:
[   19.766589]  <IRQ>
[   19.768074]  dump_stack+0x67/0x90
[   19.769580]  check_usage_forwards.cold.54+0x20/0x29
[   19.771137]  mark_lock+0x2ca/0x620
[   19.772705]  ? check_usage_backwards+0x110/0x110
[   19.774300]  __lock_acquire+0x8c6/0x1880
[   19.775676]  ? select_task_rq_fair+0x1ca/0x1110
[   19.777062]  ? __lock_acquire+0x38a/0x1880
[   19.778424]  ? __lock_acquire+0x38a/0x1880
[   19.779751]  lock_acquire+0xaa/0x180
[   19.781070]  ? __wake_up_common_lock+0x5e/0xb0
[   19.782422]  _raw_spin_lock_irqsave+0x30/0x40
[   19.783766]  ? __wake_up_common_lock+0x5e/0xb0
[   19.785129]  __wake_up_common_lock+0x5e/0xb0
[   19.786483]  __sbq_wake_up+0x8d/0xa0
[   19.787809]  sbitmap_queue_clear+0x3b/0x60
[   19.789143]  __blk_mq_free_request+0x64/0x90
[   19.790631]  scsi_end_request+0x109/0x350
[   19.792150]  scsi_io_completion+0x72/0x600
[   19.793671]  blk_done_softirq+0x9a/0xd0
[   19.795121]  __do_softirq+0xc8/0x455
[   19.796441]  irq_exit+0xcc/0xe0
[   19.797730]  call_function_single_interrupt+0xf/0x20
[   19.799153]  </IRQ>
[   19.800515] RIP: 0010:default_idle+0x19/0x160
[   19.801932] Code: ff 48 89 ea 48 89 df 31 f6 5b 5d e9 31 73 91 ff 90 41 55 41 54 55 53 65 8b 2d a3 dd e4 5e 66 66 66 66 90 e8 b9 1f 59 ff fb f4 <65> 8b 2d 90 dd e4 5e 66 66 66 66 90 5b 5d 41 5c 41 5d c3 65 8b 05
[   19.806839] RSP: 0000:ffffb6f74020bec0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff04
[   19.808531] RAX: ffff97563f2bc080 RBX: 0000000000000003 RCX: 0000000000000000
[   19.810312] RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff97563f2bc080
[   19.811925] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000000
[   19.813555] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   19.815169] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   19.817002]  do_idle+0x1d3/0x250
[   19.818604]  cpu_startup_entry+0x14/0x20
[   19.820125]  start_secondary+0x18f/0x1c0
[   19.821635]  secondary_startup_64+0xa4/0xb0
[   19.856824] EXT4-fs (sda): mounted filesystem without journal. Opts: (null)
[   19.860450] VFS: Mounted root (ext2 filesystem) on device 8:0.
[   19.870456] devtmpfs: mounted
[   19.975778] Freeing unused kernel image memory: 1348K
[   19.987597] Write protecting the kernel read-only data: 22528k
[   19.999208] Freeing unused kernel image memory: 2008K
[   20.007277] Freeing unused kernel image memory: 1824K
[   20.010030] rodata_test: all tests were successful
[   20.012948] Run /sbin/init as init process
[   20.640528] EXT4-fs (sda): re-mounted. Opts: block_validity,barrier,user_xattr,acl
[   21.570798] random: dd: uninitialized urandom read (512 bytes read)
[   21.621870] dd (1296) used greatest stack depth: 12728 bytes left

---
# bad: [74c4a24df7cac1f9213a811d79558ecde23be9a2] Add linux-next specific files for 20181207
# good: [2595646791c319cadfdbf271563aac97d0843dc7] Linux 4.20-rc5
git bisect start 'HEAD' 'v4.20-rc5'
# good: [8d285caa38663cc22f7166f7cfb9ceb9d06d3c7e] Merge remote-tracking branch 'ipsec-next/master'
git bisect good 8d285caa38663cc22f7166f7cfb9ceb9d06d3c7e
# bad: [a8147fd1777cecc0a0cc57fcd742a2b06660bd12] Merge remote-tracking branch 'backlight/for-backlight-next'
git bisect bad a8147fd1777cecc0a0cc57fcd742a2b06660bd12
# good: [467e8a516dcf922d1ea343cebb0e751f81f0dca3] Merge tag 'drm-intel-next-2018-12-04' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
git bisect good 467e8a516dcf922d1ea343cebb0e751f81f0dca3
# good: [25f68bda094fd6a808d1284abc23b194b0668a47] Merge remote-tracking branch 'sound/for-next'
git bisect good 25f68bda094fd6a808d1284abc23b194b0668a47
# bad: [81b6aa39f339c7ea9fe318580c168e55c5a7f593] Merge branch 'for-4.21/block' into for-next
git bisect bad 81b6aa39f339c7ea9fe318580c168e55c5a7f593
# good: [b6676f653f13f83582985bc713525a48d735b2a3] block: remove a few unused exports
git bisect good b6676f653f13f83582985bc713525a48d735b2a3
# good: [80ff2040ac3dd6d9d68d59159b0a6dffd8adc5ff] ataflop: implement mq_ops->commit_rqs() hook
git bisect good 80ff2040ac3dd6d9d68d59159b0a6dffd8adc5ff
# bad: [79eb7650f7f33f4e63a2380f00fc2fac651da7c1] Merge branch 'for-4.21/block' into for-next
git bisect bad 79eb7650f7f33f4e63a2380f00fc2fac651da7c1
# bad: [6322307809649cba6f545640563f95d686ecf404] nvme-pci: cleanup SQ allocation a bit
git bisect bad 6322307809649cba6f545640563f95d686ecf404
# bad: [2149da0748fc236b9916c53e26b3b0c9ab20a5dd] block: add cmd_flags to print_req_error
git bisect bad 2149da0748fc236b9916c53e26b3b0c9ab20a5dd
# good: [27fae429acee1e9418059e7fa545438075af5256] sbitmap: don't loop for find_next_zero_bit() for !round_robin
git bisect good 27fae429acee1e9418059e7fa545438075af5256
# bad: [ea86ea2cdced20057da4d2c32965c1219c238197] sbitmap: ammortize cost of clearing bits
git bisect bad ea86ea2cdced20057da4d2c32965c1219c238197
# good: [531724abc3bfb556c1dd68086cf9cb51f76464e3] block: avoid extra bio reference for async O_DIRECT
git bisect good 531724abc3bfb556c1dd68086cf9cb51f76464e3
# first bad commit: [ea86ea2cdced20057da4d2c32965c1219c238197] sbitmap: ammortize cost of clearing bits
Jens Axboe Dec. 9, 2018, 6:19 a.m. UTC | #2
On Dec 8, 2018, at 10:51 PM, Guenter Roeck <linux@roeck-us.net> wrote:
> 
> Hi,
> 
>> On Thu, Nov 29, 2018 at 06:12:33PM -0700, Jens Axboe wrote:
>> sbitmap maintains a set of words that we use to set and clear bits, with
>> each bit representing a tag for blk-mq. Even though we spread the bits
>> out and maintain a hint cache, one particular bit allocated will end up
>> being cleared in the exact same spot.
>> 
>> This introduces batched clearing of bits. Instead of clearing a given
>> bit, the same bit is set in a cleared/free mask instead. If we fail
>> allocating a bit from a given word, then we check the free mask, and
>> batch move those cleared bits at that time. This trades 64 atomic bitops
>> for 2 cmpxchg().
>> 
>> In a threaded poll test case, half the overhead of getting and clearing
>> tags is removed with this change. On another poll test case with a
>> single thread, performance is unchanged.
>> 
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> This patch results in irq lock inversion warnings when trying to boot
> from usb drives. This was observed with qemu boots of aarch64, x86, and
> x86_64 images.

This one is a false positive, was already reported last week. Just need to add some lockdep annotation for it.
diff mbox series

Patch

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 5cb1755d32da..13eb8973bd10 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -30,14 +30,19 @@  struct seq_file;
  */
 struct sbitmap_word {
 	/**
-	 * @word: The bitmap word itself.
+	 * @depth: Number of bits being used in @word/@cleared
 	 */
-	unsigned long word;
+	unsigned long depth;
 
 	/**
-	 * @depth: Number of bits being used in @word.
+	 * @word: word holding free bits
 	 */
-	unsigned long depth;
+	unsigned long word ____cacheline_aligned_in_smp;
+
+	/**
+	 * @cleared: word holding cleared bits
+	 */
+	unsigned long cleared ____cacheline_aligned_in_smp;
 } ____cacheline_aligned_in_smp;
 
 /**
@@ -315,6 +320,19 @@  static inline void sbitmap_clear_bit(struct sbitmap *sb, unsigned int bitnr)
 	clear_bit(SB_NR_TO_BIT(sb, bitnr), __sbitmap_word(sb, bitnr));
 }
 
+/*
+ * This one is special, since it doesn't actually clear the bit, rather it
+ * sets the corresponding bit in the ->cleared mask instead. Paired with
+ * the caller doing sbitmap_batch_clear() if a given index is full, which
+ * will clear the previously freed entries in the corresponding ->word.
+ */
+static inline void sbitmap_deferred_clear_bit(struct sbitmap *sb, unsigned int bitnr)
+{
+	unsigned long *addr = &sb->map[SB_NR_TO_INDEX(sb, bitnr)].cleared;
+
+	set_bit(SB_NR_TO_BIT(sb, bitnr), addr);
+}
+
 static inline void sbitmap_clear_bit_unlock(struct sbitmap *sb,
 					    unsigned int bitnr)
 {
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 21e776e3128d..04db31f4dfda 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -114,6 +114,58 @@  static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
 	return nr;
 }
 
+/*
+ * See if we have deferred clears that we can batch move
+ */
+static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
+{
+	unsigned long mask, val;
+
+	if (!sb->map[index].cleared)
+		return false;
+
+	/*
+	 * First get a stable cleared mask, setting the old mask to 0.
+	 */
+	do {
+		mask = sb->map[index].cleared;
+	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
+
+	/*
+	 * Now clear the masked bits in our free word
+	 */
+	do {
+		val = sb->map[index].word;
+	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
+
+	/*
+	 * If someone found ->cleared == 0 before we wrote ->word, then
+	 * they could have failed when they should not have. Check for
+	 * waiters.
+	 */
+	smp_mb__after_atomic();
+	sbitmap_queue_wake_up(container_of(sb, struct sbitmap_queue, sb));
+	return true;
+}
+
+static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
+				     unsigned int alloc_hint, bool round_robin)
+{
+	int nr;
+
+	do {
+		nr = __sbitmap_get_word(&sb->map[index].word,
+					sb->map[index].depth, alloc_hint,
+					!round_robin);
+		if (nr != -1)
+			break;
+		if (!sbitmap_deferred_clear(sb, index))
+			break;
+	} while (1);
+
+	return nr;
+}
+
 int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 {
 	unsigned int i, index;
@@ -132,9 +184,8 @@  int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 		alloc_hint = 0;
 
 	for (i = 0; i < sb->map_nr; i++) {
-		nr = __sbitmap_get_word(&sb->map[index].word,
-					sb->map[index].depth, alloc_hint,
-					!round_robin);
+		nr = sbitmap_find_bit_in_index(sb, index, alloc_hint,
+						round_robin);
 		if (nr != -1) {
 			nr += index << sb->shift;
 			break;
@@ -517,7 +568,8 @@  EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
 void sbitmap_queue_clear(struct sbitmap_queue *sbq, unsigned int nr,
 			 unsigned int cpu)
 {
-	sbitmap_clear_bit_unlock(&sbq->sb, nr);
+	sbitmap_deferred_clear_bit(&sbq->sb, nr);
+
 	/*
 	 * Pairs with the memory barrier in set_current_state() to ensure the
 	 * proper ordering of clear_bit_unlock()/waitqueue_active() in the waker