diff mbox series

md bitmap writes random memory over disks' bitmap sectors

Message ID 6083cbff-0896-46af-bd76-1e0f173538b7@redhat.com (mailing list archive)
State Superseded
Headers show
Series md bitmap writes random memory over disks' bitmap sectors | expand

Checks

Context Check Description
mdraidci/vmtest-md-6_14-PR fail merge-conflict

Commit Message

Nigel Croxon Feb. 25, 2025, 3:32 p.m. UTC
On a aarch64 system with an md device using a bitmap, the system would 
crash when using nvme-tcp, logging


usercopy: Kernel memory exposure attempt detected from SLUB object 
'kmalloc-512' (offset 440, size 24576)!

crash> bt
PID: 28       TASK: ffff1c9882038700  CPU: 2    COMMAND: "kworker/2:0H"
  #0 [ffff800009d4f4d0] machine_kexec at ffffdf7f6bb20b94
  #1 [ffff800009d4f530] __crash_kexec at ffffdf7f6bc597a4
  #2 [ffff800009d4f6c0] crash_kexec at ffffdf7f6bc598c0
  #3 [ffff800009d4f6f0] die at ffffdf7f6bb0ec60
  #4 [ffff800009d4f7a0] bug_handler at ffffdf7f6bb0f0dc
  #5 [ffff800009d4f7c0] bug_handler at ffffdf7f6bb0f13c
  #6 [ffff800009d4f7e0] call_break_hook at ffffdf7f6bb06838
  #7 [ffff800009d4f800] brk_handler at ffffdf7f6bb06d54
  #8 [ffff800009d4f820] do_debug_exception at ffffdf7f6bb23690
  #9 [ffff800009d4f870] el1_dbg at ffffdf7f6c5b60b0
#10 [ffff800009d4f8a0] el1h_64_sync_handler at ffffdf7f6c5b6328
#11 [ffff800009d4f9e0] el1h_64_sync at ffffdf7f6bb0233c
#12 [ffff800009d4fa00] usercopy_abort at ffffdf7f6be2a158
#13 [ffff800009d4fa40] __check_heap_object at ffffdf7f6bdf8b84
#14 [ffff800009d4fa70] __check_object_size at ffffdf7f6be2a034
#15 [ffff800009d4fab0] simple_copy_to_iter at ffffdf7f6c3a93ec
#16 [ffff800009d4fae0] __skb_datagram_iter at ffffdf7f6c3a9168
#17 [ffff800009d4fb60] skb_copy_datagram_iter at ffffdf7f6c3a9334
#18 [ffff800009d4fba0] tcp_recvmsg at ffffdf7f6c484b90
#19 [ffff800009d4fc90] inet_recvmsg at ffffdf7f6c4bb1a8
#20 [ffff800009d4fce0] sock_recvmsg at ffffdf7f6c393270
#21 [ffff800009d4fd10] nvmet_tcp_io_work at ffffdf7f16412248 [nvmet_tcp]
#22 [ffff800009d4fdc0] process_one_work at ffffdf7f6bb9a1c4
#23 [ffff800009d4fe10] worker_thread at ffffdf7f6bb9a584
#24 [ffff800009d4fe70] kthread at ffffdf7f6bba3fbc

The system tried to access the memory for page ffffffe726027980, which 
was a kmalloc-512 slab, and the area accessed was on contained in a 
single, valid slab object.

Tracking the origin of this page, the sk_buff was pointing to the page 
before, but using offsets so large it used the problem page 
ffffffe726027980, even with aarch64 using 64kB pages.

crash> struct sk_buff.head,end ffff1c9885e4e8c0 -x
   head = 0xffff1c99635a9000 
"\250\363\204\225\230\034\377\377\b\220Zc\231\034\377\377\b\220Zc\231\034\377\377\030\254\362\217\230\034\377\377",
   end = 0x2c0,

crash> struct skb_shared_info.nr_frags,frags 0xffff1c99635a92c0
   nr_frags = 1 '\001',
   frags = {{
       page = {
         p = 0xffffffe726027940
       },
       page_offset = 73656,
       size = 24576
     }, {
       page = {
         p = 0xffffffe72603f600
...

Tracking the page usage back, md sent the leading page in a bio with a 
size in the bio_vec equal to 2 pages.

crash> struct bio ffff1c9895707100
struct bio {
   bi_next = 0x0,
   bi_disk = 0xffff1c9899052000,
   bi_opf = 133121,
   bi_flags = 3072,
   bi_ioprio = 0,
   bi_write_hint = 0,
   bi_status = 0 '\000',
   bi_partno = 0 '\000',
   bi_phys_segments = 0,
   bi_seg_front_size = 0,
   bi_seg_back_size = 0,
   bi_iter = {
     bi_sector = 16,
     bi_size = 131072,
     bi_idx = 0,
     bi_bvec_done = 0,
     rh_bi_reserved = 0
   },
   __bi_remaining = {
     counter = 1
   },
   bi_end_io = 0xffffdf7f6c2e8710 <super_written>,
   bi_private = 0xffff1c98915cf600,
   bi_blkg = 0xffff1c989e235800,
   bi_issue = {
     value = 576469020341855351
   },
   {
     bi_integrity = 0x0
   },
   bi_vcnt = 1,
   bi_max_vecs = 4,
   __bi_cnt = {
     counter = 1
   },
   bi_io_vec = 0xffff1c98957071a0,
   bi_pool = 0xffff1c9898474650,
   {
     bi_iocost_cost = 0,
     rh_kabi_hidden_241 = {
       rh_reserved1 = 0
     },
     {<No data fields>}
   },
   rh_reserved2 = 0,
   rh_reserved3 = 0,
   bi_inline_vecs = 0xffff1c98957071a0
}

crash> bio_vec 0xffff1c98957071a0
struct bio_vec {
   bv_page = 0xffffffe726027940,
   bv_len = 131072,
   bv_offset = 0
}

This bio struct was writing super block data for an md device, and used 
the size of the device's io_opt value in the bio_vec instead of a valid 
page size:

crash> p ((struct request_queue *)0xffff1c9896ed34f8)->limits.io_opt
$1 = 131072

Mapping this bio back to the origin md device and its bitmap configuration:

crash> struct md_rdev.mddev 0xffff1c98915cf600
   mddev = 0xffff1c9898474000,

crash> struct mddev.bitmap 0xffff1c9898474000
   bitmap = 0xffff1c9969902400,

We find an md bitmap configured with 1 page, with only part of the page 
used:

crash> struct bitmap.storage 0xffff1c9969902400
   storage = {
     file = 0x0,
     sb_page = 0xffffffe726027940,
     filemap = 0xffff1c9895bd6b00,
     filemap_attr = 0xffff1c9895bd3300,
     file_pages = 1,
     bytes = 13064
   },

This looks to be a rhel8 variation of bugs fixed in RHEL9 with backports 
like RHEL-75639. However, the patch used to fix this issue in RHEL9 
appears flawed and still contains part of the same flaws and only works 
sanely when the bitmap is contained in a single page.

         sector_t ps = pg_index * PAGE_SIZE / SECTOR_SIZE;
         unsigned int size = PAGE_SIZE;
@@ -269,11 +271,9 @@ static int __write_sb_page(struct md_rdev *rdev, 
struct bitmap *bitmap,
                 if (size == 0)
                         /* bitmap runs in to data */
                         return -EINVAL;
-       } else {
-               /* DATA METADATA BITMAP - no problems */
         }

-       md_super_write(mddev, rdev, sboff + ps, (int) size, page);
+       md_super_write(mddev, rdev, sboff + ps, (int)min(size, 
bitmap_limit), page);
         return 0;

This patch still will attempt to send writes greater than a page using 
only a single page pointer for multi-page bitmaps. The bitmap uses an 
array (the filemap) of pages when the bitmap cannot fit in a single 
page. These pages are allocated separately and not guaranteed to be 
contiguous. So this patch will keep writes in a multi-page bitmap from 
trashing data beyond the bitmap, but can create writes which corrupt 
other parts of the bitmap with random memory.

The opt using logic in this function is fundamentally flawed as 
__write_sb_page should never send a write bigger than a page at a time. 
It would need to use a new interface which can build multi-page bio and 
not md_super_write() if it wanted to send multi-page I/Os.

Comments

Yu Kuai Feb. 28, 2025, 2:06 a.m. UTC | #1
Hi,

在 2025/02/25 23:32, Nigel Croxon 写道:
> -       md_super_write(mddev, rdev, sboff + ps, (int) size, page);
> +       md_super_write(mddev, rdev, sboff + ps, (int)min(size, 
> bitmap_limit), page);
>          return 0;
> 
> This patch still will attempt to send writes greater than a page using 
> only a single page pointer for multi-page bitmaps. The bitmap uses an 
> array (the filemap) of pages when the bitmap cannot fit in a single 
> page. These pages are allocated separately and not guaranteed to be 
> contiguous. So this patch will keep writes in a multi-page bitmap from 
> trashing data beyond the bitmap, but can create writes which corrupt 
> other parts of the bitmap with random memory.

Is this problem introduced by:

8745faa95611 ("md: Use optimal I/O size for last bitmap page")

> 
> The opt using logic in this function is fundamentally flawed as 
> __write_sb_page should never send a write bigger than a page at a time. 
> It would need to use a new interface which can build multi-page bio and 
> not md_super_write() if it wanted to send multi-page I/Os.

I argree. And I don't understand that patch yet, it said:

If the bitmap space has enough room, size the I/O for the last bitmap
page write to the optimal I/O size for the storage device.

Does this mean, for example, if bitmap space is 128k, while there is
only one page, means 124k is not used. In this case, if device opt io
size is 128k, this patch will choose to issue 128k IO instead of just
4k IO? And how can this improve performance ...

Thanks,
Kuai
Xiao Ni Feb. 28, 2025, 7:46 a.m. UTC | #2
On Fri, Feb 28, 2025 at 10:07 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/02/25 23:32, Nigel Croxon 写道:
> > -       md_super_write(mddev, rdev, sboff + ps, (int) size, page);
> > +       md_super_write(mddev, rdev, sboff + ps, (int)min(size,
> > bitmap_limit), page);
> >          return 0;
> >
> > This patch still will attempt to send writes greater than a page using
> > only a single page pointer for multi-page bitmaps. The bitmap uses an
> > array (the filemap) of pages when the bitmap cannot fit in a single
> > page. These pages are allocated separately and not guaranteed to be
> > contiguous. So this patch will keep writes in a multi-page bitmap from
> > trashing data beyond the bitmap, but can create writes which corrupt
> > other parts of the bitmap with random memory.
>
> Is this problem introduced by:
>
> 8745faa95611 ("md: Use optimal I/O size for last bitmap page")

I think so.

>
> >
> > The opt using logic in this function is fundamentally flawed as
> > __write_sb_page should never send a write bigger than a page at a time.
> > It would need to use a new interface which can build multi-page bio and
> > not md_super_write() if it wanted to send multi-page I/Os.
>
> I argree. And I don't understand that patch yet, it said:
>
> If the bitmap space has enough room, size the I/O for the last bitmap
> page write to the optimal I/O size for the storage device.
>
> Does this mean, for example, if bitmap space is 128k, while there is
> only one page, means 124k is not used. In this case, if device opt io
> size is 128k, this patch will choose to issue 128k IO instead of just
> 4k IO? And how can this improve performance ...

It looks like it does as you mentioned above. Through the commit
8745faa95611 message, the io size(3584 bytes, 7 sectors) is smaller
than 4K. Without the patch 8745faa95611,  the io size is round up with
bdev_logical_block_size(bdev). If we round up io size with PAGE_SIZE,
can it fix the performance problem? Because bitmap space is
4K/64K/128K, if it doesn't affect performance only changing the round
up value to PAGE_SIZE, it'll easy to fix this problem.

Best Regards
Xiao

> Thanks,
> Kuai
>
>
Xiao Ni March 3, 2025, 1:40 p.m. UTC | #3
On Fri, Feb 28, 2025 at 10:07 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/02/25 23:32, Nigel Croxon 写道:
> > -       md_super_write(mddev, rdev, sboff + ps, (int) size, page);
> > +       md_super_write(mddev, rdev, sboff + ps, (int)min(size,
> > bitmap_limit), page);
> >          return 0;
> >
> > This patch still will attempt to send writes greater than a page using
> > only a single page pointer for multi-page bitmaps. The bitmap uses an
> > array (the filemap) of pages when the bitmap cannot fit in a single
> > page. These pages are allocated separately and not guaranteed to be
> > contiguous. So this patch will keep writes in a multi-page bitmap from
> > trashing data beyond the bitmap, but can create writes which corrupt
> > other parts of the bitmap with random memory.
>
> Is this problem introduced by:
>
> 8745faa95611 ("md: Use optimal I/O size for last bitmap page")
>
> >
> > The opt using logic in this function is fundamentally flawed as
> > __write_sb_page should never send a write bigger than a page at a time.
> > It would need to use a new interface which can build multi-page bio and
> > not md_super_write() if it wanted to send multi-page I/Os.
>
> I argree. And I don't understand that patch yet, it said:
>
> If the bitmap space has enough room, size the I/O for the last bitmap
> page write to the optimal I/O size for the storage device.
>
> Does this mean, for example, if bitmap space is 128k, while there is
> only one page, means 124k is not used. In this case, if device opt io
> size is 128k, this patch will choose to issue 128k IO instead of just
> 4k IO? And how can this improve performance ...

This problem should already be fixed by patch
commit ab99a87542f194f28e2364a42afbf9fb48b1c724
Author: Ofir Gal <ofir.gal@volumez.com>
Date:   Fri Jun 7 10:27:44 2024 +0300

    md/md-bitmap: fix writing non bitmap pages

Regards
Xiao
>
> Thanks,
> Kuai
>
>
Su Yue March 3, 2025, 3:10 p.m. UTC | #4
On Fri 28 Feb 2025 at 15:46, Xiao Ni <xni@redhat.com> wrote:

> On Fri, Feb 28, 2025 at 10:07 AM Yu Kuai 
> <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2025/02/25 23:32, Nigel Croxon 写道:
>> > -       md_super_write(mddev, rdev, sboff + ps, (int) size, 
>> > page);
>> > +       md_super_write(mddev, rdev, sboff + ps, 
>> > (int)min(size,
>> > bitmap_limit), page);
>> >          return 0;
>> >
>> > This patch still will attempt to send writes greater than a 
>> > page using
>> > only a single page pointer for multi-page bitmaps. The bitmap 
>> > uses an
>> > array (the filemap) of pages when the bitmap cannot fit in a 
>> > single
>> > page. These pages are allocated separately and not guaranteed 
>> > to be
>> > contiguous. So this patch will keep writes in a multi-page 
>> > bitmap from
>> > trashing data beyond the bitmap, but can create writes which 
>> > corrupt
>> > other parts of the bitmap with random memory.
>>
>> Is this problem introduced by:
>>
>> 8745faa95611 ("md: Use optimal I/O size for last bitmap page")
>
> I think so.
>
>>
>> >
>> > The opt using logic in this function is fundamentally flawed 
>> > as
>> > __write_sb_page should never send a write bigger than a page 
>> > at a time.
>> > It would need to use a new interface which can build 
>> > multi-page bio and
>> > not md_super_write() if it wanted to send multi-page I/Os.
>>
>> I argree. And I don't understand that patch yet, it said:
>>
>> If the bitmap space has enough room, size the I/O for the last 
>> bitmap
>> page write to the optimal I/O size for the storage device.
>>
>> Does this mean, for example, if bitmap space is 128k, while 
>> there is
>> only one page, means 124k is not used. In this case, if device 
>> opt io
>> size is 128k, this patch will choose to issue 128k IO instead 
>> of just
>> 4k IO? And how can this improve performance ...
>
> It looks like it does as you mentioned above. Through the commit
> 8745faa95611 message, the io size(3584 bytes, 7 sectors) is 
> smaller
> than 4K. Without the patch 8745faa95611,  the io size is round 
> up with
> bdev_logical_block_size(bdev). If we round up io size with 
> PAGE_SIZE,
> can it fix the performance problem? Because bitmap space is
> 4K/64K/128K, if it doesn't affect performance only changing the 
> round
> up value to PAGE_SIZE, it'll easy to fix this problem.
>

I'm afiraid that the perf will drop if rounding up io size to 4K 
for devices
optimal I/O size smaller than 4K. IMO better version of 
md_super_write to is
necessary to handle bitmap pages as Nigel said.

--
Su

> Best Regards
> Xiao
>
>> Thanks,
>> Kuai
>>
>>
Xiao Ni March 4, 2025, 1:06 a.m. UTC | #5
On Mon, Mar 3, 2025 at 11:16 PM Su Yue <l@damenly.org> wrote:
>
> On Fri 28 Feb 2025 at 15:46, Xiao Ni <xni@redhat.com> wrote:
>
> > On Fri, Feb 28, 2025 at 10:07 AM Yu Kuai
> > <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2025/02/25 23:32, Nigel Croxon 写道:
> >> > -       md_super_write(mddev, rdev, sboff + ps, (int) size,
> >> > page);
> >> > +       md_super_write(mddev, rdev, sboff + ps,
> >> > (int)min(size,
> >> > bitmap_limit), page);
> >> >          return 0;
> >> >
> >> > This patch still will attempt to send writes greater than a
> >> > page using
> >> > only a single page pointer for multi-page bitmaps. The bitmap
> >> > uses an
> >> > array (the filemap) of pages when the bitmap cannot fit in a
> >> > single
> >> > page. These pages are allocated separately and not guaranteed
> >> > to be
> >> > contiguous. So this patch will keep writes in a multi-page
> >> > bitmap from
> >> > trashing data beyond the bitmap, but can create writes which
> >> > corrupt
> >> > other parts of the bitmap with random memory.
> >>
> >> Is this problem introduced by:
> >>
> >> 8745faa95611 ("md: Use optimal I/O size for last bitmap page")
> >
> > I think so.
> >
> >>
> >> >
> >> > The opt using logic in this function is fundamentally flawed
> >> > as
> >> > __write_sb_page should never send a write bigger than a page
> >> > at a time.
> >> > It would need to use a new interface which can build
> >> > multi-page bio and
> >> > not md_super_write() if it wanted to send multi-page I/Os.
> >>
> >> I argree. And I don't understand that patch yet, it said:
> >>
> >> If the bitmap space has enough room, size the I/O for the last
> >> bitmap
> >> page write to the optimal I/O size for the storage device.
> >>
> >> Does this mean, for example, if bitmap space is 128k, while
> >> there is
> >> only one page, means 124k is not used. In this case, if device
> >> opt io
> >> size is 128k, this patch will choose to issue 128k IO instead
> >> of just
> >> 4k IO? And how can this improve performance ...
> >
> > It looks like it does as you mentioned above. Through the commit
> > 8745faa95611 message, the io size(3584 bytes, 7 sectors) is
> > smaller
> > than 4K. Without the patch 8745faa95611,  the io size is round
> > up with
> > bdev_logical_block_size(bdev). If we round up io size with
> > PAGE_SIZE,
> > can it fix the performance problem? Because bitmap space is
> > 4K/64K/128K, if it doesn't affect performance only changing the
> > round
> > up value to PAGE_SIZE, it'll easy to fix this problem.
> >
>
> I'm afiraid that the perf will drop if rounding up io size to 4K
> for devices
> optimal I/O size smaller than 4K. IMO better version of
> md_super_write to is
> necessary to handle bitmap pages as Nigel said.
>
> --
> Su

Hi Su

You're right. Thanks for pointing out this.

Best Regards
Xiao
>
> > Best Regards
> > Xiao
> >
> >> Thanks,
> >> Kuai
> >>
> >>
>
diff mbox series

Patch

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 0a2d37eb38ef..08232d8dc815 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -227,6 +227,8 @@  static int __write_sb_page(struct md_rdev *rdev, 
struct bitmap *bitmap,
         struct block_device *bdev;
         struct mddev *mddev = bitmap->mddev;
         struct bitmap_storage *store = &bitmap->storage;
+       unsigned int bitmap_limit = (bitmap->storage.file_pages - 
pg_index) <<
+               PAGE_SHIFT;
         loff_t sboff, offset = mddev->bitmap_info.offset;