mbox series

[RFC,v2,00/14] md: introduce a new lockless bitmap

Message ID 20250328060853.4124527-1-yukuai1@huaweicloud.com (mailing list archive)
Headers show
Series md: introduce a new lockless bitmap | expand

Message

Yu Kuai March 28, 2025, 6:08 a.m. UTC
From: Yu Kuai <yukuai3@huawei.com>

#### Background

Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.

Bitmap is used to record which data blocks have been synchronized and which
ones need to be resynchronized or recovered. Each bit in the bitmap
represents a segment of data in the array. When a bit is set, it indicates
that the multiple redundant copies of that data segment may not be
consistent. Data synchronization can be performed based on the bitmap after
power failure or readding a disk. If there is no bitmap, a full disk
synchronization is required.

#### Key Concept

##### State Machine

Each bit is one byte, contain 6 difference state, see llbitmap_state. And
there are total 8 differenct actions, see llbitmap_action, can change state:

llbitmap state machine: transitions between states

|           | Startwrite | Startsync | Endsync | Abortsync| Reload   | Daemon | Discard   | Stale     |
| --------- | ---------- | --------- | ------- | -------  | -------- | ------ | --------- | --------- |
| Unwritten | Dirty      | x         | x       | x        | x        | x      | x         | x         |
| Clean     | Dirty      | x         | x       | x        | x        | x      | Unwritten | NeedSync  |
| Dirty     | x          | x         | x       | x        | NeedSync | Clean  | Unwritten | NeedSync  |
| NeedSync  | x          | Syncing   | x       | x        | x        | x      | Unwritten | x         |
| Syncing   | x          | Syncing   | Dirty   | NeedSync | NeedSync | x      | Unwritten | NeedSync  |

special illustration:
- Unwritten is special state, which means user never write data, hence there
  is no need to resync/recover data. This is safe if user create filesystems
  for the array, filesystem will make sure user will get zero data for
  unwritten blocks.
- After resync is done, change state from Syncing to Dirty first, in case
  Startwrite happen before the state is Clean.

##### Bitmap IO

A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in
llbitmap_add_disk(). And a file is created as well to manage bitmap IO for
this disk, see details in llbitmap_open_disk(). Read/write bitmap is
converted to buffer IO to this file.

IO fast path will set bits to dirty, and those dirty bits will be cleared
by daemon after IO is done. llbitmap_barrier is used to syncronize between
IO path and daemon;

Test result: to be added

Noted:
1) user must apply the following mdadm patch, and then llbitmap can be
enabled by --bitmap=lockless
https://lore.kernel.org/all/20250327134853.1069356-1-yukuai1@huaweicloud.com/
2) this set is cooked on the top of my other set:
https://lore.kernel.org/all/20250219083456.941760-1-yukuai1@huaweicloud.com/

Yu Kuai (14):
  block: factor out a helper bdev_file_alloc()
  md/md-bitmap: pass discard information to bitmap_{start, end}write
  md/md-bitmap: remove parameter slot from bitmap_create()
  md: add a new sysfs api bitmap_version
  md: delay registeration of bitmap_ops until creating bitmap
  md/md-llbitmap: implement bit state machine
  md/md-llbitmap: implement hidden disk to manage bitmap IO
  md/md-llbitmap: implement APIs for page level dirty bits
    synchronization
  md/md-llbitmap: implement APIs to mange bitmap lifetime
  md/md-llbitmap: implement APIs to dirty bits and clear bits
  md/md-llbitmap: implement APIs for sync_thread
  md/md-llbitmap: implement all bitmap operations
  md/md-llbitmap: implement sysfs APIs
  md/md-llbitmap: add Kconfig

 block/bdev.c             |   21 +-
 drivers/md/Kconfig       |   12 +
 drivers/md/Makefile      |    1 +
 drivers/md/md-bitmap.c   |   10 +-
 drivers/md/md-bitmap.h   |   21 +-
 drivers/md/md-llbitmap.c | 1410 ++++++++++++++++++++++++++++++++++++++
 drivers/md/md.c          |  180 ++++-
 drivers/md/md.h          |    3 +
 include/linux/blkdev.h   |    1 +
 9 files changed, 1614 insertions(+), 45 deletions(-)
 create mode 100644 drivers/md/md-llbitmap.c

Comments

Christoph Hellwig March 28, 2025, 11:06 a.m. UTC | #1
On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote:
> A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in
> llbitmap_add_disk(). And a file is created as well to manage bitmap IO for
> this disk, see details in llbitmap_open_disk(). Read/write bitmap is
> converted to buffer IO to this file.
> 
> IO fast path will set bits to dirty, and those dirty bits will be cleared
> by daemon after IO is done. llbitmap_barrier is used to syncronize between
> IO path and daemon;

Why do you need a separate gendisk?  I'll try to find some time to read
the code to understand what it does, but it would also be really useful
to explain the need for such an unusual concept here.
Yu Kuai March 29, 2025, 1:11 a.m. UTC | #2
Hi,

在 2025/03/28 19:06, Christoph Hellwig 写道:
> On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote:
>> A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in
>> llbitmap_add_disk(). And a file is created as well to manage bitmap IO for
>> this disk, see details in llbitmap_open_disk(). Read/write bitmap is
>> converted to buffer IO to this file.
>>
>> IO fast path will set bits to dirty, and those dirty bits will be cleared
>> by daemon after IO is done. llbitmap_barrier is used to syncronize between
>> IO path and daemon;
> 
> Why do you need a separate gendisk?  I'll try to find some time to read
> the code to understand what it does, but it would also be really useful
> to explain the need for such an unusual concept here.

The purpose here is to hide the low level bitmap IO implementation to
the API disk->submit_bio(), and the bitmap IO can be converted to buffer
IO to the bdev_file. This is the easiest way that I can think of to
resue the pagecache, with natural ability for dirty page writeback. I do
think about creating a new anon file and implement a new
file_operations, this will be much more complicated.

Meanwhile, bitmap file for the old bitmap will be removed sooner or
later, and this bdev_file implementation will compatible with bitmap
file as well.

Thanks,
Kuai

> 
> .
>
Christoph Hellwig April 4, 2025, 9:27 a.m. UTC | #3
On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote:
> 1) user must apply the following mdadm patch, and then llbitmap can be
> enabled by --bitmap=lockless
> https://lore.kernel.org/all/20250327134853.1069356-1-yukuai1@huaweicloud.com/
> 2) this set is cooked on the top of my other set:
> https://lore.kernel.org/all/20250219083456.941760-1-yukuai1@huaweicloud.com/

I tried to create a tree to review the entire thing but failed.  Can you
please also provide a working git branch?
Yu Kuai April 7, 2025, 1:09 a.m. UTC | #4
Hi,

在 2025/04/04 17:27, Christoph Hellwig 写道:
> On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote:
>> 1) user must apply the following mdadm patch, and then llbitmap can be
>> enabled by --bitmap=lockless
>> https://lore.kernel.org/all/20250327134853.1069356-1-yukuai1@huaweicloud.com/
>> 2) this set is cooked on the top of my other set:
>> https://lore.kernel.org/all/20250219083456.941760-1-yukuai1@huaweicloud.com/
> 
> I tried to create a tree to review the entire thing but failed.  Can you
> please also provide a working git branch?

Of course, here is the branch:

https://git.kernel.org/pub/scm/linux/kernel/git/yukuai/linux.git/log/?h=md-6.15

Thanks,
Kuai

> 
> .
>
Christoph Hellwig April 9, 2025, 8:32 a.m. UTC | #5
On Sat, Mar 29, 2025 at 09:11:13AM +0800, Yu Kuai wrote:
> The purpose here is to hide the low level bitmap IO implementation to
> the API disk->submit_bio(), and the bitmap IO can be converted to buffer
> IO to the bdev_file. This is the easiest way that I can think of to
> resue the pagecache, with natural ability for dirty page writeback. I do
> think about creating a new anon file and implement a new
> file_operations, this will be much more complicated.

I've started looking at this a bit now, sorry for the delay.

As far as I can see you use the bitmap file just so that you have your
own struct address_space and thus page cache instance and then call
read_mapping_page and filemap_write_and_wait_range on it right?

For that you'd be much better of just creating your own trivial
file_system_type with an inode fully controlled by your driver
that has a trivial set of address_space ops instead of oddly
mixing with the block layer.

Note that either way I'm not sure using the page cache here is an
all that good idea, as we're at the bottom of the I/O stack and
thus memory allocations can very easily deadlock.

What speaks against using your own folios explicitly allocated at
probe time and then just doing manual submit_bio on that?  That's
probably not much more code but a lot more robust.

Also a high level note: the bitmap_operations aren't a very nice
interface.  A lot of methods are empty and should just be called
conditionally.  Or even better you'd do away with the expensive
indirect calls and just directly call either the old or new
bitmap code.

> Meanwhile, bitmap file for the old bitmap will be removed sooner or
> later, and this bdev_file implementation will compatible with bitmap
> file as well.

Which would also mean that at that point the operations vector would
be pointless, so we might as well not add it to start with.
Yu Kuai April 9, 2025, 9:27 a.m. UTC | #6
Hi,

在 2025/04/09 16:32, Christoph Hellwig 写道:
> On Sat, Mar 29, 2025 at 09:11:13AM +0800, Yu Kuai wrote:
>> The purpose here is to hide the low level bitmap IO implementation to
>> the API disk->submit_bio(), and the bitmap IO can be converted to buffer
>> IO to the bdev_file. This is the easiest way that I can think of to
>> resue the pagecache, with natural ability for dirty page writeback. I do
>> think about creating a new anon file and implement a new
>> file_operations, this will be much more complicated.
> 
> I've started looking at this a bit now, sorry for the delay.
> 
> As far as I can see you use the bitmap file just so that you have your
> own struct address_space and thus page cache instance and then call
> read_mapping_page and filemap_write_and_wait_range on it right?
Yes.

> 
> For that you'd be much better of just creating your own trivial
> file_system_type with an inode fully controlled by your driver
> that has a trivial set of address_space ops instead of oddly
> mixing with the block layer.

Yes, this is exactly what I said implement a new file_operations(and
address_space ops), I wanted do this the easy way, just reuse the raw
block device ops, this way I just need to implement the submit_bio ops
for new hidden disk.

I can try with new fs type if we really think this solution is too
hacky, however, the code line will be much more. :(

> 
> Note that either way I'm not sure using the page cache here is an
> all that good idea, as we're at the bottom of the I/O stack and
> thus memory allocations can very easily deadlock.

Yes, for the page from bitmap, this set do the easy way just read and
ping all realted pages while loading the bitmap. For two reasons:

1) We don't need to allocate and read pages from IO path;(In the first
RFC version, I'm using a worker to do that).
2) In the first RFC version, I find and get page in the IO path, turns
out page reference is an *atomic*, and the overhead is not acceptable;

And the only action from IO path is that if bitmap page is dirty,
filemap_write_and_wait_range() is called from async worker, the same as
old bitmap, to flush bitmap dirty pages.
> 
> What speaks against using your own folios explicitly allocated at
> probe time and then just doing manual submit_bio on that?  That's
> probably not much more code but a lot more robust.

I'm not quite sure if I understand you correctly. Do you means don't use
pagecache for bitmap IO, and manually create BIOs like the old bitmap,
meanwhile invent a new solution for synchronism instead of the global
spin_lock from old bitmap?

Thanks,
Kuai

> 
> Also a high level note: the bitmap_operations aren't a very nice
> interface.  A lot of methods are empty and should just be called
> conditionally.  Or even better you'd do away with the expensive
> indirect calls and just directly call either the old or new
> bitmap code.
> 
>> Meanwhile, bitmap file for the old bitmap will be removed sooner or
>> later, and this bdev_file implementation will compatible with bitmap
>> file as well.
> 
> Which would also mean that at that point the operations vector would
> be pointless, so we might as well not add it to start with.
> 
> .
>
Christoph Hellwig April 9, 2025, 9:40 a.m. UTC | #7
On Wed, Apr 09, 2025 at 05:27:11PM +0800, Yu Kuai wrote:
>> For that you'd be much better of just creating your own trivial
>> file_system_type with an inode fully controlled by your driver
>> that has a trivial set of address_space ops instead of oddly
>> mixing with the block layer.
>
> Yes, this is exactly what I said implement a new file_operations(and
> address_space ops), I wanted do this the easy way, just reuse the raw
> block device ops, this way I just need to implement the submit_bio ops
> for new hidden disk.
>
> I can try with new fs type if we really think this solution is too
> hacky, however, the code line will be much more. :(

I don't think it should be much more.  It'll also remove the rather
unexpected indirection through submit_bio.  Just make sure you use
iomap for your operations, and implement the submit_io hook.  That
will also be more efficient than the buffer_head based block ops
for writes.

>>
>> Note that either way I'm not sure using the page cache here is an
>> all that good idea, as we're at the bottom of the I/O stack and
>> thus memory allocations can very easily deadlock.
>
> Yes, for the page from bitmap, this set do the easy way just read and
> ping all realted pages while loading the bitmap. For two reasons:
>
> 1) We don't need to allocate and read pages from IO path;(In the first
> RFC version, I'm using a worker to do that).

You still depend on the worker, which will still deadlock.

>> What speaks against using your own folios explicitly allocated at
>> probe time and then just doing manual submit_bio on that?  That's
>> probably not much more code but a lot more robust.
>
> I'm not quite sure if I understand you correctly. Do you means don't use
> pagecache for bitmap IO, and manually create BIOs like the old bitmap,
> meanwhile invent a new solution for synchronism instead of the global
> spin_lock from old bitmap?

Yes.  Alternatively you need to pre-populate the page cache and keep
extra page references.
Yu Kuai April 11, 2025, 1:36 a.m. UTC | #8
Hi,

在 2025/04/09 17:40, Christoph Hellwig 写道:
> On Wed, Apr 09, 2025 at 05:27:11PM +0800, Yu Kuai wrote:
>>> For that you'd be much better of just creating your own trivial
>>> file_system_type with an inode fully controlled by your driver
>>> that has a trivial set of address_space ops instead of oddly
>>> mixing with the block layer.
>>
>> Yes, this is exactly what I said implement a new file_operations(and
>> address_space ops), I wanted do this the easy way, just reuse the raw
>> block device ops, this way I just need to implement the submit_bio ops
>> for new hidden disk.
>>
>> I can try with new fs type if we really think this solution is too
>> hacky, however, the code line will be much more. :(
> 
> I don't think it should be much more.  It'll also remove the rather
> unexpected indirection through submit_bio.  Just make sure you use
> iomap for your operations, and implement the submit_io hook.  That
> will also be more efficient than the buffer_head based block ops
> for writes.
> 
>>>
>>> Note that either way I'm not sure using the page cache here is an
>>> all that good idea, as we're at the bottom of the I/O stack and
>>> thus memory allocations can very easily deadlock.
>>
>> Yes, for the page from bitmap, this set do the easy way just read and
>> ping all realted pages while loading the bitmap. For two reasons:
>>
>> 1) We don't need to allocate and read pages from IO path;(In the first
>> RFC version, I'm using a worker to do that).
> 
> You still depend on the worker, which will still deadlock.
> 
>>> What speaks against using your own folios explicitly allocated at
>>> probe time and then just doing manual submit_bio on that?  That's
>>> probably not much more code but a lot more robust.
>>
>> I'm not quite sure if I understand you correctly. Do you means don't use
>> pagecache for bitmap IO, and manually create BIOs like the old bitmap,
>> meanwhile invent a new solution for synchronism instead of the global
>> spin_lock from old bitmap?
> 
> Yes.  Alternatively you need to pre-populate the page cache and keep
> extra page references.

Ok, I'll think about self managed pages and IO path. Meanwhile, please
let me know if you have questions with other parts.

Thanks,
Kuai

> 
> .
>