Message ID | 20250328060853.4124527-1-yukuai1@huaweicloud.com (mailing list archive) |
---|---|
Headers | show |
Series | md: introduce a new lockless bitmap | expand |
On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote: > A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in > llbitmap_add_disk(). And a file is created as well to manage bitmap IO for > this disk, see details in llbitmap_open_disk(). Read/write bitmap is > converted to buffer IO to this file. > > IO fast path will set bits to dirty, and those dirty bits will be cleared > by daemon after IO is done. llbitmap_barrier is used to syncronize between > IO path and daemon; Why do you need a separate gendisk? I'll try to find some time to read the code to understand what it does, but it would also be really useful to explain the need for such an unusual concept here.
Hi, 在 2025/03/28 19:06, Christoph Hellwig 写道: > On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote: >> A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in >> llbitmap_add_disk(). And a file is created as well to manage bitmap IO for >> this disk, see details in llbitmap_open_disk(). Read/write bitmap is >> converted to buffer IO to this file. >> >> IO fast path will set bits to dirty, and those dirty bits will be cleared >> by daemon after IO is done. llbitmap_barrier is used to syncronize between >> IO path and daemon; > > Why do you need a separate gendisk? I'll try to find some time to read > the code to understand what it does, but it would also be really useful > to explain the need for such an unusual concept here. The purpose here is to hide the low level bitmap IO implementation to the API disk->submit_bio(), and the bitmap IO can be converted to buffer IO to the bdev_file. This is the easiest way that I can think of to resue the pagecache, with natural ability for dirty page writeback. I do think about creating a new anon file and implement a new file_operations, this will be much more complicated. Meanwhile, bitmap file for the old bitmap will be removed sooner or later, and this bdev_file implementation will compatible with bitmap file as well. Thanks, Kuai > > . >
On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote: > 1) user must apply the following mdadm patch, and then llbitmap can be > enabled by --bitmap=lockless > https://lore.kernel.org/all/20250327134853.1069356-1-yukuai1@huaweicloud.com/ > 2) this set is cooked on the top of my other set: > https://lore.kernel.org/all/20250219083456.941760-1-yukuai1@huaweicloud.com/ I tried to create a tree to review the entire thing but failed. Can you please also provide a working git branch?
Hi, 在 2025/04/04 17:27, Christoph Hellwig 写道: > On Fri, Mar 28, 2025 at 02:08:39PM +0800, Yu Kuai wrote: >> 1) user must apply the following mdadm patch, and then llbitmap can be >> enabled by --bitmap=lockless >> https://lore.kernel.org/all/20250327134853.1069356-1-yukuai1@huaweicloud.com/ >> 2) this set is cooked on the top of my other set: >> https://lore.kernel.org/all/20250219083456.941760-1-yukuai1@huaweicloud.com/ > > I tried to create a tree to review the entire thing but failed. Can you > please also provide a working git branch? Of course, here is the branch: https://git.kernel.org/pub/scm/linux/kernel/git/yukuai/linux.git/log/?h=md-6.15 Thanks, Kuai > > . >
On Sat, Mar 29, 2025 at 09:11:13AM +0800, Yu Kuai wrote: > The purpose here is to hide the low level bitmap IO implementation to > the API disk->submit_bio(), and the bitmap IO can be converted to buffer > IO to the bdev_file. This is the easiest way that I can think of to > resue the pagecache, with natural ability for dirty page writeback. I do > think about creating a new anon file and implement a new > file_operations, this will be much more complicated. I've started looking at this a bit now, sorry for the delay. As far as I can see you use the bitmap file just so that you have your own struct address_space and thus page cache instance and then call read_mapping_page and filemap_write_and_wait_range on it right? For that you'd be much better of just creating your own trivial file_system_type with an inode fully controlled by your driver that has a trivial set of address_space ops instead of oddly mixing with the block layer. Note that either way I'm not sure using the page cache here is an all that good idea, as we're at the bottom of the I/O stack and thus memory allocations can very easily deadlock. What speaks against using your own folios explicitly allocated at probe time and then just doing manual submit_bio on that? That's probably not much more code but a lot more robust. Also a high level note: the bitmap_operations aren't a very nice interface. A lot of methods are empty and should just be called conditionally. Or even better you'd do away with the expensive indirect calls and just directly call either the old or new bitmap code. > Meanwhile, bitmap file for the old bitmap will be removed sooner or > later, and this bdev_file implementation will compatible with bitmap > file as well. Which would also mean that at that point the operations vector would be pointless, so we might as well not add it to start with.
Hi, 在 2025/04/09 16:32, Christoph Hellwig 写道: > On Sat, Mar 29, 2025 at 09:11:13AM +0800, Yu Kuai wrote: >> The purpose here is to hide the low level bitmap IO implementation to >> the API disk->submit_bio(), and the bitmap IO can be converted to buffer >> IO to the bdev_file. This is the easiest way that I can think of to >> resue the pagecache, with natural ability for dirty page writeback. I do >> think about creating a new anon file and implement a new >> file_operations, this will be much more complicated. > > I've started looking at this a bit now, sorry for the delay. > > As far as I can see you use the bitmap file just so that you have your > own struct address_space and thus page cache instance and then call > read_mapping_page and filemap_write_and_wait_range on it right? Yes. > > For that you'd be much better of just creating your own trivial > file_system_type with an inode fully controlled by your driver > that has a trivial set of address_space ops instead of oddly > mixing with the block layer. Yes, this is exactly what I said implement a new file_operations(and address_space ops), I wanted do this the easy way, just reuse the raw block device ops, this way I just need to implement the submit_bio ops for new hidden disk. I can try with new fs type if we really think this solution is too hacky, however, the code line will be much more. :( > > Note that either way I'm not sure using the page cache here is an > all that good idea, as we're at the bottom of the I/O stack and > thus memory allocations can very easily deadlock. Yes, for the page from bitmap, this set do the easy way just read and ping all realted pages while loading the bitmap. For two reasons: 1) We don't need to allocate and read pages from IO path;(In the first RFC version, I'm using a worker to do that). 2) In the first RFC version, I find and get page in the IO path, turns out page reference is an *atomic*, and the overhead is not acceptable; And the only action from IO path is that if bitmap page is dirty, filemap_write_and_wait_range() is called from async worker, the same as old bitmap, to flush bitmap dirty pages. > > What speaks against using your own folios explicitly allocated at > probe time and then just doing manual submit_bio on that? That's > probably not much more code but a lot more robust. I'm not quite sure if I understand you correctly. Do you means don't use pagecache for bitmap IO, and manually create BIOs like the old bitmap, meanwhile invent a new solution for synchronism instead of the global spin_lock from old bitmap? Thanks, Kuai > > Also a high level note: the bitmap_operations aren't a very nice > interface. A lot of methods are empty and should just be called > conditionally. Or even better you'd do away with the expensive > indirect calls and just directly call either the old or new > bitmap code. > >> Meanwhile, bitmap file for the old bitmap will be removed sooner or >> later, and this bdev_file implementation will compatible with bitmap >> file as well. > > Which would also mean that at that point the operations vector would > be pointless, so we might as well not add it to start with. > > . >
On Wed, Apr 09, 2025 at 05:27:11PM +0800, Yu Kuai wrote: >> For that you'd be much better of just creating your own trivial >> file_system_type with an inode fully controlled by your driver >> that has a trivial set of address_space ops instead of oddly >> mixing with the block layer. > > Yes, this is exactly what I said implement a new file_operations(and > address_space ops), I wanted do this the easy way, just reuse the raw > block device ops, this way I just need to implement the submit_bio ops > for new hidden disk. > > I can try with new fs type if we really think this solution is too > hacky, however, the code line will be much more. :( I don't think it should be much more. It'll also remove the rather unexpected indirection through submit_bio. Just make sure you use iomap for your operations, and implement the submit_io hook. That will also be more efficient than the buffer_head based block ops for writes. >> >> Note that either way I'm not sure using the page cache here is an >> all that good idea, as we're at the bottom of the I/O stack and >> thus memory allocations can very easily deadlock. > > Yes, for the page from bitmap, this set do the easy way just read and > ping all realted pages while loading the bitmap. For two reasons: > > 1) We don't need to allocate and read pages from IO path;(In the first > RFC version, I'm using a worker to do that). You still depend on the worker, which will still deadlock. >> What speaks against using your own folios explicitly allocated at >> probe time and then just doing manual submit_bio on that? That's >> probably not much more code but a lot more robust. > > I'm not quite sure if I understand you correctly. Do you means don't use > pagecache for bitmap IO, and manually create BIOs like the old bitmap, > meanwhile invent a new solution for synchronism instead of the global > spin_lock from old bitmap? Yes. Alternatively you need to pre-populate the page cache and keep extra page references.
Hi, 在 2025/04/09 17:40, Christoph Hellwig 写道: > On Wed, Apr 09, 2025 at 05:27:11PM +0800, Yu Kuai wrote: >>> For that you'd be much better of just creating your own trivial >>> file_system_type with an inode fully controlled by your driver >>> that has a trivial set of address_space ops instead of oddly >>> mixing with the block layer. >> >> Yes, this is exactly what I said implement a new file_operations(and >> address_space ops), I wanted do this the easy way, just reuse the raw >> block device ops, this way I just need to implement the submit_bio ops >> for new hidden disk. >> >> I can try with new fs type if we really think this solution is too >> hacky, however, the code line will be much more. :( > > I don't think it should be much more. It'll also remove the rather > unexpected indirection through submit_bio. Just make sure you use > iomap for your operations, and implement the submit_io hook. That > will also be more efficient than the buffer_head based block ops > for writes. > >>> >>> Note that either way I'm not sure using the page cache here is an >>> all that good idea, as we're at the bottom of the I/O stack and >>> thus memory allocations can very easily deadlock. >> >> Yes, for the page from bitmap, this set do the easy way just read and >> ping all realted pages while loading the bitmap. For two reasons: >> >> 1) We don't need to allocate and read pages from IO path;(In the first >> RFC version, I'm using a worker to do that). > > You still depend on the worker, which will still deadlock. > >>> What speaks against using your own folios explicitly allocated at >>> probe time and then just doing manual submit_bio on that? That's >>> probably not much more code but a lot more robust. >> >> I'm not quite sure if I understand you correctly. Do you means don't use >> pagecache for bitmap IO, and manually create BIOs like the old bitmap, >> meanwhile invent a new solution for synchronism instead of the global >> spin_lock from old bitmap? > > Yes. Alternatively you need to pre-populate the page cache and keep > extra page references. Ok, I'll think about self managed pages and IO path. Meanwhile, please let me know if you have questions with other parts. Thanks, Kuai > > . >
From: Yu Kuai <yukuai3@huawei.com> #### Background Redundant data is used to enhance data fault tolerance, and the storage method for redundant data vary depending on the RAID levels. And it's important to maintain the consistency of redundant data. Bitmap is used to record which data blocks have been synchronized and which ones need to be resynchronized or recovered. Each bit in the bitmap represents a segment of data in the array. When a bit is set, it indicates that the multiple redundant copies of that data segment may not be consistent. Data synchronization can be performed based on the bitmap after power failure or readding a disk. If there is no bitmap, a full disk synchronization is required. #### Key Concept ##### State Machine Each bit is one byte, contain 6 difference state, see llbitmap_state. And there are total 8 differenct actions, see llbitmap_action, can change state: llbitmap state machine: transitions between states | | Startwrite | Startsync | Endsync | Abortsync| Reload | Daemon | Discard | Stale | | --------- | ---------- | --------- | ------- | ------- | -------- | ------ | --------- | --------- | | Unwritten | Dirty | x | x | x | x | x | x | x | | Clean | Dirty | x | x | x | x | x | Unwritten | NeedSync | | Dirty | x | x | x | x | NeedSync | Clean | Unwritten | NeedSync | | NeedSync | x | Syncing | x | x | x | x | Unwritten | x | | Syncing | x | Syncing | Dirty | NeedSync | NeedSync | x | Unwritten | NeedSync | special illustration: - Unwritten is special state, which means user never write data, hence there is no need to resync/recover data. This is safe if user create filesystems for the array, filesystem will make sure user will get zero data for unwritten blocks. - After resync is done, change state from Syncing to Dirty first, in case Startwrite happen before the state is Clean. ##### Bitmap IO A hidden disk, named mdxxx_bitmap, is created for bitmap, see details in llbitmap_add_disk(). And a file is created as well to manage bitmap IO for this disk, see details in llbitmap_open_disk(). Read/write bitmap is converted to buffer IO to this file. IO fast path will set bits to dirty, and those dirty bits will be cleared by daemon after IO is done. llbitmap_barrier is used to syncronize between IO path and daemon; Test result: to be added Noted: 1) user must apply the following mdadm patch, and then llbitmap can be enabled by --bitmap=lockless https://lore.kernel.org/all/20250327134853.1069356-1-yukuai1@huaweicloud.com/ 2) this set is cooked on the top of my other set: https://lore.kernel.org/all/20250219083456.941760-1-yukuai1@huaweicloud.com/ Yu Kuai (14): block: factor out a helper bdev_file_alloc() md/md-bitmap: pass discard information to bitmap_{start, end}write md/md-bitmap: remove parameter slot from bitmap_create() md: add a new sysfs api bitmap_version md: delay registeration of bitmap_ops until creating bitmap md/md-llbitmap: implement bit state machine md/md-llbitmap: implement hidden disk to manage bitmap IO md/md-llbitmap: implement APIs for page level dirty bits synchronization md/md-llbitmap: implement APIs to mange bitmap lifetime md/md-llbitmap: implement APIs to dirty bits and clear bits md/md-llbitmap: implement APIs for sync_thread md/md-llbitmap: implement all bitmap operations md/md-llbitmap: implement sysfs APIs md/md-llbitmap: add Kconfig block/bdev.c | 21 +- drivers/md/Kconfig | 12 + drivers/md/Makefile | 1 + drivers/md/md-bitmap.c | 10 +- drivers/md/md-bitmap.h | 21 +- drivers/md/md-llbitmap.c | 1410 ++++++++++++++++++++++++++++++++++++++ drivers/md/md.c | 180 ++++- drivers/md/md.h | 3 + include/linux/blkdev.h | 1 + 9 files changed, 1614 insertions(+), 45 deletions(-) create mode 100644 drivers/md/md-llbitmap.c