Message ID | 4a368301.KBrCZcokAm87BF/S%heinzm@redhat.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Jonthan Brassow |
Headers | show |
On Mon, Jun 15, 2009 at 07:21:05PM +0200, heinzm@redhat.com wrote: > Hi, > > this patch series introduces the raid45 target. > Please include upstream. NACK, no need for another raid target, especially if it happens to be as broken as the raid1 one. Please help on making the md raid4/5 code more useful for your purposes (whatever that may be). -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 2009-06-16 at 10:09 -0400, Christoph Hellwig wrote: > On Mon, Jun 15, 2009 at 07:21:05PM +0200, heinzm@redhat.com wrote: > > Hi, > > > > this patch series introduces the raid45 target. > > Please include upstream. > > NACK, no need for another raid target, especially if it happens to be > as broken as the raid1 one. You didn't review it it seems ? What are your particular pointers to any raid1 code issues ? Please point to particular code of concern. Mind you, that we've got clustered mirroring with it, which isn't supported in MD. > Please help on making the md raid4/5 code > more useful for your purposes (whatever that may be). Supporting various ATARAID raid5 mappings via dmraid long before md got external metadata suport added. Heinz > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Jun 16, 2009 at 7:51 AM, Heinz Mauelshagen<heinzm@redhat.com> wrote: > On Tue, 2009-06-16 at 10:09 -0400, Christoph Hellwig wrote: >> On Mon, Jun 15, 2009 at 07:21:05PM +0200, heinzm@redhat.com wrote: >> > Hi, >> > >> > this patch series introduces the raid45 target. >> > Please include upstream. >> >> NACK, no need for another raid target, especially if it happens to be >> as broken as the raid1 one. > > You didn't review it it seems ? Will you be allowing time for review? This seems a bit late for 2.6.31, as your "Please include upstream" implies. >> Â Please help on making the md raid4/5 code >> more useful for your purposes (whatever that may be). > > Supporting various ATARAID raid5 mappings via dmraid long before md got > external metadata suport added. This is an opportunity to investigate what technical hurdles remain to allow dmraid to use md infrastructure [1] (or as Neil proposed [2] a new unified virtual block device infrastructure) for activating external metadata raid5 arrays. The libata/ide situation shows that there is precedence for duplicate functionality in the kernel, but my personal opinion is that we exhaust the code reuse discussion before heading down that path. Regards, Dan [1] http://marc.info/?l=linux-raid&m=123300614013042&w=2 [2] http://marc.info/?l=linux-fsdevel&m=124450400028660&w=2 -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 2009-06-16 at 10:55 -0700, Dan Williams wrote: > On Tue, Jun 16, 2009 at 7:51 AM, Heinz Mauelshagen<heinzm@redhat.com> wrote: > > On Tue, 2009-06-16 at 10:09 -0400, Christoph Hellwig wrote: > >> On Mon, Jun 15, 2009 at 07:21:05PM +0200, heinzm@redhat.com wrote: > >> > Hi, > >> > > >> > this patch series introduces the raid45 target. > >> > Please include upstream. > >> > >> NACK, no need for another raid target, especially if it happens to be > >> as broken as the raid1 one. > > > > You didn't review it it seems ? > > Will you be allowing time for review? This seems a bit late for > 2.6.31, as your "Please include upstream" implies. Your help is appreciated, thanks. As you surely know, your colleagues (in particular in Poland) are familiar with these patches already, which will help quick review. > > >> Please help on making the md raid4/5 code > >> more useful for your purposes (whatever that may be). > > > > Supporting various ATARAID raid5 mappings via dmraid long before md got > > external metadata suport added. > > This is an opportunity to investigate what technical hurdles remain to > allow dmraid to use md infrastructure [1] (or as Neil proposed [2] a > new unified virtual block device infrastructure) for activating > external metadata raid5 arrays. The libata/ide situation shows that > there is precedence for duplicate functionality in the kernel, but my > personal opinion is that we exhaust the code reuse discussion before > heading down that path. dmraid supports 11 different metadata formats for various ATARAID vendor specific solutions plus DDF1: asr : Adaptec HostRAID ASR (0,1,10) ddf1 : SNIA DDF1 (0,1,4,5,linear) hpt37x : Highpoint HPT37X (S,0,1,10,01) hpt45x : Highpoint HPT45X (S,0,1,10) isw : Intel Software RAID (0,1,5,01) jmicron : JMicron ATARAID (S,0,1) lsi : LSI Logic MegaRAID (0,1,10) nvidia : NVidia RAID (S,0,1,10,5) pdc : Promise FastTrack (S,0,1,10) sil : Silicon Image(tm) Medley(tm) (0,1,10) via : VIA Software RAID (S,0,1,10) 9 of those still need porting/testing/... to/with MD external metadata support after the Intel IMSM MD port has settled, hence Red Hat is going to continue to support dmraid. That being said: once the future work on a unified virtual block device infrastructure is production ready, we're open to use that. Regards, Heinz > > Regards, > Dan > > [1] http://marc.info/?l=linux-raid&m=123300614013042&w=2 > [2] http://marc.info/?l=linux-fsdevel&m=124450400028660&w=2 -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Jun 16, 2009 at 12:11 PM, Heinz Mauelshagen<heinzm@redhat.com> wrote: > dmraid supports 11 different metadata formats for various ATARAID vendor > specific solutions plus DDF1: > > asr   : Adaptec HostRAID ASR (0,1,10) > ddf1   : SNIA DDF1 (0,1,4,5,linear) > hpt37x  : Highpoint HPT37X (S,0,1,10,01) > hpt45x  : Highpoint HPT45X (S,0,1,10) > isw   : Intel Software RAID (0,1,5,01) > jmicron : JMicron ATARAID (S,0,1) > lsi   : LSI Logic MegaRAID (0,1,10) > nvidia  : NVidia RAID (S,0,1,10,5) > pdc   : Promise FastTrack (S,0,1,10) > sil   : Silicon Image(tm) Medley(tm) (0,1,10) > via   : VIA Software RAID (S,0,1,10) > > 9 of those still need porting/testing/... to/with MD external metadata > support after the Intel IMSM MD port has settled, hence Red Hat is going > to continue to support dmraid. What I am personally interested in is a way for the dmraid userspace to reuse time proven md kernel infrastructure where it makes sense. Yes, there are benefits to porting the metadata formats to mdadm, but the point is that there is no immediate need, or even eventual need, to to do this work. dmraid already knows how to convert all these formats into a common dmsetup table format, and md already has knobs for userspace to create raid arrays from the same information. Can we not meet in the middle to allow at least dm-raid5 arrays to use md-raid5? Put another way do we need ~5000 new lines of kernel code to support all the functionality that dmraid requires? What happens when one of these formats wants to add raid6, more code duplication? Can we incrementally extend md-raid to cover the unmet needs of dmraid, what extensions are required? > That being said: once the future work on a unified virtual block device > infrastructure is production ready, we're open to use that. That is the end game, but there are more cooperation opportunities along the way. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tuesday June 16, heinzm@redhat.com wrote: > > That being said: once the future work on a unified virtual block device > infrastructure is production ready, we're open to use that. > I was kind-a hoping that you (and others) would be involved in developing this unified infrastructure, rather than just waiting for it. I think a great first step would be to allow md/raid5 to be used directly as a dm target, thus turning dm-raid5 into a shim layer over md/raid5. The process of doing this would very likely highlight a lot of the issues we would need to address in creating a unified framework. I will try to find time to review your dm-raid5 code with a view to understanding how it plugs in to dm, and then how the md/raid5 engine can be used by dm-raid5. Part of this will be disconnecting the md/raid5 code from any specific knowledge of a gendisk and a request_queue as I suppose a dm-target doesn't own any of these. Also I would probably want the mddev not be have to be on the "all_mddevs" list, as we would not want a 'dm' raid5 to appear in /proc/mdstat. NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Jun 16, 2009, at 5:46 PM, Neil Brown wrote: > On Tuesday June 16, heinzm@redhat.com wrote: >> >> That being said: once the future work on a unified virtual block >> device >> infrastructure is production ready, we're open to use that. >> > > I was kind-a hoping that you (and others) would be involved in > developing this unified infrastructure, rather than just waiting for > it. I think this is what everyone thinks, which is why there is no integration effort. brassow -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Eliminate the 3rd argument to that function. You can use 'dm_rh_bio_to_region'. brassow On Jun 15, 2009, at 12:21 PM, heinzm@redhat.com wrote: > > +void dm_rh_delay_by_region(struct dm_region_hash *rh, > + struct bio *bio, region_t region) > +{ > + struct dm_region *reg; > + > + /* FIXME: locking. */ > + read_lock(&rh->hash_lock); > + reg = __rh_find(rh, region); > + bio_list_add(®->delayed_bios, bio); > + read_unlock(&rh->hash_lock); > +} > +EXPORT_SYMBOL_GPL(dm_rh_delay_by_region); > + > void dm_rh_stop_recovery(struct dm_region_hash *rh) > { > int i; -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2009-06-18 at 11:39 -0500, Jonathan Brassow wrote: > Eliminate the 3rd argument to that function. You can use > 'dm_rh_bio_to_region'. No, I can't, because I'm keeping track of regions per single disk as in the mirroring code rather than dividing the whole sets capacity into regions. This is because a disk is divided into 2^N sized chunks and regions have to be 2^M >= 2^N sized. See caller side in dm-raid45.c, function do_io(). Heinz > > brassow > > On Jun 15, 2009, at 12:21 PM, heinzm@redhat.com wrote: > > > > > +void dm_rh_delay_by_region(struct dm_region_hash *rh, > > + struct bio *bio, region_t region) > > +{ > > + struct dm_region *reg; > > + > > + /* FIXME: locking. */ > > + read_lock(&rh->hash_lock); > > + reg = __rh_find(rh, region); > > + bio_list_add(®->delayed_bios, bio); > > + read_unlock(&rh->hash_lock); > > +} > > +EXPORT_SYMBOL_GPL(dm_rh_delay_by_region); > > + > > void dm_rh_stop_recovery(struct dm_region_hash *rh) > > { > > int i; > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wednesday June 17, neilb@suse.de wrote: > > I will try to find time to review your dm-raid5 code with a view to > understanding how it plugs in to dm, and then how the md/raid5 engine > can be used by dm-raid5. I've had a bit of a look through the dm-raid5 patches. Some observations: - You have your own 'xor' code against which you do a run-time test of the 'xor_block' code which md/raid5 uses - then choose the fasted. This really should not be necessary. If you have xor code that runs faster than anything in xor_block, it really would be best to submit it for inclusion in the common xor code base. - You have introduced "dm-message" which is a frame work for sending messages to dm targets. It is used for adjusting the cache size, changing the bandwidth used by resync, and doing things with statistics. Linux already has a framework for sending messages to kernel objects. It is called "sysfs". It would be much nicer if you could use that. - I don't know what Christoph Hellwig was referring to when he suggested that dm-raid1 was broken, but there is one issue with it that concerns me and which is (as far as I can tell) broken in dm-raid45 as well. That issue is summarised in the thread http://www.linux-archive.org/device-mapper-development/282987-dm-raid1-data-corruption.html The problem is that when a drive fails you allow writes to continue without waiting for the metadata to be updated. I.e. metadata updates are asynchronous. The problem scenario for raid5 is: - a drive fails while a write is in process - the drive is marked as faulty and user space is notified - that write, and possibly several others, complete and that status is potentially returned to user-space (e.g. for an fsync) - user space acts on that status (e.g. sends an acknowledgement over the network) - the host crashes before the metadata update completes - on reboot, the drive, which just had a transient failure, works fine, and the metadata reports that all the drives are good, but that the array was active so a parity-resync is needed. - All parity blocks are recalculated. However that write we mentioned above is only recorded in the parity block. So that data is lost. Now if you get a crash while the array is degraded data loss is a real possibility. However silent data loss should not be tolerated. For this reason, metadata updates reporting failed devices really need to by synchronous with respect to writes. - Your algorithm for limiting bandwidth used by resync seems to be based on a % concept (which is rounded to the nearest reciprocal, so 100%, 50%, 33%, 25%, 20%, 16% etc). If there is outstanding IO, the amount of outstanding resync IO must not exceed the given percentage of regular IO. While it is hard to get resync "just right", I think this will have some particularly poor characteristics. Heavy regular IO load, or a non-existent regular load should be handled OK, but I suspect that a light IO load (e.g. one or two threads of synchronous requests) would cause the resync to go much slower than you would want. Also it does not take account of IO to other partitions on a device. It is common to have a number of different arrays on the same device set. You really want to resync to back off based on IO to all of the device, not just to the array. All of these are issues that are already addressed in md/raid5. About the only thing that I saw which does not have a working analogue in md/raid5 is cluster locking. However that is "not yet implemented" in dm-raid5 so that isn't a real difference. Everything else could be made available by md to dm with only a moderate amount of effort. Of the different interface routines you require: + .ctr = raid_ctr, This would be quite straight forward, though it might be worth discussing the details of the args passed in. "recovery_bandwidth" (as I say above) I don't think is well chosen, and I don't quite see the point of io_size (maybe I'm missing something obvious) + .dtr = raid_dtr, This is trivial. raid5.stop + .map = raid_map, This is very much like the md/raid5 make_request. + .presuspend = raid_presuspend, I call this 'quiesce' + .postsuspend = raid_postsuspend, I'm not sure exactly how this fits in. Something related to the log... + .resume = raid_resume, ->quiesce requesting 'resume' rather than 'suspend' + .status = raid_status, very straight forward + .message = raid_message, As I said, I would rather this didn't exist. But it could be emulated (except for the statistics) if needed - md can already adjust resync speed and cache size. What is missing is agreement on how to handle synchronous metadata updates. md waits for the drive failure to be acknowledged. It is automatically marked as 'Blocked' on an error, and user-space must clear that flag. I suspect, using your model, the approach would be a 'dm-message' acknowledging that a given device is faulty. I would, of course, rather this be done via sysfs. I will try to look at factoring out the md specific part of md/raid5 over the next few days (maybe next week) and see how that looks. NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote: > On Wednesday June 17, neilb@suse.de wrote: > > > > I will try to find time to review your dm-raid5 code with a view to > > understanding how it plugs in to dm, and then how the md/raid5 engine > > can be used by dm-raid5. Hi Neil. > > I've had a bit of a look through the dm-raid5 patches. Thanks. > > Some observations: > > - You have your own 'xor' code against which you do a run-time test of > the 'xor_block' code which md/raid5 uses - then choose the fasted. > This really should not be necessary. If you have xor code that runs > faster than anything in xor_block, it really would be best to submit > it for inclusion in the common xor code base. This is in because it actually shows better performance regularly by utilizing cache lines etc. more efficiently (tested on Intel, AMD and Sparc). If xor_block would always have been performed best, I'd dropped that optimization already. > > - You have introduced "dm-message" which is a frame work for sending > messages to dm targets. dm-message is a parser for messages sent to a device-mapper targets message interface. It aims at allowing common message parsing functions to be shared between targets. The message interface in general allows for runtime state changes of targets like those you found below. See "dmsetup message ..." > It is used for adjusting the cache size, > changing the bandwidth used by resync, and doing things with > statistics. Linux already has a framework for sending messages to > kernel objects. It is called "sysfs". It would be much nicer if > you could use that. Looks to me like heavier to program (I counted ~100 LOC of sysfs defines and calls in md.c) for 3 files used in md. That needs some more discussion. > > - I don't know what Christoph Hellwig was referring to when he > suggested that dm-raid1 was broken, but there is one issue with it > that concerns me and which is (as far as I can tell) broken in > dm-raid45 as well. > That issue is summarised in the thread > http://www.linux-archive.org/device-mapper-development/282987-dm-raid1-data-corruption.html Yes, major point being that dm aims to mostly do uspace metadata updates and as little kspace metadata access as possible. > > The problem is that when a drive fails you allow writes to continue > without waiting for the metadata to be updated. I.e. metadata > updates are asynchronous. > The problem scenario for raid5 is: > - a drive fails while a write is in process > - the drive is marked as faulty and user space is notified > - that write, and possibly several others, complete and that > status is potentially returned to user-space (e.g. for an fsync) > - user space acts on that status (e.g. sends an acknowledgement over > the network) > - the host crashes before the metadata update completes > - on reboot, the drive, which just had a transient failure, works > fine, and the metadata reports that all the drives are good, but > that the array was active so a parity-resync is needed. > - All parity blocks are recalculated. However that write we > mentioned above is only recorded in the parity block. So that > data is lost. > Yes, like in the dm raid1 scenario, blocking of writes would have to occur until metadata has been updated in order to detect the failing drive persistently, hence not updating parity (which contains data mandatory to preserve) but recalculating chunks on the transiently failing drive utilizing intact parity. > > Now if you get a crash while the array is degraded data loss is a > real possibility. However silent data loss should not be tolerated. > > For this reason, metadata updates reporting failed devices really > need to by synchronous with respect to writes. > Agreed. > - Your algorithm for limiting bandwidth used by resync seems to be > based on a % concept (which is rounded to the nearest reciprocal, > so 100%, 50%, 33%, 25%, 20%, 16% etc). If there is outstanding > IO, the amount of outstanding resync IO must not exceed the given > percentage of regular IO. While it is hard to get resync "just > right", I think this will have some particularly poor > characteristics. My ANALYZEME pointed you there ? ;-) That's why it's in. I admitted with that that I'm not satisfied with the characteristics yet. > Heavy regular IO load, or a non-existent regular load should be > handled OK, but I suspect that a light IO load (e.g. one or two > threads of synchronous requests) would cause the resync to go much > slower than you would want. I'll look into it further taking your thoughts into account. . > > Also it does not take account of IO to other partitions on a device. > It is common to have a number of different arrays on the same device > set. You really want to resync to back off based on IO to all of > the device, not just to the array. Does MD RAID > 0 cope with that in multi-partition RAID configs ? E.g.: disk1:part1+disk2:part1 -> md0 and disk1:part2+disk2:part2 -> md1 with global iostats for disk[12] ? > > > All of these are issues that are already addressed in md/raid5. > About the only thing that I saw which does not have a working analogue > in md/raid5 is cluster locking. However that is "not yet implemented" > in dm-raid5 so that isn't a real difference. Yes, it's a start for future cluster RAID5 usage. > > Everything else could be made available by md to dm with only a > moderate amount of effort. > Of the different interface routines you require: > + .ctr = raid_ctr, > > This would be quite straight forward, though it might be worth > discussing the details of the args passed in. > "recovery_bandwidth" (as I say above) I don't think is well chosen, I was aware it needs further tweaking (your points above taken). You mean the wording here ? > and I don't quite see the point of io_size (maybe I'm missing > something obvious) Performance related. I studied read ahead/chunk size/io size pairs and found examples of better performance with io_size > PAGE_SIZE <= chunk_size. io_size being 2^N. Do you have similar experiences to share with MD RAID ? > > + .dtr = raid_dtr, > > This is trivial. raid5.stop > > + .map = raid_map, > > This is very much like the md/raid5 make_request. > > + .presuspend = raid_presuspend, > > I call this 'quiesce' > > + .postsuspend = raid_postsuspend, > > I'm not sure exactly how this fits in. Something related to the > log... We use these distinct presuspend/postsuspend methods in order to allow for targets to distinguish flushing any io in flight in presuspend and release/quiesce/... any resource (e.g. quiesce the dirty log) utilized in the first step in postsuspend. If MD RAID doesn't need that, stick with presuspend. > > + .resume = raid_resume, > > ->quiesce requesting 'resume' rather than 'suspend' > > + .status = raid_status, > > very straight forward > > + .message = raid_message, > > As I said, I would rather this didn't exist. I think its better to offer options to be utilized for different purposes. Tradeoff is always (some) overlap. We don't have one filesystem for all use cases ;-) > But it could be > emulated (except for the statistics) if needed - md can already > adjust resync speed and cache size. > > > What is missing is agreement on how to handle synchronous metadata > updates. md waits for the drive failure to be acknowledged. It is > automatically marked as 'Blocked' on an error, and user-space must > clear that flag. Yes, Mikulas already put thought into that. Let's allow him /and others) to join in. > > I suspect, using your model, the approach would be a 'dm-message' > acknowledging that a given device is faulty. I would, of course, > rather this be done via sysfs. > > I will try to look at factoring out the md specific part of md/raid5 > over the next few days (maybe next week) and see how that looks. That'll be interesting to see how it fits into the device-mapper framework I take it You'll port code to create an "md-raid456" target ? Thanks a lot, Heinz > > NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Jun 19, 2009 at 3:33 AM, Heinz Mauelshagen<heinzm@redhat.com> wrote: > On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote: >> - You have your own 'xor' code against which you do a run-time test of >> Â the 'xor_block' code which md/raid5 uses - then choose the fasted. >> Â This really should not be necessary. Â If you have xor code that runs >> Â faster than anything in xor_block, it really would be best to submit >> Â it for inclusion in the common xor code base. > > This is in because it actually shows better performance regularly by > utilizing cache lines etc. more efficiently (tested on Intel, AMD and > Sparc). > > If xor_block would always have been performed best, I'd dropped that > optimization already. Data please. Especially given the claim that it improves upon the existing per-architecture hand-tuned assembly routines. If it is faster great! Please send patches to improve the existing code. >> >> - You have introduced "dm-message" which is a frame work for sending >> Â messages to dm targets. > > dm-message is a parser for messages sent to a device-mapper targets > message interface. It aims at allowing common message parsing functions > to be shared between targets. > > The message interface in general allows for runtime state changes of > targets like those you found below. > > See "dmsetup message ..." > >> It is used for adjusting the cache size, >> Â changing the bandwidth used by resync, and doing things with >> Â statistics. Â Linux already has a framework for sending messages to >> Â kernel objects. Â It is called "sysfs". Â It would be much nicer if >> Â you could use that. > > Looks to me like heavier to program (I counted ~100 LOC of sysfs defines > and calls in md.c) for 3 files used in md. > > That needs some more discussion. Ironically the same argument can be applied to this module. ~5000 LOC to re-implement raid5 seems much "heavier to program" than to write some dmraid ABI compatibility wrappers around the existing md-raid5 code. [..] >> Â The problem is that when a drive fails you allow writes to continue >> Â without waiting for the metadata to be updated. Â I.e. metadata >> Â updates are asynchronous. >> Â The problem scenario for raid5 is: >> Â Â - a drive fails while a write is in process >> Â Â - the drive is marked as faulty and user space is notified >> Â Â - that write, and possibly several others, complete and that >> Â Â Â status is potentially returned to user-space (e.g. for an fsync) >> Â Â - user space acts on that status (e.g. sends an acknowledgement over >> Â Â Â the network) >> Â Â - the host crashes before the metadata update completes >> Â Â - on reboot, the drive, which just had a transient failure, works >> Â Â Â fine, and the metadata reports that all the drives are good, but >> Â Â Â that the array was active so a parity-resync is needed. >> Â Â - All parity blocks are recalculated. Â However that write we >> Â Â Â mentioned above is only recorded in the parity block. Â So that >> Â Â Â data is lost. >> > > Yes, like in the dm raid1 scenario, blocking of writes would have to > occur until metadata has been updated in order to detect the failing > drive persistently, hence not updating parity (which contains data > mandatory to preserve) but recalculating chunks on the transiently > failing drive utilizing intact parity. Another problem with dm along these lines is the mechanism for notifying userspace, I am assuming this is still via uevents? The problem is that each notification requires a memory allocation which might fail under load, especially if the blocked array is in the dirty page write out path. The userspace notifications and failed device handling in md are designed to not require memory allocations. [..] >> Â Also it does not take account of IO to other partitions on a device. >> Â It is common to have a number of different arrays on the same device >> Â set. Â You really want to resync to back off based on IO to all of >> Â the device, not just to the array. > > Does MD RAID > 0 cope with that in multi-partition RAID configs ? > E.g.: > disk1:part1+disk2:part1 -> md0 and disk1:part2+disk2:part2 -> md1 with > global iostats for disk[12] ? By default it will hold off resync on other arrays when it notices that multiple arrays share disks. A sysfs knob allows parallel resync. [..] >> Â As I said, I would rather this didn't exist. > > I think its better to offer options to be utilized for different > purposes. Tradeoff is always (some) overlap. > > We don't have one filesystem for all use cases ;-) That argument does not fly in this case. md is already suitable for the dmraid use case (i.e kernel raid/userspace metadata). >> But it could be >> Â emulated (except for the statistics) if needed - md can already >> Â adjust resync speed and cache size. >> >> >> Â What is missing is agreement on how to handle synchronous metadata >> Â updates. Â md waits for the drive failure to be acknowledged. Â It is >> Â automatically marked as 'Blocked' on an error, and user-space must >> Â clear that flag. > > Yes, Mikulas already put thought into that. > Let's allow him /and others) to join in. > > >> >> Â I suspect, using your model, the approach would be a 'dm-message' >> Â acknowledging that a given device is faulty. Â I would, of course, >> Â rather this be done via sysfs. >> >> Â I will try to look at factoring out the md specific part of md/raid5 >> Â over the next few days (maybe next week) and see how that looks. > > That'll be interesting to see how it fits into the device-mapper > framework > Yes, kernel ABI compatibility is the better [1] way to go. I will also revive the ideas/patches I had along these lines. Regards, Dan [1]: ...than dm2md in userspace -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Friday June 19, heinzm@redhat.com wrote: > On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote: > > On Wednesday June 17, neilb@suse.de wrote: > > > > > > I will try to find time to review your dm-raid5 code with a view to > > > understanding how it plugs in to dm, and then how the md/raid5 engine > > > can be used by dm-raid5. > > Hi Neil. > > > > > I've had a bit of a look through the dm-raid5 patches. > > Thanks. > > > > > Some observations: > > > > - You have your own 'xor' code against which you do a run-time test of > > the 'xor_block' code which md/raid5 uses - then choose the fasted. > > This really should not be necessary. If you have xor code that runs > > faster than anything in xor_block, it really would be best to submit > > it for inclusion in the common xor code base. > > This is in because it actually shows better performance regularly by > utilizing cache lines etc. more efficiently (tested on Intel, AMD and > Sparc). > > If xor_block would always have been performed best, I'd dropped that > optimization already. I'm no expert on this I must admit - I just use the xor code that others have provided. However it is my understanding that some of the xor code options were chosen deliberately because they bypassed the cache. As the xor operation operates on data that very probably was not in the cache, and only touches it once, and touches quite a lot of data, there is little benefit having in the cache, and a possible real cost as it pushes other data out of the cache. That said, the "write" process involves copying data from a buffer into the cache, and then using that data as a source for xor. In that case the cache might be beneficial, I'm not sure. I've occasionally wondered if it might be good to have an xor routine that does "copy and xor". This would use more registers than the current code so we could possibly process fewer blocks at a time, but we might be processing them faster. In that case, if we could bypass the cache to read the data, but use the cache to store the result of the xor, that would be ideal. However I have no idea if that is possible. The mechanism you use, which is much the same as the one used by xor_block, calculates speed for the hot-cache case. It is not clear that this is often a relevant cache - mostly xor is calculated when the cache is cold. So it isn't clear if it is generally applicable. That is why calibrate_xor_blocks sometimes skips the xor_speed test (in crypto/xor.c). However if your code performs better than the cache-agnostic code in xor_block, then we should replace that code with your code, rather than have two places that try to choose the optimal code. And I would be very interested to see a discussion about how relevant the cache bypassing is in real life... It looks like it was Zach Brown who introduced this about 10 years ago, presumably in consultation with Ingo Molnar. > > > > > - You have introduced "dm-message" which is a frame work for sending > > messages to dm targets. > > dm-message is a parser for messages sent to a device-mapper targets > message interface. It aims at allowing common message parsing functions > to be shared between targets. > > The message interface in general allows for runtime state changes of > targets like those you found below. > > See "dmsetup message ..." > > > It is used for adjusting the cache size, > > changing the bandwidth used by resync, and doing things with > > statistics. Linux already has a framework for sending messages to > > kernel objects. It is called "sysfs". It would be much nicer if > > you could use that. > > Looks to me like heavier to program (I counted ~100 LOC of sysfs defines > and calls in md.c) for 3 files used in md. > > That needs some more discussion. In dm-raid45 you have 170 lines from "Message interface" to "END message interface", which is 4 files I think. Am I counting the same things? In any case, it is certainly possible that sysfs usage could be improved. We recently added strict_blocks_to_sectors() which extracted a common parsing task. Once upon a time, we open coded things with simple_strtoull and repeated the error checking every time. There is always room for improvement. But I think the uniformity across the kernel is the most important issue here, not the lines of code. There is a bit of tension here: your dm-message approach is similar in style to the way targets are created. The sysfs approach is similar in style to many other parts of the kernel. So should your new mechanism be consistent with dm or consistent with linux - it cannot easily be both. This is really a question that the dm developers as a group need to consider and answer. I would like to recommend that transitioning towards more uniformity with the rest of linux is a good idea, and here is excellent opportunity to start. But it isn't my decision. > > - Your algorithm for limiting bandwidth used by resync seems to be > > based on a % concept (which is rounded to the nearest reciprocal, > > so 100%, 50%, 33%, 25%, 20%, 16% etc). If there is outstanding > > IO, the amount of outstanding resync IO must not exceed the given > > percentage of regular IO. While it is hard to get resync "just > > right", I think this will have some particularly poor > > characteristics. > > My ANALYZEME pointed you there ? ;-) > That's why it's in. I admitted with that that I'm not satisfied with the > characteristics yet. Fair enough. My concern then is that the mechanism for guiding the speed - which is not yet finalised - is being encoded in the command for creating the array. That would make it difficult to change at a later date. I would suggest not specifying anything when creating the array and insisting that a default be chosen. Then you can experiment with mechanism using different dm-message or sysfs-files with some sort of warning that these are not part of the api yet. > > > > Also it does not take account of IO to other partitions on a device. > > It is common to have a number of different arrays on the same device > > set. You really want to resync to back off based on IO to all of > > the device, not just to the array. > > Does MD RAID > 0 cope with that in multi-partition RAID configs ? > E.g.: > disk1:part1+disk2:part1 -> md0 and disk1:part2+disk2:part2 -> md1 with > global iostats for disk[12] ? Yes. 'gendisk' has 'sync_io' specifically for this purpose. When md submits IO for the purpose of resync (or recovery) it adds a count of the sectors to this counter. That can be compared with the regular io statistics for the genhd to see if there is an non-sync IO happening. If there is the genhd is assumed to be busy and we backoff the resync. This is not a perfect solution as it only goes one level down. If we had raid1 over raid0 over partitions, then we might not notice resync IO from elsewhere properly. But I don't think I've ever had a complaint about that. > > > > Everything else could be made available by md to dm with only a > > moderate amount of effort. > > Of the different interface routines you require: > > + .ctr = raid_ctr, > > > > This would be quite straight forward, though it might be worth > > discussing the details of the args passed in. > > "recovery_bandwidth" (as I say above) I don't think is well chosen, > > I was aware it needs further tweaking (your points above taken). > > You mean the wording here ? The thing that I don't think was well chosen was the choice of a percentage to guide the speed of resync. > > > and I don't quite see the point of io_size (maybe I'm missing > > something obvious) > > Performance related. I studied read ahead/chunk size/io size pairs and > found examples of better performance with io_size > PAGE_SIZE <= > chunk_size. io_size being 2^N. > > Do you have similar experiences to share with MD RAID ? My position has always been that it is up to lower layers to combine pages together. md/raid5 does all IO in pages. If it is more efficient for some device to do multiple pages as a time, it should combine the pages. The normal plugging mechanisms combined with the elevator should be enough for this. We have put more focus in to gathering write-pages into whole strips so that no pre-reading is needed - we can just calculate parity and write. This has a significant effect on throughput. (A 'strip' is like a 'stripe' but it is one page wide, not one chunk wide). > > + .presuspend = raid_presuspend, > > > > I call this 'quiesce' > > > > + .postsuspend = raid_postsuspend, > > > > I'm not sure exactly how this fits in. Something related to the > > log... > > We use these distinct presuspend/postsuspend methods in order to allow > for targets to distinguish flushing any io in flight in presuspend and > release/quiesce/... any resource (e.g. quiesce the dirty log) utilized > in the first step in postsuspend. > > If MD RAID doesn't need that, stick with presuspend. OK, thanks. > > + .message = raid_message, > > > > As I said, I would rather this didn't exist. > > I think its better to offer options to be utilized for different > purposes. Tradeoff is always (some) overlap. > > We don't have one filesystem for all use cases ;-) > No... I find myself wearing several hats here. As a developer, I think it is great that you are writing your own raid5 implementation. Doing stuff like that is lots of fun and a valuable learning experience. As I read though the code there were a number of time when I thought "that looks rather neat". Quite possibly there are ideas in dm-raid5 that I could "steal" to make md/raid5 better. dm-message certainly has its merits too (as I said, it fits the dm model). As a member of the linux kernel community, I want to see linux grow with a sense of unity. Where possible, similar tasks should be addressed in similar ways. This avoid duplicating mistakes, eases maintenance, and makes it easier for newcomers to learn. Establishing Patterns and following them is a good thing. Hence my desire to encourage the use of sysfs and a single xor implementation. As an employee of a company that sells support, I really don't want a profusion of different implementations of something that I am seen as an expert in, as I know I'll end up having to support all of them. So for very selfish reasons, I'd prefer fewer versions of NFS, fewer file-systems, and fewer raid5 implementations :-) Well... fewer that we contract to support anyway. > > > > I suspect, using your model, the approach would be a 'dm-message' > > acknowledging that a given device is faulty. I would, of course, > > rather this be done via sysfs. > > > > I will try to look at factoring out the md specific part of md/raid5 > > over the next few days (maybe next week) and see how that looks. > > That'll be interesting to see how it fits into the device-mapper > framework > > I take it You'll port code to create an "md-raid456" target ? That's the idea. I'll let you know when I have something credible. Thanks, NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Sun, 2009-06-21 at 22:06 +1000, Neil Brown wrote: > On Friday June 19, heinzm@redhat.com wrote: > > On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote: > > > On Wednesday June 17, neilb@suse.de wrote: > > > > > > > > I will try to find time to review your dm-raid5 code with a view to > > > > understanding how it plugs in to dm, and then how the md/raid5 engine > > > > can be used by dm-raid5. > > > > Hi Neil. > > > > > > > > I've had a bit of a look through the dm-raid5 patches. > > > > Thanks. > > > > > > > > Some observations: > > > > > > - You have your own 'xor' code against which you do a run-time test of > > > the 'xor_block' code which md/raid5 uses - then choose the fasted. > > > This really should not be necessary. If you have xor code that runs > > > faster than anything in xor_block, it really would be best to submit > > > it for inclusion in the common xor code base. > > > > This is in because it actually shows better performance regularly by > > utilizing cache lines etc. more efficiently (tested on Intel, AMD and > > Sparc). > > > > If xor_block would always have been performed best, I'd dropped that > > optimization already. > > I'm no expert on this I must admit - I just use the xor code that > others have provided. > However it is my understanding that some of the xor code options were > chosen deliberately because they bypassed the cache. As the xor > operation operates on data that very probably was not in the cache, > and only touches it once, and touches quite a lot of data, there is > little benefit having in the cache, and a possible real cost as it > pushes other data out of the cache. > > That said, the "write" process involves copying data from a buffer > into the cache, and then using that data as a source for xor. In that > case the cache might be beneficial, I'm not sure. > I've occasionally wondered if it might be good to have an xor routine > that does "copy and xor". This would use more registers than the > current code so we could possibly process fewer blocks at a time, but > we might be processing them faster. In that case, if we could bypass > the cache to read the data, but use the cache to store the result of > the xor, that would be ideal. However I have no idea if that is > possible. > > The mechanism you use, which is much the same as the one used by > xor_block, calculates speed for the hot-cache case. It is not clear > that this is often a relevant cache - mostly xor is calculated when > the cache is cold. So it isn't clear if it is generally applicable. > That is why calibrate_xor_blocks sometimes skips the xor_speed test > (in crypto/xor.c). I can actually perform cold cache runs likely with a little code change which should allow us to empirically find out, if my optimization is worth it. Not sure though, if I'll be able to do this before heading out to LinuxTag for the rest of this week. > > However if your code performs better than the cache-agnostic code in > xor_block, then we should replace that code with your code, rather > than have two places that try to choose the optimal code. Sure, we'll see after some more investigation. For the time being, it can stay in the raid45 target until we either come to the conclusion that it should get moved underneath crypto's generic API or get dropped altogether. > > And I would be very interested to see a discussion about how relevant > the cache bypassing is in real life... It looks like it was Zach > Brown who introduced this about 10 years ago, presumably in > consultation with Ingo Molnar. > > > > > > > > > - You have introduced "dm-message" which is a frame work for sending > > > messages to dm targets. > > > > dm-message is a parser for messages sent to a device-mapper targets > > message interface. It aims at allowing common message parsing functions > > to be shared between targets. > > > > The message interface in general allows for runtime state changes of > > targets like those you found below. > > > > See "dmsetup message ..." > > > > > It is used for adjusting the cache size, > > > changing the bandwidth used by resync, and doing things with > > > statistics. Linux already has a framework for sending messages to > > > kernel objects. It is called "sysfs". It would be much nicer if > > > you could use that. > > > > Looks to me like heavier to program (I counted ~100 LOC of sysfs defines > > and calls in md.c) for 3 files used in md. > > > > That needs some more discussion. > > In dm-raid45 you have 170 lines from "Message interface" to "END > message interface", which is 4 files I think. Am I counting the same > things? Yes. > > In any case, it is certainly possible that sysfs usage could be > improved. We recently added strict_blocks_to_sectors() which extracted > a common parsing task. Once upon a time, we open coded things with > simple_strtoull and repeated the error checking every time. > There is always room for improvement. > > But I think the uniformity across the kernel is the most important > issue here, not the lines of code. Well, it should be balanced between this possibly exclusive extremes. Nobody wants to add huge amounts of codes to maintain just to allow for uniformity without enhancing shared code upfront. > > There is a bit of tension here: your dm-message approach is similar > in style to the way targets are created. The sysfs approach is > similar in style to many other parts of the kernel. So should your > new mechanism be consistent with dm or consistent with linux - it > cannot easily be both. > > This is really a question that the dm developers as a group need to > consider and answer. I would like to recommend that transitioning > towards more uniformity with the rest of linux is a good idea, and > here is excellent opportunity to start. But it isn't my decision. We're meeting in Berlin at LinuxTag so this is going to be an item there... > > > > - Your algorithm for limiting bandwidth used by resync seems to be > > > based on a % concept (which is rounded to the nearest reciprocal, > > > so 100%, 50%, 33%, 25%, 20%, 16% etc). If there is outstanding > > > IO, the amount of outstanding resync IO must not exceed the given > > > percentage of regular IO. While it is hard to get resync "just > > > right", I think this will have some particularly poor > > > characteristics. > > > > My ANALYZEME pointed you there ? ;-) > > That's why it's in. I admitted with that that I'm not satisfied with the > > characteristics yet. > > Fair enough. > My concern then is that the mechanism for guiding the speed - which is > not yet finalised - is being encoded in the command for creating the > array. That would make it difficult to change at a later date. > I would suggest not specifying anything when creating the array and > insisting that a default be chosen. It is not mandatory to specify already and a default will be set. I'm not aware of any application utilizing it yet. > Then you can experiment with mechanism using different dm-message or > sysfs-files with some sort of warning that these are not part of the > api yet. I got that freedom already with the tiny flaw of an optional paramter being supported. If this is perceived to be bad, I can surely drop it. > > > > > > > Also it does not take account of IO to other partitions on a device. > > > It is common to have a number of different arrays on the same device > > > set. You really want to resync to back off based on IO to all of > > > the device, not just to the array. > > > > Does MD RAID > 0 cope with that in multi-partition RAID configs ? > > E.g.: > > disk1:part1+disk2:part1 -> md0 and disk1:part2+disk2:part2 -> md1 with > > global iostats for disk[12] ? > > Yes. 'gendisk' has 'sync_io' specifically for this purpose. > When md submits IO for the purpose of resync (or recovery) it adds a > count of the sectors to this counter. That can be compared with the > regular io statistics for the genhd to see if there is an non-sync IO > happening. If there is the genhd is assumed to be busy and we backoff > the resync. > > This is not a perfect solution as it only goes one level down. > If we had raid1 over raid0 over partitions, then we might not notice > resync IO from elsewhere properly. But I don't think I've ever had a > complaint about that. Thanks for the explanation. > > > > > > > Everything else could be made available by md to dm with only a > > > moderate amount of effort. > > > Of the different interface routines you require: > > > + .ctr = raid_ctr, > > > > > > This would be quite straight forward, though it might be worth > > > discussing the details of the args passed in. > > > "recovery_bandwidth" (as I say above) I don't think is well chosen, > > > > I was aware it needs further tweaking (your points above taken). > > > > You mean the wording here ? > > The thing that I don't think was well chosen was the choice of a > percentage to guide the speed of resync. I actually like percent of total bandwidth, because it allows for maximum throughput when there's no application IO while limiting to the respective and adjustable bandwidth percentage during application IO. If the set is moved to other HW with different bandwidth, this still holds up, whereas absolute throughput changes the ratio. Of course, we could support both absolute and relative. > > > > > > and I don't quite see the point of io_size (maybe I'm missing > > > something obvious) > > > > Performance related. I studied read ahead/chunk size/io size pairs and > > found examples of better performance with io_size > PAGE_SIZE <= > > chunk_size. io_size being 2^N. > > > > Do you have similar experiences to share with MD RAID ? > > My position has always been that it is up to lower layers to combine > pages together. md/raid5 does all IO in pages. If it is more > efficient for some device to do multiple pages as a time, it should > combine the pages. I can share that standpoint but had differing tests showing otherwise. > > The normal plugging mechanisms combined with the elevator should be > enough for this. Will make a series of tests when I get to this in order to finally decide how to go about it. > > We have put more focus in to gathering write-pages into whole strips > so that no pre-reading is needed - we can just calculate parity and > write. This has a significant effect on throughput. Yes, the raid45 target does the same. It's a huge win to avoid the read before write penalty altogether, when a stripe gets completely written over. > (A 'strip' is like a 'stripe' but it is one page wide, not one chunk > wide). > > > > + .presuspend = raid_presuspend, > > > > > > I call this 'quiesce' > > > > > > + .postsuspend = raid_postsuspend, > > > > > > I'm not sure exactly how this fits in. Something related to the > > > log... > > > > We use these distinct presuspend/postsuspend methods in order to allow > > for targets to distinguish flushing any io in flight in presuspend and > > release/quiesce/... any resource (e.g. quiesce the dirty log) utilized > > in the first step in postsuspend. > > > > If MD RAID doesn't need that, stick with presuspend. > > OK, thanks. > > > > + .message = raid_message, > > > > > > As I said, I would rather this didn't exist. > > > > I think its better to offer options to be utilized for different > > purposes. Tradeoff is always (some) overlap. > > > > We don't have one filesystem for all use cases ;-) > > > > No... I find myself wearing several hats here. > > As a developer, I think it is great that you are writing your own > raid5 implementation. Doing stuff like that is lots of fun and a > valuable learning experience. As I read though the code there were a > number of time when I thought "that looks rather neat". Thanks :-) > Quite > possibly there are ideas in dm-raid5 that I could "steal" to make > md/raid5 better. I wouldn't bet you ain't do that ;-) > dm-message certainly has its merits too (as I said, it fits the dm model). > > As a member of the linux kernel community, I want to see linux grow > with a sense of unity. Where possible, similar tasks should be > addressed in similar ways. This avoid duplicating mistakes, eases > maintenance, and makes it easier for newcomers to learn. Establishing > Patterns and following them is a good thing. Hence my desire to > encourage the use of sysfs and a single xor implementation. Understandable/sharable standpoint. Takes us back to the dm team discussion mentioned above. > > As an employee of a company that sells support, I really don't want a > profusion of different implementations of something that I am seen as > an expert in, as I know I'll end up having to support all of them. > So for very selfish reasons, I'd prefer fewer versions of NFS, fewer > file-systems, and fewer raid5 implementations :-) > Well... fewer that we contract to support anyway. > We might end up with one... But we can take advantage of impulses from different implementations to help better solutions, which ain't happen if there's just one. > > > > > > I suspect, using your model, the approach would be a 'dm-message' > > > acknowledging that a given device is faulty. I would, of course, > > > rather this be done via sysfs. > > > > > > I will try to look at factoring out the md specific part of md/raid5 > > > over the next few days (maybe next week) and see how that looks. > > > > That'll be interesting to see how it fits into the device-mapper > > framework > > > > I take it You'll port code to create an "md-raid456" target ? > > That's the idea. I'll let you know when I have something credible. Ok, We were thinking alike. Regards, Heinz > > Thanks, > NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
======================== 1/5: dm raid45 target: export region hash functions and add a needed one 2/5: dm raid45 target: introduce memory cache 3/5: dm raid45 target: introduce message parser 4/5: dm raid45 target: export dm_disk from dm.c 5/5: dm raid45 target: introcuce the raid45 target 6/5: dm raid45 target: add raid45 modules to Makefile and adjust Kconfig include/linux/dm-region-hash.h | 4 ++++ drivers/md/dm-region-hash.c | 21 +++++++++++++++++-- drivers/md/dm-memcache.h | 68 ++++++++++++++++++++++ drivers/md/dm-memcache.c | 301 ++++++++++++++++++++++ drivers/md/dm-message.h | 91 ++++++++++++++++++++++++ drivers/md/dm-message.c | 183 ++++++++++++++++++++++++ drivers/md/dm.c | 1 + drivers/md/dm-raid45.h | 28 ++++++++++++++++++++++++++ drivers/md/dm-raid45.c | 4580 ++++++++++++++++++++++++++ drivers/md/Makefile | 3 +++ drivers/md/Kconfig | 9 +++++++++ 11 files changed, 5286 insertions(+), 3 deletions(-) .../{dm-region-hash.h.orig => dm-region-hash.h} | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/include/linux/dm-region-hash.h.orig b/include/linux/dm-region-hash.h index a9e652a..bfd21cb 100644 --- a/include/linux/dm-region-hash.h.orig +++ b/include/linux/dm-region-hash.h @@ -49,6 +49,7 @@ struct dm_dirty_log *dm_rh_dirty_log(struct dm_region_hash *rh); */ region_t dm_rh_bio_to_region(struct dm_region_hash *rh, struct bio *bio); sector_t dm_rh_region_to_sector(struct dm_region_hash *rh, region_t region); +region_t dm_rh_sector_to_region(struct dm_region_hash *rh, sector_t sector); void *dm_rh_region_context(struct dm_region *reg); /* @@ -72,11 +73,14 @@ void dm_rh_update_states(struct dm_region_hash *rh, int errors_handled); int dm_rh_flush(struct dm_region_hash *rh); /* Inc/dec pending count on regions. */ +void dm_rh_inc(struct dm_region_hash *rh, region_t region); void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios); void dm_rh_dec(struct dm_region_hash *rh, region_t region); /* Delay bios on regions. */ void dm_rh_delay(struct dm_region_hash *rh, struct bio *bio); +void dm_rh_delay_by_region(struct dm_region_hash *rh, struct bio *bio, + region_t region); void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio, unsigned done, int error); .../md/{dm-region-hash.c.orig => dm-region-hash.c} | 21 +++++++++++++++++-- 1 files changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/md/dm-region-hash.c.orig b/drivers/md/dm-region-hash.c index 7b899be..47b088b 100644 --- a/drivers/md/dm-region-hash.c.orig +++ b/drivers/md/dm-region-hash.c @@ -107,10 +107,11 @@ struct dm_region { /* * Conversion fns */ -static region_t dm_rh_sector_to_region(struct dm_region_hash *rh, sector_t sector) +region_t dm_rh_sector_to_region(struct dm_region_hash *rh, sector_t sector) { return sector >> rh->region_shift; } +EXPORT_SYMBOL_GPL(dm_rh_sector_to_region); sector_t dm_rh_region_to_sector(struct dm_region_hash *rh, region_t region) { @@ -488,7 +489,7 @@ void dm_rh_update_states(struct dm_region_hash *rh, int errors_handled) } EXPORT_SYMBOL_GPL(dm_rh_update_states); -static void rh_inc(struct dm_region_hash *rh, region_t region) +void dm_rh_inc(struct dm_region_hash *rh, region_t region) { struct dm_region *reg; @@ -510,13 +511,14 @@ static void rh_inc(struct dm_region_hash *rh, region_t region) read_unlock(&rh->hash_lock); } +EXPORT_SYMBOL_GPL(dm_rh_inc); void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios) { struct bio *bio; for (bio = bios->head; bio; bio = bio->bi_next) - rh_inc(rh, dm_rh_bio_to_region(rh, bio)); + dm_rh_inc(rh, dm_rh_bio_to_region(rh, bio)); } EXPORT_SYMBOL_GPL(dm_rh_inc_pending); @@ -677,6 +679,19 @@ void dm_rh_delay(struct dm_region_hash *rh, struct bio *bio) } EXPORT_SYMBOL_GPL(dm_rh_delay); +void dm_rh_delay_by_region(struct dm_region_hash *rh, + struct bio *bio, region_t region) +{ + struct dm_region *reg; + + /* FIXME: locking. */ + read_lock(&rh->hash_lock); + reg = __rh_find(rh, region); + bio_list_add(®->delayed_bios, bio); + read_unlock(&rh->hash_lock); +} +EXPORT_SYMBOL_GPL(dm_rh_delay_by_region); + void dm_rh_stop_recovery(struct dm_region_hash *rh) { int i;