diff mbox

[1/6] dm raid45 target: export region hash functions and add a needed one

Message ID 4a368301.KBrCZcokAm87BF/S%heinzm@redhat.com (mailing list archive)
State Superseded, archived
Delegated to: Jonthan Brassow
Headers show

Commit Message

Heinz Mauelshagen June 15, 2009, 5:21 p.m. UTC
Hi,

this patch series introduces the raid45 target.
Please include upstream.

The raid45 target supports RAID4 mappings with a dedicated and selectable
parity device and RAID5 mappings with rotating parity (left/right (a)symmetric).
Initial and partial resynchronization functionality utilizes the dm
dirty log module.

A stripe cache handled by the target in order to update or reconstruct
parity is based on memory objects being managed in a separate, sharable
dm-memcache module, which allows for allocation of such objects with
N chunks of page lists with M pages each.

Message interface parsing is performed in the seperate, sharable
dm-message module.

Heinz


Summary of the patch-set

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Christoph Hellwig June 16, 2009, 2:09 p.m. UTC | #1
On Mon, Jun 15, 2009 at 07:21:05PM +0200, heinzm@redhat.com wrote:
> Hi,
> 
> this patch series introduces the raid45 target.
> Please include upstream.

NACK, no need for another raid target, especially if it happens to be
as broken as the raid1 one.  Please help on making the md raid4/5 code
more useful for your purposes (whatever that may be).

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Heinz Mauelshagen June 16, 2009, 2:51 p.m. UTC | #2
On Tue, 2009-06-16 at 10:09 -0400, Christoph Hellwig wrote:
> On Mon, Jun 15, 2009 at 07:21:05PM +0200, heinzm@redhat.com wrote:
> > Hi,
> > 
> > this patch series introduces the raid45 target.
> > Please include upstream.
> 
> NACK, no need for another raid target, especially if it happens to be
> as broken as the raid1 one.

You didn't review it it seems ?

What are your particular pointers to any raid1 code issues ?
Please point to particular code of concern.

Mind you, that we've got clustered mirroring with it, which isn't
supported in MD.

>   Please help on making the md raid4/5 code
> more useful for your purposes (whatever that may be).

Supporting various ATARAID raid5 mappings via dmraid long before md got
external metadata suport added.

Heinz

> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams June 16, 2009, 5:55 p.m. UTC | #3
On Tue, Jun 16, 2009 at 7:51 AM, Heinz Mauelshagen<heinzm@redhat.com> wrote:
> On Tue, 2009-06-16 at 10:09 -0400, Christoph Hellwig wrote:
>> On Mon, Jun 15, 2009 at 07:21:05PM +0200, heinzm@redhat.com wrote:
>> > Hi,
>> >
>> > this patch series introduces the raid45 target.
>> > Please include upstream.
>>
>> NACK, no need for another raid target, especially if it happens to be
>> as broken as the raid1 one.
>
> You didn't review it it seems ?

Will you be allowing time for review?  This seems a bit late for
2.6.31, as your "Please include upstream" implies.

>>   Please help on making the md raid4/5 code
>> more useful for your purposes (whatever that may be).
>
> Supporting various ATARAID raid5 mappings via dmraid long before md got
> external metadata suport added.

This is an opportunity to investigate what technical hurdles remain to
allow dmraid to use md infrastructure [1] (or as Neil proposed [2] a
new unified virtual block device infrastructure) for activating
external metadata raid5 arrays.  The libata/ide situation shows that
there is precedence for duplicate functionality in the kernel, but my
personal opinion is that we exhaust the code reuse discussion before
heading down that path.

Regards,
Dan

[1] http://marc.info/?l=linux-raid&m=123300614013042&w=2
[2] http://marc.info/?l=linux-fsdevel&m=124450400028660&w=2

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Heinz Mauelshagen June 16, 2009, 7:11 p.m. UTC | #4
On Tue, 2009-06-16 at 10:55 -0700, Dan Williams wrote:
> On Tue, Jun 16, 2009 at 7:51 AM, Heinz Mauelshagen<heinzm@redhat.com> wrote:
> > On Tue, 2009-06-16 at 10:09 -0400, Christoph Hellwig wrote:
> >> On Mon, Jun 15, 2009 at 07:21:05PM +0200, heinzm@redhat.com wrote:
> >> > Hi,
> >> >
> >> > this patch series introduces the raid45 target.
> >> > Please include upstream.
> >>
> >> NACK, no need for another raid target, especially if it happens to be
> >> as broken as the raid1 one.
> >
> > You didn't review it it seems ?
> 
> Will you be allowing time for review?  This seems a bit late for
> 2.6.31, as your "Please include upstream" implies.

Your help is appreciated, thanks.

As you surely know, your colleagues (in particular in Poland) are
familiar with these patches already, which will help quick review.

> 
> >>   Please help on making the md raid4/5 code
> >> more useful for your purposes (whatever that may be).
> >
> > Supporting various ATARAID raid5 mappings via dmraid long before md got
> > external metadata suport added.
> 
> This is an opportunity to investigate what technical hurdles remain to
> allow dmraid to use md infrastructure [1] (or as Neil proposed [2] a
> new unified virtual block device infrastructure) for activating
> external metadata raid5 arrays.  The libata/ide situation shows that
> there is precedence for duplicate functionality in the kernel, but my
> personal opinion is that we exhaust the code reuse discussion before
> heading down that path.

dmraid supports 11 different metadata formats for various ATARAID vendor
specific solutions plus DDF1:

asr     : Adaptec HostRAID ASR (0,1,10)
ddf1    : SNIA DDF1 (0,1,4,5,linear)
hpt37x  : Highpoint HPT37X (S,0,1,10,01)
hpt45x  : Highpoint HPT45X (S,0,1,10)
isw     : Intel Software RAID (0,1,5,01)
jmicron : JMicron ATARAID (S,0,1)
lsi     : LSI Logic MegaRAID (0,1,10)
nvidia  : NVidia RAID (S,0,1,10,5)
pdc     : Promise FastTrack (S,0,1,10)
sil     : Silicon Image(tm) Medley(tm) (0,1,10)
via     : VIA Software RAID (S,0,1,10)

9 of those still need porting/testing/... to/with MD external metadata
support after the Intel IMSM MD port has settled, hence Red Hat is going
to continue to support dmraid.

That being said: once the future work on a unified virtual block device
infrastructure is production ready, we're open to use that.

Regards,
Heinz

> 
> Regards,
> Dan
> 
> [1] http://marc.info/?l=linux-raid&m=123300614013042&w=2
> [2] http://marc.info/?l=linux-fsdevel&m=124450400028660&w=2

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams June 16, 2009, 7:48 p.m. UTC | #5
On Tue, Jun 16, 2009 at 12:11 PM, Heinz Mauelshagen<heinzm@redhat.com> wrote:
> dmraid supports 11 different metadata formats for various ATARAID vendor
> specific solutions plus DDF1:
>
> asr     : Adaptec HostRAID ASR (0,1,10)
> ddf1    : SNIA DDF1 (0,1,4,5,linear)
> hpt37x  : Highpoint HPT37X (S,0,1,10,01)
> hpt45x  : Highpoint HPT45X (S,0,1,10)
> isw     : Intel Software RAID (0,1,5,01)
> jmicron : JMicron ATARAID (S,0,1)
> lsi     : LSI Logic MegaRAID (0,1,10)
> nvidia  : NVidia RAID (S,0,1,10,5)
> pdc     : Promise FastTrack (S,0,1,10)
> sil     : Silicon Image(tm) Medley(tm) (0,1,10)
> via     : VIA Software RAID (S,0,1,10)
>
> 9 of those still need porting/testing/... to/with MD external metadata
> support after the Intel IMSM MD port has settled, hence Red Hat is going
> to continue to support dmraid.

What I am personally interested in is a way for the dmraid userspace
to reuse time proven md kernel infrastructure where it makes sense.
Yes, there are benefits to porting the metadata formats to mdadm, but
the point is that there is no immediate need, or even eventual need,
to to do this work.  dmraid already knows how to convert all these
formats into a common dmsetup table format, and md already has knobs
for userspace to create raid arrays from the same information.

Can we not meet in the middle to allow at least dm-raid5 arrays to use md-raid5?

Put another way do we need ~5000 new lines of kernel code to support
all the functionality that dmraid requires?  What happens when one of
these formats wants to add raid6, more code duplication?  Can we
incrementally extend md-raid to cover the unmet needs of dmraid, what
extensions are required?

> That being said: once the future work on a unified virtual block device
> infrastructure is production ready, we're open to use that.

That is the end game, but there are more cooperation opportunities
along the way.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown June 16, 2009, 10:46 p.m. UTC | #6
On Tuesday June 16, heinzm@redhat.com wrote:
> 
> That being said: once the future work on a unified virtual block device
> infrastructure is production ready, we're open to use that.
> 

I was kind-a hoping that you (and others) would be involved in
developing this unified infrastructure, rather than just waiting for
it.

I think a great first step would be to allow md/raid5 to be used
directly as a dm target, thus turning dm-raid5 into a shim layer over
md/raid5.  The process of doing this would very likely highlight a
lot of the issues we would need to address in creating a unified
framework.

I will try to find time to review your dm-raid5 code with a view to
understanding how it plugs in to dm, and then how the md/raid5 engine
can be used by dm-raid5.

Part of this will be disconnecting the md/raid5 code from any specific
knowledge of a gendisk and a request_queue as I suppose a dm-target
doesn't own any of these.  Also I would probably want the mddev not be
have to be on the "all_mddevs" list, as we would not want a 'dm' raid5
to appear in /proc/mdstat.

NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Jonthan Brassow June 18, 2009, 4:08 p.m. UTC | #7
On Jun 16, 2009, at 5:46 PM, Neil Brown wrote:

> On Tuesday June 16, heinzm@redhat.com wrote:
>>
>> That being said: once the future work on a unified virtual block  
>> device
>> infrastructure is production ready, we're open to use that.
>>
>
> I was kind-a hoping that you (and others) would be involved in
> developing this unified infrastructure, rather than just waiting for
> it.

I think this is what everyone thinks, which is why there is no  
integration effort.

  brassow

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Jonthan Brassow June 18, 2009, 4:39 p.m. UTC | #8
Eliminate the 3rd argument to that function.  You can use  
'dm_rh_bio_to_region'.

  brassow

On Jun 15, 2009, at 12:21 PM, heinzm@redhat.com wrote:

>
> +void dm_rh_delay_by_region(struct dm_region_hash *rh,
> +			   struct bio *bio, region_t region)
> +{
> +	struct dm_region *reg;
> +
> +	/* FIXME: locking. */
> +	read_lock(&rh->hash_lock);
> +	reg = __rh_find(rh, region);
> +	bio_list_add(&reg->delayed_bios, bio);
> +	read_unlock(&rh->hash_lock);
> +}
> +EXPORT_SYMBOL_GPL(dm_rh_delay_by_region);
> +
> void dm_rh_stop_recovery(struct dm_region_hash *rh)
> {
> 	int i;

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Heinz Mauelshagen June 18, 2009, 8:01 p.m. UTC | #9
On Thu, 2009-06-18 at 11:39 -0500, Jonathan Brassow wrote:
> Eliminate the 3rd argument to that function.  You can use  
> 'dm_rh_bio_to_region'.

No, I can't, because I'm keeping track of regions per single disk as in
the mirroring code rather than dividing the whole sets capacity into
regions. This is because a disk is divided into 2^N sized chunks and
regions have to be 2^M >= 2^N sized.

See caller side in dm-raid45.c, function do_io().

Heinz

> 
>   brassow
> 
> On Jun 15, 2009, at 12:21 PM, heinzm@redhat.com wrote:
> 
> >
> > +void dm_rh_delay_by_region(struct dm_region_hash *rh,
> > +			   struct bio *bio, region_t region)
> > +{
> > +	struct dm_region *reg;
> > +
> > +	/* FIXME: locking. */
> > +	read_lock(&rh->hash_lock);
> > +	reg = __rh_find(rh, region);
> > +	bio_list_add(&reg->delayed_bios, bio);
> > +	read_unlock(&rh->hash_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(dm_rh_delay_by_region);
> > +
> > void dm_rh_stop_recovery(struct dm_region_hash *rh)
> > {
> > 	int i;
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown June 19, 2009, 1:43 a.m. UTC | #10
On Wednesday June 17, neilb@suse.de wrote:
> 
> I will try to find time to review your dm-raid5 code with a view to
> understanding how it plugs in to dm, and then how the md/raid5 engine
> can be used by dm-raid5.

I've had a bit of a look through the dm-raid5 patches.

Some observations:

- You have your own 'xor' code against which you do a run-time test of
  the 'xor_block' code which md/raid5 uses - then choose the fasted.
  This really should not be necessary.  If you have xor code that runs
  faster than anything in xor_block, it really would be best to submit
  it for inclusion in the common xor code base.

- You have introduced "dm-message" which is a frame work for sending
  messages to dm targets.  It is used for adjusting the cache size,
  changing the bandwidth used by resync, and doing things with
  statistics.  Linux already has a framework for sending messages to
  kernel objects.  It is called "sysfs".  It would be much nicer if
  you could use that.

- I don't know what Christoph Hellwig was referring to when he
  suggested that dm-raid1 was broken, but there is one issue with it
  that concerns me and which is (as far as I can tell) broken in
  dm-raid45 as well.
  That issue is summarised in the thread
   http://www.linux-archive.org/device-mapper-development/282987-dm-raid1-data-corruption.html

  The problem is that when a drive fails you allow writes to continue
  without waiting for the metadata to be updated.  I.e. metadata
  updates are asynchronous. 
  The problem scenario for raid5 is:
    - a drive fails while a write is in process
    - the drive is marked as faulty and user space is notified
    - that write, and possibly several others, complete and that 
      status is potentially returned to user-space (e.g. for an fsync)
    - user space acts on that status (e.g. sends an acknowledgement over
      the network)
    - the host crashes before the metadata update completes

    - on reboot, the drive, which just had a transient failure, works
      fine, and the metadata reports that all the drives are good, but
      that the array was active so a parity-resync is needed.
    - All parity blocks are recalculated.  However that write we
      mentioned above is only recorded in the parity block.  So that
      data is lost.

   Now if you get a crash while the array is degraded data loss is a
   real possibility.  However silent data loss should not be tolerated.

   For this reason, metadata updates reporting failed devices really
   need to by synchronous with respect to writes.
   
- Your algorithm for limiting bandwidth used by resync seems to be
  based on a % concept (which is rounded to the nearest reciprocal,
  so 100%, 50%, 33%, 25%, 20%, 16% etc).  If there is outstanding
  IO, the amount of outstanding resync IO must not exceed the given
  percentage of regular IO.   While it is hard to get resync "just
  right", I think this will have some particularly poor
  characteristics.
  Heavy regular IO load, or a non-existent regular load should be
  handled OK, but I suspect that a light IO load (e.g. one or two
  threads of synchronous requests) would cause the resync to go much
  slower than you would want.

  Also it does not take account of IO to other partitions on a device.
  It is common to have a number of different arrays on the same device
  set.  You really want to resync to back off based on IO to all of
  the device, not just to the array.


All of these are issues that are already addressed in md/raid5.
About the only thing that I saw which does not have a working analogue
in md/raid5 is cluster locking.  However that is "not yet implemented"
in dm-raid5 so that isn't a real difference.

Everything else could be made available by md to dm with only a
moderate amount of effort.
Of the different interface routines you require:
+	.ctr = raid_ctr,

  This would be quite straight forward, though it might be worth
  discussing the details of the args passed in.
  "recovery_bandwidth" (as I say above) I don't think is well chosen,
  and I don't quite see the point of io_size (maybe I'm missing
  something obvious)

+	.dtr = raid_dtr,

  This is trivial. raid5.stop

+	.map = raid_map,

  This is very much like the md/raid5 make_request.

+	.presuspend = raid_presuspend,

  I call this 'quiesce'

+	.postsuspend = raid_postsuspend,

  I'm not sure exactly how this fits in.  Something related to the 
  log...

+	.resume = raid_resume,

  ->quiesce requesting 'resume' rather than 'suspend'

+	.status = raid_status,

  very straight forward

+	.message = raid_message,

  As I said, I would rather this didn't exist.  But it could be
  emulated (except for the statistics) if needed - md can already
  adjust resync speed and cache size.


 What is missing is agreement on how to handle synchronous metadata
 updates.  md waits for the drive failure to be acknowledged.  It is
 automatically marked as 'Blocked' on an error, and user-space must
 clear that flag.

 I suspect, using your model, the approach would be a 'dm-message'
 acknowledging that a given device is faulty.  I would, of course,
 rather this be done via sysfs.

 I will try to look at factoring out the md specific part of md/raid5
 over the next few days (maybe next week) and see how that looks.

NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Heinz Mauelshagen June 19, 2009, 10:33 a.m. UTC | #11
On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote:
> On Wednesday June 17, neilb@suse.de wrote:
> > 
> > I will try to find time to review your dm-raid5 code with a view to
> > understanding how it plugs in to dm, and then how the md/raid5 engine
> > can be used by dm-raid5.

Hi Neil.

> 
> I've had a bit of a look through the dm-raid5 patches.

Thanks.

> 
> Some observations:
> 
> - You have your own 'xor' code against which you do a run-time test of
>   the 'xor_block' code which md/raid5 uses - then choose the fasted.
>   This really should not be necessary.  If you have xor code that runs
>   faster than anything in xor_block, it really would be best to submit
>   it for inclusion in the common xor code base.

This is in because it actually shows better performance regularly by
utilizing cache lines etc. more efficiently (tested on Intel, AMD and
Sparc).

If xor_block would always have been performed best, I'd dropped that
optimization already.

> 
> - You have introduced "dm-message" which is a frame work for sending
>   messages to dm targets.

dm-message is a parser for messages sent to a device-mapper targets
message interface. It aims at allowing common message parsing functions
to be shared between targets.

The message interface in general allows for runtime state changes of
targets like those you found below.

See "dmsetup message ..."

> It is used for adjusting the cache size,
>   changing the bandwidth used by resync, and doing things with
>   statistics.  Linux already has a framework for sending messages to
>   kernel objects.  It is called "sysfs".  It would be much nicer if
>   you could use that.

Looks to me like heavier to program (I counted ~100 LOC of sysfs defines
and calls in md.c) for 3 files used in md.

That needs some more discussion.

> 
> - I don't know what Christoph Hellwig was referring to when he
>   suggested that dm-raid1 was broken, but there is one issue with it
>   that concerns me and which is (as far as I can tell) broken in
>   dm-raid45 as well.
>   That issue is summarised in the thread
>    http://www.linux-archive.org/device-mapper-development/282987-dm-raid1-data-corruption.html

Yes, major point being that dm aims to mostly do uspace metadata updates
and as little kspace metadata access as possible.

> 
>   The problem is that when a drive fails you allow writes to continue
>   without waiting for the metadata to be updated.  I.e. metadata
>   updates are asynchronous. 
>   The problem scenario for raid5 is:
>     - a drive fails while a write is in process
>     - the drive is marked as faulty and user space is notified
>     - that write, and possibly several others, complete and that 
>       status is potentially returned to user-space (e.g. for an fsync)
>     - user space acts on that status (e.g. sends an acknowledgement over
>       the network)
>     - the host crashes before the metadata update completes
>     - on reboot, the drive, which just had a transient failure, works
>       fine, and the metadata reports that all the drives are good, but
>       that the array was active so a parity-resync is needed.
>     - All parity blocks are recalculated.  However that write we
>       mentioned above is only recorded in the parity block.  So that
>       data is lost.
> 

Yes, like in the dm raid1 scenario, blocking of writes would have to
occur until metadata has been updated in order to detect the failing
drive persistently, hence not updating parity (which contains data
mandatory to preserve) but recalculating chunks on the transiently
failing drive utilizing intact parity.

> 
>    Now if you get a crash while the array is degraded data loss is a
>    real possibility.  However silent data loss should not be tolerated.
> 
>    For this reason, metadata updates reporting failed devices really
>    need to by synchronous with respect to writes.
> 

Agreed.
   
> - Your algorithm for limiting bandwidth used by resync seems to be
>   based on a % concept (which is rounded to the nearest reciprocal,
>   so 100%, 50%, 33%, 25%, 20%, 16% etc).  If there is outstanding
>   IO, the amount of outstanding resync IO must not exceed the given
>   percentage of regular IO.   While it is hard to get resync "just
>   right", I think this will have some particularly poor
>   characteristics.

My ANALYZEME pointed you there ? ;-)
That's why it's in. I admitted with that that I'm not satisfied with the
characteristics yet.

>   Heavy regular IO load, or a non-existent regular load should be
>   handled OK, but I suspect that a light IO load (e.g. one or two
>   threads of synchronous requests) would cause the resync to go much
>   slower than you would want.

I'll look into it further taking your thoughts into account.
.
> 
>   Also it does not take account of IO to other partitions on a device.
>   It is common to have a number of different arrays on the same device
>   set.  You really want to resync to back off based on IO to all of
>   the device, not just to the array.

Does MD RAID > 0 cope with that in multi-partition RAID configs ?
E.g.:
disk1:part1+disk2:part1 -> md0 and disk1:part2+disk2:part2 -> md1 with
global iostats for disk[12] ?

> 
> 
> All of these are issues that are already addressed in md/raid5.
> About the only thing that I saw which does not have a working analogue
> in md/raid5 is cluster locking.  However that is "not yet implemented"
> in dm-raid5 so that isn't a real difference.

Yes, it's a start for future cluster RAID5 usage.

> 
> Everything else could be made available by md to dm with only a
> moderate amount of effort.
> Of the different interface routines you require:
> +	.ctr = raid_ctr,
> 
>   This would be quite straight forward, though it might be worth
>   discussing the details of the args passed in.
>   "recovery_bandwidth" (as I say above) I don't think is well chosen,

I was aware it needs further tweaking (your points above taken).

You mean the wording here ?

>   and I don't quite see the point of io_size (maybe I'm missing
>   something obvious)

Performance related. I studied read ahead/chunk size/io size pairs and
found examples of better performance with io_size > PAGE_SIZE <=
chunk_size. io_size being 2^N.

Do you have similar experiences to share with MD RAID ?

> 
> +	.dtr = raid_dtr,
> 
>   This is trivial. raid5.stop
> 
> +	.map = raid_map,
> 
>   This is very much like the md/raid5 make_request.
> 
> +	.presuspend = raid_presuspend,
> 
>   I call this 'quiesce'
> 
> +	.postsuspend = raid_postsuspend,
> 
>   I'm not sure exactly how this fits in.  Something related to the 
>   log...

We use these distinct presuspend/postsuspend methods in order to allow
for targets to distinguish flushing any io in flight in presuspend and
release/quiesce/... any resource (e.g. quiesce the dirty log) utilized
in the first step in postsuspend.

If MD RAID doesn't need that, stick with presuspend.

> 
> +	.resume = raid_resume,
> 
>   ->quiesce requesting 'resume' rather than 'suspend'
> 
> +	.status = raid_status,
> 
>   very straight forward
> 
> +	.message = raid_message,
> 
>   As I said, I would rather this didn't exist.

I think its better to offer options to be utilized for different
purposes. Tradeoff is always (some) overlap.

We don't have one filesystem for all use cases ;-)

> But it could be
>   emulated (except for the statistics) if needed - md can already
>   adjust resync speed and cache size.
> 
> 
>  What is missing is agreement on how to handle synchronous metadata
>  updates.  md waits for the drive failure to be acknowledged.  It is
>  automatically marked as 'Blocked' on an error, and user-space must
>  clear that flag.

Yes, Mikulas already put thought into that.
Let's allow him /and others) to join in.


> 
>  I suspect, using your model, the approach would be a 'dm-message'
>  acknowledging that a given device is faulty.  I would, of course,
>  rather this be done via sysfs.
> 
>  I will try to look at factoring out the md specific part of md/raid5
>  over the next few days (maybe next week) and see how that looks.

That'll be interesting to see how it fits into the device-mapper
framework

I take it You'll port code to create an "md-raid456" target ?

Thanks a lot,
Heinz

> 
> NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams June 21, 2009, 12:32 a.m. UTC | #12
On Fri, Jun 19, 2009 at 3:33 AM, Heinz Mauelshagen<heinzm@redhat.com> wrote:
> On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote:
>> - You have your own 'xor' code against which you do a run-time test of
>>   the 'xor_block' code which md/raid5 uses - then choose the fasted.
>>   This really should not be necessary.  If you have xor code that runs
>>   faster than anything in xor_block, it really would be best to submit
>>   it for inclusion in the common xor code base.
>
> This is in because it actually shows better performance regularly by
> utilizing cache lines etc. more efficiently (tested on Intel, AMD and
> Sparc).
>
> If xor_block would always have been performed best, I'd dropped that
> optimization already.

Data please.  Especially given the claim that it improves upon the
existing per-architecture hand-tuned assembly routines.  If it is
faster great!  Please send patches to improve the existing code.

>>
>> - You have introduced "dm-message" which is a frame work for sending
>>   messages to dm targets.
>
> dm-message is a parser for messages sent to a device-mapper targets
> message interface. It aims at allowing common message parsing functions
> to be shared between targets.
>
> The message interface in general allows for runtime state changes of
> targets like those you found below.
>
> See "dmsetup message ..."
>
>> It is used for adjusting the cache size,
>>   changing the bandwidth used by resync, and doing things with
>>   statistics.  Linux already has a framework for sending messages to
>>   kernel objects.  It is called "sysfs".  It would be much nicer if
>>   you could use that.
>
> Looks to me like heavier to program (I counted ~100 LOC of sysfs defines
> and calls in md.c) for 3 files used in md.
>
> That needs some more discussion.

Ironically the same argument can be applied to this module.  ~5000 LOC
to re-implement raid5 seems much "heavier to program" than to write
some dmraid ABI compatibility wrappers around the existing md-raid5
code.

[..]
>>   The problem is that when a drive fails you allow writes to continue
>>   without waiting for the metadata to be updated.  I.e. metadata
>>   updates are asynchronous.
>>   The problem scenario for raid5 is:
>>     - a drive fails while a write is in process
>>     - the drive is marked as faulty and user space is notified
>>     - that write, and possibly several others, complete and that
>>       status is potentially returned to user-space (e.g. for an fsync)
>>     - user space acts on that status (e.g. sends an acknowledgement over
>>       the network)
>>     - the host crashes before the metadata update completes
>>     - on reboot, the drive, which just had a transient failure, works
>>       fine, and the metadata reports that all the drives are good, but
>>       that the array was active so a parity-resync is needed.
>>     - All parity blocks are recalculated.  However that write we
>>       mentioned above is only recorded in the parity block.  So that
>>       data is lost.
>>
>
> Yes, like in the dm raid1 scenario, blocking of writes would have to
> occur until metadata has been updated in order to detect the failing
> drive persistently, hence not updating parity (which contains data
> mandatory to preserve) but recalculating chunks on the transiently
> failing drive utilizing intact parity.

Another problem with dm along these lines is the mechanism for
notifying userspace, I am assuming this is still via uevents?  The
problem is that each notification requires a memory allocation which
might fail under load, especially if the blocked array is in the dirty
page write out path.  The userspace notifications and failed device
handling in md are designed to not require memory allocations.

[..]
>>   Also it does not take account of IO to other partitions on a device.
>>   It is common to have a number of different arrays on the same device
>>   set.  You really want to resync to back off based on IO to all of
>>   the device, not just to the array.
>
> Does MD RAID > 0 cope with that in multi-partition RAID configs ?
> E.g.:
> disk1:part1+disk2:part1 -> md0 and disk1:part2+disk2:part2 -> md1 with
> global iostats for disk[12] ?

By default it will hold off resync on other arrays when it notices
that multiple arrays share disks.  A sysfs knob allows parallel
resync.

[..]
>>   As I said, I would rather this didn't exist.
>
> I think its better to offer options to be utilized for different
> purposes. Tradeoff is always (some) overlap.
>
> We don't have one filesystem for all use cases ;-)

That argument does not fly in this case.  md is already suitable for
the dmraid use case (i.e kernel raid/userspace metadata).

>> But it could be
>>   emulated (except for the statistics) if needed - md can already
>>   adjust resync speed and cache size.
>>
>>
>>  What is missing is agreement on how to handle synchronous metadata
>>  updates.  md waits for the drive failure to be acknowledged.  It is
>>  automatically marked as 'Blocked' on an error, and user-space must
>>  clear that flag.
>
> Yes, Mikulas already put thought into that.
> Let's allow him /and others) to join in.
>
>
>>
>>  I suspect, using your model, the approach would be a 'dm-message'
>>  acknowledging that a given device is faulty.  I would, of course,
>>  rather this be done via sysfs.
>>
>>  I will try to look at factoring out the md specific part of md/raid5
>>  over the next few days (maybe next week) and see how that looks.
>
> That'll be interesting to see how it fits into the device-mapper
> framework
>

Yes, kernel ABI compatibility is the better [1] way to go.  I will
also revive the ideas/patches I had along these lines.

Regards,
Dan

[1]: ...than dm2md in userspace

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown June 21, 2009, 12:06 p.m. UTC | #13
On Friday June 19, heinzm@redhat.com wrote:
> On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote:
> > On Wednesday June 17, neilb@suse.de wrote:
> > > 
> > > I will try to find time to review your dm-raid5 code with a view to
> > > understanding how it plugs in to dm, and then how the md/raid5 engine
> > > can be used by dm-raid5.
> 
> Hi Neil.
> 
> > 
> > I've had a bit of a look through the dm-raid5 patches.
> 
> Thanks.
> 
> > 
> > Some observations:
> > 
> > - You have your own 'xor' code against which you do a run-time test of
> >   the 'xor_block' code which md/raid5 uses - then choose the fasted.
> >   This really should not be necessary.  If you have xor code that runs
> >   faster than anything in xor_block, it really would be best to submit
> >   it for inclusion in the common xor code base.
> 
> This is in because it actually shows better performance regularly by
> utilizing cache lines etc. more efficiently (tested on Intel, AMD and
> Sparc).
> 
> If xor_block would always have been performed best, I'd dropped that
> optimization already.

I'm no expert on this I must admit - I just use the xor code that
others have provided.
However it is my understanding that some of the xor code options were
chosen deliberately because they bypassed the cache.  As the xor
operation operates on data that very probably was not in the cache,
and only touches it once, and touches quite a lot of data, there is
little benefit having in the cache, and a possible real cost as it
pushes other data out of the cache.

That said, the "write" process involves copying data from a buffer
into the cache, and then using that data as a source for xor.  In that
case the cache might be beneficial, I'm not sure.
I've occasionally wondered if it might be good to have an xor routine
that does "copy and xor".   This would use more registers than the
current code so we could possibly process fewer blocks at a time, but
we might be processing them faster.  In that case, if we could bypass
the cache to read the data, but use the cache to store the result of
the xor, that would be ideal.  However I have no idea if that is
possible.

The mechanism you use, which is much the same as the one used by
xor_block, calculates speed for the hot-cache case.  It is not clear
that this is often a relevant cache - mostly xor is calculated when
the cache is cold.  So it isn't clear if it is generally applicable.
That is why calibrate_xor_blocks sometimes skips the xor_speed test
(in crypto/xor.c).

However if your code performs better than the cache-agnostic code in
xor_block, then we should replace that code with your code, rather
than have two places that try to choose the optimal code.

And I would be very interested to see a discussion about how relevant
the cache bypassing is in real life...  It looks like it was Zach
Brown who introduced this about 10 years ago, presumably in
consultation with Ingo Molnar.

> 
> > 
> > - You have introduced "dm-message" which is a frame work for sending
> >   messages to dm targets.
> 
> dm-message is a parser for messages sent to a device-mapper targets
> message interface. It aims at allowing common message parsing functions
> to be shared between targets.
> 
> The message interface in general allows for runtime state changes of
> targets like those you found below.
> 
> See "dmsetup message ..."
> 
> > It is used for adjusting the cache size,
> >   changing the bandwidth used by resync, and doing things with
> >   statistics.  Linux already has a framework for sending messages to
> >   kernel objects.  It is called "sysfs".  It would be much nicer if
> >   you could use that.
> 
> Looks to me like heavier to program (I counted ~100 LOC of sysfs defines
> and calls in md.c) for 3 files used in md.
> 
> That needs some more discussion.

In dm-raid45 you have 170 lines from "Message interface" to "END
message interface", which is 4 files I think.  Am I counting the same
things?

In any case, it is certainly possible that sysfs usage could be
improved.  We recently added strict_blocks_to_sectors() which extracted
a common parsing task.  Once upon a time, we open coded things with
simple_strtoull and repeated the error checking every time.
There is always room for improvement.

But I think the uniformity across the kernel is the most important
issue here, not the lines of code.

There is a bit of tension here:  your dm-message approach is similar
in style to the way targets are created.  The sysfs approach is
similar in style to many other parts of the kernel.  So should your
new mechanism be consistent with dm or consistent with linux - it
cannot easily be both.

This is really a question that the dm developers as a group need to
consider and answer.  I would like to recommend that transitioning
towards more uniformity with the rest of linux is a good idea, and
here is excellent opportunity to start.  But it isn't my decision.

> > - Your algorithm for limiting bandwidth used by resync seems to be
> >   based on a % concept (which is rounded to the nearest reciprocal,
> >   so 100%, 50%, 33%, 25%, 20%, 16% etc).  If there is outstanding
> >   IO, the amount of outstanding resync IO must not exceed the given
> >   percentage of regular IO.   While it is hard to get resync "just
> >   right", I think this will have some particularly poor
> >   characteristics.
> 
> My ANALYZEME pointed you there ? ;-)
> That's why it's in. I admitted with that that I'm not satisfied with the
> characteristics yet.

Fair enough.
My concern then is that the mechanism for guiding the speed - which is
not yet finalised - is being encoded in the command for creating the
array.  That would make it difficult to change at a later date.
I would suggest not specifying anything when creating the array and
insisting that a default be chosen.
Then you can experiment with mechanism using different dm-message or
sysfs-files with some sort of warning that these are not part of the
api yet.

> > 
> >   Also it does not take account of IO to other partitions on a device.
> >   It is common to have a number of different arrays on the same device
> >   set.  You really want to resync to back off based on IO to all of
> >   the device, not just to the array.
> 
> Does MD RAID > 0 cope with that in multi-partition RAID configs ?
> E.g.:
> disk1:part1+disk2:part1 -> md0 and disk1:part2+disk2:part2 -> md1 with
> global iostats for disk[12] ?

Yes.  'gendisk' has 'sync_io' specifically for this purpose.
When md submits IO for the purpose of resync (or recovery) it adds a
count of the sectors to this counter.  That can be compared with the
regular io statistics for the genhd to see if there is an non-sync IO
happening.  If there is the genhd is assumed to be busy and we backoff
the resync.

This is not a perfect solution as it only goes one level down.
If we had raid1 over raid0 over partitions, then we might not notice
resync IO from elsewhere properly.  But I don't think I've ever had a
complaint about that.

> > 
> > Everything else could be made available by md to dm with only a
> > moderate amount of effort.
> > Of the different interface routines you require:
> > +	.ctr = raid_ctr,
> > 
> >   This would be quite straight forward, though it might be worth
> >   discussing the details of the args passed in.
> >   "recovery_bandwidth" (as I say above) I don't think is well chosen,
> 
> I was aware it needs further tweaking (your points above taken).
> 
> You mean the wording here ?

The thing that I don't think was well chosen was the choice of a
percentage to guide the speed of resync.

> 
> >   and I don't quite see the point of io_size (maybe I'm missing
> >   something obvious)
> 
> Performance related. I studied read ahead/chunk size/io size pairs and
> found examples of better performance with io_size > PAGE_SIZE <=
> chunk_size. io_size being 2^N.
> 
> Do you have similar experiences to share with MD RAID ?

My position has always been that it is up to lower layers to combine
pages together.  md/raid5 does all IO in pages.  If it is more
efficient for some device to do multiple pages as a time, it should
combine the pages.

The normal plugging mechanisms combined with the elevator should be
enough for this.

We have put more focus in to gathering write-pages into whole strips
so that no pre-reading is needed - we can just calculate parity and
write.  This has a significant effect on throughput.
(A 'strip' is like a 'stripe' but it is one page wide, not one chunk
wide).

> > +	.presuspend = raid_presuspend,
> > 
> >   I call this 'quiesce'
> > 
> > +	.postsuspend = raid_postsuspend,
> > 
> >   I'm not sure exactly how this fits in.  Something related to the 
> >   log...
> 
> We use these distinct presuspend/postsuspend methods in order to allow
> for targets to distinguish flushing any io in flight in presuspend and
> release/quiesce/... any resource (e.g. quiesce the dirty log) utilized
> in the first step in postsuspend.
> 
> If MD RAID doesn't need that, stick with presuspend.

OK, thanks.

> > +	.message = raid_message,
> > 
> >   As I said, I would rather this didn't exist.
> 
> I think its better to offer options to be utilized for different
> purposes. Tradeoff is always (some) overlap.
> 
> We don't have one filesystem for all use cases ;-)
> 

No...  I find myself wearing several hats here.

As a developer, I think it is great that you are writing your own
raid5 implementation.  Doing stuff like that is lots of fun and a
valuable learning experience.  As I read though the code there were a
number of time when I thought "that looks rather neat".  Quite
possibly there are ideas in dm-raid5 that I could "steal" to make
md/raid5 better.
dm-message certainly has its merits too (as I said, it fits the dm model).

As a member of the linux kernel community, I want to see linux grow
with a sense of unity.  Where possible, similar tasks should be
addressed in similar ways.  This avoid duplicating mistakes, eases
maintenance, and makes it easier for newcomers to learn.  Establishing
Patterns and following them is a good thing.  Hence my desire to
encourage the use of sysfs and a single xor implementation.

As an employee of a company that sells support, I really don't want a
profusion of different implementations of something that I am seen as
an expert in, as I know I'll end up having to support all of them.
So for very selfish reasons, I'd prefer fewer versions of NFS, fewer
file-systems, and fewer raid5 implementations :-)
Well... fewer that we contract to support anyway.

> > 
> >  I suspect, using your model, the approach would be a 'dm-message'
> >  acknowledging that a given device is faulty.  I would, of course,
> >  rather this be done via sysfs.
> > 
> >  I will try to look at factoring out the md specific part of md/raid5
> >  over the next few days (maybe next week) and see how that looks.
> 
> That'll be interesting to see how it fits into the device-mapper
> framework
> 
> I take it You'll port code to create an "md-raid456" target ?

That's the idea.  I'll let you know when I have something credible.

Thanks,
NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Heinz Mauelshagen June 22, 2009, 7:10 p.m. UTC | #14
On Sun, 2009-06-21 at 22:06 +1000, Neil Brown wrote:
> On Friday June 19, heinzm@redhat.com wrote:
> > On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote:
> > > On Wednesday June 17, neilb@suse.de wrote:
> > > > 
> > > > I will try to find time to review your dm-raid5 code with a view to
> > > > understanding how it plugs in to dm, and then how the md/raid5 engine
> > > > can be used by dm-raid5.
> > 
> > Hi Neil.
> > 
> > > 
> > > I've had a bit of a look through the dm-raid5 patches.
> > 
> > Thanks.
> > 
> > > 
> > > Some observations:
> > > 
> > > - You have your own 'xor' code against which you do a run-time test of
> > >   the 'xor_block' code which md/raid5 uses - then choose the fasted.
> > >   This really should not be necessary.  If you have xor code that runs
> > >   faster than anything in xor_block, it really would be best to submit
> > >   it for inclusion in the common xor code base.
> > 
> > This is in because it actually shows better performance regularly by
> > utilizing cache lines etc. more efficiently (tested on Intel, AMD and
> > Sparc).
> > 
> > If xor_block would always have been performed best, I'd dropped that
> > optimization already.
> 
> I'm no expert on this I must admit - I just use the xor code that
> others have provided.
> However it is my understanding that some of the xor code options were
> chosen deliberately because they bypassed the cache.  As the xor
> operation operates on data that very probably was not in the cache,
> and only touches it once, and touches quite a lot of data, there is
> little benefit having in the cache, and a possible real cost as it
> pushes other data out of the cache.
> 
> That said, the "write" process involves copying data from a buffer
> into the cache, and then using that data as a source for xor.  In that
> case the cache might be beneficial, I'm not sure.
> I've occasionally wondered if it might be good to have an xor routine
> that does "copy and xor".   This would use more registers than the
> current code so we could possibly process fewer blocks at a time, but
> we might be processing them faster.  In that case, if we could bypass
> the cache to read the data, but use the cache to store the result of
> the xor, that would be ideal.  However I have no idea if that is
> possible.
> 
> The mechanism you use, which is much the same as the one used by
> xor_block, calculates speed for the hot-cache case.  It is not clear
> that this is often a relevant cache - mostly xor is calculated when
> the cache is cold.  So it isn't clear if it is generally applicable.
> That is why calibrate_xor_blocks sometimes skips the xor_speed test
> (in crypto/xor.c).

I can actually perform cold cache runs likely with a little code change
which should allow us to empirically find out, if my optimization is
worth it.

Not sure though, if I'll be able to do this before heading out to
LinuxTag for the rest of this week.

> 
> However if your code performs better than the cache-agnostic code in
> xor_block, then we should replace that code with your code, rather
> than have two places that try to choose the optimal code.

Sure, we'll see after some more investigation.

For the time being, it can stay in the raid45 target until we either
come to the conclusion that it should get moved underneath crypto's
generic API or get dropped altogether.

> 
> And I would be very interested to see a discussion about how relevant
> the cache bypassing is in real life...  It looks like it was Zach
> Brown who introduced this about 10 years ago, presumably in
> consultation with Ingo Molnar.
> 
> > 
> > > 
> > > - You have introduced "dm-message" which is a frame work for sending
> > >   messages to dm targets.
> > 
> > dm-message is a parser for messages sent to a device-mapper targets
> > message interface. It aims at allowing common message parsing functions
> > to be shared between targets.
> > 
> > The message interface in general allows for runtime state changes of
> > targets like those you found below.
> > 
> > See "dmsetup message ..."
> > 
> > > It is used for adjusting the cache size,
> > >   changing the bandwidth used by resync, and doing things with
> > >   statistics.  Linux already has a framework for sending messages to
> > >   kernel objects.  It is called "sysfs".  It would be much nicer if
> > >   you could use that.
> > 
> > Looks to me like heavier to program (I counted ~100 LOC of sysfs defines
> > and calls in md.c) for 3 files used in md.
> > 
> > That needs some more discussion.
> 
> In dm-raid45 you have 170 lines from "Message interface" to "END
> message interface", which is 4 files I think.  Am I counting the same
> things?

Yes.

> 
> In any case, it is certainly possible that sysfs usage could be
> improved.  We recently added strict_blocks_to_sectors() which extracted
> a common parsing task.  Once upon a time, we open coded things with
> simple_strtoull and repeated the error checking every time.
> There is always room for improvement.
> 
> But I think the uniformity across the kernel is the most important
> issue here, not the lines of code.

Well, it should be balanced between this possibly exclusive extremes.

Nobody wants to add huge amounts of codes to maintain just to allow for
uniformity without enhancing shared code upfront.

> 
> There is a bit of tension here:  your dm-message approach is similar
> in style to the way targets are created.  The sysfs approach is
> similar in style to many other parts of the kernel.  So should your
> new mechanism be consistent with dm or consistent with linux - it
> cannot easily be both.
> 
> This is really a question that the dm developers as a group need to
> consider and answer.  I would like to recommend that transitioning
> towards more uniformity with the rest of linux is a good idea, and
> here is excellent opportunity to start.  But it isn't my decision.

We're meeting in Berlin at LinuxTag so this is going to be an item
there...

> 
> > > - Your algorithm for limiting bandwidth used by resync seems to be
> > >   based on a % concept (which is rounded to the nearest reciprocal,
> > >   so 100%, 50%, 33%, 25%, 20%, 16% etc).  If there is outstanding
> > >   IO, the amount of outstanding resync IO must not exceed the given
> > >   percentage of regular IO.   While it is hard to get resync "just
> > >   right", I think this will have some particularly poor
> > >   characteristics.
> > 
> > My ANALYZEME pointed you there ? ;-)
> > That's why it's in. I admitted with that that I'm not satisfied with the
> > characteristics yet.
> 
> Fair enough.
> My concern then is that the mechanism for guiding the speed - which is
> not yet finalised - is being encoded in the command for creating the
> array.  That would make it difficult to change at a later date.
> I would suggest not specifying anything when creating the array and
> insisting that a default be chosen.

It is not mandatory to specify already and a default will be set.
I'm not aware of any application utilizing it yet.

> Then you can experiment with mechanism using different dm-message or
> sysfs-files with some sort of warning that these are not part of the
> api yet.

I got that freedom already with the tiny flaw of an optional paramter
being supported. If this is perceived to be bad, I can surely drop it.

> 
> > > 
> > >   Also it does not take account of IO to other partitions on a device.
> > >   It is common to have a number of different arrays on the same device
> > >   set.  You really want to resync to back off based on IO to all of
> > >   the device, not just to the array.
> > 
> > Does MD RAID > 0 cope with that in multi-partition RAID configs ?
> > E.g.:
> > disk1:part1+disk2:part1 -> md0 and disk1:part2+disk2:part2 -> md1 with
> > global iostats for disk[12] ?
> 
> Yes.  'gendisk' has 'sync_io' specifically for this purpose.
> When md submits IO for the purpose of resync (or recovery) it adds a
> count of the sectors to this counter.  That can be compared with the
> regular io statistics for the genhd to see if there is an non-sync IO
> happening.  If there is the genhd is assumed to be busy and we backoff
> the resync.
> 
> This is not a perfect solution as it only goes one level down.
> If we had raid1 over raid0 over partitions, then we might not notice
> resync IO from elsewhere properly.  But I don't think I've ever had a
> complaint about that.

Thanks for the explanation.

> 
> > > 
> > > Everything else could be made available by md to dm with only a
> > > moderate amount of effort.
> > > Of the different interface routines you require:
> > > +	.ctr = raid_ctr,
> > > 
> > >   This would be quite straight forward, though it might be worth
> > >   discussing the details of the args passed in.
> > >   "recovery_bandwidth" (as I say above) I don't think is well chosen,
> > 
> > I was aware it needs further tweaking (your points above taken).
> > 
> > You mean the wording here ?
> 
> The thing that I don't think was well chosen was the choice of a
> percentage to guide the speed of resync.

I actually like percent of total bandwidth, because it allows for
maximum throughput when there's no application IO while limiting to the
respective and adjustable bandwidth percentage during application IO.

If the set is moved to other HW with different bandwidth, this still
holds up, whereas absolute throughput changes the ratio.

Of course, we could support both absolute and relative.

> 
> > 
> > >   and I don't quite see the point of io_size (maybe I'm missing
> > >   something obvious)
> > 
> > Performance related. I studied read ahead/chunk size/io size pairs and
> > found examples of better performance with io_size > PAGE_SIZE <=
> > chunk_size. io_size being 2^N.
> > 
> > Do you have similar experiences to share with MD RAID ?
> 
> My position has always been that it is up to lower layers to combine
> pages together.  md/raid5 does all IO in pages.  If it is more
> efficient for some device to do multiple pages as a time, it should
> combine the pages.

I can share that standpoint but had differing tests showing otherwise.

> 
> The normal plugging mechanisms combined with the elevator should be
> enough for this.

Will make a series of tests when I get to this in order to finally
decide how to go about it.

> 
> We have put more focus in to gathering write-pages into whole strips
> so that no pre-reading is needed - we can just calculate parity and
> write.  This has a significant effect on throughput.

Yes, the raid45 target does the same.
It's a huge win to avoid the read before write penalty altogether, when
a stripe gets completely written over.

> (A 'strip' is like a 'stripe' but it is one page wide, not one chunk
> wide).
> 
> > > +	.presuspend = raid_presuspend,
> > > 
> > >   I call this 'quiesce'
> > > 
> > > +	.postsuspend = raid_postsuspend,
> > > 
> > >   I'm not sure exactly how this fits in.  Something related to the 
> > >   log...
> > 
> > We use these distinct presuspend/postsuspend methods in order to allow
> > for targets to distinguish flushing any io in flight in presuspend and
> > release/quiesce/... any resource (e.g. quiesce the dirty log) utilized
> > in the first step in postsuspend.
> > 
> > If MD RAID doesn't need that, stick with presuspend.
> 
> OK, thanks.
> 
> > > +	.message = raid_message,
> > > 
> > >   As I said, I would rather this didn't exist.
> > 
> > I think its better to offer options to be utilized for different
> > purposes. Tradeoff is always (some) overlap.
> > 
> > We don't have one filesystem for all use cases ;-)
> > 
> 
> No...  I find myself wearing several hats here.
> 
> As a developer, I think it is great that you are writing your own
> raid5 implementation.  Doing stuff like that is lots of fun and a
> valuable learning experience.  As I read though the code there were a
> number of time when I thought "that looks rather neat".

Thanks :-)

> Quite
> possibly there are ideas in dm-raid5 that I could "steal" to make
> md/raid5 better.

I wouldn't bet you ain't do that ;-)

> dm-message certainly has its merits too (as I said, it fits the dm model).
> 
> As a member of the linux kernel community, I want to see linux grow
> with a sense of unity.  Where possible, similar tasks should be
> addressed in similar ways.  This avoid duplicating mistakes, eases
> maintenance, and makes it easier for newcomers to learn.  Establishing
> Patterns and following them is a good thing.  Hence my desire to
> encourage the use of sysfs and a single xor implementation.

Understandable/sharable standpoint. Takes us back to the dm team
discussion mentioned above.

> 
> As an employee of a company that sells support, I really don't want a
> profusion of different implementations of something that I am seen as
> an expert in, as I know I'll end up having to support all of them.
> So for very selfish reasons, I'd prefer fewer versions of NFS, fewer
> file-systems, and fewer raid5 implementations :-)
> Well... fewer that we contract to support anyway.
> 

We might end up with one...

But we can take advantage of impulses from different implementations to
help better solutions, which ain't happen if there's just one.

> > > 
> > >  I suspect, using your model, the approach would be a 'dm-message'
> > >  acknowledging that a given device is faulty.  I would, of course,
> > >  rather this be done via sysfs.
> > > 
> > >  I will try to look at factoring out the md specific part of md/raid5
> > >  over the next few days (maybe next week) and see how that looks.
> > 
> > That'll be interesting to see how it fits into the device-mapper
> > framework
> > 
> > I take it You'll port code to create an "md-raid456" target ?
> 
> That's the idea.  I'll let you know when I have something credible.

Ok, We were thinking alike.

Regards,
Heinz

> 
> Thanks,
> NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

========================
  1/5: dm raid45 target: export region hash functions and add a needed one
  2/5: dm raid45 target: introduce memory cache
  3/5: dm raid45 target: introduce message parser
  4/5: dm raid45 target: export dm_disk from dm.c
  5/5: dm raid45 target: introcuce the raid45 target
  6/5: dm raid45 target: add raid45 modules to Makefile and adjust Kconfig

 include/linux/dm-region-hash.h  |     4 ++++
 drivers/md/dm-region-hash.c     |    21 +++++++++++++++++--
 drivers/md/dm-memcache.h        |    68 ++++++++++++++++++++++
 drivers/md/dm-memcache.c        |   301 ++++++++++++++++++++++
 drivers/md/dm-message.h         |    91 ++++++++++++++++++++++++
 drivers/md/dm-message.c         |   183 ++++++++++++++++++++++++
 drivers/md/dm.c                 |     1 +
 drivers/md/dm-raid45.h          |    28 ++++++++++++++++++++++++++
 drivers/md/dm-raid45.c          |  4580 ++++++++++++++++++++++++++
 drivers/md/Makefile             |     3 +++
 drivers/md/Kconfig              |     9 +++++++++

 11 files changed, 5286 insertions(+), 3 deletions(-)


 .../{dm-region-hash.h.orig => dm-region-hash.h}    |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/dm-region-hash.h.orig b/include/linux/dm-region-hash.h
index a9e652a..bfd21cb 100644
--- a/include/linux/dm-region-hash.h.orig
+++ b/include/linux/dm-region-hash.h
@@ -49,6 +49,7 @@  struct dm_dirty_log *dm_rh_dirty_log(struct dm_region_hash *rh);
  */
 region_t dm_rh_bio_to_region(struct dm_region_hash *rh, struct bio *bio);
 sector_t dm_rh_region_to_sector(struct dm_region_hash *rh, region_t region);
+region_t dm_rh_sector_to_region(struct dm_region_hash *rh, sector_t sector);
 void *dm_rh_region_context(struct dm_region *reg);
 
 /*
@@ -72,11 +73,14 @@  void dm_rh_update_states(struct dm_region_hash *rh, int errors_handled);
 int dm_rh_flush(struct dm_region_hash *rh);
 
 /* Inc/dec pending count on regions. */
+void dm_rh_inc(struct dm_region_hash *rh, region_t region);
 void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios);
 void dm_rh_dec(struct dm_region_hash *rh, region_t region);
 
 /* Delay bios on regions. */
 void dm_rh_delay(struct dm_region_hash *rh, struct bio *bio);
+void dm_rh_delay_by_region(struct dm_region_hash *rh, struct bio *bio,
+			   region_t region);
 
 void dm_rh_mark_nosync(struct dm_region_hash *rh,
 		       struct bio *bio, unsigned done, int error);
 .../md/{dm-region-hash.c.orig => dm-region-hash.c} |   21 +++++++++++++++++--
 1 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-region-hash.c.orig b/drivers/md/dm-region-hash.c
index 7b899be..47b088b 100644
--- a/drivers/md/dm-region-hash.c.orig
+++ b/drivers/md/dm-region-hash.c
@@ -107,10 +107,11 @@  struct dm_region {
 /*
  * Conversion fns
  */
-static region_t dm_rh_sector_to_region(struct dm_region_hash *rh, sector_t sector)
+region_t dm_rh_sector_to_region(struct dm_region_hash *rh, sector_t sector)
 {
 	return sector >> rh->region_shift;
 }
+EXPORT_SYMBOL_GPL(dm_rh_sector_to_region);
 
 sector_t dm_rh_region_to_sector(struct dm_region_hash *rh, region_t region)
 {
@@ -488,7 +489,7 @@  void dm_rh_update_states(struct dm_region_hash *rh, int errors_handled)
 }
 EXPORT_SYMBOL_GPL(dm_rh_update_states);
 
-static void rh_inc(struct dm_region_hash *rh, region_t region)
+void dm_rh_inc(struct dm_region_hash *rh, region_t region)
 {
 	struct dm_region *reg;
 
@@ -510,13 +511,14 @@  static void rh_inc(struct dm_region_hash *rh, region_t region)
 
 	read_unlock(&rh->hash_lock);
 }
+EXPORT_SYMBOL_GPL(dm_rh_inc);
 
 void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios)
 {
 	struct bio *bio;
 
 	for (bio = bios->head; bio; bio = bio->bi_next)
-		rh_inc(rh, dm_rh_bio_to_region(rh, bio));
+		dm_rh_inc(rh, dm_rh_bio_to_region(rh, bio));
 }
 EXPORT_SYMBOL_GPL(dm_rh_inc_pending);
 
@@ -677,6 +679,19 @@  void dm_rh_delay(struct dm_region_hash *rh, struct bio *bio)
 }
 EXPORT_SYMBOL_GPL(dm_rh_delay);
 
+void dm_rh_delay_by_region(struct dm_region_hash *rh,
+			   struct bio *bio, region_t region)
+{
+	struct dm_region *reg;
+
+	/* FIXME: locking. */
+	read_lock(&rh->hash_lock);
+	reg = __rh_find(rh, region);
+	bio_list_add(&reg->delayed_bios, bio);
+	read_unlock(&rh->hash_lock);
+}
+EXPORT_SYMBOL_GPL(dm_rh_delay_by_region);
+
 void dm_rh_stop_recovery(struct dm_region_hash *rh)
 {
 	int i;