diff mbox

[1/6] dm raid45 target: export region hash functions and add a needed one

Message ID 1246539130.20207.12.camel@o (mailing list archive)
State Superseded, archived
Delegated to: Jonthan Brassow
Headers show

Commit Message

Heinz Mauelshagen July 2, 2009, 12:52 p.m. UTC
On Mon, 2009-06-22 at 21:10 +0200, Heinz Mauelshagen wrote:
> On Sun, 2009-06-21 at 22:06 +1000, Neil Brown wrote:
> > On Friday June 19, heinzm@redhat.com wrote:
> > > On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote:
> > > > On Wednesday June 17, neilb@suse.de wrote:
> > > > > 
> > > > > I will try to find time to review your dm-raid5 code with a view to
> > > > > understanding how it plugs in to dm, and then how the md/raid5 engine
> > > > > can be used by dm-raid5.
> > > 
> > > Hi Neil.
> > > 
> > > > 
> > > > I've had a bit of a look through the dm-raid5 patches.
> > > 
> > > Thanks.
> > > 
> > > > 
> > > > Some observations:
> > > > 
> > > > - You have your own 'xor' code against which you do a run-time test of
> > > >   the 'xor_block' code which md/raid5 uses - then choose the fasted.
> > > >   This really should not be necessary.  If you have xor code that runs
> > > >   faster than anything in xor_block, it really would be best to submit
> > > >   it for inclusion in the common xor code base.
> > > 
> > > This is in because it actually shows better performance regularly by
> > > utilizing cache lines etc. more efficiently (tested on Intel, AMD and
> > > Sparc).
> > > 
> > > If xor_block would always have been performed best, I'd dropped that
> > > optimization already.
<SNIP>

Dan, Neil,

like mentioned before I left to LinuxTag last week, here comes an initial
take on dm-raid45 warm/cold CPU cache xor speed optimization metrics.

This shall give us the base to decide to keep or drop the dm-raid45
internal xor optimization magic or move (part of) it into the crypto
subsystem.

Heinz


Howto:
------
I added a loop to walk the list of recovery stripes to dm-raid45.c
in xor_optimize() to allow for over committing the cache and some variables
to be able to display absolute minimum and maximum xor runs performed plus
the number of xor runs achieved per cycle for xor_blocks() and for the dm-raid45
build in xor optimization.

In order to make results more deterministic, I run xor_speed() for <= 5 ticks.

See diff vs. dm-devel dm-raid45 patch (submitted Jun 15th) attached.

Tests being performed on the following 2 systems:

   hostname: a4
   2.6.31-rc1@250HZ timer frequency
   Core i7 920@3.4GHz, 8 MB 3rd Level Cache
   6GB RAM

   hostname: t4
   2.6.31-rc1@250HZ timer frequency
   2 CPU Opeteron 280@2.4GHz, 2*1 MB 2nd Level Cache
   2GB RAM

with the xor optimization being the only load on the systems.


I've performed test runs on each of those with the following mapping tables
and 128 iterations for each of them, which represents a small array case
with 3 drives per set, running the xor optimization on a single core:

Intel:
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 512 10 nosync 1  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0
...
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 512 10 nosync 13  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0

Opteron:
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 256 10 nosync 1  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0
...
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 256 10 nosync 13  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0


Because no actual IO is being performed, I just mapped to error targets
(table used: "0 2199023255552 error"; I know it's large but it ain't matter).

The number following the 2nd nosync parameter is the amount of recovery
stripes with io size of 512 sectors = 256 kilobytes per chunk or
256 sectors = 128 kilobytes per chunk respectively.
I.e. a work set of 768/384 kilobytes per recovery stripe.
These values shall make sure, that results differ in the per mille range
(i.e. more than 100 cycles per test run) where appropriate.


Systems are running out of cache at
~ >= 8 stripes on the Intel (8192 - 2048 code / (512 / 2) / 3)
and
~ >= 0 stripes on the Opteron system (1024 - 768 code) / (256 / 2) / 3).
assuming some cache utilization for code and other data.

See raw kernel log extracts being created by these test runs attached
in a tarball and the script to extract the metrics as well.


Intel results with 128 iterations each:
---------------------------------------

1 stripe  : NB:10 111/80 HM:118 111/82
2 stripes : NB:25 113/87 HM:103 112/91
3 stripes : NB:24 115/93 HM:104 114/93
4 stripes : NB:48 114/93 HM:80 114/93
5 stripes : NB:38 113/94 HM:90 114/94
6 stripes : NB:25 116/94 HM:103 114/94
7 stripes : NB:25 115/95 HM:103 115/95
8 stripes : NB:62 117/96 HM:66 116/95 <<<--- cold cache starts here
9 stripes : NB:66 117/96 HM:62 116/95
10 stripes: NB:73 117/96 HM:55 114/95
11 stripes: NB:63 114/96 HM:65 112/95
12 stripes: NB:51 111/96 HM:77 110/95
13 stripes: NB:65 109/96 HM:63 112/95

NB: number of xor_blocks() parity calculations winning per 128 iterations
HM: number of dm-raid45 xor() parity calculations equal to/winning
    xor_blocks per 128 iterations
NN/MM: count of maximm/minimum calculations achived per iteration in <= 5 ticks.

Opteron results with 128 iterations each:
-----------------------------------------
1 stripe  : NB:0 30/20 HM:128 64/53
2 stripes : NB:0 31/21 HM:128 68/55
3 stripes : NB:0 31/22 HM:128 68/57
4 stripes : NB:0 32/22 HM:128 70/61
5 stripes : NB:0 32/22 HM:128 70/63
6 stripes : NB:0 35/22 HM:128 70/64
7 stripes : NB:0 32/23 HM:128 69/63
8 stripes : NB:0 44/23 HM:128 76/65
9 stripes : NB:0 43/23 HM:128 73/65
10 stripes: NB:0 35/23 HM:128 72/64
11 stripes: NB:0 35/24 HM:128 72/64
12 stripes: NB:0 33/24 HM:128 72/65
13 stripes: NB:0 33/23 HM:128 71/64


Test analysis:
--------------
I must have done something wrong ;-)

On the Opteron, dm-raid45 xor() outperforms xor_blocks() by far.
No warm cache significance visible.

On the Intel, dm-raid45 xor() performs slightly better on warm cache
vs. xor_blocks() performing slightly better on cold cache, which may be
the result of the lag of prefetching in dm-raid45 xor().
xor_blocks() achieves a slightly better maximum in 8 / 13 runs vs.
xor() in 2 test runs. in 3 runs they achieve the same maximum.

This is not deterministic:
min/max varying by up to > 200% on the Opteron
and up to 46% on the Intel.


Questions/Recommendations:
--------------------------
Review the code changes and the data analysis please.

Review the test cases and argue if those are valid
or recommend different ones please.

Can we get this more deterministic (e.g. use prefetching for dm-raid45 xor()) ?

Regards,
Heinz
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

NeilBrown July 6, 2009, 3:21 a.m. UTC | #1
On Thursday July 2, heinzm@redhat.com wrote:
> 
> Dan, Neil,
> 
> like mentioned before I left to LinuxTag last week, here comes an initial
> take on dm-raid45 warm/cold CPU cache xor speed optimization metrics.
> 
> This shall give us the base to decide to keep or drop the dm-raid45
> internal xor optimization magic or move (part of) it into the crypto
> subsystem.

Thanks for doing this.
> 
> 
> Intel results with 128 iterations each:
> ---------------------------------------
> 
> 1 stripe  : NB:10 111/80 HM:118 111/82
> 2 stripes : NB:25 113/87 HM:103 112/91
> 3 stripes : NB:24 115/93 HM:104 114/93
> 4 stripes : NB:48 114/93 HM:80 114/93
> 5 stripes : NB:38 113/94 HM:90 114/94
> 6 stripes : NB:25 116/94 HM:103 114/94
> 7 stripes : NB:25 115/95 HM:103 115/95
> 8 stripes : NB:62 117/96 HM:66 116/95 <<<--- cold cache starts here
> 9 stripes : NB:66 117/96 HM:62 116/95
> 10 stripes: NB:73 117/96 HM:55 114/95
> 11 stripes: NB:63 114/96 HM:65 112/95
> 12 stripes: NB:51 111/96 HM:77 110/95
> 13 stripes: NB:65 109/96 HM:63 112/95

These results seem to suggest that the two different routines provide
very similar results on this hardware, particularly when the cache is cold.
The high degree of variability might be because you have dropped this:

> -	/* Wait for next tick. */
> -	for (j = jiffies; j == jiffies; )
> -		;

??
Without that, it could be running the test over anything from 4 to 5
jiffies.
I note that do_xor_speed in crypto/xor.c doesn't synchronise at the
start either.  I think that is a bug.
The variability seem to generally be close to 20%, which is consistent
with the difference between 4 and 5.

Could you put that loop back in and re-test?

> 
> Opteron results with 128 iterations each:
> -----------------------------------------
> 1 stripe  : NB:0 30/20 HM:128 64/53
> 2 stripes : NB:0 31/21 HM:128 68/55
> 3 stripes : NB:0 31/22 HM:128 68/57
> 4 stripes : NB:0 32/22 HM:128 70/61
> 5 stripes : NB:0 32/22 HM:128 70/63
> 6 stripes : NB:0 35/22 HM:128 70/64
> 7 stripes : NB:0 32/23 HM:128 69/63
> 8 stripes : NB:0 44/23 HM:128 76/65
> 9 stripes : NB:0 43/23 HM:128 73/65
> 10 stripes: NB:0 35/23 HM:128 72/64
> 11 stripes: NB:0 35/24 HM:128 72/64
> 12 stripes: NB:0 33/24 HM:128 72/65
> 13 stripes: NB:0 33/23 HM:128 71/64

Here your code seems to be 2-3 times faster!
Can you check which function xor_block is using?
If it is :
  xor: automatically using best checksumming function: ....
then it might be worth disabling that test in calibrate_xor_blocks and
see if it picks one that ends up being faster.

There is still the fact that by using the cache for data that will be
accessed once, we are potentially slowing down the rest of the system.
i.e. the reason to avoid the cache is not just because it won't
benefit the xor much, but because it will hurt other users.
I don't know how to measure that effect :-(
But if avoiding the cache makes xor 1/3 the speed of using the cache
even though it is cold, then it would be hard to justify not using the
cache I think.

> 
> Questions/Recommendations:
> --------------------------
> Review the code changes and the data analysis please.

It seems to mostly make sense
 - the 'wait for next tick' should stay
 - it would be interesting to see what the final choice of 'chunks'
   was (i.e. how many to xor together at a time).
  

Thanks!

NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Doug Ledford July 7, 2009, 6:38 p.m. UTC | #2
On Jul 5, 2009, at 11:21 PM, Neil Brown wrote:
> Here your code seems to be 2-3 times faster!
> Can you check which function xor_block is using?
> If it is :
>  xor: automatically using best checksumming function: ....
> then it might be worth disabling that test in calibrate_xor_blocks and
> see if it picks one that ends up being faster.
>
> There is still the fact that by using the cache for data that will be
> accessed once, we are potentially slowing down the rest of the system.
> i.e. the reason to avoid the cache is not just because it won't
> benefit the xor much, but because it will hurt other users.
> I don't know how to measure that effect :-(
> But if avoiding the cache makes xor 1/3 the speed of using the cache
> even though it is cold, then it would be hard to justify not using the
> cache I think.

So, Heinz and I are actually both looking at xor speed issues, but  
from two different perspectives.  While he's comparing some of the  
dmraid45 xor stuff to the xor_blocks routine in crypto/, I'm  
specifically looking at that "automatically using best checksumming  
function" routine.  For the last 9 or so years, we've automatically  
opted for the SSE + non-temporal store routine specifically because  
it's not supposed to pollute cache.  However, after even just a  
cursory reading of the current Intel architecture optimization guide,  
it's obvious that our SSE routine is getting rather aged, and I think  
the routine is in serious need of an overhaul.  This is something I'm  
currently looking into.  But, that raises the question of how to  
decide whether or not to use it, either in its current form or any new  
form it might take.  As you point out, the tradeoff between cache  
polluting and non-cache polluting is hard to quantify.

We made a significant error when we originally wrote the SSE routines,  
and Heinz just duplicated it.  Specifically, we tested performance on  
a quiescent system.  For the SSE routines, I think this is a *major*  
error.  The prefetch instructions need to be timed such that the  
prefetch happens at roughly the right point in time to compensate for  
the memory latency in getting the data to L1/L2 cache prior to use by  
the CPU.  Unfortunately, memory latency in a system that is quiescent  
is drastically different than latency in a system with several CPUs  
actively competing for RAM resources on top of 100MB/s+ of DMA  
traffic, etc.  When we optimized the routines in a quiescent state, I  
think we got our prefetches too close to when the data was needed by  
the CPU under real world use conditions and that's impacting the  
operation of the routines today (or maybe we did get it right, but  
changes in CPU speed relative to memory latency have caused the best  
prefetch point to change over time, either way the current SSE xor  
routine appears to be seriously underperforming in my benchmark tests).

Likewise, Heinz's tests were comparing cold cache to hot cache and  
trying to find a break over point where we switch from one to the  
other.  But that question necessarily depends on other factors in the  
system including what other cores on the same die are doing as that  
impacts the same cache.

So if the error was to not test and optimize these routines under  
load, then the right course of action would be to do the opposite.   
And that leads me to believe that the best way to quantify the  
difference between cache polluting and non-cache polluting should  
likewise not be done on a quiescent system with a micro benchmark.   
Instead, we need a holistic performance test to get the truly best xor  
algorithm.  In my current setup, the disks are so much faster than the  
single threaded xor thread that the bottleneck is the xor speed.  So,  
what does it matter if the xor routine doesn't pollute cache if the  
raid is so slow that programs are stuck in I/O wait all the time as  
the raid5 thread runs non-stop?  Likewise, who cares what the top  
speed of a cache polluting xor routine is if in the process it evicts  
so many cache pages belonging to the processes doing real work on the  
system that now cache reload becomes the bottleneck.  The ultimate  
goal of either approach is overall *system* speed, not micro benchmark  
speed.  I would suggest a specific, system wide workload test that  
involves a filesystem on a device that uses the particular raid level  
and parity routine you want to test, and then you need to run that  
system workload and get a total time required to perform that specific  
work set, CPU time versus idle+I/O wait time in completing that work  
set, etc.  Repeat the test for the various algorithms you wish to  
test, then analyze the results and go from there.  I don't think  
you're going to get a valid run time test for this, instead we would  
likely need to create a few heuristic rules that, combined with  
specific CPU properties, cause us to choose the right routine for the  
machine.

--

Doug Ledford <dledford@redhat.com>

GPG KeyID: CFBFF194
http://people.redhat.com/dledford

InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Heinz Mauelshagen July 8, 2009, 6:56 p.m. UTC | #3
On Mon, 2009-07-06 at 13:21 +1000, Neil Brown wrote:
> On Thursday July 2, heinzm@redhat.com wrote:
> > 
> > Dan, Neil,

Hi,

back after > 4 days of Internet outage caused by lightning :-(

I'll respond to Neils comments here in order to have a comparable
microbenchmark based on his recommended change
(and one bug I fixed; see below).

> > 
> > like mentioned before I left to LinuxTag last week, here comes an initial
> > take on dm-raid45 warm/cold CPU cache xor speed optimization metrics.
> > 
> > This shall give us the base to decide to keep or drop the dm-raid45
> > internal xor optimization magic or move (part of) it into the crypto
> > subsystem.
> 
> Thanks for doing this.

You're welcome.

> > 
> > 
> > Intel results with 128 iterations each:
> > ---------------------------------------
> > 
> > 1 stripe  : NB:10 111/80 HM:118 111/82
> > 2 stripes : NB:25 113/87 HM:103 112/91
> > 3 stripes : NB:24 115/93 HM:104 114/93
> > 4 stripes : NB:48 114/93 HM:80 114/93
> > 5 stripes : NB:38 113/94 HM:90 114/94
> > 6 stripes : NB:25 116/94 HM:103 114/94
> > 7 stripes : NB:25 115/95 HM:103 115/95
> > 8 stripes : NB:62 117/96 HM:66 116/95 <<<--- cold cache starts here
> > 9 stripes : NB:66 117/96 HM:62 116/95
> > 10 stripes: NB:73 117/96 HM:55 114/95
> > 11 stripes: NB:63 114/96 HM:65 112/95
> > 12 stripes: NB:51 111/96 HM:77 110/95
> > 13 stripes: NB:65 109/96 HM:63 112/95
> 
> These results seem to suggest that the two different routines provide
> very similar results on this hardware, particularly when the cache is cold.
> The high degree of variability might be because you have dropped this:
> 
> > -	/* Wait for next tick. */
> > -	for (j = jiffies; j == jiffies; )
> > -		;
> ??
> Without that, it could be running the test over anything from 4 to 5
> jiffies.
> I note that do_xor_speed in crypto/xor.c doesn't synchronise at the
> start either.  I think that is a bug.
> The variability seem to generally be close to 20%, which is consistent
> with the difference between 4 and 5.
> 
> Could you put that loop back in and re-test?
> 

Reintroduced and rerun tests.

In addition to that I fixed a flaw, which lead to
dm-raid45.c:xor_optimize() running xor_speed() with chunks > raid
devices, which ain't make sense and lead to longer test runs and
erroneous chunk values (e.g. 7 when only 3 raid devices configured).
Hence we could end up with an algorithm claiming it was selected
for > raid devices.

Here's the new results:

Intel Core i7:
--------------
1 stripe  : NB:54 114/94 HM:74 113/93
2 stripes : NB:57 116/94 HM:71 115/94
3 stripes : NB:64 115/94 HM:64 114/94
4 stripes : NB:51 112/94 HM:77 114/94
5 stripes : NB:77 115/94 HM:51 114/94
6 stripes : NB:25 111/89 HM:103 105/90
7 stripes : NB:13 105/91 HM:115 111/90
8 stripes : NB:27 108/92 HM:101 111/93
9 stripes : NB:29 113/92 HM:99 114/93
10 stripes: NB:41 110/92 HM:87 112/93
11 stripes: NB:34 105/92 HM:94 107/93
12 stripes: NB:51 114/93 HM:77 114/93
13 stripes: NB:54 115/94 HM:74 114/93
14 stripes: NB:64 115/94 HM:64 114/93


AMD Opteron:
--------
1 stripe  : NB:0 25/17 HM:128 48/38
2 stripes : NB:0 24/18 HM:128 46/36
3 stripes : NB:0 25/18 HM:128 47/37
4 stripes : NB:0 27/19 HM:128 48/41
5 stripes : NB:0 30/18 HM:128 49/40
6 stripes : NB:0 27/19 HM:128 49/40
7 stripes : NB:0 29/18 HM:128 49/39
8 stripes : NB:0 26/19 HM:128 49/40
9 stripes : NB:0 28/19 HM:128 51/41
10 stripes: NB:0 28/18 HM:128 50/41
11 stripes: NB:0 31/19 HM:128 49/40
12 stripes: NB:0 28/19 HM:128 50/40
13 stripes: NB:0 26/19 HM:128 50/40
14 stripes: NB:0 27/20 HM:128 49/40


Still too much variability...



> > 
> > Opteron results with 128 iterations each:
> > -----------------------------------------
> > 1 stripe  : NB:0 30/20 HM:128 64/53
> > 2 stripes : NB:0 31/21 HM:128 68/55
> > 3 stripes : NB:0 31/22 HM:128 68/57
> > 4 stripes : NB:0 32/22 HM:128 70/61
> > 5 stripes : NB:0 32/22 HM:128 70/63
> > 6 stripes : NB:0 35/22 HM:128 70/64
> > 7 stripes : NB:0 32/23 HM:128 69/63
> > 8 stripes : NB:0 44/23 HM:128 76/65
> > 9 stripes : NB:0 43/23 HM:128 73/65
> > 10 stripes: NB:0 35/23 HM:128 72/64
> > 11 stripes: NB:0 35/24 HM:128 72/64
> > 12 stripes: NB:0 33/24 HM:128 72/65
> > 13 stripes: NB:0 33/23 HM:128 71/64
> 
> Here your code seems to be 2-3 times faster!
> Can you check which function xor_block is using?
> If it is :
>   xor: automatically using best checksumming function: ....
> then it might be worth disabling that test in calibrate_xor_blocks and
> see if it picks one that ends up being faster.

Picks the same sse one automatically/measured on both archs with
obvious variability:

[37414.875236] xor: automatically using best checksumming function:
generic_sse
[37414.893930]    generic_sse: 12619.000 MB/sec
[37414.893932] xor: using function: generic_sse (12619.000 MB/sec)
[37445.679501] xor: measuring software checksum speed
[37445.696829]    generic_sse: 15375.000 MB/sec
[37445.696830] xor: using function: generic_sse (15375.000 MB/sec)


Will get to Dough's recommendation to run loaded benchmarks tomorrow...

Heinz

> 
> There is still the fact that by using the cache for data that will be
> accessed once, we are potentially slowing down the rest of the system.
> i.e. the reason to avoid the cache is not just because it won't
> benefit the xor much, but because it will hurt other users.
> I don't know how to measure that effect :-(
> But if avoiding the cache makes xor 1/3 the speed of using the cache
> even though it is cold, then it would be hard to justify not using the
> cache I think.
> 
> > 
> > Questions/Recommendations:
> > --------------------------
> > Review the code changes and the data analysis please.
> 
> It seems to mostly make sense
>  - the 'wait for next tick' should stay
>  - it would be interesting to see what the final choice of 'chunks'
>    was (i.e. how many to xor together at a time).
>   
> 
> Thanks!
> 
> NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Heinz Mauelshagen July 10, 2009, 3:23 p.m. UTC | #4
On Tue, 2009-07-07 at 14:38 -0400, Doug Ledford wrote:
> On Jul 5, 2009, at 11:21 PM, Neil Brown wrote:
> > Here your code seems to be 2-3 times faster!
> > Can you check which function xor_block is using?
> > If it is :
> >  xor: automatically using best checksumming function: ....
> > then it might be worth disabling that test in calibrate_xor_blocks and
> > see if it picks one that ends up being faster.
> >
> > There is still the fact that by using the cache for data that will be
> > accessed once, we are potentially slowing down the rest of the system.
> > i.e. the reason to avoid the cache is not just because it won't
> > benefit the xor much, but because it will hurt other users.
> > I don't know how to measure that effect :-(
> > But if avoiding the cache makes xor 1/3 the speed of using the cache
> > even though it is cold, then it would be hard to justify not using the
> > cache I think.
> 
> So, Heinz and I are actually both looking at xor speed issues, but  
> from two different perspectives.  While he's comparing some of the  
> dmraid45 xor stuff to the xor_blocks routine in crypto/, I'm  
<SNIP>
> So if the error was to not test and optimize these routines under  
> load, then the right course of action would be to do the opposite.   
> And that leads me to believe that the best way to quantify the  
> difference between cache polluting and non-cache polluting should  
> likewise not be done on a quiescent system with a micro benchmark.   
> Instead, we need a holistic performance test to get the truly best xor  
> algorithm.  In my current setup, the disks are so much faster than the  
> single threaded xor thread that the bottleneck is the xor speed.  So,  
> what does it matter if the xor routine doesn't pollute cache if the  
> raid is so slow that programs are stuck in I/O wait all the time as  
> the raid5 thread runs non-stop?  Likewise, who cares what the top  
> speed of a cache polluting xor routine is if in the process it evicts  
> so many cache pages belonging to the processes doing real work on the  
> system that now cache reload becomes the bottleneck.  The ultimate  
> goal of either approach is overall *system* speed, not micro benchmark  
> speed.  I would suggest a specific, system wide workload test that  
> involves a filesystem on a device that uses the particular raid level  
> and parity routine you want to test, and then you need to run that  
> system workload and get a total time required to perform that specific  
> work set, CPU time versus idle+I/O wait time in completing that work  
> set, etc.  Repeat the test for the various algorithms you wish to  
> test, then analyze the results and go from there.  I don't think  
> you're going to get a valid run time test for this, instead we would  
> likely need to create a few heuristic rules that, combined with  
> specific CPU properties, cause us to choose the right routine for the  
> machine.

Dough,

I extended dm-raid45's message interface to support changing the xor
algorithm and # of chunks, allowing for changes of the algorithm being
used at runtime.

This I used to perform a bunch of mkfs write intensive tests on the
Intel Core i7 system as an initial write load test case. The tests have
been run on 8 disks faked onto one SSD using LVM (~200MB sustained
writes throughput):

for a in xor_blocks
do
	for c in $(seq 2 6)
	do
		echo -e "$a $c\n---------------"
		dmsetup message r5 0 xor $a $c
		for i in $(seq 6)do
			time mkfs -t ext3 /dev/mapper/r5
		done
	done
done > xor_blocks.out 2>&1

for a in xor_8 xor_16 xor_32 xor_64
do
	for c in $(seq 2 8)
	do
		echo -e "$a $c\n---------------"
		dmsetup message r5 0 xor $a $c
		for i in $(seq 6)
		do
			time mkfs -t ext3 /dev/mapper/r5
		done
	done
done > xor_8-64.out 2>&1

Mapping table for r5:
0 146800640 raid45 core 2 8192 nosync  raid5_la 7 64 128 8 -1 10 nosync 1  8 -1 \
/dev/tst/raiddev_1 0 /dev/tst/raiddev_2 0 /dev/tst/raiddev_3 0 /dev/tst/raiddev_4 0 \
/dev/tst/raiddev_5 0 /dev/tst/raiddev_6 0 /dev/tst/raiddev_7 0 /dev/tst/raiddev_8 0

I attached filtered output files xor_blocks_1.txt and xor_8-64_1.txt,
which contain the time information for all the above algorithm/#chunks
settings.


Real time minima:

# egrep '^real' xor_blocks_1.txt|sort|head -1
real    0m14.508s
# egrep '^real' xor_8-64_1.txt|sort|head -1
real    0m14.430s


System time minima:

[root@a4 dm-tests]# egrep '^sys' xor_blocks_1.txt|sort|head -1
sys     0m0.460s
# egrep '^sys' xor_8-64_1.txt|sort|head -1
sys     0m0.444s

User time is negligible.


This mkfs test case indicates better performance for certain dm-raid45
xor() settings vs. xor_blocks(). I can get to dbench etc. after my
vacation in week 31.


Heinz


> 
> --
> 
> Doug Ledford <dledford@redhat.com>
> 
> GPG KeyID: CFBFF194
> http://people.redhat.com/dledford
> 
> InfiniBand Specific RPMS
> http://people.redhat.com/dledford/Infiniband
> 
> 
> 
>
xor_blocks 2
---------------
real	0m14.513s
user	0m0.000s
sys	0m0.568s
real	0m14.721s
user	0m0.012s
sys	0m0.476s
real	0m14.792s
user	0m0.016s
sys	0m0.568s
real	0m15.037s
user	0m0.008s
sys	0m0.512s
real	0m14.514s
user	0m0.016s
sys	0m0.564s
real	0m14.508s
user	0m0.024s
sys	0m0.512s
xor_blocks 3
---------------
real	0m14.786s
user	0m0.008s
sys	0m0.504s
real	0m14.538s
user	0m0.004s
sys	0m0.504s
real	0m14.738s
user	0m0.012s
sys	0m0.516s
real	0m14.704s
user	0m0.016s
sys	0m0.520s
real	0m14.767s
user	0m0.016s
sys	0m0.500s
real	0m14.510s
user	0m0.020s
sys	0m0.556s
xor_blocks 4
---------------
real	0m14.643s
user	0m0.004s
sys	0m0.536s
real	0m14.647s
user	0m0.032s
sys	0m0.512s
real	0m14.748s
user	0m0.020s
sys	0m0.552s
real	0m14.825s
user	0m0.024s
sys	0m0.520s
real	0m14.829s
user	0m0.008s
sys	0m0.512s
real	0m14.515s
user	0m0.004s
sys	0m0.536s
xor_blocks 5
---------------
real	0m14.764s
user	0m0.008s
sys	0m0.524s
real	0m14.593s
user	0m0.012s
sys	0m0.540s
real	0m14.783s
user	0m0.012s
sys	0m0.504s
real	0m14.632s
user	0m0.008s
sys	0m0.512s
real	0m14.806s
user	0m0.008s
sys	0m0.488s
real	0m14.780s
user	0m0.012s
sys	0m0.528s
xor_blocks 6
---------------
real	0m14.813s
user	0m0.012s
sys	0m0.512s
real	0m14.725s
user	0m0.008s
sys	0m0.524s
real	0m14.518s
user	0m0.016s
sys	0m0.460s
real	0m14.784s
user	0m0.028s
sys	0m0.548s
real	0m14.994s
user	0m0.012s
sys	0m0.516s
real	0m14.803s
user	0m0.012s
sys	0m0.512s
xor_8 2
---------------
real	0m14.518s
user	0m0.024s
sys	0m0.504s
real	0m14.611s
user	0m0.016s
sys	0m0.508s
real	0m14.838s
user	0m0.020s
sys	0m0.500s
real	0m14.837s
user	0m0.008s
sys	0m0.512s
real	0m14.652s
user	0m0.024s
sys	0m0.460s
real	0m14.954s
user	0m0.016s
sys	0m0.556s
xor_8 3
---------------
real	0m14.866s
user	0m0.004s
sys	0m0.560s
real	0m14.736s
user	0m0.008s
sys	0m0.560s
real	0m14.643s
user	0m0.012s
sys	0m0.444s
real	0m14.817s
user	0m0.012s
sys	0m0.556s
real	0m14.644s
user	0m0.008s
sys	0m0.496s
real	0m14.747s
user	0m0.008s
sys	0m0.568s
xor_8 4
---------------
real	0m14.504s
user	0m0.000s
sys	0m0.568s
real	0m14.889s
user	0m0.012s
sys	0m0.516s
real	0m14.813s
user	0m0.020s
sys	0m0.500s
real	0m14.781s
user	0m0.020s
sys	0m0.496s
real	0m14.657s
user	0m0.012s
sys	0m0.500s
real	0m14.810s
user	0m0.020s
sys	0m0.488s
xor_8 5
---------------
real	0m14.805s
user	0m0.016s
sys	0m0.524s
real	0m14.956s
user	0m0.024s
sys	0m0.520s
real	0m14.619s
user	0m0.012s
sys	0m0.468s
real	0m14.902s
user	0m0.008s
sys	0m0.484s
real	0m14.800s
user	0m0.008s
sys	0m0.512s
real	0m14.866s
user	0m0.008s
sys	0m0.516s
xor_8 6
---------------
real	0m14.834s
user	0m0.032s
sys	0m0.476s
real	0m14.661s
user	0m0.008s
sys	0m0.560s
real	0m14.809s
user	0m0.016s
sys	0m0.528s
real	0m14.828s
user	0m0.016s
sys	0m0.568s
real	0m14.801s
user	0m0.008s
sys	0m0.516s
real	0m14.811s
user	0m0.012s
sys	0m0.524s
xor_8 7
---------------
real	0m14.889s
user	0m0.020s
sys	0m0.520s
real	0m14.525s
user	0m0.012s
sys	0m0.548s
real	0m14.767s
user	0m0.008s
sys	0m0.560s
real	0m14.803s
user	0m0.012s
sys	0m0.584s
real	0m14.641s
user	0m0.016s
sys	0m0.608s
real	0m14.810s
user	0m0.016s
sys	0m0.500s
xor_8 8
---------------
real	0m14.719s
user	0m0.016s
sys	0m0.540s
real	0m14.825s
user	0m0.016s
sys	0m0.572s
real	0m14.842s
user	0m0.008s
sys	0m0.552s
real	0m14.811s
user	0m0.016s
sys	0m0.508s
real	0m14.518s
user	0m0.012s
sys	0m0.544s
real	0m14.768s
user	0m0.024s
sys	0m0.500s
xor_16 2
---------------
real	0m14.839s
user	0m0.008s
sys	0m0.576s
real	0m14.517s
user	0m0.020s
sys	0m0.528s
real	0m14.810s
user	0m0.008s
sys	0m0.532s
real	0m14.888s
user	0m0.028s
sys	0m0.520s
real	0m14.811s
user	0m0.012s
sys	0m0.544s
real	0m14.794s
user	0m0.012s
sys	0m0.472s
xor_16 3
---------------
real	0m14.766s
user	0m0.008s
sys	0m0.512s
real	0m14.809s
user	0m0.020s
sys	0m0.488s
real	0m14.582s
user	0m0.008s
sys	0m0.500s
real	0m14.767s
user	0m0.008s
sys	0m0.552s
real	0m14.899s
user	0m0.008s
sys	0m0.528s
real	0m14.812s
user	0m0.004s
sys	0m0.524s
xor_16 4
---------------
real	0m14.827s
user	0m0.004s
sys	0m0.528s
real	0m14.769s
user	0m0.008s
sys	0m0.588s
real	0m14.541s
user	0m0.012s
sys	0m0.572s
real	0m14.788s
user	0m0.016s
sys	0m0.592s
real	0m15.482s
user	0m0.004s
sys	0m0.568s
real	0m14.780s
user	0m0.020s
sys	0m0.524s
xor_16 5
---------------
real	0m14.686s
user	0m0.024s
sys	0m0.500s
real	0m14.782s
user	0m0.012s
sys	0m0.468s
real	0m14.802s
user	0m0.008s
sys	0m0.456s
real	0m14.896s
user	0m0.008s
sys	0m0.548s
real	0m14.821s
user	0m0.004s
sys	0m0.532s
real	0m14.806s
user	0m0.028s
sys	0m0.492s
xor_16 6
---------------
real	0m14.735s
user	0m0.004s
sys	0m0.576s
real	0m14.926s
user	0m0.024s
sys	0m0.564s
real	0m14.912s
user	0m0.016s
sys	0m0.528s
real	0m14.830s
user	0m0.016s
sys	0m0.492s
real	0m14.751s
user	0m0.020s
sys	0m0.524s
real	0m14.492s
user	0m0.012s
sys	0m0.500s
xor_16 7
---------------
real	0m14.821s
user	0m0.016s
sys	0m0.444s
real	0m14.714s
user	0m0.012s
sys	0m0.476s
real	0m14.956s
user	0m0.008s
sys	0m0.544s
real	0m14.755s
user	0m0.012s
sys	0m0.552s
real	0m14.605s
user	0m0.004s
sys	0m0.488s
real	0m14.750s
user	0m0.012s
sys	0m0.564s
xor_16 8
---------------
real	0m14.702s
user	0m0.012s
sys	0m0.460s
real	0m14.797s
user	0m0.012s
sys	0m0.472s
real	0m14.629s
user	0m0.016s
sys	0m0.572s
real	0m14.841s
user	0m0.012s
sys	0m0.488s
real	0m14.768s
user	0m0.020s
sys	0m0.472s
real	0m14.483s
user	0m0.008s
sys	0m0.532s
xor_32 2
---------------
real	0m19.783s
user	0m0.004s
sys	0m0.528s
real	0m14.670s
user	0m0.012s
sys	0m0.448s
real	0m14.913s
user	0m0.020s
sys	0m0.496s
real	0m14.816s
user	0m0.012s
sys	0m0.524s
real	0m14.874s
user	0m0.016s
sys	0m0.560s
real	0m14.815s
user	0m0.004s
sys	0m0.572s
xor_32 3
---------------
real	0m14.751s
user	0m0.016s
sys	0m0.512s
real	0m14.605s
user	0m0.008s
sys	0m0.508s
real	0m14.699s
user	0m0.004s
sys	0m0.576s
real	0m14.674s
user	0m0.004s
sys	0m0.512s
real	0m14.872s
user	0m0.012s
sys	0m0.540s
real	0m14.801s
user	0m0.024s
sys	0m0.504s
xor_32 4
---------------
real	0m14.780s
user	0m0.028s
sys	0m0.504s
real	0m14.802s
user	0m0.008s
sys	0m0.500s
real	0m14.624s
user	0m0.008s
sys	0m0.516s
real	0m14.779s
user	0m0.028s
sys	0m0.536s
real	0m14.953s
user	0m0.012s
sys	0m0.544s
real	0m14.571s
user	0m0.016s
sys	0m0.500s
xor_32 5
---------------
real	0m14.843s
user	0m0.008s
sys	0m0.544s
real	0m14.822s
user	0m0.016s
sys	0m0.540s
real	0m14.583s
user	0m0.016s
sys	0m0.520s
real	0m15.138s
user	0m0.008s
sys	0m0.508s
real	0m14.718s
user	0m0.012s
sys	0m0.548s
real	0m14.547s
user	0m0.012s
sys	0m0.552s
xor_32 6
---------------
real	0m14.744s
user	0m0.012s
sys	0m0.488s
real	0m14.856s
user	0m0.016s
sys	0m0.532s
real	0m14.717s
user	0m0.024s
sys	0m0.552s
real	0m14.777s
user	0m0.008s
sys	0m0.564s
real	0m14.761s
user	0m0.016s
sys	0m0.496s
real	0m14.706s
user	0m0.012s
sys	0m0.560s
xor_32 7
---------------
real	0m14.790s
user	0m0.004s
sys	0m0.568s
real	0m14.797s
user	0m0.016s
sys	0m0.488s
real	0m14.708s
user	0m0.012s
sys	0m0.512s
real	0m14.838s
user	0m0.016s
sys	0m0.512s
real	0m14.748s
user	0m0.008s
sys	0m0.476s
real	0m14.507s
user	0m0.008s
sys	0m0.512s
xor_32 8
---------------
real	0m15.055s
user	0m0.004s
sys	0m0.468s
real	0m14.839s
user	0m0.016s
sys	0m0.564s
real	0m14.551s
user	0m0.020s
sys	0m0.468s
real	0m14.789s
user	0m0.020s
sys	0m0.488s
real	0m14.495s
user	0m0.004s
sys	0m0.556s
real	0m14.852s
user	0m0.032s
sys	0m0.552s
xor_64 2
---------------
real	0m14.749s
user	0m0.028s
sys	0m0.472s
real	0m14.576s
user	0m0.016s
sys	0m0.544s
real	0m14.880s
user	0m0.004s
sys	0m0.496s
real	0m14.789s
user	0m0.016s
sys	0m0.588s
real	0m14.504s
user	0m0.020s
sys	0m0.568s
real	0m14.847s
user	0m0.016s
sys	0m0.548s
xor_64 3
---------------
real	0m14.812s
user	0m0.012s
sys	0m0.492s
real	0m23.521s
user	0m0.012s
sys	0m0.552s
real	0m14.580s
user	0m0.004s
sys	0m0.552s
real	0m14.711s
user	0m0.028s
sys	0m0.524s
real	0m14.817s
user	0m0.016s
sys	0m0.544s
real	0m14.773s
user	0m0.008s
sys	0m0.468s
xor_64 4
---------------
real	0m14.722s
user	0m0.008s
sys	0m0.516s
real	0m14.881s
user	0m0.008s
sys	0m0.520s
real	0m14.821s
user	0m0.012s
sys	0m0.520s
real	0m15.190s
user	0m0.020s
sys	0m0.456s
real	0m14.780s
user	0m0.016s
sys	0m0.448s
real	0m14.762s
user	0m0.004s
sys	0m0.564s
xor_64 5
---------------
real	0m14.688s
user	0m0.016s
sys	0m0.488s
real	0m14.559s
user	0m0.004s
sys	0m0.528s
real	0m14.829s
user	0m0.020s
sys	0m0.520s
real	0m14.818s
user	0m0.016s
sys	0m0.500s
real	0m14.812s
user	0m0.008s
sys	0m0.500s
real	0m14.804s
user	0m0.004s
sys	0m0.480s
xor_64 6
---------------
real	0m14.742s
user	0m0.024s
sys	0m0.476s
real	0m14.882s
user	0m0.020s
sys	0m0.528s
real	0m14.589s
user	0m0.012s
sys	0m0.512s
real	0m14.832s
user	0m0.004s
sys	0m0.504s
real	0m14.638s
user	0m0.012s
sys	0m0.444s
real	0m14.767s
user	0m0.008s
sys	0m0.536s
xor_64 7
---------------
real	0m14.790s
user	0m0.012s
sys	0m0.560s
real	0m14.749s
user	0m0.016s
sys	0m0.476s
real	0m14.430s
user	0m0.016s
sys	0m0.540s
real	0m14.694s
user	0m0.012s
sys	0m0.556s
real	0m14.567s
user	0m0.016s
sys	0m0.488s
real	0m14.753s
user	0m0.016s
sys	0m0.536s
xor_64 8
---------------
real	0m14.816s
user	0m0.008s
sys	0m0.544s
real	0m14.704s
user	0m0.020s
sys	0m0.516s
real	0m14.613s
user	0m0.012s
sys	0m0.548s
real	0m14.900s
user	0m0.008s
sys	0m0.532s
real	0m14.586s
user	0m0.012s
sys	0m0.464s
real	0m14.692s
user	0m0.016s
sys	0m0.520s
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Doug Ledford July 11, 2009, 12:44 p.m. UTC | #5
On Jul 10, 2009, at 11:23 AM, Heinz Mauelshagen wrote:
> Dough,
>
> I extended dm-raid45's message interface to support changing the xor
> algorithm and # of chunks, allowing for changes of the algorithm being
> used at runtime.

Very useful indeed.  I may send you some routines to be tested at some  
point in the future if you don't mind ;-)

> This I used to perform a bunch of mkfs write intensive tests on the
> Intel Core i7 system as an initial write load test case. The tests  
> have
> been run on 8 disks faked onto one SSD using LVM (~200MB sustained
> writes throughput):

That's a little slower than I think you need for a good test.  I'm not  
even sure I'm satisfied that my current SATA array is sufficient and I  
can get at least 500MB/s of write throughput to the disks using a  
raid0, possibly more if I can get a better eSATA port.

> for a in xor_blocks
> do
> 	for c in $(seq 2 6)
> 	do
> 		echo -e "$a $c\n---------------"
> 		dmsetup message r5 0 xor $a $c
> 		for i in $(seq 6)do
> 			time mkfs -t ext3 /dev/mapper/r5
> 		done
> 	done
> done > xor_blocks.out 2>&1
>
> for a in xor_8 xor_16 xor_32 xor_64
> do
> 	for c in $(seq 2 8)
> 	do
> 		echo -e "$a $c\n---------------"
> 		dmsetup message r5 0 xor $a $c
> 		for i in $(seq 6)
> 		do
> 			time mkfs -t ext3 /dev/mapper/r5
> 		done
> 	done
> done > xor_8-64.out 2>&1
>
> Mapping table for r5:
> 0 146800640 raid45 core 2 8192 nosync  raid5_la 7 64 128 8 -1 10  
> nosync 1  8 -1 \
> /dev/tst/raiddev_1 0 /dev/tst/raiddev_2 0 /dev/tst/raiddev_3 0 /dev/ 
> tst/raiddev_4 0 \
> /dev/tst/raiddev_5 0 /dev/tst/raiddev_6 0 /dev/tst/raiddev_7 0 /dev/ 
> tst/raiddev_8 0
>
> I attached filtered output files xor_blocks_1.txt and xor_8-64_1.txt,
> which contain the time information for all the above algorithm/#chunks
> settings.
>
>
> Real time minima:
>
> # egrep '^real' xor_blocks_1.txt|sort|head -1
> real    0m14.508s
> # egrep '^real' xor_8-64_1.txt|sort|head -1
> real    0m14.430s
>
>
> System time minima:
>
> [root@a4 dm-tests]# egrep '^sys' xor_blocks_1.txt|sort|head -1
> sys     0m0.460s
> # egrep '^sys' xor_8-64_1.txt|sort|head -1
> sys     0m0.444s
>
> User time is negligible.
>
>
> This mkfs test case indicates better performance for certain dm-raid45
> xor() settings vs. xor_blocks(). I can get to dbench etc. after my
> vacation in week 31.

Thanks.  This isn't too far off from what I would expect.  I would say  
that real world loads fall all along a spectrum from "create lots of  
writes, but do little to no work" to "does lots of work, and only  
sporadically writes".  It's the later end of this spectrum that is  
most likely to be helped by the cache avoiding routines, while the  
former is not.  So, one of the tests I had in mind was to use  
something like timing a complete kernel build, or doing a database  
load/report cycle, or some other things like that.  Things that do  
actual work in the foreground while the raid is kept busy in the  
background.  Of course, testing all the various points along the  
spectrum is needed, so this test gets us the first.


--

Doug Ledford <dledford@redhat.com>

GPG KeyID: CFBFF194
http://people.redhat.com/dledford

InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams July 12, 2009, 2:56 a.m. UTC | #6
On Sat, Jul 11, 2009 at 5:44 AM, Doug Ledford<dledford@redhat.com> wrote:
> Thanks.  This isn't too far off from what I would expect.  I would say that
> real world loads fall all along a spectrum from "create lots of writes, but
> do little to no work" to "does lots of work, and only sporadically writes".
>  It's the later end of this spectrum that is most likely to be helped by the
> cache avoiding routines, while the former is not.  So, one of the tests I
> had in mind was to use something like timing a complete kernel build, or
> doing a database load/report cycle, or some other things like that.  Things
> that do actual work in the foreground while the raid is kept busy in the
> background.  Of course, testing all the various points along the spectrum is
> needed, so this test gets us the first.

This reminds of the testing I did when quantifying the benefit of
hardware accelerated raid5.  I played with kernel builds while
resyncing to show the cache and cpu-utilization savings, but never got
that to settle on a solid number.  I took a look at Con Kolivas'
"contest" benchmark [1], but ultimately just published plain iozone
data [2].  The interesting bit is that cpu limited random writes saw
more throughput improvement than streaming writes because the i/o
processing did not need to compete with management of the stripe
cache.

--
Dan

[1]: http://users.on.net/~ckolivas/contest/
[2]: http://sourceforge.net/projects/xscaleiop/files/MD RAID
Acceleration/iop-iozone-graphs-20061010.tar.bz2/download

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

 {drivers => /usr/src/linux/drivers}/md/dm-raid45.c |  139 ++++++++++++++------
 1 files changed, 98 insertions(+), 41 deletions(-)

diff --git a/drivers/md/dm-raid45.c b/drivers/md/dm-raid45.c
index 0c33fea..6ace975 100644
--- a/drivers/md/dm-raid45.c
+++ b/linux/drivers/md/dm-raid45.c
@@ -8,7 +8,28 @@ 
  *
  * Linux 2.6 Device Mapper RAID4 and RAID5 target.
  *
- * Supports:
+ * Tested-by: Intel; Marcin.Labun@intel.com, krzysztof.wojcik@intel.com
+ *
+ *
+ * Supports the following ATARAID vendor solutions (and SNIA DDF):
+ *
+ * 	Adaptec HostRAID ASR
+ * 	SNIA DDF1
+ * 	Hiphpoint 37x
+ * 	Hiphpoint 45x
+ *	Intel IMSM
+ *	Jmicron ATARAID
+ *	LSI Logic MegaRAID
+ *	NVidia RAID
+ *	Promise FastTrack
+ *	Silicon Image Medley
+ *	VIA Software RAID
+ *
+ * via the dmraid application.
+ *
+ *
+ * Features:
+ *
  *	o RAID4 with dedicated and selectable parity device
  *	o RAID5 with rotating parity (left+right, symmetric+asymmetric)
  *	o recovery of out of sync device for initial
@@ -37,7 +58,7 @@ 
  * ANALYZEME: recovery bandwidth
  */ 
 
-static const char *version = "v0.2594p";
+static const char *version = "v0.2596l";
 
 #include "dm.h"
 #include "dm-memcache.h"
@@ -101,9 +122,6 @@  static const char *version = "v0.2594p";
 /* Check value in range. */
 #define	range_ok(i, min, max)	(i >= min && i <= max)
 
-/* Check argument is power of 2. */
-#define POWER_OF_2(a) (!(a & (a - 1)))
-
 /* Structure access macros. */
 /* Derive raid_set from stripe_cache pointer. */
 #define	RS(x)	container_of(x, struct raid_set, sc)
@@ -1848,10 +1866,10 @@  struct xor_func {
 	xor_function_t f;
 	const char *name;
 } static xor_funcs[] = {
-	{ xor_8,   "xor_8"  },
-	{ xor_16,  "xor_16" },
-	{ xor_32,  "xor_32" },
 	{ xor_64,  "xor_64" },
+	{ xor_32,  "xor_32" },
+	{ xor_16,  "xor_16" },
+	{ xor_8,   "xor_8"  },
 	{ xor_blocks_wrapper, "xor_blocks" },
 };
 
@@ -3114,10 +3132,10 @@  static void _do_endios(struct raid_set *rs, struct stripe *stripe,
 		SetStripeReconstructed(stripe);
 
 		/* FIXME: reschedule to be written in case of read. */
-		// if (!RSDead && RSDegraded(rs) !StripeRBW(stripe)) {
-		// 	chunk_set(CHUNK(stripe, stripe->idx.recover), DIRTY);
-		// 	stripe_chunks_rw(stripe);
-		// }
+		/* if (!RSDead && RSDegraded(rs) !StripeRBW(stripe)) {
+			chunk_set(CHUNK(stripe, stripe->idx.recover), DIRTY);
+			stripe_chunks_rw(stripe);
+		} */
 
 		stripe->idx.recover = -1;
 	}
@@ -3257,7 +3275,7 @@  static void do_ios(struct raid_set *rs, struct bio_list *ios)
 		/* Check for recovering regions. */
 		sector = _sector(rs, bio);
 		r = region_state(rs, sector, DM_RH_RECOVERING);
-		if (unlikely(r && bio_data_dir(bio) == WRITE)) {
+		if (unlikely(r)) {
 			delay++;
 			/* Wait writing to recovering regions. */
 			dm_rh_delay_by_region(rh, bio,
@@ -3409,64 +3427,104 @@  static unsigned mbpers(struct raid_set *rs, unsigned speed)
 /*
  * Discover fastest xor algorithm and # of chunks combination.
  */
-/* Calculate speed for algorithm and # of chunks. */
+/* Calculate speed of particular algorithm and # of chunks. */
 static unsigned xor_speed(struct stripe *stripe)
 {
+	int ticks = 5;
 	unsigned r = 0;
-	unsigned long j;
 
-	/* Wait for next tick. */
-	for (j = jiffies; j == jiffies; )
-		;
+	/* Do xors for a few ticks. */
+	while (ticks--) {
+		unsigned xors = 0;
+		unsigned long j = jiffies;
+
+		while (j == jiffies) {
+			mb();
+			common_xor(stripe, stripe->io.size, 0, 0);
+			mb();
+			xors++;
+			mb();
+		}
 
-	/* Do xors for a full tick. */
-	for (j = jiffies; j == jiffies; ) {
-		mb();
-		common_xor(stripe, stripe->io.size, 0, 0);
-		mb();
-		r++;
+		if (xors > r)
+			r = xors;
 	}
 
 	return r;
 }
 
+/* Define for xor multi recovery stripe optimization runs. */
+#define DMRAID45_XOR_TEST
+
 /* Optimize xor algorithm for this RAID set. */
 static unsigned xor_optimize(struct raid_set *rs)
 {
-	unsigned chunks_max = 2, p = rs->set.raid_devs, speed_max = 0;
+	unsigned chunks_max = 2, p, speed_max = 0;
 	struct xor_func *f = ARRAY_END(xor_funcs), *f_max = NULL;
 	struct stripe *stripe;
+unsigned speed_hm = 0, speed_min = ~0, speed_xor_blocks = 0;
 
 	BUG_ON(list_empty(&rs->recover.stripes));
+#ifndef DMRAID45_XOR_TEST
 	stripe = list_first_entry(&rs->recover.stripes, struct stripe,
 				  lists[LIST_RECOVER]);
 
 	/* Must set uptodate so that xor() will belabour chunks. */
-	while (p--)
+	for (p = rs->set.raid_devs; p-- ;)
 		SetChunkUptodate(CHUNK(stripe, p));
+#endif
 
 	/* Try all xor functions. */
 	while (f-- > xor_funcs) {
 		unsigned speed;
 
-		/* Set actual xor function for common_xor(). */
-		rs->xor.f = f;
-		rs->xor.chunks = (f->f == xor_blocks_wrapper ?
-				  (MAX_XOR_BLOCKS + 1) : XOR_CHUNKS_MAX) + 1;
-
-		while (rs->xor.chunks-- > 2) {
-			speed = xor_speed(stripe);
-			if (speed > speed_max) {
-				speed_max = speed;
-				chunks_max = rs->xor.chunks;
-				f_max = f;
+#ifdef DMRAID45_XOR_TEST
+		list_for_each_entry(stripe, &rs->recover.stripes,
+				    lists[LIST_RECOVER]) {
+				for (p = rs->set.raid_devs; p-- ;)
+					SetChunkUptodate(CHUNK(stripe, p));
+#endif
+	
+			/* Set actual xor function for common_xor(). */
+			rs->xor.f = f;
+			rs->xor.chunks = (f->f == xor_blocks_wrapper ?
+					  (MAX_XOR_BLOCKS + 1) :
+					  XOR_CHUNKS_MAX) + 1;
+	
+			while (rs->xor.chunks-- > 2) {
+				speed = xor_speed(stripe);
+
+#ifdef DMRAID45_XOR_TEST
+				if (f->f == xor_blocks_wrapper) {
+					if (speed > speed_xor_blocks)
+						speed_xor_blocks = speed;
+				} else if (speed > speed_hm)
+					speed_hm = speed;
+	
+				if (speed < speed_min)
+					speed_min = speed;
+#endif
+
+				if (speed > speed_max) {
+					speed_max = speed;
+					chunks_max = rs->xor.chunks;
+					f_max = f;
+				}
 			}
+#ifdef DMRAID45_XOR_TEST
 		}
+#endif
 	}
 
-	/* Memorize optimum parameters. */
+	/* Memorize optimal parameters. */
 	rs->xor.f = f_max;
 	rs->xor.chunks = chunks_max;
+#ifdef DMRAID45_XOR_TEST
+	DMINFO("%s stripes=%u min=%u xor_blocks=%u hm=%u max=%u",
+	       speed_max == speed_hm ? "HM" : "NB",
+	       rs->recover.recovery_stripes, speed_min,
+	       speed_xor_blocks, speed_hm, speed_max);
+#endif
 	return speed_max;
 }
 
@@ -3786,7 +3844,7 @@  static int get_raid_variable_parms(struct dm_target *ti, char **argv,
 		  "Invalid recovery switch; must be \"sync\" or \"nosync\"" },
 		{ 0,
 		  "Invalid number of recovery stripes;"
-		  "must be -1, > 0 and <= 16384",
+		  "must be -1, > 0 and <= 64",
 		  RECOVERY_STRIPES_MIN, RECOVERY_STRIPES_MAX,
 		  &vp->recovery_stripes_parm, &vp->recovery_stripes, NULL },
 	}, *varp;
@@ -3831,7 +3889,7 @@  static int get_raid_variable_parms(struct dm_target *ti, char **argv,
 
 		if (sscanf(*(argv++), "%d", &value) != 1 ||
 		    (value != -1 &&
-		     ((varp->action && !POWER_OF_2(value)) ||
+		     ((varp->action && !is_power_of_2(value)) ||
 		      !range_ok(value, varp->min, varp->max))))
 			TI_ERR(varp->errmsg);