Message ID | 1246539130.20207.12.camel@o (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Jonthan Brassow |
Headers | show |
On Thursday July 2, heinzm@redhat.com wrote: > > Dan, Neil, > > like mentioned before I left to LinuxTag last week, here comes an initial > take on dm-raid45 warm/cold CPU cache xor speed optimization metrics. > > This shall give us the base to decide to keep or drop the dm-raid45 > internal xor optimization magic or move (part of) it into the crypto > subsystem. Thanks for doing this. > > > Intel results with 128 iterations each: > --------------------------------------- > > 1 stripe : NB:10 111/80 HM:118 111/82 > 2 stripes : NB:25 113/87 HM:103 112/91 > 3 stripes : NB:24 115/93 HM:104 114/93 > 4 stripes : NB:48 114/93 HM:80 114/93 > 5 stripes : NB:38 113/94 HM:90 114/94 > 6 stripes : NB:25 116/94 HM:103 114/94 > 7 stripes : NB:25 115/95 HM:103 115/95 > 8 stripes : NB:62 117/96 HM:66 116/95 <<<--- cold cache starts here > 9 stripes : NB:66 117/96 HM:62 116/95 > 10 stripes: NB:73 117/96 HM:55 114/95 > 11 stripes: NB:63 114/96 HM:65 112/95 > 12 stripes: NB:51 111/96 HM:77 110/95 > 13 stripes: NB:65 109/96 HM:63 112/95 These results seem to suggest that the two different routines provide very similar results on this hardware, particularly when the cache is cold. The high degree of variability might be because you have dropped this: > - /* Wait for next tick. */ > - for (j = jiffies; j == jiffies; ) > - ; ?? Without that, it could be running the test over anything from 4 to 5 jiffies. I note that do_xor_speed in crypto/xor.c doesn't synchronise at the start either. I think that is a bug. The variability seem to generally be close to 20%, which is consistent with the difference between 4 and 5. Could you put that loop back in and re-test? > > Opteron results with 128 iterations each: > ----------------------------------------- > 1 stripe : NB:0 30/20 HM:128 64/53 > 2 stripes : NB:0 31/21 HM:128 68/55 > 3 stripes : NB:0 31/22 HM:128 68/57 > 4 stripes : NB:0 32/22 HM:128 70/61 > 5 stripes : NB:0 32/22 HM:128 70/63 > 6 stripes : NB:0 35/22 HM:128 70/64 > 7 stripes : NB:0 32/23 HM:128 69/63 > 8 stripes : NB:0 44/23 HM:128 76/65 > 9 stripes : NB:0 43/23 HM:128 73/65 > 10 stripes: NB:0 35/23 HM:128 72/64 > 11 stripes: NB:0 35/24 HM:128 72/64 > 12 stripes: NB:0 33/24 HM:128 72/65 > 13 stripes: NB:0 33/23 HM:128 71/64 Here your code seems to be 2-3 times faster! Can you check which function xor_block is using? If it is : xor: automatically using best checksumming function: .... then it might be worth disabling that test in calibrate_xor_blocks and see if it picks one that ends up being faster. There is still the fact that by using the cache for data that will be accessed once, we are potentially slowing down the rest of the system. i.e. the reason to avoid the cache is not just because it won't benefit the xor much, but because it will hurt other users. I don't know how to measure that effect :-( But if avoiding the cache makes xor 1/3 the speed of using the cache even though it is cold, then it would be hard to justify not using the cache I think. > > Questions/Recommendations: > -------------------------- > Review the code changes and the data analysis please. It seems to mostly make sense - the 'wait for next tick' should stay - it would be interesting to see what the final choice of 'chunks' was (i.e. how many to xor together at a time). Thanks! NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Jul 5, 2009, at 11:21 PM, Neil Brown wrote: > Here your code seems to be 2-3 times faster! > Can you check which function xor_block is using? > If it is : > xor: automatically using best checksumming function: .... > then it might be worth disabling that test in calibrate_xor_blocks and > see if it picks one that ends up being faster. > > There is still the fact that by using the cache for data that will be > accessed once, we are potentially slowing down the rest of the system. > i.e. the reason to avoid the cache is not just because it won't > benefit the xor much, but because it will hurt other users. > I don't know how to measure that effect :-( > But if avoiding the cache makes xor 1/3 the speed of using the cache > even though it is cold, then it would be hard to justify not using the > cache I think. So, Heinz and I are actually both looking at xor speed issues, but from two different perspectives. While he's comparing some of the dmraid45 xor stuff to the xor_blocks routine in crypto/, I'm specifically looking at that "automatically using best checksumming function" routine. For the last 9 or so years, we've automatically opted for the SSE + non-temporal store routine specifically because it's not supposed to pollute cache. However, after even just a cursory reading of the current Intel architecture optimization guide, it's obvious that our SSE routine is getting rather aged, and I think the routine is in serious need of an overhaul. This is something I'm currently looking into. But, that raises the question of how to decide whether or not to use it, either in its current form or any new form it might take. As you point out, the tradeoff between cache polluting and non-cache polluting is hard to quantify. We made a significant error when we originally wrote the SSE routines, and Heinz just duplicated it. Specifically, we tested performance on a quiescent system. For the SSE routines, I think this is a *major* error. The prefetch instructions need to be timed such that the prefetch happens at roughly the right point in time to compensate for the memory latency in getting the data to L1/L2 cache prior to use by the CPU. Unfortunately, memory latency in a system that is quiescent is drastically different than latency in a system with several CPUs actively competing for RAM resources on top of 100MB/s+ of DMA traffic, etc. When we optimized the routines in a quiescent state, I think we got our prefetches too close to when the data was needed by the CPU under real world use conditions and that's impacting the operation of the routines today (or maybe we did get it right, but changes in CPU speed relative to memory latency have caused the best prefetch point to change over time, either way the current SSE xor routine appears to be seriously underperforming in my benchmark tests). Likewise, Heinz's tests were comparing cold cache to hot cache and trying to find a break over point where we switch from one to the other. But that question necessarily depends on other factors in the system including what other cores on the same die are doing as that impacts the same cache. So if the error was to not test and optimize these routines under load, then the right course of action would be to do the opposite. And that leads me to believe that the best way to quantify the difference between cache polluting and non-cache polluting should likewise not be done on a quiescent system with a micro benchmark. Instead, we need a holistic performance test to get the truly best xor algorithm. In my current setup, the disks are so much faster than the single threaded xor thread that the bottleneck is the xor speed. So, what does it matter if the xor routine doesn't pollute cache if the raid is so slow that programs are stuck in I/O wait all the time as the raid5 thread runs non-stop? Likewise, who cares what the top speed of a cache polluting xor routine is if in the process it evicts so many cache pages belonging to the processes doing real work on the system that now cache reload becomes the bottleneck. The ultimate goal of either approach is overall *system* speed, not micro benchmark speed. I would suggest a specific, system wide workload test that involves a filesystem on a device that uses the particular raid level and parity routine you want to test, and then you need to run that system workload and get a total time required to perform that specific work set, CPU time versus idle+I/O wait time in completing that work set, etc. Repeat the test for the various algorithms you wish to test, then analyze the results and go from there. I don't think you're going to get a valid run time test for this, instead we would likely need to create a few heuristic rules that, combined with specific CPU properties, cause us to choose the right routine for the machine. -- Doug Ledford <dledford@redhat.com> GPG KeyID: CFBFF194 http://people.redhat.com/dledford InfiniBand Specific RPMS http://people.redhat.com/dledford/Infiniband -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 2009-07-06 at 13:21 +1000, Neil Brown wrote: > On Thursday July 2, heinzm@redhat.com wrote: > > > > Dan, Neil, Hi, back after > 4 days of Internet outage caused by lightning :-( I'll respond to Neils comments here in order to have a comparable microbenchmark based on his recommended change (and one bug I fixed; see below). > > > > like mentioned before I left to LinuxTag last week, here comes an initial > > take on dm-raid45 warm/cold CPU cache xor speed optimization metrics. > > > > This shall give us the base to decide to keep or drop the dm-raid45 > > internal xor optimization magic or move (part of) it into the crypto > > subsystem. > > Thanks for doing this. You're welcome. > > > > > > Intel results with 128 iterations each: > > --------------------------------------- > > > > 1 stripe : NB:10 111/80 HM:118 111/82 > > 2 stripes : NB:25 113/87 HM:103 112/91 > > 3 stripes : NB:24 115/93 HM:104 114/93 > > 4 stripes : NB:48 114/93 HM:80 114/93 > > 5 stripes : NB:38 113/94 HM:90 114/94 > > 6 stripes : NB:25 116/94 HM:103 114/94 > > 7 stripes : NB:25 115/95 HM:103 115/95 > > 8 stripes : NB:62 117/96 HM:66 116/95 <<<--- cold cache starts here > > 9 stripes : NB:66 117/96 HM:62 116/95 > > 10 stripes: NB:73 117/96 HM:55 114/95 > > 11 stripes: NB:63 114/96 HM:65 112/95 > > 12 stripes: NB:51 111/96 HM:77 110/95 > > 13 stripes: NB:65 109/96 HM:63 112/95 > > These results seem to suggest that the two different routines provide > very similar results on this hardware, particularly when the cache is cold. > The high degree of variability might be because you have dropped this: > > > - /* Wait for next tick. */ > > - for (j = jiffies; j == jiffies; ) > > - ; > ?? > Without that, it could be running the test over anything from 4 to 5 > jiffies. > I note that do_xor_speed in crypto/xor.c doesn't synchronise at the > start either. I think that is a bug. > The variability seem to generally be close to 20%, which is consistent > with the difference between 4 and 5. > > Could you put that loop back in and re-test? > Reintroduced and rerun tests. In addition to that I fixed a flaw, which lead to dm-raid45.c:xor_optimize() running xor_speed() with chunks > raid devices, which ain't make sense and lead to longer test runs and erroneous chunk values (e.g. 7 when only 3 raid devices configured). Hence we could end up with an algorithm claiming it was selected for > raid devices. Here's the new results: Intel Core i7: -------------- 1 stripe : NB:54 114/94 HM:74 113/93 2 stripes : NB:57 116/94 HM:71 115/94 3 stripes : NB:64 115/94 HM:64 114/94 4 stripes : NB:51 112/94 HM:77 114/94 5 stripes : NB:77 115/94 HM:51 114/94 6 stripes : NB:25 111/89 HM:103 105/90 7 stripes : NB:13 105/91 HM:115 111/90 8 stripes : NB:27 108/92 HM:101 111/93 9 stripes : NB:29 113/92 HM:99 114/93 10 stripes: NB:41 110/92 HM:87 112/93 11 stripes: NB:34 105/92 HM:94 107/93 12 stripes: NB:51 114/93 HM:77 114/93 13 stripes: NB:54 115/94 HM:74 114/93 14 stripes: NB:64 115/94 HM:64 114/93 AMD Opteron: -------- 1 stripe : NB:0 25/17 HM:128 48/38 2 stripes : NB:0 24/18 HM:128 46/36 3 stripes : NB:0 25/18 HM:128 47/37 4 stripes : NB:0 27/19 HM:128 48/41 5 stripes : NB:0 30/18 HM:128 49/40 6 stripes : NB:0 27/19 HM:128 49/40 7 stripes : NB:0 29/18 HM:128 49/39 8 stripes : NB:0 26/19 HM:128 49/40 9 stripes : NB:0 28/19 HM:128 51/41 10 stripes: NB:0 28/18 HM:128 50/41 11 stripes: NB:0 31/19 HM:128 49/40 12 stripes: NB:0 28/19 HM:128 50/40 13 stripes: NB:0 26/19 HM:128 50/40 14 stripes: NB:0 27/20 HM:128 49/40 Still too much variability... > > > > Opteron results with 128 iterations each: > > ----------------------------------------- > > 1 stripe : NB:0 30/20 HM:128 64/53 > > 2 stripes : NB:0 31/21 HM:128 68/55 > > 3 stripes : NB:0 31/22 HM:128 68/57 > > 4 stripes : NB:0 32/22 HM:128 70/61 > > 5 stripes : NB:0 32/22 HM:128 70/63 > > 6 stripes : NB:0 35/22 HM:128 70/64 > > 7 stripes : NB:0 32/23 HM:128 69/63 > > 8 stripes : NB:0 44/23 HM:128 76/65 > > 9 stripes : NB:0 43/23 HM:128 73/65 > > 10 stripes: NB:0 35/23 HM:128 72/64 > > 11 stripes: NB:0 35/24 HM:128 72/64 > > 12 stripes: NB:0 33/24 HM:128 72/65 > > 13 stripes: NB:0 33/23 HM:128 71/64 > > Here your code seems to be 2-3 times faster! > Can you check which function xor_block is using? > If it is : > xor: automatically using best checksumming function: .... > then it might be worth disabling that test in calibrate_xor_blocks and > see if it picks one that ends up being faster. Picks the same sse one automatically/measured on both archs with obvious variability: [37414.875236] xor: automatically using best checksumming function: generic_sse [37414.893930] generic_sse: 12619.000 MB/sec [37414.893932] xor: using function: generic_sse (12619.000 MB/sec) [37445.679501] xor: measuring software checksum speed [37445.696829] generic_sse: 15375.000 MB/sec [37445.696830] xor: using function: generic_sse (15375.000 MB/sec) Will get to Dough's recommendation to run loaded benchmarks tomorrow... Heinz > > There is still the fact that by using the cache for data that will be > accessed once, we are potentially slowing down the rest of the system. > i.e. the reason to avoid the cache is not just because it won't > benefit the xor much, but because it will hurt other users. > I don't know how to measure that effect :-( > But if avoiding the cache makes xor 1/3 the speed of using the cache > even though it is cold, then it would be hard to justify not using the > cache I think. > > > > > Questions/Recommendations: > > -------------------------- > > Review the code changes and the data analysis please. > > It seems to mostly make sense > - the 'wait for next tick' should stay > - it would be interesting to see what the final choice of 'chunks' > was (i.e. how many to xor together at a time). > > > Thanks! > > NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 2009-07-07 at 14:38 -0400, Doug Ledford wrote: > On Jul 5, 2009, at 11:21 PM, Neil Brown wrote: > > Here your code seems to be 2-3 times faster! > > Can you check which function xor_block is using? > > If it is : > > xor: automatically using best checksumming function: .... > > then it might be worth disabling that test in calibrate_xor_blocks and > > see if it picks one that ends up being faster. > > > > There is still the fact that by using the cache for data that will be > > accessed once, we are potentially slowing down the rest of the system. > > i.e. the reason to avoid the cache is not just because it won't > > benefit the xor much, but because it will hurt other users. > > I don't know how to measure that effect :-( > > But if avoiding the cache makes xor 1/3 the speed of using the cache > > even though it is cold, then it would be hard to justify not using the > > cache I think. > > So, Heinz and I are actually both looking at xor speed issues, but > from two different perspectives. While he's comparing some of the > dmraid45 xor stuff to the xor_blocks routine in crypto/, I'm <SNIP> > So if the error was to not test and optimize these routines under > load, then the right course of action would be to do the opposite. > And that leads me to believe that the best way to quantify the > difference between cache polluting and non-cache polluting should > likewise not be done on a quiescent system with a micro benchmark. > Instead, we need a holistic performance test to get the truly best xor > algorithm. In my current setup, the disks are so much faster than the > single threaded xor thread that the bottleneck is the xor speed. So, > what does it matter if the xor routine doesn't pollute cache if the > raid is so slow that programs are stuck in I/O wait all the time as > the raid5 thread runs non-stop? Likewise, who cares what the top > speed of a cache polluting xor routine is if in the process it evicts > so many cache pages belonging to the processes doing real work on the > system that now cache reload becomes the bottleneck. The ultimate > goal of either approach is overall *system* speed, not micro benchmark > speed. I would suggest a specific, system wide workload test that > involves a filesystem on a device that uses the particular raid level > and parity routine you want to test, and then you need to run that > system workload and get a total time required to perform that specific > work set, CPU time versus idle+I/O wait time in completing that work > set, etc. Repeat the test for the various algorithms you wish to > test, then analyze the results and go from there. I don't think > you're going to get a valid run time test for this, instead we would > likely need to create a few heuristic rules that, combined with > specific CPU properties, cause us to choose the right routine for the > machine. Dough, I extended dm-raid45's message interface to support changing the xor algorithm and # of chunks, allowing for changes of the algorithm being used at runtime. This I used to perform a bunch of mkfs write intensive tests on the Intel Core i7 system as an initial write load test case. The tests have been run on 8 disks faked onto one SSD using LVM (~200MB sustained writes throughput): for a in xor_blocks do for c in $(seq 2 6) do echo -e "$a $c\n---------------" dmsetup message r5 0 xor $a $c for i in $(seq 6)do time mkfs -t ext3 /dev/mapper/r5 done done done > xor_blocks.out 2>&1 for a in xor_8 xor_16 xor_32 xor_64 do for c in $(seq 2 8) do echo -e "$a $c\n---------------" dmsetup message r5 0 xor $a $c for i in $(seq 6) do time mkfs -t ext3 /dev/mapper/r5 done done done > xor_8-64.out 2>&1 Mapping table for r5: 0 146800640 raid45 core 2 8192 nosync raid5_la 7 64 128 8 -1 10 nosync 1 8 -1 \ /dev/tst/raiddev_1 0 /dev/tst/raiddev_2 0 /dev/tst/raiddev_3 0 /dev/tst/raiddev_4 0 \ /dev/tst/raiddev_5 0 /dev/tst/raiddev_6 0 /dev/tst/raiddev_7 0 /dev/tst/raiddev_8 0 I attached filtered output files xor_blocks_1.txt and xor_8-64_1.txt, which contain the time information for all the above algorithm/#chunks settings. Real time minima: # egrep '^real' xor_blocks_1.txt|sort|head -1 real 0m14.508s # egrep '^real' xor_8-64_1.txt|sort|head -1 real 0m14.430s System time minima: [root@a4 dm-tests]# egrep '^sys' xor_blocks_1.txt|sort|head -1 sys 0m0.460s # egrep '^sys' xor_8-64_1.txt|sort|head -1 sys 0m0.444s User time is negligible. This mkfs test case indicates better performance for certain dm-raid45 xor() settings vs. xor_blocks(). I can get to dbench etc. after my vacation in week 31. Heinz > > -- > > Doug Ledford <dledford@redhat.com> > > GPG KeyID: CFBFF194 > http://people.redhat.com/dledford > > InfiniBand Specific RPMS > http://people.redhat.com/dledford/Infiniband > > > > xor_blocks 2 --------------- real 0m14.513s user 0m0.000s sys 0m0.568s real 0m14.721s user 0m0.012s sys 0m0.476s real 0m14.792s user 0m0.016s sys 0m0.568s real 0m15.037s user 0m0.008s sys 0m0.512s real 0m14.514s user 0m0.016s sys 0m0.564s real 0m14.508s user 0m0.024s sys 0m0.512s xor_blocks 3 --------------- real 0m14.786s user 0m0.008s sys 0m0.504s real 0m14.538s user 0m0.004s sys 0m0.504s real 0m14.738s user 0m0.012s sys 0m0.516s real 0m14.704s user 0m0.016s sys 0m0.520s real 0m14.767s user 0m0.016s sys 0m0.500s real 0m14.510s user 0m0.020s sys 0m0.556s xor_blocks 4 --------------- real 0m14.643s user 0m0.004s sys 0m0.536s real 0m14.647s user 0m0.032s sys 0m0.512s real 0m14.748s user 0m0.020s sys 0m0.552s real 0m14.825s user 0m0.024s sys 0m0.520s real 0m14.829s user 0m0.008s sys 0m0.512s real 0m14.515s user 0m0.004s sys 0m0.536s xor_blocks 5 --------------- real 0m14.764s user 0m0.008s sys 0m0.524s real 0m14.593s user 0m0.012s sys 0m0.540s real 0m14.783s user 0m0.012s sys 0m0.504s real 0m14.632s user 0m0.008s sys 0m0.512s real 0m14.806s user 0m0.008s sys 0m0.488s real 0m14.780s user 0m0.012s sys 0m0.528s xor_blocks 6 --------------- real 0m14.813s user 0m0.012s sys 0m0.512s real 0m14.725s user 0m0.008s sys 0m0.524s real 0m14.518s user 0m0.016s sys 0m0.460s real 0m14.784s user 0m0.028s sys 0m0.548s real 0m14.994s user 0m0.012s sys 0m0.516s real 0m14.803s user 0m0.012s sys 0m0.512s xor_8 2 --------------- real 0m14.518s user 0m0.024s sys 0m0.504s real 0m14.611s user 0m0.016s sys 0m0.508s real 0m14.838s user 0m0.020s sys 0m0.500s real 0m14.837s user 0m0.008s sys 0m0.512s real 0m14.652s user 0m0.024s sys 0m0.460s real 0m14.954s user 0m0.016s sys 0m0.556s xor_8 3 --------------- real 0m14.866s user 0m0.004s sys 0m0.560s real 0m14.736s user 0m0.008s sys 0m0.560s real 0m14.643s user 0m0.012s sys 0m0.444s real 0m14.817s user 0m0.012s sys 0m0.556s real 0m14.644s user 0m0.008s sys 0m0.496s real 0m14.747s user 0m0.008s sys 0m0.568s xor_8 4 --------------- real 0m14.504s user 0m0.000s sys 0m0.568s real 0m14.889s user 0m0.012s sys 0m0.516s real 0m14.813s user 0m0.020s sys 0m0.500s real 0m14.781s user 0m0.020s sys 0m0.496s real 0m14.657s user 0m0.012s sys 0m0.500s real 0m14.810s user 0m0.020s sys 0m0.488s xor_8 5 --------------- real 0m14.805s user 0m0.016s sys 0m0.524s real 0m14.956s user 0m0.024s sys 0m0.520s real 0m14.619s user 0m0.012s sys 0m0.468s real 0m14.902s user 0m0.008s sys 0m0.484s real 0m14.800s user 0m0.008s sys 0m0.512s real 0m14.866s user 0m0.008s sys 0m0.516s xor_8 6 --------------- real 0m14.834s user 0m0.032s sys 0m0.476s real 0m14.661s user 0m0.008s sys 0m0.560s real 0m14.809s user 0m0.016s sys 0m0.528s real 0m14.828s user 0m0.016s sys 0m0.568s real 0m14.801s user 0m0.008s sys 0m0.516s real 0m14.811s user 0m0.012s sys 0m0.524s xor_8 7 --------------- real 0m14.889s user 0m0.020s sys 0m0.520s real 0m14.525s user 0m0.012s sys 0m0.548s real 0m14.767s user 0m0.008s sys 0m0.560s real 0m14.803s user 0m0.012s sys 0m0.584s real 0m14.641s user 0m0.016s sys 0m0.608s real 0m14.810s user 0m0.016s sys 0m0.500s xor_8 8 --------------- real 0m14.719s user 0m0.016s sys 0m0.540s real 0m14.825s user 0m0.016s sys 0m0.572s real 0m14.842s user 0m0.008s sys 0m0.552s real 0m14.811s user 0m0.016s sys 0m0.508s real 0m14.518s user 0m0.012s sys 0m0.544s real 0m14.768s user 0m0.024s sys 0m0.500s xor_16 2 --------------- real 0m14.839s user 0m0.008s sys 0m0.576s real 0m14.517s user 0m0.020s sys 0m0.528s real 0m14.810s user 0m0.008s sys 0m0.532s real 0m14.888s user 0m0.028s sys 0m0.520s real 0m14.811s user 0m0.012s sys 0m0.544s real 0m14.794s user 0m0.012s sys 0m0.472s xor_16 3 --------------- real 0m14.766s user 0m0.008s sys 0m0.512s real 0m14.809s user 0m0.020s sys 0m0.488s real 0m14.582s user 0m0.008s sys 0m0.500s real 0m14.767s user 0m0.008s sys 0m0.552s real 0m14.899s user 0m0.008s sys 0m0.528s real 0m14.812s user 0m0.004s sys 0m0.524s xor_16 4 --------------- real 0m14.827s user 0m0.004s sys 0m0.528s real 0m14.769s user 0m0.008s sys 0m0.588s real 0m14.541s user 0m0.012s sys 0m0.572s real 0m14.788s user 0m0.016s sys 0m0.592s real 0m15.482s user 0m0.004s sys 0m0.568s real 0m14.780s user 0m0.020s sys 0m0.524s xor_16 5 --------------- real 0m14.686s user 0m0.024s sys 0m0.500s real 0m14.782s user 0m0.012s sys 0m0.468s real 0m14.802s user 0m0.008s sys 0m0.456s real 0m14.896s user 0m0.008s sys 0m0.548s real 0m14.821s user 0m0.004s sys 0m0.532s real 0m14.806s user 0m0.028s sys 0m0.492s xor_16 6 --------------- real 0m14.735s user 0m0.004s sys 0m0.576s real 0m14.926s user 0m0.024s sys 0m0.564s real 0m14.912s user 0m0.016s sys 0m0.528s real 0m14.830s user 0m0.016s sys 0m0.492s real 0m14.751s user 0m0.020s sys 0m0.524s real 0m14.492s user 0m0.012s sys 0m0.500s xor_16 7 --------------- real 0m14.821s user 0m0.016s sys 0m0.444s real 0m14.714s user 0m0.012s sys 0m0.476s real 0m14.956s user 0m0.008s sys 0m0.544s real 0m14.755s user 0m0.012s sys 0m0.552s real 0m14.605s user 0m0.004s sys 0m0.488s real 0m14.750s user 0m0.012s sys 0m0.564s xor_16 8 --------------- real 0m14.702s user 0m0.012s sys 0m0.460s real 0m14.797s user 0m0.012s sys 0m0.472s real 0m14.629s user 0m0.016s sys 0m0.572s real 0m14.841s user 0m0.012s sys 0m0.488s real 0m14.768s user 0m0.020s sys 0m0.472s real 0m14.483s user 0m0.008s sys 0m0.532s xor_32 2 --------------- real 0m19.783s user 0m0.004s sys 0m0.528s real 0m14.670s user 0m0.012s sys 0m0.448s real 0m14.913s user 0m0.020s sys 0m0.496s real 0m14.816s user 0m0.012s sys 0m0.524s real 0m14.874s user 0m0.016s sys 0m0.560s real 0m14.815s user 0m0.004s sys 0m0.572s xor_32 3 --------------- real 0m14.751s user 0m0.016s sys 0m0.512s real 0m14.605s user 0m0.008s sys 0m0.508s real 0m14.699s user 0m0.004s sys 0m0.576s real 0m14.674s user 0m0.004s sys 0m0.512s real 0m14.872s user 0m0.012s sys 0m0.540s real 0m14.801s user 0m0.024s sys 0m0.504s xor_32 4 --------------- real 0m14.780s user 0m0.028s sys 0m0.504s real 0m14.802s user 0m0.008s sys 0m0.500s real 0m14.624s user 0m0.008s sys 0m0.516s real 0m14.779s user 0m0.028s sys 0m0.536s real 0m14.953s user 0m0.012s sys 0m0.544s real 0m14.571s user 0m0.016s sys 0m0.500s xor_32 5 --------------- real 0m14.843s user 0m0.008s sys 0m0.544s real 0m14.822s user 0m0.016s sys 0m0.540s real 0m14.583s user 0m0.016s sys 0m0.520s real 0m15.138s user 0m0.008s sys 0m0.508s real 0m14.718s user 0m0.012s sys 0m0.548s real 0m14.547s user 0m0.012s sys 0m0.552s xor_32 6 --------------- real 0m14.744s user 0m0.012s sys 0m0.488s real 0m14.856s user 0m0.016s sys 0m0.532s real 0m14.717s user 0m0.024s sys 0m0.552s real 0m14.777s user 0m0.008s sys 0m0.564s real 0m14.761s user 0m0.016s sys 0m0.496s real 0m14.706s user 0m0.012s sys 0m0.560s xor_32 7 --------------- real 0m14.790s user 0m0.004s sys 0m0.568s real 0m14.797s user 0m0.016s sys 0m0.488s real 0m14.708s user 0m0.012s sys 0m0.512s real 0m14.838s user 0m0.016s sys 0m0.512s real 0m14.748s user 0m0.008s sys 0m0.476s real 0m14.507s user 0m0.008s sys 0m0.512s xor_32 8 --------------- real 0m15.055s user 0m0.004s sys 0m0.468s real 0m14.839s user 0m0.016s sys 0m0.564s real 0m14.551s user 0m0.020s sys 0m0.468s real 0m14.789s user 0m0.020s sys 0m0.488s real 0m14.495s user 0m0.004s sys 0m0.556s real 0m14.852s user 0m0.032s sys 0m0.552s xor_64 2 --------------- real 0m14.749s user 0m0.028s sys 0m0.472s real 0m14.576s user 0m0.016s sys 0m0.544s real 0m14.880s user 0m0.004s sys 0m0.496s real 0m14.789s user 0m0.016s sys 0m0.588s real 0m14.504s user 0m0.020s sys 0m0.568s real 0m14.847s user 0m0.016s sys 0m0.548s xor_64 3 --------------- real 0m14.812s user 0m0.012s sys 0m0.492s real 0m23.521s user 0m0.012s sys 0m0.552s real 0m14.580s user 0m0.004s sys 0m0.552s real 0m14.711s user 0m0.028s sys 0m0.524s real 0m14.817s user 0m0.016s sys 0m0.544s real 0m14.773s user 0m0.008s sys 0m0.468s xor_64 4 --------------- real 0m14.722s user 0m0.008s sys 0m0.516s real 0m14.881s user 0m0.008s sys 0m0.520s real 0m14.821s user 0m0.012s sys 0m0.520s real 0m15.190s user 0m0.020s sys 0m0.456s real 0m14.780s user 0m0.016s sys 0m0.448s real 0m14.762s user 0m0.004s sys 0m0.564s xor_64 5 --------------- real 0m14.688s user 0m0.016s sys 0m0.488s real 0m14.559s user 0m0.004s sys 0m0.528s real 0m14.829s user 0m0.020s sys 0m0.520s real 0m14.818s user 0m0.016s sys 0m0.500s real 0m14.812s user 0m0.008s sys 0m0.500s real 0m14.804s user 0m0.004s sys 0m0.480s xor_64 6 --------------- real 0m14.742s user 0m0.024s sys 0m0.476s real 0m14.882s user 0m0.020s sys 0m0.528s real 0m14.589s user 0m0.012s sys 0m0.512s real 0m14.832s user 0m0.004s sys 0m0.504s real 0m14.638s user 0m0.012s sys 0m0.444s real 0m14.767s user 0m0.008s sys 0m0.536s xor_64 7 --------------- real 0m14.790s user 0m0.012s sys 0m0.560s real 0m14.749s user 0m0.016s sys 0m0.476s real 0m14.430s user 0m0.016s sys 0m0.540s real 0m14.694s user 0m0.012s sys 0m0.556s real 0m14.567s user 0m0.016s sys 0m0.488s real 0m14.753s user 0m0.016s sys 0m0.536s xor_64 8 --------------- real 0m14.816s user 0m0.008s sys 0m0.544s real 0m14.704s user 0m0.020s sys 0m0.516s real 0m14.613s user 0m0.012s sys 0m0.548s real 0m14.900s user 0m0.008s sys 0m0.532s real 0m14.586s user 0m0.012s sys 0m0.464s real 0m14.692s user 0m0.016s sys 0m0.520s -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Jul 10, 2009, at 11:23 AM, Heinz Mauelshagen wrote: > Dough, > > I extended dm-raid45's message interface to support changing the xor > algorithm and # of chunks, allowing for changes of the algorithm being > used at runtime. Very useful indeed. I may send you some routines to be tested at some point in the future if you don't mind ;-) > This I used to perform a bunch of mkfs write intensive tests on the > Intel Core i7 system as an initial write load test case. The tests > have > been run on 8 disks faked onto one SSD using LVM (~200MB sustained > writes throughput): That's a little slower than I think you need for a good test. I'm not even sure I'm satisfied that my current SATA array is sufficient and I can get at least 500MB/s of write throughput to the disks using a raid0, possibly more if I can get a better eSATA port. > for a in xor_blocks > do > for c in $(seq 2 6) > do > echo -e "$a $c\n---------------" > dmsetup message r5 0 xor $a $c > for i in $(seq 6)do > time mkfs -t ext3 /dev/mapper/r5 > done > done > done > xor_blocks.out 2>&1 > > for a in xor_8 xor_16 xor_32 xor_64 > do > for c in $(seq 2 8) > do > echo -e "$a $c\n---------------" > dmsetup message r5 0 xor $a $c > for i in $(seq 6) > do > time mkfs -t ext3 /dev/mapper/r5 > done > done > done > xor_8-64.out 2>&1 > > Mapping table for r5: > 0 146800640 raid45 core 2 8192 nosync raid5_la 7 64 128 8 -1 10 > nosync 1 8 -1 \ > /dev/tst/raiddev_1 0 /dev/tst/raiddev_2 0 /dev/tst/raiddev_3 0 /dev/ > tst/raiddev_4 0 \ > /dev/tst/raiddev_5 0 /dev/tst/raiddev_6 0 /dev/tst/raiddev_7 0 /dev/ > tst/raiddev_8 0 > > I attached filtered output files xor_blocks_1.txt and xor_8-64_1.txt, > which contain the time information for all the above algorithm/#chunks > settings. > > > Real time minima: > > # egrep '^real' xor_blocks_1.txt|sort|head -1 > real 0m14.508s > # egrep '^real' xor_8-64_1.txt|sort|head -1 > real 0m14.430s > > > System time minima: > > [root@a4 dm-tests]# egrep '^sys' xor_blocks_1.txt|sort|head -1 > sys 0m0.460s > # egrep '^sys' xor_8-64_1.txt|sort|head -1 > sys 0m0.444s > > User time is negligible. > > > This mkfs test case indicates better performance for certain dm-raid45 > xor() settings vs. xor_blocks(). I can get to dbench etc. after my > vacation in week 31. Thanks. This isn't too far off from what I would expect. I would say that real world loads fall all along a spectrum from "create lots of writes, but do little to no work" to "does lots of work, and only sporadically writes". It's the later end of this spectrum that is most likely to be helped by the cache avoiding routines, while the former is not. So, one of the tests I had in mind was to use something like timing a complete kernel build, or doing a database load/report cycle, or some other things like that. Things that do actual work in the foreground while the raid is kept busy in the background. Of course, testing all the various points along the spectrum is needed, so this test gets us the first. -- Doug Ledford <dledford@redhat.com> GPG KeyID: CFBFF194 http://people.redhat.com/dledford InfiniBand Specific RPMS http://people.redhat.com/dledford/Infiniband -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Sat, Jul 11, 2009 at 5:44 AM, Doug Ledford<dledford@redhat.com> wrote: > Thanks. Â This isn't too far off from what I would expect. Â I would say that > real world loads fall all along a spectrum from "create lots of writes, but > do little to no work" to "does lots of work, and only sporadically writes". > Â It's the later end of this spectrum that is most likely to be helped by the > cache avoiding routines, while the former is not. Â So, one of the tests I > had in mind was to use something like timing a complete kernel build, or > doing a database load/report cycle, or some other things like that. Â Things > that do actual work in the foreground while the raid is kept busy in the > background. Â Of course, testing all the various points along the spectrum is > needed, so this test gets us the first. This reminds of the testing I did when quantifying the benefit of hardware accelerated raid5. I played with kernel builds while resyncing to show the cache and cpu-utilization savings, but never got that to settle on a solid number. I took a look at Con Kolivas' "contest" benchmark [1], but ultimately just published plain iozone data [2]. The interesting bit is that cpu limited random writes saw more throughput improvement than streaming writes because the i/o processing did not need to compete with management of the stripe cache. -- Dan [1]: http://users.on.net/~ckolivas/contest/ [2]: http://sourceforge.net/projects/xscaleiop/files/MD RAID Acceleration/iop-iozone-graphs-20061010.tar.bz2/download -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
{drivers => /usr/src/linux/drivers}/md/dm-raid45.c | 139 ++++++++++++++------ 1 files changed, 98 insertions(+), 41 deletions(-) diff --git a/drivers/md/dm-raid45.c b/drivers/md/dm-raid45.c index 0c33fea..6ace975 100644 --- a/drivers/md/dm-raid45.c +++ b/linux/drivers/md/dm-raid45.c @@ -8,7 +8,28 @@ * * Linux 2.6 Device Mapper RAID4 and RAID5 target. * - * Supports: + * Tested-by: Intel; Marcin.Labun@intel.com, krzysztof.wojcik@intel.com + * + * + * Supports the following ATARAID vendor solutions (and SNIA DDF): + * + * Adaptec HostRAID ASR + * SNIA DDF1 + * Hiphpoint 37x + * Hiphpoint 45x + * Intel IMSM + * Jmicron ATARAID + * LSI Logic MegaRAID + * NVidia RAID + * Promise FastTrack + * Silicon Image Medley + * VIA Software RAID + * + * via the dmraid application. + * + * + * Features: + * * o RAID4 with dedicated and selectable parity device * o RAID5 with rotating parity (left+right, symmetric+asymmetric) * o recovery of out of sync device for initial @@ -37,7 +58,7 @@ * ANALYZEME: recovery bandwidth */ -static const char *version = "v0.2594p"; +static const char *version = "v0.2596l"; #include "dm.h" #include "dm-memcache.h" @@ -101,9 +122,6 @@ static const char *version = "v0.2594p"; /* Check value in range. */ #define range_ok(i, min, max) (i >= min && i <= max) -/* Check argument is power of 2. */ -#define POWER_OF_2(a) (!(a & (a - 1))) - /* Structure access macros. */ /* Derive raid_set from stripe_cache pointer. */ #define RS(x) container_of(x, struct raid_set, sc) @@ -1848,10 +1866,10 @@ struct xor_func { xor_function_t f; const char *name; } static xor_funcs[] = { - { xor_8, "xor_8" }, - { xor_16, "xor_16" }, - { xor_32, "xor_32" }, { xor_64, "xor_64" }, + { xor_32, "xor_32" }, + { xor_16, "xor_16" }, + { xor_8, "xor_8" }, { xor_blocks_wrapper, "xor_blocks" }, }; @@ -3114,10 +3132,10 @@ static void _do_endios(struct raid_set *rs, struct stripe *stripe, SetStripeReconstructed(stripe); /* FIXME: reschedule to be written in case of read. */ - // if (!RSDead && RSDegraded(rs) !StripeRBW(stripe)) { - // chunk_set(CHUNK(stripe, stripe->idx.recover), DIRTY); - // stripe_chunks_rw(stripe); - // } + /* if (!RSDead && RSDegraded(rs) !StripeRBW(stripe)) { + chunk_set(CHUNK(stripe, stripe->idx.recover), DIRTY); + stripe_chunks_rw(stripe); + } */ stripe->idx.recover = -1; } @@ -3257,7 +3275,7 @@ static void do_ios(struct raid_set *rs, struct bio_list *ios) /* Check for recovering regions. */ sector = _sector(rs, bio); r = region_state(rs, sector, DM_RH_RECOVERING); - if (unlikely(r && bio_data_dir(bio) == WRITE)) { + if (unlikely(r)) { delay++; /* Wait writing to recovering regions. */ dm_rh_delay_by_region(rh, bio, @@ -3409,64 +3427,104 @@ static unsigned mbpers(struct raid_set *rs, unsigned speed) /* * Discover fastest xor algorithm and # of chunks combination. */ -/* Calculate speed for algorithm and # of chunks. */ +/* Calculate speed of particular algorithm and # of chunks. */ static unsigned xor_speed(struct stripe *stripe) { + int ticks = 5; unsigned r = 0; - unsigned long j; - /* Wait for next tick. */ - for (j = jiffies; j == jiffies; ) - ; + /* Do xors for a few ticks. */ + while (ticks--) { + unsigned xors = 0; + unsigned long j = jiffies; + + while (j == jiffies) { + mb(); + common_xor(stripe, stripe->io.size, 0, 0); + mb(); + xors++; + mb(); + } - /* Do xors for a full tick. */ - for (j = jiffies; j == jiffies; ) { - mb(); - common_xor(stripe, stripe->io.size, 0, 0); - mb(); - r++; + if (xors > r) + r = xors; } return r; } +/* Define for xor multi recovery stripe optimization runs. */ +#define DMRAID45_XOR_TEST + /* Optimize xor algorithm for this RAID set. */ static unsigned xor_optimize(struct raid_set *rs) { - unsigned chunks_max = 2, p = rs->set.raid_devs, speed_max = 0; + unsigned chunks_max = 2, p, speed_max = 0; struct xor_func *f = ARRAY_END(xor_funcs), *f_max = NULL; struct stripe *stripe; +unsigned speed_hm = 0, speed_min = ~0, speed_xor_blocks = 0; BUG_ON(list_empty(&rs->recover.stripes)); +#ifndef DMRAID45_XOR_TEST stripe = list_first_entry(&rs->recover.stripes, struct stripe, lists[LIST_RECOVER]); /* Must set uptodate so that xor() will belabour chunks. */ - while (p--) + for (p = rs->set.raid_devs; p-- ;) SetChunkUptodate(CHUNK(stripe, p)); +#endif /* Try all xor functions. */ while (f-- > xor_funcs) { unsigned speed; - /* Set actual xor function for common_xor(). */ - rs->xor.f = f; - rs->xor.chunks = (f->f == xor_blocks_wrapper ? - (MAX_XOR_BLOCKS + 1) : XOR_CHUNKS_MAX) + 1; - - while (rs->xor.chunks-- > 2) { - speed = xor_speed(stripe); - if (speed > speed_max) { - speed_max = speed; - chunks_max = rs->xor.chunks; - f_max = f; +#ifdef DMRAID45_XOR_TEST + list_for_each_entry(stripe, &rs->recover.stripes, + lists[LIST_RECOVER]) { + for (p = rs->set.raid_devs; p-- ;) + SetChunkUptodate(CHUNK(stripe, p)); +#endif + + /* Set actual xor function for common_xor(). */ + rs->xor.f = f; + rs->xor.chunks = (f->f == xor_blocks_wrapper ? + (MAX_XOR_BLOCKS + 1) : + XOR_CHUNKS_MAX) + 1; + + while (rs->xor.chunks-- > 2) { + speed = xor_speed(stripe); + +#ifdef DMRAID45_XOR_TEST + if (f->f == xor_blocks_wrapper) { + if (speed > speed_xor_blocks) + speed_xor_blocks = speed; + } else if (speed > speed_hm) + speed_hm = speed; + + if (speed < speed_min) + speed_min = speed; +#endif + + if (speed > speed_max) { + speed_max = speed; + chunks_max = rs->xor.chunks; + f_max = f; + } } +#ifdef DMRAID45_XOR_TEST } +#endif } - /* Memorize optimum parameters. */ + /* Memorize optimal parameters. */ rs->xor.f = f_max; rs->xor.chunks = chunks_max; +#ifdef DMRAID45_XOR_TEST + DMINFO("%s stripes=%u min=%u xor_blocks=%u hm=%u max=%u", + speed_max == speed_hm ? "HM" : "NB", + rs->recover.recovery_stripes, speed_min, + speed_xor_blocks, speed_hm, speed_max); +#endif return speed_max; } @@ -3786,7 +3844,7 @@ static int get_raid_variable_parms(struct dm_target *ti, char **argv, "Invalid recovery switch; must be \"sync\" or \"nosync\"" }, { 0, "Invalid number of recovery stripes;" - "must be -1, > 0 and <= 16384", + "must be -1, > 0 and <= 64", RECOVERY_STRIPES_MIN, RECOVERY_STRIPES_MAX, &vp->recovery_stripes_parm, &vp->recovery_stripes, NULL }, }, *varp; @@ -3831,7 +3889,7 @@ static int get_raid_variable_parms(struct dm_target *ti, char **argv, if (sscanf(*(argv++), "%d", &value) != 1 || (value != -1 && - ((varp->action && !POWER_OF_2(value)) || + ((varp->action && !is_power_of_2(value)) || !range_ok(value, varp->min, varp->max)))) TI_ERR(varp->errmsg);