Message ID | 1486010836-25228-4-git-send-email-anup.patel@broadcom.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel <anup.patel@broadcom.com> wrote: > The DMAENGINE framework assumes that if PQ offload is supported by a > DMA device then all 256 PQ coefficients are supported. This assumption > does not hold anymore because we now have BCM-SBA-RAID offload engine > which supports PQ offload with limited number of PQ coefficients. > > This patch extends async_tx APIs to handle DMA devices with support > for fewer PQ coefficients. > > Signed-off-by: Anup Patel <anup.patel@broadcom.com> > Reviewed-by: Scott Branden <scott.branden@broadcom.com> > --- > crypto/async_tx/async_pq.c | 3 +++ > crypto/async_tx/async_raid6_recov.c | 12 ++++++++++-- > include/linux/dmaengine.h | 19 +++++++++++++++++++ > include/linux/raid/pq.h | 3 +++ > 4 files changed, 35 insertions(+), 2 deletions(-) So, I hate the way async_tx does these checks on each operation, and it's ok for me to say that because it's my fault. Really it's md that should be validating engine offload capabilities once at the beginning of time. I'd rather we move in that direction than continue to pile onto a bad design.
On Thu, Feb 2, 2017 at 11:31 AM, Dan Williams <dan.j.williams@intel.com> wrote: > > On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel <anup.patel@broadcom.com> wrote: > > The DMAENGINE framework assumes that if PQ offload is supported by a > > DMA device then all 256 PQ coefficients are supported. This assumption > > does not hold anymore because we now have BCM-SBA-RAID offload engine > > which supports PQ offload with limited number of PQ coefficients. > > > > This patch extends async_tx APIs to handle DMA devices with support > > for fewer PQ coefficients. > > > > Signed-off-by: Anup Patel <anup.patel@broadcom.com> > > Reviewed-by: Scott Branden <scott.branden@broadcom.com> > > --- > > crypto/async_tx/async_pq.c | 3 +++ > > crypto/async_tx/async_raid6_recov.c | 12 ++++++++++-- > > include/linux/dmaengine.h | 19 +++++++++++++++++++ > > include/linux/raid/pq.h | 3 +++ > > 4 files changed, 35 insertions(+), 2 deletions(-) > > So, I hate the way async_tx does these checks on each operation, and > it's ok for me to say that because it's my fault. Really it's md that > should be validating engine offload capabilities once at the beginning > of time. I'd rather we move in that direction than continue to pile > onto a bad design. Yes, indeed. All async_tx APIs have lot of checks and for high throughput RAID offload engine these checks can add some overhead. I think doing checks in Linux md would be certainly better but this would mean lot of changes in Linux md as well as remove checks in async_tx. Also, async_tx APIs should not find DMA channel on its own instead it should rely on Linux md to provide DMA channel pointer as parameter. It's better to do checks cleanup in async_tx as separate patchset and keep this patchset simple. Regards, Anup
On Fri, Feb 3, 2017 at 2:59 AM, Anup Patel <anup.patel@broadcom.com> wrote: > > > On Thu, Feb 2, 2017 at 11:31 AM, Dan Williams <dan.j.williams@intel.com> > wrote: >> >> On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel <anup.patel@broadcom.com> >> wrote: >> > The DMAENGINE framework assumes that if PQ offload is supported by a >> > DMA device then all 256 PQ coefficients are supported. This assumption >> > does not hold anymore because we now have BCM-SBA-RAID offload engine >> > which supports PQ offload with limited number of PQ coefficients. >> > >> > This patch extends async_tx APIs to handle DMA devices with support >> > for fewer PQ coefficients. >> > >> > Signed-off-by: Anup Patel <anup.patel@broadcom.com> >> > Reviewed-by: Scott Branden <scott.branden@broadcom.com> >> > --- >> > crypto/async_tx/async_pq.c | 3 +++ >> > crypto/async_tx/async_raid6_recov.c | 12 ++++++++++-- >> > include/linux/dmaengine.h | 19 +++++++++++++++++++ >> > include/linux/raid/pq.h | 3 +++ >> > 4 files changed, 35 insertions(+), 2 deletions(-) >> >> So, I hate the way async_tx does these checks on each operation, and >> it's ok for me to say that because it's my fault. Really it's md that >> should be validating engine offload capabilities once at the beginning >> of time. I'd rather we move in that direction than continue to pile >> onto a bad design. > > > Yes, indeed. All async_tx APIs have lot of checks and for high throughput > RAID offload engine these checks can add some overhead. > > I think doing checks in Linux md would be certainly better but this would > mean lot of changes in Linux md as well as remove checks in async_tx. > > Also, async_tx APIs should not find DMA channel on its own instead it > should rely on Linux md to provide DMA channel pointer as parameter. > > It's better to do checks cleanup in async_tx as separate patchset and > keep this patchset simple. That's been the problem with async_tx being broken like this for years. Once you get this "small / simple" patch upstream, that arguably makes async_tx a little bit worse, there is no longer any motivation to fix the underlying issues. If you care about the long term health of raid offload and are enabling new hardware support you should first tackle the known problems with it before adding new features.
On Sat, Feb 4, 2017 at 12:12 AM, Dan Williams <dan.j.williams@intel.com> wrote: > On Fri, Feb 3, 2017 at 2:59 AM, Anup Patel <anup.patel@broadcom.com> wrote: >> >> >> On Thu, Feb 2, 2017 at 11:31 AM, Dan Williams <dan.j.williams@intel.com> >> wrote: >>> >>> On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel <anup.patel@broadcom.com> >>> wrote: >>> > The DMAENGINE framework assumes that if PQ offload is supported by a >>> > DMA device then all 256 PQ coefficients are supported. This assumption >>> > does not hold anymore because we now have BCM-SBA-RAID offload engine >>> > which supports PQ offload with limited number of PQ coefficients. >>> > >>> > This patch extends async_tx APIs to handle DMA devices with support >>> > for fewer PQ coefficients. >>> > >>> > Signed-off-by: Anup Patel <anup.patel@broadcom.com> >>> > Reviewed-by: Scott Branden <scott.branden@broadcom.com> >>> > --- >>> > crypto/async_tx/async_pq.c | 3 +++ >>> > crypto/async_tx/async_raid6_recov.c | 12 ++++++++++-- >>> > include/linux/dmaengine.h | 19 +++++++++++++++++++ >>> > include/linux/raid/pq.h | 3 +++ >>> > 4 files changed, 35 insertions(+), 2 deletions(-) >>> >>> So, I hate the way async_tx does these checks on each operation, and >>> it's ok for me to say that because it's my fault. Really it's md that >>> should be validating engine offload capabilities once at the beginning >>> of time. I'd rather we move in that direction than continue to pile >>> onto a bad design. >> >> >> Yes, indeed. All async_tx APIs have lot of checks and for high throughput >> RAID offload engine these checks can add some overhead. >> >> I think doing checks in Linux md would be certainly better but this would >> mean lot of changes in Linux md as well as remove checks in async_tx. >> >> Also, async_tx APIs should not find DMA channel on its own instead it >> should rely on Linux md to provide DMA channel pointer as parameter. >> >> It's better to do checks cleanup in async_tx as separate patchset and >> keep this patchset simple. > > That's been the problem with async_tx being broken like this for > years. Once you get this "small / simple" patch upstream, that > arguably makes async_tx a little bit worse, there is no longer any > motivation to fix the underlying issues. If you care about the long > term health of raid offload and are enabling new hardware support you > should first tackle the known problems with it before adding new > features. Apart from the checks related issue you pointed there are other issues with async_tx APIs such as: 1. The mechanism to do update PQ (or RAID6 update) operation in current async_tx APIs is to call async_gen_syndrome() twice with ASYNC_TX_PQ_XOR_DST flag set. Also, async_gen_syndrome() will always prefer SW approach when ASYNC_TX_PQ_XOR_DST flag is set. This means async_tx API is forcing SW approach for update PQ operation and in-addition we require two async_gen_syndrome() calls to achieve update PQ. This limitations of async_gen_syndrome() reduces performance of async_tx APIs. Instead of this we should have a dedicated async_update_pq() API which will allow RAID offload engine drivers (such as BCM-FS4-RAID) to implement update PQ using HW offload and this new API will fall-back to SW approach using async_gen_syndrome() if no DMA channel provides update PQ HW offload. 2. In our stress testing, we have observed that dma_map_page() and dma_unmap_page() used in various async_tx APIs are the major cause of overhead. If we directly call DMA channel callbacks with pre-DMA-mapped pages then we get very high throughput. The async_tx APIs should provide a way for pre-DMA-mapped pages so that Linux MD can exploit this fact for better performance. 3. We really don't have a test module to stress/benchmark all async_tx APIs using multi-threading and batching large number of request in each thread. This kind of test module is very much required for performance benchmarking and stressing high throughput (hundreds of Gbps) RAID offload engines (such as BCM-FS4-RAID). From the above, we already have async_tx_test module to address point3. We also plan to address point1 above but this would also require changes in Linux MD to use new async_update_pq() API. As you can see, this patchset is not end of story of us if we want best possible utilization of BCM-FS4-RAID. Regards, Anup
diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c index f83de99..16c6526 100644 --- a/crypto/async_tx/async_pq.c +++ b/crypto/async_tx/async_pq.c @@ -187,6 +187,9 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks, BUG_ON(disks > 255 || !(P(blocks, disks) || Q(blocks, disks))); + if (device && dma_maxpqcoef(device) < src_cnt) + device = NULL; + if (device) unmap = dmaengine_get_unmap_data(device->dev, disks, GFP_NOWAIT); diff --git a/crypto/async_tx/async_raid6_recov.c b/crypto/async_tx/async_raid6_recov.c index 8fab627..2916f95 100644 --- a/crypto/async_tx/async_raid6_recov.c +++ b/crypto/async_tx/async_raid6_recov.c @@ -352,6 +352,7 @@ async_raid6_2data_recov(int disks, size_t bytes, int faila, int failb, { void *scribble = submit->scribble; int non_zero_srcs, i; + struct dma_chan *chan = async_dma_find_channel(DMA_PQ); BUG_ON(faila == failb); if (failb < faila) @@ -359,12 +360,15 @@ async_raid6_2data_recov(int disks, size_t bytes, int faila, int failb, pr_debug("%s: disks: %d len: %zu\n", __func__, disks, bytes); + if (chan && dma_maxpqcoef(chan->device) < RAID6_PQ_MAX_COEF) + chan = NULL; + /* if a dma resource is not available or a scribble buffer is not * available punt to the synchronous path. In the 'dma not * available' case be sure to use the scribble buffer to * preserve the content of 'blocks' as the caller intended. */ - if (!async_dma_find_channel(DMA_PQ) || !scribble) { + if (!chan || !scribble) { void **ptrs = scribble ? scribble : (void **) blocks; async_tx_quiesce(&submit->depend_tx); @@ -432,15 +436,19 @@ async_raid6_datap_recov(int disks, size_t bytes, int faila, void *scribble = submit->scribble; int good_srcs, good, i; struct page *srcs[2]; + struct dma_chan *chan = async_dma_find_channel(DMA_PQ); pr_debug("%s: disks: %d len: %zu\n", __func__, disks, bytes); + if (chan && dma_maxpqcoef(chan->device) < RAID6_PQ_MAX_COEF) + chan = NULL; + /* if a dma resource is not available or a scribble buffer is not * available punt to the synchronous path. In the 'dma not * available' case be sure to use the scribble buffer to * preserve the content of 'blocks' as the caller intended. */ - if (!async_dma_find_channel(DMA_PQ) || !scribble) { + if (!chan || !scribble) { void **ptrs = scribble ? scribble : (void **) blocks; async_tx_quiesce(&submit->depend_tx); diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h index feee6ec..d938a8b 100644 --- a/include/linux/dmaengine.h +++ b/include/linux/dmaengine.h @@ -24,6 +24,7 @@ #include <linux/scatterlist.h> #include <linux/bitmap.h> #include <linux/types.h> +#include <linux/raid/pq.h> #include <asm/page.h> /** @@ -668,6 +669,7 @@ struct dma_filter { * @cap_mask: one or more dma_capability flags * @max_xor: maximum number of xor sources, 0 if no capability * @max_pq: maximum number of PQ sources and PQ-continue capability + * @max_pqcoef: maximum number of PQ coefficients, 0 if all supported * @copy_align: alignment shift for memcpy operations * @xor_align: alignment shift for xor operations * @pq_align: alignment shift for pq operations @@ -727,11 +729,13 @@ struct dma_device { dma_cap_mask_t cap_mask; unsigned short max_xor; unsigned short max_pq; + unsigned short max_pqcoef; enum dmaengine_alignment copy_align; enum dmaengine_alignment xor_align; enum dmaengine_alignment pq_align; enum dmaengine_alignment fill_align; #define DMA_HAS_PQ_CONTINUE (1 << 15) + #define DMA_HAS_FEWER_PQ_COEF (1 << 15) int dev_id; struct device *dev; @@ -1122,6 +1126,21 @@ static inline int dma_maxpq(struct dma_device *dma, enum dma_ctrl_flags flags) BUG(); } +static inline void dma_set_maxpqcoef(struct dma_device *dma, + unsigned short max_pqcoef) +{ + if (max_pqcoef < RAID6_PQ_MAX_COEF) { + dma->max_pqcoef = max_pqcoef; + dma->max_pqcoef |= DMA_HAS_FEWER_PQ_COEF; + } +} + +static inline unsigned short dma_maxpqcoef(struct dma_device *dma) +{ + return (dma->max_pqcoef & DMA_HAS_FEWER_PQ_COEF) ? + (dma->max_pqcoef & ~DMA_HAS_FEWER_PQ_COEF) : RAID6_PQ_MAX_COEF; +} + static inline size_t dmaengine_get_icg(bool inc, bool sgl, size_t icg, size_t dir_icg) { diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h index 30f9453..f3a04bb 100644 --- a/include/linux/raid/pq.h +++ b/include/linux/raid/pq.h @@ -15,6 +15,9 @@ #ifdef __KERNEL__ +/* Max number of PQ coefficients */ +#define RAID6_PQ_MAX_COEF 256 + /* Set to 1 to use kernel-wide empty_zero_page */ #define RAID6_USE_EMPTY_ZERO_PAGE 0 #include <linux/blkdev.h>