diff mbox

[v2,5/5] crypto: talitos: Add software backlog queue handling

Message ID 1425388897-5434-6-git-send-email-mort@bork.org (mailing list archive)
State Changes Requested
Delegated to: Herbert Xu
Headers show

Commit Message

Martin Hicks March 3, 2015, 1:21 p.m. UTC
I was running into situations where the hardware FIFO was filling up, and
the code was returning EAGAIN to dm-crypt and just dropping the submitted
crypto request.

This adds support in talitos for a software backlog queue.  When requests
can't be queued to the hardware immediately EBUSY is returned.  The queued
requests are dispatched to the hardware in received order as hardware FIFO
slots become available.

Signed-off-by: Martin Hicks <mort@bork.org>
---
 drivers/crypto/talitos.c |  135 ++++++++++++++++++++++++++++++++++++----------
 drivers/crypto/talitos.h |    3 ++
 2 files changed, 110 insertions(+), 28 deletions(-)

Comments

Kim Phillips March 4, 2015, 12:23 a.m. UTC | #1
On Tue,  3 Mar 2015 08:21:37 -0500
Martin Hicks <mort@bork.org> wrote:

> @@ -1170,6 +1237,8 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
>  						     edesc->dma_len,
>  						     DMA_BIDIRECTIONAL);
>  	edesc->req.desc = &edesc->desc;
> +	/* A copy of the crypto_async_request to use the crypto_queue backlog */
> +	memcpy(&edesc->req.base, areq, sizeof(struct crypto_async_request));

this seems backward, or, at least can be done more efficiently IMO:
talitos_cra_init should set the tfm's reqsize so the rest of
the driver can wholly embed its talitos_edesc (which should also
wholly encapsulate its talitos_request (i.e., not via a pointer))
into the crypto API's request handle allocation.  This
would absorb and eliminate the talitos_edesc kmalloc and frees, the
above memcpy, and replace the container_of after the
crypto_dequeue_request with an offset_of, right?

When scatter-gather buffers are needed, we can assume a slower-path
and make them do their own allocations, since their sizes vary
depending on each request.  Of course, a pointer to those
allocations would need to be retained somewhere in the request
handle.

Only potential problem is getting the crypto API to set the GFP_DMA
flag in the allocation request, but presumably a
CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.

Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Horia Geantă March 5, 2015, 9:35 a.m. UTC | #2
On 3/4/2015 2:23 AM, Kim Phillips wrote:
> On Tue,  3 Mar 2015 08:21:37 -0500
> Martin Hicks <mort@bork.org> wrote:
> 
>> @@ -1170,6 +1237,8 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
>>  						     edesc->dma_len,
>>  						     DMA_BIDIRECTIONAL);
>>  	edesc->req.desc = &edesc->desc;
>> +	/* A copy of the crypto_async_request to use the crypto_queue backlog */
>> +	memcpy(&edesc->req.base, areq, sizeof(struct crypto_async_request));
> 
> this seems backward, or, at least can be done more efficiently IMO:
> talitos_cra_init should set the tfm's reqsize so the rest of
> the driver can wholly embed its talitos_edesc (which should also
> wholly encapsulate its talitos_request (i.e., not via a pointer))
> into the crypto API's request handle allocation.  This
> would absorb and eliminate the talitos_edesc kmalloc and frees, the
> above memcpy, and replace the container_of after the
> crypto_dequeue_request with an offset_of, right?
> 
> When scatter-gather buffers are needed, we can assume a slower-path
> and make them do their own allocations, since their sizes vary
> depending on each request.  Of course, a pointer to those
> allocations would need to be retained somewhere in the request
> handle.

Unfortunately talitos_edesc structure size is most of the times
variable. Its exact size can only be established at "request time", and
not at "tfm init time".

Fixed size would be sizeof(talitos_edesc).
Below are factors that influence the variable part, i.e. link_tbl in
talitos_edesc:
- whether any assoc / src / dst data is scattered
- icv_stashing (case when ICV checking is done in SW)

Still we'd be better with:
-crypto API allocates request + request context (i.e.
sizeof(talitos_edesc) + any alignment required)
-talitos driver allocates variable part of talitos_edesc (if needed)

instead of:
-crypto API allocates request
-talitos driver allocates talitos_edesc (fixed + variable)
-memcopy of the req.base (crypto_async_request) into talitos_edesc

both in terms of performance and readability.

At first look, the driver wouldn't change that much:
-talitos_cra_init() callback would have to set tfm.reqsize to
sizeof(talitos_edesc) + padding and also add the CRYPTO_TFM_REQ_DMA
indication in tfm.crt_flags
-talitos_edesc_alloc() logic would be pretty much the same, but would
allocate memory only for the link_tbl

I'm willing to do these changes if needed.

> 
> Only potential problem is getting the crypto API to set the GFP_DMA
> flag in the allocation request, but presumably a
> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.

Right. And this flag would apply only to request __ctx[].

Herbert, would this be an acceptable addition to crypto API?

Thanks,
Horia


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kim Phillips March 6, 2015, 12:34 a.m. UTC | #3
On Thu, 5 Mar 2015 11:35:23 +0200
Horia Geant? <horia.geanta@freescale.com> wrote:

> On 3/4/2015 2:23 AM, Kim Phillips wrote:
> > On Tue,  3 Mar 2015 08:21:37 -0500
> > Martin Hicks <mort@bork.org> wrote:
> > 
> >> @@ -1170,6 +1237,8 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
> >>  						     edesc->dma_len,
> >>  						     DMA_BIDIRECTIONAL);
> >>  	edesc->req.desc = &edesc->desc;
> >> +	/* A copy of the crypto_async_request to use the crypto_queue backlog */
> >> +	memcpy(&edesc->req.base, areq, sizeof(struct crypto_async_request));
> > 
> > this seems backward, or, at least can be done more efficiently IMO:
> > talitos_cra_init should set the tfm's reqsize so the rest of
> > the driver can wholly embed its talitos_edesc (which should also
> > wholly encapsulate its talitos_request (i.e., not via a pointer))
> > into the crypto API's request handle allocation.  This
> > would absorb and eliminate the talitos_edesc kmalloc and frees, the
> > above memcpy, and replace the container_of after the
> > crypto_dequeue_request with an offset_of, right?
> > 
> > When scatter-gather buffers are needed, we can assume a slower-path
> > and make them do their own allocations, since their sizes vary
> > depending on each request.  Of course, a pointer to those
> > allocations would need to be retained somewhere in the request
> > handle.
> 
> Unfortunately talitos_edesc structure size is most of the times
> variable. Its exact size can only be established at "request time", and
> not at "tfm init time".

yes, I was suggesting a common minimum should be set in cra_init.

> Fixed size would be sizeof(talitos_edesc).
> Below are factors that influence the variable part, i.e. link_tbl in
> talitos_edesc:
> - whether any assoc / src / dst data is scattered
> - icv_stashing (case when ICV checking is done in SW)

both being slow(er) paths, IMO.

> Still we'd be better with:
> -crypto API allocates request + request context (i.e.
> sizeof(talitos_edesc) + any alignment required)
> -talitos driver allocates variable part of talitos_edesc (if needed)
> 
> instead of:
> -crypto API allocates request
> -talitos driver allocates talitos_edesc (fixed + variable)
> -memcopy of the req.base (crypto_async_request) into talitos_edesc
> 
> both in terms of performance and readability.

indeed.

> At first look, the driver wouldn't change that much:
> -talitos_cra_init() callback would have to set tfm.reqsize to
> sizeof(talitos_edesc) + padding and also add the CRYPTO_TFM_REQ_DMA
> indication in tfm.crt_flags
> -talitos_edesc_alloc() logic would be pretty much the same, but would
> allocate memory only for the link_tbl
> 
> I'm willing to do these changes if needed.

Please coordinate with Martin.

fwiw, caam could use this, too.

Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 6, 2015, 4:48 a.m. UTC | #4
On Thu, Mar 05, 2015 at 11:35:23AM +0200, Horia Geant? wrote:
> 
> > Only potential problem is getting the crypto API to set the GFP_DMA
> > flag in the allocation request, but presumably a
> > CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
> 
> Right. And this flag would apply only to request __ctx[].
> 
> Herbert, would this be an acceptable addition to crypto API?

How would such a flag work?

Cheers,
Horia Geantă March 9, 2015, 12:08 p.m. UTC | #5
On 3/6/2015 6:48 AM, Herbert Xu wrote:
> On Thu, Mar 05, 2015 at 11:35:23AM +0200, Horia Geant? wrote:
>>
>>> Only potential problem is getting the crypto API to set the GFP_DMA
>>> flag in the allocation request, but presumably a
>>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
>>
>> Right. And this flag would apply only to request __ctx[].
>>
>> Herbert, would this be an acceptable addition to crypto API?
> 
> How would such a flag work?

Hm, I thought that GFP_DMA memory could be allocated only for request
private ctx. This is obviously not the case.

*_request_alloc(tfm, gfp) crypto API functions would do:
if (crypto_tfm_get_flags(tfm) & CRYPTO_TFM_REQ_DMA)
        gfp |= GFP_DMA;

Thanks,
Horia


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Horia Geantă March 16, 2015, 10:02 a.m. UTC | #6
On 3/4/2015 2:23 AM, Kim Phillips wrote:
> Only potential problem is getting the crypto API to set the GFP_DMA
> flag in the allocation request, but presumably a
> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.

Seems there are quite a few places that do not use the
{aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
Among them, IPsec and dm-crypt.
I've looked at the code and I don't think it can be converted to use
crypto API.

This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
places. Some of the maintainers do not agree, as you've seen.

An alternative would be for talitos to use the page allocator to get 1 /
2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
= 8 kB), dma_map_page the area and manage it internally for talitos_desc
hw descriptors.
What do you think?

Thanks,
Horia


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kim Phillips March 17, 2015, 12:19 a.m. UTC | #7
On Mon, 16 Mar 2015 12:02:51 +0200
Horia Geant? <horia.geanta@freescale.com> wrote:

> On 3/4/2015 2:23 AM, Kim Phillips wrote:
> > Only potential problem is getting the crypto API to set the GFP_DMA
> > flag in the allocation request, but presumably a
> > CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
> 
> Seems there are quite a few places that do not use the
> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
> Among them, IPsec and dm-crypt.
> I've looked at the code and I don't think it can be converted to use
> crypto API.

why not?

> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
> places. Some of the maintainers do not agree, as you've seen.

would modifying the crypto API to either have a different
*_request_alloc() API, and/or adding calls to negotiate the GFP mask
between crypto users and drivers, e.g., get/set_gfp_mask, work?

> An alternative would be for talitos to use the page allocator to get 1 /
> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
> hw descriptors.
> What do you think?

There's a comment in esp_alloc_tmp(): "Use spare space in skb for
this where possible," which is ideally where we'd want to be (esp.
because that memory could already be DMA-able).  Your above
suggestion would be in the opposite direction of that.

Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Horia Geantă March 17, 2015, 5:58 p.m. UTC | #8
On 3/17/2015 2:19 AM, Kim Phillips wrote:
> On Mon, 16 Mar 2015 12:02:51 +0200
> Horia Geant? <horia.geanta@freescale.com> wrote:
> 
>> On 3/4/2015 2:23 AM, Kim Phillips wrote:
>>> Only potential problem is getting the crypto API to set the GFP_DMA
>>> flag in the allocation request, but presumably a
>>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
>>
>> Seems there are quite a few places that do not use the
>> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
>> Among them, IPsec and dm-crypt.
>> I've looked at the code and I don't think it can be converted to use
>> crypto API.
> 
> why not?

It would imply having 2 memory allocations, one for crypto request and
the other for the rest of the data bundled with the request (for IPsec
that would be ESN + space for IV + sg entries for authenticated-only
data and sk_buff extension, if needed).

Trying to have a single allocation by making ESN, IV etc. part of the
request private context requires modifying tfm.reqsize on the fly.
This won't work without adding some kind of locking for the tfm.

> 
>> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
>> places. Some of the maintainers do not agree, as you've seen.
> 
> would modifying the crypto API to either have a different
> *_request_alloc() API, and/or adding calls to negotiate the GFP mask
> between crypto users and drivers, e.g., get/set_gfp_mask, work?

I think what DaveM asked for was the change to be transparent.

Besides converting to *_request_alloc(), seems that all other options
require some extra awareness from the user.
Could you elaborate on the idea above?

> 
>> An alternative would be for talitos to use the page allocator to get 1 /
>> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
>> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
>> hw descriptors.
>> What do you think?
> 
> There's a comment in esp_alloc_tmp(): "Use spare space in skb for
> this where possible," which is ideally where we'd want to be (esp.

Ok, I'll check that. But note the "where possible" - finding room in the
skb to avoid the allocation won't always be the case, and then we're
back to square one.

> because that memory could already be DMA-able).  Your above
> suggestion would be in the opposite direction of that.

The proposal:
-removes dma (un)mapping on the fast path
-avoids requesting dma mappable memory for more than it's actually
needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
only its private context)
-for caam it has the added benefit of speeding the below search for the
offending descriptor in the SW ring from O(n) to O(1):
for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) >= 1; i++) {
	sw_idx = (tail + i) & (JOBR_DEPTH - 1);

	if (jrp->outring[hw_idx].desc ==
	    jrp->entinfo[sw_idx].desc_addr_dma)
		break; /* found */
}
(drivers/crypto/caam/jr.c - caam_dequeue)

Horia


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kim Phillips March 17, 2015, 10:03 p.m. UTC | #9
On Tue, 17 Mar 2015 19:58:55 +0200
Horia Geant? <horia.geanta@freescale.com> wrote:

> On 3/17/2015 2:19 AM, Kim Phillips wrote:
> > On Mon, 16 Mar 2015 12:02:51 +0200
> > Horia Geant? <horia.geanta@freescale.com> wrote:
> > 
> >> On 3/4/2015 2:23 AM, Kim Phillips wrote:
> >>> Only potential problem is getting the crypto API to set the GFP_DMA
> >>> flag in the allocation request, but presumably a
> >>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
> >>
> >> Seems there are quite a few places that do not use the
> >> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
> >> Among them, IPsec and dm-crypt.
> >> I've looked at the code and I don't think it can be converted to use
> >> crypto API.
> > 
> > why not?
> 
> It would imply having 2 memory allocations, one for crypto request and
> the other for the rest of the data bundled with the request (for IPsec
> that would be ESN + space for IV + sg entries for authenticated-only
> data and sk_buff extension, if needed).
> 
> Trying to have a single allocation by making ESN, IV etc. part of the
> request private context requires modifying tfm.reqsize on the fly.
> This won't work without adding some kind of locking for the tfm.

can't a common minimum tfm.reqsize be co-established up front, at
least for the fast path?

> >> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
> >> places. Some of the maintainers do not agree, as you've seen.
> > 
> > would modifying the crypto API to either have a different
> > *_request_alloc() API, and/or adding calls to negotiate the GFP mask
> > between crypto users and drivers, e.g., get/set_gfp_mask, work?
> 
> I think what DaveM asked for was the change to be transparent.
> 
> Besides converting to *_request_alloc(), seems that all other options
> require some extra awareness from the user.
> Could you elaborate on the idea above?

was merely suggesting communicating GFP flags anonymously across the
API, i.e., GFP_DMA wouldn't appear in user code.

> >> An alternative would be for talitos to use the page allocator to get 1 /
> >> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
> >> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
> >> hw descriptors.
> >> What do you think?
> > 
> > There's a comment in esp_alloc_tmp(): "Use spare space in skb for
> > this where possible," which is ideally where we'd want to be (esp.
> 
> Ok, I'll check that. But note the "where possible" - finding room in the
> skb to avoid the allocation won't always be the case, and then we're
> back to square one.
> 
> > because that memory could already be DMA-able).  Your above
> > suggestion would be in the opposite direction of that.
> 
> The proposal:
> -removes dma (un)mapping on the fast path

sure, but at the expense of additional complexity.

> -avoids requesting dma mappable memory for more than it's actually
> needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
> only its private context)

compared to the payload?  Plus, we have plenty of DMA space these
days.

> -for caam it has the added benefit of speeding the below search for the
> offending descriptor in the SW ring from O(n) to O(1):
> for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) >= 1; i++) {
> 	sw_idx = (tail + i) & (JOBR_DEPTH - 1);
> 
> 	if (jrp->outring[hw_idx].desc ==
> 	    jrp->entinfo[sw_idx].desc_addr_dma)
> 		break; /* found */
> }
> (drivers/crypto/caam/jr.c - caam_dequeue)

how?  The job ring h/w will still be spitting things out
out-of-order.

Plus, like I said, it's taking the problem in the wrong direction:
we need to strive to merge the allocation and mapping with the upper
layers as much as possible.

Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Horia Geantă March 19, 2015, 3:56 p.m. UTC | #10
On 3/18/2015 12:03 AM, Kim Phillips wrote:
> On Tue, 17 Mar 2015 19:58:55 +0200
> Horia Geant? <horia.geanta@freescale.com> wrote:
> 
>> On 3/17/2015 2:19 AM, Kim Phillips wrote:
>>> On Mon, 16 Mar 2015 12:02:51 +0200
>>> Horia Geant? <horia.geanta@freescale.com> wrote:
>>>
>>>> On 3/4/2015 2:23 AM, Kim Phillips wrote:
>>>>> Only potential problem is getting the crypto API to set the GFP_DMA
>>>>> flag in the allocation request, but presumably a
>>>>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
>>>>
>>>> Seems there are quite a few places that do not use the
>>>> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
>>>> Among them, IPsec and dm-crypt.
>>>> I've looked at the code and I don't think it can be converted to use
>>>> crypto API.
>>>
>>> why not?
>>
>> It would imply having 2 memory allocations, one for crypto request and
>> the other for the rest of the data bundled with the request (for IPsec
>> that would be ESN + space for IV + sg entries for authenticated-only
>> data and sk_buff extension, if needed).
>>
>> Trying to have a single allocation by making ESN, IV etc. part of the
>> request private context requires modifying tfm.reqsize on the fly.
>> This won't work without adding some kind of locking for the tfm.
> 
> can't a common minimum tfm.reqsize be co-established up front, at
> least for the fast path?

Indeed, for IPsec at tfm allocation time - esp_init_state() -
tfm.reqsize could be increased to account for what is known for a given
flow: ESN, IV and asg (S/G entries for authenticated-only data).
The layout would be:
aead request (fixed part)
private ctx of backend algorithm
seq_no_hi (if ESN)
IV
asg
sg <-- S/G table for skb_to_sgvec; how many entries is the question

Do you have a suggestion for how many S/G entries to preallocate for
representing the sk_buff data to be encrypted?
An ancient esp4.c used ESP_NUM_FAST_SG, set to 4.
Btw, currently maximum number of fragments supported by the net stack
(MAX_SKB_FRAGS) is 16 or more.

>>>> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
>>>> places. Some of the maintainers do not agree, as you've seen.
>>>
>>> would modifying the crypto API to either have a different
>>> *_request_alloc() API, and/or adding calls to negotiate the GFP mask
>>> between crypto users and drivers, e.g., get/set_gfp_mask, work?
>>
>> I think what DaveM asked for was the change to be transparent.
>>
>> Besides converting to *_request_alloc(), seems that all other options
>> require some extra awareness from the user.
>> Could you elaborate on the idea above?
> 
> was merely suggesting communicating GFP flags anonymously across the
> API, i.e., GFP_DMA wouldn't appear in user code.

Meaning user would have to get_gfp_mask before allocating a crypto
request - i.e. instead of kmalloc(..., GFP_ATOMIC) to have
kmalloc(GFP_ATOMIC | get_gfp_mask(aead))?

>>>> An alternative would be for talitos to use the page allocator to get 1 /
>>>> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
>>>> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
>>>> hw descriptors.
>>>> What do you think?
>>>
>>> There's a comment in esp_alloc_tmp(): "Use spare space in skb for
>>> this where possible," which is ideally where we'd want to be (esp.
>>
>> Ok, I'll check that. But note the "where possible" - finding room in the
>> skb to avoid the allocation won't always be the case, and then we're
>> back to square one.

So the skb cb is out of the question, being too small (48B).
Any idea what was the intention of the "TODO" - maybe to use the
tailroom in the skb data area?

>>> because that memory could already be DMA-able).  Your above
>>> suggestion would be in the opposite direction of that.
>>
>> The proposal:
>> -removes dma (un)mapping on the fast path
> 
> sure, but at the expense of additional complexity.

Right, there's no free lunch. But it's cheaper.

>> -avoids requesting dma mappable memory for more than it's actually
>> needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
>> only its private context)
> 
> compared to the payload?  Plus, we have plenty of DMA space these
> days.
> 
>> -for caam it has the added benefit of speeding the below search for the
>> offending descriptor in the SW ring from O(n) to O(1):
>> for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) >= 1; i++) {
>> 	sw_idx = (tail + i) & (JOBR_DEPTH - 1);
>>
>> 	if (jrp->outring[hw_idx].desc ==
>> 	    jrp->entinfo[sw_idx].desc_addr_dma)
>> 		break; /* found */
>> }
>> (drivers/crypto/caam/jr.c - caam_dequeue)
> 
> how?  The job ring h/w will still be spitting things out
> out-of-order.

jrp->outring[hw_idx].desc bus address can be used to find the sw_idx in
O(1):

dma_addr_t desc_base = dma_map_page(alloc_page(GFP_DMA),...);
[...]
sw_idx = (desc_base - jrp->outring[hw_idx].desc) / JD_SIZE;

JD_SIZE would be 16 words (64B) - 13 words used for the h/w job
descriptor, 3 words can be used for smth. else.
Basically all JDs would be filled at a 64B-aligned offset in the memory
page.

> Plus, like I said, it's taking the problem in the wrong direction:
> we need to strive to merge the allocation and mapping with the upper
> layers as much as possible.

IMHO propagating the GFP_DMA from backend crypto implementations to
crypto API users doesn't seem feasable.
It's error-prone to audit all places that allocate crypto requests w/out
using *_request_alloc API.
And even if all these places would be identified:
-in some cases there's some heavy rework involved
-more places might show up in the future and there's no way to detect them

Horia

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kim Phillips March 19, 2015, 6:38 p.m. UTC | #11
On Thu, 19 Mar 2015 17:56:57 +0200
Horia Geant? <horia.geanta@freescale.com> wrote:

> On 3/18/2015 12:03 AM, Kim Phillips wrote:
> > On Tue, 17 Mar 2015 19:58:55 +0200
> > Horia Geant? <horia.geanta@freescale.com> wrote:
> > 
> >> On 3/17/2015 2:19 AM, Kim Phillips wrote:
> >>> On Mon, 16 Mar 2015 12:02:51 +0200
> >>> Horia Geant? <horia.geanta@freescale.com> wrote:
> >>>
> >>>> On 3/4/2015 2:23 AM, Kim Phillips wrote:
> >>>>> Only potential problem is getting the crypto API to set the GFP_DMA
> >>>>> flag in the allocation request, but presumably a
> >>>>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
> >>>>
> >>>> Seems there are quite a few places that do not use the
> >>>> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
> >>>> Among them, IPsec and dm-crypt.
> >>>> I've looked at the code and I don't think it can be converted to use
> >>>> crypto API.
> >>>
> >>> why not?
> >>
> >> It would imply having 2 memory allocations, one for crypto request and
> >> the other for the rest of the data bundled with the request (for IPsec
> >> that would be ESN + space for IV + sg entries for authenticated-only
> >> data and sk_buff extension, if needed).
> >>
> >> Trying to have a single allocation by making ESN, IV etc. part of the
> >> request private context requires modifying tfm.reqsize on the fly.
> >> This won't work without adding some kind of locking for the tfm.
> > 
> > can't a common minimum tfm.reqsize be co-established up front, at
> > least for the fast path?
> 
> Indeed, for IPsec at tfm allocation time - esp_init_state() -
> tfm.reqsize could be increased to account for what is known for a given
> flow: ESN, IV and asg (S/G entries for authenticated-only data).
> The layout would be:
> aead request (fixed part)
> private ctx of backend algorithm
> seq_no_hi (if ESN)
> IV
> asg
> sg <-- S/G table for skb_to_sgvec; how many entries is the question
> 
> Do you have a suggestion for how many S/G entries to preallocate for
> representing the sk_buff data to be encrypted?
> An ancient esp4.c used ESP_NUM_FAST_SG, set to 4.
> Btw, currently maximum number of fragments supported by the net stack
> (MAX_SKB_FRAGS) is 16 or more.
> 
> >>>> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
> >>>> places. Some of the maintainers do not agree, as you've seen.
> >>>
> >>> would modifying the crypto API to either have a different
> >>> *_request_alloc() API, and/or adding calls to negotiate the GFP mask
> >>> between crypto users and drivers, e.g., get/set_gfp_mask, work?
> >>
> >> I think what DaveM asked for was the change to be transparent.
> >>
> >> Besides converting to *_request_alloc(), seems that all other options
> >> require some extra awareness from the user.
> >> Could you elaborate on the idea above?
> > 
> > was merely suggesting communicating GFP flags anonymously across the
> > API, i.e., GFP_DMA wouldn't appear in user code.
> 
> Meaning user would have to get_gfp_mask before allocating a crypto
> request - i.e. instead of kmalloc(..., GFP_ATOMIC) to have
> kmalloc(GFP_ATOMIC | get_gfp_mask(aead))?
> 
> >>>> An alternative would be for talitos to use the page allocator to get 1 /
> >>>> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
> >>>> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
> >>>> hw descriptors.
> >>>> What do you think?
> >>>
> >>> There's a comment in esp_alloc_tmp(): "Use spare space in skb for
> >>> this where possible," which is ideally where we'd want to be (esp.
> >>
> >> Ok, I'll check that. But note the "where possible" - finding room in the
> >> skb to avoid the allocation won't always be the case, and then we're
> >> back to square one.
> 
> So the skb cb is out of the question, being too small (48B).
> Any idea what was the intention of the "TODO" - maybe to use the
> tailroom in the skb data area?
> 
> >>> because that memory could already be DMA-able).  Your above
> >>> suggestion would be in the opposite direction of that.
> >>
> >> The proposal:
> >> -removes dma (un)mapping on the fast path
> > 
> > sure, but at the expense of additional complexity.
> 
> Right, there's no free lunch. But it's cheaper.
> 
> >> -avoids requesting dma mappable memory for more than it's actually
> >> needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
> >> only its private context)
> > 
> > compared to the payload?  Plus, we have plenty of DMA space these
> > days.
> > 
> >> -for caam it has the added benefit of speeding the below search for the
> >> offending descriptor in the SW ring from O(n) to O(1):
> >> for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) >= 1; i++) {
> >> 	sw_idx = (tail + i) & (JOBR_DEPTH - 1);
> >>
> >> 	if (jrp->outring[hw_idx].desc ==
> >> 	    jrp->entinfo[sw_idx].desc_addr_dma)
> >> 		break; /* found */
> >> }
> >> (drivers/crypto/caam/jr.c - caam_dequeue)
> > 
> > how?  The job ring h/w will still be spitting things out
> > out-of-order.
> 
> jrp->outring[hw_idx].desc bus address can be used to find the sw_idx in
> O(1):
> 
> dma_addr_t desc_base = dma_map_page(alloc_page(GFP_DMA),...);
> [...]
> sw_idx = (desc_base - jrp->outring[hw_idx].desc) / JD_SIZE;
> 
> JD_SIZE would be 16 words (64B) - 13 words used for the h/w job
> descriptor, 3 words can be used for smth. else.
> Basically all JDs would be filled at a 64B-aligned offset in the memory
> page.

that assumes a linear mapping, which is a wrong assumption to make.

I also think you don't know how many times that loop above executes
in practice.

> > Plus, like I said, it's taking the problem in the wrong direction:
> > we need to strive to merge the allocation and mapping with the upper
> > layers as much as possible.
> 
> IMHO propagating the GFP_DMA from backend crypto implementations to
> crypto API users doesn't seem feasable.

should be.

> It's error-prone to audit all places that allocate crypto requests w/out
> using *_request_alloc API.

why is it error-prone?

> And even if all these places would be identified:
> -in some cases there's some heavy rework involved

so?

> -more places might show up in the future and there's no way to detect them

let them worry about that.

I leave the rest for netdev.

Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index b0c85ce..bb7fba0 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -182,55 +182,118 @@  static int init_device(struct device *dev)
 	return 0;
 }
 
-/**
- * talitos_submit - submits a descriptor to the device for processing
- * @dev:	the SEC device to be used
- * @ch:		the SEC device channel to be used
- * @edesc:	the descriptor to be processed by the device
- *
- * desc must contain valid dma-mapped (bus physical) address pointers.
- * callback must check err and feedback in descriptor header
- * for device processing status.
- */
-int talitos_submit(struct device *dev, int ch, struct talitos_edesc *edesc)
+/* Dispatch 'request' if provided, otherwise a backlogged request */
+static int __talitos_handle_queue(struct device *dev, int ch,
+				  struct talitos_edesc *edesc,
+				  unsigned long *irq_flags)
 {
 	struct talitos_private *priv = dev_get_drvdata(dev);
-	struct talitos_request *request = &edesc->req;
-	unsigned long flags;
+	struct talitos_request *request;
+	struct crypto_async_request *areq;
 	int head;
+	int ret = -EINPROGRESS;
 
-	spin_lock_irqsave(&priv->chan[ch].head_lock, flags);
-
-	if (!atomic_inc_not_zero(&priv->chan[ch].submit_count)) {
+	if (!atomic_inc_not_zero(&priv->chan[ch].submit_count))
 		/* h/w fifo is full */
-		spin_unlock_irqrestore(&priv->chan[ch].head_lock, flags);
-		return -EAGAIN;
+		return -EBUSY;
+
+	if (!edesc) {
+		/* Dequeue the oldest request */
+		areq = crypto_dequeue_request(&priv->chan[ch].queue);
+		request = container_of(areq, struct talitos_request, base);
+	} else {
+		request = &edesc->req;
 	}
 
-	head = priv->chan[ch].head;
 	request->dma_desc = dma_map_single(dev, request->desc,
 					   sizeof(*request->desc),
 					   DMA_BIDIRECTIONAL);
 
 	/* increment fifo head */
+	head = priv->chan[ch].head;
 	priv->chan[ch].head = (priv->chan[ch].head + 1) & (priv->fifo_len - 1);
 
-	smp_wmb();
-	priv->chan[ch].fifo[head] = request;
+	spin_unlock_irqrestore(&priv->chan[ch].head_lock, *irq_flags);
+
+	/*
+	 * Mark a backlogged request as in-progress.
+	 */
+	if (!edesc) {
+		areq = request->context;
+		areq->complete(areq, -EINPROGRESS);
+	}
+
+	spin_lock_irqsave(&priv->chan[ch].head_lock, *irq_flags);
 
 	/* GO! */
+	priv->chan[ch].fifo[head] = request;
 	wmb();
 	out_be32(priv->chan[ch].reg + TALITOS_FF,
 		 upper_32_bits(request->dma_desc));
 	out_be32(priv->chan[ch].reg + TALITOS_FF_LO,
 		 lower_32_bits(request->dma_desc));
 
+	return ret;
+}
+
+/**
+ * talitos_submit - performs submissions of a new descriptors
+ *
+ * @dev:	the SEC device to be used
+ * @ch:		the SEC device channel to be used
+ * @edesc:	the request to be processed by the device
+ *
+ * edesc->req must contain valid dma-mapped (bus physical) address pointers.
+ * callback must check err and feedback in descriptor header
+ * for device processing status upon completion.
+ */
+int talitos_submit(struct device *dev, int ch, struct talitos_edesc *edesc)
+{
+	struct talitos_private *priv = dev_get_drvdata(dev);
+	struct talitos_request *request = &edesc->req;
+	unsigned long flags;
+	int ret = -EINPROGRESS;
+
+	spin_lock_irqsave(&priv->chan[ch].head_lock, flags);
+
+	if (priv->chan[ch].queue.qlen) {
+		/*
+		 * There are backlogged requests.  Just queue this new request
+		 * and dispatch the oldest backlogged request to the hardware.
+		 */
+		crypto_enqueue_request(&priv->chan[ch].queue,
+				       &request->base);
+		__talitos_handle_queue(dev, ch, NULL, &flags);
+		ret = -EBUSY;
+	} else {
+		ret = __talitos_handle_queue(dev, ch, edesc, &flags);
+		if (ret == -EBUSY)
+			/* Hardware FIFO is full */
+			crypto_enqueue_request(&priv->chan[ch].queue,
+					       &request->base);
+	}
+
 	spin_unlock_irqrestore(&priv->chan[ch].head_lock, flags);
 
-	return -EINPROGRESS;
+	return ret;
 }
 EXPORT_SYMBOL(talitos_submit);
 
+static int talitos_handle_queue(struct device *dev, int ch)
+{
+	struct talitos_private *priv = dev_get_drvdata(dev);
+	unsigned long flags;
+	int ret = -EINPROGRESS;
+
+	spin_lock_irqsave(&priv->chan[ch].head_lock, flags);
+	/* Queue backlogged requests as long as the hardware has room */
+	while (priv->chan[ch].queue.qlen && ret == -EINPROGRESS)
+		ret = __talitos_handle_queue(dev, ch, NULL, &flags);
+	spin_unlock_irqrestore(&priv->chan[ch].head_lock, flags);
+
+	return 0;
+}
+
 /*
  * process what was done, notify callback of error if not
  */
@@ -283,6 +346,8 @@  static void flush_channel(struct device *dev, int ch, int error, int reset_ch)
 	}
 
 	spin_unlock_irqrestore(&priv->chan[ch].tail_lock, flags);
+
+	talitos_handle_queue(dev, ch);
 }
 
 /*
@@ -1039,7 +1104,8 @@  static int ipsec_esp(struct talitos_edesc *edesc, struct aead_request *areq,
 	edesc->req.context = areq;
 
 	ret = talitos_submit(dev, ctx->ch, edesc);
-	if (ret != -EINPROGRESS) {
+	if (ret != -EINPROGRESS &&
+	    !(ret == -EBUSY && areq->base.flags & CRYPTO_TFM_REQ_MAY_BACKLOG)) {
 		ipsec_esp_unmap(dev, edesc, areq);
 		kfree(edesc);
 	}
@@ -1080,6 +1146,7 @@  static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
 						 unsigned int ivsize,
 						 int icv_stashing,
 						 u32 cryptoflags,
+						 struct crypto_async_request *areq,
 						 bool encrypt)
 {
 	struct talitos_edesc *edesc;
@@ -1170,6 +1237,8 @@  static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
 						     edesc->dma_len,
 						     DMA_BIDIRECTIONAL);
 	edesc->req.desc = &edesc->desc;
+	/* A copy of the crypto_async_request to use the crypto_queue backlog */
+	memcpy(&edesc->req.base, areq, sizeof(struct crypto_async_request));
 
 	return edesc;
 }
@@ -1184,7 +1253,7 @@  static struct talitos_edesc *aead_edesc_alloc(struct aead_request *areq, u8 *iv,
 	return talitos_edesc_alloc(ctx->dev, areq->assoc, areq->src, areq->dst,
 				   iv, areq->assoclen, areq->cryptlen,
 				   ctx->authsize, ivsize, icv_stashing,
-				   areq->base.flags, encrypt);
+				   areq->base.flags, &areq->base, encrypt);
 }
 
 static int aead_encrypt(struct aead_request *req)
@@ -1414,7 +1483,8 @@  static int common_nonsnoop(struct talitos_edesc *edesc,
 	edesc->req.context = areq;
 
 	ret = talitos_submit(dev, ctx->ch, edesc);
-	if (ret != -EINPROGRESS) {
+	if (ret != -EINPROGRESS &&
+	    !(ret == -EBUSY && areq->base.flags & CRYPTO_TFM_REQ_MAY_BACKLOG)) {
 		common_nonsnoop_unmap(dev, edesc, areq);
 		kfree(edesc);
 	}
@@ -1430,7 +1500,7 @@  static struct talitos_edesc *ablkcipher_edesc_alloc(struct ablkcipher_request *
 
 	return talitos_edesc_alloc(ctx->dev, NULL, areq->src, areq->dst,
 				   areq->info, 0, areq->nbytes, 0, ivsize, 0,
-				   areq->base.flags, encrypt);
+				   areq->base.flags, &areq->base, encrypt);
 }
 
 static int ablkcipher_encrypt(struct ablkcipher_request *areq)
@@ -1597,7 +1667,8 @@  static int common_nonsnoop_hash(struct talitos_edesc *edesc,
 	edesc->req.context = areq;
 
 	ret = talitos_submit(dev, ctx->ch, edesc);
-	if (ret != -EINPROGRESS) {
+	if (ret != -EINPROGRESS &&
+	    !(ret == -EBUSY && areq->base.flags & CRYPTO_TFM_REQ_MAY_BACKLOG)) {
 		common_nonsnoop_hash_unmap(dev, edesc, areq);
 		kfree(edesc);
 	}
@@ -1612,7 +1683,8 @@  static struct talitos_edesc *ahash_edesc_alloc(struct ahash_request *areq,
 	struct talitos_ahash_req_ctx *req_ctx = ahash_request_ctx(areq);
 
 	return talitos_edesc_alloc(ctx->dev, NULL, req_ctx->psrc, NULL, NULL, 0,
-				   nbytes, 0, 0, 0, areq->base.flags, false);
+				   nbytes, 0, 0, 0, areq->base.flags,
+				   &areq->base, false);
 }
 
 static int ahash_init(struct ahash_request *areq)
@@ -2690,6 +2762,13 @@  static int talitos_probe(struct platform_device *ofdev)
 		}
 
 		atomic_set(&priv->chan[i].submit_count, -priv->chfifo_len);
+
+		/*
+		 * The crypto_queue is used to manage the backlog only.  While
+		 * the hardware FIFO has space requests are dispatched
+		 * immediately.
+		 */
+		crypto_init_queue(&priv->chan[i].queue, 0);
 	}
 
 	dma_set_mask(dev, DMA_BIT_MASK(36));
diff --git a/drivers/crypto/talitos.h b/drivers/crypto/talitos.h
index 914eb15..2762e47 100644
--- a/drivers/crypto/talitos.h
+++ b/drivers/crypto/talitos.h
@@ -65,6 +65,7 @@  struct talitos_desc {
  * @context: caller context (optional)
  */
 struct talitos_request {
+	struct crypto_async_request base;
 	struct talitos_desc *desc;
 	dma_addr_t dma_desc;
 	void (*callback) (struct device *dev, struct talitos_desc *desc,
@@ -91,6 +92,8 @@  struct talitos_channel {
 	spinlock_t tail_lock ____cacheline_aligned;
 	/* index to next in-progress/done descriptor request */
 	int tail;
+
+	struct crypto_queue queue;
 };
 
 struct talitos_private {