[WIP,28/43] IB/core: Introduce new fast registration API

Message ID	1437548143-24893-29-git-send-email-sagig@mellanox.com (mailing list archive)
State	RFC
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Sagi Grimberg <sagig@mellanox.com> To: linux-rdma@vger.kernel.org Cc: Liran Liss <liranl@mellanox.com>, Oren Duer <oren@mellanox.com> Subject: [PATCH WIP 28/43] IB/core: Introduce new fast registration API Date: Wed, 22 Jul 2015 09:55:28 +0300 Message-Id: <1437548143-24893-29-git-send-email-sagig@mellanox.com> In-Reply-To: <1437548143-24893-1-git-send-email-sagig@mellanox.com> References: <1437548143-24893-1-git-send-email-sagig@mellanox.com> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk

Sagi Grimberg July 22, 2015, 6:55 a.m. UTC

The new fast registration is receiving a struct
scatterlist and converts it to a page list under
the verbs API. The user is provided with a new
verb ib_map_mr_sg, and a helper to set the send work
request structure.

The drivers are handed with a generic helper that
converts a scatterlist into a vector of pages.
Given that some drivers have a shadow mapped page list,
I expect that drivers might use their own routines to
avoid the extra copies.

The new registration API is added with fast_reg for
now, but once all drivers and ULPs will be ported, we
can drop the old registration API.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
---
 drivers/infiniband/core/verbs.c | 123 ++++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h         |  37 ++++++++++++
 2 files changed, 160 insertions(+)

Christoph Hellwig July 22, 2015, 4:50 p.m. UTC | #1

> +/**
> + * ib_map_mr_sg() - Populates MR with a dma mapped SG list
> + * @mr:            memory region
> + * @sg:            dma mapped scatterlist
> + * @sg_nents:      number of entries in sg
> + * @access:        access permissions

I know moving the access flags here was my idea originally, but I seem
convinced by your argument that it might fit in better with the posting
helper.  Or did someone else come up with a better argument that mine
for moving it here?

> +int ib_map_mr_sg(struct ib_mr *mr,
> +		 struct scatterlist *sg,
> +		 unsigned short sg_nents,
> +		 unsigned int access)
> +{
> +	int rc;
> +
> +	if (!mr->device->map_mr_sg)
> +		return -ENOSYS;
> +
> +	rc = mr->device->map_mr_sg(mr, sg, sg_nents);

Do we really need a driver callout here?  It seems like we should
just do the map here, and then either have a flag for the mlx5 indirect
mapping, or if you want to keep the abstraction add the method at that
point but make it optional, so that all the other drivers don't need the
boilerplate code.

Also it seems like this returns 0/-error.  How do callers like SRP
see that it only did a partial mapping and it needs another MR?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg July 22, 2015, 4:56 p.m. UTC | #2

On 7/22/2015 7:50 PM, Christoph Hellwig wrote:
>> +/**
>> + * ib_map_mr_sg() - Populates MR with a dma mapped SG list
>> + * @mr:            memory region
>> + * @sg:            dma mapped scatterlist
>> + * @sg_nents:      number of entries in sg
>> + * @access:        access permissions
>
> I know moving the access flags here was my idea originally, but I seem
> convinced by your argument that it might fit in better with the posting
> helper.  Or did someone else come up with a better argument that mine
> for moving it here?

Not really. I was and still pretty indifferent about it...

>
>> +int ib_map_mr_sg(struct ib_mr *mr,
>> +		 struct scatterlist *sg,
>> +		 unsigned short sg_nents,
>> +		 unsigned int access)
>> +{
>> +	int rc;
>> +
>> +	if (!mr->device->map_mr_sg)
>> +		return -ENOSYS;
>> +
>> +	rc = mr->device->map_mr_sg(mr, sg, sg_nents);
>
> Do we really need a driver callout here?  It seems like we should
> just do the map here, and then either have a flag for the mlx5 indirect
> mapping, or if you want to keep the abstraction add the method at that
> point but make it optional, so that all the other drivers don't need the
> boilerplate code.

I commented on this bit in another reply. I think that several drivers
will want to use their own mappings. But I can change that if it's not
the case...

>
> Also it seems like this returns 0/-error.  How do callers like SRP
> see that it only did a partial mapping and it needs another MR?

Umm, I think SRP would need to iterate over the sg list and pass partial
SGs to the mapping (I can add a break; statement if we met sg_nents)

It's not perfect, but the idea was not to do backflips here.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe July 22, 2015, 5:44 p.m. UTC | #3

On Wed, Jul 22, 2015 at 09:50:12AM -0700, Christoph Hellwig wrote:
> > +/**
> > + * ib_map_mr_sg() - Populates MR with a dma mapped SG list
> > + * @mr:            memory region
> > + * @sg:            dma mapped scatterlist
> > + * @sg_nents:      number of entries in sg
> > + * @access:        access permissions
> 
> I know moving the access flags here was my idea originally, but I seem
> convinced by your argument that it might fit in better with the posting
> helper.  Or did someone else come up with a better argument that mine
> for moving it here?

I was hoping we'd move the DMA flush and translate into here and make
it mandatory. Is there any reason not to do that?

> > +int ib_map_mr_sg(struct ib_mr *mr,
> > +		 struct scatterlist *sg,
> > +		 unsigned short sg_nents,
> > +		 unsigned int access)
> > +{
> > +	int rc;
> > +
> > +	if (!mr->device->map_mr_sg)
> > +		return -ENOSYS;
> > +
> > +	rc = mr->device->map_mr_sg(mr, sg, sg_nents);
> 
> Do we really need a driver callout here?  It seems like we should

The call out makes sense to me..

The driver will convert the scatter list directly into whatever HW
representation it needs and prepare everything for posting. Every
driver has a different HW format, so it must be a callout.

> Also it seems like this returns 0/-error.  How do callers like SRP
> see that it only did a partial mapping and it needs another MR?

I would think it is an error to pass in more sg_nents than the MR was
created with, so SRP should never get a partial mapping as it should
never ask for more than max_entries.

(? Sagi, did I get the intent of this right?)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe July 22, 2015, 6:02 p.m. UTC | #4

On Wed, Jul 22, 2015 at 09:55:28AM +0300, Sagi Grimberg wrote:
> +/**
> + * ib_map_mr_sg() - Populates MR with a dma mapped SG list
> + * @mr:            memory region
> + * @sg:            dma mapped scatterlist
> + * @sg_nents:      number of entries in sg
> + * @access:        access permissions

Again, related to my prior comments, please have two of these:

ib_map_mr_sg_rkey()
ib_map_mr_sg_lkey()

So we force ULPs to think about what they are doing properly, and we
get a chance to actually force lkey to be local use only for IB.

> +static inline void
> +ib_set_fastreg_wr(struct ib_mr *mr,
> +		  u32 key,

The key should come from MR. Once the above is split then it is
obvious which key to use.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig July 23, 2015, 9:19 a.m. UTC | #5

On Wed, Jul 22, 2015 at 11:44:01AM -0600, Jason Gunthorpe wrote:
> I was hoping we'd move the DMA flush and translate into here and make
> it mandatory. Is there any reason not to do that?

That would be a reason for passing in a direction, but it would also
up the question on what form we pass that access flag in.  The
old-school RDMA local/remote read/write flags, or a enum_dma_direction
and either a bool or separate functions for lkey/rkey.

Although I wonder if we really need to differentiate between rkey and
leky in this ib_map_mr_sg function, or if we should do it when
allocating the mr, i.e. in ib_alloc_mr.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg July 23, 2015, 10:15 a.m. UTC | #6

On 7/22/2015 8:44 PM, Jason Gunthorpe wrote:
> On Wed, Jul 22, 2015 at 09:50:12AM -0700, Christoph Hellwig wrote:
>>> +/**
>>> + * ib_map_mr_sg() - Populates MR with a dma mapped SG list
>>> + * @mr:            memory region
>>> + * @sg:            dma mapped scatterlist
>>> + * @sg_nents:      number of entries in sg
>>> + * @access:        access permissions
>>
>> I know moving the access flags here was my idea originally, but I seem
>> convinced by your argument that it might fit in better with the posting
>> helper.  Or did someone else come up with a better argument that mine
>> for moving it here?
>
> I was hoping we'd move the DMA flush and translate into here and make
> it mandatory. Is there any reason not to do that?

The reason I didn't added it in was so the ULPs can make sure they meet
the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his
SG list set partials and iSER to detect gaps (they need to dma map
for that).

>
>>> +int ib_map_mr_sg(struct ib_mr *mr,
>>> +		 struct scatterlist *sg,
>>> +		 unsigned short sg_nents,
>>> +		 unsigned int access)
>>> +{
>>> +	int rc;
>>> +
>>> +	if (!mr->device->map_mr_sg)
>>> +		return -ENOSYS;
>>> +
>>> +	rc = mr->device->map_mr_sg(mr, sg, sg_nents);
>>
>> Do we really need a driver callout here?  It seems like we should
>
> The call out makes sense to me..
>
> The driver will convert the scatter list directly into whatever HW
> representation it needs and prepare everything for posting. Every
> driver has a different HW format, so it must be a callout.
>
>> Also it seems like this returns 0/-error.  How do callers like SRP
>> see that it only did a partial mapping and it needs another MR?
>
> I would think it is an error to pass in more sg_nents than the MR was
> created with, so SRP should never get a partial mapping as it should
> never ask for more than max_entries.
>
> (? Sagi, did I get the intent of this right?)

Error is returned when:
- sg_nents > max_entries
- sg has gaps
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg July 23, 2015, 10:19 a.m. UTC | #7

On 7/22/2015 9:02 PM, Jason Gunthorpe wrote:
> On Wed, Jul 22, 2015 at 09:55:28AM +0300, Sagi Grimberg wrote:
>> +/**
>> + * ib_map_mr_sg() - Populates MR with a dma mapped SG list
>> + * @mr:            memory region
>> + * @sg:            dma mapped scatterlist
>> + * @sg_nents:      number of entries in sg
>> + * @access:        access permissions
>
> Again, related to my prior comments, please have two of these:
>
> ib_map_mr_sg_rkey()
> ib_map_mr_sg_lkey()
>
> So we force ULPs to think about what they are doing properly, and we
> get a chance to actually force lkey to be local use only for IB.

The lkey/rkey decision is passed in the fastreg post_send().

ib_map_mr_sg is just a mapping API, not the registration itself.

>
>> +static inline void
>> +ib_set_fastreg_wr(struct ib_mr *mr,
>> +		  u32 key,
>
> The key should come from MR. Once the above is split then it is
> obvious which key to use.

IMO, it's obvious as it is. I don't see why should anyone get it
wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe July 23, 2015, 4:03 p.m. UTC | #8

On Thu, Jul 23, 2015 at 02:19:55AM -0700, Christoph Hellwig wrote:
> Although I wonder if we really need to differentiate between rkey and
> leky in this ib_map_mr_sg function, or if we should do it when
> allocating the mr, i.e. in ib_alloc_mr.

The allocation is agnostic to the usage, the map is what solidifies
things into a certain use, effectively based on the access flags..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe July 23, 2015, 4:14 p.m. UTC | #9

On Thu, Jul 23, 2015 at 01:19:16PM +0300, Sagi Grimberg wrote:
> >Again, related to my prior comments, please have two of these:
> >
> >ib_map_mr_sg_rkey()
> >ib_map_mr_sg_lkey()
> >
> >So we force ULPs to think about what they are doing properly, and we
> >get a chance to actually force lkey to be local use only for IB.
> 
> The lkey/rkey decision is passed in the fastreg post_send().

That is too late to check the access flags.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg July 23, 2015, 4:47 p.m. UTC | #10

On 7/23/2015 7:14 PM, Jason Gunthorpe wrote:
> On Thu, Jul 23, 2015 at 01:19:16PM +0300, Sagi Grimberg wrote:
>>> Again, related to my prior comments, please have two of these:
>>>
>>> ib_map_mr_sg_rkey()
>>> ib_map_mr_sg_lkey()
>>>
>>> So we force ULPs to think about what they are doing properly, and we
>>> get a chance to actually force lkey to be local use only for IB.
>>
>> The lkey/rkey decision is passed in the fastreg post_send().
>
> That is too late to check the access flags.

Why? the access permissions are kept in the mr context?
I can move it to the post interface if it makes more sense.
the access is kind of out of place in the mapping routine anyway...

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe July 23, 2015, 5:55 p.m. UTC | #11

On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote:
> >I was hoping we'd move the DMA flush and translate into here and make
> >it mandatory. Is there any reason not to do that?
> 
> The reason I didn't added it in was so the ULPs can make sure they meet
> the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his
> SG list set partials and iSER to detect gaps (they need to dma map
> for that).

The ULP can always get the sg list's virtual address to check for
gaps. Page aligned gaps are always OK.

BTW, the logic in ib_sg_to_pages should be checking that directly, as
coded, it won't work with swiotlb:

// Only the first SG entry can start unaligned
if (i && page_addr != dma_addr)
    return EINVAL;
// Only the last SG entry can end unaligned
if ((page_addr + dma_len) & PAGE_MASK != end_dma_addr)
 if (!is_last)
     return EINVAL;

Don't use sg->offset after dma mapping.

The biggest problem with checking the virtual address is
swiotlb. However, if swiotlb is used this API is basically broken as
swiotlb downgrades everything to a 2k alignment, which means we only
ever get 1 s/g entry.

To efficiently support swiotlb we'd need to see the driver be able to
work with a page size of IO_TLB_SEGSIZE (2k) so it can handle the
de-alignment that happens during bouncing.

My biggest problem with pushing the dma address up to the ULP is
basically that: The ULP has no idea what the driver can handle, maybe
the driver can handle the 2k pages.

So, that leaves a flow where the ULP does a basic sanity check on the
virtual side, then asks the IB core to map it. The mapping could still
fail because of swiotlb.

If the mapping fails, then the ULP has to bounce buffer, or MR split,
or totally fail.

For bounce buffer, all solutions have a DMA map/unmap cost, so it
doesn't matter if ib_map_mr_sg does that internally.

For MR fragment, the DMA mapping is still usable. Maybe we do need a
slightly different core API to help MR fragmentation? Sounds like NFS
uses this too?

 num_mrs = ib_mr_fragment_sg(&scatterlist);
 while (..)
   ib_map_fragment_sg(mr[i++],&scatterlist,&offset);

Perhaps?

Maybe that is even better because something like iser could do
the parallel:
 ib_mr_needs_fragment_sg(reference_mr,&scatterlist)

Which hides all the various restrictions in driver code.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe July 23, 2015, 6:42 p.m. UTC | #12

On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote:
> On 7/22/2015 8:44 PM, Jason Gunthorpe wrote:
> >On Wed, Jul 22, 2015 at 09:50:12AM -0700, Christoph Hellwig wrote:
> >>>+/**
> >>>+ * ib_map_mr_sg() - Populates MR with a dma mapped SG list
> >>>+ * @mr:            memory region
> >>>+ * @sg:            dma mapped scatterlist
> >>>+ * @sg_nents:      number of entries in sg
> >>>+ * @access:        access permissions
> >>
> >>I know moving the access flags here was my idea originally, but I seem
> >>convinced by your argument that it might fit in better with the posting
> >>helper.  Or did someone else come up with a better argument that mine
> >>for moving it here?
> >
> >I was hoping we'd move the DMA flush and translate into here and make
> >it mandatory. Is there any reason not to do that?
> 
> The reason I didn't added it in was so the ULPs can make sure they meet
> the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his
> SG list set partials and iSER to detect gaps (they need to dma map
> for that).

I would like to see the kdoc for ib_map_mr_sg explain exactly what is
required of the caller, maybe just hoist this bit from the
ib_sg_to_pages

Not entirely required if we are going to have an API to do the test..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe July 23, 2015, 6:51 p.m. UTC | #13

On Thu, Jul 23, 2015 at 07:47:14PM +0300, Sagi Grimberg wrote:

> >>>So we force ULPs to think about what they are doing properly, and we
> >>>get a chance to actually force lkey to be local use only for IB.
> >>
> >>The lkey/rkey decision is passed in the fastreg post_send().
> >
> >That is too late to check the access flags.
> 
> Why? the access permissions are kept in the mr context?

Sure, one could do if (key == mr->lkey) .. check lkey flags in the
post, but that seems silly considering we want the post inlined..

> I can move it to the post interface if it makes more sense.
> the access is kind of out of place in the mapping routine anyway...

All the dma routines have an access equivalent during map, I don't
think it is out of place..

To my mind, the map is the point where the MR should crystallize into
an rkey or lkey MR, not at the post.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg July 26, 2015, 8:54 a.m. UTC | #14

>
> I would like to see the kdoc for ib_map_mr_sg explain exactly what is
> required of the caller, maybe just hoist this bit from the
> ib_sg_to_pages

I'll add the kdoc.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg July 26, 2015, 9:37 a.m. UTC | #15

On 7/23/2015 8:55 PM, Jason Gunthorpe wrote:
> On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote:
>>> I was hoping we'd move the DMA flush and translate into here and make
>>> it mandatory. Is there any reason not to do that?
>>
>> The reason I didn't added it in was so the ULPs can make sure they meet
>> the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his
>> SG list set partials and iSER to detect gaps (they need to dma map
>> for that).
>
> The ULP can always get the sg list's virtual address to check for
> gaps. Page aligned gaps are always OK.

I guess I can pull DMA mapping in there, but we will need an opposite
routine ib_umap_mr_sg() since it'll be weird if the ULP will do dma
unmap without doing the map...

>
> BTW, the logic in ib_sg_to_pages should be checking that directly, as
> coded, it won't work with swiotlb:
>
> // Only the first SG entry can start unaligned
> if (i && page_addr != dma_addr)
>      return EINVAL;
> // Only the last SG entry can end unaligned
> if ((page_addr + dma_len) & PAGE_MASK != end_dma_addr)
>   if (!is_last)
>       return EINVAL;
>
> Don't use sg->offset after dma mapping.
>
> The biggest problem with checking the virtual address is
> swiotlb. However, if swiotlb is used this API is basically broken as
> swiotlb downgrades everything to a 2k alignment, which means we only
> ever get 1 s/g entry.

Can you explain what do you mean by "downgrades everything to a 2k 
alignment"? If the ULP is responsible for a PAGE_SIZE alignment than
how would this get out of alignment with swiotlb?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg July 26, 2015, 9:45 a.m. UTC | #16

On 7/23/2015 9:51 PM, Jason Gunthorpe wrote:
> On Thu, Jul 23, 2015 at 07:47:14PM +0300, Sagi Grimberg wrote:
>
>>>>> So we force ULPs to think about what they are doing properly, and we
>>>>> get a chance to actually force lkey to be local use only for IB.
>>>>
>>>> The lkey/rkey decision is passed in the fastreg post_send().
>>>
>>> That is too late to check the access flags.
>>
>> Why? the access permissions are kept in the mr context?
>
> Sure, one could do if (key == mr->lkey) .. check lkey flags in the
> post, but that seems silly considering we want the post inlined..

Why should we check the lkey/rkey access flags in the post?

>
>> I can move it to the post interface if it makes more sense.
>> the access is kind of out of place in the mapping routine anyway...
>
> All the dma routines have an access equivalent during map, I don't
> think it is out of place..
>
> To my mind, the map is the point where the MR should crystallize into
> an rkey or lkey MR, not at the post.

I'm not sure I understand why the lkey/rkey should be set at the map
routine. To me, it seems more natural to map_mr_sg and then either
register the lkey or the rkey.

It's easy enough to move the key arg to ib_map_mr_sg, but I don't see a
good reason why at the moment.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe July 27, 2015, 5:04 p.m. UTC | #17

On Sun, Jul 26, 2015 at 12:37:55PM +0300, Sagi Grimberg wrote:
> I guess I can pull DMA mapping in there, but we will need an opposite
> routine ib_umap_mr_sg() since it'll be weird if the ULP will do dma
> unmap without doing the map...

Yes, and ideally it would help ULPs to order these operations
properly. eg we shouldn't be abusing the DMA API and unmapping before
invalidate completes by default. That breaks obscure stuff in various
ways...

> >The biggest problem with checking the virtual address is
> >swiotlb. However, if swiotlb is used this API is basically broken as
> >swiotlb downgrades everything to a 2k alignment, which means we only
> >ever get 1 s/g entry.
> 
> Can you explain what do you mean by "downgrades everything to a 2k
> alignment"? If the ULP is responsible for a PAGE_SIZE alignment than
> how would this get out of alignment with swiotlb?

swiotlb copies all DMA maps to a shared buffer below 4G so it can be
used with 32 bit devices.

The shared buffer is managed in a way that copies each s/g element to
a continuous 2k aligned subsection of the buffer.

Basically, swiotlb realigns everything that passes through it.

The DMA API allows this, so ultimately, code has to check the dma
physical address when concerned about alignment.. But we should not
expect this to commonly fail.

So, something like..

  if (!ib_does_sgl_fit_in_mr(mr,sg))
     .. bounce buffer ..

  if (!ib_map_mr_sg(mr,sg)) // does dma mapping and checks it
     .. bounce buffer ..

  .. post ..
  .. send invalidate ..
  .. catch invalidate completion ...

  ib_unmap_mr(mr); // does dma unmap

?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe July 27, 2015, 5:14 p.m. UTC | #18

On Sun, Jul 26, 2015 at 12:45:10PM +0300, Sagi Grimberg wrote:
> On 7/23/2015 9:51 PM, Jason Gunthorpe wrote:
> >On Thu, Jul 23, 2015 at 07:47:14PM +0300, Sagi Grimberg wrote:
> >
> >>>>>So we force ULPs to think about what they are doing properly, and we
> >>>>>get a chance to actually force lkey to be local use only for IB.
> >>>>
> >>>>The lkey/rkey decision is passed in the fastreg post_send().
> >>>
> >>>That is too late to check the access flags.
> >>
> >>Why? the access permissions are kept in the mr context?
> >
> >Sure, one could do if (key == mr->lkey) .. check lkey flags in the
> >post, but that seems silly considering we want the post inlined..
> 
> Why should we check the lkey/rkey access flags in the post?

Eh? It was your idea..

I just want to check the access flags and force lkey's to not have
ACCESS_REMOTE set without complaining loudly.

To do that you need to know if the mr is a lkey/rkey, and you need to
know the flags.

> >>I can move it to the post interface if it makes more sense.
> >>the access is kind of out of place in the mapping routine anyway...
> >
> >All the dma routines have an access equivalent during map, I don't
> >think it is out of place..
> >
> >To my mind, the map is the point where the MR should crystallize into
> >an rkey or lkey MR, not at the post.
> 
> I'm not sure I understand why the lkey/rkey should be set at the map
> routine. To me, it seems more natural to map_mr_sg and then either
> register the lkey or the rkey.

We need to check the access flags to put a stop to this remote access
lkey security problem. That means we need to label every MR as a lkey
or rkey MR.

No more MR's can be both nonsense.

Pick a place to do that and enforce that IB cannot have remote access
LKEYs.

My vote is to do that work in map, because I don't think it make any
sense in post (post should not fail)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise July 27, 2015, 8:11 p.m. UTC | #19

On 7/27/2015 12:14 PM, Jason Gunthorpe wrote:
> On Sun, Jul 26, 2015 at 12:45:10PM +0300, Sagi Grimberg wrote:
>> On 7/23/2015 9:51 PM, Jason Gunthorpe wrote:
>>> On Thu, Jul 23, 2015 at 07:47:14PM +0300, Sagi Grimberg wrote:
>>>
>>>>>>> So we force ULPs to think about what they are doing properly, and we
>>>>>>> get a chance to actually force lkey to be local use only for IB.
>>>>>> The lkey/rkey decision is passed in the fastreg post_send().
>>>>> That is too late to check the access flags.
>>>> Why? the access permissions are kept in the mr context?
>>> Sure, one could do if (key == mr->lkey) .. check lkey flags in the
>>> post, but that seems silly considering we want the post inlined..
>> Why should we check the lkey/rkey access flags in the post?
> Eh? It was your idea..
>
> I just want to check the access flags and force lkey's to not have
> ACCESS_REMOTE set without complaining loudly.
>
> To do that you need to know if the mr is a lkey/rkey, and you need to
> know the flags.
>
>>>> I can move it to the post interface if it makes more sense.
>>>> the access is kind of out of place in the mapping routine anyway...
>>> All the dma routines have an access equivalent during map, I don't
>>> think it is out of place..
>>>
>>> To my mind, the map is the point where the MR should crystallize into
>>> an rkey or lkey MR, not at the post.
>> I'm not sure I understand why the lkey/rkey should be set at the map
>> routine. To me, it seems more natural to map_mr_sg and then either
>> register the lkey or the rkey.
> We need to check the access flags to put a stop to this remote access
> lkey security problem. That means we need to label every MR as a lkey
> or rkey MR.
>
> No more MR's can be both nonsense.

Well technically an MR with REMOTE_WRITE also has LOCAL_WRITE set. So 
you are proposing the core disallow a ULP from using the lkey for this 
type of MR?  Say in a RECV sge?



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe July 27, 2015, 8:29 p.m. UTC | #20

On Mon, Jul 27, 2015 at 03:11:04PM -0500, Steve Wise wrote:
> Well technically an MR with REMOTE_WRITE also has LOCAL_WRITE set. So you
> are proposing the core disallow a ULP from using the lkey for this type of
> MR?  Say in a RECV sge?

Yes, absolutely.

It is wrong anyhow, RECV isn't special, if you RECV into memory that
is exposed via a rkey MR, you have to invalidate that MR and fence DMA
before you can touch the buffer. Only very special, carefully
designed, cases could avoid that.

We don't have those cases, so lets just ban it. The only exception is
the iWarp RDMA READ thing.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Haggai Eran July 28, 2015, 11:20 a.m. UTC | #21

On 22/07/2015 09:55, Sagi Grimberg wrote:
> +/**
> + * ib_sg_to_pages() - Convert a sg list to a page vector
> + * @dev:           ib device
> + * @sgl:           dma mapped scatterlist
> + * @sg_nents:      number of entries in sg
> + * @max_pages:     maximum pages allowed
> + * @pages:         output page vector
> + * @npages:        output number of mapped pages
> + * @length:        output total byte length
> + * @offset:        output first byte offset
> + *
> + * Core service helper for drivers to convert a scatter
> + * list to a page vector. The assumption is that the
> + * sg must meet the following conditions:
> + * - Only the first sg is allowed to have an offset
> + * - All the elements are of the same size - PAGE_SIZE
> + * - The last element is allowed to have length less than
> + *   PAGE_SIZE
> + *
> + * If any of those conditions is not met, the routine will
> + * fail with EINVAL.
> + */
> +int ib_sg_to_pages(struct scatterlist *sgl,
> +		   unsigned short sg_nents,
> +		   unsigned short max_pages,
> +		   u64 *pages, u32 *npages,
> +		   u32 *length, u64 *offset)
> +{
> +	struct scatterlist *sg;
> +	u64 last_end_dma_addr = 0, last_page_addr = 0;
> +	unsigned int last_page_off = 0;
> +	int i, j = 0;
> +
> +	/* TODO: We can do better with huge pages */
> +
> +	*offset = sg_dma_address(&sgl[0]);
> +	*length = 0;
> +
> +	for_each_sg(sgl, sg, sg_nents, i) {
> +		u64 dma_addr = sg_dma_address(sg);
> +		unsigned int dma_len = sg_dma_len(sg);
> +		u64 end_dma_addr = dma_addr + dma_len;
> +		u64 page_addr = dma_addr & PAGE_MASK;
> +
> +		*length += dma_len;
> +
> +		/* Fail we ran out of pages */
> +		if (unlikely(j > max_pages))
> +			return -EINVAL;
> +
> +		if (i && sg->offset) {
> +			if (unlikely((last_end_dma_addr) != dma_addr)) {
> +				/* gap - fail */
> +				goto err;
> +			}
> +			if (last_page_off + dma_len < PAGE_SIZE) {
> +				/* chunk this fragment with the last */
> +				last_end_dma_addr += dma_len;
> +				last_page_off += dma_len;
> +				continue;
> +			} else {
> +				/* map starting from the next page */
> +				page_addr = last_page_addr + PAGE_SIZE;
> +				dma_len -= PAGE_SIZE - last_page_off;
> +			}
> +		}
> +
> +		do {
> +			pages[j++] = page_addr;
I think this line could overrun the pages buffer. The test above only checks
at the beginning of the sg, but with an sg larger than PAGE_SIZE, you could
still overrun.

> +			page_addr += PAGE_SIZE;
> +		} while (page_addr < end_dma_addr);
> +
> +		last_end_dma_addr = end_dma_addr;
> +		last_page_addr = end_dma_addr & PAGE_MASK;
> +		last_page_off = end_dma_addr & ~PAGE_MASK;
> +	}
> +
> +	*npages = j;
> +
> +	return 0;
> +err:
> +	pr_err("RDMA alignment violation\n");
> +	for_each_sg(sgl, sg, sg_nents, i) {
> +		u64 dma_addr = sg_dma_address(sg);
> +		unsigned int dma_len = sg_dma_len(sg);
> +
> +		pr_err("sg[%d]: offset=0x%x, dma_addr=0x%llx, dma_len=0x%x\n",
> +			i, sg->offset, dma_addr, dma_len);
> +	}
> +
> +	return -EINVAL;
> +}
> +EXPORT_SYMBOL(ib_sg_to_pages);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg July 30, 2015, 7:13 a.m. UTC | #22

>> Can you explain what do you mean by "downgrades everything to a 2k
>> alignment"? If the ULP is responsible for a PAGE_SIZE alignment than
>> how would this get out of alignment with swiotlb?
>
> swiotlb copies all DMA maps to a shared buffer below 4G so it can be
> used with 32 bit devices.
>
> The shared buffer is managed in a way that copies each s/g element to
> a continuous 2k aligned subsection of the buffer.
>

Thanks for the explanation.

> Basically, swiotlb realigns everything that passes through it.

So this won't ever happen if the ULP will DMA map the SG and check
for gaps right?

Also, is it interesting to support swiotlb even if we don't have
any devices that require it (and should we expect one to ever exist)?

>
> The DMA API allows this, so ultimately, code has to check the dma
> physical address when concerned about alignment.. But we should not
> expect this to commonly fail.
>
> So, something like..
>
>    if (!ib_does_sgl_fit_in_mr(mr,sg))
>       .. bounce buffer ..

I don't understand the need for this is we do the same thing
if the actual mapping fails...

>
>    if (!ib_map_mr_sg(mr,sg)) // does dma mapping and checks it
>       .. bounce buffer ..

Each ULP would want to do something different, iser
will bounce but srp would need to use multiple mrs, nfs will
split the request.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe July 30, 2015, 4:36 p.m. UTC | #23

On Thu, Jul 30, 2015 at 10:13:09AM +0300, Sagi Grimberg wrote:

> >Basically, swiotlb realigns everything that passes through it.
> 
> So this won't ever happen if the ULP will DMA map the SG and check
> for gaps right?

Once mapped the physical address isn't going to change - but at some
point we must check the physical address directly.

> Also, is it interesting to support swiotlb even if we don't have
> any devices that require it (and should we expect one to ever exist)?

swiotlb is an obvious example, and totally uninteresting to support,
but we must correctly use the DMA API.

> >The DMA API allows this, so ultimately, code has to check the dma
> >physical address when concerned about alignment.. But we should not
> >expect this to commonly fail.
> >
> >So, something like..
> >
> >   if (!ib_does_sgl_fit_in_mr(mr,sg))
> >      .. bounce buffer ..
> 
> I don't understand the need for this is we do the same thing
> if the actual mapping fails...

Just performance. DMA mapping is potentially very expensive, the
common case to detect will be a sg that is virtually unaligned.

This virtual scan could be bundled insde the map, but if a ULP knows
it is page aligned already then that is just creating overhead..

I'm ambivalent..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig July 30, 2015, 4:39 p.m. UTC | #24

On Thu, Jul 30, 2015 at 10:36:31AM -0600, Jason Gunthorpe wrote:
> > Also, is it interesting to support swiotlb even if we don't have
> > any devices that require it (and should we expect one to ever exist)?
> 
> swiotlb is an obvious example, and totally uninteresting to support,
> but we must correctly use the DMA API.

Do we have a choice?  It seems like various setups with DMA restrictions
rely on it, including many Xen PV guests.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg Aug. 19, 2015, 11:56 a.m. UTC | #25

On 7/23/2015 8:55 PM, Jason Gunthorpe wrote:
> On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote:
>>> I was hoping we'd move the DMA flush and translate into here and make
>>> it mandatory. Is there any reason not to do that?
>>
>> The reason I didn't added it in was so the ULPs can make sure they meet
>> the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his
>> SG list set partials and iSER to detect gaps (they need to dma map
>> for that).
>
> The ULP can always get the sg list's virtual address to check for
> gaps. Page aligned gaps are always OK.

So I had a go with moving the DMA mapping into ib_map_mr_sg() and
it turns out mapping somewhat poorly if the ULP _may_ register memory
or just send sg_lists (like storage targets over IB/iWARP). So the ULP
will sometimes use the DMA mapping and sometimes it won't... feels
kinda off to me...

it's much saner to do:
1. dma_map_sg
2. register / send-sg-list
3. unregister (if needed)
4. dma_unmap_sg

then:
1. if register - call ib_map_mr_sg (which calls dma_map_sg)
    else do dma_map_sg
2. if registered - call ib_dma_unmap_sg (which calles dma_unmap_sg)
    else do dma_unmap_sg

this kinda forces ULP to completely separate these code paths with
with very little sharing.

Also, at the moment, when ULPs are doing either FRWR or FMRs
its a pain to get a non-intrusive conversion.

I'm thinking we should keep dma_map_sg out of ib_map_mr_sg, and leave
it to the ULP like it does today (at least in the first stage...)

Thoughts?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Aug. 19, 2015, 12:52 p.m. UTC | #26

On Wed, Aug 19, 2015 at 02:56:24PM +0300, Sagi Grimberg wrote:
> So I had a go with moving the DMA mapping into ib_map_mr_sg() and
> it turns out mapping somewhat poorly if the ULP _may_ register memory
> or just send sg_lists (like storage targets over IB/iWARP). So the ULP
> will sometimes use the DMA mapping and sometimes it won't... feels
> kinda off to me...

Yes, it's odd.

> it's much saner to do:
> 1. dma_map_sg
> 2. register / send-sg-list
> 3. unregister (if needed)
> 4. dma_unmap_sg
> 
> then:
> 1. if register - call ib_map_mr_sg (which calls dma_map_sg)
>    else do dma_map_sg
> 2. if registered - call ib_dma_unmap_sg (which calles dma_unmap_sg)
>    else do dma_unmap_sg
> 
> this kinda forces ULP to completely separate these code paths with
> with very little sharing.
> 
> Also, at the moment, when ULPs are doing either FRWR or FMRs
> its a pain to get a non-intrusive conversion.
> 
> I'm thinking we should keep dma_map_sg out of ib_map_mr_sg, and leave
> it to the ULP like it does today (at least in the first stage...)

Keep it out for now.  I think we need to move the dma mapping into
the RDMA care rather sooner than later, but that must also include
ib_post_send/recv, so it's better done separately.

After having a look at the mess some drivers (ipath,qib,hfi & ehca)
cause with abuse of dma_map_ops I've got an even strong opion on
the whole subject now.  However I think we'll get more things done
if we split them into smaller steps.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg Aug. 19, 2015, 4:09 p.m. UTC | #27

>
> Keep it out for now.

Ok, I was also thinking on moving the access flags
to the work request again. It doesn't make much sense there
unless I go with what Jason suggested with ib_map_mr_[lkey|rkey]
to protect against remote access for lkeys in IB which to me, sounds
redundant at this point given that ULPs will set the access according
to iWARP anyway.

I'd prefer to get this right with a different helper like Steve
suggested:
int rdma_access_flags(int mr_roles);

This way we don't need to protect against it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Aug. 19, 2015, 4:58 p.m. UTC | #28

On Wed, Aug 19, 2015 at 07:09:18PM +0300, Sagi Grimberg wrote:
> Ok, I was also thinking on moving the access flags
> to the work request again.

Yes, with the current code I don't think we need it in the MR.

> I'd prefer to get this right with a different helper like Steve
> suggested:
> int rdma_access_flags(int mr_roles);

We can start with that.  In the long run we really want to have
two higher level helpers to RDMA READ a scatterlist:

 - one for iWARP that uses an FR and RDMA READ WITH INVALIDATE
 - one of IB-like transports that just uses a READ with the
   local lkey

Right now every ULP that wants to support iWarp needs to duplicate
that code.  This leads to some curious situations like the NFS
server apparently always using FRs if available for this if my
reading of svc_rdma_accept() is correct, or the weird parallel
code pathes for IB vs iWarp in RDS:

hch@brick:~/work/linux/net/rds$ ls ib*
ib.c  ib_cm.c  ib.h  ib_rdma.c  ib_recv.c  ib_ring.c  ib_send.c
ib_stats.c  ib_sysctl.c
hch@brick:~/work/linux/net/rds$ ls iw*
iw.c  iw_cm.c  iw.h  iw_rdma.c  iw_recv.c  iw_ring.c  iw_send.c
iw_stats.c  iw_sysctl.c

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Aug. 19, 2015, 5:37 p.m. UTC | #29

On Wed, Aug 19, 2015 at 02:56:24PM +0300, Sagi Grimberg wrote:
> On 7/23/2015 8:55 PM, Jason Gunthorpe wrote:
> >On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote:
> >>>I was hoping we'd move the DMA flush and translate into here and make
> >>>it mandatory. Is there any reason not to do that?
> >>
> >>The reason I didn't added it in was so the ULPs can make sure they meet
> >>the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his
> >>SG list set partials and iSER to detect gaps (they need to dma map
> >>for that).
> >
> >The ULP can always get the sg list's virtual address to check for
> >gaps. Page aligned gaps are always OK.
> 
> So I had a go with moving the DMA mapping into ib_map_mr_sg() and
> it turns out mapping somewhat poorly if the ULP _may_ register memory
> or just send sg_lists (like storage targets over IB/iWARP). So the ULP
> will sometimes use the DMA mapping and sometimes it won't... feels
> kinda off to me...

You need to split the rkey and lkey API flows to pull this off - the
rkey side never needs to touch a sg, while the lkey side should always
try and use a sg first. I keep saying this: they have
fundamentally different ULP usages.

> 1. if register - call ib_map_mr_sg (which calls dma_map_sg)
>    else do dma_map_sg
> 2. if registered - call ib_dma_unmap_sg (which calles dma_unmap_sg)
>    else do dma_unmap_sg

From what I've seen in the ULPs the flow control is generally such
that the MR is 'consumed' even if it isn't used by a send.

So lkey usage is simply split into things that absolutely don't need a
MR, and things that maybe do. The maybe side can go ahead and always
consume the MR resource, but optimize the implementation to a SG list
to avoid a performance hit.

Then the whole API becomes symmetric. The ULP says, 'here is a
scatterlist list and a lkey MR, make me a ib_sg list' and the core
either packes it as is into the sg, or it spins up the MR and packs
that.

This lets the unmap be symmetric, as the core always dma_unmaps, but
only tears down the MR if it was used.

The cost is the lkey MR slot is always consumed, which should be OK
because SQE flow control bounds the number of concurrent MRs required,
so consuming a SQE but not a MR doesn't provide an advantage.

> Also, at the moment, when ULPs are doing either FRWR or FMRs
> its a pain to get a non-intrusive conversion.

Without FMR sharing API entry points it is going to be hard to unify
them..

ie the map and alloc API side certainly could be shared..

> I'm thinking we should keep dma_map_sg out of ib_map_mr_sg, and leave
> it to the ULP like it does today (at least in the first stage...)

I'm fine with first stage, but we still really do need to figure how
how to get better code sharing in our API here..

Maybe we can do the rkey side right away until we can figure out how
to harmonize the rkey sg/mr usage?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg Aug. 20, 2015, 10:05 a.m. UTC | #30

>> 1. if register - call ib_map_mr_sg (which calls dma_map_sg)
>>     else do dma_map_sg
>> 2. if registered - call ib_dma_unmap_sg (which calles dma_unmap_sg)
>>     else do dma_unmap_sg
>
>  From what I've seen in the ULPs the flow control is generally such
> that the MR is 'consumed' even if it isn't used by a send.

Not really. if registration is not needed, an MR is not consumed. In
fact, in svcrdma the IB code path never uses those, and the iWARP code
path always use those for RDMA_READs and not RDMA_WRITEs. Also, isert
use those only when signature is enabled and registration is required.

>
> So lkey usage is simply split into things that absolutely don't need a
> MR, and things that maybe do. The maybe side can go ahead and always
> consume the MR resource, but optimize the implementation to a SG list
> to avoid a performance hit.
>
> Then the whole API becomes symmetric. The ULP says, 'here is a
> scatterlist list and a lkey MR, make me a ib_sg list' and the core
> either packes it as is into the sg, or it spins up the MR and packs
> that.

Always consuming an MR resource is an extra lock acquire given these
are always kept in a pool structure.

>> I'm thinking we should keep dma_map_sg out of ib_map_mr_sg, and leave
>> it to the ULP like it does today (at least in the first stage...)
>
> I'm fine with first stage, but we still really do need to figure how
> how to get better code sharing in our API here..
>
> Maybe we can do the rkey side right away until we can figure out how
> to harmonize the rkey sg/mr usage?

I'm fine with that. I agree we still need to do better.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Aug. 20, 2015, 7:04 p.m. UTC | #31

On Thu, Aug 20, 2015 at 01:05:59PM +0300, Sagi Grimberg wrote:
> >>1. if register - call ib_map_mr_sg (which calls dma_map_sg)
> >>    else do dma_map_sg
> >>2. if registered - call ib_dma_unmap_sg (which calles dma_unmap_sg)
> >>    else do dma_unmap_sg
> >
> > From what I've seen in the ULPs the flow control is generally such
> >that the MR is 'consumed' even if it isn't used by a send.
> 
> Not really. if registration is not needed, an MR is not consumed. In
> fact, in svcrdma the IB code path never uses those, and the iWARP code
> path always use those for RDMA_READs and not RDMA_WRITEs. Also, isert
> use those only when signature is enabled and registration is required.

The MR is not *used* but it should be 'consumed' - in the sense that
every RPC slot is associated (implicitly) with a MR, leaving the unused MR
in some kind of pool doesn't really help anything. Honestly, the MR
pool idea doesn't really help anything, it just makes confusion.

What should be pooled is the 'request slot' itself, in the sense that
if a request slot is in the 'ready to go' pool it is guarenteed to be
able to complete *any* request without blocking. That means the
MR/SQE/CQE resources are all ready to go. Any ancillary memory is
ready to use, etc.

The ULP should design its slot with the idea that it doesn't have to
allocate memory, or IB resources, or block, once the slot becomes
'ready to go'.

Review the discussion Chuck and I had on SQE flow control for a sense
on what that means. Understand why the lifetime of the MR and the SQE
and the slot are all convoluted together if RDMA is used correctly.

Trying to decouple the sub resources, ie by separately pooling the
MR/SQE/etc, is just unnecessary complexity, IMHO.. NFS client already
had serioues bugs in this area.

So, I turn to the idea that every ULP should work as the above, which
means when it gets to working on a 'slot' that implies there is an
actual struct ib_mr resource guaranteed available. This is why I
suggested using the 'struct ib_mr' to guide the SG construction even if
the actual HW MR isn't going to be used. The struct ib_mr is tied to
the slot, so using it has no cost.

-------

But, maybe that is too much of a shortcut, and thinking about it more
and all the other things I've written about.. Lets just directly
address this issue and add something called 'struct ib_op_slot'.

Data transfer would look like this:

 struct *ib_send_wr cur;

 cur = ib_slot_make_send(&req->op_slot,scatter_list);
 cur->next = ib_slot_make_rdma_read(&next_req->op,scatter_list,
 	          rkey,length);
 ib_post_send(qp,cur);

 [.. at CQE time ..]
 if (ib_slot_complete(qp,req->op_slot))
    [.. slot is now 'ready to go' ..]
 else
    [.. otherwise more stuff was posted, have to wait ...]

This forces the ULP to deal with many of the issues. Having a slot
means guarenteed minimum avaiable MR,SQE,CQE resources. That
guarenteed minimum avoids the messy API struggle in my prior writings.

.. and maybe the above is even thinking too small, to Christoph's
earlier musings, I wonder if a slot based middle API could hijack the
entire SCQ processing and have a per-slot callback scheme
instead. That kind of intervention is exactly what we'd need to
trivially hide the FMR difference.

... and now this provides enough context to start talking about common
helper APIs for common ULP things, like the rdma_read switch. The slot
has pre-allocated everything needed to handle the variations.

... which suddenly starts to be really viable because the slot
guarentees SQE availability too.

... and we start having the idea of a slot able to do certain tasks,
and codify that with API help at creation:

  struct nfs_rpc_slot
  {
     strict ib_op_slot slot;
  };

  struct ib_op_slot_attributes attrs;
  ib_init_slot_attrs(&attrs,ib_pd);

  ib_request_action(&attrs, "args describing RDMA read with N SGEs");
  if (ib_request_action("args describing a requirement for signature"))
      signature_supported = true;
  if (ib_request_action("args describing a requirement for non-page-aligned"))
      byte_sgl_supported = true;
  ib_request_action("args describing SEND with N SGEs");
  ib_request_action("args describing N RDMA reads each with N SGEs");

  for (required slot concurrency)
    ib_alloc_slot(&rpc.slot,&attrs);

Then the alloc just grabs everything required. ..mumble mumble.. some
way to flow into the QP/CQ allocation attributes too ..

Essentially, the ULP says 'here is what I want to do with this slot'
and the core code *guarentees* that if the slot is 'ready to go' then
'any single work of any requested type' can be queued without blocking
or memory allocation. Covers SQEs, CQEs, MRs, etc.

ib_request_action is a basic pattern that does various tests and ends
up doing:
  attrs->num_mrs = max(attrs->num_mrs, needed_for_this_action);
  attrs->num_mrs_sge = max(attrs->num_mrs_sge, needed_for_this_action);
  attrs->num_wr_sge = max(attrs->num_qp_sqe, needed_for_this_action);
  attrs->num_sqe = max(attrs->num_sqe, needed_for_this_action);
  attrs->num_cqe = max(attrs->num_cqe, needed_for_this_action);
[ie we compute the maximum allocation needed to satisfy the
 requested requirement]

Each request could fail, eg if signature is not supported then the
request_action will fail, so we have a more uniform way to talk about
send queue features.

... and the ULP could have a 'heavy' and 'light' slot pool if that
made some kind of sense for its work load.

So, that is a long road, but maybe there are reasonable interm stages?

Anyhow, conceptually, an idea. Eliminates the hated fmr pool concept,
cleans up bad thinking around queue flow control. Provides at least a
structure to abstract transport differences.

---------

It could look something like this:

 struct ib_op_slot
 {
    struct ib_mr **mr_list; // null terminated
    void *wr_memory;
    void *sg_memory;
    unsigned int num_sgs;
 };

 struct ib_send_wr *ib_slot_make_send(struct ib_op_slot *slot,
               const struct scatter_list *sgl)
 {
     dma_map(sgl);
     if (num_sges(sgl) < slot->num_sgs) {
        // send fits in the sg list
        struct ib_send_wr *wr = slot->wr_memory;
	wr->sg0list = slot->sg_memory;
	.. pack it in ..
	return wr;
     } else {
        // Need to spin up a MR..
	struct {
	   struct ib_send_wr frwr_wr;
	   struct ib_send_wr send_wr;
	} *wrs = slot->wr_memory;
	wrs->frwr_wr.next = &wrs->send_wr
	... pack it in ...
	return &wrs->frwr_wr;
    }
    // similar for FMR
 }

.. similar concept for rdma read, etc.
.. ib_request_action makes sure the wr_memory/sg_memory are pre-sized
   to accommodate the action. Add optional #ifdef'd debugging to check
   for bad ULP usage
.. function pointers could be used to provide special optimal versions
   if necessary
.. Complex things like signature just vanish from the API. ULP sees
   something like:

    if (ib_request_action("args describing a requirement for signature"))
       signature_supported = true;
    wr = ib_slot_make_rdma_write_signature(slot,....);

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Aug. 21, 2015, 6:34 a.m. UTC | #32

On Thu, Aug 20, 2015 at 01:04:13PM -0600, Jason Gunthorpe wrote:
> Trying to decouple the sub resources, ie by separately pooling the
> MR/SQE/etc, is just unnecessary complexity, IMHO.. NFS client already
> had serioues bugs in this area.
> 
> So, I turn to the idea that every ULP should work as the above, which
> means when it gets to working on a 'slot' that implies there is an
> actual struct ib_mr resource guaranteed available. This is why I
> suggested using the 'struct ib_mr' to guide the SG construction even if
> the actual HW MR isn't going to be used. The struct ib_mr is tied to
> the slot, so using it has no cost.

How is this going to work for drivers that might consumer multiple
MRs per request like SRP or similar upcoming block drivers?  Unless
you want to allocate a potentially large number of MRs for each
request that scheme doesn't work.

> This forces the ULP to deal with many of the issues. Having a slot
> means guarenteed minimum avaiable MR,SQE,CQE resources. That
> guarenteed minimum avoids the messy API struggle in my prior writings.
> 
> .. and maybe the above is even thinking too small, to Christoph's
> earlier musings, I wonder if a slot based middle API could hijack the
> entire SCQ processing and have a per-slot callback scheme
> instead. That kind of intervention is exactly what we'd need to
> trivially hide the FMR difference.

FYI, I have working early patches to do per-WR completion callback,
I'll post them after I get them into a slightly better shape.

As for your grand schemes:  I like some of the idea there, but we
need to get there gradually.  I'd much prefer to finish Sagi's simple
scheme, get my completion work in, add abstractions for RDMA READ and
WRITE scatterlist mapping and build things up slowly.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Aug. 21, 2015, 6:08 p.m. UTC | #33

On Thu, Aug 20, 2015 at 11:34:58PM -0700, Christoph Hellwig wrote:

> How is this going to work for drivers that might consumer multiple
> MRs per request like SRP or similar upcoming block drivers?  Unless
> you want to allocate a potentially large number of MRs for each
> request that scheme doesn't work.

There are at least two approaches, and it depends on how flow control
to the driving layer works out. Look at what the ULP does when the
existing MR pool exhausts:
- Exhaustion is not allowed. In this model every slot must truely handle
  every required action without blocking. The ULP somehow wrangles
  things so pool exhaustion is not possible. NFS client is a
  good example.

  Where NFS client went wrong is that the MR alone is not enough,
  issuing a request requires SQE/CQE resources, failing to track that
  caused hard to find bugs.
- Exhaustion is allowed, and somehow the ULP is able to stop
  processing. In this case you'd just swap MRs for slots in the pool,
  probably having pools of different kinds of slots to optimize
  resource use.

  Pool draw down includes SQE/CQE/etc resources as well. A multiple
  rkey MR case would just draw down the required slots from the pool.

I suspect client side tends to lean toward the first option and target
side the second - targets can always do back pressure flow control by
simply halting RQE processing, and it makes alot of sense on a target
to globally pool slots across all client QPs.

This idea of a slot is just a higher level structure we can hang other
stuff off - like the sg/mr decision, the iwarp rdma read change, sqe
accounting.

We don't need to start with everything, but I'm looking at Sagi's
notes on trying to factor the lkey side code paths and thinking a
broader abstraction than raw MR is needed to solve that.

> FYI, I have working early patches to do per-WR completion callback,
> I'll post them after I get them into a slightly better shape.

Interesting..

> As for your grand schemes:  I like some of the idea there, but we
> need to get there gradually.  I'd much prefer to finish Sagi's simple
> scheme, get my completion work in, add abstractions for RDMA READ and
> WRITE scatterlist mapping and build things up slowly.

Yes, absolutely, we have to go slowly - but exploring how we can fit
this together in some other way can help guide some of the smaller
choices.

Sagi could drop the lkey side, getting the rkey side in order would be
nice enough. Something like this is a direction to address the
lkey side.

Ie we could 1:1 replace MR with 'slot' and use that to factor the lkey
code paths. Over time slot can grow organically to factor more code.

Slot would be a new object for the core, one that is guarenteed to
last from post->completion, that seems like exactly the sort of object
a completion callback scheme would benefit from. Guarenteed memory to
hang callback pointers/etc off.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[WIP,28/43] IB/core: Introduce new fast registration API

Commit Message

Comments

Patch