diff mbox series

[for-next,v2,2/2] RDMA/erdma: Support non-4K page size in doorbell allocation

Message ID 20230307102924.70577-3-chengyou@linux.alibaba.com (mailing list archive)
State Changes Requested
Delegated to: Jason Gunthorpe
Headers show
Series RDMA/erdma: Add non-4K page size support | expand

Commit Message

Cheng Xu March 7, 2023, 10:29 a.m. UTC
Doorbell resources are exposed to userspace by mmap. The size unit of mmap
is PAGE_SIZE, previous implementation can not work correctly if PAGE_SIZE
is not 4K. We support non-4K page size in this commit.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 drivers/infiniband/hw/erdma/erdma_verbs.c | 40 ++++++++++-------------
 drivers/infiniband/hw/erdma/erdma_verbs.h |  5 ++-
 include/uapi/rdma/erdma-abi.h             |  5 ++-
 3 files changed, 23 insertions(+), 27 deletions(-)

Comments

Leon Romanovsky March 14, 2023, 10:23 a.m. UTC | #1
On Tue, Mar 07, 2023 at 06:29:24PM +0800, Cheng Xu wrote:
> Doorbell resources are exposed to userspace by mmap. The size unit of mmap
> is PAGE_SIZE, previous implementation can not work correctly if PAGE_SIZE
> is not 4K. We support non-4K page size in this commit.

Why do you need this information in rdma-core?
Can you use sysconf(_SC_PAGESIZE) there to understand the page size like
other providers?

Thanks
Cheng Xu March 14, 2023, 11:34 a.m. UTC | #2
On 3/14/23 7:31 PM, Cheng Xu wrote:
> 
> 
> On 3/14/23 6:23 PM, Leon Romanovsky wrote:
>> On Tue, Mar 07, 2023 at 06:29:24PM +0800, Cheng Xu wrote:
>>> Doorbell resources are exposed to userspace by mmap. The size unit of mmap
>>> is PAGE_SIZE, previous implementation can not work correctly if PAGE_SIZE
>>> is not 4K. We support non-4K page size in this commit.
>>
>> Why do you need this information in rdma-core?
>> Can you use sysconf(_SC_PAGESIZE) there to understand the page size like
>> other providers?
>>
> 
> I don't expose PAGE_SIZE to userspace in this patchset, but the *offset* in
> PAGE of the DBs:
> 
> diff --git a/include/uapi/rdma/erdma-abi.h b/include/uapi/rdma/erdma-abi.h
> index b7a0222f978f..57f8942a3c56 100644
> --- a/include/uapi/rdma/erdma-abi.h
> +++ b/include/uapi/rdma/erdma-abi.h
> @@ -40,10 +40,13 @@ struct erdma_uresp_alloc_ctx {
>  	__u32 dev_id;
>  	__u32 pad;
>  	__u32 sdb_type;
> -	__u32 sdb_offset;
> +	__u32 sdb_entid;
>  	__aligned_u64 sdb;
>  	__aligned_u64 rdb;
>  	__aligned_u64 cdb;
> +	__u32 sdb_off;
> +	__u32 rdb_off;
> +	__u32 cdb_off;
>  };
> 
> Our doorbell space is aligned to 4096
> 

Please ignore these two response. There is something wrong with my Thunderbird:
When I press ctrl + s, it send the email, not save it.

Sorry for this.

Cheng Xu

> 
> 
>> Thanks
Cheng Xu March 14, 2023, 11:50 a.m. UTC | #3
On 3/14/23 6:23 PM, Leon Romanovsky wrote:
> On Tue, Mar 07, 2023 at 06:29:24PM +0800, Cheng Xu wrote:
>> Doorbell resources are exposed to userspace by mmap. The size unit of mmap
>> is PAGE_SIZE, previous implementation can not work correctly if PAGE_SIZE
>> is not 4K. We support non-4K page size in this commit.
> 
> Why do you need this information in rdma-core?
> Can you use sysconf(_SC_PAGESIZE) there to understand the page size like
> other providers?
> 

I don't expose PAGE_SIZE to userspace in this patchset, but the *offset* in
PAGE of the DBs:

diff --git a/include/uapi/rdma/erdma-abi.h b/include/uapi/rdma/erdma-abi.h
index b7a0222f978f..57f8942a3c56 100644
--- a/include/uapi/rdma/erdma-abi.h
+++ b/include/uapi/rdma/erdma-abi.h
@@ -40,10 +40,13 @@ struct erdma_uresp_alloc_ctx {
 	__u32 dev_id;
 	__u32 pad;
 	__u32 sdb_type;
-	__u32 sdb_offset;
+	__u32 sdb_entid;
 	__aligned_u64 sdb;
 	__aligned_u64 rdb;
 	__aligned_u64 cdb;
+	__u32 sdb_off;
+	__u32 rdb_off;
+	__u32 cdb_off;
 };

Our doorbell space is aligned to 4096, this works fine when PAGE_SIZE is
also 4096, and the doorbell space starts from the mapped page. When
PAGE_SIZE is not 4096, the doorbell space may starts from the middle of
the mapped page.

For example, our SQ doorbell starts from the offset 4096 in PCIe bar 0.
When we map the first SQ doorbell to userspace when PAGE_SIZE is 64K,
the doorbell space starts from the offset 4096 in mmap returned address.

So the userspace needs to know the doorbell space offset in mmaped page.

Thanks,
Cheng Xu



> Thanks
Leon Romanovsky March 14, 2023, 2:10 p.m. UTC | #4
On Tue, Mar 14, 2023 at 07:50:19PM +0800, Cheng Xu wrote:
> 
> 
> On 3/14/23 6:23 PM, Leon Romanovsky wrote:
> > On Tue, Mar 07, 2023 at 06:29:24PM +0800, Cheng Xu wrote:
> >> Doorbell resources are exposed to userspace by mmap. The size unit of mmap
> >> is PAGE_SIZE, previous implementation can not work correctly if PAGE_SIZE
> >> is not 4K. We support non-4K page size in this commit.
> > 
> > Why do you need this information in rdma-core?
> > Can you use sysconf(_SC_PAGESIZE) there to understand the page size like
> > other providers?
> > 
> 
> I don't expose PAGE_SIZE to userspace in this patchset, but the *offset* in
> PAGE of the DBs:
> 
> diff --git a/include/uapi/rdma/erdma-abi.h b/include/uapi/rdma/erdma-abi.h
> index b7a0222f978f..57f8942a3c56 100644
> --- a/include/uapi/rdma/erdma-abi.h
> +++ b/include/uapi/rdma/erdma-abi.h
> @@ -40,10 +40,13 @@ struct erdma_uresp_alloc_ctx {
>  	__u32 dev_id;
>  	__u32 pad;
>  	__u32 sdb_type;
> -	__u32 sdb_offset;
> +	__u32 sdb_entid;
>  	__aligned_u64 sdb;
>  	__aligned_u64 rdb;
>  	__aligned_u64 cdb;
> +	__u32 sdb_off;
> +	__u32 rdb_off;
> +	__u32 cdb_off;
>  };
> 
> Our doorbell space is aligned to 4096, this works fine when PAGE_SIZE is
> also 4096, and the doorbell space starts from the mapped page. When
> PAGE_SIZE is not 4096, the doorbell space may starts from the middle of
> the mapped page.
> 
> For example, our SQ doorbell starts from the offset 4096 in PCIe bar 0.
> When we map the first SQ doorbell to userspace when PAGE_SIZE is 64K,
> the doorbell space starts from the offset 4096 in mmap returned address.
> 
> So the userspace needs to know the doorbell space offset in mmaped page.

And can't you preserve same alignment in the kernel for doorbells for every page size?
Just always start from 0.

> 
> Thanks,
> Cheng Xu
> 
> 
> 
> > Thanks
Cheng Xu March 15, 2023, 1:58 a.m. UTC | #5
On 3/14/23 10:10 PM, Leon Romanovsky wrote:
> On Tue, Mar 14, 2023 at 07:50:19PM +0800, Cheng Xu wrote:
>>
>>
<...>
>>
>> Our doorbell space is aligned to 4096, this works fine when PAGE_SIZE is
>> also 4096, and the doorbell space starts from the mapped page. When
>> PAGE_SIZE is not 4096, the doorbell space may starts from the middle of
>> the mapped page.
>>
>> For example, our SQ doorbell starts from the offset 4096 in PCIe bar 0.
>> When we map the first SQ doorbell to userspace when PAGE_SIZE is 64K,
>> the doorbell space starts from the offset 4096 in mmap returned address.
>>
>> So the userspace needs to know the doorbell space offset in mmaped page.
> 
> And can't you preserve same alignment in the kernel for doorbells for every page size?
> Just always start from 0.
> 

I've considered this option before, but unfortunately can't, at least for CQ DB.
The size of our PCIe bar 0 is 512K, and offset [484K, 508K] are CQ doorbells.
CQ doorbell space is located in offset [36K, 60K] when PAGE_SIZE = 64K, and can't
start from offset 0 in this case.

Another reason is that we want to organize SQ doorbell space in unit of 4096.
In current implementation, each ucontext will be assigned a SQ doorbell space
for both normal doorbell and direct wqe usage. Unit of 4096, compared with
larger unit, more ucontexts can be assigned exclusive doorbell space for direct
wqe.

Thanks,
Cheng Xu
Leon Romanovsky March 15, 2023, 10:22 a.m. UTC | #6
On Wed, Mar 15, 2023 at 09:58:06AM +0800, Cheng Xu wrote:
> 
> 
> On 3/14/23 10:10 PM, Leon Romanovsky wrote:
> > On Tue, Mar 14, 2023 at 07:50:19PM +0800, Cheng Xu wrote:
> >>
> >>
> <...>
> >>
> >> Our doorbell space is aligned to 4096, this works fine when PAGE_SIZE is
> >> also 4096, and the doorbell space starts from the mapped page. When
> >> PAGE_SIZE is not 4096, the doorbell space may starts from the middle of
> >> the mapped page.
> >>
> >> For example, our SQ doorbell starts from the offset 4096 in PCIe bar 0.
> >> When we map the first SQ doorbell to userspace when PAGE_SIZE is 64K,
> >> the doorbell space starts from the offset 4096 in mmap returned address.
> >>
> >> So the userspace needs to know the doorbell space offset in mmaped page.
> > 
> > And can't you preserve same alignment in the kernel for doorbells for every page size?
> > Just always start from 0.
> > 
> 
> I've considered this option before, but unfortunately can't, at least for CQ DB.
> The size of our PCIe bar 0 is 512K, and offset [484K, 508K] are CQ doorbells.
> CQ doorbell space is located in offset [36K, 60K] when PAGE_SIZE = 64K, and can't
> start from offset 0 in this case.
> 
> Another reason is that we want to organize SQ doorbell space in unit of 4096.
> In current implementation, each ucontext will be assigned a SQ doorbell space
> for both normal doorbell and direct wqe usage. Unit of 4096, compared with
> larger unit, more ucontexts can be assigned exclusive doorbell space for direct
> wqe.

I have a feeling that there is an existing API for it already.
Let's give a chance for Jason to chime in.

Thanks

> 
> Thanks,
> Cheng Xu
Jason Gunthorpe March 21, 2023, 2:30 p.m. UTC | #7
On Wed, Mar 15, 2023 at 12:22:10PM +0200, Leon Romanovsky wrote:
> On Wed, Mar 15, 2023 at 09:58:06AM +0800, Cheng Xu wrote:
> > 
> > 
> > On 3/14/23 10:10 PM, Leon Romanovsky wrote:
> > > On Tue, Mar 14, 2023 at 07:50:19PM +0800, Cheng Xu wrote:
> > >>
> > >>
> > <...>
> > >>
> > >> Our doorbell space is aligned to 4096, this works fine when PAGE_SIZE is
> > >> also 4096, and the doorbell space starts from the mapped page. When
> > >> PAGE_SIZE is not 4096, the doorbell space may starts from the middle of
> > >> the mapped page.
> > >>
> > >> For example, our SQ doorbell starts from the offset 4096 in PCIe bar 0.
> > >> When we map the first SQ doorbell to userspace when PAGE_SIZE is 64K,
> > >> the doorbell space starts from the offset 4096 in mmap returned address.
> > >>
> > >> So the userspace needs to know the doorbell space offset in mmaped page.
> > > 
> > > And can't you preserve same alignment in the kernel for doorbells for every page size?
> > > Just always start from 0.
> > > 
> > 
> > I've considered this option before, but unfortunately can't, at least for CQ DB.
> > The size of our PCIe bar 0 is 512K, and offset [484K, 508K] are CQ doorbells.
> > CQ doorbell space is located in offset [36K, 60K] when PAGE_SIZE = 64K, and can't
> > start from offset 0 in this case.
> > 
> > Another reason is that we want to organize SQ doorbell space in unit of 4096.
> > In current implementation, each ucontext will be assigned a SQ doorbell space
> > for both normal doorbell and direct wqe usage. Unit of 4096, compared with
> > larger unit, more ucontexts can be assigned exclusive doorbell space for direct
> > wqe.
> 
> I have a feeling that there is an existing API for it already.
> Let's give a chance for Jason to chime in.

This sounds similar to how mlx5 chops up its doorbell space

But I don't understand your device security model.

In mlx5 it is not allowed to share doorbells between unrelated
processes. Doorbells have built in security and a doorbell can only
trigger QP/CQ's that are explicitly linked to that doorbell.

Thus mlx5 is careful to only mmap doorbells that are linked to the
QPs. On 64K page size userspace receives alot of doorbells per mmap,
all linked to the same security context.

Improperly sharing MMIO pages can result in various security problems
if a hostile userspace can write arbitary things to MMIO space.

Here you seem to be talking about overmapping stuff. What is the
security argument that it is safe to leak to userspace parts of the
device MMIO BAR beyond that strictly cotnained to the single doorbell?

This has to be clearly explained in the commit message and a comment.

Jason
Cheng Xu March 22, 2023, 7:05 a.m. UTC | #8
On 3/21/23 10:30 PM, Jason Gunthorpe wrote:
> On Wed, Mar 15, 2023 at 12:22:10PM +0200, Leon Romanovsky wrote:
<...>
> 
> This sounds similar to how mlx5 chops up its doorbell space
> 
> But I don't understand your device security model.
> 
> In mlx5 it is not allowed to share doorbells between unrelated
> processes. Doorbells have built in security and a doorbell can only
> trigger QP/CQ's that are explicitly linked to that doorbell.
> 
> Thus mlx5 is careful to only mmap doorbells that are linked to the
> QPs. On 64K page size userspace receives alot of doorbells per mmap,
> all linked to the same security context.
> 
> Improperly sharing MMIO pages can result in various security problems
> if a hostile userspace can write arbitary things to MMIO space.
> 
> Here you seem to be talking about overmapping stuff. What is the
> security argument that it is safe to leak to userspace parts of the
> device MMIO BAR beyond that strictly cotnained to the single doorbell?

Thank you for your explanation. IIUC, this security mechanism of mlx5
needs the support of HW, and the HW can reject doorbells from unauthorized
doorbell space.

The current generation of erdma devices do not have this capability due to
implementation complexity. Without this HW capability, isolating the MMIO
space in software doesn't prevent the attack, because the malicious APPs
can map mmio itself, not through verbs interface.

Our consideration of security is mainly focused on the VM/VF level,
not at the process/ucontext level: no matter what the user does inside the
VM, it cannot affect other VMs. Therefore, The userspace isolation of mmio
inside one VM is incomplete and shared mmio pages cannot be avoided in
this generation.

> This has to be clearly explained in the commit message and a comment.

All right, I will do this in v3.

Thanks,
Cheng Xu
Jason Gunthorpe March 22, 2023, 11:54 a.m. UTC | #9
On Wed, Mar 22, 2023 at 03:05:29PM +0800, Cheng Xu wrote:

> The current generation of erdma devices do not have this capability due to
> implementation complexity. Without this HW capability, isolating the MMIO
> space in software doesn't prevent the attack, because the malicious APPs
> can map mmio itself, not through verbs interface.

This doesn't meet the security model of Linux, verbs HW is expected to
protect one process from another process.

if this is the case we should consider restricting this HW to
CAP_SYS_RAW_IO only.

You should come with an explanation why this HW is safe enough to
avoid this.

Jason
Cheng Xu March 22, 2023, 1:30 p.m. UTC | #10
On 3/22/23 7:54 PM, Jason Gunthorpe wrote:
> On Wed, Mar 22, 2023 at 03:05:29PM +0800, Cheng Xu wrote:
> 
>> The current generation of erdma devices do not have this capability due to
>> implementation complexity. Without this HW capability, isolating the MMIO
>> space in software doesn't prevent the attack, because the malicious APPs
>> can map mmio itself, not through verbs interface.
> 
> This doesn't meet the security model of Linux, verbs HW is expected to
> protect one process from another process.

OK, I see.

So the key point is that HW should restrict each process to use its own doorbell
space. If hardware can do this, share or do not share MMIO pages both will meet
the security requirement. Do I get it right? 

It seems that EFA uses shared MMIO pages with hardware security assurance.

> if this is the case we should consider restricting this HW to
> CAP_SYS_RAW_IO only.
> 

Please give us a chance to fix this issue first.

> You should come with an explanation why this HW is safe enough to
> avoid this.

I need to discuss with our HW guys and implement the similar security check
in HW, and this won't be long.

Thanks,
Cheng Xu
Jason Gunthorpe March 22, 2023, 2:01 p.m. UTC | #11
On Wed, Mar 22, 2023 at 09:30:41PM +0800, Cheng Xu wrote:
> 
> 
> On 3/22/23 7:54 PM, Jason Gunthorpe wrote:
> > On Wed, Mar 22, 2023 at 03:05:29PM +0800, Cheng Xu wrote:
> > 
> >> The current generation of erdma devices do not have this capability due to
> >> implementation complexity. Without this HW capability, isolating the MMIO
> >> space in software doesn't prevent the attack, because the malicious APPs
> >> can map mmio itself, not through verbs interface.
> > 
> > This doesn't meet the security model of Linux, verbs HW is expected to
> > protect one process from another process.
> 
> OK, I see.
> 
> So the key point is that HW should restrict each process to use its own doorbell
> space. If hardware can do this, share or do not share MMIO pages both will meet
> the security requirement. Do I get it right? 

HW can never do that, HW is supposed to rely on the system MMU to
isolate doorbell registers

The HW responsibility is to make doorbell MMIO registers safe in the
hands of other processes.

Simple doorbells that only 'kick' and don't convey any information are
probably safe to share, and don't require HW checks between the
doorbell page and the PD/QP/CQ/etc

Doorbells that deliver data - eg a head pointer - are not safe because
the wrong head pointer can corrupt the HW state. Process B must not be
able to corrupt the head pointer of a QP/CQ owned by Process A under
any circumstances. Definitely they cannot have access to the MMIO and
also the HW must ensure that writes coming from process B are rejected
if they touch resources owned by process a (eg by PD/QPN/CQN checks in
HW)

Doorbells that accept entire WQE's are definately not safe as a
hostile process could execute a WQE on a QP it does not own.

> It seems that EFA uses shared MMIO pages with hardware security assurance.

I'm not sure how EFA works, it writes this:

        EFA_SET(&db, EFA_IO_REGS_CQ_DB_CONSUMER_INDEX, cq->cc);
        EFA_SET(&db, EFA_IO_REGS_CQ_DB_CMD_SN, cq->cmd_sn & 0x3);
        EFA_SET(&db, EFA_IO_REGS_CQ_DB_ARM, arm);

But interestingly there is no CQN value, so I have no idea how it
associates the doorbell register with the CQ to load the data into.

If it is using a DB register offset to learn the CQN then obviously it
cannot share the same doorbell page (or associated CQNs) across verbs
FD contexts, that would be a security bug. EFA folks?

Jason
Gal Pressman March 22, 2023, 3:09 p.m. UTC | #12
On 22/03/2023 16:01, Jason Gunthorpe wrote:
>> It seems that EFA uses shared MMIO pages with hardware security assurance.
> 
> I'm not sure how EFA works, it writes this:
> 
>         EFA_SET(&db, EFA_IO_REGS_CQ_DB_CONSUMER_INDEX, cq->cc);
>         EFA_SET(&db, EFA_IO_REGS_CQ_DB_CMD_SN, cq->cmd_sn & 0x3);
>         EFA_SET(&db, EFA_IO_REGS_CQ_DB_ARM, arm);
> 
> But interestingly there is no CQN value, so I have no idea how it
> associates the doorbell register with the CQ to load the data into.
> 
> If it is using a DB register offset to learn the CQN then obviously it
> cannot share the same doorbell page (or associated CQNs) across verbs
> FD contexts, that would be a security bug. EFA folks?

The doorbell's BAR address is dictated by the device, and the CQN is
derived from the address. Each doorbell resides in a separate page,
there is no sharing of doorbells pages between different UARs.
Cheng Xu March 23, 2023, 6:57 a.m. UTC | #13
On 3/22/23 10:01 PM, Jason Gunthorpe wrote:
> On Wed, Mar 22, 2023 at 09:30:41PM +0800, Cheng Xu wrote:
>>
>>
>> On 3/22/23 7:54 PM, Jason Gunthorpe wrote:
>>> On Wed, Mar 22, 2023 at 03:05:29PM +0800, Cheng Xu wrote:
>>>
>>>> The current generation of erdma devices do not have this capability due to
>>>> implementation complexity. Without this HW capability, isolating the MMIO
>>>> space in software doesn't prevent the attack, because the malicious APPs
>>>> can map mmio itself, not through verbs interface.
>>>
>>> This doesn't meet the security model of Linux, verbs HW is expected to
>>> protect one process from another process.
>>
>> OK, I see.
>>
>> So the key point is that HW should restrict each process to use its own doorbell
>> space. If hardware can do this, share or do not share MMIO pages both will meet
>> the security requirement. Do I get it right? 
> 
> HW can never do that, HW is supposed to rely on the system MMU to
> isolate doorbell registers
> 
> The HW responsibility is to make doorbell MMIO registers safe in the
> hands of other processes.
> 
> Simple doorbells that only 'kick' and don't convey any information are
> probably safe to share, and don't require HW checks between the
> doorbell page and the PD/QP/CQ/etc
> 
> Doorbells that deliver data - eg a head pointer - are not safe because
> the wrong head pointer can corrupt the HW state. Process B must not be
> able to corrupt the head pointer of a QP/CQ owned by Process A under
> any circumstances. Definitely they cannot have access to the MMIO and
> also the HW must ensure that writes coming from process B are rejected
> if they touch resources owned by process a (eg by PD/QPN/CQN checks in
> HW)
> 
> Doorbells that accept entire WQE's are definately not safe as a
> hostile process could execute a WQE on a QP it does not own.
> 

It's much clear, thanks for your explanation and patience.

Back to erdma context, we have rethought our implementation. For QPs,
we have a field *wqe_index* in SQE/RQE, which indicates the validity
of the current WQE. Incorrect doorbell value from other processes can
not corrupt the QPC in hardware due to PI range and WQE content
validation in HW.

Unlike SQ/RQ, for CQ doorbell, It seems that we need some more works
to protect it. We have started analyzing the details.

Thanks,
Cheng Xu
Jason Gunthorpe March 23, 2023, 11:53 a.m. UTC | #14
On Thu, Mar 23, 2023 at 02:57:49PM +0800, Cheng Xu wrote:
> 
> 
> On 3/22/23 10:01 PM, Jason Gunthorpe wrote:
> > On Wed, Mar 22, 2023 at 09:30:41PM +0800, Cheng Xu wrote:
> >>
> >>
> >> On 3/22/23 7:54 PM, Jason Gunthorpe wrote:
> >>> On Wed, Mar 22, 2023 at 03:05:29PM +0800, Cheng Xu wrote:
> >>>
> >>>> The current generation of erdma devices do not have this capability due to
> >>>> implementation complexity. Without this HW capability, isolating the MMIO
> >>>> space in software doesn't prevent the attack, because the malicious APPs
> >>>> can map mmio itself, not through verbs interface.
> >>>
> >>> This doesn't meet the security model of Linux, verbs HW is expected to
> >>> protect one process from another process.
> >>
> >> OK, I see.
> >>
> >> So the key point is that HW should restrict each process to use its own doorbell
> >> space. If hardware can do this, share or do not share MMIO pages both will meet
> >> the security requirement. Do I get it right? 
> > 
> > HW can never do that, HW is supposed to rely on the system MMU to
> > isolate doorbell registers
> > 
> > The HW responsibility is to make doorbell MMIO registers safe in the
> > hands of other processes.
> > 
> > Simple doorbells that only 'kick' and don't convey any information are
> > probably safe to share, and don't require HW checks between the
> > doorbell page and the PD/QP/CQ/etc
> > 
> > Doorbells that deliver data - eg a head pointer - are not safe because
> > the wrong head pointer can corrupt the HW state. Process B must not be
> > able to corrupt the head pointer of a QP/CQ owned by Process A under
> > any circumstances. Definitely they cannot have access to the MMIO and
> > also the HW must ensure that writes coming from process B are rejected
> > if they touch resources owned by process a (eg by PD/QPN/CQN checks in
> > HW)
> > 
> > Doorbells that accept entire WQE's are definately not safe as a
> > hostile process could execute a WQE on a QP it does not own.
> > 
> 
> It's much clear, thanks for your explanation and patience.
> 
> Back to erdma context, we have rethought our implementation. For QPs,
> we have a field *wqe_index* in SQE/RQE, which indicates the validity
> of the current WQE. Incorrect doorbell value from other processes can
> not corrupt the QPC in hardware due to PI range and WQE content
> validation in HW.

No, validating the DB content is not acceptable security. The attacker
process can always generate valid content if it tries hard enough.

The only acceptable answer is to do like every other NIC did and link
the DB register to the HW object it is allowed to affect.

Jason
Cheng Xu March 23, 2023, 12:33 p.m. UTC | #15
On 3/23/23 7:53 PM, Jason Gunthorpe wrote:
> On Thu, Mar 23, 2023 at 02:57:49PM +0800, Cheng Xu wrote:
>>
>>
>> On 3/22/23 10:01 PM, Jason Gunthorpe wrote:
>>> On Wed, Mar 22, 2023 at 09:30:41PM +0800, Cheng Xu wrote:
>>>>
>>>>
>>>> On 3/22/23 7:54 PM, Jason Gunthorpe wrote:
<...>
>>
>> It's much clear, thanks for your explanation and patience.
>>
>> Back to erdma context, we have rethought our implementation. For QPs,
>> we have a field *wqe_index* in SQE/RQE, which indicates the validity
>> of the current WQE. Incorrect doorbell value from other processes can
>> not corrupt the QPC in hardware due to PI range and WQE content
>> validation in HW.
> 
> No, validating the DB content is not acceptable security. The attacker
> process can always generate valid content if it tries hard enough.
>

Oh, you may misunderstand what I said, our HW validates the *WQE* content,
not *DB* content. The attacker can not generate the WQE of other QPs. This
protection and correction is already implemented in our HW.

> The only acceptable answer is to do like every other NIC did and link
> the DB register to the HW object it is allowed to affect.
> 

Emm, still not acceptable with WQE content validation? If it's acceptable,
will reduce some works.

Thanks,
Cheng Xu
Jason Gunthorpe March 23, 2023, 1:05 p.m. UTC | #16
On Thu, Mar 23, 2023 at 08:33:53PM +0800, Cheng Xu wrote:
> 
> 
> On 3/23/23 7:53 PM, Jason Gunthorpe wrote:
> > On Thu, Mar 23, 2023 at 02:57:49PM +0800, Cheng Xu wrote:
> >>
> >>
> >> On 3/22/23 10:01 PM, Jason Gunthorpe wrote:
> >>> On Wed, Mar 22, 2023 at 09:30:41PM +0800, Cheng Xu wrote:
> >>>>
> >>>>
> >>>> On 3/22/23 7:54 PM, Jason Gunthorpe wrote:
> <...>
> >>
> >> It's much clear, thanks for your explanation and patience.
> >>
> >> Back to erdma context, we have rethought our implementation. For QPs,
> >> we have a field *wqe_index* in SQE/RQE, which indicates the validity
> >> of the current WQE. Incorrect doorbell value from other processes can
> >> not corrupt the QPC in hardware due to PI range and WQE content
> >> validation in HW.
> > 
> > No, validating the DB content is not acceptable security. The attacker
> > process can always generate valid content if it tries hard enough.
> >
> 
> Oh, you may misunderstand what I said, our HW validates the *WQE* content,
> not *DB* content. The attacker can not generate the WQE of other QPs. This
> protection and correction is already implemented in our HW.

Why are you talking about WQEs in a discussion about doorbell
security?

WQE's are read via DMA from their SQ/RQs - that path doesn't have a
doorbell security problem.

The issue comes if you try to deliver the WQE via a doorbell write.

Jason
Cheng Xu March 23, 2023, 2:10 p.m. UTC | #17
On 3/23/23 9:05 PM, Jason Gunthorpe wrote:
> On Thu, Mar 23, 2023 at 08:33:53PM +0800, Cheng Xu wrote:
>>
>>
>> On 3/23/23 7:53 PM, Jason Gunthorpe wrote:
>>> On Thu, Mar 23, 2023 at 02:57:49PM +0800, Cheng Xu wrote:
>>>>
>>>>
>>>> On 3/22/23 10:01 PM, Jason Gunthorpe wrote:
>>>>> On Wed, Mar 22, 2023 at 09:30:41PM +0800, Cheng Xu wrote:
>>>>>>
>>>>>>
>>>>>> On 3/22/23 7:54 PM, Jason Gunthorpe wrote:
>> <...>
>>>>
>>>> It's much clear, thanks for your explanation and patience.
>>>>
>>>> Back to erdma context, we have rethought our implementation. For QPs,
>>>> we have a field *wqe_index* in SQE/RQE, which indicates the validity
>>>> of the current WQE. Incorrect doorbell value from other processes can
>>>> not corrupt the QPC in hardware due to PI range and WQE content
>>>> validation in HW.
>>>
>>> No, validating the DB content is not acceptable security. The attacker
>>> process can always generate valid content if it tries hard enough.
>>>
>>
>> Oh, you may misunderstand what I said, our HW validates the *WQE* content,
>> not *DB* content. The attacker can not generate the WQE of other QPs. This
>> protection and correction is already implemented in our HW.
> 
> Why are you talking about WQEs in a discussion about doorbell
> security?

Malicious doorbell will corrupt the head pointer in QPC, and then invalid WQEs
will be processed. My point is that WQE validation can correct the head pointer
carried in malicious doorbell, and can prevent such attack.

It looks like the first kind of the doorbell types you mentioned before, but only
conveying the QPN.

Thanks,
Cheng Xu
Jason Gunthorpe March 23, 2023, 2:18 p.m. UTC | #18
On Thu, Mar 23, 2023 at 10:10:25PM +0800, Cheng Xu wrote:
> 
> 
> On 3/23/23 9:05 PM, Jason Gunthorpe wrote:
> > On Thu, Mar 23, 2023 at 08:33:53PM +0800, Cheng Xu wrote:
> >>
> >>
> >> On 3/23/23 7:53 PM, Jason Gunthorpe wrote:
> >>> On Thu, Mar 23, 2023 at 02:57:49PM +0800, Cheng Xu wrote:
> >>>>
> >>>>
> >>>> On 3/22/23 10:01 PM, Jason Gunthorpe wrote:
> >>>>> On Wed, Mar 22, 2023 at 09:30:41PM +0800, Cheng Xu wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 3/22/23 7:54 PM, Jason Gunthorpe wrote:
> >> <...>
> >>>>
> >>>> It's much clear, thanks for your explanation and patience.
> >>>>
> >>>> Back to erdma context, we have rethought our implementation. For QPs,
> >>>> we have a field *wqe_index* in SQE/RQE, which indicates the validity
> >>>> of the current WQE. Incorrect doorbell value from other processes can
> >>>> not corrupt the QPC in hardware due to PI range and WQE content
> >>>> validation in HW.
> >>>
> >>> No, validating the DB content is not acceptable security. The attacker
> >>> process can always generate valid content if it tries hard enough.
> >>>
> >>
> >> Oh, you may misunderstand what I said, our HW validates the *WQE* content,
> >> not *DB* content. The attacker can not generate the WQE of other QPs. This
> >> protection and correction is already implemented in our HW.
> > 
> > Why are you talking about WQEs in a discussion about doorbell
> > security?
> 
> Malicious doorbell will corrupt the head pointer in QPC, and then invalid WQEs
> will be processed. My point is that WQE validation can correct the head pointer
> carried in malicious doorbell, and can prevent such attack.

No, if the head pointer is incorrect an attack can stall the QP by
guessing a head pointer that is valid but before already submitted
WQEs.

There is no WQE based recovery for such a thing.

Jason
Cheng Xu March 26, 2023, 12:10 a.m. UTC | #19
On 3/23/23 10:18 PM, Jason Gunthorpe wrote:
> On Thu, Mar 23, 2023 at 10:10:25PM +0800, Cheng Xu wrote:
>>
>>
<...>
>>
>> Malicious doorbell will corrupt the head pointer in QPC, and then invalid WQEs
>> will be processed. My point is that WQE validation can correct the head pointer
>> carried in malicious doorbell, and can prevent such attack.
> 
> No, if the head pointer is incorrect an attack can stall the QP by
> guessing a head pointer that is valid but before already submitted
> WQEs.

You are right.

Thanks very much for this discussion and your advisement.

Cheng Xu

> 
> There is no WQE based recovery for such a thing.
> 
> Jason
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index 83e1b0d55977..0cd8dd61f569 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -1137,8 +1137,8 @@  void erdma_mmap_free(struct rdma_user_mmap_entry *rdma_entry)
 static void alloc_db_resources(struct erdma_dev *dev,
 			       struct erdma_ucontext *ctx)
 {
-	u32 bitmap_idx;
 	struct erdma_devattr *attrs = &dev->attrs;
+	u32 bitmap_idx, hw_page_idx;
 
 	if (attrs->disable_dwqe)
 		goto alloc_normal_db;
@@ -1151,11 +1151,9 @@  static void alloc_db_resources(struct erdma_dev *dev,
 		spin_unlock(&dev->db_bitmap_lock);
 
 		ctx->sdb_type = ERDMA_SDB_PAGE;
-		ctx->sdb_idx = bitmap_idx;
-		ctx->sdb_page_idx = bitmap_idx;
+		ctx->sdb_bitmap_idx = bitmap_idx;
 		ctx->sdb = dev->func_bar_addr + ERDMA_BAR_SQDB_SPACE_OFFSET +
-			   (bitmap_idx << PAGE_SHIFT);
-		ctx->sdb_page_off = 0;
+			   (bitmap_idx << ERDMA_HW_PAGE_SHIFT);
 
 		return;
 	}
@@ -1166,13 +1164,13 @@  static void alloc_db_resources(struct erdma_dev *dev,
 		spin_unlock(&dev->db_bitmap_lock);
 
 		ctx->sdb_type = ERDMA_SDB_ENTRY;
-		ctx->sdb_idx = bitmap_idx;
-		ctx->sdb_page_idx = attrs->dwqe_pages +
+		ctx->sdb_bitmap_idx = bitmap_idx;
+		hw_page_idx = attrs->dwqe_pages +
 				    bitmap_idx / ERDMA_DWQE_TYPE1_CNT_PER_PAGE;
-		ctx->sdb_page_off = bitmap_idx % ERDMA_DWQE_TYPE1_CNT_PER_PAGE;
 
+		ctx->sdb_entid = bitmap_idx % ERDMA_DWQE_TYPE1_CNT_PER_PAGE;
 		ctx->sdb = dev->func_bar_addr + ERDMA_BAR_SQDB_SPACE_OFFSET +
-			   (ctx->sdb_page_idx << PAGE_SHIFT);
+			   (hw_page_idx << PAGE_SHIFT);
 
 		return;
 	}
@@ -1181,11 +1179,8 @@  static void alloc_db_resources(struct erdma_dev *dev,
 
 alloc_normal_db:
 	ctx->sdb_type = ERDMA_SDB_SHARED;
-	ctx->sdb_idx = 0;
-	ctx->sdb_page_idx = ERDMA_SDB_SHARED_PAGE_INDEX;
-	ctx->sdb_page_off = 0;
-
-	ctx->sdb = dev->func_bar_addr + (ctx->sdb_page_idx << PAGE_SHIFT);
+	ctx->sdb = dev->func_bar_addr +
+		   (ERDMA_SDB_SHARED_PAGE_INDEX << ERDMA_HW_PAGE_SHIFT);
 }
 
 static void erdma_uctx_user_mmap_entries_remove(struct erdma_ucontext *uctx)
@@ -1215,11 +1210,6 @@  int erdma_alloc_ucontext(struct ib_ucontext *ibctx, struct ib_udata *udata)
 	ctx->rdb = dev->func_bar_addr + ERDMA_BAR_RQDB_SPACE_OFFSET;
 	ctx->cdb = dev->func_bar_addr + ERDMA_BAR_CQDB_SPACE_OFFSET;
 
-	if (udata->outlen < sizeof(uresp)) {
-		ret = -EINVAL;
-		goto err_out;
-	}
-
 	ctx->sq_db_mmap_entry = erdma_user_mmap_entry_insert(
 		ctx, (void *)ctx->sdb, PAGE_SIZE, ERDMA_MMAP_IO_NC, &uresp.sdb);
 	if (!ctx->sq_db_mmap_entry) {
@@ -1243,9 +1233,13 @@  int erdma_alloc_ucontext(struct ib_ucontext *ibctx, struct ib_udata *udata)
 
 	uresp.dev_id = dev->pdev->device;
 	uresp.sdb_type = ctx->sdb_type;
-	uresp.sdb_offset = ctx->sdb_page_off;
+	uresp.sdb_entid = ctx->sdb_entid;
+	uresp.sdb_off = ctx->sdb & ~PAGE_MASK;
+	uresp.rdb_off = ctx->rdb & ~PAGE_MASK;
+	uresp.cdb_off = ctx->cdb & ~PAGE_MASK;
 
-	ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+	ret = ib_copy_to_udata(udata, &uresp,
+			       min(sizeof(uresp), udata->outlen));
 	if (ret)
 		goto err_out;
 
@@ -1264,9 +1258,9 @@  void erdma_dealloc_ucontext(struct ib_ucontext *ibctx)
 
 	spin_lock(&dev->db_bitmap_lock);
 	if (ctx->sdb_type == ERDMA_SDB_PAGE)
-		clear_bit(ctx->sdb_idx, dev->sdb_page);
+		clear_bit(ctx->sdb_bitmap_idx, dev->sdb_page);
 	else if (ctx->sdb_type == ERDMA_SDB_ENTRY)
-		clear_bit(ctx->sdb_idx, dev->sdb_entry);
+		clear_bit(ctx->sdb_bitmap_idx, dev->sdb_entry);
 
 	erdma_uctx_user_mmap_entries_remove(ctx);
 
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.h b/drivers/infiniband/hw/erdma/erdma_verbs.h
index e0a993bc032a..4dbef1483027 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.h
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.h
@@ -35,9 +35,8 @@  struct erdma_ucontext {
 	struct ib_ucontext ibucontext;
 
 	u32 sdb_type;
-	u32 sdb_idx;
-	u32 sdb_page_idx;
-	u32 sdb_page_off;
+	u32 sdb_bitmap_idx;
+	u32 sdb_entid;
 	u64 sdb;
 	u64 rdb;
 	u64 cdb;
diff --git a/include/uapi/rdma/erdma-abi.h b/include/uapi/rdma/erdma-abi.h
index b7a0222f978f..57f8942a3c56 100644
--- a/include/uapi/rdma/erdma-abi.h
+++ b/include/uapi/rdma/erdma-abi.h
@@ -40,10 +40,13 @@  struct erdma_uresp_alloc_ctx {
 	__u32 dev_id;
 	__u32 pad;
 	__u32 sdb_type;
-	__u32 sdb_offset;
+	__u32 sdb_entid;
 	__aligned_u64 sdb;
 	__aligned_u64 rdb;
 	__aligned_u64 cdb;
+	__u32 sdb_off;
+	__u32 rdb_off;
+	__u32 cdb_off;
 };
 
 #endif