diff mbox

[v7,07/12] dma-mapping: introduce dma_has_iommu()

Message ID 150732935473.22363.1853399637339625023.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Dan Williams Oct. 6, 2017, 10:35 p.m. UTC
Add a helper to determine if the dma mappings set up for a given device
are backed by an iommu. In particular, this lets code paths know that a
dma_unmap operation will revoke access to memory if the device can not
otherwise be quiesced. The need for this knowledge is driven by a need
to make RDMA transfers to DAX mappings safe. If the DAX file's block map
changes we need to be to reliably stop accesses to blocks that have been
freed or re-assigned to a new file.

Since PMEM+DAX is currently only enabled for x86, we only update the x86
iommu drivers.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/base/dma-mapping.c  |   10 ++++++++++
 drivers/iommu/amd_iommu.c   |    6 ++++++
 drivers/iommu/intel-iommu.c |    6 ++++++
 include/linux/dma-mapping.h |    3 +++
 4 files changed, 25 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Woodhouse Oct. 6, 2017, 10:45 p.m. UTC | #1
On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
> Add a helper to determine if the dma mappings set up for a given device
> are backed by an iommu. In particular, this lets code paths know that a
> dma_unmap operation will revoke access to memory if the device can not
> otherwise be quiesced. The need for this knowledge is driven by a need
> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> changes we need to be to reliably stop accesses to blocks that have been
> freed or re-assigned to a new file.

"a dma_unmap operation revoke access to memory"... but it's OK that the
next *map* will give the same DMA address to someone else, right?
Dan Williams Oct. 6, 2017, 10:52 p.m. UTC | #2
On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> Add a helper to determine if the dma mappings set up for a given device
>> are backed by an iommu. In particular, this lets code paths know that a
>> dma_unmap operation will revoke access to memory if the device can not
>> otherwise be quiesced. The need for this knowledge is driven by a need
>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> changes we need to be to reliably stop accesses to blocks that have been
>> freed or re-assigned to a new file.
>
> "a dma_unmap operation revoke access to memory"... but it's OK that the
> next *map* will give the same DMA address to someone else, right?

I'm assuming the next map will be to other physical addresses and a
different requester device since the memory is still registered
exclusively.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse Oct. 6, 2017, 11:10 p.m. UTC | #3
On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote:
> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
> > > 
> > > Add a helper to determine if the dma mappings set up for a given device
> > > are backed by an iommu. In particular, this lets code paths know that a
> > > dma_unmap operation will revoke access to memory if the device can not
> > > otherwise be quiesced. The need for this knowledge is driven by a need
> > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> > > changes we need to be to reliably stop accesses to blocks that have been
> > > freed or re-assigned to a new file.
> > "a dma_unmap operation revoke access to memory"... but it's OK that the
> > next *map* will give the same DMA address to someone else, right?
>
> I'm assuming the next map will be to other physical addresses and a
> different requester device since the memory is still registered
> exclusively.

I meant the next map for this device/group.

It may well use the same virtual DMA address as the one you just
unmapped, yet actually map to a different physical address. So if the
DMA still occurs to the "old" address, that isn't revoked at all — it's
just going to the wrong physical location.

And if you are sure that the DMA will never happen, why do you need to
revoke the mapping in the first place?
Dan Williams Oct. 6, 2017, 11:12 p.m. UTC | #4
On Fri, Oct 6, 2017 at 3:52 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>> On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>>> Add a helper to determine if the dma mappings set up for a given device
>>> are backed by an iommu. In particular, this lets code paths know that a
>>> dma_unmap operation will revoke access to memory if the device can not
>>> otherwise be quiesced. The need for this knowledge is driven by a need
>>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>>> changes we need to be to reliably stop accesses to blocks that have been
>>> freed or re-assigned to a new file.
>>
>> "a dma_unmap operation revoke access to memory"... but it's OK that the
>> next *map* will give the same DMA address to someone else, right?
>
> I'm assuming the next map will be to other physical addresses and a
> different requester device since the memory is still registered
> exclusively.

[ chatted with Ashok ]

Yes, it seems we need a way to pin that IOVA as in use, but invalidate
it. Then we can wait for the unmap to occur when the memory is
unregistered to avoid this IOVA reuse problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Oct. 6, 2017, 11:15 p.m. UTC | #5
On Fri, Oct 6, 2017 at 4:10 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote:
>> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>> >
>> > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> > >
>> > > Add a helper to determine if the dma mappings set up for a given device
>> > > are backed by an iommu. In particular, this lets code paths know that a
>> > > dma_unmap operation will revoke access to memory if the device can not
>> > > otherwise be quiesced. The need for this knowledge is driven by a need
>> > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> > > changes we need to be to reliably stop accesses to blocks that have been
>> > > freed or re-assigned to a new file.
>> > "a dma_unmap operation revoke access to memory"... but it's OK that the
>> > next *map* will give the same DMA address to someone else, right?
>>
>> I'm assuming the next map will be to other physical addresses and a
>> different requester device since the memory is still registered
>> exclusively.
>
> I meant the next map for this device/group.
>
> It may well use the same virtual DMA address as the one you just
> unmapped, yet actually map to a different physical address. So if the
> DMA still occurs to the "old" address, that isn't revoked at all — it's
> just going to the wrong physical location.
>
> And if you are sure that the DMA will never happen, why do you need to
> revoke the mapping in the first place?

Right, crossed mails. The semantic I want is that the IOVA is
invalidated / starts throwing errors to the device because the address
it thought it was talking to has been remapped in the file. Once
userspace wakes up and responds to this invalidation event it can do
the actual unmap to make the IOVA reusable again.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse Oct. 7, 2017, 11:08 a.m. UTC | #6
On Fri, 2017-10-06 at 16:15 -0700, Dan Williams wrote:
> 
> Right, crossed mails. The semantic I want is that the IOVA is
> invalidated / starts throwing errors to the device because the address
> it thought it was talking to has been remapped in the file. Once
> userspace wakes up and responds to this invalidation event it can do
> the actual unmap to make the IOVA reusable again.

So basically you want to unmap it by removing it from the page tables
and flushing the IOTLB, but you want the IOVA to still be reserved.

The normal device-facing DMA API doesn't give you that today. You could
do it with the IOMMU API though — that one does let you manage the IOVA
space yourself. You don't want that IOVA used again? Well don't use it
as the IOVA in a subsequent iommu_map() call then :)
Dan Williams Oct. 7, 2017, 11:33 p.m. UTC | #7
On Sat, Oct 7, 2017 at 4:08 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 16:15 -0700, Dan Williams wrote:
>>
>> Right, crossed mails. The semantic I want is that the IOVA is
>> invalidated / starts throwing errors to the device because the address
>> it thought it was talking to has been remapped in the file. Once
>> userspace wakes up and responds to this invalidation event it can do
>> the actual unmap to make the IOVA reusable again.
>
> So basically you want to unmap it by removing it from the page tables
> and flushing the IOTLB, but you want the IOVA to still be reserved.
>
> The normal device-facing DMA API doesn't give you that today. You could
> do it with the IOMMU API though — that one does let you manage the IOVA
> space yourself. You don't want that IOVA used again? Well don't use it
> as the IOVA in a subsequent iommu_map() call then :)

Ah, nice. So I think I'll just add a dma_get_iommu_domain() so the
dma_ops implementation can exclude identity-mapped devices, and then
iommu_unmap() does the rest. Thanks for the pointer.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Oct. 9, 2017, 6:58 p.m. UTC | #8
On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> otherwise be quiesced. The need for this knowledge is driven by a need
> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> changes we need to be to reliably stop accesses to blocks that have been
> freed or re-assigned to a new file.

If RDMA is driving this need, why not invalidate backing RDMA MRs
instead of requiring a IOMMU to do it? RDMA MR are finer grained and
do not suffer from the re-use problem David W. brought up with IOVAs..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Oct. 9, 2017, 7:05 p.m. UTC | #9
On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
>> otherwise be quiesced. The need for this knowledge is driven by a need
>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> changes we need to be to reliably stop accesses to blocks that have been
>> freed or re-assigned to a new file.
>
> If RDMA is driving this need, why not invalidate backing RDMA MRs
> instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> do not suffer from the re-use problem David W. brought up with IOVAs..

Sounds promising. All I want in the end is to be sure that the kernel
is enabled to stop any in-flight RDMA at will without asking
userspace. Does this require per-RDMA driver opt-in or is there a
common call that can be made?

Outside of that the re-use problem is already solved by just unmapping
(iommu_unmap()) the IOVA, but keeping it allocated until the eventual
dma_unmap_sg() at memory un-registration time frees it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Oct. 9, 2017, 7:18 p.m. UTC | #10
On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote:
> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> >> otherwise be quiesced. The need for this knowledge is driven by a need
> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> >> changes we need to be to reliably stop accesses to blocks that have been
> >> freed or re-assigned to a new file.
> >
> > If RDMA is driving this need, why not invalidate backing RDMA MRs
> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> > do not suffer from the re-use problem David W. brought up with IOVAs..
> 
> Sounds promising. All I want in the end is to be sure that the kernel
> is enabled to stop any in-flight RDMA at will without asking
> userspace. Does this require per-RDMA driver opt-in or is there a
> common call that can be made?

I don't think this has ever come up in the context of an all-device MR
invalidate requirement. Drivers already have code to invalidate
specifc MRs, but to find all MRs that touch certain pages and then
invalidate them would be new code.

We also have ODP aware drivers that can retarget a MR to new
physical pages. If the block map changes DAX should synchronously
retarget the ODP MR, not halt DMA.

Most likely ODP & DAX would need to be used together to get robust
user applications, as having the user QP's go to an error state at
random times (due to DMA failures) during operation is never going to
be acceptable...

Perhaps you might want to initially only support ODP MR mappings with
DAX and then the DMA fencing issue goes away?

Cheers,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Oct. 9, 2017, 7:28 p.m. UTC | #11
On Mon, Oct 9, 2017 at 12:18 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote:
>> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
>> >> otherwise be quiesced. The need for this knowledge is driven by a need
>> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> >> changes we need to be to reliably stop accesses to blocks that have been
>> >> freed or re-assigned to a new file.
>> >
>> > If RDMA is driving this need, why not invalidate backing RDMA MRs
>> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
>> > do not suffer from the re-use problem David W. brought up with IOVAs..
>>
>> Sounds promising. All I want in the end is to be sure that the kernel
>> is enabled to stop any in-flight RDMA at will without asking
>> userspace. Does this require per-RDMA driver opt-in or is there a
>> common call that can be made?
>
> I don't think this has ever come up in the context of an all-device MR
> invalidate requirement. Drivers already have code to invalidate
> specifc MRs, but to find all MRs that touch certain pages and then
> invalidate them would be new code.
>
> We also have ODP aware drivers that can retarget a MR to new
> physical pages. If the block map changes DAX should synchronously
> retarget the ODP MR, not halt DMA.

Have a look at the patch [1], I don't touch the ODP path.

> Most likely ODP & DAX would need to be used together to get robust
> user applications, as having the user QP's go to an error state at
> random times (due to DMA failures) during operation is never going to
> be acceptable...

It's not random. The process that set up the mapping and registered
the memory gets SIGIO when someone else tries to modify the file map.
That process then gets /proc/sys/fs/lease-break-time seconds to fix
the problem before the kernel force revokes the DMA access.

It's otherwise not acceptable to allow DMA into random locations when
the file map changes.

> Perhaps you might want to initially only support ODP MR mappings with
> DAX and then the DMA fencing issue goes away?

I'd rather try to fix the non-ODP DAX case instead of just turning it off.

[1]: https://patchwork.kernel.org/patch/9991681/
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Oct. 10, 2017, 5:25 p.m. UTC | #12
On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:

> > I don't think this has ever come up in the context of an all-device MR
> > invalidate requirement. Drivers already have code to invalidate
> > specifc MRs, but to find all MRs that touch certain pages and then
> > invalidate them would be new code.
> >
> > We also have ODP aware drivers that can retarget a MR to new
> > physical pages. If the block map changes DAX should synchronously
> > retarget the ODP MR, not halt DMA.
> 
> Have a look at the patch [1], I don't touch the ODP path.

But, does ODP work OK already? I'm not clear on that..

> > Most likely ODP & DAX would need to be used together to get robust
> > user applications, as having the user QP's go to an error state at
> > random times (due to DMA failures) during operation is never going to
> > be acceptable...
> 
> It's not random. The process that set up the mapping and registered
> the memory gets SIGIO when someone else tries to modify the file map.
> That process then gets /proc/sys/fs/lease-break-time seconds to fix
> the problem before the kernel force revokes the DMA access.

Well, the process can't fix the problem in bounded time, so it is
random if it will fail or not.

MR life time is under the control of the remote side, and time to
complete the network exchanges required to release the MRs is hard to
bound. So even if I implement SIGIO properly my app will still likely
have random QP failures under various cases and work loads. :(

This is why ODP should be the focus because this cannot work fully
reliably otherwise..

> > Perhaps you might want to initially only support ODP MR mappings with
> > DAX and then the DMA fencing issue goes away?
> 
> I'd rather try to fix the non-ODP DAX case instead of just turning it off.

Well, what about using SIGKILL if the lease-break-time hits? The
kernel will clean up the MRs when the process exits and this will
fence DMA to that memory.

But, still, if you really want to be fined graned, then I think
invalidating the impacted MR's is a better solution for RDMA than
trying to do it with the IOMMU...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Oct. 10, 2017, 5:39 p.m. UTC | #13
On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:
>
>> > I don't think this has ever come up in the context of an all-device MR
>> > invalidate requirement. Drivers already have code to invalidate
>> > specifc MRs, but to find all MRs that touch certain pages and then
>> > invalidate them would be new code.
>> >
>> > We also have ODP aware drivers that can retarget a MR to new
>> > physical pages. If the block map changes DAX should synchronously
>> > retarget the ODP MR, not halt DMA.
>>
>> Have a look at the patch [1], I don't touch the ODP path.
>
> But, does ODP work OK already? I'm not clear on that..

It had better. If the mapping is invalidated I would hope that
generates an io fault that gets handled by the driver to setup the new
mapping. I don't see how it can work otherwise.

>> > Most likely ODP & DAX would need to be used together to get robust
>> > user applications, as having the user QP's go to an error state at
>> > random times (due to DMA failures) during operation is never going to
>> > be acceptable...
>>
>> It's not random. The process that set up the mapping and registered
>> the memory gets SIGIO when someone else tries to modify the file map.
>> That process then gets /proc/sys/fs/lease-break-time seconds to fix
>> the problem before the kernel force revokes the DMA access.
>
> Well, the process can't fix the problem in bounded time, so it is
> random if it will fail or not.
>
> MR life time is under the control of the remote side, and time to
> complete the network exchanges required to release the MRs is hard to
> bound. So even if I implement SIGIO properly my app will still likely
> have random QP failures under various cases and work loads. :(
>
> This is why ODP should be the focus because this cannot work fully
> reliably otherwise..

The lease break time is configurable. If that application can't
respond to a stop request within a timeout of its own choosing then it
should not be using DAX mappings.

>
>> > Perhaps you might want to initially only support ODP MR mappings with
>> > DAX and then the DMA fencing issue goes away?
>>
>> I'd rather try to fix the non-ODP DAX case instead of just turning it off.
>
> Well, what about using SIGKILL if the lease-break-time hits? The
> kernel will clean up the MRs when the process exits and this will
> fence DMA to that memory.

Can you point me to where the MR cleanup code fences DMA and quiesces
the device?

> But, still, if you really want to be fined graned, then I think
> invalidating the impacted MR's is a better solution for RDMA than
> trying to do it with the IOMMU...

If there's a better routine for handling ib_umem_lease_break() I'd
love to use it. Right now I'm reaching for the only tool I know for
kernel enforced revocation of DMA access.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Oct. 10, 2017, 6:05 p.m. UTC | #14
On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe

> >> Have a look at the patch [1], I don't touch the ODP path.
> >
> > But, does ODP work OK already? I'm not clear on that..
> 
> It had better. If the mapping is invalidated I would hope that
> generates an io fault that gets handled by the driver to setup the new
> mapping. I don't see how it can work otherwise.

I would assume so too...

> > This is why ODP should be the focus because this cannot work fully
> > reliably otherwise..
> 
> The lease break time is configurable. If that application can't
> respond to a stop request within a timeout of its own choosing then it
> should not be using DAX mappings.

Well, no RDMA application can really do this, unless you set the
timeout to multiple minutes, on par with network timeouts.

Again, these details are why I think this kind of DAX and non ODP-MRs
are probably practically not too useful for a production system. Great
for test of course, but in that case SIGKILL would be fine too...

> > Well, what about using SIGKILL if the lease-break-time hits? The
> > kernel will clean up the MRs when the process exits and this will
> > fence DMA to that memory.
> 
> Can you point me to where the MR cleanup code fences DMA and quiesces
> the device?

Yes. The MR's are associated with an fd. When the fd is closed
ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
all the objects, including MRs, and deletes them.

The specification for deleting a MR requires a synchronous fence with
the hardware. After MR deletion the hardware will not DMA to any pages
described by the old MR, and those pages will be unpinned.

> > But, still, if you really want to be fined graned, then I think
> > invalidating the impacted MR's is a better solution for RDMA than
> > trying to do it with the IOMMU...
> 
> If there's a better routine for handling ib_umem_lease_break() I'd
> love to use it. Right now I'm reaching for the only tool I know for
> kernel enforced revocation of DMA access.

Well, you'd have to code something in the MR code to keep track of DAX
MRs and issue an out of band invalidate to impacted MRs to create the
fence.

This probably needs some driver work, I'm not sure if all the hardware
can do out of band invalidate to any MR or not..

Generally speaking, in RDMA, when a new feature like this comes along
we have to push a lot of the work down to the driver authors, and the
approach has historically been that new features only work on some
hardware (as much as I dislike this, it is pragmatic)

So, not being able to support DAX on certain RDMA hardware is not
an unreasonable situation in our space.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Oct. 10, 2017, 8:17 p.m. UTC | #15
On Tue, Oct 10, 2017 at 11:05 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
>> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe
>
>> >> Have a look at the patch [1], I don't touch the ODP path.
>> >
>> > But, does ODP work OK already? I'm not clear on that..
>>
>> It had better. If the mapping is invalidated I would hope that
>> generates an io fault that gets handled by the driver to setup the new
>> mapping. I don't see how it can work otherwise.
>
> I would assume so too...
>
>> > This is why ODP should be the focus because this cannot work fully
>> > reliably otherwise..
>>
>> The lease break time is configurable. If that application can't
>> respond to a stop request within a timeout of its own choosing then it
>> should not be using DAX mappings.
>
> Well, no RDMA application can really do this, unless you set the
> timeout to multiple minutes, on par with network timeouts.

The default lease break timeout is 45 seconds on my system, so minutes
does not seem out of the question.

Also keep in mind that what triggers the lease break is another
application trying to write or punch holes in a file that is mapped
for RDMA. So, if the hardware can't handle the iommu mapping getting
invalidated asynchronously and the application can't react in the
lease break timeout period then the administrator should arrange for
the file to not be written or truncated while it is mapped.

It's already the case that get_user_pages() does not lock down file
associations, so if your application is contending with these types of
file changes it likely already has a problem keeping transactions in
sync with the file state even without DAX.

> Again, these details are why I think this kind of DAX and non ODP-MRs
> are probably practically not too useful for a production system. Great
> for test of course, but in that case SIGKILL would be fine too...
>
>> > Well, what about using SIGKILL if the lease-break-time hits? The
>> > kernel will clean up the MRs when the process exits and this will
>> > fence DMA to that memory.
>>
>> Can you point me to where the MR cleanup code fences DMA and quiesces
>> the device?
>
> Yes. The MR's are associated with an fd. When the fd is closed
> ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
> all the objects, including MRs, and deletes them.
>
> The specification for deleting a MR requires a synchronous fence with
> the hardware. After MR deletion the hardware will not DMA to any pages
> described by the old MR, and those pages will be unpinned.
>
>> > But, still, if you really want to be fined graned, then I think
>> > invalidating the impacted MR's is a better solution for RDMA than
>> > trying to do it with the IOMMU...
>>
>> If there's a better routine for handling ib_umem_lease_break() I'd
>> love to use it. Right now I'm reaching for the only tool I know for
>> kernel enforced revocation of DMA access.
>
> Well, you'd have to code something in the MR code to keep track of DAX
> MRs and issue an out of band invalidate to impacted MRs to create the
> fence.
>
> This probably needs some driver work, I'm not sure if all the hardware
> can do out of band invalidate to any MR or not..

Ok.

>
> Generally speaking, in RDMA, when a new feature like this comes along
> we have to push a lot of the work down to the driver authors, and the
> approach has historically been that new features only work on some
> hardware (as much as I dislike this, it is pragmatic)
>
> So, not being able to support DAX on certain RDMA hardware is not
> an unreasonable situation in our space.

That makes sense, but it still seems to me that this proposed solution
allows more than enough ways to avoid that worst case scenario where
hardware reacts badly to iommu invalidation. Drivers that can do
better than iommu invalidation can arrange for a callback to do their
driver-specific action at lease break time. Hardware that can't should
be blacklisted from supporting DAX altogether. In other words this is
a starting point to incrementally enhance or disable specific drivers,
but with the assurance that the kernel can always do the safe thing
when / if the driver is missing a finer grained solution.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Oct. 12, 2017, 6:27 p.m. UTC | #16
On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:

> Also keep in mind that what triggers the lease break is another
> application trying to write or punch holes in a file that is mapped
> for RDMA. So, if the hardware can't handle the iommu mapping getting
> invalidated asynchronously and the application can't react in the
> lease break timeout period then the administrator should arrange for
> the file to not be written or truncated while it is mapped.

That makes sense, but why not return ENOSYS or something to the app
trying to alter the file if the RDMA hardware can't support this
instead of having the RDMA app deal with this lease break weirdness?

> It's already the case that get_user_pages() does not lock down file
> associations, so if your application is contending with these types of
> file changes it likely already has a problem keeping transactions in
> sync with the file state even without DAX.

Yes, things go weird in non-ODP RDMA cases like this..

Also, just to clear, I would expect an app using the SIGIO interface
to basically halt ongoing RDMA, wait for MRs to become unused locally
and remotely, destroy the MRs, then somehow, establish new MRs that
cover the same logical map (eg what ODP would do transparently) after
the lease breaker has made their changes, then restart their IO.

Does your SIGIO approach have a race-free way to do that last steps?

> > So, not being able to support DAX on certain RDMA hardware is not
> > an unreasonable situation in our space.
> 
> That makes sense, but it still seems to me that this proposed solution
> allows more than enough ways to avoid that worst case scenario where
> hardware reacts badly to iommu invalidation.

Yes, although I am concerned that returning PCI-E errors is such an
unusual and untested path for some of our RDMA drivers that they may
malfunction badly...

Again, going back to the question of who would ever use this, I would
be very relucant to deploy a production configuration relying on the iommu
invalidate or SIGIO techniques, when ODP HW is available and works
flawlessly.

> be blacklisted from supporting DAX altogether. In other words this is
> a starting point to incrementally enhance or disable specific drivers,
> but with the assurance that the kernel can always do the safe thing
> when / if the driver is missing a finer grained solution.

Seems reasonable.. I think existing HW will have an easier time adding
invalidate, while new hardware really should implement ODP.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Oct. 12, 2017, 8:10 p.m. UTC | #17
On Thu, Oct 12, 2017 at 11:27 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:
>
>> Also keep in mind that what triggers the lease break is another
>> application trying to write or punch holes in a file that is mapped
>> for RDMA. So, if the hardware can't handle the iommu mapping getting
>> invalidated asynchronously and the application can't react in the
>> lease break timeout period then the administrator should arrange for
>> the file to not be written or truncated while it is mapped.
>
> That makes sense, but why not return ENOSYS or something to the app
> trying to alter the file if the RDMA hardware can't support this
> instead of having the RDMA app deal with this lease break weirdness?

That's where I started, an inode flag that said "hands off, this file
is busy", but Christoph pointed out that we should reuse the same
mechanisms that pnfs is using. The pnfs protection scheme uses file
leases, and once the kernel decides that a lease needs to be broken /
layout needs to be recalled there is no stopping it, only delaying.

>> It's already the case that get_user_pages() does not lock down file
>> associations, so if your application is contending with these types of
>> file changes it likely already has a problem keeping transactions in
>> sync with the file state even without DAX.
>
> Yes, things go weird in non-ODP RDMA cases like this..
>
> Also, just to clear, I would expect an app using the SIGIO interface
> to basically halt ongoing RDMA, wait for MRs to become unused locally
> and remotely, destroy the MRs, then somehow, establish new MRs that
> cover the same logical map (eg what ODP would do transparently) after
> the lease breaker has made their changes, then restart their IO.
>
> Does your SIGIO approach have a race-free way to do that last steps?

After the SIGIO that's becomes a userspace / driver problem to quiesce
the I/O...

However, chatting this over with a few more people I have an alternate
solution that effectively behaves the same as how non-ODP hardware
handles this case of hole punch / truncation today. So, today if this
scenario happens on a page-cache backed mapping, the file blocks are
unmapped and the RDMA continues into pinned pages that are no longer
part of the file. We can achieve the same thing with the iommu, just
re-target the I/O into memory that isn't part of the file. That way
hardware does not see I/O errors and the DAX data consistency model is
no worse than the page-cache case.

>> > So, not being able to support DAX on certain RDMA hardware is not
>> > an unreasonable situation in our space.
>>
>> That makes sense, but it still seems to me that this proposed solution
>> allows more than enough ways to avoid that worst case scenario where
>> hardware reacts badly to iommu invalidation.
>
> Yes, although I am concerned that returning PCI-E errors is such an
> unusual and untested path for some of our RDMA drivers that they may
> malfunction badly...
>
> Again, going back to the question of who would ever use this, I would
> be very relucant to deploy a production configuration relying on the iommu
> invalidate or SIGIO techniques, when ODP HW is available and works
> flawlessly.

I don't think it is reasonable to tell people you need to throw away
your old hardware just because you want to target a DAX mapping.

>> be blacklisted from supporting DAX altogether. In other words this is
>> a starting point to incrementally enhance or disable specific drivers,
>> but with the assurance that the kernel can always do the safe thing
>> when / if the driver is missing a finer grained solution.
>
> Seems reasonable.. I think existing HW will have an easier time adding
> invalidate, while new hardware really should implement ODP.
>

Yeah, so if we go with 'remap' instead of 'invalidate' does that
address your concerns?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 13, 2017, 6:50 a.m. UTC | #18
On Thu, Oct 12, 2017 at 01:10:33PM -0700, Dan Williams wrote:
> On Thu, Oct 12, 2017 at 11:27 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:
> >
> >> Also keep in mind that what triggers the lease break is another
> >> application trying to write or punch holes in a file that is mapped
> >> for RDMA. So, if the hardware can't handle the iommu mapping getting
> >> invalidated asynchronously and the application can't react in the
> >> lease break timeout period then the administrator should arrange for
> >> the file to not be written or truncated while it is mapped.
> >
> > That makes sense, but why not return ENOSYS or something to the app
> > trying to alter the file if the RDMA hardware can't support this
> > instead of having the RDMA app deal with this lease break weirdness?
> 
> That's where I started, an inode flag that said "hands off, this file
> is busy", but Christoph pointed out that we should reuse the same
> mechanisms that pnfs is using. The pnfs protection scheme uses file
> leases, and once the kernel decides that a lease needs to be broken /
> layout needs to be recalled there is no stopping it, only delaying.

That was just a suggestion - the important statement is that a hands
off flag is just a no-go.

> However, chatting this over with a few more people I have an alternate
> solution that effectively behaves the same as how non-ODP hardware
> handles this case of hole punch / truncation today. So, today if this
> scenario happens on a page-cache backed mapping, the file blocks are
> unmapped and the RDMA continues into pinned pages that are no longer
> part of the file. We can achieve the same thing with the iommu, just
> re-target the I/O into memory that isn't part of the file. That way
> hardware does not see I/O errors and the DAX data consistency model is
> no worse than the page-cache case.

Yikes.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 13, 2017, 7:09 a.m. UTC | #19
On Mon, Oct 09, 2017 at 01:18:20PM -0600, Jason Gunthorpe wrote:
> > > If RDMA is driving this need, why not invalidate backing RDMA MRs
> > > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> > > do not suffer from the re-use problem David W. brought up with IOVAs..
> > 
> > Sounds promising. All I want in the end is to be sure that the kernel
> > is enabled to stop any in-flight RDMA at will without asking
> > userspace. Does this require per-RDMA driver opt-in or is there a
> > common call that can be made?
> 
> I don't think this has ever come up in the context of an all-device MR
> invalidate requirement. Drivers already have code to invalidate
> specifc MRs, but to find all MRs that touch certain pages and then
> invalidate them would be new code.

The whole point is that we should not need that IFF we provide the
right interface.

If we have a new 'register memory with a lease', the driver (or in fact
probably the umem core for the drivers using it) has the lease associated
with the ib_umem structure, which will just need a backpointer from the
ib_umem to the to the MR to unregister it.

Which might be a good opportunity to break the user MR from the in-kernel
ones and merge it with ib_umem, but that's a different story..
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Oct. 13, 2017, 3:03 p.m. UTC | #20
On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote:

> > However, chatting this over with a few more people I have an alternate
> > solution that effectively behaves the same as how non-ODP hardware
> > handles this case of hole punch / truncation today. So, today if this
> > scenario happens on a page-cache backed mapping, the file blocks are
> > unmapped and the RDMA continues into pinned pages that are no longer
> > part of the file. We can achieve the same thing with the iommu, just
> > re-target the I/O into memory that isn't part of the file. That way
> > hardware does not see I/O errors and the DAX data consistency model is
> > no worse than the page-cache case.
> 
> Yikes.

Well, as much as you say Yikes, Dan is correct, this does match the
semantics RDMA MR's already have. They become non-coherent if their
underlying object is changed, and there are many ways to get there.
I've never thought about it, but it does sound like ftruncate,
fallocate, etc on a normal file would break the MR coherency too??

There have been efforts in the past driven by the MPI people to
create, essentially, something like lease-break' SIGIO. Except it was
intended to be general, and wanted solve all the problems related with
MR de-coherence. This was complicated and never became acceptable to
mainline.

Instead ODP was developed, and ODP actually solves all the problem
sanely.

Thinking about it some more, and with your other comments on
get_user_pages in this thread, I tend to agree. It doesn't make sense
to develop a user space lease break API for MR's that is a DAX
specific feature.

Along the some lines, it also doesn't make sense to force-invalidate
MR's linked to DAX regions, while leaving MR's linked to other
regions that have the same problem alone.

If you want to make non-ODP MR's work better, then you need to have a
general overall solution to tell userspace when the MR becomes (or I
guess, is becoming) non-coherent, that covers all the cases that break
MR coherence, not just via DAX.

Otherwise, I think Dan is right, keeping the current semantic of
having MRs just do something wrong, but not corrupt memory, when they
loose coherence, is broadly consistent with how non-ODP MRs work today.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak Oct. 15, 2017, 3:14 p.m. UTC | #21
On Fri, Oct 13, 2017 at 6:03 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote:
>
>> > However, chatting this over with a few more people I have an alternate
>> > solution that effectively behaves the same as how non-ODP hardware
>> > handles this case of hole punch / truncation today. So, today if this
>> > scenario happens on a page-cache backed mapping, the file blocks are
>> > unmapped and the RDMA continues into pinned pages that are no longer
>> > part of the file. We can achieve the same thing with the iommu, just
>> > re-target the I/O into memory that isn't part of the file. That way
>> > hardware does not see I/O errors and the DAX data consistency model is
>> > no worse than the page-cache case.
>>
>> Yikes.
>
> Well, as much as you say Yikes, Dan is correct, this does match the
> semantics RDMA MR's already have. They become non-coherent if their
> underlying object is changed, and there are many ways to get there.
> I've never thought about it, but it does sound like ftruncate,
> fallocate, etc on a normal file would break the MR coherency too??
>
> There have been efforts in the past driven by the MPI people to
> create, essentially, something like lease-break' SIGIO. Except it was
> intended to be general, and wanted solve all the problems related with
> MR de-coherence. This was complicated and never became acceptable to
> mainline.
>
> Instead ODP was developed, and ODP actually solves all the problem
> sanely.
>
> Thinking about it some more, and with your other comments on
> get_user_pages in this thread, I tend to agree. It doesn't make sense
> to develop a user space lease break API for MR's that is a DAX
> specific feature.
>
> Along the some lines, it also doesn't make sense to force-invalidate
> MR's linked to DAX regions, while leaving MR's linked to other
> regions that have the same problem alone.
>
> If you want to make non-ODP MR's work better, then you need to have a
> general overall solution to tell userspace when the MR becomes (or I
> guess, is becoming) non-coherent, that covers all the cases that break
> MR coherence, not just via DAX.
>
> Otherwise, I think Dan is right, keeping the current semantic of
> having MRs just do something wrong, but not corrupt memory, when they
> loose coherence, is broadly consistent with how non-ODP MRs work today.
>

I agree, keeping the current semantics is probably the best thing we
could do. It's a trade-off between breaking existing applications,
having a new lease API for DAX or just failing DAX in particular (as
opposed to other cases). For stable mappings, what we have is probably
sufficient. For mappings which could be changed, it's unclear to me
how you could guarantee non-racy behavior that is bounded by a
pre-defined time and guarantee no user-space errors. On top of that,
ODP (should) already solve that problem transparently.

IMHO, using iommu for that and causing DMA errors just because the
lease broke isn't the right thing to do.

> Jason

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Oct. 15, 2017, 3:21 p.m. UTC | #22
On Sun, Oct 15, 2017 at 8:14 AM, Matan Barak <matanb@dev.mellanox.co.il> wrote:
[..]
> IMHO, using iommu for that and causing DMA errors just because the
> lease broke isn't the right thing to do.

Yes, see the current proposal over in this thread:

https://lists.01.org/pipermail/linux-nvdimm/2017-October/012885.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index e584eddef0a7..e1b5f103d90e 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -369,3 +369,13 @@  void dma_deconfigure(struct device *dev)
 	of_dma_deconfigure(dev);
 	acpi_dma_deconfigure(dev);
 }
+
+bool dma_has_iommu(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (ops && ops->has_iommu)
+		return ops->has_iommu(dev);
+	return false;
+}
+EXPORT_SYMBOL(dma_has_iommu);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 51f8215877f5..873f899fcf57 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2271,6 +2271,11 @@  static struct protection_domain *get_domain(struct device *dev)
 	return domain;
 }
 
+static bool amd_dma_has_iommu(struct device *dev)
+{
+	return !IS_ERR(get_domain(dev));
+}
+
 static void update_device_table(struct protection_domain *domain)
 {
 	struct iommu_dev_data *dev_data;
@@ -2689,6 +2694,7 @@  static const struct dma_map_ops amd_iommu_dma_ops = {
 	.unmap_sg	= unmap_sg,
 	.dma_supported	= amd_iommu_dma_supported,
 	.mapping_error	= amd_iommu_mapping_error,
+	.has_iommu	= amd_dma_has_iommu,
 };
 
 static int init_reserved_iova_ranges(void)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..243ef42fdad4 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3578,6 +3578,11 @@  static int iommu_no_mapping(struct device *dev)
 	return 0;
 }
 
+static bool intel_dma_has_iommu(struct device *dev)
+{
+	return !iommu_no_mapping(dev);
+}
+
 static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 				     size_t size, int dir, u64 dma_mask)
 {
@@ -3872,6 +3877,7 @@  const struct dma_map_ops intel_dma_ops = {
 	.map_page = intel_map_page,
 	.unmap_page = intel_unmap_page,
 	.mapping_error = intel_mapping_error,
+	.has_iommu = intel_dma_has_iommu,
 #ifdef CONFIG_X86
 	.dma_supported = x86_dma_supported,
 #endif
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 29ce9815da87..659f122c18f5 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -128,6 +128,7 @@  struct dma_map_ops {
 				   enum dma_data_direction dir);
 	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
 	int (*dma_supported)(struct device *dev, u64 mask);
+	bool (*has_iommu)(struct device *dev);
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
 	u64 (*get_required_mask)(struct device *dev);
 #endif
@@ -221,6 +222,8 @@  static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
 }
 #endif
 
+extern bool dma_has_iommu(struct device *dev);
+
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
 					      size_t size,
 					      enum dma_data_direction dir,