diff mbox

[v2] arm64: do not set dma masks that device connection can't handle

Message ID 1483947002-16410-1-git-send-email-nikita.yoush@cogentembedded.com (mailing list archive)
State Not Applicable
Delegated to: Simon Horman
Headers show

Commit Message

Nikita Yushchenko Jan. 9, 2017, 7:30 a.m. UTC
It is possible that device is capable of 64-bit DMA addresses, and
device driver tries to set wide DMA mask, but bridge or bus used to
connect device to the system can't handle wide addresses.

With swiotlb, memory above 4G still can be used by drivers for streaming
DMA, but *dev->mask and dev->dma_coherent_mask must still keep values
that hardware handles physically.

This patch enforces that. Based on original version by
Arnd Bergmann <arnd@arndb.de>, extended with coherent mask hadnling.

Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
CC: Arnd Bergmann <arnd@arndb.de>
---
Changes since v1:
- fixed issues noted by Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
  - save mask, not size
  - remove doube empty line

 arch/arm64/Kconfig              |  3 +++
 arch/arm64/include/asm/device.h |  1 +
 arch/arm64/mm/dma-mapping.c     | 51 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+)

Comments

Will Deacon Jan. 10, 2017, 11:51 a.m. UTC | #1
On Mon, Jan 09, 2017 at 10:30:02AM +0300, Nikita Yushchenko wrote:
> It is possible that device is capable of 64-bit DMA addresses, and
> device driver tries to set wide DMA mask, but bridge or bus used to
> connect device to the system can't handle wide addresses.
> 
> With swiotlb, memory above 4G still can be used by drivers for streaming
> DMA, but *dev->mask and dev->dma_coherent_mask must still keep values
> that hardware handles physically.
> 
> This patch enforces that. Based on original version by
> Arnd Bergmann <arnd@arndb.de>, extended with coherent mask hadnling.
> 
> Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
> CC: Arnd Bergmann <arnd@arndb.de>
> ---
> Changes since v1:
> - fixed issues noted by Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
>   - save mask, not size
>   - remove doube empty line
> 
>  arch/arm64/Kconfig              |  3 +++
>  arch/arm64/include/asm/device.h |  1 +
>  arch/arm64/mm/dma-mapping.c     | 51 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 55 insertions(+)

I still don't think this patch is general enough. The problem you're seeing
with swiotlb seems to be exactly the same problem reported by Feng Kan over
at:

  http://lkml.kernel.org/r/CAL85gmA_SSCwM80TKdkZqEe+S1beWzDEvdki1kpkmUTDRmSP7g@mail.gmail.com

[read on; it was initially thought to be a hardware erratum, but it's
 actually the inability to restrict the DMA mask of the endpoint that's
 the problem]

The point here is that an IOMMU doesn't solve your issue, and the
IOMMU-backed DMA ops need the same treatment. In light of that, it really
feels to me like the DMA masks should be restricted in of_dma_configure
so that the parent mask is taken into account there, rather than hook
into each set of DMA ops to intercept set_dma_mask. We'd still need to
do something to stop dma_set_mask widening the mask if it was restricted
by of_dma_configure, but I think Robin (cc'd) was playing with that.

Will
Nikita Yushchenko Jan. 10, 2017, 12:47 p.m. UTC | #2
Hi

> The point here is that an IOMMU doesn't solve your issue, and the
> IOMMU-backed DMA ops need the same treatment. In light of that, it really
> feels to me like the DMA masks should be restricted in of_dma_configure
> so that the parent mask is taken into account there, rather than hook
> into each set of DMA ops to intercept set_dma_mask. We'd still need to
> do something to stop dma_set_mask widening the mask if it was restricted
> by of_dma_configure, but I think Robin (cc'd) was playing with that.

What issue "IOMMU doesn't solve"?

Issue I'm trying to address is - inconsistency within swiotlb
dma_map_ops, where (1) any wide mask is silently accepted, but (2) then
mask is used to decide if bounce buffers are needed or not. This
inconsistency causes NVMe+R-Car cobmo not working (and breaking memory
instead).

I just can't think out what similar issue iommu can have.
Do you mean that in iommu case, mask also must not be set to whatever
wider than initial value? Why? What is the use of mask in iommu case? Is
there any real case when iommu can't address all memory existing in the
system?

NVMe maintainer has just stated that they expect
set_dma_mask(DMA_BIT_MASK(64)) to always succeed, and are going to error
out driver probe if that call fails.  They claim that architecture must
always be able to dma_map() whatever memory existing in the system - via
iommu or swiotlb or whatever. Their direction is to remove bounce
buffers from block and other layers.

With this direction, semantics of dma mask becomes even more
questionable. I'd say dma_mask is candidate for removal (or to move to
swiotlb's or iommu's local area)

Nikita
Arnd Bergmann Jan. 10, 2017, 1:12 p.m. UTC | #3
On Tuesday, January 10, 2017 3:47:25 PM CET Nikita Yushchenko wrote:
> 
> > The point here is that an IOMMU doesn't solve your issue, and the
> > IOMMU-backed DMA ops need the same treatment. In light of that, it really
> > feels to me like the DMA masks should be restricted in of_dma_configure
> > so that the parent mask is taken into account there, rather than hook
> > into each set of DMA ops to intercept set_dma_mask.

of_dma_configure() sets up a 32-bit mask, which is assumed to always work
in the kernel. We can't change it to a larger mask because that would
break drivers that have to restrict devices to 32-bit.

If the bus addressing is narrower than 32 bits however, the initial mask
should probably be limited to whatever the bus supports, but that is not
the problem we are trying to solve here.

> > We'd still need to
> > do something to stop dma_set_mask widening the mask if it was restricted
> > by of_dma_configure, but I think Robin (cc'd) was playing with that.
> 
> What issue "IOMMU doesn't solve"?
> 
> Issue I'm trying to address is - inconsistency within swiotlb
> dma_map_ops, where (1) any wide mask is silently accepted, but (2) then
> mask is used to decide if bounce buffers are needed or not. This
> inconsistency causes NVMe+R-Car cobmo not working (and breaking memory
> instead).

It's not just an inconsistency, it's a known bug that we really
need to fix.

> I just can't think out what similar issue iommu can have.
> Do you mean that in iommu case, mask also must not be set to whatever
> wider than initial value? Why? What is the use of mask in iommu case? Is
> there any real case when iommu can't address all memory existing in the
> system?

I think the problem that Will is referring to is when the IOMMU has
a virtual address space that is wider than the DMA mask of the device:
In this case, dma_map_single() might return a dma_addr_t that is not
reachable by the device.

I'd consider that a separate bug that needs to be worked around in
the IOMMU code.

> NVMe maintainer has just stated that they expect
> set_dma_mask(DMA_BIT_MASK(64)) to always succeed, and are going to error
> out driver probe if that call fails.  They claim that architecture must
> always be able to dma_map() whatever memory existing in the system - via
> iommu or swiotlb or whatever. Their direction is to remove bounce
> buffers from block and other layers.
> 
> With this direction, semantics of dma mask becomes even more
> questionable. I'd say dma_mask is candidate for removal (or to move to
> swiotlb's or iommu's local area)

Removing dma_mask is not realistic any time soon, there are too many things
in the kernel that use it for one thing or another, so any changes here
have to be done really carefully. We definitely need the mask to support
architectures without swiotlb.

	Arnd
Robin Murphy Jan. 10, 2017, 1:25 p.m. UTC | #4
On 10/01/17 12:47, Nikita Yushchenko wrote:
> Hi
> 
>> The point here is that an IOMMU doesn't solve your issue, and the
>> IOMMU-backed DMA ops need the same treatment. In light of that, it really
>> feels to me like the DMA masks should be restricted in of_dma_configure
>> so that the parent mask is taken into account there, rather than hook
>> into each set of DMA ops to intercept set_dma_mask. We'd still need to
>> do something to stop dma_set_mask widening the mask if it was restricted
>> by of_dma_configure, but I think Robin (cc'd) was playing with that.
> 
> What issue "IOMMU doesn't solve"?
> 
> Issue I'm trying to address is - inconsistency within swiotlb
> dma_map_ops, where (1) any wide mask is silently accepted, but (2) then
> mask is used to decide if bounce buffers are needed or not. This
> inconsistency causes NVMe+R-Car cobmo not working (and breaking memory
> instead).

The fundamental underlying problem is the "any wide mask is silently
accepted" part, and that applies equally to IOMMU ops as well.

> I just can't think out what similar issue iommu can have.
> Do you mean that in iommu case, mask also must not be set to whatever
> wider than initial value? Why? What is the use of mask in iommu case? Is
> there any real case when iommu can't address all memory existing in the
> system?

There's a very subtle misunderstanding there - the DMA mask does not
describe the memory a device can address, it describes the range of
addresses the device is capable of generating. Yes, in the non-IOMMU
case they are equivalent, but once you put an IOMMU in between, the
problem is merely shifted from "what range of physical addresses can
this device access" to "what range of IOVAs is valid to give to this
device" - the fact that those IOVAs can map to any underlying physical
address only obviates the need for any bouncing at the memory end; it
doesn't remove the fact that the device has a hardware addressing
limitation which needs to be accommodated.

The thread Will linked to describes that equivalent version of your
problem - the IOMMU gives the device 48-bit addresses which get
erroneously truncated because it doesn't know that only 42 bits are
actually wired up. That situation still requires the device's DMA mask
to correctly describe its addressing capability just as yours does.

> NVMe maintainer has just stated that they expect
> set_dma_mask(DMA_BIT_MASK(64)) to always succeed, and are going to error
> out driver probe if that call fails.  They claim that architecture must
> always be able to dma_map() whatever memory existing in the system - via
> iommu or swiotlb or whatever. Their direction is to remove bounce
> buffers from block and other layers.

I have to say I'm in full agreement with that - having subsystems do
their own bouncing is generally redundant, although there are some
edge-cases like MMC_BLOCK_BOUNCE (which is primarily about aligning and
coalescing buffers than working around DMA limitations per se).

> With this direction, semantics of dma mask becomes even more
> questionable. I'd say dma_mask is candidate for removal (or to move to
> swiotlb's or iommu's local area)

We still need a way for drivers to communicate a device's probed
addressing capability to SWIOTLB, so there's always going to have to be
*some* sort of public interface. Personally, the change in semantics I'd
like to see is to make dma_set_mask() only fail if DMA is entirely
disallowed - in the normal case it would always succeed, but the DMA API
implementation would be permitted to set a smaller mask than requested
(this is effectively what the x86 IOMMU ops do already). The significant
work that would require, though, is changing all the drivers currently
using this sort of pattern:

	if (!dma_set_mask(dev, DMA_BIT_MASK(64))
		/* put device into 64-bit mode */
	else if (!dma_set_mask(dev, DMA_BIT_MASK(32))
		/* put device into 32-bit mode */
	else
		/* error */

to something like this:

	if (!dma_set_mask(dev, DMA_BIT_MASK(64))
		/* error */
	if (dma_get_mask(dev) > DMA_BIT_MASK(32))
		/* put device into 64-bit mode */
	else
		/* put device into 32-bit mode */

Which would be a pretty major job.

Robin.
Arnd Bergmann Jan. 10, 2017, 1:42 p.m. UTC | #5
On Tuesday, January 10, 2017 1:25:12 PM CET Robin Murphy wrote:
> On 10/01/17 12:47, Nikita Yushchenko wrote:
> >> The point here is that an IOMMU doesn't solve your issue, and the
> >> IOMMU-backed DMA ops need the same treatment. In light of that, it really
> >> feels to me like the DMA masks should be restricted in of_dma_configure
> >> so that the parent mask is taken into account there, rather than hook
> >> into each set of DMA ops to intercept set_dma_mask. We'd still need to
> >> do something to stop dma_set_mask widening the mask if it was restricted
> >> by of_dma_configure, but I think Robin (cc'd) was playing with that.
> > 
> > What issue "IOMMU doesn't solve"?
> > 
> > Issue I'm trying to address is - inconsistency within swiotlb
> > dma_map_ops, where (1) any wide mask is silently accepted, but (2) then
> > mask is used to decide if bounce buffers are needed or not. This
> > inconsistency causes NVMe+R-Car cobmo not working (and breaking memory
> > instead).
> 
> The fundamental underlying problem is the "any wide mask is silently
> accepted" part, and that applies equally to IOMMU ops as well.

It's a much rarer problem for the IOMMU case though, because it only
impacts devices that are restricted to addressing of less than 32-bits.

If you have an IOMMU enabled, the dma-mapping interface does not care
if the device can do wider than 32 bit addressing, as it will never
hand out IOVAs above 0xffffffff.

> > I just can't think out what similar issue iommu can have.
> > Do you mean that in iommu case, mask also must not be set to whatever
> > wider than initial value? Why? What is the use of mask in iommu case? Is
> > there any real case when iommu can't address all memory existing in the
> > system?
> 
> There's a very subtle misunderstanding there - the DMA mask does not
> describe the memory a device can address, it describes the range of
> addresses the device is capable of generating. Yes, in the non-IOMMU
> case they are equivalent, but once you put an IOMMU in between, the
> problem is merely shifted from "what range of physical addresses can
> this device access" to "what range of IOVAs is valid to give to this
> device" - the fact that those IOVAs can map to any underlying physical
> address only obviates the need for any bouncing at the memory end; it
> doesn't remove the fact that the device has a hardware addressing
> limitation which needs to be accommodated.
> 
> The thread Will linked to describes that equivalent version of your
> problem - the IOMMU gives the device 48-bit addresses which get
> erroneously truncated because it doesn't know that only 42 bits are
> actually wired up. That situation still requires the device's DMA mask
> to correctly describe its addressing capability just as yours does.

That problem should only impact virtual machines which have a guest
bus address space covering more than 42 bits of physical RAM, whereas
the problem we have with swiotlb is for the dma-mapping interface.

> > With this direction, semantics of dma mask becomes even more
> > questionable. I'd say dma_mask is candidate for removal (or to move to
> > swiotlb's or iommu's local area)
> 
> We still need a way for drivers to communicate a device's probed
> addressing capability to SWIOTLB, so there's always going to have to be
> *some* sort of public interface. Personally, the change in semantics I'd
> like to see is to make dma_set_mask() only fail if DMA is entirely
> disallowed - in the normal case it would always succeed, but the DMA API
> implementation would be permitted to set a smaller mask than requested
> (this is effectively what the x86 IOMMU ops do already).

With swiotlb enabled, it only needs to fail if the mask does not contain
the swiotlb bounce buffer area, either because the start of RAM is outside
of the mask, or the bounce area has been allocated at the end of ZONE_DMA
and the mask is smaller than ZONE_DMA.

	Arnd
Nikita Yushchenko Jan. 10, 2017, 2:01 p.m. UTC | #6
>> What issue "IOMMU doesn't solve"?
>>
>> Issue I'm trying to address is - inconsistency within swiotlb
>> dma_map_ops, where (1) any wide mask is silently accepted, but (2) then
>> mask is used to decide if bounce buffers are needed or not. This
>> inconsistency causes NVMe+R-Car cobmo not working (and breaking memory
>> instead).
> 
> The fundamental underlying problem is the "any wide mask is silently
> accepted" part, and that applies equally to IOMMU ops as well.

Is just posted version better?

It should cover iommu case as well.

Nikita
Robin Murphy Jan. 10, 2017, 2:16 p.m. UTC | #7
On 10/01/17 13:42, Arnd Bergmann wrote:
> On Tuesday, January 10, 2017 1:25:12 PM CET Robin Murphy wrote:
>> On 10/01/17 12:47, Nikita Yushchenko wrote:
>>>> The point here is that an IOMMU doesn't solve your issue, and the
>>>> IOMMU-backed DMA ops need the same treatment. In light of that, it really
>>>> feels to me like the DMA masks should be restricted in of_dma_configure
>>>> so that the parent mask is taken into account there, rather than hook
>>>> into each set of DMA ops to intercept set_dma_mask. We'd still need to
>>>> do something to stop dma_set_mask widening the mask if it was restricted
>>>> by of_dma_configure, but I think Robin (cc'd) was playing with that.
>>>
>>> What issue "IOMMU doesn't solve"?
>>>
>>> Issue I'm trying to address is - inconsistency within swiotlb
>>> dma_map_ops, where (1) any wide mask is silently accepted, but (2) then
>>> mask is used to decide if bounce buffers are needed or not. This
>>> inconsistency causes NVMe+R-Car cobmo not working (and breaking memory
>>> instead).
>>
>> The fundamental underlying problem is the "any wide mask is silently
>> accepted" part, and that applies equally to IOMMU ops as well.
> 
> It's a much rarer problem for the IOMMU case though, because it only
> impacts devices that are restricted to addressing of less than 32-bits.
> 
> If you have an IOMMU enabled, the dma-mapping interface does not care
> if the device can do wider than 32 bit addressing, as it will never
> hand out IOVAs above 0xffffffff.

I can assure you that it will - we constrain allocations to the
intersection of the IOMMU domain aperture (normally the IOMMU's physical
input address width) and the given device's DMA mask. If both of those
are >32 bits then >32-bit IOVAs will fall out. For the arm64/common
implementation I have prototyped a copy of the x86 optimisation which
always first tries to get 32-bit IOVAs for PCI devices, but even then it
can start returning higher addresses if the 32-bit space fills up.

>>> I just can't think out what similar issue iommu can have.
>>> Do you mean that in iommu case, mask also must not be set to whatever
>>> wider than initial value? Why? What is the use of mask in iommu case? Is
>>> there any real case when iommu can't address all memory existing in the
>>> system?
>>
>> There's a very subtle misunderstanding there - the DMA mask does not
>> describe the memory a device can address, it describes the range of
>> addresses the device is capable of generating. Yes, in the non-IOMMU
>> case they are equivalent, but once you put an IOMMU in between, the
>> problem is merely shifted from "what range of physical addresses can
>> this device access" to "what range of IOVAs is valid to give to this
>> device" - the fact that those IOVAs can map to any underlying physical
>> address only obviates the need for any bouncing at the memory end; it
>> doesn't remove the fact that the device has a hardware addressing
>> limitation which needs to be accommodated.
>>
>> The thread Will linked to describes that equivalent version of your
>> problem - the IOMMU gives the device 48-bit addresses which get
>> erroneously truncated because it doesn't know that only 42 bits are
>> actually wired up. That situation still requires the device's DMA mask
>> to correctly describe its addressing capability just as yours does.
> 
> That problem should only impact virtual machines which have a guest
> bus address space covering more than 42 bits of physical RAM, whereas
> the problem we have with swiotlb is for the dma-mapping interface.

As above, it impacts DMA API use for anything whose addressing
capability is narrower than the IOMMU's reported input size and whose
driver is able to blindly set a too-big DMA mask. It just happens to be
the case that the stars line up on most systems, and for 32-bit devices
who keep the default DMA mask.

I actually have a third variation of this problem involving a PCI root
complex which *could* drive full-width (40-bit) addresses, but won't,
due to the way its PCI<->AXI interface is programmed. That would require
even more complicated dma-ranges handling to describe the windows of
valid physical addresses which it *will* pass, so I'm not pressing the
issue - let's just get the basic DMA mask case fixed first.

>>> With this direction, semantics of dma mask becomes even more
>>> questionable. I'd say dma_mask is candidate for removal (or to move to
>>> swiotlb's or iommu's local area)
>>
>> We still need a way for drivers to communicate a device's probed
>> addressing capability to SWIOTLB, so there's always going to have to be
>> *some* sort of public interface. Personally, the change in semantics I'd
>> like to see is to make dma_set_mask() only fail if DMA is entirely
>> disallowed - in the normal case it would always succeed, but the DMA API
>> implementation would be permitted to set a smaller mask than requested
>> (this is effectively what the x86 IOMMU ops do already).
> 
> With swiotlb enabled, it only needs to fail if the mask does not contain
> the swiotlb bounce buffer area, either because the start of RAM is outside
> of the mask, or the bounce area has been allocated at the end of ZONE_DMA
> and the mask is smaller than ZONE_DMA.

Agreed, I'd managed to overlook that specific case, but I'd be inclined
to consider "impossible" a subset of "disallowed" still :)

Robin.
Christoph Hellwig Jan. 10, 2017, 2:51 p.m. UTC | #8
On Tue, Jan 10, 2017 at 03:47:25PM +0300, Nikita Yushchenko wrote:
> With this direction, semantics of dma mask becomes even more
> questionable. I'd say dma_mask is candidate for removal (or to move to
> swiotlb's or iommu's local area)

We need the dma mask so that the device can advertise what addresses
the device supports.  Many old devices only support 32-bit DMA addressing,
and some less common ones just 24-bit or other weird ones.
Christoph Hellwig Jan. 10, 2017, 2:57 p.m. UTC | #9
On Tue, Jan 10, 2017 at 01:25:12PM +0000, Robin Murphy wrote:
> We still need a way for drivers to communicate a device's probed
> addressing capability to SWIOTLB, so there's always going to have to be
> *some* sort of public interface. Personally, the change in semantics I'd
> like to see is to make dma_set_mask() only fail if DMA is entirely
> disallowed - in the normal case it would always succeed, but the DMA API
> implementation would be permitted to set a smaller mask than requested
> (this is effectively what the x86 IOMMU ops do already).

Yes, this sounds reasonable.

> The significant
> work that would require, though, is changing all the drivers currently
> using this sort of pattern:
> 
> 	if (!dma_set_mask(dev, DMA_BIT_MASK(64))
> 		/* put device into 64-bit mode */
> 	else if (!dma_set_mask(dev, DMA_BIT_MASK(32))
> 		/* put device into 32-bit mode */
> 	else
> 		/* error */

While we have this pattern in a lot of places it's already rather
pointless on most architectures as the first dma_set_mask call
won't ever fail for the most common dma_ops implementations.

> to something like this:
> 
> 	if (!dma_set_mask(dev, DMA_BIT_MASK(64))
> 		/* error */
> 	if (dma_get_mask(dev) > DMA_BIT_MASK(32))
> 		/* put device into 64-bit mode */
> 	else
> 		/* put device into 32-bit mode */
> 
> Which would be a pretty major job.

I don't think it's too bad.  Also for many modern devices there is no
need to put the device into a specific mode.  It's mostly a historic
issue from the PCI/PCI-X days with the less efficient DAC addressing
scheme.
Christoph Hellwig Jan. 10, 2017, 2:59 p.m. UTC | #10
On Tue, Jan 10, 2017 at 02:42:23PM +0100, Arnd Bergmann wrote:
> It's a much rarer problem for the IOMMU case though, because it only
> impacts devices that are restricted to addressing of less than 32-bits.
> 
> If you have an IOMMU enabled, the dma-mapping interface does not care
> if the device can do wider than 32 bit addressing, as it will never
> hand out IOVAs above 0xffffffff.

That's absolutely not the case.  IOMMUs can and do generate addresses
larger than 32-bit.  Also various platforms have modes where an IOMMU
can be used when <= 32-bit addresses are used and bypassed if full 64-bit
addressing is supported and I/O isolation is not explicitly requested.
Arnd Bergmann Jan. 10, 2017, 3:06 p.m. UTC | #11
On Tuesday, January 10, 2017 2:16:57 PM CET Robin Murphy wrote:
> On 10/01/17 13:42, Arnd Bergmann wrote:
> > On Tuesday, January 10, 2017 1:25:12 PM CET Robin Murphy wrote:
> >> On 10/01/17 12:47, Nikita Yushchenko wrote:
> >>>> The point here is that an IOMMU doesn't solve your issue, and the
> >>>> IOMMU-backed DMA ops need the same treatment. In light of that, it really
> >>>> feels to me like the DMA masks should be restricted in of_dma_configure
> >>>> so that the parent mask is taken into account there, rather than hook
> >>>> into each set of DMA ops to intercept set_dma_mask. We'd still need to
> >>>> do something to stop dma_set_mask widening the mask if it was restricted
> >>>> by of_dma_configure, but I think Robin (cc'd) was playing with that.
> >>>
> >>> What issue "IOMMU doesn't solve"?
> >>>
> >>> Issue I'm trying to address is - inconsistency within swiotlb
> >>> dma_map_ops, where (1) any wide mask is silently accepted, but (2) then
> >>> mask is used to decide if bounce buffers are needed or not. This
> >>> inconsistency causes NVMe+R-Car cobmo not working (and breaking memory
> >>> instead).
> >>
> >> The fundamental underlying problem is the "any wide mask is silently
> >> accepted" part, and that applies equally to IOMMU ops as well.
> > 
> > It's a much rarer problem for the IOMMU case though, because it only
> > impacts devices that are restricted to addressing of less than 32-bits.
> > 
> > If you have an IOMMU enabled, the dma-mapping interface does not care
> > if the device can do wider than 32 bit addressing, as it will never
> > hand out IOVAs above 0xffffffff.
> 
> I can assure you that it will - we constrain allocations to the
> intersection of the IOMMU domain aperture (normally the IOMMU's physical
> input address width) and the given device's DMA mask. If both of those
> are >32 bits then >32-bit IOVAs will fall out. For the arm64/common
> implementation I have prototyped a copy of the x86 optimisation which
> always first tries to get 32-bit IOVAs for PCI devices, but even then it
> can start returning higher addresses if the 32-bit space fills up.

Ok, got it. I have to admit that most of my knowledge about the internals
of IOMMUs is from PowerPC of a few years ago, which couldn't do this at
all. I agree that we need to do the same thing on swiotlb and iommu then.

> >> The thread Will linked to describes that equivalent version of your
> >> problem - the IOMMU gives the device 48-bit addresses which get
> >> erroneously truncated because it doesn't know that only 42 bits are
> >> actually wired up. That situation still requires the device's DMA mask
> >> to correctly describe its addressing capability just as yours does.
> > 
> > That problem should only impact virtual machines which have a guest
> > bus address space covering more than 42 bits of physical RAM, whereas
> > the problem we have with swiotlb is for the dma-mapping interface.
> > 
> I actually have a third variation of this problem involving a PCI root
> complex which *could* drive full-width (40-bit) addresses, but won't,
> due to the way its PCI<->AXI interface is programmed. That would require
> even more complicated dma-ranges handling to describe the windows of
> valid physical addresses which it *will* pass, so I'm not pressing the
> issue - let's just get the basic DMA mask case fixed first.

Can you describe this a little more? We should at least try to not
make it harder to solve the next problem while solving this one,
so I'd like to understand the exact limitation you are hitting there.

	Arnd
Nikita Yushchenko Jan. 11, 2017, 12:37 p.m. UTC | #12
> I actually have a third variation of this problem involving a PCI root
> complex which *could* drive full-width (40-bit) addresses, but won't,
> due to the way its PCI<->AXI interface is programmed. That would require
> even more complicated dma-ranges handling to describe the windows of
> valid physical addresses which it *will* pass, so I'm not pressing the
> issue - let's just get the basic DMA mask case fixed first.

R-Car + NVMe is actually not "basic case".

It has PCI<->AXI interface involved.
PCI addresses are 64-bit and controller does handle 64-bit addresses
there. Mapping between PCI addresses and AXI addresses is defined. But
AXI is 32-bit.

SoC has iommu that probably could be used between PCIe module and RAM.
Although AFAIK nobody made that working yet.

Board I work with has 4G of RAM, in 4 banks, located at different parts
of wide address space, and only one of them is below 4G. But if iommu is
capable of translating addresses such that 4 gigabyte banks map to first
4 gigabytes of address space, then all memory will become available for
DMA from PCIe device.
Arnd Bergmann Jan. 11, 2017, 4:21 p.m. UTC | #13
On Wednesday, January 11, 2017 3:37:22 PM CET Nikita Yushchenko wrote:
> > I actually have a third variation of this problem involving a PCI root
> > complex which *could* drive full-width (40-bit) addresses, but won't,
> > due to the way its PCI<->AXI interface is programmed. That would require
> > even more complicated dma-ranges handling to describe the windows of
> > valid physical addresses which it *will* pass, so I'm not pressing the
> > issue - let's just get the basic DMA mask case fixed first.
> 
> R-Car + NVMe is actually not "basic case".
> 
> It has PCI<->AXI interface involved.
> PCI addresses are 64-bit and controller does handle 64-bit addresses
> there. Mapping between PCI addresses and AXI addresses is defined. But
> AXI is 32-bit.
> 
> SoC has iommu that probably could be used between PCIe module and RAM.
> Although AFAIK nobody made that working yet.
> 
> Board I work with has 4G of RAM, in 4 banks, located at different parts
> of wide address space, and only one of them is below 4G. But if iommu is
> capable of translating addresses such that 4 gigabyte banks map to first
> 4 gigabytes of address space, then all memory will become available for
> DMA from PCIe device.

You can in theory handle this by defining your own platform specific
dma_map_ops, as we used to do in the old days. Unfortunately, the modern
way of using the generic IOVA allocation can't handle really it, so it's
unclear if the work that would be necessary to support it (and the long
term maintenance cost) outweigh the benefits.

The more likely option here is to try harder to get the IOMMU working
(or show that it's impossible but make sure the next chip gets it right).

	Arnd
Robin Murphy Jan. 11, 2017, 6:28 p.m. UTC | #14
On 11/01/17 12:37, Nikita Yushchenko wrote:
>> I actually have a third variation of this problem involving a PCI root
>> complex which *could* drive full-width (40-bit) addresses, but won't,
>> due to the way its PCI<->AXI interface is programmed. That would require
>> even more complicated dma-ranges handling to describe the windows of
>> valid physical addresses which it *will* pass, so I'm not pressing the
>> issue - let's just get the basic DMA mask case fixed first.
> 
> R-Car + NVMe is actually not "basic case".

I meant "basic" in terms of what needs to be done in Linux - simply
preventing device drivers from overwriting the DT-configured DMA mask
will make everything work as well as well as it possibly can on R-Car,
both with or without the IOMMU, since apparently all you need is to
ensure a PCI device never gets given a DMA address above 4GB. The
situation where PCI devices *can* DMA to all of physical memory, but
can't use arbitrary addresses *outside* it - which only becomes a
problem with an IOMMU - is considerably trickier.

> It has PCI<->AXI interface involved.
> PCI addresses are 64-bit and controller does handle 64-bit addresses
> there. Mapping between PCI addresses and AXI addresses is defined. But
> AXI is 32-bit.
> 
> SoC has iommu that probably could be used between PCIe module and RAM.
> Although AFAIK nobody made that working yet.
> 
> Board I work with has 4G of RAM, in 4 banks, located at different parts
> of wide address space, and only one of them is below 4G. But if iommu is
> capable of translating addresses such that 4 gigabyte banks map to first
> 4 gigabytes of address space, then all memory will become available for
> DMA from PCIe device.

The aforementioned situation on Juno is similar yet different - the PLDA
XR3 root complex uses an address-based lookup table to translate
outgoing PCI memory space transactions to AXI bus addresses with the
appropriate attributes, in power-of-two-sized regions. The firmware
configures 3 LUT entries - 2GB at 0x8000_0000 and 8GB at 0x8_8000_0000
with cache-coherent attributes to cover the DRAM areas, plus a small one
with device attributes covering the GICv2m MSI frame. The issue is that
there is no "no match" translation, so any transaction not within one of
those regions never makes it out of the root complex at all.

That's fine in normal operation, as there's nothing outside those
regions in the physical memory map a PCI device should be accessing
anyway, but turning on the SMMU is another matter - since the IOVA
allocator runs top-down, a PCI device with a 64-bit DMA mask will do a
dma_map or dma_alloc, get the physical page mapped to an IOVA up around
FF_FFFF_F000 (the SMMU will constrain things to the system bus width of
40 bits), then try to access that address and get a termination straight
back from the RC. Similarly, A KVM guest which wants to place its memory
at arbitrary locations and expect device passthrough to work is going to
have a bad time.

I don't know if it's feasible to have the firmware set the LUT up
differently, as that might lead to other problems when not using the
SMMU, and/or just require far more than the 8 available LUT entries
(assuming they have to be non-overlapping - I'm not 100% sure and
documentation is sparse). Thus it seems appropriate to describe the
currently valid PCI-AXI translations with dma-ranges, but then we'd have
multiple entries - last time I looked Linux simply ignores all but the
last one in that case - which can't be combined into a simple bitmask,
so I'm not entirely sure where to go from there. Especially as so far it
seems to be a problem exclusive to one not-widely-available ageing
early-access development platform...

It happens that limiting all PCI DMA masks to 32 bits would bodge around
this problem thanks to the current IOVA allocator behaviour, but that's
pretty yuck, and would force unnecessary bouncing for the non-SMMU case.
My other hack to carve up IOVA domains to reserve all addresses not
matching memblocks is hardly any more realistic, hence why the SMMU is
in the Juno DT in a change-it-at-your-own-peril "disabled" state ;)

Robin.
diff mbox

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1117421..afb2c08 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -216,6 +216,9 @@  config NEED_DMA_MAP_STATE
 config NEED_SG_DMA_LENGTH
 	def_bool y
 
+config ARCH_HAS_DMA_SET_COHERENT_MASK
+	def_bool y
+
 config SMP
 	def_bool y
 
diff --git a/arch/arm64/include/asm/device.h b/arch/arm64/include/asm/device.h
index 243ef25..a57e7bb 100644
--- a/arch/arm64/include/asm/device.h
+++ b/arch/arm64/include/asm/device.h
@@ -22,6 +22,7 @@  struct dev_archdata {
 	void *iommu;			/* private IOMMU data */
 #endif
 	bool dma_coherent;
+	u64 parent_dma_mask;
 };
 
 struct pdev_archdata {
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index e040827..5ab15ce 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -352,6 +352,30 @@  static int __swiotlb_dma_supported(struct device *hwdev, u64 mask)
 	return 1;
 }
 
+static int __swiotlb_set_dma_mask(struct device *dev, u64 mask)
+{
+	/* device is not DMA capable */
+	if (!dev->dma_mask)
+		return -EIO;
+
+	/* mask is below swiotlb bounce buffer, so fail */
+	if (!swiotlb_dma_supported(dev, mask))
+		return -EIO;
+
+	/*
+	 * because of the swiotlb, we can return success for
+	 * larger masks, but need to ensure that bounce buffers
+	 * are used above parent_dma_mask, so set that as
+	 * the effective mask.
+	 */
+	if (mask > dev->archdata.parent_dma_mask)
+		mask = dev->archdata.parent_dma_mask;
+
+	*dev->dma_mask = mask;
+
+	return 0;
+}
+
 static struct dma_map_ops swiotlb_dma_ops = {
 	.alloc = __dma_alloc,
 	.free = __dma_free,
@@ -367,8 +391,23 @@  static struct dma_map_ops swiotlb_dma_ops = {
 	.sync_sg_for_device = __swiotlb_sync_sg_for_device,
 	.dma_supported = __swiotlb_dma_supported,
 	.mapping_error = swiotlb_dma_mapping_error,
+	.set_dma_mask = __swiotlb_set_dma_mask,
 };
 
+int dma_set_coherent_mask(struct device *dev, u64 mask)
+{
+	if (!dma_supported(dev, mask))
+		return -EIO;
+
+	if (get_dma_ops(dev) == &swiotlb_dma_ops &&
+	    mask > dev->archdata.parent_dma_mask)
+		mask = dev->archdata.parent_dma_mask;
+
+	dev->coherent_dma_mask = mask;
+	return 0;
+}
+EXPORT_SYMBOL(dma_set_coherent_mask);
+
 static int __init atomic_pool_init(void)
 {
 	pgprot_t prot = __pgprot(PROT_NORMAL_NC);
@@ -958,6 +997,18 @@  void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 	if (!dev->archdata.dma_ops)
 		dev->archdata.dma_ops = &swiotlb_dma_ops;
 
+	/*
+	 * we don't yet support buses that have a non-zero mapping.
+	 *  Let's hope we won't need it
+	 */
+	WARN_ON(dma_base != 0);
+
+	/*
+	 * Whatever the parent bus can set. A device must not set
+	 * a DMA mask larger than this.
+	 */
+	dev->archdata.parent_dma_mask = size - 1;
+
 	dev->archdata.dma_coherent = coherent;
 	__iommu_setup_dma_ops(dev, dma_base, size, iommu);
 }