For the problem when using swiotlb

Message ID	20141121180925.GG19783@e104818-lin.cambridge.arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> Date: Fri, 21 Nov 2014 18:09:25 +0000 From: Catalin Marinas <catalin.marinas@arm.com> To: Arnd Bergmann <arnd@arndb.de> Subject: Re: For the problem when using swiotlb Message-ID: <20141121180925.GG19783@e104818-lin.cambridge.arm.com> References: <5469E26B.2010905@huawei.com> <5029442.vhLMp7Ns7F@wuerfel> <20141121165708.GF19783@e104818-lin.cambridge.arm.com> <2530749.roRsteyaXx@wuerfel> <20141121175118.GA10451@localhost> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20141121175118.GA10451@localhost> User-Agent: Mutt/1.5.23 (2014-03-12) Cc: Will Deacon <Will.Deacon@arm.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>, Ding Tianhong <dingtianhong@huawei.com> Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Catalin Marinas Nov. 21, 2014, 6:09 p.m. UTC

On Fri, Nov 21, 2014 at 05:51:19PM +0000, Catalin Marinas wrote:
> On Fri, Nov 21, 2014 at 05:04:28PM +0000, Arnd Bergmann wrote:
> > On Friday 21 November 2014 16:57:09 Catalin Marinas wrote:
> > > There is a scenario where smaller mask would work on arm64. For example
> > > Juno, you can have 2GB of RAM in the 32-bit phys range (starting at
> > > 0x80000000). A device with 31-bit mask and a dma_pfn_offset of
> > > 0x80000000 would still work (there isn't any but just as an example). So
> > > the check in dma_alloc_coherent() would be something like:
> > > 
> > > 	phys_to_dma(top of ZONE_DMA) - dma_pfn_offset <= coherent_dma_mask
> > > 
> > > (or assuming RAM starts at 0 and ignoring dma_pfn_offset for now)
> > > 
> > > If the condition above fails, dma_alloc_coherent() would no longer fall
> > > back to swiotlb but issue a dev_warn() and return NULL.
> > 
> > Ah, that looks like it should work on all architectures, very nice.
> > How about checking this condition, and then printing a small warning
> > (dev_warn, not WARN_ON) and setting the dma_mask pointer to NULL?
> 
> I would not add the above ZONE_DMA check to of_dma_configure(). For
> example on arm64, we may not support a small coherent_dma_mask but the
> same value for dma_mask could be fine via swiotlb bouncing (or IOMMU).
> However, that's an arch-specific decision. Maybe after the above setting
> of dev->coherent_dma_mask in of_dma_configure(), we could add:
> 
> 	if (!dma_supported(dev, dev->coherent_dma_mask))
> 		dev->dma_mask = NULL;
> 
> The arch dma_supported() can check the swiotlb bouncing or ZONE_DMA
> limits.

More precisely, that's what I meant:

Arnd Bergmann Nov. 24, 2014, 8:12 p.m. UTC | #1

On Friday 21 November 2014 18:09:25 Catalin Marinas wrote:
> On Fri, Nov 21, 2014 at 05:51:19PM +0000, Catalin Marinas wrote:
> > On Fri, Nov 21, 2014 at 05:04:28PM +0000, Arnd Bergmann wrote:
> > > On Friday 21 November 2014 16:57:09 Catalin Marinas wrote:
> > > > There is a scenario where smaller mask would work on arm64. For example
> > > > Juno, you can have 2GB of RAM in the 32-bit phys range (starting at
> > > > 0x80000000). A device with 31-bit mask and a dma_pfn_offset of
> > > > 0x80000000 would still work (there isn't any but just as an example). So
> > > > the check in dma_alloc_coherent() would be something like:
> > > > 
> > > > 	phys_to_dma(top of ZONE_DMA) - dma_pfn_offset <= coherent_dma_mask
> > > > 
> > > > (or assuming RAM starts at 0 and ignoring dma_pfn_offset for now)
> > > > 
> > > > If the condition above fails, dma_alloc_coherent() would no longer fall
> > > > back to swiotlb but issue a dev_warn() and return NULL.
> > > 
> > > Ah, that looks like it should work on all architectures, very nice.
> > > How about checking this condition, and then printing a small warning
> > > (dev_warn, not WARN_ON) and setting the dma_mask pointer to NULL?
> > 
> > I would not add the above ZONE_DMA check to of_dma_configure(). For
> > example on arm64, we may not support a small coherent_dma_mask but the
> > same value for dma_mask could be fine via swiotlb bouncing (or IOMMU).
> > However, that's an arch-specific decision. Maybe after the above setting
> > of dev->coherent_dma_mask in of_dma_configure(), we could add:
> > 

You seem to implement the opposite:

> +	/*
> +	 * If the bus dma-ranges property specifies a size smaller than 4GB,
> +	 * the device would not be capable of accessing the whole 32-bit
> +	 * space, so reduce the default coherent_dma_mask accordingly.
> +	 */
> +	if (size && size < (1ULL << 32))
> +		dev->coherent_dma_mask = DMA_BIT_MASK(ilog2(size));
> +
> +	/*
> +	 * Set dma_mask to coherent_dma_mask by default if the architecture
> +	 * code has not set it and DMA on such mask is supported.
> +	 */
> +	if (!dev->dma_mask && dma_supported(dev, dev->coherent_dma_mask))
> +		dev->dma_mask = &dev->coherent_dma_mask;
>  }
>  

Here, coherent_dma_mask wouldn't work while dma_mask might be
fine in case of swiotlb, but you set a nonzero coherent_dma_mask
and an invalid dma_mask.

	Arnd

Catalin Marinas Nov. 25, 2014, 10:58 a.m. UTC | #2

On Mon, Nov 24, 2014 at 08:12:09PM +0000, Arnd Bergmann wrote:
> On Friday 21 November 2014 18:09:25 Catalin Marinas wrote:
> > On Fri, Nov 21, 2014 at 05:51:19PM +0000, Catalin Marinas wrote:
> > > On Fri, Nov 21, 2014 at 05:04:28PM +0000, Arnd Bergmann wrote:
> > > > On Friday 21 November 2014 16:57:09 Catalin Marinas wrote:
> > > > > There is a scenario where smaller mask would work on arm64. For example
> > > > > Juno, you can have 2GB of RAM in the 32-bit phys range (starting at
> > > > > 0x80000000). A device with 31-bit mask and a dma_pfn_offset of
> > > > > 0x80000000 would still work (there isn't any but just as an example). So
> > > > > the check in dma_alloc_coherent() would be something like:
> > > > > 
> > > > > 	phys_to_dma(top of ZONE_DMA) - dma_pfn_offset <= coherent_dma_mask
> > > > > 
> > > > > (or assuming RAM starts at 0 and ignoring dma_pfn_offset for now)
> > > > > 
> > > > > If the condition above fails, dma_alloc_coherent() would no longer fall
> > > > > back to swiotlb but issue a dev_warn() and return NULL.
> > > > 
> > > > Ah, that looks like it should work on all architectures, very nice.
> > > > How about checking this condition, and then printing a small warning
> > > > (dev_warn, not WARN_ON) and setting the dma_mask pointer to NULL?
> > > 
> > > I would not add the above ZONE_DMA check to of_dma_configure(). For
> > > example on arm64, we may not support a small coherent_dma_mask but the
> > > same value for dma_mask could be fine via swiotlb bouncing (or IOMMU).
> > > However, that's an arch-specific decision. Maybe after the above setting
> > > of dev->coherent_dma_mask in of_dma_configure(), we could add:
> 
> You seem to implement the opposite:

Possibly, but I had something else in mind.

> > +	/*
> > +	 * If the bus dma-ranges property specifies a size smaller than 4GB,
> > +	 * the device would not be capable of accessing the whole 32-bit
> > +	 * space, so reduce the default coherent_dma_mask accordingly.
> > +	 */
> > +	if (size && size < (1ULL << 32))
> > +		dev->coherent_dma_mask = DMA_BIT_MASK(ilog2(size));
> > +
> > +	/*
> > +	 * Set dma_mask to coherent_dma_mask by default if the architecture
> > +	 * code has not set it and DMA on such mask is supported.
> > +	 */
> > +	if (!dev->dma_mask && dma_supported(dev, dev->coherent_dma_mask))
> > +		dev->dma_mask = &dev->coherent_dma_mask;
> >  }
> 
> Here, coherent_dma_mask wouldn't work while dma_mask might be
> fine in case of swiotlb, but you set a nonzero coherent_dma_mask
> and an invalid dma_mask.

My assumption is that dma_supported() only checks the validity of
dma_mask (not the coherent one). On arm64 it is currently routed to
swiotlb_dma_supported() which returns true if the swiotlb bounce buffer
is within that mask. So if the coherent_dma_mask is enough for swiotlb
bounce buffer, we point dma_mask to it. Otherwise, there is no way to do
streaming DMA to such mask, hence setting it to NULL.

Since we don't have a coherent_dma_supported() function, we defer the
validity check of coherent_dma_mask to dma_alloc_coherent() (and here we
can remove bouncing since swiotlb has relatively small buffers).

There is a slight downside if dma_supported() on the default 32-bit mask
fails, we end up with dma_mask == NULL and the driver calling
dma_set_mask_and_coherent(64-bit) would fail to set dma_mask even though
dma-ranges allows it. Is this a real scenario (not for arm64 where we
allow DMA into 32-bit via ZONE_DMA)?

Russell King - ARM Linux Nov. 25, 2014, 11:29 a.m. UTC | #3

On Tue, Nov 25, 2014 at 10:58:15AM +0000, Catalin Marinas wrote:
> Since we don't have a coherent_dma_supported() function, we defer the
> validity check of coherent_dma_mask to dma_alloc_coherent() (and here we
> can remove bouncing since swiotlb has relatively small buffers).

Bouncing of coherent DMA buffers is insane; if you have to bounce them,
they're by definition not coherent.

Think about one of the common uses of coherent DMA buffers: ring buffers,
where both the CPU and the DMA agent write to the ring:

- CPU writes to ring, loading address and length, then writing to the
  status word for the ring entry.
- DMA reads the ring status word, sees it owns the entry, processes it,
  DMA writes to the ring status word to give it back.

What this means is that if you are bouncing the buffer, you are copying
it whole-sale between the CPU visible version and the DMA visible
version, which means that you can miss DMA updates to it.  So, bouncing
a coherent DMA buffer is simply not an acceptable implementation for
dma_alloc_coherent().

As for validity of masks, it is defined in the DMA API documentation that
if a DMA mask is suitable for the streaming APIs, then it is also suitable
for the coherent APIs.  The reverse is left open, and so may not
necessarily be true.

In other words:

	err = dma_set_mask(dev, mask);
	if (err == 0)
		assert(dma_set_coherent_mask(dev, mask) == 0);

must always succeed, but reversing the two calls has no guarantee.

Note that there seems to be only one driver which has different coherent
and streaming DMA masks today:

drivers/media/pci/sta2x11/sta2x11_vip.c:
        if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(26))) {
        err = dma_set_coherent_mask(&vip->pdev->dev, DMA_BIT_MASK(29));

Catalin Marinas Nov. 25, 2014, 12:23 p.m. UTC | #4

On Tue, Nov 25, 2014 at 11:29:05AM +0000, Russell King - ARM Linux wrote:
> On Tue, Nov 25, 2014 at 10:58:15AM +0000, Catalin Marinas wrote:
> > Since we don't have a coherent_dma_supported() function, we defer the
> > validity check of coherent_dma_mask to dma_alloc_coherent() (and here we
> > can remove bouncing since swiotlb has relatively small buffers).
> 
> Bouncing of coherent DMA buffers is insane; if you have to bounce them,
> they're by definition not coherent.

"Bouncing" is not a proper term here. What I meant is not using the
swiotlb bounce buffers for the dma_alloc_coherent(). The
swiotlb_alloc_coherent() allows this but it doesn't do any
copying/bouncing in such scenario (it would not make sense, as you said
below).

> Think about one of the common uses of coherent DMA buffers: ring buffers,
> where both the CPU and the DMA agent write to the ring:
> 
> - CPU writes to ring, loading address and length, then writing to the
>   status word for the ring entry.
> - DMA reads the ring status word, sees it owns the entry, processes it,
>   DMA writes to the ring status word to give it back.
> 
> What this means is that if you are bouncing the buffer, you are copying
> it whole-sale between the CPU visible version and the DMA visible
> version, which means that you can miss DMA updates to it.  So, bouncing
> a coherent DMA buffer is simply not an acceptable implementation for
> dma_alloc_coherent().

I agree, not arguing against this.

> As for validity of masks, it is defined in the DMA API documentation that
> if a DMA mask is suitable for the streaming APIs, then it is also suitable
> for the coherent APIs.  The reverse is left open, and so may not
> necessarily be true.
> 
> In other words:
> 
> 	err = dma_set_mask(dev, mask);
> 	if (err == 0)
> 		assert(dma_set_coherent_mask(dev, mask) == 0);
> 
> must always succeed, but reversing the two calls has no guarantee.

That would succeed on arm64 currently.

The problem is that swiotlb bounce buffers allow for smaller than 32-bit
dma_mask for streaming DMA since it can do bouncing. With coherent
allocations, we could allow the same smaller than 32-bit
coherent_dma_mask, however allocations may fail since they fall back to
the swiotlb reserved buffer which is relatively small.

What I want to avoid is threads like this where people ask for bigger
swiotlb buffers for coherent allocations most likely because they do the
wrong thing with their dma masks (or dma-ranges property). If the arm64
dma_alloc_coherent() code avoids calling swiotlb_alloc_coherent() and
just limits itself to ZONE_DMA, we make it clear that the swiotlb
buffers are only used for streaming DMA (bouncing).

Once we avoid swiotlb buffers for coherent DMA, the problem is that even
though your assertion above would pass, dma_alloc_coherent() may fail if
coherent_dma_mask is smaller than the end of ZONE_DMA (and pfn offsets
considered). Basically breaking the assumption that you mentioned above
with regards to coherent_dma_mask suitability.

I think the best we can get is for dma_supported() to ignore
swiotlb_dma_supported() and simply check for ZONE_DMA (similar to the
arm32 one). We guarantee that swiotlb bounce buffers are within ZONE_DMA
already, so no need for calling swiotlb_dma_supported(). If we ever have
devices with smaller than 32-bit mask on arm64, we will have to reduce
ZONE_DMA.

> Note that there seems to be only one driver which has different coherent
> and streaming DMA masks today:
> 
> drivers/media/pci/sta2x11/sta2x11_vip.c:
>         if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(26))) {
>         err = dma_set_coherent_mask(&vip->pdev->dev, DMA_BIT_MASK(29));

This would work if we have a ZONE_DMA of 29-bit while the swiotlb bounce
buffer within 26-bit. If we go with the above ZONE_DMA only check, we
would need ZONE_DMA of 26-bit.

Ding Tianhong Nov. 27, 2014, 2:36 a.m. UTC | #5

On 2014/11/25 20:23, Catalin Marinas wrote:
> On Tue, Nov 25, 2014 at 11:29:05AM +0000, Russell King - ARM Linux wrote:
>> On Tue, Nov 25, 2014 at 10:58:15AM +0000, Catalin Marinas wrote:
>>> Since we don't have a coherent_dma_supported() function, we defer the
>>> validity check of coherent_dma_mask to dma_alloc_coherent() (and here we
>>> can remove bouncing since swiotlb has relatively small buffers).
>>
>> Bouncing of coherent DMA buffers is insane; if you have to bounce them,
>> they're by definition not coherent.
> 
> "Bouncing" is not a proper term here. What I meant is not using the
> swiotlb bounce buffers for the dma_alloc_coherent(). The
> swiotlb_alloc_coherent() allows this but it doesn't do any
> copying/bouncing in such scenario (it would not make sense, as you said
> below).
> 

After the long discussion, I think the idea is clear, the drivers should set
correct dma mask and coherent dma mask to use the buffer avoiding using swiotlb,
and my case is fixed by this way.

On my aarch64 board, the coherent dma mask is always bigger than swiotlb bounce,
so dma_supported() looks no sense here, but maybe useful for other board, Thanks everyone.

Regards
Ding

>> Think about one of the common uses of coherent DMA buffers: ring buffers,
>> where both the CPU and the DMA agent write to the ring:
>>
>> - CPU writes to ring, loading address and length, then writing to the
>>   status word for the ring entry.
>> - DMA reads the ring status word, sees it owns the entry, processes it,
>>   DMA writes to the ring status word to give it back.
>>
>> What this means is that if you are bouncing the buffer, you are copying
>> it whole-sale between the CPU visible version and the DMA visible
>> version, which means that you can miss DMA updates to it.  So, bouncing
>> a coherent DMA buffer is simply not an acceptable implementation for
>> dma_alloc_coherent().
> 
> I agree, not arguing against this.
> 
>> As for validity of masks, it is defined in the DMA API documentation that
>> if a DMA mask is suitable for the streaming APIs, then it is also suitable
>> for the coherent APIs.  The reverse is left open, and so may not
>> necessarily be true.
>>
>> In other words:
>>
>> 	err = dma_set_mask(dev, mask);
>> 	if (err == 0)
>> 		assert(dma_set_coherent_mask(dev, mask) == 0);
>>
>> must always succeed, but reversing the two calls has no guarantee.
> 
> That would succeed on arm64 currently.
> 
> The problem is that swiotlb bounce buffers allow for smaller than 32-bit
> dma_mask for streaming DMA since it can do bouncing. With coherent
> allocations, we could allow the same smaller than 32-bit
> coherent_dma_mask, however allocations may fail since they fall back to
> the swiotlb reserved buffer which is relatively small.
> 
> What I want to avoid is threads like this where people ask for bigger
> swiotlb buffers for coherent allocations most likely because they do the
> wrong thing with their dma masks (or dma-ranges property). If the arm64
> dma_alloc_coherent() code avoids calling swiotlb_alloc_coherent() and
> just limits itself to ZONE_DMA, we make it clear that the swiotlb
> buffers are only used for streaming DMA (bouncing).
> 
> Once we avoid swiotlb buffers for coherent DMA, the problem is that even
> though your assertion above would pass, dma_alloc_coherent() may fail if
> coherent_dma_mask is smaller than the end of ZONE_DMA (and pfn offsets
> considered). Basically breaking the assumption that you mentioned above
> with regards to coherent_dma_mask suitability.
> 
> I think the best we can get is for dma_supported() to ignore
> swiotlb_dma_supported() and simply check for ZONE_DMA (similar to the
> arm32 one). We guarantee that swiotlb bounce buffers are within ZONE_DMA
> already, so no need for calling swiotlb_dma_supported(). If we ever have
> devices with smaller than 32-bit mask on arm64, we will have to reduce
> ZONE_DMA.
> 
>> Note that there seems to be only one driver which has different coherent
>> and streaming DMA masks today:
>>
>> drivers/media/pci/sta2x11/sta2x11_vip.c:
>>         if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(26))) {
>>         err = dma_set_coherent_mask(&vip->pdev->dev, DMA_BIT_MASK(29));
> 
> This would work if we have a ZONE_DMA of 29-bit while the swiotlb bounce
> buffer within 26-bit. If we go with the above ZONE_DMA only check, we
> would need ZONE_DMA of 26-bit.
>

For the problem when using swiotlb

Commit Message

Comments

Patch