diff mbox series

[RFC,v5,3/8] iommu: add a new capable IOMMU_CAP_MERGING

Message ID 1559733114-4221-4-git-send-email-yoshihiro.shimoda.uh@renesas.com (mailing list archive)
State New, archived
Headers show
Series treewide: improve R-Car SDHI performance | expand

Commit Message

Yoshihiro Shimoda June 5, 2019, 11:11 a.m. UTC
This patch adds a new capable IOMMU_CAP_MERGING to check whether
the IOVA would be contiguous strictly if a device requires and
the IOMMU driver has the capable.

Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
---
 drivers/iommu/dma-iommu.c | 26 ++++++++++++++++++++++++--
 include/linux/iommu.h     |  1 +
 2 files changed, 25 insertions(+), 2 deletions(-)

Comments

Robin Murphy June 5, 2019, 12:21 p.m. UTC | #1
On 05/06/2019 12:11, Yoshihiro Shimoda wrote:
> This patch adds a new capable IOMMU_CAP_MERGING to check whether
> the IOVA would be contiguous strictly if a device requires and
> the IOMMU driver has the capable.
> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> ---
>   drivers/iommu/dma-iommu.c | 26 ++++++++++++++++++++++++--
>   include/linux/iommu.h     |  1 +
>   2 files changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 034caae..ecf1a04 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -847,11 +847,16 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>   	dma_addr_t iova;
>   	size_t iova_len = 0;
>   	unsigned long mask = dma_get_seg_boundary(dev);
> -	int i;
> +	int i, ret;
> +	bool iova_contiguous = false;
>   
>   	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
>   		iommu_dma_sync_sg_for_device(dev, sg, nents, dir);
>   
> +	if (dma_get_iova_contiguous(dev) &&
> +	    iommu_capable(dev->bus, IOMMU_CAP_MERGING))
> +		iova_contiguous = true;
> +
>   	/*
>   	 * Work out how much IOVA space we need, and align the segments to
>   	 * IOVA granules for the IOMMU driver to handle. With some clever
> @@ -867,6 +872,13 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>   		sg_dma_len(s) = s_length;
>   		s->offset -= s_iova_off;
>   		s_length = iova_align(iovad, s_length + s_iova_off);
> +		/*
> +		 * Check whether the IOVA would be contiguous strictly if
> +		 * a device requires and the IOMMU driver has the capable.
> +		 */
> +		if (iova_contiguous && i > 0 &&
> +		    (s_iova_off || s->length != s_length))
> +			return 0;
>   		s->length = s_length;
>   
>   		/*
> @@ -902,8 +914,18 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>   	if (iommu_map_sg(domain, iova, sg, nents, prot) < iova_len)
>   		goto out_free_iova;
>   
> -	return __finalise_sg(dev, sg, nents, iova);
> +	ret = __finalise_sg(dev, sg, nents, iova);
> +	/*
> +	 * Check whether the sg entry is single if a device requires and
> +	 * the IOMMU driver has the capable.
> +	 */
> +	if (iova_contiguous && ret != 1)
> +		goto out_unmap_sg;

I don't see that just failing really gives this option any value. 
Clearly the MMC driver has to do *something* to handle the failure (plus 
presumably the case of not having IOMMU DMA ops at all), which begs the 
question of why it couldn't just do whatever that is anyway, without all 
this infrastructure. For starters, it would be a far simpler and less 
invasive patch:

	if (dma_map_sg(...) > 1) {
		dma_unmap_sg(...);
		/* split into multiple requests and try again */
	}

But then it would make even more sense to just have the driver be 
proactive about its special requirement in the first place, and simply 
validate the list before it even tries to map it:

	for_each_sg(sgl, sg, n, i)
		if ((i > 0 && sg->offset % PAGE_SIZE) ||
		    (i < n - 1 && sg->length % PAGE_SIZE))
			/* segment will not be mergeable */

For reference, I think v4l2 and possibly some areas of DRM already do 
something vaguely similar to judge whether they get contiguous buffers 
or not.

> +
> +	return ret;
>   
> +out_unmap_sg:
> +	iommu_dma_unmap_sg(dev, sg, nents, dir, attrs);
>   out_free_iova:
>   	iommu_dma_free_iova(cookie, iova, iova_len);
>   out_restore_sg:
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 91af22a..f971dd3 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -104,6 +104,7 @@ enum iommu_cap {
>   					   transactions */
>   	IOMMU_CAP_INTR_REMAP,		/* IOMMU supports interrupt isolation */
>   	IOMMU_CAP_NOEXEC,		/* IOMMU_NOEXEC flag */
> +	IOMMU_CAP_MERGING,		/* IOMMU supports segments merging */

This isn't a 'capability' of the IOMMU - "segment merging" equates to 
just remapping pages, and there's already a fundamental assumption that 
IOMMUs are capable of that. Plus it's very much a DMA API concept, so 
hardly belongs in the IOMMU API anyway.

All in all, I'm struggling to see the point of this. Although it's not a 
DMA API guarantee, iommu-dma already merges scatterlists as aggressively 
as it is allowed to, and will continue to do so for the foreseeable 
future (since it avoids considerable complication in the IOVA 
allocation), so if you want to make sure iommu_dma_map_sg() merges an 
entire list, just don't give it a non-mergeable list. And if you still 
really really want dma_map_sg() to have a behaviour of "merge to a 
single segment or fail", then that should at least be a DMA API 
attribute, which could in principle be honoured by bounce-buffering 
implementations as well.

And if the problem is really that you're not getting merging because of 
exposing the wrong parameters to the DMA API and/or the block layer, or 
that you just can't quite express your requirement to the block layer in 
the first place, then that should really be tackled at the source rather 
than worked around further down in the stack.

Robin.
Christoph Hellwig June 5, 2019, 12:38 p.m. UTC | #2
On Wed, Jun 05, 2019 at 01:21:59PM +0100, Robin Murphy wrote:
> And if the problem is really that you're not getting merging because of 
> exposing the wrong parameters to the DMA API and/or the block layer, or 
> that you just can't quite express your requirement to the block layer in 
> the first place, then that should really be tackled at the source rather 
> than worked around further down in the stack.

The problem is that we need a way to communicate to the block layer
that more than a single segment is ok IFF the DMA API instance supports
merging.  And of course the answer will depend on futher parameters
like the maximum merged segment size and alignment for the segement.

We'll need some way to communicate that, but I don't really think this
is series is the way to go.

We'd really need something hanging off the device (or through a query
API) how the dma map ops implementation exposes under what circumstances
it can merge.  The driver can then communicate that to the block layer
so that the block layer doesn't split requests up when reaching the
segement limit.
Sergei Shtylyov June 5, 2019, 4:17 p.m. UTC | #3
On 06/05/2019 02:11 PM, Yoshihiro Shimoda wrote:

> This patch adds a new capable IOMMU_CAP_MERGING to check whether
> the IOVA would be contiguous strictly if a device requires and
> the IOMMU driver has the capable.

   s/has/is/? Or capable what?

> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> ---
>  drivers/iommu/dma-iommu.c | 26 ++++++++++++++++++++++++--
>  include/linux/iommu.h     |  1 +
>  2 files changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 034caae..ecf1a04 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
[...]
> @@ -867,6 +872,13 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>  		sg_dma_len(s) = s_length;
>  		s->offset -= s_iova_off;
>  		s_length = iova_align(iovad, s_length + s_iova_off);
> +		/*
> +		 * Check whether the IOVA would be contiguous strictly if
> +		 * a device requires and the IOMMU driver has the capable.

   Same question here...

> +		 */
> +		if (iova_contiguous && i > 0 &&
> +		    (s_iova_off || s->length != s_length))
> +			return 0;
>  		s->length = s_length;
>  
>  		/*
> @@ -902,8 +914,18 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>  	if (iommu_map_sg(domain, iova, sg, nents, prot) < iova_len)
>  		goto out_free_iova;
>  
> -	return __finalise_sg(dev, sg, nents, iova);
> +	ret = __finalise_sg(dev, sg, nents, iova);
> +	/*
> +	 * Check whether the sg entry is single if a device requires and
> +	 * the IOMMU driver has the capable.

   You  meant capability perhaps?

[...]

MBR, Sergei
Yoshihiro Shimoda June 6, 2019, 5:53 a.m. UTC | #4
Hi Robin,

Thank you for your comments!

> From: Robin Murphy, Sent: Wednesday, June 5, 2019 9:22 PM
<snip>
> > @@ -902,8 +914,18 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >   	if (iommu_map_sg(domain, iova, sg, nents, prot) < iova_len)
> >   		goto out_free_iova;
> >
> > -	return __finalise_sg(dev, sg, nents, iova);
> > +	ret = __finalise_sg(dev, sg, nents, iova);
> > +	/*
> > +	 * Check whether the sg entry is single if a device requires and
> > +	 * the IOMMU driver has the capable.
> > +	 */
> > +	if (iova_contiguous && ret != 1)
> > +		goto out_unmap_sg;
> 
> I don't see that just failing really gives this option any value.
> Clearly the MMC driver has to do *something* to handle the failure (plus
> presumably the case of not having IOMMU DMA ops at all), which begs the
> question of why it couldn't just do whatever that is anyway, without all
> this infrastructure. For starters, it would be a far simpler and less
> invasive patch:
> 
> 	if (dma_map_sg(...) > 1) {
> 		dma_unmap_sg(...);
> 		/* split into multiple requests and try again */
> 	}

I understood it.

> But then it would make even more sense to just have the driver be
> proactive about its special requirement in the first place, and simply
> validate the list before it even tries to map it:
> 
> 	for_each_sg(sgl, sg, n, i)
> 		if ((i > 0 && sg->offset % PAGE_SIZE) ||
> 		    (i < n - 1 && sg->length % PAGE_SIZE))
> 			/* segment will not be mergeable */

In previous version, I made such a code [1].
But, I think I misunderstood Christoph's comments [2] [3].

[1]
https://patchwork.kernel.org/patch/10970047/

[2]
https://marc.info/?l=linux-renesas-soc&m=155956751811689&w=2

[3]
https://marc.info/?l=linux-renesas-soc&m=155852814607202&w=2

> For reference, I think v4l2 and possibly some areas of DRM already do
> something vaguely similar to judge whether they get contiguous buffers
> or not.

I see. I'll check these areas later.

> > +
> > +	return ret;
> >
> > +out_unmap_sg:
> > +	iommu_dma_unmap_sg(dev, sg, nents, dir, attrs);
> >   out_free_iova:
> >   	iommu_dma_free_iova(cookie, iova, iova_len);
> >   out_restore_sg:
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 91af22a..f971dd3 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -104,6 +104,7 @@ enum iommu_cap {
> >   					   transactions */
> >   	IOMMU_CAP_INTR_REMAP,		/* IOMMU supports interrupt isolation */
> >   	IOMMU_CAP_NOEXEC,		/* IOMMU_NOEXEC flag */
> > +	IOMMU_CAP_MERGING,		/* IOMMU supports segments merging */
> 
> This isn't a 'capability' of the IOMMU - "segment merging" equates to
> just remapping pages, and there's already a fundamental assumption that
> IOMMUs are capable of that. Plus it's very much a DMA API concept, so
> hardly belongs in the IOMMU API anyway.

I got it.

> All in all, I'm struggling to see the point of this. Although it's not a
> DMA API guarantee, iommu-dma already merges scatterlists as aggressively
> as it is allowed to, and will continue to do so for the foreseeable
> future (since it avoids considerable complication in the IOVA
> allocation), so if you want to make sure iommu_dma_map_sg() merges an
> entire list, just don't give it a non-mergeable list.

Thank you for the explanation. I didn't know that a driver should not
give it a non-mergeable list.

> And if you still
> really really want dma_map_sg() to have a behaviour of "merge to a
> single segment or fail", then that should at least be a DMA API
> attribute, which could in principle be honoured by bounce-buffering
> implementations as well.

I got it. For this patch series, it seems I have to modify a block layer
so that such a new DMA API is not needed though.

> And if the problem is really that you're not getting merging because of
> exposing the wrong parameters to the DMA API and/or the block layer, or
> that you just can't quite express your requirement to the block layer in
> the first place, then that should really be tackled at the source rather
> than worked around further down in the stack.

I'll reply on Christoph email about this topic later.

Best regards,
Yoshihiro Shimoda

> Robin.
Yoshihiro Shimoda June 6, 2019, 6:28 a.m. UTC | #5
Hi Christoph,

Thank you for your comments!

> From: Christoph Hellwig, Sent: Wednesday, June 5, 2019 9:38 PM
> 
> On Wed, Jun 05, 2019 at 01:21:59PM +0100, Robin Murphy wrote:
> > And if the problem is really that you're not getting merging because of
> > exposing the wrong parameters to the DMA API and/or the block layer, or
> > that you just can't quite express your requirement to the block layer in
> > the first place, then that should really be tackled at the source rather
> > than worked around further down in the stack.
> 
> The problem is that we need a way to communicate to the block layer
> that more than a single segment is ok IFF the DMA API instance supports
> merging.  And of course the answer will depend on futher parameters
> like the maximum merged segment size and alignment for the segement.

I'm afraid but I don't understand why we need a way to communicate to
the block layer that more than a single segment is ok IFF the DMA API
instance supports merging.

> We'll need some way to communicate that, but I don't really think this
> is series is the way to go.

I should discard the patches 1/8 through 4/8.

> We'd really need something hanging off the device (or through a query
> API) how the dma map ops implementation exposes under what circumstances
> it can merge.  The driver can then communicate that to the block layer
> so that the block layer doesn't split requests up when reaching the
> segement limit.

The block layer already has a limit "max_segment_size" for each device so that
regardless it can/cannot merge the segments, we can use the limit.
Is my understanding incorrect?

Best regards,
Yoshihiro Shimoda
Christoph Hellwig June 6, 2019, 7 a.m. UTC | #6
On Thu, Jun 06, 2019 at 06:28:47AM +0000, Yoshihiro Shimoda wrote:
> > The problem is that we need a way to communicate to the block layer
> > that more than a single segment is ok IFF the DMA API instance supports
> > merging.  And of course the answer will depend on futher parameters
> > like the maximum merged segment size and alignment for the segement.
> 
> I'm afraid but I don't understand why we need a way to communicate to
> the block layer that more than a single segment is ok IFF the DMA API
> instance supports merging.

Assume a device (which I think is your case) that only supports a single
segment in hardware.  In that case we set max_segments to 1 if no
IOMMU is present.  But if we have a merge capable IOMMU we can set
max_segments to unlimited (or some software limit for scatterlist
allocation), as long as we set a virt_boundary matching what the IOMMU
expects, and max_sectors_kb isn't larger than the max IOMMU mapping
size.  Now we could probably just open code this in the driver, but
I'd feel much happier having a block layer like this:

bool blk_can_use_iommu_merging(struct request_queue *q, struct device *dev)
{
	if (!IOMMU_CAN_MERGE_SEGMENTS(dev))
		return false;

	blk_queue_virt_boundary(q, IOMMU_PAGE_SIZE(dev));
	blk_queue_max_segment_size(q, IOMMU_MAX_SEGMENT_SIZE(dev));
	return true;
}

and the driver then does:

	if (blk_can_use_iommu_merging(q, dev)) {
		blk_queue_max_segments(q, MAX_SW_SEGMENTS);
		// initialize sg mempool, etc..
	}


Where the SCREAMING pseudo code calls are something we need to find a
good API for.

And thinking about it the backend doesn't need to be an iommu, swiotlb
could handle this as well, which might be interesting for devices
that need to boune buffer anyway.  IIRC mmc actually has some code
to copy multiple segments into a bounce buffer somewhere.

> The block layer already has a limit "max_segment_size" for each device so that
> regardless it can/cannot merge the segments, we can use the limit.
> Is my understanding incorrect?

Yes.
Yoshihiro Shimoda June 7, 2019, 5:41 a.m. UTC | #7
Hi Christoph,

> From: Christoph Hellwig, Sent: Thursday, June 6, 2019 4:01 PM
> 
> On Thu, Jun 06, 2019 at 06:28:47AM +0000, Yoshihiro Shimoda wrote:
> > > The problem is that we need a way to communicate to the block layer
> > > that more than a single segment is ok IFF the DMA API instance supports
> > > merging.  And of course the answer will depend on futher parameters
> > > like the maximum merged segment size and alignment for the segement.
> >
> > I'm afraid but I don't understand why we need a way to communicate to
> > the block layer that more than a single segment is ok IFF the DMA API
> > instance supports merging.
> 
> Assume a device (which I think is your case) that only supports a single
> segment in hardware.  In that case we set max_segments to 1 if no
> IOMMU is present.  But if we have a merge capable IOMMU we can set
> max_segments to unlimited (or some software limit for scatterlist
> allocation), as long as we set a virt_boundary matching what the IOMMU
> expects, and max_sectors_kb isn't larger than the max IOMMU mapping
> size.  Now we could probably just open code this in the driver, but
> I'd feel much happier having a block layer like this:

Thank you for the explanation in detail!

> bool blk_can_use_iommu_merging(struct request_queue *q, struct device *dev)
> {
> 	if (!IOMMU_CAN_MERGE_SEGMENTS(dev))
> 		return false;

As Robin mentioned, all IOMMUs can merge segments so that we don't need
this condition, IIUC. However, this should check whether the device is mapped
on iommu by using device_iommu_mapped().

> 	blk_queue_virt_boundary(q, IOMMU_PAGE_SIZE(dev));
> 	blk_queue_max_segment_size(q, IOMMU_MAX_SEGMENT_SIZE(dev));

By the way, I reported an issue [1] and I'm thinking dima_is_direct() environment
(especially for swiotlb) is also needed such max_segment_size changes somehow.
What do you think?

[1]
https://marc.info/?l=linux-block&m=155954415603356&w=2

> 	return true;
> }
> 
> and the driver then does:
> 
> 	if (blk_can_use_iommu_merging(q, dev)) {
> 		blk_queue_max_segments(q, MAX_SW_SEGMENTS);
> 		// initialize sg mempool, etc..
> 	}

In this case, I assume that "the driver" is ./drivers/mmc/core/queue.c,
not any drivers/mmc/host/ code.

> Where the SCREAMING pseudo code calls are something we need to find a
> good API for.

I assumed
 - IOMMU_PAGE_SIZE(dev) = dma_get_seg_boundary(dev);
 - IOMMU_MAX_SEGMENT_SIZE(dev) = dma_get_max_seg_size(dev);

I could not find "IOMMU_PAGE_SIZE(dev))" for now.
If it's true, I'll add such a new API.

> And thinking about it the backend doesn't need to be an iommu, swiotlb
> could handle this as well, which might be interesting for devices
> that need to boune buffer anyway.  IIRC mmc actually has some code
> to copy multiple segments into a bounce buffer somewhere.

I see. So, as I mentioned above, this seems that swiotlb is also needed.
IIUC, now mmc doesn't have a bounce buffer from the commit [2].

[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/mmc/core?h=v5.2-rc3&id=de3ee99b097dd51938276e3af388cd4ad0f2750a

> > The block layer already has a limit "max_segment_size" for each device so that
> > regardless it can/cannot merge the segments, we can use the limit.
> > Is my understanding incorrect?
> 
> Yes.

Now I understand that block layer's max_segment_size differs with IOMMU's one.

Best regards,
Yoshihiro Shimoda
Christoph Hellwig June 7, 2019, 5:49 a.m. UTC | #8
On Fri, Jun 07, 2019 at 05:41:56AM +0000, Yoshihiro Shimoda wrote:
> > bool blk_can_use_iommu_merging(struct request_queue *q, struct device *dev)
> > {
> > 	if (!IOMMU_CAN_MERGE_SEGMENTS(dev))
> > 		return false;
> 
> As Robin mentioned, all IOMMUs can merge segments so that we don't need
> this condition, IIUC. However, this should check whether the device is mapped
> on iommu by using device_iommu_mapped().

There are plenty of dma_map_ops based drivers that can't merge segments.
Examples:

 - arch/ia64/sn/pci/pci_dma.c
 - arch/mips/jazz/jazzdma.c
 - arch/sparc/mm/io-unit.c
 - arch/sparc/mm/iommu.c
 - arch/x86/kernel/pci-calgary_64.c

Nevermind the diret mapping, swiotlb and other weirdos.

> 
> > 	blk_queue_virt_boundary(q, IOMMU_PAGE_SIZE(dev));
> > 	blk_queue_max_segment_size(q, IOMMU_MAX_SEGMENT_SIZE(dev));
> 
> By the way, I reported an issue [1] and I'm thinking dima_is_direct() environment
> (especially for swiotlb) is also needed such max_segment_size changes somehow.
> What do you think?
> 
> [1]
> https://marc.info/?l=linux-block&m=155954415603356&w=2

That doesn't seem to be related to the segment merging.  I'll take
a look, but next time please Cc the author of a suspect commit if
you already bisect things.
Yoshihiro Shimoda June 7, 2019, 6:01 a.m. UTC | #9
Hi Christoph,

> From: Christoph Hellwig, Sent: Friday, June 7, 2019 2:50 PM
> 
> On Fri, Jun 07, 2019 at 05:41:56AM +0000, Yoshihiro Shimoda wrote:
> > > bool blk_can_use_iommu_merging(struct request_queue *q, struct device *dev)
> > > {
> > > 	if (!IOMMU_CAN_MERGE_SEGMENTS(dev))
> > > 		return false;
> >
> > As Robin mentioned, all IOMMUs can merge segments so that we don't need
> > this condition, IIUC. However, this should check whether the device is mapped
> > on iommu by using device_iommu_mapped().
> 
> There are plenty of dma_map_ops based drivers that can't merge segments.
> Examples:
> 
>  - arch/ia64/sn/pci/pci_dma.c
>  - arch/mips/jazz/jazzdma.c
>  - arch/sparc/mm/io-unit.c
>  - arch/sparc/mm/iommu.c
>  - arch/x86/kernel/pci-calgary_64.c

Thank you for the indicate. I'll check these codes.

> Nevermind the diret mapping, swiotlb and other weirdos.

I got it.

> > > 	blk_queue_virt_boundary(q, IOMMU_PAGE_SIZE(dev));
> > > 	blk_queue_max_segment_size(q, IOMMU_MAX_SEGMENT_SIZE(dev));
> >
> > By the way, I reported an issue [1] and I'm thinking dima_is_direct() environment
> > (especially for swiotlb) is also needed such max_segment_size changes somehow.
> > What do you think?
> >
> > [1]
> > https://marc.info/?l=linux-block&m=155954415603356&w=2
> 
> That doesn't seem to be related to the segment merging.  I'll take
> a look, but next time please Cc the author of a suspect commit if
> you already bisect things.

Oops. I'll Cc the author in next time.

Best regards,
Yoshihiro Shimoda
diff mbox series

Patch

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 034caae..ecf1a04 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -847,11 +847,16 @@  static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	dma_addr_t iova;
 	size_t iova_len = 0;
 	unsigned long mask = dma_get_seg_boundary(dev);
-	int i;
+	int i, ret;
+	bool iova_contiguous = false;
 
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		iommu_dma_sync_sg_for_device(dev, sg, nents, dir);
 
+	if (dma_get_iova_contiguous(dev) &&
+	    iommu_capable(dev->bus, IOMMU_CAP_MERGING))
+		iova_contiguous = true;
+
 	/*
 	 * Work out how much IOVA space we need, and align the segments to
 	 * IOVA granules for the IOMMU driver to handle. With some clever
@@ -867,6 +872,13 @@  static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		sg_dma_len(s) = s_length;
 		s->offset -= s_iova_off;
 		s_length = iova_align(iovad, s_length + s_iova_off);
+		/*
+		 * Check whether the IOVA would be contiguous strictly if
+		 * a device requires and the IOMMU driver has the capable.
+		 */
+		if (iova_contiguous && i > 0 &&
+		    (s_iova_off || s->length != s_length))
+			return 0;
 		s->length = s_length;
 
 		/*
@@ -902,8 +914,18 @@  static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	if (iommu_map_sg(domain, iova, sg, nents, prot) < iova_len)
 		goto out_free_iova;
 
-	return __finalise_sg(dev, sg, nents, iova);
+	ret = __finalise_sg(dev, sg, nents, iova);
+	/*
+	 * Check whether the sg entry is single if a device requires and
+	 * the IOMMU driver has the capable.
+	 */
+	if (iova_contiguous && ret != 1)
+		goto out_unmap_sg;
+
+	return ret;
 
+out_unmap_sg:
+	iommu_dma_unmap_sg(dev, sg, nents, dir, attrs);
 out_free_iova:
 	iommu_dma_free_iova(cookie, iova, iova_len);
 out_restore_sg:
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 91af22a..f971dd3 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -104,6 +104,7 @@  enum iommu_cap {
 					   transactions */
 	IOMMU_CAP_INTR_REMAP,		/* IOMMU supports interrupt isolation */
 	IOMMU_CAP_NOEXEC,		/* IOMMU_NOEXEC flag */
+	IOMMU_CAP_MERGING,		/* IOMMU supports segments merging */
 };
 
 /*