[16/18] dmaengine/amba-pl08x: Add support for sg len greater than one for slave transfers
diff mbox

Message ID 5d691ab0c4f447c9f324213d8d740ac61d1739a1.1311936524.git.viresh.kumar@st.com
State New, archived
Headers show

Commit Message

Viresh KUMAR July 29, 2011, 10:49 a.m. UTC
Untill now, sg_len greater than one is not supported. This patch adds support to
do that.

Signed-off-by: Viresh Kumar <viresh.kumar@st.com>
---
 drivers/dma/amba-pl08x.c   |  288 ++++++++++++++++++++++++++------------------
 include/linux/amba/pl08x.h |   22 +++-
 2 files changed, 184 insertions(+), 126 deletions(-)

Comments

Linus Walleij July 31, 2011, 12:41 a.m. UTC | #1
2011/7/29 Viresh Kumar <viresh.kumar@st.com>:

> Untill now, sg_len greater than one is not supported. This patch adds support to
> do that.
>
> Signed-off-by: Viresh Kumar <viresh.kumar@st.com>

Thanks, was on my TODO once.
Acked-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij
Russell King - ARM Linux Aug. 14, 2011, 8:36 a.m. UTC | #2
On Fri, Jul 29, 2011 at 04:19:26PM +0530, Viresh Kumar wrote:
> Untill now, sg_len greater than one is not supported. This patch adds support to
> do that.

I'm not sure that this is the correct approach.  memcpy()s can only be
used with one single buffer, so the sg stuff for that (and the unmapping
support) doesn't make sense at all.

The only place where this makes sense is the slave sg stuff.  I wonder
whether we can better deal with that by having the LLI setup code deal
with one SG list entry at a time, and chain each together.

Something I've also been pondering which is related to this is linking
together DMA operations using the LLI chaining when interrupts and
callbacks aren't required.  It's a very similar problem.  Other DMA
engine drivers do this.

Finally, note that some of the PL08x code assumes that for any TXD,
the LLI list is a contiguous array.  See the first part of
pl08x_getbytes_chan(). That needs fixing as soon as we start going to
more than one SG list entry.
Linus Walleij Aug. 15, 2011, 10:11 a.m. UTC | #3
On Sun, Aug 14, 2011 at 10:36 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Fri, Jul 29, 2011 at 04:19:26PM +0530, Viresh Kumar wrote:
>> Untill now, sg_len greater than one is not supported. This patch adds support to
>> do that.
>
> I'm not sure that this is the correct approach.  memcpy()s can only be
> used with one single buffer, so the sg stuff for that (and the unmapping
> support) doesn't make sense at all.

I agree, that part is fishy.

> The only place where this makes sense is the slave sg stuff.  I wonder
> whether we can better deal with that by having the LLI setup code deal
> with one SG list entry at a time, and chain each together.
>
> Something I've also been pondering which is related to this is linking
> together DMA operations using the LLI chaining when interrupts and
> callbacks aren't required.  It's a very similar problem.  Other DMA
> engine drivers do this.

Yeah. Viresh can you hold this patch off for the moment, move it to
the end of the series and see if we can find some more optimal
approach for this?

> Finally, note that some of the PL08x code assumes that for any TXD,
> the LLI list is a contiguous array.  See the first part of
> pl08x_getbytes_chan(). That needs fixing as soon as we start going to
> more than one SG list entry.

Actually, after the removal of the phantom boundary in the PL08x
driver, the LLI handling code starts to resemble the stuff inside
the coh901318.c driver an *awful lot*, so I'm now ever more
convinced that that thing is a PL08x derivate. (I have asked the
hardware designers but the people who made it are unreachable.)

And the coh901318 code handles this by properly walking the LLIs
in memory IIRC, so could be a good inspiration.

I am aware that we need to consolidate that LLI handling so
the coh901318 and PL08x share the same LLI code atleast, or I
may even be able to merge all into pl08x.c. I have a mental note
to get to that once we've got PL08x into shape, until then it
can atleast be used for inspiration.

Thanks,
Linus Walleij
Viresh KUMAR Aug. 18, 2011, 8:38 a.m. UTC | #4
Hi Russell & Linus,

Sorry for being late to reply.

On 8/14/2011 2:06 PM, Russell King - ARM Linux wrote:
> On Fri, Jul 29, 2011 at 04:19:26PM +0530, Viresh Kumar wrote:
>> Untill now, sg_len greater than one is not supported. This patch adds support to
>> do that.
> 
> I'm not sure that this is the correct approach.  memcpy()s can only be
> used with one single buffer, so the sg stuff for that (and the unmapping
> support) doesn't make sense at all.
> 

I am not sure if i get this completely. In memcpy, we still don't support
more than one sg. We have created a new member in txd, which keeps track of
data (addresses, len).

While unmapping, we are unmapping only single dsg. I think i don't need to
use list_for_each_entry() there, as there is only one element present in list.
But can still keep that.

> The only place where this makes sense is the slave sg stuff.  I wonder
> whether we can better deal with that by having the LLI setup code deal
> with one SG list entry at a time, and chain each together.
> 
> Something I've also been pondering which is related to this is linking
> together DMA operations using the LLI chaining when interrupts and
> callbacks aren't required.  It's a very similar problem.  Other DMA
> engine drivers do this.
> 

Are you talking about linking all sg's together or linking multiple calls to
prep_dma_memcpy? I am handling the first.

> Finally, note that some of the PL08x code assumes that for any TXD,
> the LLI list is a contiguous array.  See the first part of
> pl08x_getbytes_chan(). That needs fixing as soon as we start going to
> more than one SG list entry.

We still have one contiguous array for LLI list. In pl08x_fill_llis_for_desc()
i am using same txd->llis_va for all sg's.

Sorry if i misunderstood your concerns.
Russell King - ARM Linux Aug. 21, 2011, 8:33 a.m. UTC | #5
On Thu, Aug 18, 2011 at 02:08:19PM +0530, Viresh Kumar wrote:
> On 8/14/2011 2:06 PM, Russell King - ARM Linux wrote:
> > On Fri, Jul 29, 2011 at 04:19:26PM +0530, Viresh Kumar wrote:
> >> Untill now, sg_len greater than one is not supported. This patch adds support to
> >> do that.
> > 
> > I'm not sure that this is the correct approach.  memcpy()s can only be
> > used with one single buffer, so the sg stuff for that (and the unmapping
> > support) doesn't make sense at all.
> > 
> 
> I am not sure if i get this completely. In memcpy, we still don't support
> more than one sg. We have created a new member in txd, which keeps track of
> data (addresses, len).

Yes, but we shouldn't need to translate it into any kind of scatterlist.

> While unmapping, we are unmapping only single dsg. I think i don't need to
> use list_for_each_entry() there, as there is only one element present in list.
> But can still keep that.

Correct.

> > The only place where this makes sense is the slave sg stuff.  I wonder
> > whether we can better deal with that by having the LLI setup code deal
> > with one SG list entry at a time, and chain each together.
> > 
> > Something I've also been pondering which is related to this is linking
> > together DMA operations using the LLI chaining when interrupts and
> > callbacks aren't required.  It's a very similar problem.  Other DMA
> > engine drivers do this.
> 
> Are you talking about linking all sg's together or linking multiple calls to
> prep_dma_memcpy? I am handling the first.

I'm talking about letting the hardware process as many txds as possible
without calling back into the driver to have it setup the next txd.

> > Finally, note that some of the PL08x code assumes that for any TXD,
> > the LLI list is a contiguous array.  See the first part of
> > pl08x_getbytes_chan(). That needs fixing as soon as we start going to
> > more than one SG list entry.
> 
> We still have one contiguous array for LLI list. In pl08x_fill_llis_for_desc()
> i am using same txd->llis_va for all sg's.

That's ok then.
Viresh KUMAR Aug. 23, 2011, 4:22 a.m. UTC | #6
On 8/21/2011 2:03 PM, Russell King - ARM Linux wrote:
>> > I am not sure if i get this completely. In memcpy, we still don't support
>> > more than one sg. We have created a new member in txd, which keeps track of
>> > data (addresses, len).
> Yes, but we shouldn't need to translate it into any kind of scatterlist.
> 

Ok. I way out is keeping two separate variables in txd, list for slave transfers,
and pointer to single element for memcpy. And that looks to be even bad to me.
Why waste memory, for second variable. Or create union of both.

One more thing, we can actually have scatter gather in memcpy too in future. This will be
helpful then also. But surely that's something not implemented currently.
Vinod Koul Aug. 23, 2011, 5:20 a.m. UTC | #7
On Tue, 2011-08-23 at 09:52 +0530, Viresh Kumar wrote:
> On 8/21/2011 2:03 PM, Russell King - ARM Linux wrote:
> >> > I am not sure if i get this completely. In memcpy, we still don't support
> >> > more than one sg. We have created a new member in txd, which keeps track of
> >> > data (addresses, len).
> > Yes, but we shouldn't need to translate it into any kind of scatterlist.
> > 
> 
> Ok. I way out is keeping two separate variables in txd, list for slave transfers,
> and pointer to single element for memcpy. And that looks to be even bad to me.
> Why waste memory, for second variable. Or create union of both.
> 
> One more thing, we can actually have scatter gather in memcpy too in future. This will be
> helpful then also. But surely that's something not implemented currently.
> 
There are few points for consideration:
1) Clients will calls tx_submit() and submit the descriptors, this
should push your descriptor to queued list
2) On issue_pending() you need to push the queued descriptors to dmac.
If your hardware can transfer multiple descriptors, then you should use
this and push all available descriptors to hardware.

This way you ensure optimal performance and give LLI for memcpy even
when you think API is not available.

If there are some use cases where we need scatterlist support for
memcpy, and if above doesn't satisfy, should be okay to add it.
Viresh KUMAR Aug. 26, 2011, 3:41 a.m. UTC | #8
On 8/23/2011 9:52 AM, Viresh Kumar wrote:
> On 8/21/2011 2:03 PM, Russell King - ARM Linux wrote:
>>>> I am not sure if i get this completely. In memcpy, we still don't support
>>>> more than one sg. We have created a new member in txd, which keeps track of
>>>> data (addresses, len).
>> Yes, but we shouldn't need to translate it into any kind of scatterlist.
>>
> 
> Ok. I way out is keeping two separate variables in txd, list for slave transfers,
> and pointer to single element for memcpy. And that looks to be even bad to me.
> Why waste memory, for second variable. Or create union of both.
> 
> One more thing, we can actually have scatter gather in memcpy too in future. This will be
> helpful then also. But surely that's something not implemented currently.
> 

Russell/Linus,

Probably this is the only issue left in this patch. (Sorry if i missed some other points)
Please suggest what should i modify here, so that Vinod can go ahead and push pending
patches too.
Viresh KUMAR Aug. 26, 2011, 8:51 a.m. UTC | #9
On 8/26/2011 1:33 PM, Linus Walleij wrote:
> Patches are individual for that very reason - so they can be applied one by one.
> 
> I already suggested to Vinod to merge the other patches as far as possible,
> for certain 1 thru 15 could be merged right off with some minor
> manual fixup. Then we can fix this one patch.

Actually, Vinod has already applied earlier 17 patches. Now only 3 are left, this one
and two after it.

I just wanted to know, whether i need to do any other modification in this patch or not?
Viresh KUMAR Sept. 1, 2011, 10:07 a.m. UTC | #10
On 8/26/2011 2:21 PM, Viresh Kumar wrote:
> I just wanted to know, whether i need to do any other modification in this patch or not?

Hello,

Do i need to update anything in this patch? Or can this be pushed as it is?
Vinod Koul Sept. 7, 2011, 6:42 p.m. UTC | #11
On Thu, 2011-09-01 at 15:37 +0530, Viresh Kumar wrote:
> On 8/26/2011 2:21 PM, Viresh Kumar wrote:
> > I just wanted to know, whether i need to do any other modification in this patch or not?
> 
> Hello,
> 
> Do i need to update anything in this patch? Or can this be pushed as it is?
> 
I thought Linus W, and Russell had some comments on this. 

--
~Vinod
Linus Walleij Sept. 7, 2011, 11:01 p.m. UTC | #12
2011/9/7 Koul, Vinod <vinod.koul@intel.com>:
> On Thu, 2011-09-01 at 15:37 +0530, Viresh Kumar wrote:
>> On 8/26/2011 2:21 PM, Viresh Kumar wrote:
>> > I just wanted to know, whether i need to do any other modification in this patch or not?
>>
>> Hello,
>>
>> Do i need to update anything in this patch? Or can this be pushed as it is?
>>
> I thought Linus W, and Russell had some comments on this.

I think the patch brings valuable functionality we don't want to loose when
there is a solution. Basically the dmaengine has a contract to handle
sglists of any lengths and it's a pity that we don't, and I suspect Viresh
cannot use the driver for MMC unless something like this is added, so
Acked-by: Linus Walleij <linus.walleij@linaro.org>

BUT I think it is possible to rewrite it a bit later so as to get a better
handling of this. Isn't Russells initial remark that the LLI:s can simply just
take in the entire sglist at once true?

Check for example in drivers/dma/coh901318.c which evidently has
a LLI format similar or identical to what pl08x use (which is another
reason to refactor this later):

It will just fill in LLIs for all the elements of the sglist, and off you go.

Like this:

       /* The dma only supports transmitting packages up to
         * MAX_DMA_PACKET_SIZE. Calculate to total number of
         * dma elemts required to send the entire sg list
         */
        for_each_sg(sgl, sg, sg_len, i) {
                unsigned int factor;
                size = sg_dma_len(sg);

                if (size <= MAX_DMA_PACKET_SIZE) {
                        len++;
                        continue;
                }

                factor = size >> MAX_DMA_PACKET_SIZE_SHIFT;
                if ((factor << MAX_DMA_PACKET_SIZE_SHIFT) < size)
                        factor++;

                len += factor;
        }

        pr_debug("Allocate %d lli:s for this transfer\n", len);
        lli = coh901318_lli_alloc(&cohc->base->pool, len);

        if (lli == NULL)
                goto err_dma_alloc;

        /* initiate allocated lli list */
        ret = coh901318_lli_fill_sg(&cohc->base->pool, lli, sgl, sg_len,
                                    cohc_dev_addr(cohc),
                                    ctrl_chained,
                                    ctrl,
                                    ctrl_last,
                                    direction, COH901318_CX_CTRL_TC_IRQ_ENABLE);
        if (ret)
                goto err_lli_fill;


The rest of interesting code is in coh901318_lli_fill_sg() as you can
see.

Evidently this need to be fixed as part of factoring out the LLI handling
from PL08x and coh901318 into a single one. If we ever get to it.

Yours,
Linus Walleij
Viresh KUMAR Sept. 8, 2011, 3:50 a.m. UTC | #13
On 9/8/2011 4:31 AM, Linus Walleij wrote:
> I think the patch brings valuable functionality we don't want to loose when
> there is a solution. Basically the dmaengine has a contract to handle
> sglists of any lengths and it's a pity that we don't, and I suspect Viresh
> cannot use the driver for MMC unless something like this is added, so
> Acked-by: Linus Walleij <linus.walleij@linaro.org>
> 

Thanks Again.

> BUT I think it is possible to rewrite it a bit later so as to get a better
> handling of this. Isn't Russells initial remark that the LLI:s can simply just
> take in the entire sglist at once true?

If i am getting this clearly, the concern is "why to queue separate transfers for
individual sg's? Better would be to prepare the complete list at once and
start the transfer, so that DMA stops only after finishing all sg's
passed from user." Is this what you are pointing at?

If yes, then the same is done in this patch too. An array for llis is allocated at 
the start, then for each sg i prepare lli list from this array. Last lli from one sg
is followed by first lli from next sg. And so i get a continuous chain of llis.
Linus Walleij Sept. 8, 2011, 10:29 a.m. UTC | #14
On Thu, Sep 8, 2011 at 5:50 AM, Viresh Kumar <viresh.kumar@st.com> wrote:

> If i am getting this clearly, the concern is "why to queue separate transfers for
> individual sg's? Better would be to prepare the complete list at once and
> start the transfer, so that DMA stops only after finishing all sg's
> passed from user." Is this what you are pointing at?

Yes.

> If yes, then the same is done in this patch too. An array for llis is allocated at
> the start, then for each sg i prepare lli list from this array. Last lli from one sg
> is followed by first lli from next sg. And so i get a continuous chain of llis.

OK so I guess I was lost in the code ...

So this is mainy cached as txd->dsg_list so you can quickly retrieve the
number of bytes pending in the LLI by traversing that sglist.

This is better than what the coh901318 does, because that driver
resorts to going into the physical LLIs themselves to retrieve this
info.

It also seems like this will play nice with Per Forlin's MMC
speed-up patches, so that will become effective for your MMC
usecases.

Now I really like this patch.

Sorry for being such a slow learner!
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij

Patch
diff mbox

diff --git a/drivers/dma/amba-pl08x.c b/drivers/dma/amba-pl08x.c
index 9eea0d1..ddc06bd 100644
--- a/drivers/dma/amba-pl08x.c
+++ b/drivers/dma/amba-pl08x.c
@@ -354,7 +354,9 @@  static u32 pl08x_getbytes_chan(struct pl08x_dma_chan *plchan)
 	if (!list_empty(&plchan->pend_list)) {
 		struct pl08x_txd *txdi;
 		list_for_each_entry(txdi, &plchan->pend_list, node) {
-			bytes += txdi->len;
+			struct pl08x_sg *dsg;
+			list_for_each_entry(dsg, &txd->dsg_list, node)
+				bytes += dsg->len;
 		}
 	}
 
@@ -564,6 +566,7 @@  static int pl08x_fill_llis_for_desc(struct pl08x_driver_data *pl08x,
 	u32 cctl, early_bytes = 0;
 	size_t max_bytes_per_lli, total_bytes = 0;
 	struct pl08x_lli *llis_va;
+	struct pl08x_sg *dsg;
 
 	txd->llis_va = dma_pool_alloc(pl08x->pool, GFP_NOWAIT, &txd->llis_bus);
 	if (!txd->llis_va) {
@@ -573,13 +576,9 @@  static int pl08x_fill_llis_for_desc(struct pl08x_driver_data *pl08x,
 
 	pl08x->pool_ctr++;
 
-	/* Get the default CCTL */
-	cctl = txd->cctl;
-
 	bd.txd = txd;
-	bd.srcbus.addr = txd->src_addr;
-	bd.dstbus.addr = txd->dst_addr;
 	bd.lli_bus = (pl08x->lli_buses & PL08X_AHB2) ? PL080_LLI_LM_AHB2 : 0;
+	cctl = txd->cctl;
 
 	/* Find maximum width of the source bus */
 	bd.srcbus.maxwidth =
@@ -591,46 +590,27 @@  static int pl08x_fill_llis_for_desc(struct pl08x_driver_data *pl08x,
 		pl08x_get_bytes_for_cctl((cctl & PL080_CONTROL_DWIDTH_MASK) >>
 				PL080_CONTROL_DWIDTH_SHIFT);
 
-	/* Set up the bus widths to the maximum */
-	bd.srcbus.buswidth = bd.srcbus.maxwidth;
-	bd.dstbus.buswidth = bd.dstbus.maxwidth;
+	list_for_each_entry(dsg, &txd->dsg_list, node) {
+		cctl = txd->cctl;
 
-	/* We need to count this down to zero */
-	bd.remainder = txd->len;
+		bd.srcbus.addr = dsg->src_addr;
+		bd.dstbus.addr = dsg->dst_addr;
+		bd.remainder = dsg->len;
+		bd.srcbus.buswidth = bd.srcbus.maxwidth;
+		bd.dstbus.buswidth = bd.dstbus.maxwidth;
 
-	pl08x_choose_master_bus(&bd, &mbus, &sbus, cctl);
+		pl08x_choose_master_bus(&bd, &mbus, &sbus, cctl);
 
-	dev_vdbg(&pl08x->adev->dev, "src=0x%08x%s/%u dst=0x%08x%s/%u len=%zu\n",
-		 bd.srcbus.addr, cctl & PL080_CONTROL_SRC_INCR ? "+" : "",
-		 bd.srcbus.buswidth,
-		 bd.dstbus.addr, cctl & PL080_CONTROL_DST_INCR ? "+" : "",
-		 bd.dstbus.buswidth,
-		 bd.remainder);
-	dev_vdbg(&pl08x->adev->dev, "mbus=%s sbus=%s\n",
-		 mbus == &bd.srcbus ? "src" : "dst",
-		 sbus == &bd.srcbus ? "src" : "dst");
-
-	/*
-	 * Send byte by byte for following cases
-	 * - Less than a bus width available
-	 * - until master bus is aligned
-	 */
-	if (bd.remainder < mbus->buswidth)
-		early_bytes = bd.remainder;
-	else if ((mbus->addr) % (mbus->buswidth)) {
-		early_bytes = mbus->buswidth - (mbus->addr) % (mbus->buswidth);
-		if ((bd.remainder - early_bytes) < mbus->buswidth)
-			early_bytes = bd.remainder;
-	}
+		dev_vdbg(&pl08x->adev->dev, "src=0x%08x%s/%u dst=0x%08x%s/%u len=%zu\n",
+			bd.srcbus.addr, cctl & PL080_CONTROL_SRC_INCR ? "+" : "",
+			bd.srcbus.buswidth,
+			bd.dstbus.addr, cctl & PL080_CONTROL_DST_INCR ? "+" : "",
+			bd.dstbus.buswidth,
+			bd.remainder);
+		dev_vdbg(&pl08x->adev->dev, "mbus=%s sbus=%s\n",
+			mbus == &bd.srcbus ? "src" : "dst",
+			sbus == &bd.srcbus ? "src" : "dst");
 
-	if (early_bytes) {
-		dev_vdbg(&pl08x->adev->dev, "%s byte width LLIs "
-				"(remain 0x%08x)\n", __func__, bd.remainder);
-		prep_byte_width_lli(&bd, &cctl, early_bytes, num_llis++,
-				&total_bytes);
-	}
-
-	if (bd.remainder) {
 		/*
 		 * Zero length is only allowed if all these requirements are
 		 * met:
@@ -664,79 +644,111 @@  static int pl08x_fill_llis_for_desc(struct pl08x_driver_data *pl08x,
 		}
 
 		/*
-		 * Master now aligned
-		 * - if slave is not then we must set its width down
+		 * Send byte by byte for following cases
+		 * - Less than a bus width available
+		 * - until master bus is aligned
 		 */
-		if (sbus->addr % sbus->buswidth) {
-			dev_dbg(&pl08x->adev->dev,
-				"%s set down bus width to one byte\n",
-				 __func__);
+		if (bd.remainder < mbus->buswidth)
+			early_bytes = bd.remainder;
+		else if ((mbus->addr) % (mbus->buswidth)) {
+			early_bytes = mbus->buswidth - (mbus->addr) %
+				(mbus->buswidth);
+			if ((bd.remainder - early_bytes) < mbus->buswidth)
+				early_bytes = bd.remainder;
+		}
 
-			sbus->buswidth = 1;
+		if (early_bytes) {
+			dev_vdbg(&pl08x->adev->dev, "%s byte width LLIs "
+				"(remain 0x%08x)\n", __func__, bd.remainder);
+			prep_byte_width_lli(&bd, &cctl, early_bytes, num_llis++,
+				&total_bytes);
 		}
 
-		/* Bytes transferred = tsize * src width, not MIN(buswidths) */
-		max_bytes_per_lli = bd.srcbus.buswidth *
-			PL080_CONTROL_TRANSFER_SIZE_MASK;
+		if (bd.remainder) {
+			/*
+			 * Master now aligned
+			 * - if slave is not then we must set its width down
+			 */
+			if (sbus->addr % sbus->buswidth) {
+				dev_dbg(&pl08x->adev->dev,
+					"%s set down bus width to one byte\n",
+					__func__);
 
-		/*
-		 * Make largest possible LLIs until less than one bus
-		 * width left
-		 */
-		while (bd.remainder > (mbus->buswidth - 1)) {
-			size_t lli_len, tsize, width;
+				sbus->buswidth = 1;
+			}
 
 			/*
-			 * If enough left try to send max possible,
-			 * otherwise try to send the remainder
+			 * Bytes transferred = tsize * src width, not
+			 * MIN(buswidths)
 			 */
-			lli_len = min(bd.remainder, max_bytes_per_lli);
+			max_bytes_per_lli = bd.srcbus.buswidth *
+				PL080_CONTROL_TRANSFER_SIZE_MASK;
+			dev_vdbg(&pl08x->adev->dev, "%s max bytes per lli = "
+				"%zu\n", __func__, max_bytes_per_lli);
 
 			/*
-			 * Check against maximum bus alignment: Calculate actual
-			 * transfer size in relation to bus width and get a
-			 * maximum remainder of the highest bus width - 1
+			 * Make largest possible LLIs until less than one bus
+			 * width left
 			 */
-			width = max(mbus->buswidth, sbus->buswidth);
-			lli_len = (lli_len / width) * width;
-			tsize = lli_len / bd.srcbus.buswidth;
+			while (bd.remainder > (mbus->buswidth - 1)) {
+				size_t lli_len, tsize, width;
 
-			dev_vdbg(&pl08x->adev->dev,
+				/*
+				 * If enough left try to send max possible,
+				 * otherwise try to send the remainder
+				 */
+				lli_len = min(bd.remainder, max_bytes_per_lli);
+
+				/*
+				 * Check against maximum bus alignment:
+				 * Calculate actual transfer size in relation to
+				 * bus width an get a maximum remainder of the
+				 * highest bus width - 1
+				 */
+				width = max(mbus->buswidth, sbus->buswidth);
+				lli_len = (lli_len / width) * width;
+				tsize = lli_len / bd.srcbus.buswidth;
+
+				dev_vdbg(&pl08x->adev->dev,
 					"%s fill lli with single lli chunk of "
 					"size 0x%08zx (remainder 0x%08zx)\n",
 					__func__, lli_len, bd.remainder);
 
-			cctl = pl08x_cctl_bits(cctl, bd.srcbus.buswidth,
+				cctl = pl08x_cctl_bits(cctl, bd.srcbus.buswidth,
 					bd.dstbus.buswidth, tsize);
-			pl08x_fill_lli_for_desc(&bd, num_llis++, lli_len, cctl);
-			total_bytes += lli_len;
-		}
+				pl08x_fill_lli_for_desc(&bd, num_llis++,
+						lli_len, cctl);
+				total_bytes += lli_len;
+			}
 
-		/*
-		 * Send any odd bytes
-		 */
-		if (bd.remainder) {
-			dev_vdbg(&pl08x->adev->dev,
-				"%s align with boundary, send odd bytes "
-				"(remain %zu)\n", __func__, bd.remainder);
-			prep_byte_width_lli(&bd, &cctl, bd.remainder,
-					num_llis++, &total_bytes);
+			/*
+			 * Send any odd bytes
+			 */
+			if (bd.remainder) {
+				dev_vdbg(&pl08x->adev->dev,
+					"%s align with boundary, send odd bytes"
+					" (remain %zu)\n", __func__,
+					bd.remainder);
+				prep_byte_width_lli(&bd, &cctl, bd.remainder,
+						num_llis++, &total_bytes);
+			}
 		}
-	}
 
-	if (total_bytes != txd->len) {
-		dev_err(&pl08x->adev->dev,
-			"%s size of encoded lli:s don't match total txd, "
-			"transferred 0x%08zx from size 0x%08zx\n", __func__,
-			total_bytes, txd->len);
-		return 0;
-	}
+		if (total_bytes != dsg->len) {
+			dev_err(&pl08x->adev->dev,
+				"%s size of encoded lli:s don't match total txd"
+				", transferred 0x%08zx from size 0x%08zx\n",
+				__func__,
+				total_bytes, dsg->len);
+			return 0;
+		}
 
-	if (num_llis >= MAX_NUM_TSFR_LLIS) {
-		dev_err(&pl08x->adev->dev,
-			"%s need to increase MAX_NUM_TSFR_LLIS from 0x%08x\n",
-			__func__, (u32) MAX_NUM_TSFR_LLIS);
-		return 0;
+		if (num_llis >= MAX_NUM_TSFR_LLIS) {
+			dev_err(&pl08x->adev->dev,
+				"%s need to increase MAX_NUM_TSFR_LLIS from "
+				"0x%08x\n", __func__, (u32) MAX_NUM_TSFR_LLIS);
+			return 0;
+		}
 	}
 
 	llis_va = txd->llis_va;
@@ -769,11 +781,18 @@  static int pl08x_fill_llis_for_desc(struct pl08x_driver_data *pl08x,
 static void pl08x_free_txd(struct pl08x_driver_data *pl08x,
 		struct pl08x_txd *txd)
 {
+	struct pl08x_sg *dsg, *_dsg;
+
 	/* Free the LLI */
 	dma_pool_free(pl08x->pool, txd->llis_va, txd->llis_bus);
 
 	pl08x->pool_ctr--;
 
+	list_for_each_entry_safe(dsg, _dsg, &txd->dsg_list, node) {
+		list_del(&dsg->node);
+		kfree(dsg);
+	}
+
 	kfree(txd);
 }
 
@@ -1228,6 +1247,7 @@  static struct pl08x_txd *pl08x_get_txd(struct pl08x_dma_chan *plchan,
 		txd->tx.flags = flags;
 		txd->tx.tx_submit = pl08x_tx_submit;
 		INIT_LIST_HEAD(&txd->node);
+		INIT_LIST_HEAD(&txd->dsg_list);
 
 		/* Always enable error and terminal interrupts */
 		txd->ccfg = PL080_CONFIG_ERR_IRQ_MASK |
@@ -1246,6 +1266,7 @@  static struct dma_async_tx_descriptor *pl08x_prep_dma_memcpy(
 	struct pl08x_dma_chan *plchan = to_pl08x_chan(chan);
 	struct pl08x_driver_data *pl08x = plchan->host;
 	struct pl08x_txd *txd;
+	struct pl08x_sg *dsg;
 	int ret;
 
 	txd = pl08x_get_txd(plchan, flags);
@@ -1255,10 +1276,19 @@  static struct dma_async_tx_descriptor *pl08x_prep_dma_memcpy(
 		return NULL;
 	}
 
+	dsg = kzalloc(sizeof(struct pl08x_sg), GFP_NOWAIT);
+	if (!dsg) {
+		pl08x_free_txd(pl08x, txd);
+		dev_err(&pl08x->adev->dev, "%s no memory for pl080 sg\n",
+				__func__);
+		return NULL;
+	}
+	list_add_tail(&dsg->node, &txd->dsg_list);
+
 	txd->direction = DMA_NONE;
-	txd->src_addr = src;
-	txd->dst_addr = dest;
-	txd->len = len;
+	dsg->src_addr = src;
+	dsg->dst_addr = dest;
+	dsg->len = len;
 
 	/* Set platform data for m2m */
 	txd->ccfg |= PL080_FLOW_MEM2MEM << PL080_CONFIG_FLOW_CONTROL_SHIFT;
@@ -1287,17 +1317,11 @@  static struct dma_async_tx_descriptor *pl08x_prep_slave_sg(
 	struct pl08x_dma_chan *plchan = to_pl08x_chan(chan);
 	struct pl08x_driver_data *pl08x = plchan->host;
 	struct pl08x_txd *txd;
+	struct pl08x_sg *dsg;
+	struct scatterlist *sg;
+	dma_addr_t slave_addr;
 	int ret, tmp;
 
-	/*
-	 * Current implementation ASSUMES only one sg
-	 */
-	if (sg_len != 1) {
-		dev_err(&pl08x->adev->dev, "%s prepared too long sglist\n",
-			__func__);
-		BUG();
-	}
-
 	dev_dbg(&pl08x->adev->dev, "%s prepare transaction of %d bytes from"
 			" %s\n", __func__, sgl->length, plchan->name);
 
@@ -1318,17 +1342,15 @@  static struct dma_async_tx_descriptor *pl08x_prep_slave_sg(
 	 * channel target address dynamically at runtime.
 	 */
 	txd->direction = direction;
-	txd->len = sgl->length;
 
 	if (direction == DMA_TO_DEVICE) {
 		txd->cctl = plchan->dst_cctl;
-		txd->src_addr = sgl->dma_address;
-		txd->dst_addr = plchan->dst_addr;
+		slave_addr = plchan->dst_addr;
 	} else if (direction == DMA_FROM_DEVICE) {
 		txd->cctl = plchan->src_cctl;
-		txd->src_addr = plchan->src_addr;
-		txd->dst_addr = sgl->dma_address;
+		slave_addr = plchan->src_addr;
 	} else {
+		pl08x_free_txd(pl08x, txd);
 		dev_err(&pl08x->adev->dev,
 			"%s direction unsupported\n", __func__);
 		return NULL;
@@ -1342,6 +1364,26 @@  static struct dma_async_tx_descriptor *pl08x_prep_slave_sg(
 
 	txd->ccfg |= tmp << PL080_CONFIG_FLOW_CONTROL_SHIFT;
 
+	for_each_sg(sgl, sg, sg_len, tmp) {
+		dsg = kzalloc(sizeof(struct pl08x_sg), GFP_NOWAIT);
+		if (!dsg) {
+			pl08x_free_txd(pl08x, txd);
+			dev_err(&pl08x->adev->dev, "%s no mem for pl080 sg\n",
+					__func__);
+			return NULL;
+		}
+		list_add_tail(&dsg->node, &txd->dsg_list);
+
+		dsg->len = sg_dma_len(sg);
+		if (direction == DMA_TO_DEVICE) {
+			dsg->src_addr = sg_phys(sg);
+			dsg->dst_addr = slave_addr;
+		} else {
+			dsg->src_addr = slave_addr;
+			dsg->dst_addr = sg_phys(sg);
+		}
+	}
+
 	ret = pl08x_prep_channel_resources(plchan, txd);
 	if (ret)
 		return NULL;
@@ -1439,22 +1481,28 @@  static void pl08x_ensure_on(struct pl08x_driver_data *pl08x)
 static void pl08x_unmap_buffers(struct pl08x_txd *txd)
 {
 	struct device *dev = txd->tx.chan->device->dev;
+	struct pl08x_sg *dsg;
 
 	if (!(txd->tx.flags & DMA_COMPL_SKIP_SRC_UNMAP)) {
 		if (txd->tx.flags & DMA_COMPL_SRC_UNMAP_SINGLE)
-			dma_unmap_single(dev, txd->src_addr, txd->len,
-				DMA_TO_DEVICE);
-		else
-			dma_unmap_page(dev, txd->src_addr, txd->len,
-				DMA_TO_DEVICE);
+			list_for_each_entry(dsg, &txd->dsg_list, node)
+				dma_unmap_single(dev, dsg->src_addr, dsg->len,
+						DMA_TO_DEVICE);
+		else {
+			list_for_each_entry(dsg, &txd->dsg_list, node)
+				dma_unmap_page(dev, dsg->src_addr, dsg->len,
+						DMA_TO_DEVICE);
+		}
 	}
 	if (!(txd->tx.flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
 		if (txd->tx.flags & DMA_COMPL_DEST_UNMAP_SINGLE)
-			dma_unmap_single(dev, txd->dst_addr, txd->len,
-				DMA_FROM_DEVICE);
+			list_for_each_entry(dsg, &txd->dsg_list, node)
+				dma_unmap_single(dev, dsg->dst_addr, dsg->len,
+						DMA_FROM_DEVICE);
 		else
-			dma_unmap_page(dev, txd->dst_addr, txd->len,
-				DMA_FROM_DEVICE);
+			list_for_each_entry(dsg, &txd->dsg_list, node)
+				dma_unmap_page(dev, dsg->dst_addr, dsg->len,
+						DMA_FROM_DEVICE);
 	}
 }
 
diff --git a/include/linux/amba/pl08x.h b/include/linux/amba/pl08x.h
index febb0da..9090342 100644
--- a/include/linux/amba/pl08x.h
+++ b/include/linux/amba/pl08x.h
@@ -107,12 +107,24 @@  struct pl08x_phy_chan {
 };
 
 /**
+ * struct pl08x_sg - structure containing data per sg
+ * @src_addr: src address of sg
+ * @dst_addr: dst address of sg
+ * @len: transfer len in bytes
+ * @node: node for txd's dsg_list
+ */
+struct pl08x_sg {
+	dma_addr_t src_addr;
+	dma_addr_t dst_addr;
+	size_t len;
+	struct list_head node;
+};
+
+/**
  * struct pl08x_txd - wrapper for struct dma_async_tx_descriptor
  * @tx: async tx descriptor
  * @node: node for txd list for channels
- * @src_addr: src address of txd
- * @dst_addr: dst address of txd
- * @len: transfer len in bytes
+ * @dsg_list: list of children sg's
  * @direction: direction of transfer
  * @llis_bus: DMA memory address (physical) start for the LLIs
  * @llis_va: virtual memory address start for the LLIs
@@ -122,10 +134,8 @@  struct pl08x_phy_chan {
 struct pl08x_txd {
 	struct dma_async_tx_descriptor tx;
 	struct list_head node;
+	struct list_head dsg_list;
 	enum dma_data_direction	direction;
-	dma_addr_t src_addr;
-	dma_addr_t dst_addr;
-	size_t len;
 	dma_addr_t llis_bus;
 	struct pl08x_lli *llis_va;
 	/* Default cctl value for LLIs */