diff mbox

ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info

Message ID 43389276-E591-4E09-AB84-491C2CB2D9A7@martin.sperl.org (mailing list archive)
State New, archived
Headers show

Commit Message

Martin Sperl April 2, 2014, 4:06 p.m. UTC
Hi!

I have finished a first fully working revision of the DMA-only driven
SPI bus driver for the BCM2835 which is now quite efficient with
regards to CPU-resources.

But to make the most out of it it will require that the real device drivers
are also optimized for thru-put to avoid unnecessarily rerun some
computations/transformations. 

These "optimizations" are mostly:
* use spi_messages that are created on the heap and do not need to
  get setup/torn down after each invocation (typically on the stack)
* use the (previously discussed) optimize interface to "fix" spi_messages
  in such a way that minimal of processing needs to get done prior to
  submission to DMA (translation to dma, verify if all parameters are valid)
  - patch against 3.13 below
* thinking of how to schedule transfers in such a way that while a complete
  callback is being processed more transfers are already scheduled and can
  run making the best use of the time available.
And they apply mostly to drivers that potentially generate a high number
of SPI messages in a short period of time.

With this we can "reduce" a sequence of typically 7-10 
commands to the device down to 3 spi_messages of which the first 2 are
scheduled one after the other (note that this requires locking, so that no
race condition can occur on callback, which I found out the hard way may
happen even in a GPIO irq handler getting interrupted by a DMA irq handling
the callback) and then optimize them once avoiding the need to
verify/transform the messages 1000 times per second.

This means that there are (typically) no visible gaps between the
the 3 spi_messages and that the SPI clock is running with minimal "gaps".

For an image of this optimized case as seen by a logic analyser,
please see here:
http://www.raspberrypi.org/forums/download/file.php?id=6335.

For comparisson here also an image showing the PIO in kernel spi_bcm2835
driver: http://www.raspberrypi.org/forums/download/file.php?id=6332

For example a sequence with 10 transfers with a total of 53 bytes
to/from the
device at 8MHz bus speed takes 84us, which is fairly close to the 53us that
is be the ideal transfer time for 53 bytes (without taking CS (de)asserting
into account)

The way the drivers have been designed we are also minimizing the number of
interrupts - for such a "complex" transfer we have a total of only 4
interrupts per CAN-message (2 related to GPIO (generic+pin-specific) and
2 to trigger the complete callbacks for 2 of the 3 spi_messages)

As the original spi-bcm2835 PIO-driver is known to trigger a lot of
interrupts and consumes a lot of CPU (which is unfortunately not accounted
for in vmstat)
I also have done some tests to see how much time is really left for
other work by user processes.

So I have taken the following scenario:
* CAN bus at 250khz
* running at 100% utilization with extended frames and 8 bytes data
  (note that would mean an equivalent of 25% utilization on 1MHz CAN bus)
* resulting in 1532 CAN messages/second which means one message every 
  0.616ms
* running a compile of the spi-bcm2835 driver several times measuring
  with "time" focusing on real/elapsed time
* also observing via vmstat 10

This have been tested against the following scenarios:
* no SPI drivers loaded
* MCP2515 optimize disabled + spi-bcm2835 currently in kernel
* MCP2515 optimize enabled + spi-bcm2835 currently in kernel
* MCP2515 optimize disabled + spi-bcm2835dma
* MCP2515 optimize enabled + spi-bcm2835dma

The difference between 2 and 3 shows the "gain" of avoiding running
the _spi_async validations 4800 times per second.
while the difference between 3 and 4 shows the advantage of the
use of DMA scheduling - even with the overhead for the "creation"
of the DMA sequences on every call
and the difference between 4 and 5 shows how much can be the net-gain
when optimizing/preparing those SPI messages once.
And you can also compare all of them against an idle system (option 1)

So here the results:

driver         optimize real_compile interrupts/s irq2EOT
none                 NA          50s          300     N/A                
spi-bcm2835(=PIO)   off         120s        45000   293us
spi-bcm2835(=PIO)    on         115s        45000   290us 
spi-bcm2835dma      off          76s         6700   172us
spi-bcm2835dma       on          60s         6700    82us

(irq2EOT = time between IRQ line goes down and the CS goes high after the 
last transfer)

So you can see that the optimize makes a slight difference for PIO.
You can also see that the costly DMA setup (which adds latencies to the
initial scheduling of the first byte) is still a lot more CPU friendly
than all those interrupts when using PIO.

And finally the optimize with DMA gives us closer to "idle" time
measurements - the difference is in parts the time spent in the IRQ
handlers (GPIO and DMA), but also regarding the fact that the DMAs are
producing load on the memory bus reading those control-blocks
and have an impact on L2 Cache evictions.

Obviously this "optimization" is most important for drivers that are running
spi_messages at high rates, as there the gain becomes most significant.
But it also applies to spi_messages where we transfer say 4096 bytes.

On the BCM2835 an SPI-interrupt gets triggered in PIO mode every 12 bytes
so that the interrupt handler can read and fill the FIFO before it there is
an overrun - so for those 4096 bytes we would need to handle 341 interrupts.
Assuming 8MHz BUS rate we would ideally transfer this in 0.04 seconds.

So transferring 1 MByte of data would take exactly one second and trigger
83252 interrupts - that is lots of interrupts!

By comparison the DMA driver would (in the unoptimized case) maybe take 30us
at the start of the transfer to create the DMAs to issue and then run with a
single interrupt at the end of the transfer (but on the bcm2835 it would
need to limit itself to 32 transfers of 32K each, but still a single IRQ).

In the optimized case we would run the transfer within 9us from issuing the
spi_async call.

Note that those usec measurements have been taken by "poking" GPIOs to
up/down via writel and measuring the pin-changes via a logic-analyzer as
logging via kprint kills performance and does not allow for insight over
longer periods of time.

So I hope these measurements convince you that the optimize patch below is
"worth" adding - it also includes a means for the device driver to
"vary" some parts of the spi message (source, destination, length, speed,
delay_usec), which [wc]ould change between invocations of spi_async.

As for the spi-bcm2835dma driver itself:
it is still work in progress, but I am having some questions with regards to
submitting it - some of it has to do with designing it in such a way that
parts of it may at some point may in the future get incorporated into the
spi-framework.

So some facts:
* driver still does not fully follow coding styles
* not all vary cases are not all handled correctly and fixing those requires
* it allows the use of ANY GPIO as CS not the limit of effective 2 CS
  exposed by the hardware.
  this actually started as a workaround for a buggy CS implementation in
  the HW block which might take CS up for a few 100th of a usec creating
  hickups with some devices (which only shows up when programming the
  SPI-registers via DMA transfers), but the step from making 2 GPIO work
  to make all GPIO work as CS is not really complicated...
* right now I am not really using dma-engine, as it is too CPU intensive
  setting a complex SPI message up (a typical read+write SPI message takes
  typically 21 DMA transfers)
* instead I started a new dma-fragment interface that has:
** dma_links, which are probably similar to some structures in dma_engine
** dma_fragments, which are chains of dma_links and/or dma_fragments
   the structure of which does not change and the data inside only changes
   in a few places
** dma_fragment_caches, which are caches of fragments which can get reused
   quickly without heavy setup overhead (this avoids mallocs,
   dma_pool_fetch)
   these caches also are responsible for creating fragments if needed and
   tearing them down
   there are also fetch and merge methods to create compound dma_fragments
   for final scheduling
 * there are also some "common" spi_dma_fragment parts that could easily
  get merged into the spi-core.
* there is some provision to see if an spi message contains is of a
  single/double transfer type and in those cases take an "optimized"
  spi_message (done by the driver) instead and fill it with the data
  from the submitted spi_message and make use of all those "vary" cases.
  this would reduce the DMA creation cost for those synchronous transfers
  (spi_write_then_read, spi_write, spi_read), but this is not fully in place
* per HW-document it requires 2 DMAs to handle the full transfer, but as per
  interrupt latencies unfortunately we require a 3rd DMA channel which only
  triggers the interrupt to schedule complete (the DMA needs to be "idle"
  long enough for the kernel to detect which IRQ source was responsible for
  the wakeup, so using TX-DMA is not an option, as the kernel is sometimes
  too slow to handle the interrupt before the next TX transfer is scheduled
  in the pipelined DMA case)
* currently tested with a focus on the CAN controller mcp2515 (minor testing
  with other devices)

In the end I have the following "distinct" fragments and caches for
each of those:
* spi_merged_fragment: an empty fragment used from cache into which all
  the below fragments get merged (avoids calls to malloc,... to be fast)
* spi_setup: setup SPI registers for the next transfer includes:
** setting clock
** enabling SPI-HW+resetting FIFOs
** CS assert
** preparing TX-DMA channel
  - on bcm2835 a total of 6 DMAs
* spi_transfer: do a single transfer (in the BCM2835 if the transfer is NOT
  a multiple of 4, then after the transfer the FIFOS will have to be reset)
  - on bcm2835 a total of 2 DMAs (1 RX, 1 TX)
* spi_delay: if a single delay is configured then this does the delay in DMA
  - on bcm2835 a total of 1 DMAs
* spi_csdelay: if there is a cs_change then this is responsible for it -
  - it will also do a delay prior to CS-up and keeps CS deasserted for half
  a SPI clock cycle.
  - on bcm2835 a total of 3 DMAs (delay, CS-deassert,delay)
* trigger_interrupt: trigger an interrupt - typically to call complete
  - on bcm2835 a total of 3 DMAs (2 RX, 1IRQ DMA essentially just
  triggering the interrupt, but doing nothing else)
* marks the spi_message as finished (this actually copies the 1MHz clock
  counter of the BCM2835 to a location to mark it as finished - we would
  even have absolute timings when the transfer finished (same could be
  applied for start of spi_message) if the spi core would want to expose
  such information.
  - on bcm2835 a total of 1 DMAs

So a single read/write transfer is actually made up of the following:
* spi_merged_fragment containing:
** spi_setup    (6CBs)
** spi_transfer (2CBs)
** spi_csdelay  (3CBs)
** trigger_irq  (3CBs)
** finished     (1CBs)
so a total of 15 DMA transfers for a single SPI_message with one
spi_transfer.

Each additional transfer (without cs_change) adds an additional 6 CBs
If the number of transfers so far is a multiple of 4 (full words), then
this is reduced to an additional 2 CBs(=Control Block/dma_link)

I hope these look "generic enough" to be reusable...

There is also one question here with regards to semantics:
When do we need to call the "complete" method?
When we have received all the data or when the CS has been de-asserted
(after the delay_usec)?

The reason why I ask is because for optimizations some of the irq scheduling
latencies could get hidden by triggering the complete while CS is still
asserted.

Note that I decided to go with the approach of fetch from cache and merge
(where the cache functionality is hiding some of the work about releasing
the structures again) instead of having separate link/unlink methods that
need to get called explicitly (more method pointers in the structures,...)

The central pieces of bus_driver specific code for creating/scheduling the
DMA which corresponds to a spi_message with the above framework(s) looks
like this:
-----------------------------------------------------------------------------
struct spi_merged_dma_fragment *bcm2835dma_spi_message_to_dma_fragment(
        struct spi_message *msg, int flags, gfp_t gfpflags)
{
        struct spi_device *spi    = msg->spi;
        struct spi_master *master = spi->master;
        struct bcm2835dma_spi *bs = spi_master_get_devdata(master);

        struct spi_merged_dma_fragment *merged;
        struct spi_transfer *xfer;
        int err=0;
        int is_last;

        /* some optimizations - it might help if we knew the length... */
        /* check if we got a frame that is of a single transfer */
        if ( list_is_singular(&msg->transfers) ) {
                /* check if we got something in the structure we could
use */
        }

        /* fetch a merged fragment */
        merged = (typeof(merged))
                dma_fragment_cache_fetch(
                        &bs->fragment_merged,
                        gfpflags);
        if (! merged)
                return NULL;

        /* initialize some fields */
        merged->message       = msg;
        merged->transfer      = NULL;
        merged->last_transfer = NULL;
        merged->dma_fragment.link_head = NULL;
        merged->dma_fragment.link_tail = NULL;
        merged->complete_data = NULL;
        merged->needs_spi_setup = 1;
        /* now start iterating the transfers */
        list_for_each_entry(xfer, &msg->transfers, transfer_list) {
                /* check if we are the last in the list */
                is_last = list_is_last(&xfer->transfer_list,
                                        &msg->transfers);
                /* assign the current transfer */
                merged->transfer = xfer;
                /* do we need to reconfigure spi
                   compared to the last transfer */
                if (! merged->needs_spi_setup) {
                        if (merged->last_transfer->speed_hz
                                != xfer->speed_hz)
                                merged->needs_spi_setup = 1;
                        else if (merged->last_transfer->tx_nbits
                                != xfer->tx_nbits)
                                merged->needs_spi_setup = 1;
                        else if (merged->last_transfer->rx_nbits
                                != xfer->rx_nbits)
                                merged->needs_spi_setup = 1;
                        else if (merged->last_transfer->bits_per_word
                                != xfer->bits_per_word)
                                merged->needs_spi_setup = 1;
                }
                /* if we have no last_transfer,
                   then we need to setup spi */
                if (merged->needs_spi_setup) {
                        err = spi_merged_dma_fragment_merge_fragment_cache(
                                &bs->fragment_setup_spi,
                                merged,
                                gfpflags);
                        if (err)
                                goto error;
                        merged->needs_spi_setup = 0;
                }
                /* add transfer if the transfer length is not 0
                   or if we vary length */
                if ( (xfer->len)
                        /* || (xfer->vary & SPI_OPTIMIZE_VARY_LENGTH) */) {
                        /* schedule transfer */
                        err = spi_merged_dma_fragment_merge_fragment_cache(
                                &bs->fragment_transfer,
                                merged,
                                gfpflags);
                        if (err)
                                goto error;
                        /* set last transfer */
                        merged->last_transfer=xfer;
                }
                /* add cs_change with optional extra delay
                   if requested or last in sequence */
                if ((xfer->cs_change)||(is_last)) {
                        err = spi_merged_dma_fragment_merge_fragment_cache(
                                &bs->fragment_cs_deselect,
                                merged,
                                gfpflags);
                } else if ( (xfer->delay_usecs)
                        /* || (xfer->vary & SPI_OPTIMIZE_VARY_DELAY) */) {
                        /* or add a delay if requested */
                        err = spi_merged_dma_fragment_merge_fragment_cache(
                                &bs->fragment_delay,
                                merged,
                                gfpflags);
                }

                if (err)
                        goto error;
        }
        /* and add an interrupt if we got a callback to handle
         * if there is no callback, then we do not need to release it
         * immediately - even for prepared messages
         */
        if (msg->complete) {
                err=spi_merged_dma_fragment_merge_fragment_cache(
                        &bs->fragment_trigger_irq,
                        merged,
                        gfpflags);
                if (err)
                        goto error;
        }

        /* and the "end of transfer flag"
         * - must be the last to avoid races... */
        err=spi_merged_dma_fragment_merge_fragment_cache(
                &bs->fragment_message_finished,
                merged,
                gfpflags);
        if (err)
                goto error;

        /* reset transfers, as these are invalid by the time
         * we run the transforms */
        merged->transfer      = NULL;
        merged->last_transfer = NULL;

        /* and return it */
        return merged;

error:
        dev_printk(KERN_ERR,&spi->dev,
                "bcm2835dma_spi_message_to_dma_fragment: err=%i\n",
                err);
        spi_merged_dma_fragment_dump(
                merged,
                &msg->spi->dev,
                0,0,
                &bcm2835_dma_link_dump
                );
        return NULL;
}

static int bcm2835dma_spi_message_optimize(struct spi_message *message)
{
        message->state = bcm2835dma_spi_message_to_dma_fragment(
                message,
                0,
                GFP_ATOMIC);
        if (!message->state)
                return -ENOMEM;
        if (unlikely(debug_dma & DEBUG_DMA_OPTIMIZE)) {
                dev_printk(KERN_INFO,&message->spi->dev,
                        "Optimizing %pf %pf\n",message,message->state);
                if (debug_dma & DEBUG_DMA_OPTIMIZE_DUMP_FRAGMENT) {
                        spi_merged_dma_fragment_dump(message->state,
                                                &message->spi->dev,
                                                0,0,
                                                &bcm2835_dma_link_dump
                                );
                }
        }

        return 0;
}

static void bcm2835dma_spi_message_unoptimize(struct spi_message *msg)
{
        dma_fragment_release(msg->state);
        msg->state=NULL;
        if (unlikely(debug_dma&DEBUG_DMA_OPTIMIZE)) {
                dev_printk(KERN_INFO,&msg->spi->dev,
                        "Unoptimizing %pf\n",msg);
        }
}

static int bcm2835dma_spi_transfer(struct spi_device *spi,
                                struct spi_message *message)
{
        int err=0;
        struct spi_merged_dma_fragment *merged;
        struct spi_master *master = spi->master;
        struct bcm2835dma_spi *bs = spi_master_get_devdata(master);
        unsigned long flags;

        /* fetch DMA fragment */
#ifdef SPI_HAVE_OPTIMIZE
        if ((message->is_optimized) ) {
                merged = message->state;
        } else
#endif
        {
                merged = bcm2835dma_spi_message_to_dma_fragment(
                        message,
                        0,
                        GFP_ATOMIC);
                message->state = merged;
        }
        if (!merged)
                return -ENOMEM;
        /* assign some values */
        message->actual_length = 0;

        /* and execute the pre-transforms */
        err = spi_merged_dma_fragment_execute_pre_dma_transforms(
                merged,merged,GFP_ATOMIC);
        if (err)
                goto error;

        if (unlikely(debug_dma & DEBUG_DMA_ASYNC)) {
                dev_printk(KERN_INFO,&message->spi->dev,
                        "Scheduling Message %pf fragment %pf\n",
                        message,merged);
                if (debug_dma & DEBUG_DMA_ASYNC_DUMP_FRAGMENT) {
                        spi_merged_dma_fragment_dump(merged,
                                                &message->spi->dev,
                                                0,0,
                                                &bcm2835_dma_link_dump
                                );
                }
        }

        /* statistics on message submission */
        spin_lock_irqsave(&master->queue_lock,flags);
        bs->count_spi_messages++;
#ifdef SPI_HAVE_OPTIMIZE
        if ((message->is_optimized) )
                bs->count_spi_optimized_messages++;
#endif
        spin_unlock_irqrestore(&master->queue_lock,flags);

        /* and schedule it */
        if (merged) {
                bcm2835dma_schedule_dma_fragment(message);
        }

        /* and return */
        return 0;
error:
        dev_printk(KERN_ERR,&spi->dev,"spi_transfer_failed: %i",err);
        dma_fragment_release(&merged->dma_fragment);
        return -EPERM;
}

static int bcm2835dma_spi_probe(struct platform_device *pdev)
{
...
        master->transfer = bcm2835dma_spi_transfer;
#ifdef SPI_HAVE_OPTIMIZE
        if (use_optimize) {
                master->optimize_message =
                        bcm2835dma_spi_message_optimize;
                master->unoptimize_message =
                        bcm2835dma_spi_message_unoptimize;
        }
#endif
...
}

-----------------------------------------------------------------------------
The diff (against 3.13) to the framework to allow spi_optimize to work:

  *    are not GPIOs (driven by the SPI controller itself).
@@ -412,7 +418,8 @@ struct spi_master {
                    struct spi_message *message);
     int (*unprepare_message)(struct spi_master *master,
                  struct spi_message *message);
-
+    int (*optimize_message)(struct spi_message *message);
+    void (*unoptimize_message)(struct spi_message *message);
     /*
      * These hooks are for drivers that use a generic implementation
      * of transfer_one_message() provied by the core.
@@ -506,6 +513,8 @@ extern struct spi_master *spi_busnum_to_master(u16
busnum);
  * @delay_usecs: microseconds to delay after this transfer before
  *    (optionally) changing the chipselect status, then starting
  *    the next transfer or completing this @spi_message.
+ * @vary: allows a driver to mark a SPI transfer as "modifyable" on the
+ *      specific pieces of information
  * @transfer_list: transfers are sequenced through @spi_message.transfers
  *
  * SPI transfers always write the same number of bytes as they read.
@@ -584,6 +593,12 @@ struct spi_transfer {
     u8        bits_per_word;
     u16        delay_usecs;
     u32        speed_hz;
+#define SPI_OPTIMIZE_VARY_TX_BUF               (1<<0)
+#define SPI_OPTIMIZE_VARY_RX_BUF               (1<<1)
+#define SPI_OPTIMIZE_VARY_SPEED_HZ             (1<<2)
+#define SPI_OPTIMIZE_VARY_DELAY_USECS          (1<<3)
+#define SPI_OPTIMIZE_VARY_LENGTH               (1<<4)
+    u32             vary;
 
     struct list_head transfer_list;
 };
@@ -594,6 +609,9 @@ struct spi_transfer {
  * @spi: SPI device to which the transaction is queued
  * @is_dma_mapped: if true, the caller provided both dma and cpu virtual
  *    addresses for each transfer buffer
+ * @is_optimized: if true, then the spi_message has been processed with
+ *      spi_message_optimize() - @state belongs to the spi-driver now
+ *      and may not get set by the driver
  * @complete: called to report transaction completions
  * @context: the argument to complete() when it's called
  * @actual_length: the total number of bytes that were transferred in all
@@ -622,6 +640,8 @@ struct spi_message {
     struct spi_device    *spi;
 
     unsigned        is_dma_mapped:1;
+#define SPI_HAVE_OPTIMIZE
+    unsigned                is_optimized:1;
 
     /* REVISIT:  we might want a flag affecting the behavior of the
      * last transfer ... allowing things like "read 16 bit length L"
@@ -655,6 +675,9 @@ static inline void spi_message_init(struct
spi_message *m)
     INIT_LIST_HEAD(&m->transfers);
 }
 
+int spi_message_optimize(struct spi_device *s,struct spi_message *m);
+void spi_message_unoptimize(struct spi_message *m);
+
 static inline void
 spi_message_add_tail(struct spi_transfer *t, struct spi_message *m)
 {

-----------------------------------------------------------------------------

So the questions are:
Are there any major questions on this design?
Does it look acceptable from a design perspective?
Is it generic enough that it might get used with other devices as well?
Do we need all those "VARY" cases or would we need to vary by more fields?
I still think that the optimize/unoptimize_message methods are quite similar
to prepare/unprepare_message so the question is if these can not get merged
into a single pair of methods.

Some of those influence the back-end parts that do the setup of the initial
and link-time transfers (which I have not shared yet and which need most
cleanup)

Also it influences the final design on vary in the which is still a bit more
complicated (and as mentioned not fully working at this moment).

As soon I have some guidance which way to go I will do a final cleanup and
and more documentation and share it in the form of patches...

Also the question is how I should initially submit those "generic" components.
Should I create a helper.h and helper.c with those components, so that they
can get split out later or should I keep them separate from the start.

Thanks,
       Martin

P.s: people interrested in the still "unfit" code as of now please look at
the out of tree driver at: https://github.com/msperl/spi-bcm2835

p.p.s: I hope the formatting of the email is fine - if not, then
apologies...


--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Mark Brown April 2, 2014, 6:15 p.m. UTC | #1
On Wed, Apr 02, 2014 at 06:06:11PM +0200, Martin Sperl wrote:

Just a general meta comment: your mail is over 700 lines long, that's
way too long.  Please split things up into smaller chunks or take a
higher level view as something this long is just asking far too much of
the reader especially without work to structure the text.  Things like
splitting out the device specifics help a lot.

I've just skimmed through this for now, I'll try to find time to read in
more detail but there's a backlog of patches I need to get to.

> * right now I am not really using dma-engine, as it is too CPU intensive
>   setting a complex SPI message up (a typical read+write SPI message takes
>   typically 21 DMA transfers)
> * instead I started a new dma-fragment interface that has:

This needs to be resolved for mainlining, if dmaengine is not up to the
job we need to either decide to replace it wholesale or extend it (more
likely the latter) rather than just doing something custom.  I do see
the need to improve the DMA APIs here but that needs to be done in a way
that's joined up with other work on DMA.

> * there is some provision to see if an spi message contains is of a
>   single/double transfer type and in those cases take an "optimized"
>   spi_message (done by the driver) instead and fill it with the data
>   from the submitted spi_message and make use of all those "vary" cases.
>   this would reduce the DMA creation cost for those synchronous transfers
>   (spi_write_then_read, spi_write, spi_read), but this is not fully in place

I don't know what a "single/double transfer type" is.

> There is also one question here with regards to semantics:
> When do we need to call the "complete" method?
> When we have received all the data or when the CS has been de-asserted
> (after the delay_usec)?

Users might reasonably assume that the transfer has finished at the time
any callback runs and that has to include the /CS change otherwise the
device might not have reacted.

> So the questions are:
> Are there any major questions on this design?
> Does it look acceptable from a design perspective?
> Is it generic enough that it might get used with other devices as well?

I was having a bit of a wood from the trees problem following the design
here but broadly there does seem to be a reasonable amount of usefulness
and reusability here.  A lot of systems are going to struggle to use the
full feature set due to restrictions on things like minimum DMA transfer
sizes and not being able to DMA to control registers and like I say the
DMA API issues need to be addressed.  Trying to do chip selects with DMA
is also going to be a lot of fun in the general case, especially with
GPIOs.

I also suspect that there is a cutoff point where PIO makes sense even
if we can DMA.

> Do we need all those "VARY" cases or would we need to vary by more fields?

The main things are probably replacing and changing the data.

> I still think that the optimize/unoptimize_message methods are quite similar
> to prepare/unprepare_message so the question is if these can not get merged
> into a single pair of methods.

Drivers are currently using prepare_message() for actual hardware setup
specific to the message that's done around actual transmission whereas
the optimisation can be done very distant to the actual use of the
message.

> Also the question is how I should initially submit those "generic" components.
> Should I create a helper.h and helper.c with those components, so that they
> can get split out later or should I keep them separate from the start.

I'm not sure what those helper bits are but in general keeping things so
that they can be reused is best.
Martin Sperl April 2, 2014, 8:40 p.m. UTC | #2
Thanks for the feedback.

Will try to keep emails shorter - I wanted to get everything out in one
and document it also trying to answer questions I thought would come up.

So I will take it that I can continue from where I am right now.

Just to summarize the current "modules" and their respective files:

spi-optimize - essentially the patch shared with my original email touching:
include/linux/spi/spi.h
driver/spi/spi.c

maybe adding some documentation on how to "optimize" high-volume drivers
to make optimal use of the interface.

basic generic dma-fragment support (structures/methods):
include/linux/dma-fragment.h
drivers/dma/dma-fragment.c
(maybe add: Documentation/dma-fragment.txt)

bcm2835 specific fragment support (dumping, scheduling, linking):
include/linux/dma/bcm2835-dma.h
drivers/dma/bcm2835-dma.c

generic spi specific dma-fragment code: 
include/linux/spi/spi-dmafragment.h
drivers/spi/spi-dmafragment.c
mostly about extensions pre/post dma transforms for vary support 
- also handling DMA-mapping, and data for the state-machine used in transformation
from spi-message to dma_fragment)
this could get integrated with spi.h/spi.c if it is deemed sensible

spi-bcm2835dma driver:
include/linux/spi/bcm2835.h (SPI DMA registers and Bits)
drivers/spi/spi-bcm2835dma.h (datastructures and defines)
drivers/spi/spi-bcm2835dma_drv.c (driver itself)
drivers/spi/spi-bcm2835dma_frag.c (the code filling/creating the fragments)

Does this look "ok" from an approach perspective?

I will share one module after the other for initial review.

I will brush up the code for the generic DMA fragment code, so that it
becomes presentable and will share that part in the hope that it live in
parallel to dmaengine and/or get somehow integrated - possibly as another target
besides cyclic.

It is really mostly management of those dma blocks so that they can get
joined together easily as well as the management of a group of those blocks.

Also it is designed to be generic that is why there is the bcm2835 specific code
as well as a separate module.

As such drivers using this specific interface will need to know enough about
the HW itself, my assumption is that the driver will need to know about the DMA
engine used to work properly - so I left all sorts of abstractions out of the 
design - this can get added later if needed...
Anyway this should come with deeper integration with dma-engine.

With regards to VARY: I have used the "options" provided to cover most of the
bases that I thought I would need to "auto-optimize" those simple cases of
spi_read, spi_write, spi_write_then_read. The VARY_DELAY_USEC was just added as
an after-thought, as it is not so "complex" to enable this as a feature...

As for the measurements on effective CPU utilization shared originally - is the
data shared understood/convincing? Does it need more explanation?

Ciao, Martin

P.s: as an afterthought: I actually think that I could implement a DMA driven
bit-bang SPI bus driver with up to 8 data-lines using the above dma_fragment 
approach - not sure about the effective clock speed that this could run... 
But right now it is not worth pursuing that further. --
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown April 3, 2014, 10:02 p.m. UTC | #3
On Wed, Apr 02, 2014 at 10:40:41PM +0200, Martin Sperl wrote:

> Will try to keep emails shorter - I wanted to get everything out in one
> and document it also trying to answer questions I thought would come up.

Sending several mails might've helped too - the way a patch series is
structured is actually a fairly good way of breaking things up for
comprehnsibility.

> Just to summarize the current "modules" and their respective files:

> spi-optimize - essentially the patch shared with my original email touching:
> include/linux/spi/spi.h
> driver/spi/spi.c

> maybe adding some documentation on how to "optimize" high-volume drivers
> to make optimal use of the interface.

There should be some win from this purely from the framework too even
without drivers doing anything.

> Does this look "ok" from an approach perspective?

Broadly.  Like I say the DMA stuff is the biggest alarm bell - if it's
not playing nicely with dmaengine that'll need to be dealt with.

> I will share one module after the other for initial review.

It's OK to post the lot at once as part of a series but it needs to be
split up into separate logical patches.  Hopefully we can get the
externally visible optimisation stuff merged relatively quickly so that
drivers can start building on it.

> I will brush up the code for the generic DMA fragment code, so that it
> becomes presentable and will share that part in the hope that it live in
> parallel to dmaengine and/or get somehow integrated - possibly as another target
> besides cyclic.

Yes, something like that.  The most important thing is to make it usable
with any driver.

> As such drivers using this specific interface will need to know enough about
> the HW itself, my assumption is that the driver will need to know about the DMA
> engine used to work properly - so I left all sorts of abstractions out of the 
> design - this can get added later if needed...
> Anyway this should come with deeper integration with dma-engine.

That would seem very surprising - I'd really have expected that we'd be
able to expose enough capability information from the DMA controllers to
allow fairly generic code; there's several controllers that have to work
over multiple SoCs.

> P.s: as an afterthought: I actually think that I could implement a DMA driven
> bit-bang SPI bus driver with up to 8 data-lines using the above dma_fragment 
> approach - not sure about the effective clock speed that this could run... 
> But right now it is not worth pursuing that further. 

Right, and it does depend on being able to DMA to set GPIOs which is
challenging in the general case.
Martin Sperl April 4, 2014, 2:17 p.m. UTC | #4
Hi Mark!

This is again a long email - trying to answer your questions/concerns.

On 04.04.2014 00:02, Mark Brown wrote:
> There should be some win from this purely from the framework too even
> without drivers doing anything.
> 


If the device-driver does not do anything, then there is no cost involved with
framework (ok - besides an additional if "(!msg->is_optimized) ...")

If the bus driver does not support optimization there is still some win. 

I have shared the "optimization" data already, but here again the overview:

Running a compile of the driver several times (measuring elapsed time)
with different driver/optimize combinations:

driver         optimize real_compile interrupts/s irq2EOT
none                 NA          50s          300     N/A                
spi-bcm2835         off         120s        45000   293us
spi-bcm2835          on         115s        45000   290us 
spi-bcm2835dma      off          76s         6700   172us
spi-bcm2835dma       on          60s         6700    82us

For the "default" driver essentially the CPU cycles available to userspace
went up from 41.6% to 43.5% (=50s/115s). It is not much, but it is still something. This is achived by cutting out "_spi_verify", which is mostly
"_spi_async()" in the current code.

But if you take now the optimize + DMA-driver case: we have 83.3% (=50s/60s)
CPU available for userspace. And without optimize: 65.8% (=50s/76s).
And both of those numbers are big wins!

Note that the first version of the driver did not implement caching for
fragments, but was rebuilding the full DMA-chain on the fly each time, the
available CPU cycles was somewhere in the 45-50% range, so better than the 
stock driver, but not by much. Merging only 5 Fragments is way more efficient
than building 19 dma controlblocks from scratch including the time for 
allocation, filling with data, deallocation,...

As for "generic/existing unoptimized" device-drivers - as mentioned - there is
the idea of providing an auto-optimize option for common spi_read, spi_write, 
spi_write_then_read type cases (by making use of VARY and optimize on some 
driver-prepared messages)

For the framework there might also be the chance to do some optimizations of
its own when "spi_optimize" gets called for a message.
There the framework might want to call the spi_prepare methods only once.
But I do not fully know the use-cases and semantics for prepare inside the
framework - you say it is different from the optimize I vision.

A side-effect of optimize means that ownership of state and queue members is
transferred to the framework/bus_driver and only those fields flagged by vary
may change. There may be some optimizations possible for the framework based
on this "transfer of ownership"...

> That would seem very surprising - I'd really have expected that we'd be
> able to expose enough capability information from the DMA controllers to
> allow fairly generic code; there's several controllers that have to work
> over multiple SoCs.
> 


It is mostly related to knowing the specific registers which you need to set...
How to make it more abstract I have not yet figured it out.

But it might boil down to something like that:
* create Fragment
* add Poke(frag,Data,bus_address(register))
* add Poke ...

As of now I am more explicit yet, which is also due to the fact that I want
to be able to handle a few transfer cases together (write only, read only,
read-write), which require slightly different DMA parameters - and the VARY
interface should allow me to handle all together with minimal setup overhead.

But for this to know you need to "know" the DMA capabilities to make the most
of it - maybe some abstraction is possible there as well...

But it is still complicated by the fact that the driver needs to use 3 DMA 
channels to drive SPI. As mentioned actually 2, but the 3rd is needed to 
stably trigger a completion interrupt without any race conditions, that 
would inhibit the DMA interrupt to really get called (irq-flag might have
been cleared already).

So this is quite specific to the DMA + SPI implementation.

>> P.s: as an afterthought: I actually think that I could implement a DMA driven
>> bit-bang SPI bus driver with up to 8 data-lines using the above dma_fragment 
>> approach - not sure about the effective clock speed that this could run... 
>> But right now it is not worth pursuing that further. 
>> 
> Right, and it does depend on being able to DMA to set GPIOs which is
> challenging in the general case.

"pulling" GPIO up/down - on the BCM2835 it is fairly simple:
to set a GPIO: write to GPIOSET registers with the corresponding bits
 (1<<GPIOPIN) set.
To clear it: write to GPIOCLEAR registers again with the same mask.
So DMA can set all or 0 GPIO pins together.

One can set/clear up to 32 GPIO with a single writel or DMA.

Drawback is that it needs two writes to set an exact value for multiple GPIOs
and under some circumstances you need to be aware of what you do.

This feature is probably due to the "dual" CPU design ARM + VC4/GPU, which
allows to work the GPIO pins from both sides without any concurrency issues
(as long as the ownership of the specific pin is clear).
The concurrency is serialized between ARM, GPU and DMA via the common AXI bus.

Unfortunately the same is NOT possible for changing GPIO directions /
alternate functions (but this is supposed to be rarer, so it can get
arbitrated between components...)

> Broadly.  Like I say the DMA stuff is the biggest alarm bell - if it's
> not playing nicely with dmaengine that'll need to be dealt with.
> 

As for DMA-engine: The driver should (for the moment) also work with minimal
changes also on the foundation kernel - there is a much bigger user base 
there, that use it for LCD displays, CAN controllers, ADCs and more - so it
gets more exposure to different devices than I can access myself.

But still: I believe I must get the basics right first before I can start
addressing DMAengine.

And one of the issues I have with DMA-engine is that you always have to set up
tear down the DMA-transfers (at least the way I understood it) and that is why
I created this generic DMA-fragment interface which can also cache some of
those DMA artifacts and allows chaining them in individual order.

So the idea is to take that to build the DMA-control block chain and then pass
it on to the dma-engine.

Still a lot of things are missing - for example if the DMA is already running
and there is another DMA fragment to execute the driver chains those fragments
together in the hope that the DMA will continue and pick it up.

Here the stats for 53M received CAN messages:
root@raspberrypi:~/spi-bcm2835# cat /sys/class/spi_master/spi0/stats
bcm2835dma_stats_info - 0.1
total spi_messages:     160690872
optimized spi_messages: 160690870
started dma:            53768175
linked to running dma:  106922697
last dma_schedule type: linked
dma interrupts:         107127237
queued messages:        0

As explained, my highly optimized device driver schedules 3 spi_messages.
the first 2 together the 3rd in the complete function of the 1st message.
And the counters for "linked to running dma" is about double the counter
of "started DMA".

The first spi_message will need to get stated normally (as it is typically
idle) while the 2nd and 3rd are typically linked.

If you do the math this is linking happens in 66.54% of all spi_messages.
Under ideal circumstances this value should be 66.666666% (=2/3).

So there are times when the ARM is slightly too slow and typically the 3rd 
message is really scheduled only when the DMA has already stopped.
Running for more than 2 days with 500M CAN-messages did not show any further races (but the scheduling needs to make heavy use of dsb() that this does not
happen....)

This kind of thing is something that DMA-engine does not support as of now.
But prior to getting something like this accepted it first needs a proof
that it works... 

And this is the POC that shows that it is possible and gives huge gains
(at least on some platforms)...

Hope this answers your questions.

Ciao, Martin--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown April 10, 2014, 10:35 p.m. UTC | #5
On Fri, Apr 04, 2014 at 04:17:24PM +0200, martin sperl wrote:
> On 04.04.2014 00:02, Mark Brown wrote:

> > There should be some win from this purely from the framework too even
> > without drivers doing anything.

> If the device-driver does not do anything, then there is no cost involved with
> framework (ok - besides an additional if "(!msg->is_optimized) ...")

> If the bus driver does not support optimization there is still some win. 

Right, I was more pointing out that you were underselling here - you
were talking only about the benefits when everything is updated.  This
also means we can usefully merge partial support, addressing only some
of the improvements is still going to give some wins and makes it easier
to digest the code overall.

> For the framework there might also be the chance to do some optimizations of
> its own when "spi_optimize" gets called for a message.
> There the framework might want to call the spi_prepare methods only once.
> But I do not fully know the use-cases and semantics for prepare inside the
> framework - you say it is different from the optimize I vision.

The main thing I'd like to do is actually coalesce adjacent transfers
within a message if they are compatible and we're doing DMA so we end up
with fewer hardware interactions overall.  There's a fairly common
pattern of writing multiple registers (for firmware download or setting
coefficients) where you have a small transfer for a register then a
larger one for a data block which would benefit.

For prepare I'd expect we can get to skipping reconfiguring the hardware
entirely when not needed even without an optimisation step, though the
tradeoff might not be worth it.  This may be required more than once in
a message though.

> > That would seem very surprising - I'd really have expected that we'd be
> > able to expose enough capability information from the DMA controllers to
> > allow fairly generic code; there's several controllers that have to work
> > over multiple SoCs.

> It is mostly related to knowing the specific registers which you need to set...
> How to make it more abstract I have not yet figured it out.

> But it might boil down to something like that:
> * create Fragment
> * add Poke(frag,Data,bus_address(register))
> * add Poke ...

Registers where?  If it's in the DMA controller we can just add ops for
the DMA controller can't we?

> But it is still complicated by the fact that the driver needs to use 3 DMA 
> channels to drive SPI. As mentioned actually 2, but the 3rd is needed to 
> stably trigger a completion interrupt without any race conditions, that 
> would inhibit the DMA interrupt to really get called (irq-flag might have
> been cleared already).

Right, that's a peculiarity of your system (hopefully) which makes life
difficult for that.  The Samsung controller has a vaugely similar thing
where the transmit interrupt goes off too early for what we want so we
need to ensure there's always a read if we're doing interrupts but that
is a lot easier to work around.

> > Right, and it does depend on being able to DMA to set GPIOs which is
> > challenging in the general case.

> "pulling" GPIO up/down - on the BCM2835 it is fairly simple:
> to set a GPIO: write to GPIOSET registers with the corresponding bits
>  (1<<GPIOPIN) set.
> To clear it: write to GPIOCLEAR registers again with the same mask.
> So DMA can set all or 0 GPIO pins together.

That's quite convenient for this sort of use - lots of controllers just
have a register where you need to set the values of some GPIOs so you
end up doing a read/modify/write which is obviously hard with DMA.  We
would need gpiolib to export an interface to allow this operation; it
should be doable but I expect it's like the regular bulk set and tends
to get bogged down.

> > Broadly.  Like I say the DMA stuff is the biggest alarm bell - if it's
> > not playing nicely with dmaengine that'll need to be dealt with.

> As for DMA-engine: The driver should (for the moment) also work with minimal
> changes also on the foundation kernel - there is a much bigger user base 
> there, that use it for LCD displays, CAN controllers, ADCs and more - so it
> gets more exposure to different devices than I can access myself.

It's good to talk early so the ideas are in people's heads if nothing
else, it also helps ensure everyone is travelling in the same direction.

> But still: I believe I must get the basics right first before I can start
> addressing DMAengine.

Sounds like you already have a pretty good prototype?

> And one of the issues I have with DMA-engine is that you always have to set up
> tear down the DMA-transfers (at least the way I understood it) and that is why
> I created this generic DMA-fragment interface which can also cache some of
> those DMA artifacts and allows chaining them in individual order.

> So the idea is to take that to build the DMA-control block chain and then pass
> it on to the dma-engine.

Right, that's the biggest thing I can see in what you're suggesting for
dmaengine and it sounds like something other users could also benefit
from so I'd not expect massive pushback.

> This kind of thing is something that DMA-engine does not support as of now.
> But prior to getting something like this accepted it first needs a proof
> that it works... 

On the other hand you may find that what you've got already is enough
for people, or get a feel for the sort of things people are looking for
to show a win.  Sometimes you'll even get "oh, that's a really good idea
I'll just add the feature" (if you're very lucky).

BTW I have acquired a RPi for some other use - is the CAN board you are
using something I can order?  I'm not sure if I'd ever have much time to
look at it but I'm generally trying to expand the range of SPI hardware
I have available for test.
Martin Sperl April 11, 2014, 12:40 p.m. UTC | #6
On 11.04.2014 00:35, Mark Brown wrote:
> Right, I was more pointing out that you were underselling here - you were 
> talking only about the benefits when everything is updated. This also means
> we can usefully merge partial support, addressing only some of the 
> improvements is still going to give some wins and makes it easier to digest
> the code overall.
Well - I thought I had shown this with my shared timing-data already... 
(it was just a 5s gain for the PIO driver, but still...)

> The main thing I'd like to do is actually coalesce adjacent transfers within
> a message if they are compatible and we're doing DMA so we end up with fewer
> hardware interactions overall. There's a fairly common pattern of writing
> multiple registers (for firmware download or setting coefficients) where you
> have a small transfer for a register then a larger one for a data block which
> would benefit.

I just found out something similar just a few days ago: 
doing a write then read with 2+1 is more effort to handle than a read/write
of 3 bytes and some moving of data arround. Also it actually saves memory,
as an spi_transfer is bigger than those 3 extra bytes needed.
With the DMA Driver this saved me 4us on overall transfer time -  It dropped
from 52us to 47us. For comparisson: the real data transfer for those 36 
bytes (without 7x CS) takes 35us - so we are pretty close to ideal...

There is not so much of a difference with the PIO drivers as they take 200us
for the same, but it sees a drop on the number of interrupts (typically 
1 IRQ/transfer), which increases CPU availability for userspace...
Redesigning the interrupt-handler to push bytes spanning multiple 
spi_transfers into the fifo together would help, but then it becomes much 
more complex - at least the spi-bcm2835 does call "complete(&bs->done);"
when it has finished the transfer.

I believe I will add all those findings to the documentation with the 
spi_optimize patch...

> Registers where?  If it's in the DMA controller we can just add ops for
> the DMA controller can't we?

Possibly - still working on cleaning up the code and making it as generic
as possible

> Right, that's a peculiarity of your system (hopefully) which makes life
> difficult for that. The Samsung controller has a vaugely similar thing
> where the transmit interrupt goes off too early for what we want so we 
> need to ensure there's always a read if we're doing interrupts but that
> is a lot easier to work around.

Specs says to use 2 Interrupts one for feeding data to FIFO the other to
read data from FIFO. Probably one could use multiple interleaved DMAs for
reads and writes, but then you would be limited to 32 byte transfers and
would have some "gaps" between the fifos filling/emptying - this way at 
least 65535 bytes can get sent with a single transfer...

The interrupt DMA is more on the tricky side - if we had to have an 
interrupt every time an spi_message finished, it would be possible with
2 DMAs, but as we have the possibility of chaining multiple spi_messages
together for higher thruput, then this "waiting" for interrupt is not
possible (IRQ flag gets sometimes cleared by the next DMA controlblock),
so we need to have another "stable" IRQ source and hence DMA3.
(actually I was using the TX DMA for that, which gets started by the
RX-DMA, so we have a race-time between DMA for IRQ and the interrupt
handler on the CPU reading the registers quickly enough to find out the
source of the IRQ)

> That's quite convenient for this sort of use - lots of controllers
> just have a register where you need to set the values of some GPIOs so
> you end up doing a read/modify/write which is obviously hard with DMA.
> We would need gpiolib to export an interface to allow this operation; 
> it should be doable but I expect it's like the regular bulk set and 
> tends to get bogged down.

yes - it is most convenient - but on the other hand it is also a bit of
a pain if you need to do that on the ARM, as you have to do 2 writes for
a single change in gpiolib and ordering could become a problem with 
bitbanged SPI, but it is possible (zero first, set last...

> Sounds like you already have a pretty good prototype?
> 
As said: it mostly works, but i need to cleanup the code before I can
post it, but before I do that I had to finish the CAN driver that makes
use of the features to see what pitfalls we still have.

One of those found is the fact that an optimized message can only reliably
get used with spi_async. For spi_sync use it has to have the complete 
function defined when optimizing, as it would otherwise not implement a
callback IRQ in DMA. And then when spi_sync sets the complete function
the DMA would still issue the optimized message without interrupts.

Also there is one other race-condition, that happens sometimes quickly
sometimes after 48 hours, but I think I might have an idea but for that
I need more info on the event itself...

> Right, that's the biggest thing I can see in what you're suggesting for
> dmaengine and it sounds like something other users could also benefit
> from so I'd not expect massive pushback.
> 
> 
>> This kind of thing is something that DMA-engine does not support as of now.
>> But prior to getting something like this accepted it first needs a proof
>> that it works... 
>> 
> On the other hand you may find that what you've got already is enough
> for people, or get a feel for the sort of things people are looking for
> to show a win.  Sometimes you'll even get "oh, that's a really good idea
> I'll just add the feature" (if you're very lucky).
> 
So who else should I involve as soon as I have presentable code?
(coding standard issues I have already fixed - mostly...)

> BTW I have acquired a RPi for some other use - is the CAN board you are
> using something I can order?  I'm not sure if I'd ever have much time to
> look at it but I'm generally trying to expand the range of SPI hardware
> I have available for test.

Could be as simple as a bread-board + mcp2515 + mcp2551 + 16MHz Oszillator 
+ 4 Resistors + 2-3 Capacitors for the chips...

Other people have bought the pican module, which essentially implements
the same, but is specific to the PI...
http://skpang.co.uk/catalog/pican-canbus-board-for-raspberry-pi-p-1196.html

And in both cases you probably will need a peer which accepts messages or
sends them...

It is a bit tricky - especially the setting up of the correct Baud rate
on the CAN-Bus side - so people are complaining that it is hard to make
it work...

Ciao,
        Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown April 21, 2014, 10:20 p.m. UTC | #7
On Fri, Apr 11, 2014 at 02:40:07PM +0200, martin sperl wrote:
> On 11.04.2014 00:35, Mark Brown wrote:

> > The main thing I'd like to do is actually coalesce adjacent transfers within
> > a message if they are compatible and we're doing DMA so we end up with fewer
> > hardware interactions overall. There's a fairly common pattern of writing
> > multiple registers (for firmware download or setting coefficients) where you
> > have a small transfer for a register then a larger one for a data block which
> > would benefit.
> 
> I just found out something similar just a few days ago: 
> doing a write then read with 2+1 is more effort to handle than a read/write
> of 3 bytes and some moving of data arround. Also it actually saves memory,
> as an spi_transfer is bigger than those 3 extra bytes needed.
> With the DMA Driver this saved me 4us on overall transfer time -  It dropped
> from 52us to 47us. For comparisson: the real data transfer for those 36 
> bytes (without 7x CS) takes 35us - so we are pretty close to ideal...

It can be even nicer for drivers where there's some minimum limit on the
size of transfers (quite a lot of them), it can sometimes allow short
writes to be converted from being PIO to being combined with an adjacent
DMAs.  Possibly even two PIOs making a DMA, though that case is more on
the edge.

> There is not so much of a difference with the PIO drivers as they take 200us
> for the same, but it sees a drop on the number of interrupts (typically 
> 1 IRQ/transfer), which increases CPU availability for userspace...
> Redesigning the interrupt-handler to push bytes spanning multiple 
> spi_transfers into the fifo together would help, but then it becomes much 
> more complex - at least the spi-bcm2835 does call "complete(&bs->done);"
> when it has finished the transfer.

It's probably not worth doing unless it's factored out into the
framework and affects lots of drivers; it seems most likely to get done
as part of a general improvement in the ability to drive the queue from
interrupt context.  Most systems won't be able to do the fully DMAed
pipeline the Pi can.

> > Right, that's a peculiarity of your system (hopefully) which makes life
> > difficult for that. The Samsung controller has a vaugely similar thing
> > where the transmit interrupt goes off too early for what we want so we 
> > need to ensure there's always a read if we're doing interrupts but that
> > is a lot easier to work around.

> Specs says to use 2 Interrupts one for feeding data to FIFO the other to
> read data from FIFO. Probably one could use multiple interleaved DMAs for
> reads and writes, but then you would be limited to 32 byte transfers and
> would have some "gaps" between the fifos filling/emptying - this way at 
> least 65535 bytes can get sent with a single transfer...

Sure, that's standard though looking at the code in drivers it seems
there's a fairly common pattern of just ignoring one of the completions,
usually the transmit one, at the SPI level since we can guarantee that
the other will always come later.

> The interrupt DMA is more on the tricky side - if we had to have an 
> interrupt every time an spi_message finished, it would be possible with
> 2 DMAs, but as we have the possibility of chaining multiple spi_messages
> together for higher thruput, then this "waiting" for interrupt is not
> possible (IRQ flag gets sometimes cleared by the next DMA controlblock),
> so we need to have another "stable" IRQ source and hence DMA3.
> (actually I was using the TX DMA for that, which gets started by the
> RX-DMA, so we have a race-time between DMA for IRQ and the interrupt
> handler on the CPU reading the registers quickly enough to find out the
> source of the IRQ)

I think it's worth implementing the three DMA solution as an
optimisation on the two DMA version partly to get the big part of the
win from the two DMAs integrated but also because that's going to be the
more generally useful pattern.  It also lets the tricky bit with the
extra DMA be considered separately.

> One of those found is the fact that an optimized message can only reliably
> get used with spi_async. For spi_sync use it has to have the complete
> function defined when optimizing, as it would otherwise not implement a
> callback IRQ in DMA. And then when spi_sync sets the complete function
> the DMA would still issue the optimized message without interrupts.

Hrm.  We could provide a flag saying this will be used synchronously, or
more likely given driver use patterns the other way around.  Doesn't
seem quite elegant though.  Or insert a dummy transfer (eg, some
register read) which does interrupt into the chain after the real one to
do the callback.

> > On the other hand you may find that what you've got already is enough
> > for people, or get a feel for the sort of things people are looking for
> > to show a win.  Sometimes you'll even get "oh, that's a really good idea
> > I'll just add the feature" (if you're very lucky).

> So who else should I involve as soon as I have presentable code?
> (coding standard issues I have already fixed - mostly...)

The dmaengine maintainers for the DMA stuff - Vinod Koul
<vinod.koul@intel.com> and Dan Williams <dan.j.williams@intel.com> - and
the GPIO maintainers for that - Linus Walleij <linus.walleij@linaro.org>
and Alexandre Courbot <gnurou@gmail.com> (check MAINTAINERS).  Plus
anyone working on the drivers you modified.

> the same, but is specific to the PI...
> http://skpang.co.uk/catalog/pican-canbus-board-for-raspberry-pi-p-1196.html

> And in both cases you probably will need a peer which accepts messages or
> sends them...

> It is a bit tricky - especially the setting up of the correct Baud rate
> on the CAN-Bus side - so people are complaining that it is hard to make
> it work...

Hrm, OK.  Might be too fiddly, dunno.  I do also keep thinking I should
get a FPGA board and impelement some test devices that way to let me
exercise the framework, it's not exactly realistic though.
diff mbox

Patch

diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index d745f95..24a54aa 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -1598,15 +1598,12 @@  int spi_setup(struct spi_device *spi)
 }
 EXPORT_SYMBOL_GPL(spi_setup);
 
-static int __spi_async(struct spi_device *spi, struct spi_message *message)
+static inline int __spi_verify(struct spi_device *spi,
+            struct spi_message *message)
 {
     struct spi_master *master = spi->master;
     struct spi_transfer *xfer;
 
-    message->spi = spi;
-
-    trace_spi_message_submit(message);
-
     if (list_empty(&message->transfers))
         return -EINVAL;
     if (!message->complete)
@@ -1705,9 +1702,28 @@  static int __spi_async(struct spi_device *spi,
struct spi_message *message)
                 return -EINVAL;
         }
     }
+    return 0;
+}
+
+static int __spi_async(struct spi_device *spi, struct spi_message *message)
+{
+    struct spi_master *master = spi->master;
+    int ret = 0;
+
+    trace_spi_message_submit(message);
+
+    if (message->is_optimized) {
+        if (spi != message->spi)
+            return -EINVAL;
+    } else {
+        message->spi = spi;
+        ret = __spi_verify(spi,message);
+        if (ret)
+            return ret;
+    }
 
     message->status = -EINPROGRESS;
-    return master->transfer(spi, message);
+    return spi->master->transfer(spi, message);
 }
 
 /**
@@ -1804,6 +1820,48 @@  int spi_async_locked(struct spi_device *spi,
struct spi_message *message)
 }
 EXPORT_SYMBOL_GPL(spi_async_locked);
 
+/**
+ * spi_message_optimize - optimize a message for repeated use minimizing
+ *   processing overhead
+ *
+ * @spi: device with which data will be exchanged
+ * @message: describes the data transfers, including completion callback
+ * Context: can sleep
+ */
+int spi_message_optimize(struct spi_device *spi,
+            struct spi_message *message)
+{
+    int ret = 0;
+    if (message->is_optimized)
+        spi_message_unoptimize(message);
+
+    message->spi = spi;
+    ret = __spi_verify(spi,message);
+    if (ret)
+        return ret;
+
+    if (spi->master->optimize_message)
+        ret = spi->master->optimize_message(message);
+    if (ret)
+        return ret;
+
+    message->is_optimized = 1;
+
+    return 0;
+}
+EXPORT_SYMBOL_GPL(spi_message_optimize);
+
+void spi_message_unoptimize(struct spi_message *message)
+{
+    if (!message->is_optimized)
+        return;
+
+    if (message->spi->master->unoptimize_message)
+        message->spi->master->unoptimize_message(message);
+
+    message->is_optimized = 0;
+}
+EXPORT_SYMBOL_GPL(spi_message_unoptimize);
 
 /*-------------------------------------------------------------------------*/
 
diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
index 8c62ba7..5206038 100644
--- a/include/linux/spi/spi.h
+++ b/include/linux/spi/spi.h
@@ -287,6 +287,12 @@  static inline void spi_unregister_driver(struct
spi_driver *sdrv)
  *              spi_finalize_current_transfer() so the subsystem can issue
  *                the next transfer
  * @unprepare_message: undo any work done by prepare_message().
+ * @optimize_message: allow computation of optimizations of a spi message
+ *                    this may set up the corresponding DMA transfers
+ *                    or do other work that need not get computed every
+ *                    time a message gets executed
+ *                    Not called from interrupt context.
+ * @unoptimize_message: undo any work done by @optimize_message().
  * @cs_gpios: Array of GPIOs to use as chip select lines; one per CS
  *    number. Any individual value may be -ENOENT for CS lines that