Message ID | 43389276-E591-4E09-AB84-491C2CB2D9A7@martin.sperl.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Apr 02, 2014 at 06:06:11PM +0200, Martin Sperl wrote: Just a general meta comment: your mail is over 700 lines long, that's way too long. Please split things up into smaller chunks or take a higher level view as something this long is just asking far too much of the reader especially without work to structure the text. Things like splitting out the device specifics help a lot. I've just skimmed through this for now, I'll try to find time to read in more detail but there's a backlog of patches I need to get to. > * right now I am not really using dma-engine, as it is too CPU intensive > setting a complex SPI message up (a typical read+write SPI message takes > typically 21 DMA transfers) > * instead I started a new dma-fragment interface that has: This needs to be resolved for mainlining, if dmaengine is not up to the job we need to either decide to replace it wholesale or extend it (more likely the latter) rather than just doing something custom. I do see the need to improve the DMA APIs here but that needs to be done in a way that's joined up with other work on DMA. > * there is some provision to see if an spi message contains is of a > single/double transfer type and in those cases take an "optimized" > spi_message (done by the driver) instead and fill it with the data > from the submitted spi_message and make use of all those "vary" cases. > this would reduce the DMA creation cost for those synchronous transfers > (spi_write_then_read, spi_write, spi_read), but this is not fully in place I don't know what a "single/double transfer type" is. > There is also one question here with regards to semantics: > When do we need to call the "complete" method? > When we have received all the data or when the CS has been de-asserted > (after the delay_usec)? Users might reasonably assume that the transfer has finished at the time any callback runs and that has to include the /CS change otherwise the device might not have reacted. > So the questions are: > Are there any major questions on this design? > Does it look acceptable from a design perspective? > Is it generic enough that it might get used with other devices as well? I was having a bit of a wood from the trees problem following the design here but broadly there does seem to be a reasonable amount of usefulness and reusability here. A lot of systems are going to struggle to use the full feature set due to restrictions on things like minimum DMA transfer sizes and not being able to DMA to control registers and like I say the DMA API issues need to be addressed. Trying to do chip selects with DMA is also going to be a lot of fun in the general case, especially with GPIOs. I also suspect that there is a cutoff point where PIO makes sense even if we can DMA. > Do we need all those "VARY" cases or would we need to vary by more fields? The main things are probably replacing and changing the data. > I still think that the optimize/unoptimize_message methods are quite similar > to prepare/unprepare_message so the question is if these can not get merged > into a single pair of methods. Drivers are currently using prepare_message() for actual hardware setup specific to the message that's done around actual transmission whereas the optimisation can be done very distant to the actual use of the message. > Also the question is how I should initially submit those "generic" components. > Should I create a helper.h and helper.c with those components, so that they > can get split out later or should I keep them separate from the start. I'm not sure what those helper bits are but in general keeping things so that they can be reused is best.
Thanks for the feedback. Will try to keep emails shorter - I wanted to get everything out in one and document it also trying to answer questions I thought would come up. So I will take it that I can continue from where I am right now. Just to summarize the current "modules" and their respective files: spi-optimize - essentially the patch shared with my original email touching: include/linux/spi/spi.h driver/spi/spi.c maybe adding some documentation on how to "optimize" high-volume drivers to make optimal use of the interface. basic generic dma-fragment support (structures/methods): include/linux/dma-fragment.h drivers/dma/dma-fragment.c (maybe add: Documentation/dma-fragment.txt) bcm2835 specific fragment support (dumping, scheduling, linking): include/linux/dma/bcm2835-dma.h drivers/dma/bcm2835-dma.c generic spi specific dma-fragment code: include/linux/spi/spi-dmafragment.h drivers/spi/spi-dmafragment.c mostly about extensions pre/post dma transforms for vary support - also handling DMA-mapping, and data for the state-machine used in transformation from spi-message to dma_fragment) this could get integrated with spi.h/spi.c if it is deemed sensible spi-bcm2835dma driver: include/linux/spi/bcm2835.h (SPI DMA registers and Bits) drivers/spi/spi-bcm2835dma.h (datastructures and defines) drivers/spi/spi-bcm2835dma_drv.c (driver itself) drivers/spi/spi-bcm2835dma_frag.c (the code filling/creating the fragments) Does this look "ok" from an approach perspective? I will share one module after the other for initial review. I will brush up the code for the generic DMA fragment code, so that it becomes presentable and will share that part in the hope that it live in parallel to dmaengine and/or get somehow integrated - possibly as another target besides cyclic. It is really mostly management of those dma blocks so that they can get joined together easily as well as the management of a group of those blocks. Also it is designed to be generic that is why there is the bcm2835 specific code as well as a separate module. As such drivers using this specific interface will need to know enough about the HW itself, my assumption is that the driver will need to know about the DMA engine used to work properly - so I left all sorts of abstractions out of the design - this can get added later if needed... Anyway this should come with deeper integration with dma-engine. With regards to VARY: I have used the "options" provided to cover most of the bases that I thought I would need to "auto-optimize" those simple cases of spi_read, spi_write, spi_write_then_read. The VARY_DELAY_USEC was just added as an after-thought, as it is not so "complex" to enable this as a feature... As for the measurements on effective CPU utilization shared originally - is the data shared understood/convincing? Does it need more explanation? Ciao, Martin P.s: as an afterthought: I actually think that I could implement a DMA driven bit-bang SPI bus driver with up to 8 data-lines using the above dma_fragment approach - not sure about the effective clock speed that this could run... But right now it is not worth pursuing that further. -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 02, 2014 at 10:40:41PM +0200, Martin Sperl wrote: > Will try to keep emails shorter - I wanted to get everything out in one > and document it also trying to answer questions I thought would come up. Sending several mails might've helped too - the way a patch series is structured is actually a fairly good way of breaking things up for comprehnsibility. > Just to summarize the current "modules" and their respective files: > spi-optimize - essentially the patch shared with my original email touching: > include/linux/spi/spi.h > driver/spi/spi.c > maybe adding some documentation on how to "optimize" high-volume drivers > to make optimal use of the interface. There should be some win from this purely from the framework too even without drivers doing anything. > Does this look "ok" from an approach perspective? Broadly. Like I say the DMA stuff is the biggest alarm bell - if it's not playing nicely with dmaengine that'll need to be dealt with. > I will share one module after the other for initial review. It's OK to post the lot at once as part of a series but it needs to be split up into separate logical patches. Hopefully we can get the externally visible optimisation stuff merged relatively quickly so that drivers can start building on it. > I will brush up the code for the generic DMA fragment code, so that it > becomes presentable and will share that part in the hope that it live in > parallel to dmaengine and/or get somehow integrated - possibly as another target > besides cyclic. Yes, something like that. The most important thing is to make it usable with any driver. > As such drivers using this specific interface will need to know enough about > the HW itself, my assumption is that the driver will need to know about the DMA > engine used to work properly - so I left all sorts of abstractions out of the > design - this can get added later if needed... > Anyway this should come with deeper integration with dma-engine. That would seem very surprising - I'd really have expected that we'd be able to expose enough capability information from the DMA controllers to allow fairly generic code; there's several controllers that have to work over multiple SoCs. > P.s: as an afterthought: I actually think that I could implement a DMA driven > bit-bang SPI bus driver with up to 8 data-lines using the above dma_fragment > approach - not sure about the effective clock speed that this could run... > But right now it is not worth pursuing that further. Right, and it does depend on being able to DMA to set GPIOs which is challenging in the general case.
Hi Mark! This is again a long email - trying to answer your questions/concerns. On 04.04.2014 00:02, Mark Brown wrote: > There should be some win from this purely from the framework too even > without drivers doing anything. > If the device-driver does not do anything, then there is no cost involved with framework (ok - besides an additional if "(!msg->is_optimized) ...") If the bus driver does not support optimization there is still some win. I have shared the "optimization" data already, but here again the overview: Running a compile of the driver several times (measuring elapsed time) with different driver/optimize combinations: driver optimize real_compile interrupts/s irq2EOT none NA 50s 300 N/A spi-bcm2835 off 120s 45000 293us spi-bcm2835 on 115s 45000 290us spi-bcm2835dma off 76s 6700 172us spi-bcm2835dma on 60s 6700 82us For the "default" driver essentially the CPU cycles available to userspace went up from 41.6% to 43.5% (=50s/115s). It is not much, but it is still something. This is achived by cutting out "_spi_verify", which is mostly "_spi_async()" in the current code. But if you take now the optimize + DMA-driver case: we have 83.3% (=50s/60s) CPU available for userspace. And without optimize: 65.8% (=50s/76s). And both of those numbers are big wins! Note that the first version of the driver did not implement caching for fragments, but was rebuilding the full DMA-chain on the fly each time, the available CPU cycles was somewhere in the 45-50% range, so better than the stock driver, but not by much. Merging only 5 Fragments is way more efficient than building 19 dma controlblocks from scratch including the time for allocation, filling with data, deallocation,... As for "generic/existing unoptimized" device-drivers - as mentioned - there is the idea of providing an auto-optimize option for common spi_read, spi_write, spi_write_then_read type cases (by making use of VARY and optimize on some driver-prepared messages) For the framework there might also be the chance to do some optimizations of its own when "spi_optimize" gets called for a message. There the framework might want to call the spi_prepare methods only once. But I do not fully know the use-cases and semantics for prepare inside the framework - you say it is different from the optimize I vision. A side-effect of optimize means that ownership of state and queue members is transferred to the framework/bus_driver and only those fields flagged by vary may change. There may be some optimizations possible for the framework based on this "transfer of ownership"... > That would seem very surprising - I'd really have expected that we'd be > able to expose enough capability information from the DMA controllers to > allow fairly generic code; there's several controllers that have to work > over multiple SoCs. > It is mostly related to knowing the specific registers which you need to set... How to make it more abstract I have not yet figured it out. But it might boil down to something like that: * create Fragment * add Poke(frag,Data,bus_address(register)) * add Poke ... As of now I am more explicit yet, which is also due to the fact that I want to be able to handle a few transfer cases together (write only, read only, read-write), which require slightly different DMA parameters - and the VARY interface should allow me to handle all together with minimal setup overhead. But for this to know you need to "know" the DMA capabilities to make the most of it - maybe some abstraction is possible there as well... But it is still complicated by the fact that the driver needs to use 3 DMA channels to drive SPI. As mentioned actually 2, but the 3rd is needed to stably trigger a completion interrupt without any race conditions, that would inhibit the DMA interrupt to really get called (irq-flag might have been cleared already). So this is quite specific to the DMA + SPI implementation. >> P.s: as an afterthought: I actually think that I could implement a DMA driven >> bit-bang SPI bus driver with up to 8 data-lines using the above dma_fragment >> approach - not sure about the effective clock speed that this could run... >> But right now it is not worth pursuing that further. >> > Right, and it does depend on being able to DMA to set GPIOs which is > challenging in the general case. "pulling" GPIO up/down - on the BCM2835 it is fairly simple: to set a GPIO: write to GPIOSET registers with the corresponding bits (1<<GPIOPIN) set. To clear it: write to GPIOCLEAR registers again with the same mask. So DMA can set all or 0 GPIO pins together. One can set/clear up to 32 GPIO with a single writel or DMA. Drawback is that it needs two writes to set an exact value for multiple GPIOs and under some circumstances you need to be aware of what you do. This feature is probably due to the "dual" CPU design ARM + VC4/GPU, which allows to work the GPIO pins from both sides without any concurrency issues (as long as the ownership of the specific pin is clear). The concurrency is serialized between ARM, GPU and DMA via the common AXI bus. Unfortunately the same is NOT possible for changing GPIO directions / alternate functions (but this is supposed to be rarer, so it can get arbitrated between components...) > Broadly. Like I say the DMA stuff is the biggest alarm bell - if it's > not playing nicely with dmaengine that'll need to be dealt with. > As for DMA-engine: The driver should (for the moment) also work with minimal changes also on the foundation kernel - there is a much bigger user base there, that use it for LCD displays, CAN controllers, ADCs and more - so it gets more exposure to different devices than I can access myself. But still: I believe I must get the basics right first before I can start addressing DMAengine. And one of the issues I have with DMA-engine is that you always have to set up tear down the DMA-transfers (at least the way I understood it) and that is why I created this generic DMA-fragment interface which can also cache some of those DMA artifacts and allows chaining them in individual order. So the idea is to take that to build the DMA-control block chain and then pass it on to the dma-engine. Still a lot of things are missing - for example if the DMA is already running and there is another DMA fragment to execute the driver chains those fragments together in the hope that the DMA will continue and pick it up. Here the stats for 53M received CAN messages: root@raspberrypi:~/spi-bcm2835# cat /sys/class/spi_master/spi0/stats bcm2835dma_stats_info - 0.1 total spi_messages: 160690872 optimized spi_messages: 160690870 started dma: 53768175 linked to running dma: 106922697 last dma_schedule type: linked dma interrupts: 107127237 queued messages: 0 As explained, my highly optimized device driver schedules 3 spi_messages. the first 2 together the 3rd in the complete function of the 1st message. And the counters for "linked to running dma" is about double the counter of "started DMA". The first spi_message will need to get stated normally (as it is typically idle) while the 2nd and 3rd are typically linked. If you do the math this is linking happens in 66.54% of all spi_messages. Under ideal circumstances this value should be 66.666666% (=2/3). So there are times when the ARM is slightly too slow and typically the 3rd message is really scheduled only when the DMA has already stopped. Running for more than 2 days with 500M CAN-messages did not show any further races (but the scheduling needs to make heavy use of dsb() that this does not happen....) This kind of thing is something that DMA-engine does not support as of now. But prior to getting something like this accepted it first needs a proof that it works... And this is the POC that shows that it is possible and gives huge gains (at least on some platforms)... Hope this answers your questions. Ciao, Martin-- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 04, 2014 at 04:17:24PM +0200, martin sperl wrote: > On 04.04.2014 00:02, Mark Brown wrote: > > There should be some win from this purely from the framework too even > > without drivers doing anything. > If the device-driver does not do anything, then there is no cost involved with > framework (ok - besides an additional if "(!msg->is_optimized) ...") > If the bus driver does not support optimization there is still some win. Right, I was more pointing out that you were underselling here - you were talking only about the benefits when everything is updated. This also means we can usefully merge partial support, addressing only some of the improvements is still going to give some wins and makes it easier to digest the code overall. > For the framework there might also be the chance to do some optimizations of > its own when "spi_optimize" gets called for a message. > There the framework might want to call the spi_prepare methods only once. > But I do not fully know the use-cases and semantics for prepare inside the > framework - you say it is different from the optimize I vision. The main thing I'd like to do is actually coalesce adjacent transfers within a message if they are compatible and we're doing DMA so we end up with fewer hardware interactions overall. There's a fairly common pattern of writing multiple registers (for firmware download or setting coefficients) where you have a small transfer for a register then a larger one for a data block which would benefit. For prepare I'd expect we can get to skipping reconfiguring the hardware entirely when not needed even without an optimisation step, though the tradeoff might not be worth it. This may be required more than once in a message though. > > That would seem very surprising - I'd really have expected that we'd be > > able to expose enough capability information from the DMA controllers to > > allow fairly generic code; there's several controllers that have to work > > over multiple SoCs. > It is mostly related to knowing the specific registers which you need to set... > How to make it more abstract I have not yet figured it out. > But it might boil down to something like that: > * create Fragment > * add Poke(frag,Data,bus_address(register)) > * add Poke ... Registers where? If it's in the DMA controller we can just add ops for the DMA controller can't we? > But it is still complicated by the fact that the driver needs to use 3 DMA > channels to drive SPI. As mentioned actually 2, but the 3rd is needed to > stably trigger a completion interrupt without any race conditions, that > would inhibit the DMA interrupt to really get called (irq-flag might have > been cleared already). Right, that's a peculiarity of your system (hopefully) which makes life difficult for that. The Samsung controller has a vaugely similar thing where the transmit interrupt goes off too early for what we want so we need to ensure there's always a read if we're doing interrupts but that is a lot easier to work around. > > Right, and it does depend on being able to DMA to set GPIOs which is > > challenging in the general case. > "pulling" GPIO up/down - on the BCM2835 it is fairly simple: > to set a GPIO: write to GPIOSET registers with the corresponding bits > (1<<GPIOPIN) set. > To clear it: write to GPIOCLEAR registers again with the same mask. > So DMA can set all or 0 GPIO pins together. That's quite convenient for this sort of use - lots of controllers just have a register where you need to set the values of some GPIOs so you end up doing a read/modify/write which is obviously hard with DMA. We would need gpiolib to export an interface to allow this operation; it should be doable but I expect it's like the regular bulk set and tends to get bogged down. > > Broadly. Like I say the DMA stuff is the biggest alarm bell - if it's > > not playing nicely with dmaengine that'll need to be dealt with. > As for DMA-engine: The driver should (for the moment) also work with minimal > changes also on the foundation kernel - there is a much bigger user base > there, that use it for LCD displays, CAN controllers, ADCs and more - so it > gets more exposure to different devices than I can access myself. It's good to talk early so the ideas are in people's heads if nothing else, it also helps ensure everyone is travelling in the same direction. > But still: I believe I must get the basics right first before I can start > addressing DMAengine. Sounds like you already have a pretty good prototype? > And one of the issues I have with DMA-engine is that you always have to set up > tear down the DMA-transfers (at least the way I understood it) and that is why > I created this generic DMA-fragment interface which can also cache some of > those DMA artifacts and allows chaining them in individual order. > So the idea is to take that to build the DMA-control block chain and then pass > it on to the dma-engine. Right, that's the biggest thing I can see in what you're suggesting for dmaengine and it sounds like something other users could also benefit from so I'd not expect massive pushback. > This kind of thing is something that DMA-engine does not support as of now. > But prior to getting something like this accepted it first needs a proof > that it works... On the other hand you may find that what you've got already is enough for people, or get a feel for the sort of things people are looking for to show a win. Sometimes you'll even get "oh, that's a really good idea I'll just add the feature" (if you're very lucky). BTW I have acquired a RPi for some other use - is the CAN board you are using something I can order? I'm not sure if I'd ever have much time to look at it but I'm generally trying to expand the range of SPI hardware I have available for test.
On 11.04.2014 00:35, Mark Brown wrote: > Right, I was more pointing out that you were underselling here - you were > talking only about the benefits when everything is updated. This also means > we can usefully merge partial support, addressing only some of the > improvements is still going to give some wins and makes it easier to digest > the code overall. Well - I thought I had shown this with my shared timing-data already... (it was just a 5s gain for the PIO driver, but still...) > The main thing I'd like to do is actually coalesce adjacent transfers within > a message if they are compatible and we're doing DMA so we end up with fewer > hardware interactions overall. There's a fairly common pattern of writing > multiple registers (for firmware download or setting coefficients) where you > have a small transfer for a register then a larger one for a data block which > would benefit. I just found out something similar just a few days ago: doing a write then read with 2+1 is more effort to handle than a read/write of 3 bytes and some moving of data arround. Also it actually saves memory, as an spi_transfer is bigger than those 3 extra bytes needed. With the DMA Driver this saved me 4us on overall transfer time - It dropped from 52us to 47us. For comparisson: the real data transfer for those 36 bytes (without 7x CS) takes 35us - so we are pretty close to ideal... There is not so much of a difference with the PIO drivers as they take 200us for the same, but it sees a drop on the number of interrupts (typically 1 IRQ/transfer), which increases CPU availability for userspace... Redesigning the interrupt-handler to push bytes spanning multiple spi_transfers into the fifo together would help, but then it becomes much more complex - at least the spi-bcm2835 does call "complete(&bs->done);" when it has finished the transfer. I believe I will add all those findings to the documentation with the spi_optimize patch... > Registers where? If it's in the DMA controller we can just add ops for > the DMA controller can't we? Possibly - still working on cleaning up the code and making it as generic as possible > Right, that's a peculiarity of your system (hopefully) which makes life > difficult for that. The Samsung controller has a vaugely similar thing > where the transmit interrupt goes off too early for what we want so we > need to ensure there's always a read if we're doing interrupts but that > is a lot easier to work around. Specs says to use 2 Interrupts one for feeding data to FIFO the other to read data from FIFO. Probably one could use multiple interleaved DMAs for reads and writes, but then you would be limited to 32 byte transfers and would have some "gaps" between the fifos filling/emptying - this way at least 65535 bytes can get sent with a single transfer... The interrupt DMA is more on the tricky side - if we had to have an interrupt every time an spi_message finished, it would be possible with 2 DMAs, but as we have the possibility of chaining multiple spi_messages together for higher thruput, then this "waiting" for interrupt is not possible (IRQ flag gets sometimes cleared by the next DMA controlblock), so we need to have another "stable" IRQ source and hence DMA3. (actually I was using the TX DMA for that, which gets started by the RX-DMA, so we have a race-time between DMA for IRQ and the interrupt handler on the CPU reading the registers quickly enough to find out the source of the IRQ) > That's quite convenient for this sort of use - lots of controllers > just have a register where you need to set the values of some GPIOs so > you end up doing a read/modify/write which is obviously hard with DMA. > We would need gpiolib to export an interface to allow this operation; > it should be doable but I expect it's like the regular bulk set and > tends to get bogged down. yes - it is most convenient - but on the other hand it is also a bit of a pain if you need to do that on the ARM, as you have to do 2 writes for a single change in gpiolib and ordering could become a problem with bitbanged SPI, but it is possible (zero first, set last... > Sounds like you already have a pretty good prototype? > As said: it mostly works, but i need to cleanup the code before I can post it, but before I do that I had to finish the CAN driver that makes use of the features to see what pitfalls we still have. One of those found is the fact that an optimized message can only reliably get used with spi_async. For spi_sync use it has to have the complete function defined when optimizing, as it would otherwise not implement a callback IRQ in DMA. And then when spi_sync sets the complete function the DMA would still issue the optimized message without interrupts. Also there is one other race-condition, that happens sometimes quickly sometimes after 48 hours, but I think I might have an idea but for that I need more info on the event itself... > Right, that's the biggest thing I can see in what you're suggesting for > dmaengine and it sounds like something other users could also benefit > from so I'd not expect massive pushback. > > >> This kind of thing is something that DMA-engine does not support as of now. >> But prior to getting something like this accepted it first needs a proof >> that it works... >> > On the other hand you may find that what you've got already is enough > for people, or get a feel for the sort of things people are looking for > to show a win. Sometimes you'll even get "oh, that's a really good idea > I'll just add the feature" (if you're very lucky). > So who else should I involve as soon as I have presentable code? (coding standard issues I have already fixed - mostly...) > BTW I have acquired a RPi for some other use - is the CAN board you are > using something I can order? I'm not sure if I'd ever have much time to > look at it but I'm generally trying to expand the range of SPI hardware > I have available for test. Could be as simple as a bread-board + mcp2515 + mcp2551 + 16MHz Oszillator + 4 Resistors + 2-3 Capacitors for the chips... Other people have bought the pican module, which essentially implements the same, but is specific to the PI... http://skpang.co.uk/catalog/pican-canbus-board-for-raspberry-pi-p-1196.html And in both cases you probably will need a peer which accepts messages or sends them... It is a bit tricky - especially the setting up of the correct Baud rate on the CAN-Bus side - so people are complaining that it is hard to make it work... Ciao, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 11, 2014 at 02:40:07PM +0200, martin sperl wrote: > On 11.04.2014 00:35, Mark Brown wrote: > > The main thing I'd like to do is actually coalesce adjacent transfers within > > a message if they are compatible and we're doing DMA so we end up with fewer > > hardware interactions overall. There's a fairly common pattern of writing > > multiple registers (for firmware download or setting coefficients) where you > > have a small transfer for a register then a larger one for a data block which > > would benefit. > > I just found out something similar just a few days ago: > doing a write then read with 2+1 is more effort to handle than a read/write > of 3 bytes and some moving of data arround. Also it actually saves memory, > as an spi_transfer is bigger than those 3 extra bytes needed. > With the DMA Driver this saved me 4us on overall transfer time - It dropped > from 52us to 47us. For comparisson: the real data transfer for those 36 > bytes (without 7x CS) takes 35us - so we are pretty close to ideal... It can be even nicer for drivers where there's some minimum limit on the size of transfers (quite a lot of them), it can sometimes allow short writes to be converted from being PIO to being combined with an adjacent DMAs. Possibly even two PIOs making a DMA, though that case is more on the edge. > There is not so much of a difference with the PIO drivers as they take 200us > for the same, but it sees a drop on the number of interrupts (typically > 1 IRQ/transfer), which increases CPU availability for userspace... > Redesigning the interrupt-handler to push bytes spanning multiple > spi_transfers into the fifo together would help, but then it becomes much > more complex - at least the spi-bcm2835 does call "complete(&bs->done);" > when it has finished the transfer. It's probably not worth doing unless it's factored out into the framework and affects lots of drivers; it seems most likely to get done as part of a general improvement in the ability to drive the queue from interrupt context. Most systems won't be able to do the fully DMAed pipeline the Pi can. > > Right, that's a peculiarity of your system (hopefully) which makes life > > difficult for that. The Samsung controller has a vaugely similar thing > > where the transmit interrupt goes off too early for what we want so we > > need to ensure there's always a read if we're doing interrupts but that > > is a lot easier to work around. > Specs says to use 2 Interrupts one for feeding data to FIFO the other to > read data from FIFO. Probably one could use multiple interleaved DMAs for > reads and writes, but then you would be limited to 32 byte transfers and > would have some "gaps" between the fifos filling/emptying - this way at > least 65535 bytes can get sent with a single transfer... Sure, that's standard though looking at the code in drivers it seems there's a fairly common pattern of just ignoring one of the completions, usually the transmit one, at the SPI level since we can guarantee that the other will always come later. > The interrupt DMA is more on the tricky side - if we had to have an > interrupt every time an spi_message finished, it would be possible with > 2 DMAs, but as we have the possibility of chaining multiple spi_messages > together for higher thruput, then this "waiting" for interrupt is not > possible (IRQ flag gets sometimes cleared by the next DMA controlblock), > so we need to have another "stable" IRQ source and hence DMA3. > (actually I was using the TX DMA for that, which gets started by the > RX-DMA, so we have a race-time between DMA for IRQ and the interrupt > handler on the CPU reading the registers quickly enough to find out the > source of the IRQ) I think it's worth implementing the three DMA solution as an optimisation on the two DMA version partly to get the big part of the win from the two DMAs integrated but also because that's going to be the more generally useful pattern. It also lets the tricky bit with the extra DMA be considered separately. > One of those found is the fact that an optimized message can only reliably > get used with spi_async. For spi_sync use it has to have the complete > function defined when optimizing, as it would otherwise not implement a > callback IRQ in DMA. And then when spi_sync sets the complete function > the DMA would still issue the optimized message without interrupts. Hrm. We could provide a flag saying this will be used synchronously, or more likely given driver use patterns the other way around. Doesn't seem quite elegant though. Or insert a dummy transfer (eg, some register read) which does interrupt into the chain after the real one to do the callback. > > On the other hand you may find that what you've got already is enough > > for people, or get a feel for the sort of things people are looking for > > to show a win. Sometimes you'll even get "oh, that's a really good idea > > I'll just add the feature" (if you're very lucky). > So who else should I involve as soon as I have presentable code? > (coding standard issues I have already fixed - mostly...) The dmaengine maintainers for the DMA stuff - Vinod Koul <vinod.koul@intel.com> and Dan Williams <dan.j.williams@intel.com> - and the GPIO maintainers for that - Linus Walleij <linus.walleij@linaro.org> and Alexandre Courbot <gnurou@gmail.com> (check MAINTAINERS). Plus anyone working on the drivers you modified. > the same, but is specific to the PI... > http://skpang.co.uk/catalog/pican-canbus-board-for-raspberry-pi-p-1196.html > And in both cases you probably will need a peer which accepts messages or > sends them... > It is a bit tricky - especially the setting up of the correct Baud rate > on the CAN-Bus side - so people are complaining that it is hard to make > it work... Hrm, OK. Might be too fiddly, dunno. I do also keep thinking I should get a FPGA board and impelement some test devices that way to let me exercise the framework, it's not exactly realistic though.
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c index d745f95..24a54aa 100644 --- a/drivers/spi/spi.c +++ b/drivers/spi/spi.c @@ -1598,15 +1598,12 @@ int spi_setup(struct spi_device *spi) } EXPORT_SYMBOL_GPL(spi_setup); -static int __spi_async(struct spi_device *spi, struct spi_message *message) +static inline int __spi_verify(struct spi_device *spi, + struct spi_message *message) { struct spi_master *master = spi->master; struct spi_transfer *xfer; - message->spi = spi; - - trace_spi_message_submit(message); - if (list_empty(&message->transfers)) return -EINVAL; if (!message->complete) @@ -1705,9 +1702,28 @@ static int __spi_async(struct spi_device *spi, struct spi_message *message) return -EINVAL; } } + return 0; +} + +static int __spi_async(struct spi_device *spi, struct spi_message *message) +{ + struct spi_master *master = spi->master; + int ret = 0; + + trace_spi_message_submit(message); + + if (message->is_optimized) { + if (spi != message->spi) + return -EINVAL; + } else { + message->spi = spi; + ret = __spi_verify(spi,message); + if (ret) + return ret; + } message->status = -EINPROGRESS; - return master->transfer(spi, message); + return spi->master->transfer(spi, message); } /** @@ -1804,6 +1820,48 @@ int spi_async_locked(struct spi_device *spi, struct spi_message *message) } EXPORT_SYMBOL_GPL(spi_async_locked); +/** + * spi_message_optimize - optimize a message for repeated use minimizing + * processing overhead + * + * @spi: device with which data will be exchanged + * @message: describes the data transfers, including completion callback + * Context: can sleep + */ +int spi_message_optimize(struct spi_device *spi, + struct spi_message *message) +{ + int ret = 0; + if (message->is_optimized) + spi_message_unoptimize(message); + + message->spi = spi; + ret = __spi_verify(spi,message); + if (ret) + return ret; + + if (spi->master->optimize_message) + ret = spi->master->optimize_message(message); + if (ret) + return ret; + + message->is_optimized = 1; + + return 0; +} +EXPORT_SYMBOL_GPL(spi_message_optimize); + +void spi_message_unoptimize(struct spi_message *message) +{ + if (!message->is_optimized) + return; + + if (message->spi->master->unoptimize_message) + message->spi->master->unoptimize_message(message); + + message->is_optimized = 0; +} +EXPORT_SYMBOL_GPL(spi_message_unoptimize); /*-------------------------------------------------------------------------*/ diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h index 8c62ba7..5206038 100644 --- a/include/linux/spi/spi.h +++ b/include/linux/spi/spi.h @@ -287,6 +287,12 @@ static inline void spi_unregister_driver(struct spi_driver *sdrv) * spi_finalize_current_transfer() so the subsystem can issue * the next transfer * @unprepare_message: undo any work done by prepare_message(). + * @optimize_message: allow computation of optimizations of a spi message + * this may set up the corresponding DMA transfers + * or do other work that need not get computed every + * time a message gets executed + * Not called from interrupt context. + * @unoptimize_message: undo any work done by @optimize_message(). * @cs_gpios: Array of GPIOs to use as chip select lines; one per CS * number. Any individual value may be -ENOENT for CS lines that