Message ID | 20190110134433.15672-1-joro@8bytes.org (mailing list archive) |
---|---|
Headers | show |
Series | Fix virtio-blk issue with SWIOTLB | expand |
On Thu, Jan 10, 2019 at 02:44:30PM +0100, Joerg Roedel wrote: > Hi, > > there is a problem with virtio-blk driven devices when > virtio-ring uses SWIOTLB through the DMA-API. This happens > for example in AMD-SEV enabled guests, where the guest RAM > is mostly encrypted and all emulated DMA has to happen > to/from the SWIOTLB aperture. > > The problem is a limitation of the SWIOTLB implementation, > which does not support allocations larger than 256kb. When > the virtio-blk driver tries to read/write a block larger > than that, the allocation of the dma-handle fails and an IO > error is reported. s/allocations/mappings/, right? We don't use the swiotlb buffer for the coherent allocations anymore, and I don't think virtio-blk uses them anyway. > This patch-set adds a check to the virtio-code whether it > might be using SWIOTLB bounce buffering and limits the > maximum segment size in the virtio-blk driver in this case, > so that it doesn't try to do larger reads/writes. I really don't like the fact that we hardcode swiotlb specific. This needs to be indirect through the dma ops or struct device, as various iommu implementations also have limits (although usually much larger ones).
Hi Christoph, On Thu, Jan 10, 2019 at 02:59:52PM +0100, Christoph Hellwig wrote: > On Thu, Jan 10, 2019 at 02:44:30PM +0100, Joerg Roedel wrote: > > The problem is a limitation of the SWIOTLB implementation, > > which does not support allocations larger than 256kb. When > > the virtio-blk driver tries to read/write a block larger > > than that, the allocation of the dma-handle fails and an IO > > error is reported. > > s/allocations/mappings/, right? We don't use the swiotlb > buffer for the coherent allocations anymore, and I don't think > virtio-blk uses them anyway. 'Allocation' in the sense that there are address ranges allocated, not memory, but mappings would also be right. > I really don't like the fact that we hardcode swiotlb specific. > This needs to be indirect through the dma ops or struct device, > as various iommu implementations also have limits (although > usually much larger ones). I though about exposing it through DMA-API, but didn't go that route as I didn't want to extend a generic API for some SWIOTLB specifics. But if there are more implementations that have a size limitation it makes certainly sense to put it into the DMA-API. I'll change that in the next version. Thanks, Joerg
On 2019/1/10 下午9:44, Joerg Roedel wrote: > Hi, > > there is a problem with virtio-blk driven devices when > virtio-ring uses SWIOTLB through the DMA-API. This happens > for example in AMD-SEV enabled guests, where the guest RAM > is mostly encrypted and all emulated DMA has to happen > to/from the SWIOTLB aperture. > > The problem is a limitation of the SWIOTLB implementation, > which does not support allocations larger than 256kb. When > the virtio-blk driver tries to read/write a block larger > than that, the allocation of the dma-handle fails and an IO > error is reported. > > This patch-set adds a check to the virtio-code whether it > might be using SWIOTLB bounce buffering and limits the > maximum segment size in the virtio-blk driver in this case, > so that it doesn't try to do larger reads/writes. Just wonder if my understanding is correct IOMMU_PLATFORM must be set for all virtio devices under AMD-SEV guests? Thanks > > Please review. > > Thanks, > > Joerg > > Joerg Roedel (3): > swiotlb: Export maximum allocation size > virtio: Introduce virtio_max_dma_size() > virtio-blk: Consider virtio_max_dma_size() for maximum segment size > > drivers/block/virtio_blk.c | 10 ++++++---- > drivers/virtio/virtio_ring.c | 11 +++++++++++ > include/linux/swiotlb.h | 12 ++++++++++++ > include/linux/virtio.h | 2 ++ > 4 files changed, 31 insertions(+), 4 deletions(-) >
On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote: > Just wonder if my understanding is correct IOMMU_PLATFORM must be set for > all virtio devices under AMD-SEV guests? Yes, that is correct. Emulated DMA can only happen on the SWIOTLB aperture, because that memory is not encrypted. The guest bounces the data then to its encrypted memory. Regards, Joerg
On 2019/1/11 下午5:15, Joerg Roedel wrote: > On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote: >> Just wonder if my understanding is correct IOMMU_PLATFORM must be set for >> all virtio devices under AMD-SEV guests? > Yes, that is correct. Emulated DMA can only happen on the SWIOTLB > aperture, because that memory is not encrypted. The guest bounces the > data then to its encrypted memory. > > Regards, > > Joerg Thanks, have you tested vhost-net in this case. I suspect it may not work
On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote: > > On 2019/1/11 下午5:15, Joerg Roedel wrote: >> On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote: >>> Just wonder if my understanding is correct IOMMU_PLATFORM must be set for >>> all virtio devices under AMD-SEV guests? >> Yes, that is correct. Emulated DMA can only happen on the SWIOTLB >> aperture, because that memory is not encrypted. The guest bounces the >> data then to its encrypted memory. >> >> Regards, >> >> Joerg > > > Thanks, have you tested vhost-net in this case. I suspect it may not work Which brings me back to my pet pevee that we need to take actions that virtio uses the proper dma mapping API by default with quirks for legacy cases. The magic bypass it uses is just causing problems over problems.
On 2019/1/14 下午5:50, Christoph Hellwig wrote: > On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote: >> On 2019/1/11 下午5:15, Joerg Roedel wrote: >>> On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote: >>>> Just wonder if my understanding is correct IOMMU_PLATFORM must be set for >>>> all virtio devices under AMD-SEV guests? >>> Yes, that is correct. Emulated DMA can only happen on the SWIOTLB >>> aperture, because that memory is not encrypted. The guest bounces the >>> data then to its encrypted memory. >>> >>> Regards, >>> >>> Joerg >> >> Thanks, have you tested vhost-net in this case. I suspect it may not work > Which brings me back to my pet pevee that we need to take actions > that virtio uses the proper dma mapping API by default with quirks > for legacy cases. The magic bypass it uses is just causing problems > over problems. Yes, I fully agree with you. This is probably an exact example of such problem. Thanks
On Mon, Jan 14, 2019 at 08:41:37PM +0800, Jason Wang wrote: > > On 2019/1/14 下午5:50, Christoph Hellwig wrote: > > On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote: > > > On 2019/1/11 下午5:15, Joerg Roedel wrote: > > > > On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote: > > > > > Just wonder if my understanding is correct IOMMU_PLATFORM must be set for > > > > > all virtio devices under AMD-SEV guests? > > > > Yes, that is correct. Emulated DMA can only happen on the SWIOTLB > > > > aperture, because that memory is not encrypted. The guest bounces the > > > > data then to its encrypted memory. > > > > > > > > Regards, > > > > > > > > Joerg > > > > > > Thanks, have you tested vhost-net in this case. I suspect it may not work > > Which brings me back to my pet pevee that we need to take actions > > that virtio uses the proper dma mapping API by default with quirks > > for legacy cases. The magic bypass it uses is just causing problems > > over problems. > > > Yes, I fully agree with you. This is probably an exact example of such > problem. > > Thanks I don't think so - the issue is really that DMA API does not yet handle the SEV case 100% correctly. I suspect passthrough devices would have the same issue. In fact whoever sets IOMMU_PLATFORM is completely unaffected by Christoph's pet peeve. Christoph is saying that !IOMMU_PLATFORM devices should hide the compatibility code in a special per-device DMA API implementation. Which would be fine especially if we can manage not to introduce a bunch of indirect calls all over the place and hurt performance. It's just that the benefit is unlikely to be big (e.g. we can't also get rid of the virtio specific memory barriers) so no one was motivated enough to work on it.
On 1/14/19 12:20 PM, Michael S. Tsirkin wrote: > On Mon, Jan 14, 2019 at 08:41:37PM +0800, Jason Wang wrote: >> >> On 2019/1/14 下午5:50, Christoph Hellwig wrote: >>> On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote: >>>> On 2019/1/11 下午5:15, Joerg Roedel wrote: >>>>> On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote: >>>>>> Just wonder if my understanding is correct IOMMU_PLATFORM must be set for >>>>>> all virtio devices under AMD-SEV guests? >>>>> Yes, that is correct. Emulated DMA can only happen on the SWIOTLB >>>>> aperture, because that memory is not encrypted. The guest bounces the >>>>> data then to its encrypted memory. >>>>> >>>>> Regards, >>>>> >>>>> Joerg >>>> >>>> Thanks, have you tested vhost-net in this case. I suspect it may not work >>> Which brings me back to my pet pevee that we need to take actions >>> that virtio uses the proper dma mapping API by default with quirks >>> for legacy cases. The magic bypass it uses is just causing problems >>> over problems. >> >> >> Yes, I fully agree with you. This is probably an exact example of such >> problem. >> >> Thanks > > I don't think so - the issue is really that DMA API does not yet handle > the SEV case 100% correctly. I suspect passthrough devices would have > the same issue. > In case of SEV, emulated DMA is performed through the SWIOTLB (which bounces the encrypted buffers). The issue reported here will happen on any platform which is making use of SWIOTLB. We could easily reproduce the the virtio-blk failure if we configure swiotlb=force in non SEV guest. Unfortunately in case of SEV the SWIOTLB is must. As Jorge highlighted the main issue is limitation of the SWIOTLB, it does not support allocation/map larger than 256Kb. > In fact whoever sets IOMMU_PLATFORM is completely unaffected by > Christoph's pet peeve. > > Christoph is saying that !IOMMU_PLATFORM devices should hide the > compatibility code in a special per-device DMA API implementation. > Which would be fine especially if we can manage not to introduce a bunch > of indirect calls all over the place and hurt performance. It's just > that the benefit is unlikely to be big (e.g. we can't also get rid of > the virtio specific memory barriers) so no one was motivated enough to > work on it. >
On 14/01/2019 18:20, Michael S. Tsirkin wrote: > On Mon, Jan 14, 2019 at 08:41:37PM +0800, Jason Wang wrote: >> >> On 2019/1/14 下午5:50, Christoph Hellwig wrote: >>> On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote: >>>> On 2019/1/11 下午5:15, Joerg Roedel wrote: >>>>> On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote: >>>>>> Just wonder if my understanding is correct IOMMU_PLATFORM must be set for >>>>>> all virtio devices under AMD-SEV guests? >>>>> Yes, that is correct. Emulated DMA can only happen on the SWIOTLB >>>>> aperture, because that memory is not encrypted. The guest bounces the >>>>> data then to its encrypted memory. >>>>> >>>>> Regards, >>>>> >>>>> Joerg >>>> >>>> Thanks, have you tested vhost-net in this case. I suspect it may not work >>> Which brings me back to my pet pevee that we need to take actions >>> that virtio uses the proper dma mapping API by default with quirks >>> for legacy cases. The magic bypass it uses is just causing problems >>> over problems. >> >> >> Yes, I fully agree with you. This is probably an exact example of such >> problem. >> >> Thanks > > I don't think so - the issue is really that DMA API does not yet handle > the SEV case 100% correctly. I suspect passthrough devices would have > the same issue. Huh? Regardless of which virtio devices use it or not, the DMA API is handling the SEV case as correctly as it possibly can, by forcing everything through the unencrypted bounce buffer. If the segments being mapped are too big for that bounce buffer in the first place, there's nothing it can possibly do except fail, gracefully or otherwise. Now, in theory, yes, the real issue at hand is not unique to virtio-blk nor SEV - any driver whose device has a sufficiently large DMA segment size and who manages to get sufficient physically-contiguous memory could technically generate a scatterlist segment longer than SWIOTLB can handle. However, in practice that basically never happens, not least because very few drivers ever override the default 64K DMA segment limit. AFAICS nothing in drivers/virtio is calling dma_set_max_seg_size() or otherwise assigning any dma_parms to replace the defaults either, so the really interesting question here is how are these apparently-out-of-spec 256K segments getting generated at all? Robin.
On Mon, Jan 14, 2019 at 01:20:45PM -0500, Michael S. Tsirkin wrote: > I don't think so - the issue is really that DMA API does not yet handle > the SEV case 100% correctly. I suspect passthrough devices would have > the same issue. The DMA API handles the SEV case perfectly. Its just that virtio_blk supports huge segments that virtio does not generally support, but that is not related to SEV in any way. > In fact whoever sets IOMMU_PLATFORM is completely unaffected by > Christoph's pet peeve. No, the above happens only when we set IOMMU_PLATFORM. > Christoph is saying that !IOMMU_PLATFORM devices should hide the > compatibility code in a special per-device DMA API implementation. > Which would be fine especially if we can manage not to introduce a bunch > of indirect calls all over the place and hurt performance. It's just > that the benefit is unlikely to be big (e.g. we can't also get rid of > the virtio specific memory barriers) so no one was motivated enough to > work on it. No. The problem is that we still haven't fully specified what IOMMU_PLATFORM and !IOMMU_PLATFORM actually mean. Your "ACCESS_PLATFORM/ORDER_PLATFORM" commit in the virtio-spec repo improves it a little bit, but it is still far from enough. As a start VIRTIO_F_ORDER_PLATFORM and VIRTIO_F_ACCESS_PLATFORM absolutely MUST be set for hardware implementations. Otherwise said hardware has no chance of working on anything but the most x86-like systems. Second software implementations SHOULD set VIRTIO_F_ACCESS_PLATFORM, because otherwise we can't add proper handling for things like SEV or the IBM "secure hypervisor" thing. Last but not least a lot of wording outside the area describing these flags really needs some major updates in terms of DMA access.
On Mon, Jan 14, 2019 at 07:12:08PM +0000, Robin Murphy wrote: > Now, in theory, yes, the real issue at hand is not unique to virtio-blk nor > SEV - any driver whose device has a sufficiently large DMA segment size and > who manages to get sufficient physically-contiguous memory could > technically generate a scatterlist segment longer than SWIOTLB can handle. > However, in practice that basically never happens, not least because very > few drivers ever override the default 64K DMA segment limit. AFAICS nothing > in drivers/virtio is calling dma_set_max_seg_size() or otherwise assigning > any dma_parms to replace the defaults either, so the really interesting > question here is how are these apparently-out-of-spec 256K segments getting > generated at all? drivers/block/virtio_blk.c:virtblk_probe(): /* Host can optionally specify maximum segment size and number of * segments. */ err = virtio_cread_feature(vdev, VIRTIO_BLK_F_SIZE_MAX, struct virtio_blk_config, size_max, &v); if (!err) blk_queue_max_segment_size(q, v); else blk_queue_max_segment_size(q, -1U); So it really is virtio_blk that is special here. The host could set VIRTIO_BLK_F_SIZE_MAX to paper over the problem, but I really think we need a dma_max_segment_size API that is wired up through the dma_map_ops to sort this out for real.
On Mon, Jan 14, 2019 at 07:12:08PM +0000, Robin Murphy wrote: > On 14/01/2019 18:20, Michael S. Tsirkin wrote: > > On Mon, Jan 14, 2019 at 08:41:37PM +0800, Jason Wang wrote: > > > > > > On 2019/1/14 下午5:50, Christoph Hellwig wrote: > > > > On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote: > > > > > On 2019/1/11 下午5:15, Joerg Roedel wrote: > > > > > > On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote: > > > > > > > Just wonder if my understanding is correct IOMMU_PLATFORM must be set for > > > > > > > all virtio devices under AMD-SEV guests? > > > > > > Yes, that is correct. Emulated DMA can only happen on the SWIOTLB > > > > > > aperture, because that memory is not encrypted. The guest bounces the > > > > > > data then to its encrypted memory. > > > > > > > > > > > > Regards, > > > > > > > > > > > > Joerg > > > > > > > > > > Thanks, have you tested vhost-net in this case. I suspect it may not work > > > > Which brings me back to my pet pevee that we need to take actions > > > > that virtio uses the proper dma mapping API by default with quirks > > > > for legacy cases. The magic bypass it uses is just causing problems > > > > over problems. > > > > > > > > > Yes, I fully agree with you. This is probably an exact example of such > > > problem. > > > > > > Thanks > > > > I don't think so - the issue is really that DMA API does not yet handle > > the SEV case 100% correctly. I suspect passthrough devices would have > > the same issue. > > Huh? Regardless of which virtio devices use it or not, the DMA API is > handling the SEV case as correctly as it possibly can, by forcing everything > through the unencrypted bounce buffer. If the segments being mapped are too > big for that bounce buffer in the first place, there's nothing it can > possibly do except fail, gracefully or otherwise. It seems reasonable to be able to ask it what it's capabilities are though. > Now, in theory, yes, the real issue at hand is not unique to virtio-blk nor > SEV - any driver whose device has a sufficiently large DMA segment size and > who manages to get sufficient physically-contiguous memory could technically > generate a scatterlist segment longer than SWIOTLB can handle. However, in > practice that basically never happens, not least because very few drivers > ever override the default 64K DMA segment limit. AFAICS nothing in > drivers/virtio is calling dma_set_max_seg_size() or otherwise assigning any > dma_parms to replace the defaults either, so the really interesting question > here is how are these apparently-out-of-spec 256K segments getting generated > at all? > > Robin. I guess this is what you are looking for: /* Host can optionally specify maximum segment size and number of * segments. */ err = virtio_cread_feature(vdev, VIRTIO_BLK_F_SIZE_MAX, struct virtio_blk_config, size_max, &v); if (!err) blk_queue_max_segment_size(q, v); else blk_queue_max_segment_size(q, -1U); virtio isn't the only caller with a value >64K: $ git grep -A1 blk_queue_max_segment_size Documentation/block/biodoc.txt: blk_queue_max_segment_size(q, max_seg_size) Documentation/block/biodoc.txt- Maximum size of a clustered segment, 64kB default. -- block/blk-settings.c: * blk_queue_max_segment_size - set max segment size for blk_rq_map_sg block/blk-settings.c- * @q: the request queue for the device -- block/blk-settings.c:void blk_queue_max_segment_size(struct request_queue *q, unsigned int max_size) block/blk-settings.c-{ -- block/blk-settings.c:EXPORT_SYMBOL(blk_queue_max_segment_size); block/blk-settings.c- -- drivers/block/mtip32xx/mtip32xx.c: blk_queue_max_segment_size(dd->queue, 0x400000); drivers/block/mtip32xx/mtip32xx.c- blk_queue_io_min(dd->queue, 4096); -- drivers/block/nbd.c: blk_queue_max_segment_size(disk->queue, UINT_MAX); drivers/block/nbd.c- blk_queue_max_segments(disk->queue, USHRT_MAX); -- drivers/block/ps3disk.c: blk_queue_max_segment_size(queue, dev->bounce_size); drivers/block/ps3disk.c- -- drivers/block/ps3vram.c: blk_queue_max_segment_size(queue, BLK_MAX_SEGMENT_SIZE); drivers/block/ps3vram.c- blk_queue_max_hw_sectors(queue, BLK_SAFE_MAX_SECTORS); -- drivers/block/rbd.c: blk_queue_max_segment_size(q, UINT_MAX); drivers/block/rbd.c- blk_queue_io_min(q, objset_bytes); -- drivers/block/sunvdc.c: blk_queue_max_segment_size(q, PAGE_SIZE); drivers/block/sunvdc.c- -- drivers/block/virtio_blk.c: blk_queue_max_segment_size(q, v); drivers/block/virtio_blk.c- else drivers/block/virtio_blk.c: blk_queue_max_segment_size(q, -1U); drivers/block/virtio_blk.c- -- drivers/block/xen-blkfront.c: blk_queue_max_segment_size(rq, PAGE_SIZE); drivers/block/xen-blkfront.c- -- drivers/cdrom/gdrom.c: blk_queue_max_segment_size(gd.gdrom_rq, 0x40000); drivers/cdrom/gdrom.c- gd.disk->queue = gd.gdrom_rq; -- drivers/memstick/core/ms_block.c: blk_queue_max_segment_size(msb->queue, drivers/memstick/core/ms_block.c- MS_BLOCK_MAX_PAGES * msb->page_size); -- drivers/memstick/core/mspro_block.c: blk_queue_max_segment_size(msb->queue, drivers/memstick/core/mspro_block.c- MSPRO_BLOCK_MAX_PAGES * msb->page_size); -- drivers/mmc/core/queue.c: blk_queue_max_segment_size(mq->queue, host->max_seg_size); drivers/mmc/core/queue.c- -- drivers/s390/block/dasd.c: blk_queue_max_segment_size(q, PAGE_SIZE); drivers/s390/block/dasd.c- blk_queue_segment_boundary(q, PAGE_SIZE - 1); -- drivers/scsi/be2iscsi/be_main.c: blk_queue_max_segment_size(sdev->request_queue, 65536); drivers/scsi/be2iscsi/be_main.c- return 0; -- drivers/scsi/scsi_debug.c: blk_queue_max_segment_size(sdp->request_queue, -1U); drivers/scsi/scsi_debug.c- if (sdebug_no_uld) -- drivers/scsi/scsi_lib.c: blk_queue_max_segment_size(q, dma_get_max_seg_size(dev)); drivers/scsi/scsi_lib.c- -- drivers/scsi/ufs/ufshcd.c: blk_queue_max_segment_size(q, PRDT_DATA_BYTE_COUNT_MAX); drivers/scsi/ufs/ufshcd.c- -- include/linux/blkdev.h:extern void blk_queue_max_segment_size(struct request_queue *, unsigned int); include/linux/blkdev.h-extern void blk_queue_max_discard_sectors(struct request_queue *q, -- include/linux/mmc/host.h: unsigned int max_seg_size; /* see blk_queue_max_segment_size */ include/linux/mmc/host.h- unsigned short max_segs; /* see blk_queue_max_segments */ Some of these devices are probably not going to work well if passed through to a SEV guest. Going back to virtio, at some level virtio is like a stacking device so it does not necessarily need a limit.
On Mon, Jan 14, 2019 at 09:19:35PM +0100, Christoph Hellwig wrote: > > Christoph is saying that !IOMMU_PLATFORM devices should hide the > > compatibility code in a special per-device DMA API implementation. > > Which would be fine especially if we can manage not to introduce a bunch > > of indirect calls all over the place and hurt performance. It's just > > that the benefit is unlikely to be big (e.g. we can't also get rid of > > the virtio specific memory barriers) so no one was motivated enough to > > work on it. > > No. Oh ok then. > The problem is that we still haven't fully specified what > IOMMU_PLATFORM and !IOMMU_PLATFORM actually mean. Your > "ACCESS_PLATFORM/ORDER_PLATFORM" commit in the virtio-spec repo > improves it a little bit, but it is still far from enough. > > As a start VIRTIO_F_ORDER_PLATFORM and VIRTIO_F_ACCESS_PLATFORM > absolutely MUST be set for hardware implementations. Otherwise said > hardware has no chance of working on anything but the most x86-like > systems. I think you are exaggerating a little here. Limited access with e.g. need to flush caches is common on lower end but less common on higher end systems used for virtualization. As you point out that changes with e.g. SEV but then SEV is a pretty niche thing so far. So I would make that a SHOULD probably. > Second software implementations SHOULD set VIRTIO_F_ACCESS_PLATFORM, > because otherwise we can't add proper handling for things like SEV or > the IBM "secure hypervisor" thing. Yes. If host does not have access to all of memory it SHOULD set VIRTIO_F_ACCESS_PLATFORM. It seems rather straight-forward to add conformance clauses for this, thanks for reminding me. > Last but not least a lot of wording outside the area describing these > flags really needs some major updates in terms of DMA access. Thanks for the reminder. Yes generally spec tends to still say e.g. "physical address" where it really means a "DMA address" in Linux terms. Whether changing that will make things much clearer I'm not sure. E.g. hardware designers don't much care I guess. If you want to list the most problematic cases, e.g. in a github issue, we might get to clarifying them sooner. Thanks!
On Mon, Jan 14, 2019 at 01:20:45PM -0500, Michael S. Tsirkin wrote: > Which would be fine especially if we can manage not to introduce a bunch > of indirect calls all over the place and hurt performance. Which indirect calls? In case of unset dma_ops the DMA-API functions call directly into the dma-direct implementation, no indirect calls at all. Regards, Joerg
On Mon, Jan 14, 2019 at 03:48:47PM -0500, Michael S. Tsirkin wrote: > I think you are exaggerating a little here. Limited access with e.g. > need to flush caches is common on lower end but less common on higher > end systems used for virtualization. As you point out that changes with > e.g. SEV but then SEV is a pretty niche thing so far. > > So I would make that a SHOULD probably. The problem is that without using DMA ops you can't just plug the device into a random system, which is a total no-go for periphals. And cache flushing is not that uncommmon, e.g. every sparc systems needs it, many arm/arm64 ones, some power ones, lots of mips ones, etc. But cache flushing is just one side of it - lots of systems have either mandatory or option IOMMUs, and without the platform access flag that doesn't work. As does that case where you have a DMA offset, which is extremly common on arm and power systems. So you might find a few non-x86 systems you can plug it in and it'll just work, but it won't be all that many. And even on x86 your device will be completely broken if the users dares to use an IOMMU. > > flags really needs some major updates in terms of DMA access. > > Thanks for the reminder. Yes generally spec tends to still say e.g. > "physical address" where it really means a "DMA address" in Linux terms. > Whether changing that will make things much clearer I'm not sure. E.g. > hardware designers don't much care I guess. At least say bus address - as said above the CPU and bus concepts of physicall addresses are different in many systems. Including x86 when you use IOMMUs.
On Tue, Jan 15, 2019 at 09:37:42AM +0100, Joerg Roedel wrote: > On Mon, Jan 14, 2019 at 01:20:45PM -0500, Michael S. Tsirkin wrote: > > Which would be fine especially if we can manage not to introduce a bunch > > of indirect calls all over the place and hurt performance. > > Which indirect calls? In case of unset dma_ops the DMA-API functions > call directly into the dma-direct implementation, no indirect calls at > all. True. But the NULL-ops dma direct case still has two issues that might not work for virtio: (a) if the architecture supports devices that are not DMA coherent it will dip into a special allocator and do cache maintainance for streaming mappings. Although it would be a bit of a hack we could work around this in virtio doings something like: #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \ defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \ defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) dev->dma_coherent = true; #endif except that won't work for mips, which has a mode where it does a system instead of device level coherency flag and thus doesn't look at this struct device field (b) we can still mangle the DMA address, either using the dma_pfn_offset field in struct device, or by a full override of __phys_to_dma / __dma_to_phys by the architecture code. The first could be solved using a hack like the one above, but the latter would be a little harder. In the long run I'd love to get rid of that hook and have everyone use the generic offset code, but for that we first need to support multiple ranges with different offset and do quite some nasty arch code surgery. So for the legacy virtio case I fear we need to keep local dma mapping implementation for now. I just wish now recent hypervisor would ever offer devices in this broken legacy mode.. > > Regards, > > Joerg ---end quoted text---
On Tue, Jan 15, 2019 at 02:20:19PM +0100, Christoph Hellwig wrote: > On Tue, Jan 15, 2019 at 09:37:42AM +0100, Joerg Roedel wrote: > > On Mon, Jan 14, 2019 at 01:20:45PM -0500, Michael S. Tsirkin wrote: > > > Which would be fine especially if we can manage not to introduce a bunch > > > of indirect calls all over the place and hurt performance. > > > > Which indirect calls? In case of unset dma_ops the DMA-API functions > > call directly into the dma-direct implementation, no indirect calls at > > all. > > True. But the NULL-ops dma direct case still has two issues that might > not work for virtio: > > (a) if the architecture supports devices that are not DMA coherent it > will dip into a special allocator and do cache maintainance for > streaming mappings. Although it would be a bit of a hack we could > work around this in virtio doings something like: > > #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \ > defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \ > defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) > dev->dma_coherent = true; > #endif > > except that won't work for mips, which has a mode where it does > a system instead of device level coherency flag and thus doesn't > look at this struct device field > > (b) we can still mangle the DMA address, either using the > dma_pfn_offset field in struct device, or by a full override > of __phys_to_dma / __dma_to_phys by the architecture code. > The first could be solved using a hack like the one above, > but the latter would be a little harder. In the long run > I'd love to get rid of that hook and have everyone use the > generic offset code, but for that we first need to support > multiple ranges with different offset and do quite some > nasty arch code surgery. > > So for the legacy virtio case I fear we need to keep local dma mapping > implementation for now. I just wish now recent hypervisor would ever > offer devices in this broken legacy mode.. IIUC some emulated hardware has no reasonable way to specify that some devices bypass the IOMMU, and no cheap way to cover all memory with an IOMMU mapping. So I suspect !ACCESS_PLATFORM will stay a supported configuration forever :( > > > > Regards, > > > > Joerg > ---end quoted text---