mbox series

[0/3] Fix virtio-blk issue with SWIOTLB

Message ID 20190110134433.15672-1-joro@8bytes.org (mailing list archive)
Headers show
Series Fix virtio-blk issue with SWIOTLB | expand

Message

Joerg Roedel Jan. 10, 2019, 1:44 p.m. UTC
Hi,

there is a problem with virtio-blk driven devices when
virtio-ring uses SWIOTLB through the DMA-API. This happens
for example in AMD-SEV enabled guests, where the guest RAM
is mostly encrypted and all emulated DMA has to happen
to/from the SWIOTLB aperture.

The problem is a limitation of the SWIOTLB implementation,
which does not support allocations larger than 256kb. When
the virtio-blk driver tries to read/write a block larger
than that, the allocation of the dma-handle fails and an IO
error is reported.

This patch-set adds a check to the virtio-code whether it
might be using SWIOTLB bounce buffering and limits the
maximum segment size in the virtio-blk driver in this case,
so that it doesn't try to do larger reads/writes.

Please review.

Thanks,

	Joerg

Joerg Roedel (3):
  swiotlb: Export maximum allocation size
  virtio: Introduce virtio_max_dma_size()
  virtio-blk: Consider virtio_max_dma_size() for maximum segment size

 drivers/block/virtio_blk.c   | 10 ++++++----
 drivers/virtio/virtio_ring.c | 11 +++++++++++
 include/linux/swiotlb.h      | 12 ++++++++++++
 include/linux/virtio.h       |  2 ++
 4 files changed, 31 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig Jan. 10, 2019, 1:59 p.m. UTC | #1
On Thu, Jan 10, 2019 at 02:44:30PM +0100, Joerg Roedel wrote:
> Hi,
> 
> there is a problem with virtio-blk driven devices when
> virtio-ring uses SWIOTLB through the DMA-API. This happens
> for example in AMD-SEV enabled guests, where the guest RAM
> is mostly encrypted and all emulated DMA has to happen
> to/from the SWIOTLB aperture.
> 
> The problem is a limitation of the SWIOTLB implementation,
> which does not support allocations larger than 256kb. When
> the virtio-blk driver tries to read/write a block larger
> than that, the allocation of the dma-handle fails and an IO
> error is reported.

s/allocations/mappings/, right?  We don't use the swiotlb
buffer for the coherent allocations anymore, and I don't think
virtio-blk uses them anyway.

> This patch-set adds a check to the virtio-code whether it
> might be using SWIOTLB bounce buffering and limits the
> maximum segment size in the virtio-blk driver in this case,
> so that it doesn't try to do larger reads/writes.

I really don't like the fact that we hardcode swiotlb specific.
This needs to be indirect through the dma ops or struct device,
as various iommu implementations also have limits (although
usually much larger ones).
Joerg Roedel Jan. 10, 2019, 2:26 p.m. UTC | #2
Hi Christoph,

On Thu, Jan 10, 2019 at 02:59:52PM +0100, Christoph Hellwig wrote:
> On Thu, Jan 10, 2019 at 02:44:30PM +0100, Joerg Roedel wrote:

> > The problem is a limitation of the SWIOTLB implementation,
> > which does not support allocations larger than 256kb. When
> > the virtio-blk driver tries to read/write a block larger
> > than that, the allocation of the dma-handle fails and an IO
> > error is reported.
> 
> s/allocations/mappings/, right?  We don't use the swiotlb
> buffer for the coherent allocations anymore, and I don't think
> virtio-blk uses them anyway.

'Allocation' in the sense that there are address ranges allocated, not
memory, but mappings would also be right.

> I really don't like the fact that we hardcode swiotlb specific.
> This needs to be indirect through the dma ops or struct device,
> as various iommu implementations also have limits (although
> usually much larger ones).

I though about exposing it through DMA-API, but didn't go that route as
I didn't want to extend a generic API for some SWIOTLB specifics. But if
there are more implementations that have a size limitation it makes
certainly sense to put it into the DMA-API. I'll change that in the
next version.

Thanks,

	Joerg
Jason Wang Jan. 11, 2019, 3:29 a.m. UTC | #3
On 2019/1/10 下午9:44, Joerg Roedel wrote:
> Hi,
>
> there is a problem with virtio-blk driven devices when
> virtio-ring uses SWIOTLB through the DMA-API. This happens
> for example in AMD-SEV enabled guests, where the guest RAM
> is mostly encrypted and all emulated DMA has to happen
> to/from the SWIOTLB aperture.
>
> The problem is a limitation of the SWIOTLB implementation,
> which does not support allocations larger than 256kb. When
> the virtio-blk driver tries to read/write a block larger
> than that, the allocation of the dma-handle fails and an IO
> error is reported.
>
> This patch-set adds a check to the virtio-code whether it
> might be using SWIOTLB bounce buffering and limits the
> maximum segment size in the virtio-blk driver in this case,
> so that it doesn't try to do larger reads/writes.


Just wonder if my understanding is correct IOMMU_PLATFORM must be set 
for all virtio devices under AMD-SEV guests?

Thanks


>
> Please review.
>
> Thanks,
>
> 	Joerg
>
> Joerg Roedel (3):
>    swiotlb: Export maximum allocation size
>    virtio: Introduce virtio_max_dma_size()
>    virtio-blk: Consider virtio_max_dma_size() for maximum segment size
>
>   drivers/block/virtio_blk.c   | 10 ++++++----
>   drivers/virtio/virtio_ring.c | 11 +++++++++++
>   include/linux/swiotlb.h      | 12 ++++++++++++
>   include/linux/virtio.h       |  2 ++
>   4 files changed, 31 insertions(+), 4 deletions(-)
>
Joerg Roedel Jan. 11, 2019, 9:15 a.m. UTC | #4
On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote:
> Just wonder if my understanding is correct IOMMU_PLATFORM must be set for
> all virtio devices under AMD-SEV guests?

Yes, that is correct. Emulated DMA can only happen on the SWIOTLB
aperture, because that memory is not encrypted. The guest bounces the
data then to its encrypted memory.

Regards,

	Joerg
Jason Wang Jan. 14, 2019, 9:41 a.m. UTC | #5
On 2019/1/11 下午5:15, Joerg Roedel wrote:
> On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote:
>> Just wonder if my understanding is correct IOMMU_PLATFORM must be set for
>> all virtio devices under AMD-SEV guests?
> Yes, that is correct. Emulated DMA can only happen on the SWIOTLB
> aperture, because that memory is not encrypted. The guest bounces the
> data then to its encrypted memory.
>
> Regards,
>
> 	Joerg


Thanks, have you tested vhost-net in this case. I suspect it may not work
Christoph Hellwig Jan. 14, 2019, 9:50 a.m. UTC | #6
On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote:
>
> On 2019/1/11 下午5:15, Joerg Roedel wrote:
>> On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote:
>>> Just wonder if my understanding is correct IOMMU_PLATFORM must be set for
>>> all virtio devices under AMD-SEV guests?
>> Yes, that is correct. Emulated DMA can only happen on the SWIOTLB
>> aperture, because that memory is not encrypted. The guest bounces the
>> data then to its encrypted memory.
>>
>> Regards,
>>
>> 	Joerg
>
>
> Thanks, have you tested vhost-net in this case. I suspect it may not work

Which brings me back to my pet pevee that we need to take actions
that virtio uses the proper dma mapping API by default with quirks
for legacy cases.  The magic bypass it uses is just causing problems
over problems.
Jason Wang Jan. 14, 2019, 12:41 p.m. UTC | #7
On 2019/1/14 下午5:50, Christoph Hellwig wrote:
> On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote:
>> On 2019/1/11 下午5:15, Joerg Roedel wrote:
>>> On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote:
>>>> Just wonder if my understanding is correct IOMMU_PLATFORM must be set for
>>>> all virtio devices under AMD-SEV guests?
>>> Yes, that is correct. Emulated DMA can only happen on the SWIOTLB
>>> aperture, because that memory is not encrypted. The guest bounces the
>>> data then to its encrypted memory.
>>>
>>> Regards,
>>>
>>> 	Joerg
>>
>> Thanks, have you tested vhost-net in this case. I suspect it may not work
> Which brings me back to my pet pevee that we need to take actions
> that virtio uses the proper dma mapping API by default with quirks
> for legacy cases.  The magic bypass it uses is just causing problems
> over problems.


Yes, I fully agree with you. This is probably an exact example of such 
problem.

Thanks
Michael S. Tsirkin Jan. 14, 2019, 6:20 p.m. UTC | #8
On Mon, Jan 14, 2019 at 08:41:37PM +0800, Jason Wang wrote:
> 
> On 2019/1/14 下午5:50, Christoph Hellwig wrote:
> > On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote:
> > > On 2019/1/11 下午5:15, Joerg Roedel wrote:
> > > > On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote:
> > > > > Just wonder if my understanding is correct IOMMU_PLATFORM must be set for
> > > > > all virtio devices under AMD-SEV guests?
> > > > Yes, that is correct. Emulated DMA can only happen on the SWIOTLB
> > > > aperture, because that memory is not encrypted. The guest bounces the
> > > > data then to its encrypted memory.
> > > > 
> > > > Regards,
> > > > 
> > > > 	Joerg
> > > 
> > > Thanks, have you tested vhost-net in this case. I suspect it may not work
> > Which brings me back to my pet pevee that we need to take actions
> > that virtio uses the proper dma mapping API by default with quirks
> > for legacy cases.  The magic bypass it uses is just causing problems
> > over problems.
> 
> 
> Yes, I fully agree with you. This is probably an exact example of such
> problem.
> 
> Thanks

I don't think so - the issue is really that DMA API does not yet handle
the SEV case 100% correctly. I suspect passthrough devices would have
the same issue.

In fact whoever sets IOMMU_PLATFORM is completely unaffected by
Christoph's pet peeve.

Christoph is saying that !IOMMU_PLATFORM devices should hide the
compatibility code in a special per-device DMA API implementation.
Which would be fine especially if we can manage not to introduce a bunch
of indirect calls all over the place and hurt performance.  It's just
that the benefit is unlikely to be big (e.g. we can't also get rid of
the virtio specific memory barriers) so no one was motivated enough to
work on it.
Brijesh Singh Jan. 14, 2019, 7:09 p.m. UTC | #9
On 1/14/19 12:20 PM, Michael S. Tsirkin wrote:
> On Mon, Jan 14, 2019 at 08:41:37PM +0800, Jason Wang wrote:
>>
>> On 2019/1/14 下午5:50, Christoph Hellwig wrote:
>>> On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote:
>>>> On 2019/1/11 下午5:15, Joerg Roedel wrote:
>>>>> On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote:
>>>>>> Just wonder if my understanding is correct IOMMU_PLATFORM must be set for
>>>>>> all virtio devices under AMD-SEV guests?
>>>>> Yes, that is correct. Emulated DMA can only happen on the SWIOTLB
>>>>> aperture, because that memory is not encrypted. The guest bounces the
>>>>> data then to its encrypted memory.
>>>>>
>>>>> Regards,
>>>>>
>>>>> 	Joerg
>>>>
>>>> Thanks, have you tested vhost-net in this case. I suspect it may not work
>>> Which brings me back to my pet pevee that we need to take actions
>>> that virtio uses the proper dma mapping API by default with quirks
>>> for legacy cases.  The magic bypass it uses is just causing problems
>>> over problems.
>>
>>
>> Yes, I fully agree with you. This is probably an exact example of such
>> problem.
>>
>> Thanks
> 
> I don't think so - the issue is really that DMA API does not yet handle
> the SEV case 100% correctly. I suspect passthrough devices would have
> the same issue.
> 


In case of SEV, emulated DMA is performed through the SWIOTLB
(which bounces the encrypted buffers). The issue reported here will
happen on any platform which is making use of SWIOTLB. We could
easily reproduce the the virtio-blk failure if we configure
swiotlb=force in non SEV guest. Unfortunately in case of SEV the
SWIOTLB is must. As Jorge highlighted the main issue is limitation
of the SWIOTLB, it does not support allocation/map larger than 256Kb.


> In fact whoever sets IOMMU_PLATFORM is completely unaffected by
> Christoph's pet peeve.
> 
> Christoph is saying that !IOMMU_PLATFORM devices should hide the
> compatibility code in a special per-device DMA API implementation.
> Which would be fine especially if we can manage not to introduce a bunch
> of indirect calls all over the place and hurt performance.  It's just
> that the benefit is unlikely to be big (e.g. we can't also get rid of
> the virtio specific memory barriers) so no one was motivated enough to
> work on it.
>
Robin Murphy Jan. 14, 2019, 7:12 p.m. UTC | #10
On 14/01/2019 18:20, Michael S. Tsirkin wrote:
> On Mon, Jan 14, 2019 at 08:41:37PM +0800, Jason Wang wrote:
>>
>> On 2019/1/14 下午5:50, Christoph Hellwig wrote:
>>> On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote:
>>>> On 2019/1/11 下午5:15, Joerg Roedel wrote:
>>>>> On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote:
>>>>>> Just wonder if my understanding is correct IOMMU_PLATFORM must be set for
>>>>>> all virtio devices under AMD-SEV guests?
>>>>> Yes, that is correct. Emulated DMA can only happen on the SWIOTLB
>>>>> aperture, because that memory is not encrypted. The guest bounces the
>>>>> data then to its encrypted memory.
>>>>>
>>>>> Regards,
>>>>>
>>>>> 	Joerg
>>>>
>>>> Thanks, have you tested vhost-net in this case. I suspect it may not work
>>> Which brings me back to my pet pevee that we need to take actions
>>> that virtio uses the proper dma mapping API by default with quirks
>>> for legacy cases.  The magic bypass it uses is just causing problems
>>> over problems.
>>
>>
>> Yes, I fully agree with you. This is probably an exact example of such
>> problem.
>>
>> Thanks
> 
> I don't think so - the issue is really that DMA API does not yet handle
> the SEV case 100% correctly. I suspect passthrough devices would have
> the same issue.

Huh? Regardless of which virtio devices use it or not, the DMA API is 
handling the SEV case as correctly as it possibly can, by forcing 
everything through the unencrypted bounce buffer. If the segments being 
mapped are too big for that bounce buffer in the first place, there's 
nothing it can possibly do except fail, gracefully or otherwise.

Now, in theory, yes, the real issue at hand is not unique to virtio-blk 
nor SEV - any driver whose device has a sufficiently large DMA segment 
size and who manages to get sufficient physically-contiguous memory 
could technically generate a scatterlist segment longer than SWIOTLB can 
handle. However, in practice that basically never happens, not least 
because very few drivers ever override the default 64K DMA segment 
limit. AFAICS nothing in drivers/virtio is calling 
dma_set_max_seg_size() or otherwise assigning any dma_parms to replace 
the defaults either, so the really interesting question here is how are 
these apparently-out-of-spec 256K segments getting generated at all?

Robin.
Christoph Hellwig Jan. 14, 2019, 8:19 p.m. UTC | #11
On Mon, Jan 14, 2019 at 01:20:45PM -0500, Michael S. Tsirkin wrote:
> I don't think so - the issue is really that DMA API does not yet handle
> the SEV case 100% correctly. I suspect passthrough devices would have
> the same issue.

The DMA API handles the SEV case perfectly.  Its just that virtio_blk
supports huge segments that virtio does not generally support, but that
is not related to SEV in any way.

> In fact whoever sets IOMMU_PLATFORM is completely unaffected by
> Christoph's pet peeve.

No, the above happens only when we set IOMMU_PLATFORM.

> Christoph is saying that !IOMMU_PLATFORM devices should hide the
> compatibility code in a special per-device DMA API implementation.
> Which would be fine especially if we can manage not to introduce a bunch
> of indirect calls all over the place and hurt performance.  It's just
> that the benefit is unlikely to be big (e.g. we can't also get rid of
> the virtio specific memory barriers) so no one was motivated enough to
> work on it.

No.  The problem is that we still haven't fully specified what
IOMMU_PLATFORM and !IOMMU_PLATFORM actually mean.  Your
"ACCESS_PLATFORM/ORDER_PLATFORM" commit in the virtio-spec repo
improves it a little bit, but it is still far from enough.

As a start VIRTIO_F_ORDER_PLATFORM and VIRTIO_F_ACCESS_PLATFORM
absolutely MUST be set for hardware implementations.  Otherwise said
hardware has no chance of working on anything but the most x86-like
systems.

Second software implementations SHOULD set VIRTIO_F_ACCESS_PLATFORM,
because otherwise we can't add proper handling for things like SEV or
the IBM "secure hypervisor" thing.

Last but not least a lot of wording outside the area describing these
flags really needs some major updates in terms of DMA access.
Christoph Hellwig Jan. 14, 2019, 8:22 p.m. UTC | #12
On Mon, Jan 14, 2019 at 07:12:08PM +0000, Robin Murphy wrote:
> Now, in theory, yes, the real issue at hand is not unique to virtio-blk nor 
> SEV - any driver whose device has a sufficiently large DMA segment size and 
> who manages to get sufficient physically-contiguous memory could 
> technically generate a scatterlist segment longer than SWIOTLB can handle. 
> However, in practice that basically never happens, not least because very 
> few drivers ever override the default 64K DMA segment limit. AFAICS nothing 
> in drivers/virtio is calling dma_set_max_seg_size() or otherwise assigning 
> any dma_parms to replace the defaults either, so the really interesting 
> question here is how are these apparently-out-of-spec 256K segments getting 
> generated at all?

drivers/block/virtio_blk.c:virtblk_probe():

	/* Host can optionally specify maximum segment size and number of
         * segments. */
	err = virtio_cread_feature(vdev, VIRTIO_BLK_F_SIZE_MAX,
				   struct virtio_blk_config, size_max, &v);
        if (!err)
		blk_queue_max_segment_size(q, v);
	else
		blk_queue_max_segment_size(q, -1U);

So it really is virtio_blk that is special here.  The host could
set VIRTIO_BLK_F_SIZE_MAX to paper over the problem, but I really
think we need a dma_max_segment_size API that is wired up through
the dma_map_ops to sort this out for real.
Michael S. Tsirkin Jan. 14, 2019, 8:29 p.m. UTC | #13
On Mon, Jan 14, 2019 at 07:12:08PM +0000, Robin Murphy wrote:
> On 14/01/2019 18:20, Michael S. Tsirkin wrote:
> > On Mon, Jan 14, 2019 at 08:41:37PM +0800, Jason Wang wrote:
> > > 
> > > On 2019/1/14 下午5:50, Christoph Hellwig wrote:
> > > > On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote:
> > > > > On 2019/1/11 下午5:15, Joerg Roedel wrote:
> > > > > > On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote:
> > > > > > > Just wonder if my understanding is correct IOMMU_PLATFORM must be set for
> > > > > > > all virtio devices under AMD-SEV guests?
> > > > > > Yes, that is correct. Emulated DMA can only happen on the SWIOTLB
> > > > > > aperture, because that memory is not encrypted. The guest bounces the
> > > > > > data then to its encrypted memory.
> > > > > > 
> > > > > > Regards,
> > > > > > 
> > > > > > 	Joerg
> > > > > 
> > > > > Thanks, have you tested vhost-net in this case. I suspect it may not work
> > > > Which brings me back to my pet pevee that we need to take actions
> > > > that virtio uses the proper dma mapping API by default with quirks
> > > > for legacy cases.  The magic bypass it uses is just causing problems
> > > > over problems.
> > > 
> > > 
> > > Yes, I fully agree with you. This is probably an exact example of such
> > > problem.
> > > 
> > > Thanks
> > 
> > I don't think so - the issue is really that DMA API does not yet handle
> > the SEV case 100% correctly. I suspect passthrough devices would have
> > the same issue.
> 
> Huh? Regardless of which virtio devices use it or not, the DMA API is
> handling the SEV case as correctly as it possibly can, by forcing everything
> through the unencrypted bounce buffer. If the segments being mapped are too
> big for that bounce buffer in the first place, there's nothing it can
> possibly do except fail, gracefully or otherwise.

It seems reasonable to be able to ask it what it's capabilities are
though.

> Now, in theory, yes, the real issue at hand is not unique to virtio-blk nor
> SEV - any driver whose device has a sufficiently large DMA segment size and
> who manages to get sufficient physically-contiguous memory could technically
> generate a scatterlist segment longer than SWIOTLB can handle. However, in
> practice that basically never happens, not least because very few drivers
> ever override the default 64K DMA segment limit. AFAICS nothing in
> drivers/virtio is calling dma_set_max_seg_size() or otherwise assigning any
> dma_parms to replace the defaults either, so the really interesting question
> here is how are these apparently-out-of-spec 256K segments getting generated
> at all?
> 
> Robin.

I guess this is what you are looking for:

        /* Host can optionally specify maximum segment size and number of
         * segments. */
        err = virtio_cread_feature(vdev, VIRTIO_BLK_F_SIZE_MAX,
                                   struct virtio_blk_config, size_max, &v);
        if (!err)
                blk_queue_max_segment_size(q, v);
        else
                blk_queue_max_segment_size(q, -1U);

virtio isn't the only caller with a value >64K:

$ git grep -A1 blk_queue_max_segment_size
Documentation/block/biodoc.txt: blk_queue_max_segment_size(q, max_seg_size)
Documentation/block/biodoc.txt-         Maximum size of a clustered segment, 64kB default.
--
block/blk-settings.c: * blk_queue_max_segment_size - set max segment size for blk_rq_map_sg
block/blk-settings.c- * @q:  the request queue for the device
--
block/blk-settings.c:void blk_queue_max_segment_size(struct request_queue *q, unsigned int max_size)
block/blk-settings.c-{
--
block/blk-settings.c:EXPORT_SYMBOL(blk_queue_max_segment_size);
block/blk-settings.c-
--
drivers/block/mtip32xx/mtip32xx.c:      blk_queue_max_segment_size(dd->queue, 0x400000);
drivers/block/mtip32xx/mtip32xx.c-      blk_queue_io_min(dd->queue, 4096);
--
drivers/block/nbd.c:    blk_queue_max_segment_size(disk->queue, UINT_MAX);
drivers/block/nbd.c-    blk_queue_max_segments(disk->queue, USHRT_MAX);
--
drivers/block/ps3disk.c:        blk_queue_max_segment_size(queue, dev->bounce_size);
drivers/block/ps3disk.c-
--
drivers/block/ps3vram.c:        blk_queue_max_segment_size(queue, BLK_MAX_SEGMENT_SIZE);
drivers/block/ps3vram.c-        blk_queue_max_hw_sectors(queue, BLK_SAFE_MAX_SECTORS);
--
drivers/block/rbd.c:    blk_queue_max_segment_size(q, UINT_MAX);
drivers/block/rbd.c-    blk_queue_io_min(q, objset_bytes);
--
drivers/block/sunvdc.c: blk_queue_max_segment_size(q, PAGE_SIZE);
drivers/block/sunvdc.c-
--
drivers/block/virtio_blk.c:             blk_queue_max_segment_size(q, v);
drivers/block/virtio_blk.c-     else
drivers/block/virtio_blk.c:             blk_queue_max_segment_size(q, -1U);
drivers/block/virtio_blk.c-
--
drivers/block/xen-blkfront.c:   blk_queue_max_segment_size(rq, PAGE_SIZE);
drivers/block/xen-blkfront.c-
--
drivers/cdrom/gdrom.c:  blk_queue_max_segment_size(gd.gdrom_rq, 0x40000);
drivers/cdrom/gdrom.c-  gd.disk->queue = gd.gdrom_rq;
--
drivers/memstick/core/ms_block.c:       blk_queue_max_segment_size(msb->queue,
drivers/memstick/core/ms_block.c-                                  MS_BLOCK_MAX_PAGES * msb->page_size);
--
drivers/memstick/core/mspro_block.c:    blk_queue_max_segment_size(msb->queue,
drivers/memstick/core/mspro_block.c-                               MSPRO_BLOCK_MAX_PAGES * msb->page_size);
--
drivers/mmc/core/queue.c:       blk_queue_max_segment_size(mq->queue, host->max_seg_size);
drivers/mmc/core/queue.c-
--
drivers/s390/block/dasd.c:      blk_queue_max_segment_size(q, PAGE_SIZE);
drivers/s390/block/dasd.c-      blk_queue_segment_boundary(q, PAGE_SIZE - 1);
--
drivers/scsi/be2iscsi/be_main.c:        blk_queue_max_segment_size(sdev->request_queue, 65536);
drivers/scsi/be2iscsi/be_main.c-        return 0;
--
drivers/scsi/scsi_debug.c:      blk_queue_max_segment_size(sdp->request_queue, -1U);
drivers/scsi/scsi_debug.c-      if (sdebug_no_uld)
--
drivers/scsi/scsi_lib.c:        blk_queue_max_segment_size(q, dma_get_max_seg_size(dev));
drivers/scsi/scsi_lib.c-
--
drivers/scsi/ufs/ufshcd.c:      blk_queue_max_segment_size(q, PRDT_DATA_BYTE_COUNT_MAX);
drivers/scsi/ufs/ufshcd.c-
--
include/linux/blkdev.h:extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
include/linux/blkdev.h-extern void blk_queue_max_discard_sectors(struct request_queue *q,
--
include/linux/mmc/host.h:       unsigned int            max_seg_size;   /* see blk_queue_max_segment_size */
include/linux/mmc/host.h-       unsigned short          max_segs;       /* see blk_queue_max_segments */


Some of these devices are probably not going to work well if
passed through to a SEV guest.

Going back to virtio, at some level virtio is like a stacking device so
it does not necessarily need a limit.
Michael S. Tsirkin Jan. 14, 2019, 8:48 p.m. UTC | #14
On Mon, Jan 14, 2019 at 09:19:35PM +0100, Christoph Hellwig wrote:
> > Christoph is saying that !IOMMU_PLATFORM devices should hide the
> > compatibility code in a special per-device DMA API implementation.
> > Which would be fine especially if we can manage not to introduce a bunch
> > of indirect calls all over the place and hurt performance.  It's just
> > that the benefit is unlikely to be big (e.g. we can't also get rid of
> > the virtio specific memory barriers) so no one was motivated enough to
> > work on it.
> 
> No.  

Oh ok then.

> The problem is that we still haven't fully specified what
> IOMMU_PLATFORM and !IOMMU_PLATFORM actually mean.  Your
> "ACCESS_PLATFORM/ORDER_PLATFORM" commit in the virtio-spec repo
> improves it a little bit, but it is still far from enough.
> 
> As a start VIRTIO_F_ORDER_PLATFORM and VIRTIO_F_ACCESS_PLATFORM
> absolutely MUST be set for hardware implementations.  Otherwise said
> hardware has no chance of working on anything but the most x86-like
> systems.

I think you are exaggerating a little here.  Limited access with e.g.
need to flush caches is common on lower end but less common on higher
end systems used for virtualization.  As you point out that changes with
e.g. SEV but then SEV is a pretty niche thing so far.

So I would make that a SHOULD probably.

> Second software implementations SHOULD set VIRTIO_F_ACCESS_PLATFORM,
> because otherwise we can't add proper handling for things like SEV or
> the IBM "secure hypervisor" thing.

Yes. If host does not have access to all of memory it SHOULD set
VIRTIO_F_ACCESS_PLATFORM.

It seems rather straight-forward to add conformance clauses
for this, thanks for reminding me.


> Last but not least a lot of wording outside the area describing these
> flags really needs some major updates in terms of DMA access.

Thanks for the reminder. Yes generally spec tends to still say e.g.
"physical address" where it really means a "DMA address" in Linux terms.
Whether changing that will make things much clearer I'm not sure. E.g.
hardware designers don't much care I guess.

If you want to list the most problematic cases, e.g. in a github issue,
we might get to clarifying them sooner.

Thanks!
Joerg Roedel Jan. 15, 2019, 8:37 a.m. UTC | #15
On Mon, Jan 14, 2019 at 01:20:45PM -0500, Michael S. Tsirkin wrote:
> Which would be fine especially if we can manage not to introduce a bunch
> of indirect calls all over the place and hurt performance.

Which indirect calls? In case of unset dma_ops the DMA-API functions
call directly into the dma-direct implementation, no indirect calls at
all.

Regards,

	Joerg
Christoph Hellwig Jan. 15, 2019, 1:09 p.m. UTC | #16
On Mon, Jan 14, 2019 at 03:48:47PM -0500, Michael S. Tsirkin wrote:
> I think you are exaggerating a little here.  Limited access with e.g.
> need to flush caches is common on lower end but less common on higher
> end systems used for virtualization.  As you point out that changes with
> e.g. SEV but then SEV is a pretty niche thing so far.
> 
> So I would make that a SHOULD probably.

The problem is that without using DMA ops you can't just plug the device
into a random system, which is a total no-go for periphals.

And cache flushing is not that uncommmon, e.g. every sparc systems
needs it, many arm/arm64 ones, some power ones, lots of mips ones, etc.

But cache flushing is just one side of it - lots of systems have either
mandatory or option IOMMUs, and without the platform access flag that
doesn't work.  As does that case where you have a DMA offset, which
is extremly common on arm and power systems.  So you might find a few
non-x86 systems you can plug it in and it'll just work, but it won't
be all that many.  And even on x86 your device will be completely broken
if the users dares to use an IOMMU.

> > flags really needs some major updates in terms of DMA access.
> 
> Thanks for the reminder. Yes generally spec tends to still say e.g.
> "physical address" where it really means a "DMA address" in Linux terms.
> Whether changing that will make things much clearer I'm not sure. E.g.
> hardware designers don't much care I guess.

At least say bus address - as said above the CPU and bus concepts of
physicall addresses are different in many systems.  Including x86
when you use IOMMUs.
Christoph Hellwig Jan. 15, 2019, 1:20 p.m. UTC | #17
On Tue, Jan 15, 2019 at 09:37:42AM +0100, Joerg Roedel wrote:
> On Mon, Jan 14, 2019 at 01:20:45PM -0500, Michael S. Tsirkin wrote:
> > Which would be fine especially if we can manage not to introduce a bunch
> > of indirect calls all over the place and hurt performance.
> 
> Which indirect calls? In case of unset dma_ops the DMA-API functions
> call directly into the dma-direct implementation, no indirect calls at
> all.

True.  But the NULL-ops dma direct case still has two issues that might
not work for virtio:

 (a) if the architecture supports devices that are not DMA coherent it
     will dip into a special allocator and do cache maintainance for
     streaming mappings.  Although it would be a bit of a hack we could
     work around this in virtio doings something like:

#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \
    defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \
    defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
	dev->dma_coherent = true;
#endif

    except that won't work for mips, which has a mode where it does
    a system instead of device level coherency flag and thus doesn't
    look at this struct device field

 (b) we can still mangle the DMA address, either using the
     dma_pfn_offset field in struct device, or by a full override
     of __phys_to_dma / __dma_to_phys by the architecture code.
     The first could be solved using a hack like the one above,
     but the latter would be a little harder.  In the long run
     I'd love to get rid of that hook and have everyone use the
     generic offset code, but for that we first need to support
     multiple ranges with different offset and do quite some
     nasty arch code surgery.
 
So for the legacy virtio case I fear we need to keep local dma mapping
implementation for now.  I just wish now recent hypervisor would ever
offer devices in this broken legacy mode..

> 
> Regards,
> 
> 	Joerg
---end quoted text---
Michael S. Tsirkin Jan. 16, 2019, 2:16 p.m. UTC | #18
On Tue, Jan 15, 2019 at 02:20:19PM +0100, Christoph Hellwig wrote:
> On Tue, Jan 15, 2019 at 09:37:42AM +0100, Joerg Roedel wrote:
> > On Mon, Jan 14, 2019 at 01:20:45PM -0500, Michael S. Tsirkin wrote:
> > > Which would be fine especially if we can manage not to introduce a bunch
> > > of indirect calls all over the place and hurt performance.
> > 
> > Which indirect calls? In case of unset dma_ops the DMA-API functions
> > call directly into the dma-direct implementation, no indirect calls at
> > all.
> 
> True.  But the NULL-ops dma direct case still has two issues that might
> not work for virtio:
> 
>  (a) if the architecture supports devices that are not DMA coherent it
>      will dip into a special allocator and do cache maintainance for
>      streaming mappings.  Although it would be a bit of a hack we could
>      work around this in virtio doings something like:
> 
> #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \
>     defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \
>     defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
> 	dev->dma_coherent = true;
> #endif
> 
>     except that won't work for mips, which has a mode where it does
>     a system instead of device level coherency flag and thus doesn't
>     look at this struct device field
> 
>  (b) we can still mangle the DMA address, either using the
>      dma_pfn_offset field in struct device, or by a full override
>      of __phys_to_dma / __dma_to_phys by the architecture code.
>      The first could be solved using a hack like the one above,
>      but the latter would be a little harder.  In the long run
>      I'd love to get rid of that hook and have everyone use the
>      generic offset code, but for that we first need to support
>      multiple ranges with different offset and do quite some
>      nasty arch code surgery.
>  
> So for the legacy virtio case I fear we need to keep local dma mapping
> implementation for now.  I just wish now recent hypervisor would ever
> offer devices in this broken legacy mode..

IIUC some emulated hardware has no reasonable way to specify that
some devices bypass the IOMMU, and no cheap way to cover all
memory with an IOMMU mapping.

So I suspect !ACCESS_PLATFORM will stay a supported configuration
forever :(



> > 
> > Regards,
> > 
> > 	Joerg
> ---end quoted text---