mbox series

[RFC,V2,0/7] Do not read from descripto ring

Message ID 20210423080942.2997-1-jasowang@redhat.com (mailing list archive)
Headers show
Series Do not read from descripto ring | expand

Message

Jason Wang April 23, 2021, 8:09 a.m. UTC
Hi:

Sometimes, the driver doesn't trust the device. This is usually
happens for the encrtpyed VM or VDUSE[1]. In both cases, technology
like swiotlb is used to prevent the poking/mangling of memory from the
device. But this is not sufficient since current virtio driver may
trust what is stored in the descriptor table (coherent mapping) for
performing the DMA operations like unmap and bounce so the device may
choose to utilize the behaviour of swiotlb to perform attacks[2].

To protect from a malicous device, this series store and use the
descriptor metadata in an auxiliay structure which can not be accessed
via swiotlb instead of the ones in the descriptor table. This means
the descriptor table is write-only from the view of the driver.

Actually, we've almost achieved that through packed virtqueue and we
just need to fix a corner case of handling mapping errors. For split
virtqueue we just follow what's done in the packed.

Note that we don't duplicate descriptor medata for indirect
descriptors since it uses stream mapping which is read only so it's
safe if the metadata of non-indirect descriptors are correct.

For split virtqueue, the change increase the footprint due the the
auxiliary metadata but it's almost neglectlable in the simple test
like pktgen or netpef.

Slightly tested with packed on/off, iommu on/of, swiotlb force/off in
the guest.

Please review.

Changes from V1:
- Always use auxiliary metadata for split virtqueue
- Don't read from descripto when detaching indirect descriptor

[1]
https://lore.kernel.org/netdev/fab615ce-5e13-a3b3-3715-a4203b4ab010@redhat.com/T/
[2]
https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b

Jason Wang (7):
  virtio-ring: maintain next in extra state for packed virtqueue
  virtio_ring: rename vring_desc_extra_packed
  virtio-ring: factor out desc_extra allocation
  virtio_ring: secure handling of mapping errors
  virtio_ring: introduce virtqueue_desc_add_split()
  virtio: use err label in __vring_new_virtqueue()
  virtio-ring: store DMA metadata in desc_extra for split virtqueue

 drivers/virtio/virtio_ring.c | 201 +++++++++++++++++++++++++----------
 1 file changed, 144 insertions(+), 57 deletions(-)

Comments

Jason Wang May 6, 2021, 3:20 a.m. UTC | #1
在 2021/4/23 下午4:09, Jason Wang 写道:
> Hi:
>
> Sometimes, the driver doesn't trust the device. This is usually
> happens for the encrtpyed VM or VDUSE[1]. In both cases, technology
> like swiotlb is used to prevent the poking/mangling of memory from the
> device. But this is not sufficient since current virtio driver may
> trust what is stored in the descriptor table (coherent mapping) for
> performing the DMA operations like unmap and bounce so the device may
> choose to utilize the behaviour of swiotlb to perform attacks[2].
>
> To protect from a malicous device, this series store and use the
> descriptor metadata in an auxiliay structure which can not be accessed
> via swiotlb instead of the ones in the descriptor table. This means
> the descriptor table is write-only from the view of the driver.
>
> Actually, we've almost achieved that through packed virtqueue and we
> just need to fix a corner case of handling mapping errors. For split
> virtqueue we just follow what's done in the packed.
>
> Note that we don't duplicate descriptor medata for indirect
> descriptors since it uses stream mapping which is read only so it's
> safe if the metadata of non-indirect descriptors are correct.
>
> For split virtqueue, the change increase the footprint due the the
> auxiliary metadata but it's almost neglectlable in the simple test
> like pktgen or netpef.
>
> Slightly tested with packed on/off, iommu on/of, swiotlb force/off in
> the guest.
>
> Please review.
>
> Changes from V1:
> - Always use auxiliary metadata for split virtqueue
> - Don't read from descripto when detaching indirect descriptor


Hi Michael:

Our QE see no regression on the perf test for 10G but some regressions 
(5%-10%) on 40G card.

I think this is expected since we increase the footprint, are you OK 
with this and we can try to optimize on top or you have other ideas?

Thanks


>
> [1]
> https://lore.kernel.org/netdev/fab615ce-5e13-a3b3-3715-a4203b4ab010@redhat.com/T/
> [2]
> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
>
> Jason Wang (7):
>    virtio-ring: maintain next in extra state for packed virtqueue
>    virtio_ring: rename vring_desc_extra_packed
>    virtio-ring: factor out desc_extra allocation
>    virtio_ring: secure handling of mapping errors
>    virtio_ring: introduce virtqueue_desc_add_split()
>    virtio: use err label in __vring_new_virtqueue()
>    virtio-ring: store DMA metadata in desc_extra for split virtqueue
>
>   drivers/virtio/virtio_ring.c | 201 +++++++++++++++++++++++++----------
>   1 file changed, 144 insertions(+), 57 deletions(-)
>
Michael S. Tsirkin May 6, 2021, 8:12 a.m. UTC | #2
On Thu, May 06, 2021 at 11:20:30AM +0800, Jason Wang wrote:
> 
> 在 2021/4/23 下午4:09, Jason Wang 写道:
> > Hi:
> > 
> > Sometimes, the driver doesn't trust the device. This is usually
> > happens for the encrtpyed VM or VDUSE[1]. In both cases, technology
> > like swiotlb is used to prevent the poking/mangling of memory from the
> > device. But this is not sufficient since current virtio driver may
> > trust what is stored in the descriptor table (coherent mapping) for
> > performing the DMA operations like unmap and bounce so the device may
> > choose to utilize the behaviour of swiotlb to perform attacks[2].
> > 
> > To protect from a malicous device, this series store and use the
> > descriptor metadata in an auxiliay structure which can not be accessed
> > via swiotlb instead of the ones in the descriptor table. This means
> > the descriptor table is write-only from the view of the driver.
> > 
> > Actually, we've almost achieved that through packed virtqueue and we
> > just need to fix a corner case of handling mapping errors. For split
> > virtqueue we just follow what's done in the packed.
> > 
> > Note that we don't duplicate descriptor medata for indirect
> > descriptors since it uses stream mapping which is read only so it's
> > safe if the metadata of non-indirect descriptors are correct.
> > 
> > For split virtqueue, the change increase the footprint due the the
> > auxiliary metadata but it's almost neglectlable in the simple test
> > like pktgen or netpef.
> > 
> > Slightly tested with packed on/off, iommu on/of, swiotlb force/off in
> > the guest.
> > 
> > Please review.
> > 
> > Changes from V1:
> > - Always use auxiliary metadata for split virtqueue
> > - Don't read from descripto when detaching indirect descriptor
> 
> 
> Hi Michael:
> 
> Our QE see no regression on the perf test for 10G but some regressions
> (5%-10%) on 40G card.
> 
> I think this is expected since we increase the footprint, are you OK with
> this and we can try to optimize on top or you have other ideas?
> 
> Thanks

Let's try for just a bit, won't make this window anyway:

I have an old idea. Add a way to find out that unmap is a nop
(or more exactly does not use the address/length).
Then in that case even with DMA API we do not need
the extra data. Hmm?


> 
> > 
> > [1]
> > https://lore.kernel.org/netdev/fab615ce-5e13-a3b3-3715-a4203b4ab010@redhat.com/T/
> > [2]
> > https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
> > 
> > Jason Wang (7):
> >    virtio-ring: maintain next in extra state for packed virtqueue
> >    virtio_ring: rename vring_desc_extra_packed
> >    virtio-ring: factor out desc_extra allocation
> >    virtio_ring: secure handling of mapping errors
> >    virtio_ring: introduce virtqueue_desc_add_split()
> >    virtio: use err label in __vring_new_virtqueue()
> >    virtio-ring: store DMA metadata in desc_extra for split virtqueue
> > 
> >   drivers/virtio/virtio_ring.c | 201 +++++++++++++++++++++++++----------
> >   1 file changed, 144 insertions(+), 57 deletions(-)
> >
Christoph Hellwig May 6, 2021, 12:38 p.m. UTC | #3
On Thu, May 06, 2021 at 04:12:17AM -0400, Michael S. Tsirkin wrote:
> Let's try for just a bit, won't make this window anyway:
> 
> I have an old idea. Add a way to find out that unmap is a nop
> (or more exactly does not use the address/length).
> Then in that case even with DMA API we do not need
> the extra data. Hmm?

So we actually do have a check for that from the early days of the DMA
API, but it only works at compile time: CONFIG_NEED_DMA_MAP_STATE.

But given how rare configs without an iommu or swiotlb are these days
it has stopped to be very useful.  Unfortunately a runtime-version is
not entirely trivial, but maybe if we allow for false positives we
could do something like this

bool dma_direct_need_state(struct device *dev)
{
	/* some areas could not be covered by any map at all */
	if (dev->dma_range_map)
		return false;
	if (force_dma_unencrypted(dev))
		return false;
	if (dma_direct_need_sync(dev))
		return false;
	return *dev->dma_mask == DMA_BIT_MASK(64);
}

bool dma_need_state(struct device *dev)
{
	const struct dma_map_ops *ops = get_dma_ops(dev);

	if (dma_map_direct(dev, ops))
		return dma_direct_need_state(dev);
	return ops->unmap_page ||
		ops->sync_single_for_cpu || ops->sync_single_for_device;
}
Stefan Hajnoczi May 13, 2021, 4:27 p.m. UTC | #4
On Fri, Apr 23, 2021 at 04:09:35PM +0800, Jason Wang wrote:
> Sometimes, the driver doesn't trust the device. This is usually
> happens for the encrtpyed VM or VDUSE[1].

Thanks for doing this.

Can you describe the overall memory safety model that virtio drivers
must follow? For example:

- Driver-to-device buffers must be on dedicated pages to avoid
  information leaks.

- Driver-to-device buffers must be on dedicated pages to avoid memory
  corruption.

When I say "pages" I guess it's the IOMMU page size that matters?

What is the memory access granularity of VDUSE?

I'm asking these questions because there is driver code that exposes
kernel memory to the device and I'm not sure it's safe. For example:

  static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
                  struct scatterlist *data_sg, bool have_data)
  {
          struct scatterlist hdr, status, *sgs[3];
          unsigned int num_out = 0, num_in = 0;

          sg_init_one(&hdr, &vbr->out_hdr, sizeof(vbr->out_hdr));
	                    ^^^^^^^^^^^^^
          sgs[num_out++] = &hdr;

          if (have_data) {
                  if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_OUT))
                          sgs[num_out++] = data_sg;
                  else
                          sgs[num_out + num_in++] = data_sg;
          }

          sg_init_one(&status, &vbr->status, sizeof(vbr->status));
                               ^^^^^^^^^^^^
          sgs[num_out + num_in++] = &status;

          return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
  }

I guess the drivers don't need to be modified as long as swiotlb is used
to bounce the buffers through "insecure" memory so that the memory
surrounding the buffers is not exposed?

Stefan
Yongji Xie May 14, 2021, 6:06 a.m. UTC | #5
On Fri, May 14, 2021 at 12:27 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Fri, Apr 23, 2021 at 04:09:35PM +0800, Jason Wang wrote:
> > Sometimes, the driver doesn't trust the device. This is usually
> > happens for the encrtpyed VM or VDUSE[1].
>
> Thanks for doing this.
>
> Can you describe the overall memory safety model that virtio drivers
> must follow? For example:
>
> - Driver-to-device buffers must be on dedicated pages to avoid
>   information leaks.
>
> - Driver-to-device buffers must be on dedicated pages to avoid memory
>   corruption.
>
> When I say "pages" I guess it's the IOMMU page size that matters?
>
> What is the memory access granularity of VDUSE?
>

Now we use PAGE_SIZE as the access granularity. I think it should be
safe to access the Driver-to-device buffers in VDUSE case because we
also use bounce-buffering mechanism like swiotlb does.

Thanks,
Yongji
Jason Wang May 14, 2021, 7:29 a.m. UTC | #6
On Fri, May 14, 2021 at 12:27 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Fri, Apr 23, 2021 at 04:09:35PM +0800, Jason Wang wrote:
> > Sometimes, the driver doesn't trust the device. This is usually
> > happens for the encrtpyed VM or VDUSE[1].
>
> Thanks for doing this.
>
> Can you describe the overall memory safety model that virtio drivers
> must follow?

My understanding is that, basically the driver should not trust the
device (since the driver doesn't know what kind of device that it
tries to drive)

1) For any read only metadata (required at the spec level) which is
mapped as coherent, driver should not depend on the metadata that is
stored in a place that could be wrote by the device. This is what this
series tries to achieve.
2) For other metadata that is produced by the device, need to make
sure there's no malicious device triggered behavior, this is somehow
similar to what vhost did. No DOS, loop, kernel bug and other stuffs.
3) swiotb is a must to enforce memory access isolation. (VDUSE or encrypted VM)

> For example:
>
> - Driver-to-device buffers must be on dedicated pages to avoid
>   information leaks.

It looks to me if swiotlb is used, we don't need this since the
bouncing is not done at byte not page.

But if swiotlb is not used, we need to enforce this.

>
> - Driver-to-device buffers must be on dedicated pages to avoid memory
>   corruption.

Similar to the above.

>
> When I say "pages" I guess it's the IOMMU page size that matters?
>

And the IOTLB page size.

> What is the memory access granularity of VDUSE?

It has an swiotlb, but the access and bouncing is done per byte.

>
> I'm asking these questions because there is driver code that exposes
> kernel memory to the device and I'm not sure it's safe. For example:
>
>   static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
>                   struct scatterlist *data_sg, bool have_data)
>   {
>           struct scatterlist hdr, status, *sgs[3];
>           unsigned int num_out = 0, num_in = 0;
>
>           sg_init_one(&hdr, &vbr->out_hdr, sizeof(vbr->out_hdr));
>                             ^^^^^^^^^^^^^
>           sgs[num_out++] = &hdr;
>
>           if (have_data) {
>                   if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_OUT))
>                           sgs[num_out++] = data_sg;
>                   else
>                           sgs[num_out + num_in++] = data_sg;
>           }
>
>           sg_init_one(&status, &vbr->status, sizeof(vbr->status));
>                                ^^^^^^^^^^^^
>           sgs[num_out + num_in++] = &status;
>
>           return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
>   }
>
> I guess the drivers don't need to be modified as long as swiotlb is used
> to bounce the buffers through "insecure" memory so that the memory
> surrounding the buffers is not exposed?

Yes, swiotlb won't bounce the whole page. So I think it's safe.

Thanks

>
> Stefan
Jason Wang May 14, 2021, 7:30 a.m. UTC | #7
On Fri, May 14, 2021 at 2:07 PM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Fri, May 14, 2021 at 12:27 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Fri, Apr 23, 2021 at 04:09:35PM +0800, Jason Wang wrote:
> > > Sometimes, the driver doesn't trust the device. This is usually
> > > happens for the encrtpyed VM or VDUSE[1].
> >
> > Thanks for doing this.
> >
> > Can you describe the overall memory safety model that virtio drivers
> > must follow? For example:
> >
> > - Driver-to-device buffers must be on dedicated pages to avoid
> >   information leaks.
> >
> > - Driver-to-device buffers must be on dedicated pages to avoid memory
> >   corruption.
> >
> > When I say "pages" I guess it's the IOMMU page size that matters?
> >
> > What is the memory access granularity of VDUSE?
> >
>
> Now we use PAGE_SIZE as the access granularity. I think it should be
> safe to access the Driver-to-device buffers in VDUSE case because we
> also use bounce-buffering mechanism like swiotlb does.
>
> Thanks,
> Yongji
>

Yes, while at this, I wonder it's possible the re-use the swiotlb
codes for VDUSE, or having some common library for this. Otherwise
there would be duplicated codes (bugs).

Thanks
Yongji Xie May 14, 2021, 8:40 a.m. UTC | #8
On Fri, May 14, 2021 at 3:31 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, May 14, 2021 at 2:07 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> >
> > On Fri, May 14, 2021 at 12:27 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Fri, Apr 23, 2021 at 04:09:35PM +0800, Jason Wang wrote:
> > > > Sometimes, the driver doesn't trust the device. This is usually
> > > > happens for the encrtpyed VM or VDUSE[1].
> > >
> > > Thanks for doing this.
> > >
> > > Can you describe the overall memory safety model that virtio drivers
> > > must follow? For example:
> > >
> > > - Driver-to-device buffers must be on dedicated pages to avoid
> > >   information leaks.
> > >
> > > - Driver-to-device buffers must be on dedicated pages to avoid memory
> > >   corruption.
> > >
> > > When I say "pages" I guess it's the IOMMU page size that matters?
> > >
> > > What is the memory access granularity of VDUSE?
> > >
> >
> > Now we use PAGE_SIZE as the access granularity. I think it should be
> > safe to access the Driver-to-device buffers in VDUSE case because we
> > also use bounce-buffering mechanism like swiotlb does.
> >
> > Thanks,
> > Yongji
> >
>
> Yes, while at this, I wonder it's possible the re-use the swiotlb
> codes for VDUSE, or having some common library for this. Otherwise
> there would be duplicated codes (bugs).
>

I think there are still some gaps between VDUSE codes and swiotlb
codes.  For example, swiotlb allocates and uses contiguous memory for
bouncing but VDUSE doesn't. The swiotlb works in singleton mode
(designed for platform IOMMU) , but VDUSE is based on on-chip IOMMU
(supports multiple instances). So we will need some extra work if we
want a common library for them both.

And since the only duplicated codes now are swiotlb_bounce() (swiotlb)
and do_bounce() (VDUSE). So I prefer to do this work in future rather
than in the current series.

Thanks,
Yongji
Michael S. Tsirkin May 14, 2021, 11:13 a.m. UTC | #9
On Thu, May 06, 2021 at 01:38:29PM +0100, Christoph Hellwig wrote:
> On Thu, May 06, 2021 at 04:12:17AM -0400, Michael S. Tsirkin wrote:
> > Let's try for just a bit, won't make this window anyway:
> > 
> > I have an old idea. Add a way to find out that unmap is a nop
> > (or more exactly does not use the address/length).
> > Then in that case even with DMA API we do not need
> > the extra data. Hmm?
> 
> So we actually do have a check for that from the early days of the DMA
> API, but it only works at compile time: CONFIG_NEED_DMA_MAP_STATE.
> 
> But given how rare configs without an iommu or swiotlb are these days
> it has stopped to be very useful.  Unfortunately a runtime-version is
> not entirely trivial, but maybe if we allow for false positives we
> could do something like this
> 
> bool dma_direct_need_state(struct device *dev)
> {
> 	/* some areas could not be covered by any map at all */
> 	if (dev->dma_range_map)
> 		return false;
> 	if (force_dma_unencrypted(dev))
> 		return false;
> 	if (dma_direct_need_sync(dev))
> 		return false;
> 	return *dev->dma_mask == DMA_BIT_MASK(64);
> }
> 
> bool dma_need_state(struct device *dev)
> {
> 	const struct dma_map_ops *ops = get_dma_ops(dev);
> 
> 	if (dma_map_direct(dev, ops))
> 		return dma_direct_need_state(dev);
> 	return ops->unmap_page ||
> 		ops->sync_single_for_cpu || ops->sync_single_for_device;
> }

Yea that sounds like a good idea. We will need to document that.


Something like:

/*
 * dma_need_state - report whether unmap calls use the address and length
 * @dev: device to guery
 *
 * This is a runtime version of CONFIG_NEED_DMA_MAP_STATE.
 *
 * Return the value indicating whether dma_unmap_* and dma_sync_* calls for the device
 * use the DMA state parameters passed to them.
 * The DMA state parameters are: scatter/gather list/table, address and
 * length.
 *
 * If dma_need_state returns false then DMA state parameters are
 * ignored by all dma_unmap_* and dma_sync_* calls, so it is safe to pass 0 for
 * address and length, and DMA_UNMAP_SG_TABLE_INVALID and
 * DMA_UNMAP_SG_LIST_INVALID for s/g table and length respectively.
 * If dma_need_state returns true then DMA state might
 * be used and so the actual values are required.
 */

And we will need DMA_UNMAP_SG_TABLE_INVALID and
DMA_UNMAP_SG_LIST_INVALID as pointers to an empty global table and list
for calls such as dma_unmap_sgtable that dereference pointers before checking
they are used.


Does this look good?

The table/length variants are for consistency, virtio specifically does
not use s/g at the moment, but it seems nicer than leaving
users wonder what to do about these.

Thoughts? Jason want to try implementing?
Stefan Hajnoczi May 14, 2021, 11:16 a.m. UTC | #10
On Fri, May 14, 2021 at 03:29:20PM +0800, Jason Wang wrote:
> On Fri, May 14, 2021 at 12:27 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Fri, Apr 23, 2021 at 04:09:35PM +0800, Jason Wang wrote:
> > > Sometimes, the driver doesn't trust the device. This is usually
> > > happens for the encrtpyed VM or VDUSE[1].
> >
> > Thanks for doing this.
> >
> > Can you describe the overall memory safety model that virtio drivers
> > must follow?
> 
> My understanding is that, basically the driver should not trust the
> device (since the driver doesn't know what kind of device that it
> tries to drive)
> 
> 1) For any read only metadata (required at the spec level) which is
> mapped as coherent, driver should not depend on the metadata that is
> stored in a place that could be wrote by the device. This is what this
> series tries to achieve.
> 2) For other metadata that is produced by the device, need to make
> sure there's no malicious device triggered behavior, this is somehow
> similar to what vhost did. No DOS, loop, kernel bug and other stuffs.
> 3) swiotb is a must to enforce memory access isolation. (VDUSE or encrypted VM)
> 
> > For example:
> >
> > - Driver-to-device buffers must be on dedicated pages to avoid
> >   information leaks.
> 
> It looks to me if swiotlb is used, we don't need this since the
> bouncing is not done at byte not page.
> 
> But if swiotlb is not used, we need to enforce this.
> 
> >
> > - Driver-to-device buffers must be on dedicated pages to avoid memory
> >   corruption.
> 
> Similar to the above.
> 
> >
> > When I say "pages" I guess it's the IOMMU page size that matters?
> >
> 
> And the IOTLB page size.
> 
> > What is the memory access granularity of VDUSE?
> 
> It has an swiotlb, but the access and bouncing is done per byte.
> 
> >
> > I'm asking these questions because there is driver code that exposes
> > kernel memory to the device and I'm not sure it's safe. For example:
> >
> >   static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
> >                   struct scatterlist *data_sg, bool have_data)
> >   {
> >           struct scatterlist hdr, status, *sgs[3];
> >           unsigned int num_out = 0, num_in = 0;
> >
> >           sg_init_one(&hdr, &vbr->out_hdr, sizeof(vbr->out_hdr));
> >                             ^^^^^^^^^^^^^
> >           sgs[num_out++] = &hdr;
> >
> >           if (have_data) {
> >                   if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_OUT))
> >                           sgs[num_out++] = data_sg;
> >                   else
> >                           sgs[num_out + num_in++] = data_sg;
> >           }
> >
> >           sg_init_one(&status, &vbr->status, sizeof(vbr->status));
> >                                ^^^^^^^^^^^^
> >           sgs[num_out + num_in++] = &status;
> >
> >           return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
> >   }
> >
> > I guess the drivers don't need to be modified as long as swiotlb is used
> > to bounce the buffers through "insecure" memory so that the memory
> > surrounding the buffers is not exposed?
> 
> Yes, swiotlb won't bounce the whole page. So I think it's safe.

Thanks Jason and Yongji Xie for clarifying. Seems like swiotlb or a
similar mechanism can handle byte-granularity isolation so the drivers
not need to worry about information leaks or memory corruption outside
the mapped byte range.

We still need to audit virtio guest drivers to ensure they don't trust
data that can be modified by the device. I will look at virtio-blk and
virtio-fs next week.

Stefan
Yongji Xie May 14, 2021, 11:27 a.m. UTC | #11
On Fri, May 14, 2021 at 7:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Fri, May 14, 2021 at 03:29:20PM +0800, Jason Wang wrote:
> > On Fri, May 14, 2021 at 12:27 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Fri, Apr 23, 2021 at 04:09:35PM +0800, Jason Wang wrote:
> > > > Sometimes, the driver doesn't trust the device. This is usually
> > > > happens for the encrtpyed VM or VDUSE[1].
> > >
> > > Thanks for doing this.
> > >
> > > Can you describe the overall memory safety model that virtio drivers
> > > must follow?
> >
> > My understanding is that, basically the driver should not trust the
> > device (since the driver doesn't know what kind of device that it
> > tries to drive)
> >
> > 1) For any read only metadata (required at the spec level) which is
> > mapped as coherent, driver should not depend on the metadata that is
> > stored in a place that could be wrote by the device. This is what this
> > series tries to achieve.
> > 2) For other metadata that is produced by the device, need to make
> > sure there's no malicious device triggered behavior, this is somehow
> > similar to what vhost did. No DOS, loop, kernel bug and other stuffs.
> > 3) swiotb is a must to enforce memory access isolation. (VDUSE or encrypted VM)
> >
> > > For example:
> > >
> > > - Driver-to-device buffers must be on dedicated pages to avoid
> > >   information leaks.
> >
> > It looks to me if swiotlb is used, we don't need this since the
> > bouncing is not done at byte not page.
> >
> > But if swiotlb is not used, we need to enforce this.
> >
> > >
> > > - Driver-to-device buffers must be on dedicated pages to avoid memory
> > >   corruption.
> >
> > Similar to the above.
> >
> > >
> > > When I say "pages" I guess it's the IOMMU page size that matters?
> > >
> >
> > And the IOTLB page size.
> >
> > > What is the memory access granularity of VDUSE?
> >
> > It has an swiotlb, but the access and bouncing is done per byte.
> >
> > >
> > > I'm asking these questions because there is driver code that exposes
> > > kernel memory to the device and I'm not sure it's safe. For example:
> > >
> > >   static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
> > >                   struct scatterlist *data_sg, bool have_data)
> > >   {
> > >           struct scatterlist hdr, status, *sgs[3];
> > >           unsigned int num_out = 0, num_in = 0;
> > >
> > >           sg_init_one(&hdr, &vbr->out_hdr, sizeof(vbr->out_hdr));
> > >                             ^^^^^^^^^^^^^
> > >           sgs[num_out++] = &hdr;
> > >
> > >           if (have_data) {
> > >                   if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_OUT))
> > >                           sgs[num_out++] = data_sg;
> > >                   else
> > >                           sgs[num_out + num_in++] = data_sg;
> > >           }
> > >
> > >           sg_init_one(&status, &vbr->status, sizeof(vbr->status));
> > >                                ^^^^^^^^^^^^
> > >           sgs[num_out + num_in++] = &status;
> > >
> > >           return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
> > >   }
> > >
> > > I guess the drivers don't need to be modified as long as swiotlb is used
> > > to bounce the buffers through "insecure" memory so that the memory
> > > surrounding the buffers is not exposed?
> >
> > Yes, swiotlb won't bounce the whole page. So I think it's safe.
>
> Thanks Jason and Yongji Xie for clarifying. Seems like swiotlb or a
> similar mechanism can handle byte-granularity isolation so the drivers
> not need to worry about information leaks or memory corruption outside
> the mapped byte range.
>
> We still need to audit virtio guest drivers to ensure they don't trust
> data that can be modified by the device. I will look at virtio-blk and
> virtio-fs next week.
>

Oh, that's great. Thank you!

I also did some audit work these days and will send a new version for
reviewing next Monday.

Thanks,
Yongji
Michael S. Tsirkin May 14, 2021, 11:36 a.m. UTC | #12
On Fri, May 14, 2021 at 07:27:22PM +0800, Yongji Xie wrote:
> On Fri, May 14, 2021 at 7:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Fri, May 14, 2021 at 03:29:20PM +0800, Jason Wang wrote:
> > > On Fri, May 14, 2021 at 12:27 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Fri, Apr 23, 2021 at 04:09:35PM +0800, Jason Wang wrote:
> > > > > Sometimes, the driver doesn't trust the device. This is usually
> > > > > happens for the encrtpyed VM or VDUSE[1].
> > > >
> > > > Thanks for doing this.
> > > >
> > > > Can you describe the overall memory safety model that virtio drivers
> > > > must follow?
> > >
> > > My understanding is that, basically the driver should not trust the
> > > device (since the driver doesn't know what kind of device that it
> > > tries to drive)
> > >
> > > 1) For any read only metadata (required at the spec level) which is
> > > mapped as coherent, driver should not depend on the metadata that is
> > > stored in a place that could be wrote by the device. This is what this
> > > series tries to achieve.
> > > 2) For other metadata that is produced by the device, need to make
> > > sure there's no malicious device triggered behavior, this is somehow
> > > similar to what vhost did. No DOS, loop, kernel bug and other stuffs.
> > > 3) swiotb is a must to enforce memory access isolation. (VDUSE or encrypted VM)
> > >
> > > > For example:
> > > >
> > > > - Driver-to-device buffers must be on dedicated pages to avoid
> > > >   information leaks.
> > >
> > > It looks to me if swiotlb is used, we don't need this since the
> > > bouncing is not done at byte not page.
> > >
> > > But if swiotlb is not used, we need to enforce this.
> > >
> > > >
> > > > - Driver-to-device buffers must be on dedicated pages to avoid memory
> > > >   corruption.
> > >
> > > Similar to the above.
> > >
> > > >
> > > > When I say "pages" I guess it's the IOMMU page size that matters?
> > > >
> > >
> > > And the IOTLB page size.
> > >
> > > > What is the memory access granularity of VDUSE?
> > >
> > > It has an swiotlb, but the access and bouncing is done per byte.
> > >
> > > >
> > > > I'm asking these questions because there is driver code that exposes
> > > > kernel memory to the device and I'm not sure it's safe. For example:
> > > >
> > > >   static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
> > > >                   struct scatterlist *data_sg, bool have_data)
> > > >   {
> > > >           struct scatterlist hdr, status, *sgs[3];
> > > >           unsigned int num_out = 0, num_in = 0;
> > > >
> > > >           sg_init_one(&hdr, &vbr->out_hdr, sizeof(vbr->out_hdr));
> > > >                             ^^^^^^^^^^^^^
> > > >           sgs[num_out++] = &hdr;
> > > >
> > > >           if (have_data) {
> > > >                   if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_OUT))
> > > >                           sgs[num_out++] = data_sg;
> > > >                   else
> > > >                           sgs[num_out + num_in++] = data_sg;
> > > >           }
> > > >
> > > >           sg_init_one(&status, &vbr->status, sizeof(vbr->status));
> > > >                                ^^^^^^^^^^^^
> > > >           sgs[num_out + num_in++] = &status;
> > > >
> > > >           return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
> > > >   }
> > > >
> > > > I guess the drivers don't need to be modified as long as swiotlb is used
> > > > to bounce the buffers through "insecure" memory so that the memory
> > > > surrounding the buffers is not exposed?
> > >
> > > Yes, swiotlb won't bounce the whole page. So I think it's safe.
> >
> > Thanks Jason and Yongji Xie for clarifying. Seems like swiotlb or a
> > similar mechanism can handle byte-granularity isolation so the drivers
> > not need to worry about information leaks or memory corruption outside
> > the mapped byte range.
> >
> > We still need to audit virtio guest drivers to ensure they don't trust
> > data that can be modified by the device. I will look at virtio-blk and
> > virtio-fs next week.
> >
> 
> Oh, that's great. Thank you!
> 
> I also did some audit work these days and will send a new version for
> reviewing next Monday.
> 
> Thanks,
> Yongji

Doing it in a way that won't hurt performance for simple
configs that trust the device is a challenge though.
Pls take a look at the discussion with Christoph for some ideas
on how to do this.
Yongji Xie May 14, 2021, 1:58 p.m. UTC | #13
On Fri, May 14, 2021 at 7:36 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, May 14, 2021 at 07:27:22PM +0800, Yongji Xie wrote:
> > On Fri, May 14, 2021 at 7:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Fri, May 14, 2021 at 03:29:20PM +0800, Jason Wang wrote:
> > > > On Fri, May 14, 2021 at 12:27 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Fri, Apr 23, 2021 at 04:09:35PM +0800, Jason Wang wrote:
> > > > > > Sometimes, the driver doesn't trust the device. This is usually
> > > > > > happens for the encrtpyed VM or VDUSE[1].
> > > > >
> > > > > Thanks for doing this.
> > > > >
> > > > > Can you describe the overall memory safety model that virtio drivers
> > > > > must follow?
> > > >
> > > > My understanding is that, basically the driver should not trust the
> > > > device (since the driver doesn't know what kind of device that it
> > > > tries to drive)
> > > >
> > > > 1) For any read only metadata (required at the spec level) which is
> > > > mapped as coherent, driver should not depend on the metadata that is
> > > > stored in a place that could be wrote by the device. This is what this
> > > > series tries to achieve.
> > > > 2) For other metadata that is produced by the device, need to make
> > > > sure there's no malicious device triggered behavior, this is somehow
> > > > similar to what vhost did. No DOS, loop, kernel bug and other stuffs.
> > > > 3) swiotb is a must to enforce memory access isolation. (VDUSE or encrypted VM)
> > > >
> > > > > For example:
> > > > >
> > > > > - Driver-to-device buffers must be on dedicated pages to avoid
> > > > >   information leaks.
> > > >
> > > > It looks to me if swiotlb is used, we don't need this since the
> > > > bouncing is not done at byte not page.
> > > >
> > > > But if swiotlb is not used, we need to enforce this.
> > > >
> > > > >
> > > > > - Driver-to-device buffers must be on dedicated pages to avoid memory
> > > > >   corruption.
> > > >
> > > > Similar to the above.
> > > >
> > > > >
> > > > > When I say "pages" I guess it's the IOMMU page size that matters?
> > > > >
> > > >
> > > > And the IOTLB page size.
> > > >
> > > > > What is the memory access granularity of VDUSE?
> > > >
> > > > It has an swiotlb, but the access and bouncing is done per byte.
> > > >
> > > > >
> > > > > I'm asking these questions because there is driver code that exposes
> > > > > kernel memory to the device and I'm not sure it's safe. For example:
> > > > >
> > > > >   static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
> > > > >                   struct scatterlist *data_sg, bool have_data)
> > > > >   {
> > > > >           struct scatterlist hdr, status, *sgs[3];
> > > > >           unsigned int num_out = 0, num_in = 0;
> > > > >
> > > > >           sg_init_one(&hdr, &vbr->out_hdr, sizeof(vbr->out_hdr));
> > > > >                             ^^^^^^^^^^^^^
> > > > >           sgs[num_out++] = &hdr;
> > > > >
> > > > >           if (have_data) {
> > > > >                   if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_OUT))
> > > > >                           sgs[num_out++] = data_sg;
> > > > >                   else
> > > > >                           sgs[num_out + num_in++] = data_sg;
> > > > >           }
> > > > >
> > > > >           sg_init_one(&status, &vbr->status, sizeof(vbr->status));
> > > > >                                ^^^^^^^^^^^^
> > > > >           sgs[num_out + num_in++] = &status;
> > > > >
> > > > >           return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
> > > > >   }
> > > > >
> > > > > I guess the drivers don't need to be modified as long as swiotlb is used
> > > > > to bounce the buffers through "insecure" memory so that the memory
> > > > > surrounding the buffers is not exposed?
> > > >
> > > > Yes, swiotlb won't bounce the whole page. So I think it's safe.
> > >
> > > Thanks Jason and Yongji Xie for clarifying. Seems like swiotlb or a
> > > similar mechanism can handle byte-granularity isolation so the drivers
> > > not need to worry about information leaks or memory corruption outside
> > > the mapped byte range.
> > >
> > > We still need to audit virtio guest drivers to ensure they don't trust
> > > data that can be modified by the device. I will look at virtio-blk and
> > > virtio-fs next week.
> > >
> >
> > Oh, that's great. Thank you!
> >
> > I also did some audit work these days and will send a new version for
> > reviewing next Monday.
> >
> > Thanks,
> > Yongji
>
> Doing it in a way that won't hurt performance for simple
> configs that trust the device is a challenge though.
> Pls take a look at the discussion with Christoph for some ideas
> on how to do this.
>

I see. Thanks for the reminder.

Thanks,
Yongji
Jason Wang June 4, 2021, 5:38 a.m. UTC | #14
在 2021/5/14 下午7:13, Michael S. Tsirkin 写道:
> On Thu, May 06, 2021 at 01:38:29PM +0100, Christoph Hellwig wrote:
>> On Thu, May 06, 2021 at 04:12:17AM -0400, Michael S. Tsirkin wrote:
>>> Let's try for just a bit, won't make this window anyway:
>>>
>>> I have an old idea. Add a way to find out that unmap is a nop
>>> (or more exactly does not use the address/length).
>>> Then in that case even with DMA API we do not need
>>> the extra data. Hmm?
>> So we actually do have a check for that from the early days of the DMA
>> API, but it only works at compile time: CONFIG_NEED_DMA_MAP_STATE.
>>
>> But given how rare configs without an iommu or swiotlb are these days
>> it has stopped to be very useful.  Unfortunately a runtime-version is
>> not entirely trivial, but maybe if we allow for false positives we
>> could do something like this
>>
>> bool dma_direct_need_state(struct device *dev)
>> {
>> 	/* some areas could not be covered by any map at all */
>> 	if (dev->dma_range_map)
>> 		return false;
>> 	if (force_dma_unencrypted(dev))
>> 		return false;
>> 	if (dma_direct_need_sync(dev))
>> 		return false;
>> 	return *dev->dma_mask == DMA_BIT_MASK(64);
>> }
>>
>> bool dma_need_state(struct device *dev)
>> {
>> 	const struct dma_map_ops *ops = get_dma_ops(dev);
>>
>> 	if (dma_map_direct(dev, ops))
>> 		return dma_direct_need_state(dev);
>> 	return ops->unmap_page ||
>> 		ops->sync_single_for_cpu || ops->sync_single_for_device;
>> }
> Yea that sounds like a good idea. We will need to document that.
>
>
> Something like:
>
> /*
>   * dma_need_state - report whether unmap calls use the address and length
>   * @dev: device to guery
>   *
>   * This is a runtime version of CONFIG_NEED_DMA_MAP_STATE.
>   *
>   * Return the value indicating whether dma_unmap_* and dma_sync_* calls for the device
>   * use the DMA state parameters passed to them.
>   * The DMA state parameters are: scatter/gather list/table, address and
>   * length.
>   *
>   * If dma_need_state returns false then DMA state parameters are
>   * ignored by all dma_unmap_* and dma_sync_* calls, so it is safe to pass 0 for
>   * address and length, and DMA_UNMAP_SG_TABLE_INVALID and
>   * DMA_UNMAP_SG_LIST_INVALID for s/g table and length respectively.
>   * If dma_need_state returns true then DMA state might
>   * be used and so the actual values are required.
>   */
>
> And we will need DMA_UNMAP_SG_TABLE_INVALID and
> DMA_UNMAP_SG_LIST_INVALID as pointers to an empty global table and list
> for calls such as dma_unmap_sgtable that dereference pointers before checking
> they are used.
>
>
> Does this look good?
>
> The table/length variants are for consistency, virtio specifically does
> not use s/g at the moment, but it seems nicer than leaving
> users wonder what to do about these.
>
> Thoughts? Jason want to try implementing?


I can add it in my todo list other if other people are interested in 
this, please let us know.

But this is just about saving the efforts of unmap and it doesn't 
eliminate the necessary of using private memory (addr, length) for the 
metadata for validating the device inputs.

And just to clarify, the slight regression we see is testing without 
VIRTIO_F_ACCESS_PLATFORM which means DMA API is not used.

So I will go to post a formal version of this series and we can start 
from there.

Thanks


>
Michael S. Tsirkin July 11, 2021, 4:08 p.m. UTC | #15
On Fri, Jun 04, 2021 at 01:38:01PM +0800, Jason Wang wrote:
> 
> 在 2021/5/14 下午7:13, Michael S. Tsirkin 写道:
> > On Thu, May 06, 2021 at 01:38:29PM +0100, Christoph Hellwig wrote:
> > > On Thu, May 06, 2021 at 04:12:17AM -0400, Michael S. Tsirkin wrote:
> > > > Let's try for just a bit, won't make this window anyway:
> > > > 
> > > > I have an old idea. Add a way to find out that unmap is a nop
> > > > (or more exactly does not use the address/length).
> > > > Then in that case even with DMA API we do not need
> > > > the extra data. Hmm?
> > > So we actually do have a check for that from the early days of the DMA
> > > API, but it only works at compile time: CONFIG_NEED_DMA_MAP_STATE.
> > > 
> > > But given how rare configs without an iommu or swiotlb are these days
> > > it has stopped to be very useful.  Unfortunately a runtime-version is
> > > not entirely trivial, but maybe if we allow for false positives we
> > > could do something like this
> > > 
> > > bool dma_direct_need_state(struct device *dev)
> > > {
> > > 	/* some areas could not be covered by any map at all */
> > > 	if (dev->dma_range_map)
> > > 		return false;
> > > 	if (force_dma_unencrypted(dev))
> > > 		return false;
> > > 	if (dma_direct_need_sync(dev))
> > > 		return false;
> > > 	return *dev->dma_mask == DMA_BIT_MASK(64);
> > > }
> > > 
> > > bool dma_need_state(struct device *dev)
> > > {
> > > 	const struct dma_map_ops *ops = get_dma_ops(dev);
> > > 
> > > 	if (dma_map_direct(dev, ops))
> > > 		return dma_direct_need_state(dev);
> > > 	return ops->unmap_page ||
> > > 		ops->sync_single_for_cpu || ops->sync_single_for_device;
> > > }
> > Yea that sounds like a good idea. We will need to document that.
> > 
> > 
> > Something like:
> > 
> > /*
> >   * dma_need_state - report whether unmap calls use the address and length
> >   * @dev: device to guery
> >   *
> >   * This is a runtime version of CONFIG_NEED_DMA_MAP_STATE.
> >   *
> >   * Return the value indicating whether dma_unmap_* and dma_sync_* calls for the device
> >   * use the DMA state parameters passed to them.
> >   * The DMA state parameters are: scatter/gather list/table, address and
> >   * length.
> >   *
> >   * If dma_need_state returns false then DMA state parameters are
> >   * ignored by all dma_unmap_* and dma_sync_* calls, so it is safe to pass 0 for
> >   * address and length, and DMA_UNMAP_SG_TABLE_INVALID and
> >   * DMA_UNMAP_SG_LIST_INVALID for s/g table and length respectively.
> >   * If dma_need_state returns true then DMA state might
> >   * be used and so the actual values are required.
> >   */
> > 
> > And we will need DMA_UNMAP_SG_TABLE_INVALID and
> > DMA_UNMAP_SG_LIST_INVALID as pointers to an empty global table and list
> > for calls such as dma_unmap_sgtable that dereference pointers before checking
> > they are used.
> > 
> > 
> > Does this look good?
> > 
> > The table/length variants are for consistency, virtio specifically does
> > not use s/g at the moment, but it seems nicer than leaving
> > users wonder what to do about these.
> > 
> > Thoughts? Jason want to try implementing?
> 
> 
> I can add it in my todo list other if other people are interested in this,
> please let us know.
> 
> But this is just about saving the efforts of unmap and it doesn't eliminate
> the necessary of using private memory (addr, length) for the metadata for
> validating the device inputs.


Besides unmap, why do we need to validate address? length can be
typically validated by specific drivers - not all of them even use it ..

> And just to clarify, the slight regression we see is testing without
> VIRTIO_F_ACCESS_PLATFORM which means DMA API is not used.

I guess this is due to extra cache pressure? Maybe create yet another
array just for DMA state ...

> So I will go to post a formal version of this series and we can start from
> there.
> 
> Thanks
> 
> 
> >
Jason Wang July 12, 2021, 3:07 a.m. UTC | #16
在 2021/7/12 上午12:08, Michael S. Tsirkin 写道:
> On Fri, Jun 04, 2021 at 01:38:01PM +0800, Jason Wang wrote:
>> 在 2021/5/14 下午7:13, Michael S. Tsirkin 写道:
>>> On Thu, May 06, 2021 at 01:38:29PM +0100, Christoph Hellwig wrote:
>>>> On Thu, May 06, 2021 at 04:12:17AM -0400, Michael S. Tsirkin wrote:
>>>>> Let's try for just a bit, won't make this window anyway:
>>>>>
>>>>> I have an old idea. Add a way to find out that unmap is a nop
>>>>> (or more exactly does not use the address/length).
>>>>> Then in that case even with DMA API we do not need
>>>>> the extra data. Hmm?
>>>> So we actually do have a check for that from the early days of the DMA
>>>> API, but it only works at compile time: CONFIG_NEED_DMA_MAP_STATE.
>>>>
>>>> But given how rare configs without an iommu or swiotlb are these days
>>>> it has stopped to be very useful.  Unfortunately a runtime-version is
>>>> not entirely trivial, but maybe if we allow for false positives we
>>>> could do something like this
>>>>
>>>> bool dma_direct_need_state(struct device *dev)
>>>> {
>>>> 	/* some areas could not be covered by any map at all */
>>>> 	if (dev->dma_range_map)
>>>> 		return false;
>>>> 	if (force_dma_unencrypted(dev))
>>>> 		return false;
>>>> 	if (dma_direct_need_sync(dev))
>>>> 		return false;
>>>> 	return *dev->dma_mask == DMA_BIT_MASK(64);
>>>> }
>>>>
>>>> bool dma_need_state(struct device *dev)
>>>> {
>>>> 	const struct dma_map_ops *ops = get_dma_ops(dev);
>>>>
>>>> 	if (dma_map_direct(dev, ops))
>>>> 		return dma_direct_need_state(dev);
>>>> 	return ops->unmap_page ||
>>>> 		ops->sync_single_for_cpu || ops->sync_single_for_device;
>>>> }
>>> Yea that sounds like a good idea. We will need to document that.
>>>
>>>
>>> Something like:
>>>
>>> /*
>>>    * dma_need_state - report whether unmap calls use the address and length
>>>    * @dev: device to guery
>>>    *
>>>    * This is a runtime version of CONFIG_NEED_DMA_MAP_STATE.
>>>    *
>>>    * Return the value indicating whether dma_unmap_* and dma_sync_* calls for the device
>>>    * use the DMA state parameters passed to them.
>>>    * The DMA state parameters are: scatter/gather list/table, address and
>>>    * length.
>>>    *
>>>    * If dma_need_state returns false then DMA state parameters are
>>>    * ignored by all dma_unmap_* and dma_sync_* calls, so it is safe to pass 0 for
>>>    * address and length, and DMA_UNMAP_SG_TABLE_INVALID and
>>>    * DMA_UNMAP_SG_LIST_INVALID for s/g table and length respectively.
>>>    * If dma_need_state returns true then DMA state might
>>>    * be used and so the actual values are required.
>>>    */
>>>
>>> And we will need DMA_UNMAP_SG_TABLE_INVALID and
>>> DMA_UNMAP_SG_LIST_INVALID as pointers to an empty global table and list
>>> for calls such as dma_unmap_sgtable that dereference pointers before checking
>>> they are used.
>>>
>>>
>>> Does this look good?
>>>
>>> The table/length variants are for consistency, virtio specifically does
>>> not use s/g at the moment, but it seems nicer than leaving
>>> users wonder what to do about these.
>>>
>>> Thoughts? Jason want to try implementing?
>>
>> I can add it in my todo list other if other people are interested in this,
>> please let us know.
>>
>> But this is just about saving the efforts of unmap and it doesn't eliminate
>> the necessary of using private memory (addr, length) for the metadata for
>> validating the device inputs.
>
> Besides unmap, why do we need to validate address?


Sorry, it's not validating actually, the driver doesn't do any 
validation. As the subject, the driver will just use the metadata stored 
in the desc_state instead of the one stored in the descriptor ring.


>   length can be
> typically validated by specific drivers - not all of them even use it ..
>
>> And just to clarify, the slight regression we see is testing without
>> VIRTIO_F_ACCESS_PLATFORM which means DMA API is not used.
> I guess this is due to extra cache pressure?


Yes.


> Maybe create yet another
> array just for DMA state ...


I'm not sure I get this, we use this basically:

struct vring_desc_extra {
         dma_addr_t addr;                /* Buffer DMA addr. */
         u32 len;                        /* Buffer length. */
         u16 flags;                      /* Descriptor flags. */
         u16 next;                       /* The next desc state in a 
list. */
};

Except for the "next" the rest are all DMA state.

Thanks


>
>> So I will go to post a formal version of this series and we can start from
>> there.
>>
>> Thanks
>>
>>
Michael S. Tsirkin July 12, 2021, 12:58 p.m. UTC | #17
On Mon, Jul 12, 2021 at 11:07:44AM +0800, Jason Wang wrote:
> 
> 在 2021/7/12 上午12:08, Michael S. Tsirkin 写道:
> > On Fri, Jun 04, 2021 at 01:38:01PM +0800, Jason Wang wrote:
> > > 在 2021/5/14 下午7:13, Michael S. Tsirkin 写道:
> > > > On Thu, May 06, 2021 at 01:38:29PM +0100, Christoph Hellwig wrote:
> > > > > On Thu, May 06, 2021 at 04:12:17AM -0400, Michael S. Tsirkin wrote:
> > > > > > Let's try for just a bit, won't make this window anyway:
> > > > > > 
> > > > > > I have an old idea. Add a way to find out that unmap is a nop
> > > > > > (or more exactly does not use the address/length).
> > > > > > Then in that case even with DMA API we do not need
> > > > > > the extra data. Hmm?
> > > > > So we actually do have a check for that from the early days of the DMA
> > > > > API, but it only works at compile time: CONFIG_NEED_DMA_MAP_STATE.
> > > > > 
> > > > > But given how rare configs without an iommu or swiotlb are these days
> > > > > it has stopped to be very useful.  Unfortunately a runtime-version is
> > > > > not entirely trivial, but maybe if we allow for false positives we
> > > > > could do something like this
> > > > > 
> > > > > bool dma_direct_need_state(struct device *dev)
> > > > > {
> > > > > 	/* some areas could not be covered by any map at all */
> > > > > 	if (dev->dma_range_map)
> > > > > 		return false;
> > > > > 	if (force_dma_unencrypted(dev))
> > > > > 		return false;
> > > > > 	if (dma_direct_need_sync(dev))
> > > > > 		return false;
> > > > > 	return *dev->dma_mask == DMA_BIT_MASK(64);
> > > > > }
> > > > > 
> > > > > bool dma_need_state(struct device *dev)
> > > > > {
> > > > > 	const struct dma_map_ops *ops = get_dma_ops(dev);
> > > > > 
> > > > > 	if (dma_map_direct(dev, ops))
> > > > > 		return dma_direct_need_state(dev);
> > > > > 	return ops->unmap_page ||
> > > > > 		ops->sync_single_for_cpu || ops->sync_single_for_device;
> > > > > }
> > > > Yea that sounds like a good idea. We will need to document that.
> > > > 
> > > > 
> > > > Something like:
> > > > 
> > > > /*
> > > >    * dma_need_state - report whether unmap calls use the address and length
> > > >    * @dev: device to guery
> > > >    *
> > > >    * This is a runtime version of CONFIG_NEED_DMA_MAP_STATE.
> > > >    *
> > > >    * Return the value indicating whether dma_unmap_* and dma_sync_* calls for the device
> > > >    * use the DMA state parameters passed to them.
> > > >    * The DMA state parameters are: scatter/gather list/table, address and
> > > >    * length.
> > > >    *
> > > >    * If dma_need_state returns false then DMA state parameters are
> > > >    * ignored by all dma_unmap_* and dma_sync_* calls, so it is safe to pass 0 for
> > > >    * address and length, and DMA_UNMAP_SG_TABLE_INVALID and
> > > >    * DMA_UNMAP_SG_LIST_INVALID for s/g table and length respectively.
> > > >    * If dma_need_state returns true then DMA state might
> > > >    * be used and so the actual values are required.
> > > >    */
> > > > 
> > > > And we will need DMA_UNMAP_SG_TABLE_INVALID and
> > > > DMA_UNMAP_SG_LIST_INVALID as pointers to an empty global table and list
> > > > for calls such as dma_unmap_sgtable that dereference pointers before checking
> > > > they are used.
> > > > 
> > > > 
> > > > Does this look good?
> > > > 
> > > > The table/length variants are for consistency, virtio specifically does
> > > > not use s/g at the moment, but it seems nicer than leaving
> > > > users wonder what to do about these.
> > > > 
> > > > Thoughts? Jason want to try implementing?
> > > 
> > > I can add it in my todo list other if other people are interested in this,
> > > please let us know.
> > > 
> > > But this is just about saving the efforts of unmap and it doesn't eliminate
> > > the necessary of using private memory (addr, length) for the metadata for
> > > validating the device inputs.
> > 
> > Besides unmap, why do we need to validate address?
> 
> 
> Sorry, it's not validating actually, the driver doesn't do any validation.
> As the subject, the driver will just use the metadata stored in the
> desc_state instead of the one stored in the descriptor ring.
> 
> 
> >   length can be
> > typically validated by specific drivers - not all of them even use it ..
> > 
> > > And just to clarify, the slight regression we see is testing without
> > > VIRTIO_F_ACCESS_PLATFORM which means DMA API is not used.
> > I guess this is due to extra cache pressure?
> 
> 
> Yes.
> 
> 
> > Maybe create yet another
> > array just for DMA state ...
> 
> 
> I'm not sure I get this, we use this basically:
> 
> struct vring_desc_extra {
>         dma_addr_t addr;                /* Buffer DMA addr. */
>         u32 len;                        /* Buffer length. */
>         u16 flags;                      /* Descriptor flags. */
>         u16 next;                       /* The next desc state in a list. */
> };
> 
> Except for the "next" the rest are all DMA state.
> 
> Thanks


I am talking about the dma need state idea where we interrogate the DMA
API to figure out whether unmap is actually a nop.

> 
> > 
> > > So I will go to post a formal version of this series and we can start from
> > > there.
> > > 
> > > Thanks
> > > 
> > >