mbox series

[net-next,0/3] vhost: accelerate metadata access through vmap()

Message ID 20181213101022.12475-1-jasowang@redhat.com (mailing list archive)
Headers show
Series vhost: accelerate metadata access through vmap() | expand

Message

Jason Wang Dec. 13, 2018, 10:10 a.m. UTC
Hi:

This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling.

Test shows about 24% improvement on TX PPS. It should benefit other
cases as well.

Please review

Jason Wang (3):
  vhost: generalize adding used elem
  vhost: fine grain userspace memory accessors
  vhost: access vq metadata through kernel virtual address

 drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
 drivers/vhost/vhost.h |  11 ++
 2 files changed, 266 insertions(+), 26 deletions(-)

Comments

Michael S. Tsirkin Dec. 13, 2018, 3:27 p.m. UTC | #1
On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.

Userspace accesses through remapping tricks and next time there's a need
for a new barrier we are left to figure it out by ourselves.  I don't
like the idea I have to say.  As a first step, why don't we switch to
unsafe_put_user/unsafe_get_user etc?
That would be more of an apples to apples comparison, would it not?


> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.
> 
> Please review
> 
> Jason Wang (3):
>   vhost: generalize adding used elem
>   vhost: fine grain userspace memory accessors
>   vhost: access vq metadata through kernel virtual address
> 
>  drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>  drivers/vhost/vhost.h |  11 ++
>  2 files changed, 266 insertions(+), 26 deletions(-)
> 
> -- 
> 2.17.1
Michael S. Tsirkin Dec. 13, 2018, 8:12 p.m. UTC | #2
On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.
> 
> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.
> 
> Please review

I think the idea of speeding up userspace access is a good one.
However I think that moving all checks to start is way too aggressive.
Instead, let's batch things up but let's not keep them
around forever.
Here are some ideas:


1. Disable preemption, process a small number of small packets
   directly in an atomic context. This should cut latency
   down significantly, the tricky part is to only do it
   on a light load and disable this
   for the streaming case otherwise it's unfair.
   This might fail, if it does just bounce things out to
   a thread.

2. Switch to unsafe_put_user/unsafe_get_user,
   and batch up multiple accesses.

3. Allow adding a fixup point manually,
   such that multiple independent get_user accesses
   can get a single fixup (will allow better compiler
   optimizations).





> Jason Wang (3):
>   vhost: generalize adding used elem
>   vhost: fine grain userspace memory accessors
>   vhost: access vq metadata through kernel virtual address
> 
>  drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>  drivers/vhost/vhost.h |  11 ++
>  2 files changed, 266 insertions(+), 26 deletions(-)
> 
> -- 
> 2.17.1
Jason Wang Dec. 14, 2018, 3:42 a.m. UTC | #3
On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
> Userspace accesses through remapping tricks and next time there's a need
> for a new barrier we are left to figure it out by ourselves.


I don't get here, do you mean spec barriers? It's completely unnecessary 
for vhost which is kernel thread. And even if you're right, vhost is not 
the only place, there's lots of vmap() based accessing in kernel. Think 
in another direction, this means we won't suffer form unnecessary 
barriers for kthread like vhost in the future, we will manually pick the 
one we really need (but it should have little possibility).

Please notice we only access metdata through remapping not the data 
itself. This idea has been used for high speed userspace backend for 
years, e.g packet socket or recent AF_XDP. The only difference is the 
page was remap to from kernel to userspace.


>    I don't
> like the idea I have to say.  As a first step, why don't we switch to
> unsafe_put_user/unsafe_get_user etc?


Several reasons:

- They only have x86 variant, it won't have any difference for the rest 
of architecture.

- unsafe_put_user/unsafe_get_user is not sufficient for accessing 
structures (e.g accessing descriptor) or arrays (batching).

- Unless we can batch at least the accessing of two places in three of 
avail, used and descriptor in one run. There will be no difference. E.g 
we can batch updating used ring, but it won't make any difference in 
this case.


> That would be more of an apples to apples comparison, would it not?


Apples to apples comparison only help if we are the No.1. But the fact 
is we are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is 
the fastest method AFAIK.


Thanks


>
>
>> Test shows about 24% improvement on TX PPS. It should benefit other
>> cases as well.
>>
>> Please review
>>
>> Jason Wang (3):
>>    vhost: generalize adding used elem
>>    vhost: fine grain userspace memory accessors
>>    vhost: access vq metadata through kernel virtual address
>>
>>   drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>>   drivers/vhost/vhost.h |  11 ++
>>   2 files changed, 266 insertions(+), 26 deletions(-)
>>
>> -- 
>> 2.17.1
Jason Wang Dec. 14, 2018, 4:29 a.m. UTC | #4
On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
>>
>> Test shows about 24% improvement on TX PPS. It should benefit other
>> cases as well.
>>
>> Please review
> I think the idea of speeding up userspace access is a good one.
> However I think that moving all checks to start is way too aggressive.


So did packet and AF_XDP. Anyway, sharing address space and access them 
directly is the fastest way. Performance is the major consideration for 
people to choose backend. Compare to userspace implementation, vhost 
does not have security advantages at any level. If vhost is still slow, 
people will start to develop backends based on e.g AF_XDP.


> Instead, let's batch things up but let's not keep them
> around forever.
> Here are some ideas:
>
>
> 1. Disable preemption, process a small number of small packets
>     directly in an atomic context. This should cut latency
>     down significantly, the tricky part is to only do it
>     on a light load and disable this
>     for the streaming case otherwise it's unfair.
>     This might fail, if it does just bounce things out to
>     a thread.


I'm not sure what context you meant here. Is this for TX path of TUN? 
But a fundamental difference is my series is targeted for extreme heavy 
load not light one, 100% cpu for vhost is expected.


>
> 2. Switch to unsafe_put_user/unsafe_get_user,
>     and batch up multiple accesses.


As I said, unless we can batch accessing of two difference places of 
three of avail, descriptor and used. It won't help for batching the 
accessing of a single place like used. I'm even not sure this can be 
done consider the case of packed virtqueue, we have a single descriptor 
ring. Batching through unsafe helpers may not help in this case since 
it's equivalent to safe ones . And This requires non trivial refactoring 
of vhost. And such refactoring itself make give us noticeable impact 
(e.g it may lead regression).


>
> 3. Allow adding a fixup point manually,
>     such that multiple independent get_user accesses
>     can get a single fixup (will allow better compiler
>     optimizations).
>

So for metadata access, I don't see how you suggest here can help in the 
case of heavy workload.

For data access, this may help but I've played to batch the data copy to 
reduce SMAP/spec barriers in vhost-net but I don't see performance 
improvement.

Thanks


>
>
>
>> Jason Wang (3):
>>    vhost: generalize adding used elem
>>    vhost: fine grain userspace memory accessors
>>    vhost: access vq metadata through kernel virtual address
>>
>>   drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>>   drivers/vhost/vhost.h |  11 ++
>>   2 files changed, 266 insertions(+), 26 deletions(-)
>>
>> -- 
>> 2.17.1
Michael S. Tsirkin Dec. 14, 2018, 12:33 p.m. UTC | #5
On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
> 
> On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
> > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > Hi:
> > > 
> > > This series tries to access virtqueue metadata through kernel virtual
> > > address instead of copy_user() friends since they had too much
> > > overheads like checks, spec barriers or even hardware feature
> > > toggling.
> > Userspace accesses through remapping tricks and next time there's a need
> > for a new barrier we are left to figure it out by ourselves.
> 
> 
> I don't get here, do you mean spec barriers?

I mean the next barrier people decide to put into userspace
memory accesses.

> It's completely unnecessary for
> vhost which is kernel thread.

It's defence in depth. Take a look at the commit that added them.
And yes quite possibly in most cases we actually have a spec
barrier in the validation phase. If we do let's use the
unsafe variants so they can be found.

> And even if you're right, vhost is not the
> only place, there's lots of vmap() based accessing in kernel.

For sure. But if one can get by without get user pages, one
really should. Witness recently uncovered mess with file
backed storage.

> Think in
> another direction, this means we won't suffer form unnecessary barriers for
> kthread like vhost in the future, we will manually pick the one we really
> need

I personally think we should err on the side of caution not on the side of
performance.

> (but it should have little possibility).

History seems to teach otherwise.

> Please notice we only access metdata through remapping not the data itself.
> This idea has been used for high speed userspace backend for years, e.g
> packet socket or recent AF_XDP.

I think their justification for the higher risk is that they are mostly
designed for priveledged userspace.

> The only difference is the page was remap to
> from kernel to userspace.

At least that avoids the g.u.p mess.

> 
> >    I don't
> > like the idea I have to say.  As a first step, why don't we switch to
> > unsafe_put_user/unsafe_get_user etc?
> 
> 
> Several reasons:
> 
> - They only have x86 variant, it won't have any difference for the rest of
> architecture.

Is there an issue on other architectures? If yes they can be extended
there.

> - unsafe_put_user/unsafe_get_user is not sufficient for accessing structures
> (e.g accessing descriptor) or arrays (batching).

So you want unsafe_copy_xxx_user? I can do this. Hang on will post.

> - Unless we can batch at least the accessing of two places in three of
> avail, used and descriptor in one run. There will be no difference. E.g we
> can batch updating used ring, but it won't make any difference in this case.
> 

So let's batch them all?


> > That would be more of an apples to apples comparison, would it not?
> 
> 
> Apples to apples comparison only help if we are the No.1. But the fact is we
> are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is the
> fastest method AFAIK.
> 
> 
> Thanks

We need to speed up the packet access itself too though.
You can't vmap all of guest memory.


> 
> > 
> > 
> > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > cases as well.
> > > 
> > > Please review
> > > 
> > > Jason Wang (3):
> > >    vhost: generalize adding used elem
> > >    vhost: fine grain userspace memory accessors
> > >    vhost: access vq metadata through kernel virtual address
> > > 
> > >   drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
> > >   drivers/vhost/vhost.h |  11 ++
> > >   2 files changed, 266 insertions(+), 26 deletions(-)
> > > 
> > > -- 
> > > 2.17.1
Michael S. Tsirkin Dec. 14, 2018, 12:52 p.m. UTC | #6
On Fri, Dec 14, 2018 at 12:29:54PM +0800, Jason Wang wrote:
> 
> On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
> > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > Hi:
> > > 
> > > This series tries to access virtqueue metadata through kernel virtual
> > > address instead of copy_user() friends since they had too much
> > > overheads like checks, spec barriers or even hardware feature
> > > toggling.
> > > 
> > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > cases as well.
> > > 
> > > Please review
> > I think the idea of speeding up userspace access is a good one.
> > However I think that moving all checks to start is way too aggressive.
> 
> 
> So did packet and AF_XDP. Anyway, sharing address space and access them
> directly is the fastest way. Performance is the major consideration for
> people to choose backend. Compare to userspace implementation, vhost does
> not have security advantages at any level. If vhost is still slow, people
> will start to develop backends based on e.g AF_XDP.
> 

Let them what's wrong with that?

> > Instead, let's batch things up but let's not keep them
> > around forever.
> > Here are some ideas:
> > 
> > 
> > 1. Disable preemption, process a small number of small packets
> >     directly in an atomic context. This should cut latency
> >     down significantly, the tricky part is to only do it
> >     on a light load and disable this
> >     for the streaming case otherwise it's unfair.
> >     This might fail, if it does just bounce things out to
> >     a thread.
> 
> 
> I'm not sure what context you meant here. Is this for TX path of TUN? But a
> fundamental difference is my series is targeted for extreme heavy load not
> light one, 100% cpu for vhost is expected.

Interesting. You only shared a TCP RR result though.
What's the performance gain in a heavy load case?

> 
> > 
> > 2. Switch to unsafe_put_user/unsafe_get_user,
> >     and batch up multiple accesses.
> 
> 
> As I said, unless we can batch accessing of two difference places of three
> of avail, descriptor and used. It won't help for batching the accessing of a
> single place like used. I'm even not sure this can be done consider the case
> of packed virtqueue, we have a single descriptor ring.

So that's one of the reasons packed should be faster. Single access
and you get the descriptor no messy redirects. Somehow your
benchmarking so far didn't show a gain with vhost and
packed though - do you know what's wrong?

> Batching through
> unsafe helpers may not help in this case since it's equivalent to safe ones
> . And This requires non trivial refactoring of vhost. And such refactoring
> itself make give us noticeable impact (e.g it may lead regression).
> 
> 
> > 
> > 3. Allow adding a fixup point manually,
> >     such that multiple independent get_user accesses
> >     can get a single fixup (will allow better compiler
> >     optimizations).
> > 
> 
> So for metadata access, I don't see how you suggest here can help in the
> case of heavy workload.
> 
> For data access, this may help but I've played to batch the data copy to
> reduce SMAP/spec barriers in vhost-net but I don't see performance
> improvement.
> 
> Thanks

So how about we try to figure what's going on actually?
Can you drop the barriers and show the same gain?
E.g. vmap does not use a huge page IIRC so in fact it
can be slower than direct access. It's not a magic
faster way.



> 
> > 
> > 
> > 
> > > Jason Wang (3):
> > >    vhost: generalize adding used elem
> > >    vhost: fine grain userspace memory accessors
> > >    vhost: access vq metadata through kernel virtual address
> > > 
> > >   drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
> > >   drivers/vhost/vhost.h |  11 ++
> > >   2 files changed, 266 insertions(+), 26 deletions(-)
> > > 
> > > -- 
> > > 2.17.1
Michael S. Tsirkin Dec. 14, 2018, 3:16 p.m. UTC | #7
On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.
> 
> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.

BTW if the issue is all the error checking, maybe we should
consider using something like uaccess_catch and check
in a single place.


> Please review
> 
> Jason Wang (3):
>   vhost: generalize adding used elem
>   vhost: fine grain userspace memory accessors
>   vhost: access vq metadata through kernel virtual address
> 
>  drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>  drivers/vhost/vhost.h |  11 ++
>  2 files changed, 266 insertions(+), 26 deletions(-)
> 
> -- 
> 2.17.1
Michael S. Tsirkin Dec. 14, 2018, 3:31 p.m. UTC | #8
On Fri, Dec 14, 2018 at 07:33:20AM -0500, Michael S. Tsirkin wrote:
> > - unsafe_put_user/unsafe_get_user is not sufficient for accessing structures
> > (e.g accessing descriptor) or arrays (batching).
> 
> So you want unsafe_copy_xxx_user? I can do this. Hang on will post.

Like this basically? Warning: completely untested.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index b5e58cc0c5e7..d2afd70793ca 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -728,5 +728,10 @@ do {										\
 	if (unlikely(__gu_err)) goto err_label;					\
 } while (0)
 
+#define unsafe_copy_from_user(to, from, n)					\
+	 __copy_user_ll(to, (__force const void *)from, n)
+#define unsafe_copy_to_user(to, from, n)					\
+	 __copy_user_ll((__force void *)to, from, n)
+
 #endif /* _ASM_X86_UACCESS_H */
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index efe79c1cdd47..b9d515ba2920 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -271,6 +271,8 @@ extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count);
 #define user_access_end() do { } while (0)
 #define unsafe_get_user(x, ptr, err) do { if (unlikely(__get_user(x, ptr))) goto err; } while (0)
 #define unsafe_put_user(x, ptr, err) do { if (unlikely(__put_user(x, ptr))) goto err; } while (0)
+#define unsafe_copy_to_user(from, to, n) copy_to_user(from, to, n)
+#define unsafe_copy_from_user(from, to, n) copy_from_user(from, to, n)
 #endif
 
 #ifdef CONFIG_HARDENED_USERCOPY
David Miller Dec. 15, 2018, 7:43 p.m. UTC | #9
From: Jason Wang <jasowang@redhat.com>
Date: Fri, 14 Dec 2018 12:29:54 +0800

> 
> On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>> Hi:
>>>
>>> This series tries to access virtqueue metadata through kernel virtual
>>> address instead of copy_user() friends since they had too much
>>> overheads like checks, spec barriers or even hardware feature
>>> toggling.
>>>
>>> Test shows about 24% improvement on TX PPS. It should benefit other
>>> cases as well.
>>>
>>> Please review
>> I think the idea of speeding up userspace access is a good one.
>> However I think that moving all checks to start is way too aggressive.
> 
> 
> So did packet and AF_XDP. Anyway, sharing address space and access
> them directly is the fastest way. Performance is the major
> consideration for people to choose backend. Compare to userspace
> implementation, vhost does not have security advantages at any
> level. If vhost is still slow, people will start to develop backends
> based on e.g AF_XDP.

Exactly, this is precisely how this kind of problem should be solved.

Michael, I strongly support the approach Jason is taking here, and I
would like to ask you to seriously reconsider your objections.

Thank you.
Michael S. Tsirkin Dec. 16, 2018, 7:57 p.m. UTC | #10
On Sat, Dec 15, 2018 at 11:43:08AM -0800, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Fri, 14 Dec 2018 12:29:54 +0800
> 
> > 
> > On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
> >> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> >>> Hi:
> >>>
> >>> This series tries to access virtqueue metadata through kernel virtual
> >>> address instead of copy_user() friends since they had too much
> >>> overheads like checks, spec barriers or even hardware feature
> >>> toggling.
> >>>
> >>> Test shows about 24% improvement on TX PPS. It should benefit other
> >>> cases as well.
> >>>
> >>> Please review
> >> I think the idea of speeding up userspace access is a good one.
> >> However I think that moving all checks to start is way too aggressive.
> > 
> > 
> > So did packet and AF_XDP. Anyway, sharing address space and access
> > them directly is the fastest way. Performance is the major
> > consideration for people to choose backend. Compare to userspace
> > implementation, vhost does not have security advantages at any
> > level. If vhost is still slow, people will start to develop backends
> > based on e.g AF_XDP.
> 
> Exactly, this is precisely how this kind of problem should be solved.
> 
> Michael, I strongly support the approach Jason is taking here, and I
> would like to ask you to seriously reconsider your objections.
> 
> Thank you.

Okay. Won't be the first time I'm wrong.

Let's say we ignore security aspects, but we need to make sure the
following all keep working (broken with this revision):
- file backed memory (I didn't see where we mark memory dirty -
  if we don't we get guest memory corruption on close, if we do
  then host crash as https://lwn.net/Articles/774411/ seems to apply here?)
- THP
- auto-NUMA

Because vhost isn't like AF_XDP where you can just tell people "use
hugetlbfs" and "data is removed on close" - people are using it in lots
of configurations with guest memory shared between rings and unrelated
data.

Jason, thoughts on these?
Jason Wang Dec. 24, 2018, 8:32 a.m. UTC | #11
On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
> On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
>> On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>> Hi:
>>>>
>>>> This series tries to access virtqueue metadata through kernel virtual
>>>> address instead of copy_user() friends since they had too much
>>>> overheads like checks, spec barriers or even hardware feature
>>>> toggling.
>>> Userspace accesses through remapping tricks and next time there's a need
>>> for a new barrier we are left to figure it out by ourselves.
>>
>> I don't get here, do you mean spec barriers?
> I mean the next barrier people decide to put into userspace
> memory accesses.
>
>> It's completely unnecessary for
>> vhost which is kernel thread.
> It's defence in depth. Take a look at the commit that added them.
> And yes quite possibly in most cases we actually have a spec
> barrier in the validation phase. If we do let's use the
> unsafe variants so they can be found.


unsafe variants can only work if you can batch userspace access. This is 
not necessarily the case for light load.


>
>> And even if you're right, vhost is not the
>> only place, there's lots of vmap() based accessing in kernel.
> For sure. But if one can get by without get user pages, one
> really should. Witness recently uncovered mess with file
> backed storage.


We only pin metadata pages, I don't believe they will be used by any DMA.


>
>> Think in
>> another direction, this means we won't suffer form unnecessary barriers for
>> kthread like vhost in the future, we will manually pick the one we really
>> need
> I personally think we should err on the side of caution not on the side of
> performance.


So what you suggest may lead unnecessary performance regression 
(10%-20%) which is part of the goal of this series. We should audit and 
only use the one we really need instead of depending on copy_user() 
friends().

If we do it our own, it could be slow for for security fix but it's no 
less safe than before with performance kept.


>
>> (but it should have little possibility).
> History seems to teach otherwise.


What case did you mean here?


>
>> Please notice we only access metdata through remapping not the data itself.
>> This idea has been used for high speed userspace backend for years, e.g
>> packet socket or recent AF_XDP.
> I think their justification for the higher risk is that they are mostly
> designed for priveledged userspace.


I think it's the same with TUN/TAP, privileged process can pass them to 
unprivileged ones.


>
>> The only difference is the page was remap to
>> from kernel to userspace.
> At least that avoids the g.u.p mess.


I'm still not very clear at the point. We only pin 2 or 4 pages, they're 
several other cases that will pin much more.


>
>>>     I don't
>>> like the idea I have to say.  As a first step, why don't we switch to
>>> unsafe_put_user/unsafe_get_user etc?
>>
>> Several reasons:
>>
>> - They only have x86 variant, it won't have any difference for the rest of
>> architecture.
> Is there an issue on other architectures? If yes they can be extended
> there.


Consider the unexpected amount of work and in the best case it can give 
the same performance to vmap(). I'm not sure it's worth.


>
>> - unsafe_put_user/unsafe_get_user is not sufficient for accessing structures
>> (e.g accessing descriptor) or arrays (batching).
> So you want unsafe_copy_xxx_user? I can do this. Hang on will post.
>
>> - Unless we can batch at least the accessing of two places in three of
>> avail, used and descriptor in one run. There will be no difference. E.g we
>> can batch updating used ring, but it won't make any difference in this case.
>>
> So let's batch them all?


Batching might not help for the case of light load. And we need to 
measure the gain/cost of batching itself.


>
>
>>> That would be more of an apples to apples comparison, would it not?
>>
>> Apples to apples comparison only help if we are the No.1. But the fact is we
>> are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is the
>> fastest method AFAIK.
>>
>>
>> Thanks
> We need to speed up the packet access itself too though.
> You can't vmap all of guest memory.


This series only pin and vmap very few pages (metadata).

Thanks


>
>
>>>
>>>> Test shows about 24% improvement on TX PPS. It should benefit other
>>>> cases as well.
>>>>
>>>> Please review
>>>>
>>>> Jason Wang (3):
>>>>     vhost: generalize adding used elem
>>>>     vhost: fine grain userspace memory accessors
>>>>     vhost: access vq metadata through kernel virtual address
>>>>
>>>>    drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>>>>    drivers/vhost/vhost.h |  11 ++
>>>>    2 files changed, 266 insertions(+), 26 deletions(-)
>>>>
>>>> -- 
>>>> 2.17.1
Jason Wang Dec. 24, 2018, 8:44 a.m. UTC | #12
On 2018/12/17 上午3:57, Michael S. Tsirkin wrote:
> On Sat, Dec 15, 2018 at 11:43:08AM -0800, David Miller wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Date: Fri, 14 Dec 2018 12:29:54 +0800
>>
>>> On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
>>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>>> Hi:
>>>>>
>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>> address instead of copy_user() friends since they had too much
>>>>> overheads like checks, spec barriers or even hardware feature
>>>>> toggling.
>>>>>
>>>>> Test shows about 24% improvement on TX PPS. It should benefit other
>>>>> cases as well.
>>>>>
>>>>> Please review
>>>> I think the idea of speeding up userspace access is a good one.
>>>> However I think that moving all checks to start is way too aggressive.
>>>
>>> So did packet and AF_XDP. Anyway, sharing address space and access
>>> them directly is the fastest way. Performance is the major
>>> consideration for people to choose backend. Compare to userspace
>>> implementation, vhost does not have security advantages at any
>>> level. If vhost is still slow, people will start to develop backends
>>> based on e.g AF_XDP.
>> Exactly, this is precisely how this kind of problem should be solved.
>>
>> Michael, I strongly support the approach Jason is taking here, and I
>> would like to ask you to seriously reconsider your objections.
>>
>> Thank you.
> Okay. Won't be the first time I'm wrong.
>
> Let's say we ignore security aspects, but we need to make sure the
> following all keep working (broken with this revision):
> - file backed memory (I didn't see where we mark memory dirty -
>    if we don't we get guest memory corruption on close, if we do
>    then host crash as https://lwn.net/Articles/774411/ seems to apply here?)


We only pin metadata pages, so I don't think they can be used for DMA. 
So it was probably not an issue. The real issue is zerocopy codes, maybe 
it's time to disable it by default?


> - THP


We will miss 2 or 4 pages for THP, I wonder whether or not it's measurable.


> - auto-NUMA


I'm not sure auto-NUMA will help for the case of IPC. It can damage the 
performance in the worst case if vhost and userspace are running in two 
different nodes. Anyway I can measure.


>
> Because vhost isn't like AF_XDP where you can just tell people "use
> hugetlbfs" and "data is removed on close" - people are using it in lots
> of configurations with guest memory shared between rings and unrelated
> data.


This series doesn't share data, only metadata is shared.


>
> Jason, thoughts on these?
>

Based on the above, I can measure the impact of THP to see how it impacts.

For unsafe variants, it can only work for when we can batch the access 
and it needs non trivial rework on the vhost codes with unexpected 
amount of work for archs other than x86. I'm not sure it's worth to try.

Thanks
Michael S. Tsirkin Dec. 24, 2018, 6:12 p.m. UTC | #13
On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
> 
> On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
> > On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
> > > On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > > > Hi:
> > > > > 
> > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > address instead of copy_user() friends since they had too much
> > > > > overheads like checks, spec barriers or even hardware feature
> > > > > toggling.
> > > > Userspace accesses through remapping tricks and next time there's a need
> > > > for a new barrier we are left to figure it out by ourselves.
> > > 
> > > I don't get here, do you mean spec barriers?
> > I mean the next barrier people decide to put into userspace
> > memory accesses.
> > 
> > > It's completely unnecessary for
> > > vhost which is kernel thread.
> > It's defence in depth. Take a look at the commit that added them.
> > And yes quite possibly in most cases we actually have a spec
> > barrier in the validation phase. If we do let's use the
> > unsafe variants so they can be found.
> 
> 
> unsafe variants can only work if you can batch userspace access. This is not
> necessarily the case for light load.


Do we care a lot about the light load? How would you benchmark it?


> 
> > 
> > > And even if you're right, vhost is not the
> > > only place, there's lots of vmap() based accessing in kernel.
> > For sure. But if one can get by without get user pages, one
> > really should. Witness recently uncovered mess with file
> > backed storage.
> 
> 
> We only pin metadata pages, I don't believe they will be used by any DMA.

It doesn't matter really, if you dirty pages behind the MM back
the problem is there.

> 
> > 
> > > Think in
> > > another direction, this means we won't suffer form unnecessary barriers for
> > > kthread like vhost in the future, we will manually pick the one we really
> > > need
> > I personally think we should err on the side of caution not on the side of
> > performance.
> 
> 
> So what you suggest may lead unnecessary performance regression (10%-20%)
> which is part of the goal of this series. We should audit and only use the
> one we really need instead of depending on copy_user() friends().
> 
> If we do it our own, it could be slow for for security fix but it's no less
> safe than before with performance kept.
> 
> 
> > 
> > > (but it should have little possibility).
> > History seems to teach otherwise.
> 
> 
> What case did you mean here?
> 
> 
> > 
> > > Please notice we only access metdata through remapping not the data itself.
> > > This idea has been used for high speed userspace backend for years, e.g
> > > packet socket or recent AF_XDP.
> > I think their justification for the higher risk is that they are mostly
> > designed for priveledged userspace.
> 
> 
> I think it's the same with TUN/TAP, privileged process can pass them to
> unprivileged ones.
> 
> 
> > 
> > > The only difference is the page was remap to
> > > from kernel to userspace.
> > At least that avoids the g.u.p mess.
> 
> 
> I'm still not very clear at the point. We only pin 2 or 4 pages, they're
> several other cases that will pin much more.
> 
> 
> > 
> > > >     I don't
> > > > like the idea I have to say.  As a first step, why don't we switch to
> > > > unsafe_put_user/unsafe_get_user etc?
> > > 
> > > Several reasons:
> > > 
> > > - They only have x86 variant, it won't have any difference for the rest of
> > > architecture.
> > Is there an issue on other architectures? If yes they can be extended
> > there.
> 
> 
> Consider the unexpected amount of work and in the best case it can give the
> same performance to vmap(). I'm not sure it's worth.
> 
> 
> > 
> > > - unsafe_put_user/unsafe_get_user is not sufficient for accessing structures
> > > (e.g accessing descriptor) or arrays (batching).
> > So you want unsafe_copy_xxx_user? I can do this. Hang on will post.
> > 
> > > - Unless we can batch at least the accessing of two places in three of
> > > avail, used and descriptor in one run. There will be no difference. E.g we
> > > can batch updating used ring, but it won't make any difference in this case.
> > > 
> > So let's batch them all?
> 
> 
> Batching might not help for the case of light load. And we need to measure
> the gain/cost of batching itself.
> 
> 
> > 
> > 
> > > > That would be more of an apples to apples comparison, would it not?
> > > 
> > > Apples to apples comparison only help if we are the No.1. But the fact is we
> > > are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is the
> > > fastest method AFAIK.
> > > 
> > > 
> > > Thanks
> > We need to speed up the packet access itself too though.
> > You can't vmap all of guest memory.
> 
> 
> This series only pin and vmap very few pages (metadata).
> 
> Thanks
> 
> 
> > 
> > 
> > > > 
> > > > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > > > cases as well.
> > > > > 
> > > > > Please review
> > > > > 
> > > > > Jason Wang (3):
> > > > >     vhost: generalize adding used elem
> > > > >     vhost: fine grain userspace memory accessors
> > > > >     vhost: access vq metadata through kernel virtual address
> > > > > 
> > > > >    drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
> > > > >    drivers/vhost/vhost.h |  11 ++
> > > > >    2 files changed, 266 insertions(+), 26 deletions(-)
> > > > > 
> > > > > -- 
> > > > > 2.17.1
Michael S. Tsirkin Dec. 24, 2018, 7:09 p.m. UTC | #14
On Mon, Dec 24, 2018 at 04:44:14PM +0800, Jason Wang wrote:
> 
> On 2018/12/17 上午3:57, Michael S. Tsirkin wrote:
> > On Sat, Dec 15, 2018 at 11:43:08AM -0800, David Miller wrote:
> > > From: Jason Wang <jasowang@redhat.com>
> > > Date: Fri, 14 Dec 2018 12:29:54 +0800
> > > 
> > > > On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > > > > Hi:
> > > > > > 
> > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > address instead of copy_user() friends since they had too much
> > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > toggling.
> > > > > > 
> > > > > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > > > > cases as well.
> > > > > > 
> > > > > > Please review
> > > > > I think the idea of speeding up userspace access is a good one.
> > > > > However I think that moving all checks to start is way too aggressive.
> > > > 
> > > > So did packet and AF_XDP. Anyway, sharing address space and access
> > > > them directly is the fastest way. Performance is the major
> > > > consideration for people to choose backend. Compare to userspace
> > > > implementation, vhost does not have security advantages at any
> > > > level. If vhost is still slow, people will start to develop backends
> > > > based on e.g AF_XDP.
> > > Exactly, this is precisely how this kind of problem should be solved.
> > > 
> > > Michael, I strongly support the approach Jason is taking here, and I
> > > would like to ask you to seriously reconsider your objections.
> > > 
> > > Thank you.
> > Okay. Won't be the first time I'm wrong.
> > 
> > Let's say we ignore security aspects, but we need to make sure the
> > following all keep working (broken with this revision):
> > - file backed memory (I didn't see where we mark memory dirty -
> >    if we don't we get guest memory corruption on close, if we do
> >    then host crash as https://lwn.net/Articles/774411/ seems to apply here?)
> 
> 
> We only pin metadata pages, so I don't think they can be used for DMA. So it
> was probably not an issue. The real issue is zerocopy codes, maybe it's time
> to disable it by default?
> 
> 
> > - THP
> 
> 
> We will miss 2 or 4 pages for THP, I wonder whether or not it's measurable.
> 
> 
> > - auto-NUMA
> 
> 
> I'm not sure auto-NUMA will help for the case of IPC. It can damage the
> performance in the worst case if vhost and userspace are running in two
> different nodes. Anyway I can measure.
> 
> 
> > 
> > Because vhost isn't like AF_XDP where you can just tell people "use
> > hugetlbfs" and "data is removed on close" - people are using it in lots
> > of configurations with guest memory shared between rings and unrelated
> > data.
> 
> 
> This series doesn't share data, only metadata is shared.

Let me clarify - I mean that metadata is in same huge page with
unrelated guest data. 

> 
> > 
> > Jason, thoughts on these?
> > 
> 
> Based on the above, I can measure the impact of THP to see how it impacts.
> 
> For unsafe variants, it can only work for when we can batch the access and
> it needs non trivial rework on the vhost codes with unexpected amount of
> work for archs other than x86. I'm not sure it's worth to try.
> 
> Thanks

Yes I think we need better APIs in vhost. Right now
we have an API to get and translate a single buffer.
We should have one that gets a batch of descriptors
and stores it, then one that translates this batch.

IMHO this will benefit everyone even if we do vmap due to
better code locality.
Jason Wang Dec. 25, 2018, 10:09 a.m. UTC | #15
On 2018/12/25 上午2:12, Michael S. Tsirkin wrote:
> On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
>> On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
>>> On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
>>>> On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
>>>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>>>> Hi:
>>>>>>
>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>> address instead of copy_user() friends since they had too much
>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>> toggling.
>>>>> Userspace accesses through remapping tricks and next time there's a need
>>>>> for a new barrier we are left to figure it out by ourselves.
>>>> I don't get here, do you mean spec barriers?
>>> I mean the next barrier people decide to put into userspace
>>> memory accesses.
>>>
>>>> It's completely unnecessary for
>>>> vhost which is kernel thread.
>>> It's defence in depth. Take a look at the commit that added them.
>>> And yes quite possibly in most cases we actually have a spec
>>> barrier in the validation phase. If we do let's use the
>>> unsafe variants so they can be found.
>>
>> unsafe variants can only work if you can batch userspace access. This is not
>> necessarily the case for light load.
>
> Do we care a lot about the light load? How would you benchmark it?
>

If we can reduce the latency that's will be more than what we expect.

1 byte TCP_RR shows 1.5%-2% improvement.


>>>> And even if you're right, vhost is not the
>>>> only place, there's lots of vmap() based accessing in kernel.
>>> For sure. But if one can get by without get user pages, one
>>> really should. Witness recently uncovered mess with file
>>> backed storage.
>>
>> We only pin metadata pages, I don't believe they will be used by any DMA.
> It doesn't matter really, if you dirty pages behind the MM back
> the problem is there.


Ok, but the usual case is anonymous pages, do we use file backed pages 
for user of vhost? And even if we use sometime, according to the pointer 
it's not something that can fix, RFC has been posted to solve this issue.

Thanks
Michael S. Tsirkin Dec. 25, 2018, 12:52 p.m. UTC | #16
On Tue, Dec 25, 2018 at 06:09:19PM +0800, Jason Wang wrote:
> 
> On 2018/12/25 上午2:12, Michael S. Tsirkin wrote:
> > On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
> > > On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
> > > > On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
> > > > > On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
> > > > > > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > > > > > Hi:
> > > > > > > 
> > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > toggling.
> > > > > > Userspace accesses through remapping tricks and next time there's a need
> > > > > > for a new barrier we are left to figure it out by ourselves.
> > > > > I don't get here, do you mean spec barriers?
> > > > I mean the next barrier people decide to put into userspace
> > > > memory accesses.
> > > > 
> > > > > It's completely unnecessary for
> > > > > vhost which is kernel thread.
> > > > It's defence in depth. Take a look at the commit that added them.
> > > > And yes quite possibly in most cases we actually have a spec
> > > > barrier in the validation phase. If we do let's use the
> > > > unsafe variants so they can be found.
> > > 
> > > unsafe variants can only work if you can batch userspace access. This is not
> > > necessarily the case for light load.
> > 
> > Do we care a lot about the light load? How would you benchmark it?
> > 
> 
> If we can reduce the latency that's will be more than what we expect.
> 
> 1 byte TCP_RR shows 1.5%-2% improvement.

It's nice but not great. E.g. adaptive polling would be
a better approach to work on latency imho.

> 
> > > > > And even if you're right, vhost is not the
> > > > > only place, there's lots of vmap() based accessing in kernel.
> > > > For sure. But if one can get by without get user pages, one
> > > > really should. Witness recently uncovered mess with file
> > > > backed storage.
> > > 
> > > We only pin metadata pages, I don't believe they will be used by any DMA.
> > It doesn't matter really, if you dirty pages behind the MM back
> > the problem is there.
> 
> 
> Ok, but the usual case is anonymous pages, do we use file backed pages for
> user of vhost?

Some people use file backed pages for vms.
Nothing prevents them from using vhost as well.

> And even if we use sometime, according to the pointer it's
> not something that can fix, RFC has been posted to solve this issue.
> 
> Thanks

Except it's not broken if we don't to gup + write.
So yea, wait for rfc to be merged.
Jason Wang Dec. 26, 2018, 3:59 a.m. UTC | #17
On 2018/12/25 下午8:52, Michael S. Tsirkin wrote:
> On Tue, Dec 25, 2018 at 06:09:19PM +0800, Jason Wang wrote:
>> On 2018/12/25 上午2:12, Michael S. Tsirkin wrote:
>>> On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
>>>> On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
>>>>> On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
>>>>>> On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>>>>>> Hi:
>>>>>>>>
>>>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>>>> address instead of copy_user() friends since they had too much
>>>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>>>> toggling.
>>>>>>> Userspace accesses through remapping tricks and next time there's a need
>>>>>>> for a new barrier we are left to figure it out by ourselves.
>>>>>> I don't get here, do you mean spec barriers?
>>>>> I mean the next barrier people decide to put into userspace
>>>>> memory accesses.
>>>>>
>>>>>> It's completely unnecessary for
>>>>>> vhost which is kernel thread.
>>>>> It's defence in depth. Take a look at the commit that added them.
>>>>> And yes quite possibly in most cases we actually have a spec
>>>>> barrier in the validation phase. If we do let's use the
>>>>> unsafe variants so they can be found.
>>>> unsafe variants can only work if you can batch userspace access. This is not
>>>> necessarily the case for light load.
>>> Do we care a lot about the light load? How would you benchmark it?
>>>
>> If we can reduce the latency that's will be more than what we expect.
>>
>> 1 byte TCP_RR shows 1.5%-2% improvement.
> It's nice but not great. E.g. adaptive polling would be
> a better approach to work on latency imho.


Actually this is another advantages of vmap():

For split ring, we will poll avail idx

For packed ring, we will poll wrap counter

Either of which can not be batched.


>
>>>>>> And even if you're right, vhost is not the
>>>>>> only place, there's lots of vmap() based accessing in kernel.
>>>>> For sure. But if one can get by without get user pages, one
>>>>> really should. Witness recently uncovered mess with file
>>>>> backed storage.
>>>> We only pin metadata pages, I don't believe they will be used by any DMA.
>>> It doesn't matter really, if you dirty pages behind the MM back
>>> the problem is there.
>>
>> Ok, but the usual case is anonymous pages, do we use file backed pages for
>> user of vhost?
> Some people use file backed pages for vms.
> Nothing prevents them from using vhost as well.


Ok.


>
>> And even if we use sometime, according to the pointer it's
>> not something that can fix, RFC has been posted to solve this issue.
>>
>> Thanks
> Except it's not broken if we don't to gup + write.
> So yea, wait for rfc to be merged.
>

Yes.

Thanks