mbox series

[RFC,V3,0/5] Hi:

Message ID 20181229124656.3900-1-jasowang@redhat.com (mailing list archive)
Headers show
Series Hi: | expand

Message

Jason Wang Dec. 29, 2018, 12:46 p.m. UTC
This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling.

Test shows about 24% improvement on TX PPS. It should benefit other
cases as well.

Changes from V2:
- fix buggy range overlapping check
- tear down MMU notifier during vhost ioctl to make sure invalidation
  request can read metadata userspace address and vq size without
  holding vq mutex.
Changes from V1:
- instead of pinning pages, use MMU notifier to invalidate vmaps and
  remap duing metadata prefetch
- fix build warning on MIPS

Jason Wang (5):
  vhost: generalize adding used elem
  vhost: fine grain userspace memory accessors
  vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
  vhost: introduce helpers to get the size of metadata area
  vhost: access vq metadata through kernel virtual address

 drivers/vhost/net.c   |   4 +-
 drivers/vhost/vhost.c | 416 +++++++++++++++++++++++++++++++++++++-----
 drivers/vhost/vhost.h |  15 +-
 3 files changed, 384 insertions(+), 51 deletions(-)

Comments

Michael S. Tsirkin Jan. 2, 2019, 8:47 p.m. UTC | #1
On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.

Will review, thanks!
One questions that comes to mind is whether it's all about bypassing
stac/clac.  Could you please include a performance comparison with
nosmap?

> 
> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.
>
> Changes from V2:
> - fix buggy range overlapping check
> - tear down MMU notifier during vhost ioctl to make sure invalidation
>   request can read metadata userspace address and vq size without
>   holding vq mutex.
> Changes from V1:
> - instead of pinning pages, use MMU notifier to invalidate vmaps and
>   remap duing metadata prefetch
> - fix build warning on MIPS
> 
> Jason Wang (5):
>   vhost: generalize adding used elem
>   vhost: fine grain userspace memory accessors
>   vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
>   vhost: introduce helpers to get the size of metadata area
>   vhost: access vq metadata through kernel virtual address
> 
>  drivers/vhost/net.c   |   4 +-
>  drivers/vhost/vhost.c | 416 +++++++++++++++++++++++++++++++++++++-----
>  drivers/vhost/vhost.h |  15 +-
>  3 files changed, 384 insertions(+), 51 deletions(-)
> 
> -- 
> 2.17.1
Michael S. Tsirkin Jan. 4, 2019, 9:41 p.m. UTC | #2
On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.


I think it's a reasonable approach.
However I need to look at whether and which mmu notifiers are invoked before
writeback. Do you know?

> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.
> 
> Changes from V2:
> - fix buggy range overlapping check
> - tear down MMU notifier during vhost ioctl to make sure invalidation
>   request can read metadata userspace address and vq size without
>   holding vq mutex.
> Changes from V1:
> - instead of pinning pages, use MMU notifier to invalidate vmaps and
>   remap duing metadata prefetch
> - fix build warning on MIPS
> 
> Jason Wang (5):
>   vhost: generalize adding used elem
>   vhost: fine grain userspace memory accessors
>   vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
>   vhost: introduce helpers to get the size of metadata area
>   vhost: access vq metadata through kernel virtual address
> 
>  drivers/vhost/net.c   |   4 +-
>  drivers/vhost/vhost.c | 416 +++++++++++++++++++++++++++++++++++++-----
>  drivers/vhost/vhost.h |  15 +-
>  3 files changed, 384 insertions(+), 51 deletions(-)
> 
> -- 
> 2.17.1
Jason Wang Jan. 7, 2019, 2:19 a.m. UTC | #3
On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
> Will review, thanks!
> One questions that comes to mind is whether it's all about bypassing
> stac/clac.  Could you please include a performance comparison with
> nosmap?
>

On machine without SMAP (Sandy Bridge):

Before: 4.8Mpps

After: 5.2Mpps

On machine with SMAP (Broadwell):

Before: 5.0Mpps

After: 6.1Mpps

No smap: 7.5Mpps


Thanks
Michael S. Tsirkin Jan. 7, 2019, 3:28 a.m. UTC | #4
On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> 
> On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > This series tries to access virtqueue metadata through kernel virtual
> > > address instead of copy_user() friends since they had too much
> > > overheads like checks, spec barriers or even hardware feature
> > > toggling.
> > Will review, thanks!
> > One questions that comes to mind is whether it's all about bypassing
> > stac/clac.  Could you please include a performance comparison with
> > nosmap?
> > 
> 
> On machine without SMAP (Sandy Bridge):
> 
> Before: 4.8Mpps
> 
> After: 5.2Mpps

OK so would you say it's really unsafe versus safe accesses?
Or would you say it's just a better written code?

> On machine with SMAP (Broadwell):
> 
> Before: 5.0Mpps
> 
> After: 6.1Mpps
> 
> No smap: 7.5Mpps
> 
> 
> Thanks


no smap being before or after?
Jason Wang Jan. 7, 2019, 3:53 a.m. UTC | #5
On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
>> On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>> This series tries to access virtqueue metadata through kernel virtual
>>>> address instead of copy_user() friends since they had too much
>>>> overheads like checks, spec barriers or even hardware feature
>>>> toggling.
>>> Will review, thanks!
>>> One questions that comes to mind is whether it's all about bypassing
>>> stac/clac.  Could you please include a performance comparison with
>>> nosmap?
>>>
>> On machine without SMAP (Sandy Bridge):
>>
>> Before: 4.8Mpps
>>
>> After: 5.2Mpps
> OK so would you say it's really unsafe versus safe accesses?
> Or would you say it's just a better written code?


It's the effect of removing speculation barrier.


>
>> On machine with SMAP (Broadwell):
>>
>> Before: 5.0Mpps
>>
>> After: 6.1Mpps
>>
>> No smap: 7.5Mpps
>>
>>
>> Thanks
>
> no smap being before or after?
>

Let me clarify:


Before (SMAP on): 5.0Mpps

Before (SMAP off): 7.5Mpps

After (SMAP on): 6.1Mpps


Thanks
Michael S. Tsirkin Jan. 7, 2019, 4:17 a.m. UTC | #6
On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> 
> On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > address instead of copy_user() friends since they had too much
> > > > > overheads like checks, spec barriers or even hardware feature
> > > > > toggling.
> > > > Will review, thanks!
> > > > One questions that comes to mind is whether it's all about bypassing
> > > > stac/clac.  Could you please include a performance comparison with
> > > > nosmap?
> > > > 
> > > On machine without SMAP (Sandy Bridge):
> > > 
> > > Before: 4.8Mpps
> > > 
> > > After: 5.2Mpps
> > OK so would you say it's really unsafe versus safe accesses?
> > Or would you say it's just a better written code?
> 
> 
> It's the effect of removing speculation barrier.


You mean __uaccess_begin_nospec introduced by
commit 304ec1b050310548db33063e567123fae8fd0301
?

So fundamentally we do access_ok checks when supplying
the memory table to the kernel thread, and we should
do the spec barrier there.

Then we can just create and use a variant of uaccess macros that does
not include the barrier?

Or, how about moving the barrier into access_ok?
This way repeated accesses with a single access_ok get a bit faster.
CC Dan Williams on this idea.



> 
> > 
> > > On machine with SMAP (Broadwell):
> > > 
> > > Before: 5.0Mpps
> > > 
> > > After: 6.1Mpps
> > > 
> > > No smap: 7.5Mpps
> > > 
> > > 
> > > Thanks
> > 
> > no smap being before or after?
> > 
> 
> Let me clarify:
> 
> 
> Before (SMAP on): 5.0Mpps
> 
> Before (SMAP off): 7.5Mpps
> 
> After (SMAP on): 6.1Mpps
> 
> 
> Thanks

How about after + smap off?

And maybe we want a module option just for the vhost thread to keep smap
off generally since almost all it does is copy stuff from userspace into
kernel anyway. Because what above numbers should is that we really
really want a solution that isn't limited to just meta-data access,
and I really do not see how any such solution can not also be
used to make meta-data access fast.
Jason Wang Jan. 7, 2019, 6:50 a.m. UTC | #7
On 2019/1/7 下午12:17, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
>> On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
>>> On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
>>>> On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
>>>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>> address instead of copy_user() friends since they had too much
>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>> toggling.
>>>>> Will review, thanks!
>>>>> One questions that comes to mind is whether it's all about bypassing
>>>>> stac/clac.  Could you please include a performance comparison with
>>>>> nosmap?
>>>>>
>>>> On machine without SMAP (Sandy Bridge):
>>>>
>>>> Before: 4.8Mpps
>>>>
>>>> After: 5.2Mpps
>>> OK so would you say it's really unsafe versus safe accesses?
>>> Or would you say it's just a better written code?
>>
>> It's the effect of removing speculation barrier.
>
> You mean __uaccess_begin_nospec introduced by
> commit 304ec1b050310548db33063e567123fae8fd0301
> ?

Yes.


>
> So fundamentally we do access_ok checks when supplying
> the memory table to the kernel thread, and we should
> do the spec barrier there.
>
> Then we can just create and use a variant of uaccess macros that does
> not include the barrier?


The unsafe ones?


>
> Or, how about moving the barrier into access_ok?
> This way repeated accesses with a single access_ok get a bit faster.
> CC Dan Williams on this idea.


The problem is, e.g for vhost control path. During mem table validation, 
we don't even want to access them there. So the spec barrier is not needed.


>
>
>>>> On machine with SMAP (Broadwell):
>>>>
>>>> Before: 5.0Mpps
>>>>
>>>> After: 6.1Mpps
>>>>
>>>> No smap: 7.5Mpps
>>>>
>>>>
>>>> Thanks
>>> no smap being before or after?
>>>
>> Let me clarify:
>>
>>
>> Before (SMAP on): 5.0Mpps
>>
>> Before (SMAP off): 7.5Mpps
>>
>> After (SMAP on): 6.1Mpps
>>
>>
>> Thanks
> How about after + smap off?


After (SMAP off): 8.0Mpps


>
> And maybe we want a module option just for the vhost thread to keep smap
> off generally since almost all it does is copy stuff from userspace into
> kernel anyway. Because what above numbers should is that we really
> really want a solution that isn't limited to just meta-data access,
> and I really do not see how any such solution can not also be
> used to make meta-data access fast.


As we've discussed in another thread of previous version. This requires 
lots of changes, the main issues is SMAP state was not saved/restored on 
explicit schedule(). Even if it did, since vhost will call lots of 
net/block codes, any kind of uaccess in those codes needs understand 
this special request from vhost e.g you provably need to invent a new 
kinds of iov iterator that does not touch SMAP at all. And I'm not sure 
this is the only thing we need to deal with.

So I still prefer to:

1) speedup the metadata access through vmap + MMU notifier

2) speedup the datacopy with batched copy (unsafe ones or other new 
interfaces)

Thanks
Jason Wang Jan. 7, 2019, 6:58 a.m. UTC | #8
On 2019/1/5 上午5:41, Michael S. Tsirkin wrote:
> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
>
> I think it's a reasonable approach.
> However I need to look at whether and which mmu notifiers are invoked before
> writeback. Do you know?


I don't know but just looking at the MMU notifier ops definition, 
there's no such callback if my understanding is correct.

Thanks


>
>> Test shows about 24% improvement on TX PPS. It should benefit other
>> cases as well.
>>
>> Changes from V2:
>> - fix buggy range overlapping check
>> - tear down MMU notifier during vhost ioctl to make sure invalidation
>>    request can read metadata userspace address and vq size without
>>    holding vq mutex.
>> Changes from V1:
>> - instead of pinning pages, use MMU notifier to invalidate vmaps and
>>    remap duing metadata prefetch
>> - fix build warning on MIPS
>>
>> Jason Wang (5):
>>    vhost: generalize adding used elem
>>    vhost: fine grain userspace memory accessors
>>    vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
>>    vhost: introduce helpers to get the size of metadata area
>>    vhost: access vq metadata through kernel virtual address
>>
>>   drivers/vhost/net.c   |   4 +-
>>   drivers/vhost/vhost.c | 416 +++++++++++++++++++++++++++++++++++++-----
>>   drivers/vhost/vhost.h |  15 +-
>>   3 files changed, 384 insertions(+), 51 deletions(-)
>>
>> -- 
>> 2.17.1
Dan Williams Jan. 7, 2019, 7:15 a.m. UTC | #9
On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> >
> > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > address instead of copy_user() friends since they had too much
> > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > toggling.
> > > > > Will review, thanks!
> > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > stac/clac.  Could you please include a performance comparison with
> > > > > nosmap?
> > > > >
> > > > On machine without SMAP (Sandy Bridge):
> > > >
> > > > Before: 4.8Mpps
> > > >
> > > > After: 5.2Mpps
> > > OK so would you say it's really unsafe versus safe accesses?
> > > Or would you say it's just a better written code?
> >
> >
> > It's the effect of removing speculation barrier.
>
>
> You mean __uaccess_begin_nospec introduced by
> commit 304ec1b050310548db33063e567123fae8fd0301
> ?
>
> So fundamentally we do access_ok checks when supplying
> the memory table to the kernel thread, and we should
> do the spec barrier there.
>
> Then we can just create and use a variant of uaccess macros that does
> not include the barrier?
>
> Or, how about moving the barrier into access_ok?
> This way repeated accesses with a single access_ok get a bit faster.
> CC Dan Williams on this idea.

It would be interesting to see how expensive re-doing the address
limit check is compared to the speculation barrier. I.e. just switch
vhost_get_user() to use get_user() rather than __get_user(). That will
sanitize the pointer in the speculative path without a barrier.

I recall we had a convert access_ok() discussion with this result here:

https://lkml.org/lkml/2018/1/17/929

...but it sounds like you are proposing a smaller scope fixup for the
vhost use case? Something like barrier_nospec() in the success path
for all vhost access_ok() checks and then a get_user() variant that
disables the barrier.
Michael S. Tsirkin Jan. 7, 2019, 2:11 p.m. UTC | #10
On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > >
> > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > toggling.
> > > > > > Will review, thanks!
> > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > nosmap?
> > > > > >
> > > > > On machine without SMAP (Sandy Bridge):
> > > > >
> > > > > Before: 4.8Mpps
> > > > >
> > > > > After: 5.2Mpps
> > > > OK so would you say it's really unsafe versus safe accesses?
> > > > Or would you say it's just a better written code?
> > >
> > >
> > > It's the effect of removing speculation barrier.
> >
> >
> > You mean __uaccess_begin_nospec introduced by
> > commit 304ec1b050310548db33063e567123fae8fd0301
> > ?
> >
> > So fundamentally we do access_ok checks when supplying
> > the memory table to the kernel thread, and we should
> > do the spec barrier there.
> >
> > Then we can just create and use a variant of uaccess macros that does
> > not include the barrier?
> >
> > Or, how about moving the barrier into access_ok?
> > This way repeated accesses with a single access_ok get a bit faster.
> > CC Dan Williams on this idea.
> 
> It would be interesting to see how expensive re-doing the address
> limit check is compared to the speculation barrier. I.e. just switch
> vhost_get_user() to use get_user() rather than __get_user(). That will
> sanitize the pointer in the speculative path without a barrier.

Hmm it's way cheaper even though IIRC it's measureable.
Jason, would you like to try?
Although frankly __get_user being slower than get_user feels very wrong.
Not yet sure what to do exactly but would you agree?


> I recall we had a convert access_ok() discussion with this result here:
> 
> https://lkml.org/lkml/2018/1/17/929

Sorry let me try to clarify. IIUC speculating access_ok once
is harmless. As Linus said the problem is with "_subsequent_
accesses that can then be used to perturb the cache".

Thus:

1. if (!access_ok)
2.	return
3.  get_user
4. if (!access_ok)
5.	return
6.  get_user

Your proposal that Linus nacked was to effectively add a barrier after
lines 2 and 5 (also using the array_index_nospec trick for speed),
right? Unfortunately that needs a big API change.

I am asking about adding barrier_nospec within access_ok.
Thus effectively before lines 1 and 4.
access_ok will be slower but after all the point of access_ok is
to then access the same memory multiple times.
So we should be making __get_user faster and access_ok slower ...

> ...but it sounds like you are proposing a smaller scope fixup for the
> vhost use case? Something like barrier_nospec() in the success path
> for all vhost access_ok() checks and then a get_user() variant that
> disables the barrier.

Maybe we'll have to. Except I hope vhost won't end up being the
only user otherwise it will be hard to maintain.
Michael S. Tsirkin Jan. 7, 2019, 2:37 p.m. UTC | #11
On Mon, Jan 07, 2019 at 02:50:17PM +0800, Jason Wang wrote:
> 
> On 2019/1/7 下午12:17, Michael S. Tsirkin wrote:
> > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > toggling.
> > > > > > Will review, thanks!
> > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > nosmap?
> > > > > > 
> > > > > On machine without SMAP (Sandy Bridge):
> > > > > 
> > > > > Before: 4.8Mpps
> > > > > 
> > > > > After: 5.2Mpps
> > > > OK so would you say it's really unsafe versus safe accesses?
> > > > Or would you say it's just a better written code?
> > > 
> > > It's the effect of removing speculation barrier.
> > 
> > You mean __uaccess_begin_nospec introduced by
> > commit 304ec1b050310548db33063e567123fae8fd0301
> > ?
> 
> Yes.
> 
> 
> > 
> > So fundamentally we do access_ok checks when supplying
> > the memory table to the kernel thread, and we should
> > do the spec barrier there.
> > 
> > Then we can just create and use a variant of uaccess macros that does
> > not include the barrier?
> 
> 
> The unsafe ones?

Fundamentally yes.


> 
> > 
> > Or, how about moving the barrier into access_ok?
> > This way repeated accesses with a single access_ok get a bit faster.
> > CC Dan Williams on this idea.
> 
> 
> The problem is, e.g for vhost control path. During mem table validation, we
> don't even want to access them there. So the spec barrier is not needed.

Again spec barrier is not needed as such at all. It's defence in depth.
And mem table init is slow path. So we can stick a barrier there and it
won't be a problem for anyone.

> 
> > 
> > 
> > > > > On machine with SMAP (Broadwell):
> > > > > 
> > > > > Before: 5.0Mpps
> > > > > 
> > > > > After: 6.1Mpps
> > > > > 
> > > > > No smap: 7.5Mpps
> > > > > 
> > > > > 
> > > > > Thanks
> > > > no smap being before or after?
> > > > 
> > > Let me clarify:
> > > 
> > > 
> > > Before (SMAP on): 5.0Mpps
> > > 
> > > Before (SMAP off): 7.5Mpps
> > > 
> > > After (SMAP on): 6.1Mpps
> > > 
> > > 
> > > Thanks
> > How about after + smap off?
> 
> 
> After (SMAP off): 8.0Mpps
> 
> > 
> > And maybe we want a module option just for the vhost thread to keep smap
> > off generally since almost all it does is copy stuff from userspace into
> > kernel anyway. Because what above numbers should is that we really
> > really want a solution that isn't limited to just meta-data access,
> > and I really do not see how any such solution can not also be
> > used to make meta-data access fast.
> 
> 
> As we've discussed in another thread of previous version. This requires lots
> of changes, the main issues is SMAP state was not saved/restored on explicit
> schedule().

I wonder how expensive can reading eflags be?
If it's cheap we can just check EFLAGS.AC and rerun stac if needed.

> Even if it did, since vhost will call lots of net/block codes,
> any kind of uaccess in those codes needs understand this special request
> from vhost e.g you provably need to invent a new kinds of iov iterator that
> does not touch SMAP at all. And I'm not sure this is the only thing we need
> to deal with.


Well we wanted to move packet processing from tun into vhost anyway right?

> 
> So I still prefer to:
> 
> 1) speedup the metadata access through vmap + MMU notifier
> 
> 2) speedup the datacopy with batched copy (unsafe ones or other new
> interfaces)
> 
> Thanks

I just guess once you do (2) you will want to rework (1) to use
the new interfaces. So all the effort you are now investing in (1)
will be wasted. Just my $.02.
Michael S. Tsirkin Jan. 7, 2019, 2:47 p.m. UTC | #12
On Mon, Jan 07, 2019 at 02:58:08PM +0800, Jason Wang wrote:
> 
> On 2019/1/5 上午5:41, Michael S. Tsirkin wrote:
> > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > This series tries to access virtqueue metadata through kernel virtual
> > > address instead of copy_user() friends since they had too much
> > > overheads like checks, spec barriers or even hardware feature
> > > toggling.
> > 
> > I think it's a reasonable approach.
> > However I need to look at whether and which mmu notifiers are invoked before
> > writeback. Do you know?
> 
> 
> I don't know but just looking at the MMU notifier ops definition, there's no
> such callback if my understanding is correct.
> 
> Thanks

In that case how are you making sure used ring updates are written back?
If they aren't guest will crash ...

> 
> > 
> > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > cases as well.
> > > 
> > > Changes from V2:
> > > - fix buggy range overlapping check
> > > - tear down MMU notifier during vhost ioctl to make sure invalidation
> > >    request can read metadata userspace address and vq size without
> > >    holding vq mutex.
> > > Changes from V1:
> > > - instead of pinning pages, use MMU notifier to invalidate vmaps and
> > >    remap duing metadata prefetch
> > > - fix build warning on MIPS
> > > 
> > > Jason Wang (5):
> > >    vhost: generalize adding used elem
> > >    vhost: fine grain userspace memory accessors
> > >    vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
> > >    vhost: introduce helpers to get the size of metadata area
> > >    vhost: access vq metadata through kernel virtual address
> > > 
> > >   drivers/vhost/net.c   |   4 +-
> > >   drivers/vhost/vhost.c | 416 +++++++++++++++++++++++++++++++++++++-----
> > >   drivers/vhost/vhost.h |  15 +-
> > >   3 files changed, 384 insertions(+), 51 deletions(-)
> > > 
> > > -- 
> > > 2.17.1
Dan Williams Jan. 7, 2019, 9:39 p.m. UTC | #13
On Mon, Jan 7, 2019 at 6:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> > On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > >
> > > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > > toggling.
> > > > > > > Will review, thanks!
> > > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > > nosmap?
> > > > > > >
> > > > > > On machine without SMAP (Sandy Bridge):
> > > > > >
> > > > > > Before: 4.8Mpps
> > > > > >
> > > > > > After: 5.2Mpps
> > > > > OK so would you say it's really unsafe versus safe accesses?
> > > > > Or would you say it's just a better written code?
> > > >
> > > >
> > > > It's the effect of removing speculation barrier.
> > >
> > >
> > > You mean __uaccess_begin_nospec introduced by
> > > commit 304ec1b050310548db33063e567123fae8fd0301
> > > ?
> > >
> > > So fundamentally we do access_ok checks when supplying
> > > the memory table to the kernel thread, and we should
> > > do the spec barrier there.
> > >
> > > Then we can just create and use a variant of uaccess macros that does
> > > not include the barrier?
> > >
> > > Or, how about moving the barrier into access_ok?
> > > This way repeated accesses with a single access_ok get a bit faster.
> > > CC Dan Williams on this idea.
> >
> > It would be interesting to see how expensive re-doing the address
> > limit check is compared to the speculation barrier. I.e. just switch
> > vhost_get_user() to use get_user() rather than __get_user(). That will
> > sanitize the pointer in the speculative path without a barrier.
>
> Hmm it's way cheaper even though IIRC it's measureable.
> Jason, would you like to try?
> Although frankly __get_user being slower than get_user feels very wrong.
> Not yet sure what to do exactly but would you agree?

Agree. __get_user() being faster than get_user() defeats the whole
point of converting code paths to the access_ok() + __get_user()
pattern.

>
>
> > I recall we had a convert access_ok() discussion with this result here:
> >
> > https://lkml.org/lkml/2018/1/17/929
>
> Sorry let me try to clarify. IIUC speculating access_ok once
> is harmless. As Linus said the problem is with "_subsequent_
> accesses that can then be used to perturb the cache".
>
> Thus:
>
> 1. if (!access_ok)
> 2.      return
> 3.  get_user
> 4. if (!access_ok)
> 5.      return
> 6.  get_user
>
> Your proposal that Linus nacked was to effectively add a barrier after
> lines 2 and 5 (also using the array_index_nospec trick for speed),
> right? Unfortunately that needs a big API change.
>
> I am asking about adding barrier_nospec within access_ok.
> Thus effectively before lines 1 and 4.
> access_ok will be slower but after all the point of access_ok is
> to then access the same memory multiple times.

If the barrier is before lines 1 and 4 then it offers no real
protection as far I can see. It will start speculating again on the
user controlled pointer value after the barrier resolves.

> So we should be making __get_user faster and access_ok slower ...

I agree, but then the barrier always needs to be after the access_ok()
check unconditionally called in the return path from access_ok(). At
that point it's back to the implementation that Linus nak'd, or I'm
missing some other detail.

...but maybe if it is limited to a new access_ok_nospec() then the
concern is addressed? Then rename current __get_user() to
__get_user_nospec() and make a new __get_user() that is back to being
optimal.
Michael S. Tsirkin Jan. 7, 2019, 10:25 p.m. UTC | #14
On Mon, Jan 07, 2019 at 01:39:15PM -0800, Dan Williams wrote:
> On Mon, Jan 7, 2019 at 6:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> > > On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > > >
> > > > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > > > toggling.
> > > > > > > > Will review, thanks!
> > > > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > > > nosmap?
> > > > > > > >
> > > > > > > On machine without SMAP (Sandy Bridge):
> > > > > > >
> > > > > > > Before: 4.8Mpps
> > > > > > >
> > > > > > > After: 5.2Mpps
> > > > > > OK so would you say it's really unsafe versus safe accesses?
> > > > > > Or would you say it's just a better written code?
> > > > >
> > > > >
> > > > > It's the effect of removing speculation barrier.
> > > >
> > > >
> > > > You mean __uaccess_begin_nospec introduced by
> > > > commit 304ec1b050310548db33063e567123fae8fd0301
> > > > ?
> > > >
> > > > So fundamentally we do access_ok checks when supplying
> > > > the memory table to the kernel thread, and we should
> > > > do the spec barrier there.
> > > >
> > > > Then we can just create and use a variant of uaccess macros that does
> > > > not include the barrier?
> > > >
> > > > Or, how about moving the barrier into access_ok?
> > > > This way repeated accesses with a single access_ok get a bit faster.
> > > > CC Dan Williams on this idea.
> > >
> > > It would be interesting to see how expensive re-doing the address
> > > limit check is compared to the speculation barrier. I.e. just switch
> > > vhost_get_user() to use get_user() rather than __get_user(). That will
> > > sanitize the pointer in the speculative path without a barrier.
> >
> > Hmm it's way cheaper even though IIRC it's measureable.
> > Jason, would you like to try?
> > Although frankly __get_user being slower than get_user feels very wrong.
> > Not yet sure what to do exactly but would you agree?
> 
> Agree. __get_user() being faster than get_user() defeats the whole
> point of converting code paths to the access_ok() + __get_user()
> pattern.

Did you mean the reverse?

> >
> >
> > > I recall we had a convert access_ok() discussion with this result here:
> > >
> > > https://lkml.org/lkml/2018/1/17/929
> >
> > Sorry let me try to clarify. IIUC speculating access_ok once
> > is harmless. As Linus said the problem is with "_subsequent_
> > accesses that can then be used to perturb the cache".
> >
> > Thus:
> >
> > 1. if (!access_ok)
> > 2.      return
> > 3.  get_user
> > 4. if (!access_ok)
> > 5.      return
> > 6.  get_user
> >
> > Your proposal that Linus nacked was to effectively add a barrier after
> > lines 2 and 5 (also using the array_index_nospec trick for speed),
> > right? Unfortunately that needs a big API change.
> >
> > I am asking about adding barrier_nospec within access_ok.
> > Thus effectively before lines 1 and 4.
> > access_ok will be slower but after all the point of access_ok is
> > to then access the same memory multiple times.
> 
> If the barrier is before lines 1 and 4 then it offers no real
> protection as far I can see. It will start speculating again on the
> user controlled pointer value after the barrier resolves.
> 
> > So we should be making __get_user faster and access_ok slower ...
> 
> I agree, but then the barrier always needs to be after the access_ok()
> check unconditionally called in the return path from access_ok(). At
> that point it's back to the implementation that Linus nak'd, or I'm
> missing some other detail.
> 
> ...but maybe if it is limited to a new access_ok_nospec() then the
> concern is addressed? Then rename current __get_user() to
> __get_user_nospec() and make a new __get_user() that is back to being
> optimal.
Dan Williams Jan. 7, 2019, 10:44 p.m. UTC | #15
On Mon, Jan 7, 2019 at 2:25 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Jan 07, 2019 at 01:39:15PM -0800, Dan Williams wrote:
> > On Mon, Jan 7, 2019 at 6:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> > > > On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > > > >
> > > > > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > > > > toggling.
> > > > > > > > > Will review, thanks!
> > > > > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > > > > nosmap?
> > > > > > > > >
> > > > > > > > On machine without SMAP (Sandy Bridge):
> > > > > > > >
> > > > > > > > Before: 4.8Mpps
> > > > > > > >
> > > > > > > > After: 5.2Mpps
> > > > > > > OK so would you say it's really unsafe versus safe accesses?
> > > > > > > Or would you say it's just a better written code?
> > > > > >
> > > > > >
> > > > > > It's the effect of removing speculation barrier.
> > > > >
> > > > >
> > > > > You mean __uaccess_begin_nospec introduced by
> > > > > commit 304ec1b050310548db33063e567123fae8fd0301
> > > > > ?
> > > > >
> > > > > So fundamentally we do access_ok checks when supplying
> > > > > the memory table to the kernel thread, and we should
> > > > > do the spec barrier there.
> > > > >
> > > > > Then we can just create and use a variant of uaccess macros that does
> > > > > not include the barrier?
> > > > >
> > > > > Or, how about moving the barrier into access_ok?
> > > > > This way repeated accesses with a single access_ok get a bit faster.
> > > > > CC Dan Williams on this idea.
> > > >
> > > > It would be interesting to see how expensive re-doing the address
> > > > limit check is compared to the speculation barrier. I.e. just switch
> > > > vhost_get_user() to use get_user() rather than __get_user(). That will
> > > > sanitize the pointer in the speculative path without a barrier.
> > >
> > > Hmm it's way cheaper even though IIRC it's measureable.
> > > Jason, would you like to try?
> > > Although frankly __get_user being slower than get_user feels very wrong.
> > > Not yet sure what to do exactly but would you agree?
> >
> > Agree. __get_user() being faster than get_user() defeats the whole
> > point of converting code paths to the access_ok() + __get_user()
> > pattern.
>
> Did you mean the reverse?

Hmm, no... I'll rephrase: __get_user() should have lower overhead than
get_user().
Jason Wang Jan. 8, 2019, 10:01 a.m. UTC | #16
On 2019/1/7 下午10:37, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 02:50:17PM +0800, Jason Wang wrote:
>> On 2019/1/7 下午12:17, Michael S. Tsirkin wrote:
>>> On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
>>>> On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
>>>>> On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
>>>>>> On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
>>>>>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>>>> address instead of copy_user() friends since they had too much
>>>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>>>> toggling.
>>>>>>> Will review, thanks!
>>>>>>> One questions that comes to mind is whether it's all about bypassing
>>>>>>> stac/clac.  Could you please include a performance comparison with
>>>>>>> nosmap?
>>>>>>>
>>>>>> On machine without SMAP (Sandy Bridge):
>>>>>>
>>>>>> Before: 4.8Mpps
>>>>>>
>>>>>> After: 5.2Mpps
>>>>> OK so would you say it's really unsafe versus safe accesses?
>>>>> Or would you say it's just a better written code?
>>>> It's the effect of removing speculation barrier.
>>> You mean __uaccess_begin_nospec introduced by
>>> commit 304ec1b050310548db33063e567123fae8fd0301
>>> ?
>> Yes.
>>
>>
>>> So fundamentally we do access_ok checks when supplying
>>> the memory table to the kernel thread, and we should
>>> do the spec barrier there.
>>>
>>> Then we can just create and use a variant of uaccess macros that does
>>> not include the barrier?
>>
>> The unsafe ones?
> Fundamentally yes.
>
>
>>> Or, how about moving the barrier into access_ok?
>>> This way repeated accesses with a single access_ok get a bit faster.
>>> CC Dan Williams on this idea.
>>
>> The problem is, e.g for vhost control path. During mem table validation, we
>> don't even want to access them there. So the spec barrier is not needed.
> Again spec barrier is not needed as such at all. It's defence in depth.
> And mem table init is slow path. So we can stick a barrier there and it
> won't be a problem for anyone.


Consider it's a generic helper. For a deep defense we should keep it 
around the places we do the real userspace memory access.


>
>>>
>>>>>> On machine with SMAP (Broadwell):
>>>>>>
>>>>>> Before: 5.0Mpps
>>>>>>
>>>>>> After: 6.1Mpps
>>>>>>
>>>>>> No smap: 7.5Mpps
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>> no smap being before or after?
>>>>>
>>>> Let me clarify:
>>>>
>>>>
>>>> Before (SMAP on): 5.0Mpps
>>>>
>>>> Before (SMAP off): 7.5Mpps
>>>>
>>>> After (SMAP on): 6.1Mpps
>>>>
>>>>
>>>> Thanks
>>> How about after + smap off?
>>
>> After (SMAP off): 8.0Mpps
>>
>>> And maybe we want a module option just for the vhost thread to keep smap
>>> off generally since almost all it does is copy stuff from userspace into
>>> kernel anyway. Because what above numbers should is that we really
>>> really want a solution that isn't limited to just meta-data access,
>>> and I really do not see how any such solution can not also be
>>> used to make meta-data access fast.
>>
>> As we've discussed in another thread of previous version. This requires lots
>> of changes, the main issues is SMAP state was not saved/restored on explicit
>> schedule().
> I wonder how expensive can reading eflags be?
> If it's cheap we can just check EFLAGS.AC and rerun stac if needed.


Probably not expensive, but consider vhost is probably the only user, is 
it really worth to do this? If we do vmap + batched copy, most part of 
the code were still under protection of SMAP but the performance is 
almost the same. Isn't this a much better solution?


>
>> Even if it did, since vhost will call lots of net/block codes,
>> any kind of uaccess in those codes needs understand this special request
>> from vhost e.g you provably need to invent a new kinds of iov iterator that
>> does not touch SMAP at all. And I'm not sure this is the only thing we need
>> to deal with.
>
> Well we wanted to move packet processing from tun into vhost anyway right?


Yes, but how about other devices? And we should deal with zerocopy path. 
It not a small amount of refactoring and work.


>
>> So I still prefer to:
>>
>> 1) speedup the metadata access through vmap + MMU notifier
>>
>> 2) speedup the datacopy with batched copy (unsafe ones or other new
>> interfaces)
>>
>> Thanks
> I just guess once you do (2) you will want to rework (1) to use
> the new interfaces.



Do you mean batching? So batched copy is much more easier, just few 
codes if unsafe variants if ready or we can invent new safe variants. 
But it would still be slower than vmap. And what's more, vmap does not 
conflict with batching.


>   So all the effort you are now investing in (1)
> will be wasted. Just my $.02.
>

Speeding up metadata access is much easier and vmap was the fastest 
method. So we can benefit from it soon. Speeding up data copy requires 
much more work to do. And in the future if kernel or vhost is ready for 
some new API and perf numbers prove its advantage, it doesn't harm to 
switch.


Thanks
Jason Wang Jan. 8, 2019, 10:12 a.m. UTC | #17
On 2019/1/7 下午10:47, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 02:58:08PM +0800, Jason Wang wrote:
>> On 2019/1/5 上午5:41, Michael S. Tsirkin wrote:
>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>> This series tries to access virtqueue metadata through kernel virtual
>>>> address instead of copy_user() friends since they had too much
>>>> overheads like checks, spec barriers or even hardware feature
>>>> toggling.
>>> I think it's a reasonable approach.
>>> However I need to look at whether and which mmu notifiers are invoked before
>>> writeback. Do you know?
>>
>> I don't know but just looking at the MMU notifier ops definition, there's no
>> such callback if my understanding is correct.
>>
>> Thanks
> In that case how are you making sure used ring updates are written back?
> If they aren't guest will crash ...


I think this is the writeback issue you mentioned early. I don't do a 
followup on the pointer but it looks to me some work is ongoing to fix 
the issue.

I can investigate it more, but it's not something new, consider the case 
of VFIO.

Thanks
Jason Wang Jan. 8, 2019, 11:42 a.m. UTC | #18
On 2019/1/7 下午10:11, Michael S. Tsirkin wrote:
> On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
>> On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
>>>> On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
>>>>> On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
>>>>>> On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
>>>>>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>>>> address instead of copy_user() friends since they had too much
>>>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>>>> toggling.
>>>>>>> Will review, thanks!
>>>>>>> One questions that comes to mind is whether it's all about bypassing
>>>>>>> stac/clac.  Could you please include a performance comparison with
>>>>>>> nosmap?
>>>>>>>
>>>>>> On machine without SMAP (Sandy Bridge):
>>>>>>
>>>>>> Before: 4.8Mpps
>>>>>>
>>>>>> After: 5.2Mpps
>>>>> OK so would you say it's really unsafe versus safe accesses?
>>>>> Or would you say it's just a better written code?
>>>>
>>>> It's the effect of removing speculation barrier.
>>>
>>> You mean __uaccess_begin_nospec introduced by
>>> commit 304ec1b050310548db33063e567123fae8fd0301
>>> ?
>>>
>>> So fundamentally we do access_ok checks when supplying
>>> the memory table to the kernel thread, and we should
>>> do the spec barrier there.
>>>
>>> Then we can just create and use a variant of uaccess macros that does
>>> not include the barrier?
>>>
>>> Or, how about moving the barrier into access_ok?
>>> This way repeated accesses with a single access_ok get a bit faster.
>>> CC Dan Williams on this idea.
>> It would be interesting to see how expensive re-doing the address
>> limit check is compared to the speculation barrier. I.e. just switch
>> vhost_get_user() to use get_user() rather than __get_user(). That will
>> sanitize the pointer in the speculative path without a barrier.
> Hmm it's way cheaper even though IIRC it's measureable.
> Jason, would you like to try?


0.5% regression after using get_user()/put_user()/...


> Although frankly __get_user being slower than get_user feels very wrong.
> Not yet sure what to do exactly but would you agree?
>
>
>> I recall we had a convert access_ok() discussion with this result here:
>>
>> https://lkml.org/lkml/2018/1/17/929
> Sorry let me try to clarify. IIUC speculating access_ok once
> is harmless. As Linus said the problem is with "_subsequent_
> accesses that can then be used to perturb the cache".
>
> Thus:
>
> 1. if (!access_ok)
> 2.	return
> 3.  get_user
> 4. if (!access_ok)
> 5.	return
> 6.  get_user
>
> Your proposal that Linus nacked was to effectively add a barrier after
> lines 2 and 5 (also using the array_index_nospec trick for speed),
> right? Unfortunately that needs a big API change.
>
> I am asking about adding barrier_nospec within access_ok.
> Thus effectively before lines 1 and 4.
> access_ok will be slower but after all the point of access_ok is
> to then access the same memory multiple times.
> So we should be making __get_user faster and access_ok slower ...


And the barrier_nospec() was completely necessary if you want to do 
write instead read.


Thanks


>
>> ...but it sounds like you are proposing a smaller scope fixup for the
>> vhost use case? Something like barrier_nospec() in the success path
>> for all vhost access_ok() checks and then a get_user() variant that
>> disables the barrier.
> Maybe we'll have to. Except I hope vhost won't end up being the
> only user otherwise it will be hard to maintain.
>
>
Michael S. Tsirkin Jan. 9, 2019, 4:31 a.m. UTC | #19
On Mon, Jan 07, 2019 at 02:44:24PM -0800, Dan Williams wrote:
> On Mon, Jan 7, 2019 at 2:25 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Jan 07, 2019 at 01:39:15PM -0800, Dan Williams wrote:
> > > On Mon, Jan 7, 2019 at 6:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> > > > > On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > > > > >
> > > > > > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > > > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > > > > > toggling.
> > > > > > > > > > Will review, thanks!
> > > > > > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > > > > > nosmap?
> > > > > > > > > >
> > > > > > > > > On machine without SMAP (Sandy Bridge):
> > > > > > > > >
> > > > > > > > > Before: 4.8Mpps
> > > > > > > > >
> > > > > > > > > After: 5.2Mpps
> > > > > > > > OK so would you say it's really unsafe versus safe accesses?
> > > > > > > > Or would you say it's just a better written code?
> > > > > > >
> > > > > > >
> > > > > > > It's the effect of removing speculation barrier.
> > > > > >
> > > > > >
> > > > > > You mean __uaccess_begin_nospec introduced by
> > > > > > commit 304ec1b050310548db33063e567123fae8fd0301
> > > > > > ?
> > > > > >
> > > > > > So fundamentally we do access_ok checks when supplying
> > > > > > the memory table to the kernel thread, and we should
> > > > > > do the spec barrier there.
> > > > > >
> > > > > > Then we can just create and use a variant of uaccess macros that does
> > > > > > not include the barrier?
> > > > > >
> > > > > > Or, how about moving the barrier into access_ok?
> > > > > > This way repeated accesses with a single access_ok get a bit faster.
> > > > > > CC Dan Williams on this idea.
> > > > >
> > > > > It would be interesting to see how expensive re-doing the address
> > > > > limit check is compared to the speculation barrier. I.e. just switch
> > > > > vhost_get_user() to use get_user() rather than __get_user(). That will
> > > > > sanitize the pointer in the speculative path without a barrier.
> > > >
> > > > Hmm it's way cheaper even though IIRC it's measureable.
> > > > Jason, would you like to try?
> > > > Although frankly __get_user being slower than get_user feels very wrong.
> > > > Not yet sure what to do exactly but would you agree?
> > >
> > > Agree. __get_user() being faster than get_user() defeats the whole
> > > point of converting code paths to the access_ok() + __get_user()
> > > pattern.
> >
> > Did you mean the reverse?
> 
> Hmm, no... I'll rephrase: __get_user() should have lower overhead than
> get_user().

Right ...

Linus, given that you just changed all users of access_ok anyway, do
you still think that the access_ok() conversion to return a speculation
sanitized pointer or NULL is too big a conversion?

It was previously discarded here:
https://lkml.org/lkml/2018/1/17/929
but at that point we didn't have numbers and there was an understandable
rush to ship something safe.

At this point I think that vhost can show very measureable gains from
this conversion.

Thanks,
Linus Torvalds Jan. 9, 2019, 5:19 a.m. UTC | #20
On Tue, Jan 8, 2019 at 8:31 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Linus, given that you just changed all users of access_ok anyway, do
> you still think that the access_ok() conversion to return a speculation
> sanitized pointer or NULL is too big a conversion?

I didn't actually change a single access_ok().

I changed the (very few) users of "user_access_begin()" to do an
access_ok() in them. There were 8 of them total.

It turns out that two of those cases (the strn*_user() ones) found
bugs in the implementation of access_ok() of two architectures, and
then looking at the others found that six more architectures also had
problems, but those weren't actually because of any access_ok()
changes, they were pre-existing issues. So we definitely had
unfortunate bugs in access_ok(), but they were mostly the benign kind
(ir the "use arguments twice - a real potential bug, but not one that
actually likely makes any difference to existing users)

Changing all 600+ users of access_ok() would be painful.

That said, one thing I *would* like to do is to just get rid of
__get_user() and __put_user() entirely. Or rather, just make them do
exactly the same thing that the normal "get_user()"/"put_user()"
functions do.

And then, _within_ the case of get_user()/put_user(), doing the
access_ok() as a data dependency rather than a lfence should be easy
enough.

                     Linus
Jason Wang Jan. 11, 2019, 8:59 a.m. UTC | #21
On 2019/1/8 下午6:12, Jason Wang wrote:
>
> On 2019/1/7 下午10:47, Michael S. Tsirkin wrote:
>> On Mon, Jan 07, 2019 at 02:58:08PM +0800, Jason Wang wrote:
>>> On 2019/1/5 上午5:41, Michael S. Tsirkin wrote:
>>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>> address instead of copy_user() friends since they had too much
>>>>> overheads like checks, spec barriers or even hardware feature
>>>>> toggling.
>>>> I think it's a reasonable approach.
>>>> However I need to look at whether and which mmu notifiers are 
>>>> invoked before
>>>> writeback. Do you know?
>>>
>>> I don't know but just looking at the MMU notifier ops definition, 
>>> there's no
>>> such callback if my understanding is correct.
>>>
>>> Thanks
>> In that case how are you making sure used ring updates are written back?
>> If they aren't guest will crash ...
>
>
> I think this is the writeback issue you mentioned early. I don't do a 
> followup on the pointer but it looks to me some work is ongoing to fix 
> the issue.
>
> I can investigate it more, but it's not something new, consider the 
> case of VFIO.
>
> Thanks


Ok, after some investigation. The GUP + dirty pages issue is not easy to 
be fixed so it may still take a while.

An idea is switch back to copy_user() friends if we find the metadata 
page is not anonymous.

Does this sound good to you?

Thanks