mbox series

[RFC,V2,0/5] vhost: accelerate metadata access through vmap()

Message ID 1551856692-3384-1-git-send-email-jasowang@redhat.com (mailing list archive)
Headers show
Series vhost: accelerate metadata access through vmap() | expand

Message

Jason Wang March 6, 2019, 7:18 a.m. UTC
This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling. This is done through setup kernel address through vmap() and
resigter MMU notifier for invalidation.

Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
obvious improvement.

Thanks

Changes from V4:
- use invalidate_range() instead of invalidate_range_start()
- track dirty pages
Changes from V3:
- don't try to use vmap for file backed pages
- rebase to master
Changes from V2:
- fix buggy range overlapping check
- tear down MMU notifier during vhost ioctl to make sure invalidation
  request can read metadata userspace address and vq size without
  holding vq mutex.
Changes from V1:
- instead of pinning pages, use MMU notifier to invalidate vmaps and
  remap duing metadata prefetch
- fix build warning on MIPS

Jason Wang (5):
  vhost: generalize adding used elem
  vhost: fine grain userspace memory accessors
  vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
  vhost: introduce helpers to get the size of metadata area
  vhost: access vq metadata through kernel virtual address

 drivers/vhost/net.c   |   6 +-
 drivers/vhost/vhost.c | 434 ++++++++++++++++++++++++++++++++++++++++++++------
 drivers/vhost/vhost.h |  18 ++-
 3 files changed, 407 insertions(+), 51 deletions(-)

Comments

Christoph Hellwig March 8, 2019, 2:12 p.m. UTC | #1
On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling. This is done through setup kernel address through vmap() and
> resigter MMU notifier for invalidation.
> 
> Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
> obvious improvement.

How is this going to work for CPUs with virtually tagged caches?
Jason Wang March 11, 2019, 7:13 a.m. UTC | #2
On 2019/3/8 下午10:12, Christoph Hellwig wrote:
> On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling. This is done through setup kernel address through vmap() and
>> resigter MMU notifier for invalidation.
>>
>> Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
>> obvious improvement.
> How is this going to work for CPUs with virtually tagged caches?


Anything different that you worry? I can have a test but do you know any 
archs that use virtual tag cache?

Thanks
Michael S. Tsirkin March 11, 2019, 1:59 p.m. UTC | #3
On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
> 
> On 2019/3/8 下午10:12, Christoph Hellwig wrote:
> > On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
> > > This series tries to access virtqueue metadata through kernel virtual
> > > address instead of copy_user() friends since they had too much
> > > overheads like checks, spec barriers or even hardware feature
> > > toggling. This is done through setup kernel address through vmap() and
> > > resigter MMU notifier for invalidation.
> > > 
> > > Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
> > > obvious improvement.
> > How is this going to work for CPUs with virtually tagged caches?
> 
> 
> Anything different that you worry?

If caches have virtual tags then kernel and userspace view of memory
might not be automatically in sync if they access memory
through different virtual addresses. You need to do things like
flush_cache_page, probably multiple times.

> I can have a test but do you know any
> archs that use virtual tag cache?

sparc I believe.

> Thanks
David Miller March 11, 2019, 6:14 p.m. UTC | #4
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Mon, 11 Mar 2019 09:59:28 -0400

> On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
>> 
>> On 2019/3/8 下午10:12, Christoph Hellwig wrote:
>> > On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
>> > > This series tries to access virtqueue metadata through kernel virtual
>> > > address instead of copy_user() friends since they had too much
>> > > overheads like checks, spec barriers or even hardware feature
>> > > toggling. This is done through setup kernel address through vmap() and
>> > > resigter MMU notifier for invalidation.
>> > > 
>> > > Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
>> > > obvious improvement.
>> > How is this going to work for CPUs with virtually tagged caches?
>> 
>> 
>> Anything different that you worry?
> 
> If caches have virtual tags then kernel and userspace view of memory
> might not be automatically in sync if they access memory
> through different virtual addresses. You need to do things like
> flush_cache_page, probably multiple times.

"flush_dcache_page()"
Jason Wang March 12, 2019, 2:59 a.m. UTC | #5
On 2019/3/12 上午2:14, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Mon, 11 Mar 2019 09:59:28 -0400
>
>> On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
>>> On 2019/3/8 下午10:12, Christoph Hellwig wrote:
>>>> On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>> address instead of copy_user() friends since they had too much
>>>>> overheads like checks, spec barriers or even hardware feature
>>>>> toggling. This is done through setup kernel address through vmap() and
>>>>> resigter MMU notifier for invalidation.
>>>>>
>>>>> Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
>>>>> obvious improvement.
>>>> How is this going to work for CPUs with virtually tagged caches?
>>>
>>> Anything different that you worry?
>> If caches have virtual tags then kernel and userspace view of memory
>> might not be automatically in sync if they access memory
>> through different virtual addresses. You need to do things like
>> flush_cache_page, probably multiple times.
> "flush_dcache_page()"


I get this. Then I think the current set_bit_to_user() is suspicious, we 
probably miss a flush_dcache_page() there:


static int set_bit_to_user(int nr, void __user *addr)
{
         unsigned long log = (unsigned long)addr;
         struct page *page;
         void *base;
         int bit = nr + (log % PAGE_SIZE) * 8;
         int r;

         r = get_user_pages_fast(log, 1, 1, &page);
         if (r < 0)
                 return r;
         BUG_ON(r != 1);
         base = kmap_atomic(page);
         set_bit(bit, base);
         kunmap_atomic(base);
         set_page_dirty_lock(page);
         put_page(page);
         return 0;
}

Thanks
Michael S. Tsirkin March 12, 2019, 3:52 a.m. UTC | #6
On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
> 
> On 2019/3/12 上午2:14, David Miller wrote:
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > Date: Mon, 11 Mar 2019 09:59:28 -0400
> > 
> > > On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
> > > > On 2019/3/8 下午10:12, Christoph Hellwig wrote:
> > > > > On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
> > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > address instead of copy_user() friends since they had too much
> > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > toggling. This is done through setup kernel address through vmap() and
> > > > > > resigter MMU notifier for invalidation.
> > > > > > 
> > > > > > Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
> > > > > > obvious improvement.
> > > > > How is this going to work for CPUs with virtually tagged caches?
> > > > 
> > > > Anything different that you worry?
> > > If caches have virtual tags then kernel and userspace view of memory
> > > might not be automatically in sync if they access memory
> > > through different virtual addresses. You need to do things like
> > > flush_cache_page, probably multiple times.
> > "flush_dcache_page()"
> 
> 
> I get this. Then I think the current set_bit_to_user() is suspicious, we
> probably miss a flush_dcache_page() there:
> 
> 
> static int set_bit_to_user(int nr, void __user *addr)
> {
>         unsigned long log = (unsigned long)addr;
>         struct page *page;
>         void *base;
>         int bit = nr + (log % PAGE_SIZE) * 8;
>         int r;
> 
>         r = get_user_pages_fast(log, 1, 1, &page);
>         if (r < 0)
>                 return r;
>         BUG_ON(r != 1);
>         base = kmap_atomic(page);
>         set_bit(bit, base);
>         kunmap_atomic(base);
>         set_page_dirty_lock(page);
>         put_page(page);
>         return 0;
> }
> 
> Thanks

I think you are right. The correct fix though is to re-implement
it using asm and handling pagefault, not gup.
Three atomic ops per bit is way to expensive.
James Bottomley March 12, 2019, 5:14 a.m. UTC | #7
On Tue, 2019-03-12 at 10:59 +0800, Jason Wang wrote:
> On 2019/3/12 上午2:14, David Miller wrote:
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > Date: Mon, 11 Mar 2019 09:59:28 -0400
> > 
> > > On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
> > > > On 2019/3/8 下午10:12, Christoph Hellwig wrote:
> > > > > On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
> > > > > > This series tries to access virtqueue metadata through
> > > > > > kernel virtual
> > > > > > address instead of copy_user() friends since they had too
> > > > > > much
> > > > > > overheads like checks, spec barriers or even hardware
> > > > > > feature
> > > > > > toggling. This is done through setup kernel address through
> > > > > > vmap() and
> > > > > > resigter MMU notifier for invalidation.
> > > > > > 
> > > > > > Test shows about 24% improvement on TX PPS. TCP_STREAM
> > > > > > doesn't see
> > > > > > obvious improvement.
> > > > > 
> > > > > How is this going to work for CPUs with virtually tagged
> > > > > caches?
> > > > 
> > > > Anything different that you worry?
> > > 
> > > If caches have virtual tags then kernel and userspace view of
> > > memory
> > > might not be automatically in sync if they access memory
> > > through different virtual addresses. You need to do things like
> > > flush_cache_page, probably multiple times.
> > 
> > "flush_dcache_page()"
> 
> 
> I get this. Then I think the current set_bit_to_user() is suspicious,
> we 
> probably miss a flush_dcache_page() there:
> 
> 
> static int set_bit_to_user(int nr, void __user *addr)
> {
>          unsigned long log = (unsigned long)addr;
>          struct page *page;
>          void *base;
>          int bit = nr + (log % PAGE_SIZE) * 8;
>          int r;
> 
>          r = get_user_pages_fast(log, 1, 1, &page);
>          if (r < 0)
>                  return r;
>          BUG_ON(r != 1);
>          base = kmap_atomic(page);
>          set_bit(bit, base);
>          kunmap_atomic(base);

This sequence should be OK.  get_user_pages() contains a flush which
clears the cache above the user virtual address, so on kmap, the page
is coherent at the new alias.  On parisc at least, kunmap embodies a
flush_dcache_page() which pushes any changes in the cache above the
kernel virtual address back to main memory and makes it coherent again
for the user alias to pick it up.

James
Jason Wang March 12, 2019, 7:17 a.m. UTC | #8
On 2019/3/12 上午11:52, Michael S. Tsirkin wrote:
> On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
>> On 2019/3/12 上午2:14, David Miller wrote:
>>> From: "Michael S. Tsirkin" <mst@redhat.com>
>>> Date: Mon, 11 Mar 2019 09:59:28 -0400
>>>
>>>> On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
>>>>> On 2019/3/8 下午10:12, Christoph Hellwig wrote:
>>>>>> On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
>>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>>> address instead of copy_user() friends since they had too much
>>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>>> toggling. This is done through setup kernel address through vmap() and
>>>>>>> resigter MMU notifier for invalidation.
>>>>>>>
>>>>>>> Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
>>>>>>> obvious improvement.
>>>>>> How is this going to work for CPUs with virtually tagged caches?
>>>>> Anything different that you worry?
>>>> If caches have virtual tags then kernel and userspace view of memory
>>>> might not be automatically in sync if they access memory
>>>> through different virtual addresses. You need to do things like
>>>> flush_cache_page, probably multiple times.
>>> "flush_dcache_page()"
>>
>> I get this. Then I think the current set_bit_to_user() is suspicious, we
>> probably miss a flush_dcache_page() there:
>>
>>
>> static int set_bit_to_user(int nr, void __user *addr)
>> {
>>          unsigned long log = (unsigned long)addr;
>>          struct page *page;
>>          void *base;
>>          int bit = nr + (log % PAGE_SIZE) * 8;
>>          int r;
>>
>>          r = get_user_pages_fast(log, 1, 1, &page);
>>          if (r < 0)
>>                  return r;
>>          BUG_ON(r != 1);
>>          base = kmap_atomic(page);
>>          set_bit(bit, base);
>>          kunmap_atomic(base);
>>          set_page_dirty_lock(page);
>>          put_page(page);
>>          return 0;
>> }
>>
>> Thanks
> I think you are right. The correct fix though is to re-implement
> it using asm and handling pagefault, not gup.


I agree but it needs to introduce new helpers in asm  for all archs 
which is not trivial. At least for -stable, we need the flush?


> Three atomic ops per bit is way to expensive.


Yes.

Thanks
Jason Wang March 12, 2019, 7:51 a.m. UTC | #9
On 2019/3/12 下午1:14, James Bottomley wrote:
> On Tue, 2019-03-12 at 10:59 +0800, Jason Wang wrote:
>> On 2019/3/12 上午2:14, David Miller wrote:
>>> From: "Michael S. Tsirkin" <mst@redhat.com>
>>> Date: Mon, 11 Mar 2019 09:59:28 -0400
>>>
>>>> On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
>>>>> On 2019/3/8 下午10:12, Christoph Hellwig wrote:
>>>>>> On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
>>>>>>> This series tries to access virtqueue metadata through
>>>>>>> kernel virtual
>>>>>>> address instead of copy_user() friends since they had too
>>>>>>> much
>>>>>>> overheads like checks, spec barriers or even hardware
>>>>>>> feature
>>>>>>> toggling. This is done through setup kernel address through
>>>>>>> vmap() and
>>>>>>> resigter MMU notifier for invalidation.
>>>>>>>
>>>>>>> Test shows about 24% improvement on TX PPS. TCP_STREAM
>>>>>>> doesn't see
>>>>>>> obvious improvement.
>>>>>> How is this going to work for CPUs with virtually tagged
>>>>>> caches?
>>>>> Anything different that you worry?
>>>> If caches have virtual tags then kernel and userspace view of
>>>> memory
>>>> might not be automatically in sync if they access memory
>>>> through different virtual addresses. You need to do things like
>>>> flush_cache_page, probably multiple times.
>>> "flush_dcache_page()"
>>
>> I get this. Then I think the current set_bit_to_user() is suspicious,
>> we
>> probably miss a flush_dcache_page() there:
>>
>>
>> static int set_bit_to_user(int nr, void __user *addr)
>> {
>>           unsigned long log = (unsigned long)addr;
>>           struct page *page;
>>           void *base;
>>           int bit = nr + (log % PAGE_SIZE) * 8;
>>           int r;
>>
>>           r = get_user_pages_fast(log, 1, 1, &page);
>>           if (r < 0)
>>                   return r;
>>           BUG_ON(r != 1);
>>           base = kmap_atomic(page);
>>           set_bit(bit, base);
>>           kunmap_atomic(base);
> This sequence should be OK.  get_user_pages() contains a flush which
> clears the cache above the user virtual address, so on kmap, the page
> is coherent at the new alias.  On parisc at least, kunmap embodies a
> flush_dcache_page() which pushes any changes in the cache above the
> kernel virtual address back to main memory and makes it coherent again
> for the user alias to pick it up.


It would be good if kmap()/kunmap() can do this but looks like we can 
not assume this? For example, sparc's flush_dcache_page() doesn't do 
flush_dcache_page(). And bio_copy_data_iter() do flush_dcache_page() 
after kunmap_atomic().

Thanks


>
> James
>
Jason Wang March 12, 2019, 7:53 a.m. UTC | #10
On 2019/3/12 下午3:51, Jason Wang wrote:
>
> On 2019/3/12 下午1:14, James Bottomley wrote:
>> On Tue, 2019-03-12 at 10:59 +0800, Jason Wang wrote:
>>> On 2019/3/12 上午2:14, David Miller wrote:
>>>> From: "Michael S. Tsirkin" <mst@redhat.com>
>>>> Date: Mon, 11 Mar 2019 09:59:28 -0400
>>>>
>>>>> On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
>>>>>> On 2019/3/8 下午10:12, Christoph Hellwig wrote:
>>>>>>> On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
>>>>>>>> This series tries to access virtqueue metadata through
>>>>>>>> kernel virtual
>>>>>>>> address instead of copy_user() friends since they had too
>>>>>>>> much
>>>>>>>> overheads like checks, spec barriers or even hardware
>>>>>>>> feature
>>>>>>>> toggling. This is done through setup kernel address through
>>>>>>>> vmap() and
>>>>>>>> resigter MMU notifier for invalidation.
>>>>>>>>
>>>>>>>> Test shows about 24% improvement on TX PPS. TCP_STREAM
>>>>>>>> doesn't see
>>>>>>>> obvious improvement.
>>>>>>> How is this going to work for CPUs with virtually tagged
>>>>>>> caches?
>>>>>> Anything different that you worry?
>>>>> If caches have virtual tags then kernel and userspace view of
>>>>> memory
>>>>> might not be automatically in sync if they access memory
>>>>> through different virtual addresses. You need to do things like
>>>>> flush_cache_page, probably multiple times.
>>>> "flush_dcache_page()"
>>>
>>> I get this. Then I think the current set_bit_to_user() is suspicious,
>>> we
>>> probably miss a flush_dcache_page() there:
>>>
>>>
>>> static int set_bit_to_user(int nr, void __user *addr)
>>> {
>>>           unsigned long log = (unsigned long)addr;
>>>           struct page *page;
>>>           void *base;
>>>           int bit = nr + (log % PAGE_SIZE) * 8;
>>>           int r;
>>>
>>>           r = get_user_pages_fast(log, 1, 1, &page);
>>>           if (r < 0)
>>>                   return r;
>>>           BUG_ON(r != 1);
>>>           base = kmap_atomic(page);
>>>           set_bit(bit, base);
>>>           kunmap_atomic(base);
>> This sequence should be OK.  get_user_pages() contains a flush which
>> clears the cache above the user virtual address, so on kmap, the page
>> is coherent at the new alias.  On parisc at least, kunmap embodies a
>> flush_dcache_page() which pushes any changes in the cache above the
>> kernel virtual address back to main memory and makes it coherent again
>> for the user alias to pick it up.
>
>
> It would be good if kmap()/kunmap() can do this but looks like we can 
> not assume this? For example, sparc's flush_dcache_page() 


Sorry, I meant kunmap_atomic().

Thanks


> doesn't do flush_dcache_page(). And bio_copy_data_iter() do 
> flush_dcache_page() after kunmap_atomic().
>
> Thanks
>
>
>>
>> James
>>
>
Michael S. Tsirkin March 12, 2019, 11:54 a.m. UTC | #11
On Tue, Mar 12, 2019 at 03:17:00PM +0800, Jason Wang wrote:
> 
> On 2019/3/12 上午11:52, Michael S. Tsirkin wrote:
> > On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
> > > On 2019/3/12 上午2:14, David Miller wrote:
> > > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > > > Date: Mon, 11 Mar 2019 09:59:28 -0400
> > > > 
> > > > > On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
> > > > > > On 2019/3/8 下午10:12, Christoph Hellwig wrote:
> > > > > > > On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
> > > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > > toggling. This is done through setup kernel address through vmap() and
> > > > > > > > resigter MMU notifier for invalidation.
> > > > > > > > 
> > > > > > > > Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
> > > > > > > > obvious improvement.
> > > > > > > How is this going to work for CPUs with virtually tagged caches?
> > > > > > Anything different that you worry?
> > > > > If caches have virtual tags then kernel and userspace view of memory
> > > > > might not be automatically in sync if they access memory
> > > > > through different virtual addresses. You need to do things like
> > > > > flush_cache_page, probably multiple times.
> > > > "flush_dcache_page()"
> > > 
> > > I get this. Then I think the current set_bit_to_user() is suspicious, we
> > > probably miss a flush_dcache_page() there:
> > > 
> > > 
> > > static int set_bit_to_user(int nr, void __user *addr)
> > > {
> > >          unsigned long log = (unsigned long)addr;
> > >          struct page *page;
> > >          void *base;
> > >          int bit = nr + (log % PAGE_SIZE) * 8;
> > >          int r;
> > > 
> > >          r = get_user_pages_fast(log, 1, 1, &page);
> > >          if (r < 0)
> > >                  return r;
> > >          BUG_ON(r != 1);
> > >          base = kmap_atomic(page);
> > >          set_bit(bit, base);
> > >          kunmap_atomic(base);
> > >          set_page_dirty_lock(page);
> > >          put_page(page);
> > >          return 0;
> > > }
> > > 
> > > Thanks
> > I think you are right. The correct fix though is to re-implement
> > it using asm and handling pagefault, not gup.
> 
> 
> I agree but it needs to introduce new helpers in asm  for all archs which is
> not trivial.

We can have a generic implementation using kmap.

> At least for -stable, we need the flush?
> 
> 
> > Three atomic ops per bit is way to expensive.
> 
> 
> Yes.
> 
> Thanks

See James's reply - I stand corrected we do kunmap so no need to flush.
James Bottomley March 12, 2019, 3:46 p.m. UTC | #12
On Tue, 2019-03-12 at 07:54 -0400, Michael S. Tsirkin wrote:
> On Tue, Mar 12, 2019 at 03:17:00PM +0800, Jason Wang wrote:
> > 
> > On 2019/3/12 上午11:52, Michael S. Tsirkin wrote:
> > > On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
[...]
> > At least for -stable, we need the flush?
> > 
> > 
> > > Three atomic ops per bit is way to expensive.
> > 
> > 
> > Yes.
> > 
> > Thanks
> 
> See James's reply - I stand corrected we do kunmap so no need to
> flush.

Well, I said that's what we do on Parisc.  The cachetlb document
definitely says if you alter the data between kmap and kunmap you are
responsible for the flush.  It's just that flush_dcache_page() is a no-
op on x86 so they never remember to add it and since it will crash
parisc if you get it wrong we finally gave up trying to make them.

But that's the point: it is a no-op on your favourite architecture so
it costs you nothing to add it.

James
Andrea Arcangeli March 12, 2019, 8:04 p.m. UTC | #13
On Tue, Mar 12, 2019 at 08:46:50AM -0700, James Bottomley wrote:
> On Tue, 2019-03-12 at 07:54 -0400, Michael S. Tsirkin wrote:
> > On Tue, Mar 12, 2019 at 03:17:00PM +0800, Jason Wang wrote:
> > > 
> > > On 2019/3/12 上午11:52, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
> [...]
> > > At least for -stable, we need the flush?
> > > 
> > > 
> > > > Three atomic ops per bit is way to expensive.
> > > 
> > > 
> > > Yes.
> > > 
> > > Thanks
> > 
> > See James's reply - I stand corrected we do kunmap so no need to
> > flush.
> 
> Well, I said that's what we do on Parisc.  The cachetlb document
> definitely says if you alter the data between kmap and kunmap you are
> responsible for the flush.  It's just that flush_dcache_page() is a no-
> op on x86 so they never remember to add it and since it will crash
> parisc if you get it wrong we finally gave up trying to make them.
> 
> But that's the point: it is a no-op on your favourite architecture so
> it costs you nothing to add it.

Yes, the fact Parisc gave up and is doing it on kunmap is reasonable
approach for Parisc, but it doesn't move the needle as far as vhost
common code is concerned, because other archs don't flush any cache on
kunmap.

So either all other archs give up trying to optimize, or vhost still
has to call flush_dcache_page() after kunmap.

Which means after we fix vhost to add the flush_dcache_page after
kunmap, Parisc will get a double hit (but it also means Parisc was the
only one of those archs needed explicit cache flushes, where vhost
worked correctly so far.. so it kinds of proofs your point of giving
up being the safe choice).

Thanks,
Andrea
James Bottomley March 12, 2019, 8:53 p.m. UTC | #14
On Tue, 2019-03-12 at 16:04 -0400, Andrea Arcangeli wrote:
> On Tue, Mar 12, 2019 at 08:46:50AM -0700, James Bottomley wrote:
> > On Tue, 2019-03-12 at 07:54 -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 12, 2019 at 03:17:00PM +0800, Jason Wang wrote:
> > > > 
> > > > On 2019/3/12 上åˆ11:52, Michael S. Tsirkin wrote:
> > > > > On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
> > 
> > [...]
> > > > At least for -stable, we need the flush?
> > > > 
> > > > 
> > > > > Three atomic ops per bit is way to expensive.
> > > > 
> > > > 
> > > > Yes.
> > > > 
> > > > Thanks
> > > 
> > > See James's reply - I stand corrected we do kunmap so no need to
> > > flush.
> > 
> > Well, I said that's what we do on Parisc.  The cachetlb document
> > definitely says if you alter the data between kmap and kunmap you
> > are responsible for the flush.  It's just that flush_dcache_page()
> > is a no-op on x86 so they never remember to add it and since it
> > will crash parisc if you get it wrong we finally gave up trying to
> > make them.
> > 
> > But that's the point: it is a no-op on your favourite architecture
> > so it costs you nothing to add it.
> 
> Yes, the fact Parisc gave up and is doing it on kunmap is reasonable
> approach for Parisc, but it doesn't move the needle as far as vhost
> common code is concerned, because other archs don't flush any cache
> on kunmap.
> 
> So either all other archs give up trying to optimize, or vhost still
> has to call flush_dcache_page() after kunmap.

I've got to say: optimize what?  What code do we ever have in the
kernel that kmap's a page and then doesn't do anything with it? You can
guarantee that on kunmap the page is either referenced (needs
invalidating) or updated (needs flushing). The in-kernel use of kmap is
always

kmap
do something with the mapped page
kunmap

In a very short interval.  It seems just a simplification to make
kunmap do the flush if needed rather than try to have the users
remember.  The thing which makes this really simple is that on most
architectures flush and invalidate is the same operation.  If you
really want to optimize you can use the referenced and dirty bits on
the kmapped pte to tell you what operation to do, but if your flush is
your invalidate, you simply assume the data needs flushing on kunmap
without checking anything.

> Which means after we fix vhost to add the flush_dcache_page after
> kunmap, Parisc will get a double hit (but it also means Parisc was
> the only one of those archs needed explicit cache flushes, where
> vhost worked correctly so far.. so it kinds of proofs your point of
> giving up being the safe choice).

What double hit?  If there's no cache to flush then cache flush is a
no-op.  It's also a highly piplineable no-op because the CPU has the L1
cache within easy reach.  The only event when flush takes a large
amount time is if we actually have dirty data to write back to main
memory.

James
Andrea Arcangeli March 12, 2019, 9:11 p.m. UTC | #15
On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley wrote:
> I've got to say: optimize what?  What code do we ever have in the
> kernel that kmap's a page and then doesn't do anything with it? You can
> guarantee that on kunmap the page is either referenced (needs
> invalidating) or updated (needs flushing). The in-kernel use of kmap is
> always
> 
> kmap
> do something with the mapped page
> kunmap
> 
> In a very short interval.  It seems just a simplification to make
> kunmap do the flush if needed rather than try to have the users
> remember.  The thing which makes this really simple is that on most
> architectures flush and invalidate is the same operation.  If you
> really want to optimize you can use the referenced and dirty bits on
> the kmapped pte to tell you what operation to do, but if your flush is
> your invalidate, you simply assume the data needs flushing on kunmap
> without checking anything.

Except other archs like arm64 and sparc do the cache flushing on
copy_to_user_page and copy_user_page, not on kunmap.

#define copy_user_page(to,from,vaddr,pg) __cpu_copy_user_page(to, from, vaddr)
void __cpu_copy_user_page(void *kto, const void *kfrom, unsigned long vaddr)
{
	struct page *page = virt_to_page(kto);
	copy_page(kto, kfrom);
	flush_dcache_page(page);
}
#define copy_user_page(to, from, vaddr, page)	\
	do {	copy_page(to, from);		\
		sparc_flush_page_to_ram(page);	\
	} while (0)

And they do nothing on kunmap:

static inline void kunmap(struct page *page)
{
	BUG_ON(in_interrupt());
	if (!PageHighMem(page))
		return;
	kunmap_high(page);
}
void kunmap_high(struct page *page)
{
	unsigned long vaddr;
	unsigned long nr;
	unsigned long flags;
	int need_wakeup;
	unsigned int color = get_pkmap_color(page);
	wait_queue_head_t *pkmap_map_wait;

	lock_kmap_any(flags);
	vaddr = (unsigned long)page_address(page);
	BUG_ON(!vaddr);
	nr = PKMAP_NR(vaddr);

	/*
	 * A count must never go down to zero
	 * without a TLB flush!
	 */
	need_wakeup = 0;
	switch (--pkmap_count[nr]) {
	case 0:
		BUG();
	case 1:
		/*
		 * Avoid an unnecessary wake_up() function call.
		 * The common case is pkmap_count[] == 1, but
		 * no waiters.
		 * The tasks queued in the wait-queue are guarded
		 * by both the lock in the wait-queue-head and by
		 * the kmap_lock.  As the kmap_lock is held here,
		 * no need for the wait-queue-head's lock.  Simply
		 * test if the queue is empty.
		 */
		pkmap_map_wait = get_pkmap_wait_queue_head(color);
		need_wakeup = waitqueue_active(pkmap_map_wait);
	}
	unlock_kmap_any(flags);

	/* do wake-up, if needed, race-free outside of the spin lock */
	if (need_wakeup)
		wake_up(pkmap_map_wait);
}
static inline void kunmap(struct page *page)
{
}

because they already did it just above.


> > Which means after we fix vhost to add the flush_dcache_page after
> > kunmap, Parisc will get a double hit (but it also means Parisc was
> > the only one of those archs needed explicit cache flushes, where
> > vhost worked correctly so far.. so it kinds of proofs your point of
> > giving up being the safe choice).
> 
> What double hit?  If there's no cache to flush then cache flush is a
> no-op.  It's also a highly piplineable no-op because the CPU has the L1
> cache within easy reach.  The only event when flush takes a large
> amount time is if we actually have dirty data to write back to main
> memory.

The double hit is in parisc copy_to_user_page:

#define copy_to_user_page(vma, page, vaddr, dst, src, len) \
do { \
	flush_cache_page(vma, vaddr, page_to_pfn(page)); \
	memcpy(dst, src, len); \
	flush_kernel_dcache_range_asm((unsigned long)dst, (unsigned long)dst + len); \
} while (0)

That is executed just before kunmap:

static inline void kunmap(struct page *page)
{
	flush_kernel_dcache_page_addr(page_address(page));
}

Can't argue about the fact your "safer" kunmap is safer, but we cannot
rely on common code unless we remove some optimization from the common
code abstractions and we make all archs do kunmap like parisc.

Thanks,
Andrea
James Bottomley March 12, 2019, 9:19 p.m. UTC | #16
I think we might be talking past each other.  Let me try the double
flush first

On Tue, 2019-03-12 at 17:11 -0400, Andrea Arcangeli wrote:
> On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley wrote:
> > > Which means after we fix vhost to add the flush_dcache_page after
> > > kunmap, Parisc will get a double hit (but it also means Parisc
> > > was
> > > the only one of those archs needed explicit cache flushes, where
> > > vhost worked correctly so far.. so it kinds of proofs your point
> > > of
> > > giving up being the safe choice).
> > 
> > What double hit?  If there's no cache to flush then cache flush is
> > a no-op.  It's also a highly piplineable no-op because the CPU has
> > the L1 cache within easy reach.  The only event when flush takes a
> > large amount time is if we actually have dirty data to write back
> > to main memory.
> 
> The double hit is in parisc copy_to_user_page:
> 
> #define copy_to_user_page(vma, page, vaddr, dst, src, len) \
> do { \
> 	flush_cache_page(vma, vaddr, page_to_pfn(page)); \
> 	memcpy(dst, src, len); \
> 	flush_kernel_dcache_range_asm((unsigned long)dst, (unsigned
> long)dst + len); \
> } while (0)
> 
> That is executed just before kunmap:
> 
> static inline void kunmap(struct page *page)
> {
> 	flush_kernel_dcache_page_addr(page_address(page));
> }

I mean in the sequence

flush_dcache_page(page);
flush_dcache_page(page);

The first flush_dcache_page did all the work and the second it a
tightly pipelined no-op.  That's what I mean by there not really being
a double hit.

James
Andrea Arcangeli March 12, 2019, 9:53 p.m. UTC | #17
On Tue, Mar 12, 2019 at 02:19:15PM -0700, James Bottomley wrote:
> I mean in the sequence
> 
> flush_dcache_page(page);
> flush_dcache_page(page);
> 
> The first flush_dcache_page did all the work and the second it a
> tightly pipelined no-op.  That's what I mean by there not really being
> a double hit.

Ok I wasn't sure it was clear there was a double (profiling) hit on
that function.

void flush_kernel_dcache_page_addr(void *addr)
{
	unsigned long flags;

	flush_kernel_dcache_page_asm(addr);
	purge_tlb_start(flags);
	pdtlb_kernel(addr);
	purge_tlb_end(flags);
}

#define purge_tlb_start(flags)	spin_lock_irqsave(&pa_tlb_lock, flags)
#define purge_tlb_end(flags)	spin_unlock_irqrestore(&pa_tlb_lock, flags)

You got a system-wide spinlock in there that won't just go away the
second time. So it's a bit more than a tightly pipelined "noop".

Your logic of adding the flush on kunmap makes sense, all I'm saying
is that it's sacrificing some performance for safety. You asked
"optimized what", I meant to optimize away all the above quoted code
that will end running twice for each vhost set_bit when it should run
just once like in other archs. And it clearly paid off until now
(until now it run just once and it was the only safe one).

Before we can leverage your idea to flush the dcache on kunmap in
common code without having to sacrifice performance in arch code, we'd
need to change all other archs to add the cache flushes on kunmap too,
and then remove the cache flushes from the other places like copy_page
or we'd waste CPU. Then you'd have the best of both words, no double
flush and kunmap would be enough.

Thanks,
Andrea
James Bottomley March 12, 2019, 10:02 p.m. UTC | #18
On Tue, 2019-03-12 at 17:53 -0400, Andrea Arcangeli wrote:
> On Tue, Mar 12, 2019 at 02:19:15PM -0700, James Bottomley wrote:
> > I mean in the sequence
> > 
> > flush_dcache_page(page);
> > flush_dcache_page(page);
> > 
> > The first flush_dcache_page did all the work and the second it a
> > tightly pipelined no-op.  That's what I mean by there not really
> > being
> > a double hit.
> 
> Ok I wasn't sure it was clear there was a double (profiling) hit on
> that function.
> 
> void flush_kernel_dcache_page_addr(void *addr)
> {
> 	unsigned long flags;
> 
> 	flush_kernel_dcache_page_asm(addr);
> 	purge_tlb_start(flags);
> 	pdtlb_kernel(addr);
> 	purge_tlb_end(flags);
> }
> 
> #define purge_tlb_start(flags)	spin_lock_irqsave(&pa_tlb_lock,
> flags)
> #define purge_tlb_end(flags)	spin_unlock_irqrestore(&pa_tlb_lo
> ck, flags)
> 
> You got a system-wide spinlock in there that won't just go away the
> second time. So it's a bit more than a tightly pipelined "noop".

Well, yes, guilty as charged.  That particular bit of code is a work
around for an N class system which has an internal cross CPU coherency
bus but helpfully crashes if two different CPUs try to use it at once. 
Since the N class was a huge power hog, I thought they'd all been
decommisioned and this was an irrelevant anachronism (or at the very
least runtime patched).

> Your logic of adding the flush on kunmap makes sense, all I'm saying
> is that it's sacrificing some performance for safety. You asked
> "optimized what", I meant to optimize away all the above quoted code
> that will end running twice for each vhost set_bit when it should run
> just once like in other archs. And it clearly paid off until now
> (until now it run just once and it was the only safe one).

I'm sure there must be workarounds elsewhere in the other arch code
otherwise things like this, which appear all over drivers/, wouldn't
work:

drivers/scsi/isci/request.c:1430

	kaddr = kmap_atomic(page);
	memcpy(kaddr + sg->offset, src_addr, copy_len);
	kunmap_atomic(kaddr);

the sequence dirties the kernel virtual address but doesn't flush
before doing kunmap.  There are hundreds of other examples which is why
I think adding flush_kernel_dcache_page() is an already lost cause.

> Before we can leverage your idea to flush the dcache on kunmap in
> common code without having to sacrifice performance in arch code,
> we'd need to change all other archs to add the cache flushes on
> kunmap too, and then remove the cache flushes from the other places
> like copy_page or we'd waste CPU. Then you'd have the best of both
> words, no double flush and kunmap would be enough.

Actually copy_user_page() is unused in the main kernel.  The big
problem is copy_user_highpage() but that's mostly highly optimised by
the VIPT architectures (in other words you can fiddle with kmap without
impacting it).

James
Andrea Arcangeli March 12, 2019, 10:50 p.m. UTC | #19
On Tue, Mar 12, 2019 at 03:02:54PM -0700, James Bottomley wrote:
> I'm sure there must be workarounds elsewhere in the other arch code
> otherwise things like this, which appear all over drivers/, wouldn't
> work:
> 
> drivers/scsi/isci/request.c:1430
> 
> 	kaddr = kmap_atomic(page);
> 	memcpy(kaddr + sg->offset, src_addr, copy_len);
> 	kunmap_atomic(kaddr);
> 

Are you sure "page" is an userland page with an alias address?

	sg->page_link = (unsigned long)virt_to_page(addr);

page_link seems to point to kernel memory.

I found an apparent solution like parisc on arm 32bit:

void __kunmap_atomic(void *kvaddr)
{
	unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
	int idx, type;

	if (kvaddr >= (void *)FIXADDR_START) {
		type = kmap_atomic_idx();
		idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id();

		if (cache_is_vivt())
			__cpuc_flush_dcache_area((void *)vaddr, PAGE_SIZE);

However on arm 64bit kunmap_atomic is not implemented at all and other
32bit implementations don't do it, for example sparc seems to do the
cache flush too if the kernel is built with CONFIG_DEBUG_HIGHMEM
(which makes the flushing conditional to the debug option).

The kunmap_atomic where fixmap is used, is flushing the tlb lazily so
even on 32bit you can't even be sure if there was a tlb flush for each
single page you unmapped, so it's hard to see how the above can work
safe, is "page" would have been an userland page mapped with aliased
CPU cache.

> the sequence dirties the kernel virtual address but doesn't flush
> before doing kunmap.  There are hundreds of other examples which is why
> I think adding flush_kernel_dcache_page() is an already lost cause.

In lots of cases kmap is needed to just modify kernel memory not to
modify userland memory (where get/put_user is more commonly used
instead..), there's no cache aliasing in such case.

> Actually copy_user_page() is unused in the main kernel.  The big
> problem is copy_user_highpage() but that's mostly highly optimised by
> the VIPT architectures (in other words you can fiddle with kmap without
> impacting it).

copy_user_page is not unused, it's called precisely by
copy_user_highpage, which is why the cache flushes are done inside
copy_user_page.

static inline void copy_user_highpage(struct page *to, struct page *from,
	unsigned long vaddr, struct vm_area_struct *vma)
{
	char *vfrom, *vto;

	vfrom = kmap_atomic(from);
	vto = kmap_atomic(to);
	copy_user_page(vto, vfrom, vaddr, to);
	kunmap_atomic(vto);
	kunmap_atomic(vfrom);
}
James Bottomley March 12, 2019, 10:57 p.m. UTC | #20
On Tue, 2019-03-12 at 18:50 -0400, Andrea Arcangeli wrote:
> On Tue, Mar 12, 2019 at 03:02:54PM -0700, James Bottomley wrote:
> > I'm sure there must be workarounds elsewhere in the other arch code
> > otherwise things like this, which appear all over drivers/,
> > wouldn't
> > work:
> > 
> > drivers/scsi/isci/request.c:1430
> > 
> > 	kaddr = kmap_atomic(page);
> > 	memcpy(kaddr + sg->offset, src_addr, copy_len);
> > 	kunmap_atomic(kaddr);
> > 
> 
> Are you sure "page" is an userland page with an alias address?
> 
> 	sg->page_link = (unsigned long)virt_to_page(addr);

Yes, it's an element of a scatter gather list, which may be either a
kernel page or a user page, but is usually the latter.

> page_link seems to point to kernel memory.
> 
> I found an apparent solution like parisc on arm 32bit:
> 
> void __kunmap_atomic(void *kvaddr)
> {
> 	unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
> 	int idx, type;
> 
> 	if (kvaddr >= (void *)FIXADDR_START) {
> 		type = kmap_atomic_idx();
> 		idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR *
> smp_processor_id();
> 
> 		if (cache_is_vivt())
> 			__cpuc_flush_dcache_area((void *)vaddr,
> PAGE_SIZE);
> 
> However on arm 64bit kunmap_atomic is not implemented at all and
> other 32bit implementations don't do it, for example sparc seems to
> do the cache flush too if the kernel is built with
> CONFIG_DEBUG_HIGHMEM (which makes the flushing conditional to the
> debug option).
> 
> The kunmap_atomic where fixmap is used, is flushing the tlb lazily so
> even on 32bit you can't even be sure if there was a tlb flush for
> each single page you unmapped, so it's hard to see how the above can
> work safe, is "page" would have been an userland page mapped with
> aliased CPU cache.
> 
> > the sequence dirties the kernel virtual address but doesn't flush
> > before doing kunmap.  There are hundreds of other examples which is
> > why I think adding flush_kernel_dcache_page() is an already lost
> > cause.
> 
> In lots of cases kmap is needed to just modify kernel memory not to
> modify userland memory (where get/put_user is more commonly used
> instead..), there's no cache aliasing in such case.

That's why I picked drivers/  The use case in there is mostly kmap to
put a special value into a scatter gather list entry.

> > Actually copy_user_page() is unused in the main kernel.  The big
> > problem is copy_user_highpage() but that's mostly highly optimised
> > by the VIPT architectures (in other words you can fiddle with kmap
> > without impacting it).
> 
> copy_user_page is not unused, it's called precisely by
> copy_user_highpage, which is why the cache flushes are done inside
> copy_user_page.
> 
> static inline void copy_user_highpage(struct page *to, struct page
> *from,
> 	unsigned long vaddr, struct vm_area_struct *vma)
> {
> 	char *vfrom, *vto;
> 
> 	vfrom = kmap_atomic(from);
> 	vto = kmap_atomic(to);
> 	copy_user_page(vto, vfrom, vaddr, to);
> 	kunmap_atomic(vto);
> 	kunmap_atomic(vfrom);
> }

That's the asm/generic implementation.  Most VIPT architectures
override it.

James
Christoph Hellwig March 13, 2019, 4:05 p.m. UTC | #21
On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley wrote:
> I've got to say: optimize what?  What code do we ever have in the
> kernel that kmap's a page and then doesn't do anything with it? You can
> guarantee that on kunmap the page is either referenced (needs
> invalidating) or updated (needs flushing). The in-kernel use of kmap is
> always
> 
> kmap
> do something with the mapped page
> kunmap
> 
> In a very short interval.  It seems just a simplification to make
> kunmap do the flush if needed rather than try to have the users
> remember.  The thing which makes this really simple is that on most
> architectures flush and invalidate is the same operation.  If you
> really want to optimize you can use the referenced and dirty bits on
> the kmapped pte to tell you what operation to do, but if your flush is
> your invalidate, you simply assume the data needs flushing on kunmap
> without checking anything.

I agree that this would be a good way to simplify the API.   Now
we'd just need volunteers to implement this for all architectures
that need cache flushing and then remove the explicit flushing in
the callers..

> > Which means after we fix vhost to add the flush_dcache_page after
> > kunmap, Parisc will get a double hit (but it also means Parisc was
> > the only one of those archs needed explicit cache flushes, where
> > vhost worked correctly so far.. so it kinds of proofs your point of
> > giving up being the safe choice).
> 
> What double hit?  If there's no cache to flush then cache flush is a
> no-op.  It's also a highly piplineable no-op because the CPU has the L1
> cache within easy reach.  The only event when flush takes a large
> amount time is if we actually have dirty data to write back to main
> memory.

I've heard people complaining that on some microarchitectures even
no-op cache flushes are relatively expensive.  Don't ask me why,
but if we can easily avoid double flushes we should do that.
James Bottomley March 13, 2019, 4:37 p.m. UTC | #22
On Wed, 2019-03-13 at 09:05 -0700, Christoph Hellwig wrote:
> On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley wrote:
> > I've got to say: optimize what?  What code do we ever have in the
> > kernel that kmap's a page and then doesn't do anything with it? You
> > can
> > guarantee that on kunmap the page is either referenced (needs
> > invalidating) or updated (needs flushing). The in-kernel use of
> > kmap is
> > always
> > 
> > kmap
> > do something with the mapped page
> > kunmap
> > 
> > In a very short interval.  It seems just a simplification to make
> > kunmap do the flush if needed rather than try to have the users
> > remember.  The thing which makes this really simple is that on most
> > architectures flush and invalidate is the same operation.  If you
> > really want to optimize you can use the referenced and dirty bits
> > on the kmapped pte to tell you what operation to do, but if your
> > flush is your invalidate, you simply assume the data needs flushing
> > on kunmap without checking anything.
> 
> I agree that this would be a good way to simplify the API.   Now
> we'd just need volunteers to implement this for all architectures
> that need cache flushing and then remove the explicit flushing in
> the callers..

Well, it's already done on parisc ...  I can help with this if we agree
it's the best way forward.  It's really only architectures that
implement flush_dcache_page that would need modifying.

It may also improve performance because some kmap/use/flush/kunmap
sequences have flush_dcache_page() instead of
flush_kernel_dcache_page() and the former is hugely expensive and
usually unnecessary because GUP already flushed all the user aliases.

In the interests of full disclosure the reason we do it for parisc is
because our later machines have problems even with clean aliases.  So
on most VIPT systems, doing kmap/read/kunmap creates a fairly harmless
clean alias.  Technically it should be invalidated, because if you
remap the same page to the same colour you get cached stale data, but
in practice the data is expired from the cache long before that
happens, so the problem is almost never seen if the flush is forgotten.
 Our problem is on the P9xxx processor: they have a L1/L2 VIPT L3 PIPT
cache.  As the L1/L2 caches expire clean data, they place the expiring
contents into L3, but because L3 is PIPT, the stale alias suddenly
becomes the default for any read of they physical page because any
update which dirtied the cache line often gets written to main memory
and placed into the L3 as clean *before* the clean alias in L1/L2 gets
expired, so the older clean alias replaces it.

Our only recourse is to kill all aliases with prejudice before the
kernel loses ownership.

> > > Which means after we fix vhost to add the flush_dcache_page after
> > > kunmap, Parisc will get a double hit (but it also means Parisc
> > > was the only one of those archs needed explicit cache flushes,
> > > where vhost worked correctly so far.. so it kinds of proofs your
> > > point of giving up being the safe choice).
> > 
> > What double hit?  If there's no cache to flush then cache flush is
> > a no-op.  It's also a highly piplineable no-op because the CPU has
> > the L1 cache within easy reach.  The only event when flush takes a
> > large amount time is if we actually have dirty data to write back
> > to main memory.
> 
> I've heard people complaining that on some microarchitectures even
> no-op cache flushes are relatively expensive.  Don't ask me why,
> but if we can easily avoid double flushes we should do that.

It's still not entirely free for us.  Our internal cache line is around
32 bytes (some have 16 and some have 64) but that means we need 128
flushes for a page ... we definitely can't pipeline them all.  So I
agree duplicate flush elimination would be a small improvement.

James
Michael S. Tsirkin March 14, 2019, 10:42 a.m. UTC | #23
On Wed, Mar 13, 2019 at 09:37:08AM -0700, James Bottomley wrote:
> On Wed, 2019-03-13 at 09:05 -0700, Christoph Hellwig wrote:
> > On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley wrote:
> > > I've got to say: optimize what?  What code do we ever have in the
> > > kernel that kmap's a page and then doesn't do anything with it? You
> > > can
> > > guarantee that on kunmap the page is either referenced (needs
> > > invalidating) or updated (needs flushing). The in-kernel use of
> > > kmap is
> > > always
> > > 
> > > kmap
> > > do something with the mapped page
> > > kunmap
> > > 
> > > In a very short interval.  It seems just a simplification to make
> > > kunmap do the flush if needed rather than try to have the users
> > > remember.  The thing which makes this really simple is that on most
> > > architectures flush and invalidate is the same operation.  If you
> > > really want to optimize you can use the referenced and dirty bits
> > > on the kmapped pte to tell you what operation to do, but if your
> > > flush is your invalidate, you simply assume the data needs flushing
> > > on kunmap without checking anything.
> > 
> > I agree that this would be a good way to simplify the API.   Now
> > we'd just need volunteers to implement this for all architectures
> > that need cache flushing and then remove the explicit flushing in
> > the callers..
> 
> Well, it's already done on parisc ...  I can help with this if we agree
> it's the best way forward.  It's really only architectures that
> implement flush_dcache_page that would need modifying.
> 
> It may also improve performance because some kmap/use/flush/kunmap
> sequences have flush_dcache_page() instead of
> flush_kernel_dcache_page() and the former is hugely expensive and
> usually unnecessary because GUP already flushed all the user aliases.
> 
> In the interests of full disclosure the reason we do it for parisc is
> because our later machines have problems even with clean aliases.  So
> on most VIPT systems, doing kmap/read/kunmap creates a fairly harmless
> clean alias.  Technically it should be invalidated, because if you
> remap the same page to the same colour you get cached stale data, but
> in practice the data is expired from the cache long before that
> happens, so the problem is almost never seen if the flush is forgotten.
>  Our problem is on the P9xxx processor: they have a L1/L2 VIPT L3 PIPT
> cache.  As the L1/L2 caches expire clean data, they place the expiring
> contents into L3, but because L3 is PIPT, the stale alias suddenly
> becomes the default for any read of they physical page because any
> update which dirtied the cache line often gets written to main memory
> and placed into the L3 as clean *before* the clean alias in L1/L2 gets
> expired, so the older clean alias replaces it.
> 
> Our only recourse is to kill all aliases with prejudice before the
> kernel loses ownership.
> 
> > > > Which means after we fix vhost to add the flush_dcache_page after
> > > > kunmap, Parisc will get a double hit (but it also means Parisc
> > > > was the only one of those archs needed explicit cache flushes,
> > > > where vhost worked correctly so far.. so it kinds of proofs your
> > > > point of giving up being the safe choice).
> > > 
> > > What double hit?  If there's no cache to flush then cache flush is
> > > a no-op.  It's also a highly piplineable no-op because the CPU has
> > > the L1 cache within easy reach.  The only event when flush takes a
> > > large amount time is if we actually have dirty data to write back
> > > to main memory.
> > 
> > I've heard people complaining that on some microarchitectures even
> > no-op cache flushes are relatively expensive.  Don't ask me why,
> > but if we can easily avoid double flushes we should do that.
> 
> It's still not entirely free for us.  Our internal cache line is around
> 32 bytes (some have 16 and some have 64) but that means we need 128
> flushes for a page ... we definitely can't pipeline them all.  So I
> agree duplicate flush elimination would be a small improvement.
> 
> James

I suspect we'll keep the copyXuser path around for 32 bit anyway -
right Jason?
So we can also keep using that on parisc...
Jason Wang March 14, 2019, 1:49 p.m. UTC | #24
On 2019/3/14 下午6:42, Michael S. Tsirkin wrote:
>>>>> Which means after we fix vhost to add the flush_dcache_page after
>>>>> kunmap, Parisc will get a double hit (but it also means Parisc
>>>>> was the only one of those archs needed explicit cache flushes,
>>>>> where vhost worked correctly so far.. so it kinds of proofs your
>>>>> point of giving up being the safe choice).
>>>> What double hit?  If there's no cache to flush then cache flush is
>>>> a no-op.  It's also a highly piplineable no-op because the CPU has
>>>> the L1 cache within easy reach.  The only event when flush takes a
>>>> large amount time is if we actually have dirty data to write back
>>>> to main memory.
>>> I've heard people complaining that on some microarchitectures even
>>> no-op cache flushes are relatively expensive.  Don't ask me why,
>>> but if we can easily avoid double flushes we should do that.
>> It's still not entirely free for us.  Our internal cache line is around
>> 32 bytes (some have 16 and some have 64) but that means we need 128
>> flushes for a page ... we definitely can't pipeline them all.  So I
>> agree duplicate flush elimination would be a small improvement.
>>
>> James
> I suspect we'll keep the copyXuser path around for 32 bit anyway -
> right Jason?


Yes since we don't want to slow down 32bit.

Thanks


> So we can also keep using that on parisc...
>
> --
Andrea Arcangeli March 14, 2019, 7:33 p.m. UTC | #25
Hello Jason,

On Thu, Mar 14, 2019 at 09:49:03PM +0800, Jason Wang wrote:
> Yes since we don't want to slow down 32bit.

If you've a lot of ram there's no justification to stick to a 32bit
kernel, so I don't think there's need to maintain a separate model
just for 32bit. I really wouldn't care about the performance of 32bit
with >700MB of RAM if that would cause any maintenance burden. Let's
focus on the best 64bit implementation that will work equally
optimally on 32bit with <= 700M of RAM.

Talking to Jerome about the set_page_dirty issue, he raised the point
of what happens if two thread calls a mmu notifier invalidate
simultaneously. The first mmu notifier could call set_page_dirty and
then proceed in try_to_free_buffers or page_mkclean and then the
concurrent mmu notifier that arrives second, then must not call
set_page_dirty a second time.

With KVM sptes mappings and vhost mappings you would call
set_page_dirty (if you invoked gup with FOLL_WRITE) only when
effectively tearing down any secondary mapping (you've got pointers in
both cases for the mapping). So there's no way to risk a double
set_page_dirty from concurrent mmu notifier invalidate because the
invalidate takes a lock when it has to teardown the mapping and so
set_page_dirty is only run in the first invalidate method and not in
the second. In the spte case even better, as you wouldn't need to call
it even at teardown time unless the spte is dirty (if shadow mask
allows dirty sptes with EPT2 or NPT or shadow MMU).

If you instead had to invalidate a secondary MMU mapping that isn't
tracked by the driver (again: not vhost nor KVM case), you could have
used the dirty bit of the kernel pagetable to call set_page_dirty and
disambiguate but that's really messy, and it would prevent the use of
gigapages in the direct mapping too and it'd require vmap for 4k
tracking.

To make sure set_page_dirty is run a single time no matter if the
invalidate known when a mapping is tear down, I suggested the below
model:

  access = FOLL_WRITE

repeat:
  page = gup_fast(access)
  put_page(page) /* need a way to drop FOLL_GET from gup_fast instead! */

  spin_lock(mmu_notifier_lock);
  if (race with invalidate) {
    spin_unlock..
    goto repeat;
  }
  if (access == FOLL_WRITE)
    set_page_dirty(page)
  establish writable mapping in secondary MMU on page
  spin_unlock

(replace spin_lock with mutex_lock for vhost of course if you stick to
a mutex and _start/_end instead of non-sleepable ->invalidate_range)

"race with invalidate" is the usual "mmu_notifier_retry" in kvm_host.h
to be implemented for vhost.

We could add a FOLL_DIRTY flag to add to FOLL_TOUCH to move the
set_page_dirty inside GUP forced (currently it's not forced if the
linux pte is already dirty). And we could remove FOLL_GET.

Then if you have the ability to disambiguate which is the first
invalidate that tears down a mapping to any given page (vhost can do
that trivially, it's just a pointer to a page struct to kmap), in the
mmu notifier invalidate just before dropping the spinlock you would
do this check:

def vhost_invalidate_range_start():
   [..]
   spin_lock(mmu_notifier_lock);
   [..]
   if (vhost->page_pointer) {
      if (access == FOLL_WRITE)
	VM_WARN_ON(!PageDirty(vhost->page_pointer));
      vhost->page_pointer = NULL;
      /* no put_page, already done at gup time */
   }
   spin_unlock(..

Generally speaking set_page_dirty is safer done after the last
modification to the data of the page. However the way stable page
works, as long as the mmu notifier invalidate didn't run, the PG_dirty
cannot go away.

So this model solves the issue with guaranteeing a single
set_page_dirty is run before page_mkclean or try_to_free_buffers can
run, even for drivers that implement the invalidate as a generic "TLB
flush" across the whole secondary MMU and that cannot disambiguate the
first invalidate from a second invalidate if they're issued
concurrently on the same address by two different CPUs.

So for those drivers that can disambiguate trivially (like KVM and
vhost) we'll just add a debug check in the invalidate to validate the
common code for all mmu notifier users.

This is the solution for RDMA hardware and everything that can support
mmu notifiers too and they can take infinitely long secondary MMU
mappins without interfering with stable pages at all (i.e. long term
pins but without pinning) perfectly safe and transparent to the whole
stable page code.

I think O_DIRECT for stable pages shall be solved taking the page lock
or writeback lock or a new rwsem in the inode that is taken for
writing by page_mkclean and try_to_free_buffers and for reading by
outstanding O_DIRECT in flight I/O, like I suggested probably ages ago
but then we only made GUP take the page pin, which is fine for mmu
notifier actually (except those didn't exist back then). To solve
O_DIRECT we can leverage the 100% guarantee that the pin will be
dropped ASAP and stop page_mkclean and stop or trylock in
try_to_free_buffers in such case.

mm_take_all_locks is major hurdle that prevents usage in O_DIRECT
case, even if we "cache it" if you fork(); write; exit() in a loop
it'll still cause heavy lock overhead. MMU notifier registration isn't
intended to happen in fast and frequent paths like the write()
syscall. Dropping mm_take_all_locks would bring other downsides: a
regular quiescent point can happen in between _start/_end and _start
must be always called first all the mmu notifier retry counters we
rely on would break. One way would be to add srcu_read_lock _before_
you can call mm_has_mm_has_notifiers(mm), then yes we could replace
mm_take_all_locks with synchronize_srcu. It would save a lot of CPU
and a ton of locked operations, but it'd potentially increase the
latency of the registration so the first O_DIRECT write() in a process
could still potentially stall (still better than mm_take_all_locks
which would use a lot more CPU and hurt SMP scalability in threaded
programs). The downside is all VM fast paths would get some overhead
because of srcu_read_lock even when mmu notifier is not registered,
which is what we carefully avoided by taking a larger hit in the
registration with mm_take_all_locks. This is why I don't think mmu
notifier is a good solution to solve O_DIRECT stable pages even in
theory O_DIRECT could use the exact same model as vhost to solve
stable pages.

If we solve O_DIRECT with new per-page locking or a new rwsem inode
lock leveraging the fact we're guaranteed the pin to go away ASAP,
what's left is the case of PCI devices mapping userland memory for
indefinite amount of time that cannot support MMU notifier because of
hardware limitations.

Mentioning virtualization as a case taking long term PIN is incorrect,
that didn't happen since the introduction of MMU notifier.

vfio for device assignment to KVM takes the long term pins, but that's
because the iommus may not support the mmu notifier, mmu notifier
could solve the vfio case too.

PCI devices that pretend to keep a constant mapping on userland
virtual memory and that cannot support MMU notifier because they lack
a secondary MMU, cripple the Linux VM and there's no solution to
that. Even if we solve the stable pages problem, they will still
practically disable all advanced VM features.

I think it would be ok to solve the stable pages in 3 different ways:

1) mmu notifier as above when mmu_notifier_register doesn't need to
   run often and the hardware can support it

2) O_DIRECT with new locking stopping page_mkclean/try_to_free_buffers
   until I/O completion, leveraging the fact the pin&lock are
   guaranteed to be dropped&released ASAP

3) something else for pci devices that cannot support MMU notifier
   because of hardware limitations, bounce buffers would be fine as
   well

I'm not even sure if in 3) it is worth worrying about being able to
routinely flush to disk the dirty data, but then bounce buffers could
solve that. Altering the page mapped in the pte sounds like a complex
solution when you could copy the physical page just before issuing the
I/O in the writeback code. To find if a GUP pin exists it's enough to
check what KSM does:

		/*
		 * Check that no O_DIRECT or similar I/O is in progress on the
		 * page
		 */
		if (page_mapcount(page) + 1 + swapped != page_count(page)) {

That can give false positives (even through random pins coming from
speculative cache lookups), but not false negatives.

Thanks,
Andrea
Jason Wang March 15, 2019, 4:39 a.m. UTC | #26
On 2019/3/15 上午3:33, Andrea Arcangeli wrote:
> Hello Jason,
>
> On Thu, Mar 14, 2019 at 09:49:03PM +0800, Jason Wang wrote:
>> Yes since we don't want to slow down 32bit.
> If you've a lot of ram there's no justification to stick to a 32bit
> kernel, so I don't think there's need to maintain a separate model
> just for 32bit. I really wouldn't care about the performance of 32bit
> with >700MB of RAM if that would cause any maintenance burden. Let's
> focus on the best 64bit implementation that will work equally
> optimally on 32bit with <= 700M of RAM.


Yes, but probably there are still some reasons of keeping copy_user() 
friends as a fallback. When we have a large virtqueue, the ring may 
occupy more than one page. This means the VA might not be contiguous 
when using kmap(). Instead of doing tricks in the accessories, maybe 
it's or simpler better just fall back to copy_user() in this case. And 
we meet the similar issue when software device IOTLB is used for vhost. 
And in the following example for gup, we can simply do a fallback when 
we race with the invalidation.

Michael also tends to keep the copy_user(), he suggested to use 
copy_user() for VIVT archs then there's no need for a explicit 
flush_dcache_page(). And he also want a module parameter for falling 
back to copy_user() for e.g debugging purpose.


>
> Talking to Jerome about the set_page_dirty issue, he raised the point
> of what happens if two thread calls a mmu notifier invalidate
> simultaneously. The first mmu notifier could call set_page_dirty and
> then proceed in try_to_free_buffers or page_mkclean and then the
> concurrent mmu notifier that arrives second, then must not call
> set_page_dirty a second time.
>
> With KVM sptes mappings and vhost mappings you would call
> set_page_dirty (if you invoked gup with FOLL_WRITE) only when
> effectively tearing down any secondary mapping (you've got pointers in
> both cases for the mapping). So there's no way to risk a double
> set_page_dirty from concurrent mmu notifier invalidate because the
> invalidate takes a lock when it has to teardown the mapping and so
> set_page_dirty is only run in the first invalidate method and not in
> the second. In the spte case even better, as you wouldn't need to call
> it even at teardown time unless the spte is dirty (if shadow mask
> allows dirty sptes with EPT2 or NPT or shadow MMU).


I see, the sounds indeed better.


>
> If you instead had to invalidate a secondary MMU mapping that isn't
> tracked by the driver (again: not vhost nor KVM case), you could have
> used the dirty bit of the kernel pagetable to call set_page_dirty and
> disambiguate but that's really messy, and it would prevent the use of
> gigapages in the direct mapping too and it'd require vmap for 4k
> tracking.
>
> To make sure set_page_dirty is run a single time no matter if the
> invalidate known when a mapping is tear down, I suggested the below
> model:
>
>    access = FOLL_WRITE
>
> repeat:
>    page = gup_fast(access)
>    put_page(page) /* need a way to drop FOLL_GET from gup_fast instead! */
>
>    spin_lock(mmu_notifier_lock);
>    if (race with invalidate) {
>      spin_unlock..
>      goto repeat;
>    }
>    if (access == FOLL_WRITE)
>      set_page_dirty(page)
>    establish writable mapping in secondary MMU on page
>    spin_unlock
>
> (replace spin_lock with mutex_lock for vhost of course if you stick to
> a mutex and _start/_end instead of non-sleepable ->invalidate_range)


Yes, I probably stick to the vq mutex since the invalidation needs to be 
synchronized with the whole packet processing routine.


> "race with invalidate" is the usual "mmu_notifier_retry" in kvm_host.h
> to be implemented for vhost.
>
> We could add a FOLL_DIRTY flag to add to FOLL_TOUCH to move the
> set_page_dirty inside GUP forced (currently it's not forced if the
> linux pte is already dirty). And we could remove FOLL_GET.
>
> Then if you have the ability to disambiguate which is the first
> invalidate that tears down a mapping to any given page (vhost can do
> that trivially, it's just a pointer to a page struct to kmap), in the
> mmu notifier invalidate just before dropping the spinlock you would
> do this check:
>
> def vhost_invalidate_range_start():
>     [..]
>     spin_lock(mmu_notifier_lock);
>     [..]
>     if (vhost->page_pointer) {
>        if (access == FOLL_WRITE)
> 	VM_WARN_ON(!PageDirty(vhost->page_pointer));
>        vhost->page_pointer = NULL;
>        /* no put_page, already done at gup time */
>     }
>     spin_unlock(..
>
> Generally speaking set_page_dirty is safer done after the last
> modification to the data of the page. However the way stable page
> works, as long as the mmu notifier invalidate didn't run, the PG_dirty
> cannot go away.


Ok.


>
> So this model solves the issue with guaranteeing a single
> set_page_dirty is run before page_mkclean or try_to_free_buffers can
> run, even for drivers that implement the invalidate as a generic "TLB
> flush" across the whole secondary MMU and that cannot disambiguate the
> first invalidate from a second invalidate if they're issued
> concurrently on the same address by two different CPUs.
>
> So for those drivers that can disambiguate trivially (like KVM and
> vhost) we'll just add a debug check in the invalidate to validate the
> common code for all mmu notifier users.
>
> This is the solution for RDMA hardware and everything that can support
> mmu notifiers too and they can take infinitely long secondary MMU
> mappins without interfering with stable pages at all (i.e. long term
> pins but without pinning) perfectly safe and transparent to the whole
> stable page code.
>
> I think O_DIRECT for stable pages shall be solved taking the page lock
> or writeback lock or a new rwsem in the inode that is taken for
> writing by page_mkclean and try_to_free_buffers and for reading by
> outstanding O_DIRECT in flight I/O, like I suggested probably ages ago
> but then we only made GUP take the page pin, which is fine for mmu
> notifier actually (except those didn't exist back then). To solve
> O_DIRECT we can leverage the 100% guarantee that the pin will be
> dropped ASAP and stop page_mkclean and stop or trylock in
> try_to_free_buffers in such case.
>
> mm_take_all_locks is major hurdle that prevents usage in O_DIRECT
> case, even if we "cache it" if you fork(); write; exit() in a loop
> it'll still cause heavy lock overhead. MMU notifier registration isn't
> intended to happen in fast and frequent paths like the write()
> syscall. Dropping mm_take_all_locks would bring other downsides: a
> regular quiescent point can happen in between _start/_end and _start
> must be always called first all the mmu notifier retry counters we
> rely on would break. One way would be to add srcu_read_lock _before_
> you can call mm_has_mm_has_notifiers(mm), then yes we could replace
> mm_take_all_locks with synchronize_srcu. It would save a lot of CPU
> and a ton of locked operations, but it'd potentially increase the
> latency of the registration so the first O_DIRECT write() in a process
> could still potentially stall (still better than mm_take_all_locks
> which would use a lot more CPU and hurt SMP scalability in threaded
> programs). The downside is all VM fast paths would get some overhead
> because of srcu_read_lock even when mmu notifier is not registered,
> which is what we carefully avoided by taking a larger hit in the
> registration with mm_take_all_locks. This is why I don't think mmu
> notifier is a good solution to solve O_DIRECT stable pages even in
> theory O_DIRECT could use the exact same model as vhost to solve
> stable pages.
>
> If we solve O_DIRECT with new per-page locking or a new rwsem inode
> lock leveraging the fact we're guaranteed the pin to go away ASAP,
> what's left is the case of PCI devices mapping userland memory for
> indefinite amount of time that cannot support MMU notifier because of
> hardware limitations.


Yes and this is part of the issue we met in vhost TX zerocopy code. 
What's more interesting is that a skb could be held of a software layer 
e.g qdisc as well. MMU notifier could be used for dealing with such 
software holding e.g copy packets into new pages. But this may requires 
more thoughts since this may race with networking stack.


>
> Mentioning virtualization as a case taking long term PIN is incorrect,
> that didn't happen since the introduction of MMU notifier.
>
> vfio for device assignment to KVM takes the long term pins, but that's
> because the iommus may not support the mmu notifier, mmu notifier
> could solve the vfio case too.


This might require the support of page faults from IOMMU hardware.


>
> PCI devices that pretend to keep a constant mapping on userland
> virtual memory and that cannot support MMU notifier because they lack
> a secondary MMU, cripple the Linux VM and there's no solution to
> that. Even if we solve the stable pages problem, they will still
> practically disable all advanced VM features.
>
> I think it would be ok to solve the stable pages in 3 different ways:
>
> 1) mmu notifier as above when mmu_notifier_register doesn't need to
>     run often and the hardware can support it
>
> 2) O_DIRECT with new locking stopping page_mkclean/try_to_free_buffers
>     until I/O completion, leveraging the fact the pin&lock are
>     guaranteed to be dropped&released ASAP
>
> 3) something else for pci devices that cannot support MMU notifier
>     because of hardware limitations, bounce buffers would be fine as
>     well
>
> I'm not even sure if in 3) it is worth worrying about being able to
> routinely flush to disk the dirty data, but then bounce buffers could
> solve that. Altering the page mapped in the pte sounds like a complex
> solution when you could copy the physical page just before issuing the
> I/O in the writeback code. To find if a GUP pin exists it's enough to
> check what KSM does:
>
> 		/*
> 		 * Check that no O_DIRECT or similar I/O is in progress on the
> 		 * page
> 		 */
> 		if (page_mapcount(page) + 1 + swapped != page_count(page)) {
>
> That can give false positives (even through random pins coming from
> speculative cache lookups), but not false negatives.
>
> Thanks,
> Andrea


Thanks for the patient and detailed explanation with lots of 
backgrounds. It help to understand the whole picture a lot.