diff mbox series

[RFC,3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly

Message ID ZGdlff80rZ6v4LNO@MiWiFi-R3L-srv (mailing list archive)
State New
Headers show
Series [RFC,1/3] mm/vmalloc.c: try to flush vmap_area one by one | expand

Commit Message

Baoquan He May 19, 2023, 12:03 p.m. UTC
After vb_free() invocation, the va will be purged and put into purge
tree/list if the entire vmap_block is dirty. If not entirely dirty, the
vmap_block is still in percpu vmap_block_queue list, just like below two
graphs:

(1)
  |-----|------------|-----------|-------|
  |dirty|still mapped|   dirty   | free  |

2)
  |------------------------------|-------|
  |         dirty                | free  |

In the current _vm_unmap_aliases(), to reclaim those unmapped range and
flush, it will iterate percpu vbq to calculate the range from vmap_block
like above two cases. Then call purge_fragmented_blocks_allcpus()
to purge the vmap_block in case 2 since no mapping exists right now,
and put these purged vmap_block va into purge tree/list. Then in
__purge_vmap_area_lazy(), it will continue calculating the flush range
from purge list. Obviously, this will take vmap_block va in the 2nd case
into account twice.

So here just move purge_fragmented_blocks_allcpus() up to purge the
vmap_block va of case 2 firstly, then only need iterate and count in
the dirty range in above 1st case. With the change, counting in the
dirty region of vmap_block in 1st case is now inside vmap_purge_lock
protection region, which makes the flush range calculation more
reasonable and accurate by avoiding concurrent operation in other cpu.

And also rename _vm_unmap_aliases() to vm_unmap_aliases(), since no
other caller except of the old vm_unmap_aliases().

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 mm/vmalloc.c | 45 ++++++++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

Comments

Thomas Gleixner May 19, 2023, 2:17 p.m. UTC | #1
On Fri, May 19 2023 at 20:03, Baoquan He wrote:
> @@ -2245,6 +2260,9 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
>  
>  	might_sleep();
>  
> +	mutex_lock(&vmap_purge_lock);
> +	purge_fragmented_blocks_allcpus();
> +
>  	for_each_possible_cpu(cpu) {

That still does TWO iterations over all possible CPUs, while this really
could be one.
Thomas Gleixner May 19, 2023, 6:38 p.m. UTC | #2
On Fri, May 19 2023 at 20:03, Baoquan He wrote:
> After vb_free() invocation, the va will be purged and put into purge
> tree/list if the entire vmap_block is dirty. If not entirely dirty, the
> vmap_block is still in percpu vmap_block_queue list, just like below two
> graphs:
>
> (1)
>   |-----|------------|-----------|-------|
>   |dirty|still mapped|   dirty   | free  |
>
> 2)
>   |------------------------------|-------|
>   |         dirty                | free  |
>
> In the current _vm_unmap_aliases(), to reclaim those unmapped range and
> flush, it will iterate percpu vbq to calculate the range from vmap_block
> like above two cases. Then call purge_fragmented_blocks_allcpus()
> to purge the vmap_block in case 2 since no mapping exists right now,
> and put these purged vmap_block va into purge tree/list. Then in
> __purge_vmap_area_lazy(), it will continue calculating the flush range
> from purge list. Obviously, this will take vmap_block va in the 2nd case
> into account twice.

Which made me look deeper into purge_fragmented_blocks()

	list_for_each_entry_rcu(vb, &vbq->free, free_list) {

		if (!(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS))
			continue;

                spin_lock(&vb->lock);
		if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) {

That means if an allocation does not find something in the free list
then this can happen:

vaddr = vb_alloc(size)
     vaddr = new_vmap_block(order, gfp_mask);

vb_free(vaddr, size)
     vb->dirty = 1ULL << order;

purge_fragmented_blocks()
      purge(most_recently_allocated_block);

vaddr = vb_alloc(size)
     vaddr = new_vmap_block(order, gfp_mask);

How does that make sense?

That block would have hundreds of pages left and is the most recently
allocated. So the next vb_alloc() has to reallocate one instead of using
the one which was allocated just before.

This clearly lacks a free space check so that blocks which have more
free space than a certain threshold are not thrown away prematurely.
Maybe it wants an age check too, so that blocks which are unused for a
long time can be recycled, but that's an orthogonal issue.


That aside your patch does still not address what I pointed out to you
and what my patch cures:

		 pages     bits    dirtymin            dirtymax
   vb_alloc(A)   255     0 -  254  VMAP_BBMAP_BITS     0
   vb_alloc(B)   255   255 -  509  VMAP_BBMAP_BITS     0
   vb_alloc(C)   255   510 -  764  VMAP_BBMAP_BITS     0
   vb_alloc(D)   255   765 - 1020  VMAP_BBMAP_BITS     0

The block stays on the free list because there are still 4 pages left
and it stays there until either _all_ free space is used or _all_
allocated space is freed.

Now the first allocation gets freed:

   vb_free(A)    255     0 - 254  0                    254

From there on _every_ invocation of __purge_vmap_area_lazy() will see
this range as long as the block is on the free list:

   list_for_each_entry_rcu(vb, &vbq->free, free_list) {
	spin_lock(&vb->lock);
	if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {

because this condition is true. So this flushes the same range over and
over, no?

This flush range gets larger over time the more allocations are freed up
to the point where the block vanishes from the free list.

By resetting vb->dirty_min/max the freed range is only flushed _once_,
no? The resulting flush range might still be excessively large as I
pointed out before:

1) Flush after freeing A

   vb_free(A)    2     0 - 1  0			  1
   Flush                      VMAP_BBMAP_BITS     0    <- correct
   vb_free(C)    2     6 - 7  6                   7
   Flush                      VMAP_BBMAP_BITS     0    <- correct

2) No flush between freeing A and C

   vb_free(A)    2     0 - 1  0			  1
   vb_free(C)    2     6 - 7  0                   7     
   Flush                      VMAP_BBMAP_BITS     0     <- overbroad flush

3) No flush between freeing A, C, B

   vb_free(A)    2     0 - 1  0			  1
   vb_free(B)    2     6 - 7  0                   7
   vb_free(C)    2     2 - 5  0                   7
   Flush                      VMAP_BBMAP_BITS     0    <- correct

Obviously case 2 could be

   vb_free(A)    2     0 - 1     0		     1
   vb_free(X)    2  1000 - 1001  1000                1001

   So that flush via purge_vmap_area_list() will ask to flush 1002 pages
   instead of 4, right? Again, that does not make sense.


The other issue I pointed out:

Assume the block has (for simplicity) 255 allocations size of 4 pages,
again free space of 4 pages.

254 allocations are freed, which means there is one remaining
mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over
time.

Now the last allocation is freed and the block is moved to the
purge_vmap_area_list, which then does a full flush of the complete area,
i.e. 4MB in that case, while in fact it only needs to flush 2 pages.


Also these intermediate flushes are inconsistent versus how fully
utilized blocks are handled:

vb_alloc()
      if (vb->free == 0)
          list_del(vb->free_list);

So all allocations which are freed after that point stay unflushed until
the last allocation is freed which moves the block to the
purge_vmap_area_list, where it gets a full VA range flush.

IOW, for blocks on the free list this cares about unflushed mappings of
freed spaces, but for fully utilized blocks with freed spaces it does
obviously not matter, right?

So either we care about flushing the mappings of freed spaces or we do
not, but caring in one case and ignoring it in the other case is
inconsistent at best.

Thanks,

        tglx
Baoquan He May 19, 2023, 11:46 p.m. UTC | #3
On 05/19/23 at 08:38pm, Thomas Gleixner wrote:
> On Fri, May 19 2023 at 20:03, Baoquan He wrote:
> > After vb_free() invocation, the va will be purged and put into purge
> > tree/list if the entire vmap_block is dirty. If not entirely dirty, the
> > vmap_block is still in percpu vmap_block_queue list, just like below two
> > graphs:
> >
> > (1)
> >   |-----|------------|-----------|-------|
> >   |dirty|still mapped|   dirty   | free  |
> >
> > 2)
> >   |------------------------------|-------|
> >   |         dirty                | free  |
> >
> > In the current _vm_unmap_aliases(), to reclaim those unmapped range and
> > flush, it will iterate percpu vbq to calculate the range from vmap_block
> > like above two cases. Then call purge_fragmented_blocks_allcpus()
> > to purge the vmap_block in case 2 since no mapping exists right now,
> > and put these purged vmap_block va into purge tree/list. Then in
> > __purge_vmap_area_lazy(), it will continue calculating the flush range
> > from purge list. Obviously, this will take vmap_block va in the 2nd case
> > into account twice.
> 
> Which made me look deeper into purge_fragmented_blocks()
> 
> 	list_for_each_entry_rcu(vb, &vbq->free, free_list) {
> 
> 		if (!(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS))
> 			continue;
> 
>                 spin_lock(&vb->lock);
> 		if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) {
> 
> That means if an allocation does not find something in the free list
> then this can happen:
> 
> vaddr = vb_alloc(size)
>      vaddr = new_vmap_block(order, gfp_mask);
> 
> vb_free(vaddr, size)
>      vb->dirty = 1ULL << order;
> 
> purge_fragmented_blocks()
>       purge(most_recently_allocated_block);
> 
> vaddr = vb_alloc(size)
>      vaddr = new_vmap_block(order, gfp_mask);
> 
> How does that make sense?
> 
> That block would have hundreds of pages left and is the most recently
> allocated. So the next vb_alloc() has to reallocate one instead of using
> the one which was allocated just before.
> 
> This clearly lacks a free space check so that blocks which have more
> free space than a certain threshold are not thrown away prematurely.
> Maybe it wants an age check too, so that blocks which are unused for a
> long time can be recycled, but that's an orthogonal issue.

You are right, the vmap_block alloc/free does have the issue you pointed
out here. What I can defend is that it should be fine if
VM_FLUSH_RESET_PERMS memory doesn't upset the situation. As we see, the
lazy flush will only be triggered when lazy_max_pages() is met, or
alloc_vmap_area() can't find an available range. If these two happens,
means we really need to flush and reclaim the unmapped area into free
list/tree since the vmalloc address space has run out. Even though the
vmap_block has mach free space left, still need be purged to cope with
an emergency.

So, if we pick VM_FLUSH_RESET_PERMS memory out and flush it alone, and
set a threshold for vmap_block purging, is it better?

> 
> 
> That aside your patch does still not address what I pointed out to you
> and what my patch cures:
> 
> 		 pages     bits    dirtymin            dirtymax
>    vb_alloc(A)   255     0 -  254  VMAP_BBMAP_BITS     0
>    vb_alloc(B)   255   255 -  509  VMAP_BBMAP_BITS     0
>    vb_alloc(C)   255   510 -  764  VMAP_BBMAP_BITS     0
>    vb_alloc(D)   255   765 - 1020  VMAP_BBMAP_BITS     0
> 
> The block stays on the free list because there are still 4 pages left
> and it stays there until either _all_ free space is used or _all_
> allocated space is freed.
> 
> Now the first allocation gets freed:
> 
>    vb_free(A)    255     0 - 254  0                    254
> 
> From there on _every_ invocation of __purge_vmap_area_lazy() will see
> this range as long as the block is on the free list:
> 
>    list_for_each_entry_rcu(vb, &vbq->free, free_list) {
> 	spin_lock(&vb->lock);
> 	if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
> 
> because this condition is true. So this flushes the same range over and
> over, no?
> 
> This flush range gets larger over time the more allocations are freed up
> to the point where the block vanishes from the free list.
> 
> By resetting vb->dirty_min/max the freed range is only flushed _once_,
> no? The resulting flush range might still be excessively large as I
> pointed out before:
> 
> 1) Flush after freeing A
> 
>    vb_free(A)    2     0 - 1  0			  1
>    Flush                      VMAP_BBMAP_BITS     0    <- correct
>    vb_free(C)    2     6 - 7  6                   7
>    Flush                      VMAP_BBMAP_BITS     0    <- correct
> 
> 2) No flush between freeing A and C
> 
>    vb_free(A)    2     0 - 1  0			  1
>    vb_free(C)    2     6 - 7  0                   7     
>    Flush                      VMAP_BBMAP_BITS     0     <- overbroad flush
> 
> 3) No flush between freeing A, C, B
> 
>    vb_free(A)    2     0 - 1  0			  1
>    vb_free(B)    2     6 - 7  0                   7
>    vb_free(C)    2     2 - 5  0                   7
>    Flush                      VMAP_BBMAP_BITS     0    <- correct
> 
> Obviously case 2 could be
> 
>    vb_free(A)    2     0 - 1     0		     1
>    vb_free(X)    2  1000 - 1001  1000                1001
> 
>    So that flush via purge_vmap_area_list() will ask to flush 1002 pages
>    instead of 4, right? Again, that does not make sense.

Yes, I got your point now. I didn't read your cure code carefully, sorry
for that.

> 
> The other issue I pointed out:
> 
> Assume the block has (for simplicity) 255 allocations size of 4 pages,
> again free space of 4 pages.
> 
> 254 allocations are freed, which means there is one remaining
> mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over
> time.
> 
> Now the last allocation is freed and the block is moved to the
> purge_vmap_area_list, which then does a full flush of the complete area,
> i.e. 4MB in that case, while in fact it only needs to flush 2 pages.

It's easy to fix. For vmap_block, I have marked it in va->flag with
VMAP_RAM|VMAP_BLOCK. When flushing va in purge list, we can skip
vmap_block va. I don't know how you will tackle the per va flush
Nadav pointed out, so I will not give a dtaft code.

> 
> 
> Also these intermediate flushes are inconsistent versus how fully
> utilized blocks are handled:
> 
> vb_alloc()
>       if (vb->free == 0)
>           list_del(vb->free_list);
> 
> So all allocations which are freed after that point stay unflushed until
> the last allocation is freed which moves the block to the
> purge_vmap_area_list, where it gets a full VA range flush.

That may be risky if stay unflushed until the last allocation is freed.
We use vm_map_ram() interface to map passed in pages into vmalloc area.
If vb_free() is called, the sub-region has been unmapped and user maybe
have released the pages. user of vm_unmap_aliases() may be impacted if
we don't flush those area freed with vb_free(). In reality, those areas
have been unmapped, while there's still TLB existing. Not very sure
about that.

If we can hold the vmap_block flush until purging it w/o risk, it will
save us many troubles.

> 
> IOW, for blocks on the free list this cares about unflushed mappings of
> freed spaces, but for fully utilized blocks with freed spaces it does
> obviously not matter, right?

Yes, while depends on how we flush them. Flush them each time if there's
dirty, or hold the flush until purged if holding is allowed.

> 
> So either we care about flushing the mappings of freed spaces or we do
> not, but caring in one case and ignoring it in the other case is
> inconsistent at best.
>
Thomas Gleixner May 21, 2023, 11:10 p.m. UTC | #4
On Sat, May 20 2023 at 07:46, Baoquan He wrote:
> On 05/19/23 at 08:38pm, Thomas Gleixner wrote:
>> That block would have hundreds of pages left and is the most recently
>> allocated. So the next vb_alloc() has to reallocate one instead of using
>> the one which was allocated just before.
>> 
>> This clearly lacks a free space check so that blocks which have more
>> free space than a certain threshold are not thrown away prematurely.
>> Maybe it wants an age check too, so that blocks which are unused for a
>> long time can be recycled, but that's an orthogonal issue.
>
> You are right, the vmap_block alloc/free does have the issue you pointed
> out here. What I can defend is that it should be fine if
> VM_FLUSH_RESET_PERMS memory doesn't upset the situation. As we see, the
> lazy flush will only be triggered when lazy_max_pages() is met, or
> alloc_vmap_area() can't find an available range. If these two happens,
> means we really need to flush and reclaim the unmapped area into free
> list/tree since the vmalloc address space has run out. Even though the
> vmap_block has mach free space left, still need be purged to cope with
> an emergency.
>
> So, if we pick VM_FLUSH_RESET_PERMS memory out and flush it alone, and

That might be counterproductive depending on the usage scenario
especially as that BPF filtering is gaining traction.

> set a threshold for vmap_block purging, is it better?

The point is that there is no differentiation between:

  1) Purging freed areas, i.e. when the vmap_layz_nr hits the threshold,
     from drain_vmap_area_work()

  2) Reclaiming vmalloc address space from pcpu_get_vm_areas()

  3) _unmap_aliases()

#1 You really don't want to reclaim vmap blocks from the per cpu free
   lists, unless they have only a few free pages left and no active
   mappings.

   There is no reason to zap vmap blocks unconditionally, simply because
   reclaiming all lazy areas drains at least

      32MB * fls(num_online_cpus())

   per invocation which is plenty, no?

#2 You might want to try #1 first and if that does not help then reclaim
   all vmap blocks on the per cpu lists which have no active mappings
   unconditionally.

#3 _unmap_aliases() requires to touch everything because the caller has
   no clue which vmap_area used a particular page and the vmap_area lost
   that information too.

   Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the
   vmap area first and then cares about the flush. That in turn requires
   a full walk of _all_ vmap areas including the one which was just
   added to the purge list.

   I can see the point that this is used to combine outstanding TLB
   flushes and do the housekeeping of purging freed areas, but like #1
   there is no real good reason to zap usable vmap blocks
   unconditionally.

I'm all for consolidating functionality, but that swiss army knife
approach of one fits it all does not make sense to me.

>> The other issue I pointed out:
>> 
>> Assume the block has (for simplicity) 255 allocations size of 4 pages,
>> again free space of 4 pages.
>> 
>> 254 allocations are freed, which means there is one remaining
>> mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over
>> time.
>> 
>> Now the last allocation is freed and the block is moved to the
>> purge_vmap_area_list, which then does a full flush of the complete area,
>> i.e. 4MB in that case, while in fact it only needs to flush 2 pages.
>
> It's easy to fix. For vmap_block, I have marked it in va->flag with
> VMAP_RAM|VMAP_BLOCK. When flushing va in purge list, we can skip
> vmap_block va.

How can you skip that? The last 2 pages in that VA still need to be
flushed, no?

But VA has no information about already done partial flushes via the
vmap_block, so this requires flushing the full VA range unconditionally.

That made me think about the following scenario:

vm_map_ram()
  vb_alloc()
    // Uses page P
    vb->free -= 1UL << order;
    if (vb->free == 0)
      list_del_rcu(&vb->free_list);

vm_unmap_ram()
  vb_free()
    Does not free the last mapping of the vmap_block
    Clears the page table entry for page P, but does not flush TLB

__free_page(P)

page P gets reused

vm_unmap_aliases()

  Does not find the VA which used page P because neither the VB is in
  free_list nor the VA is in the purge_list

  Unless _vm_unmap_aliases() finds a large enough range to cover
  page P or ends up with a flush_tlb_all() the stale TLB(s) persists.

No?

Same problem in this scenario:

addr = vm_map_ram()
  vb_alloc()
    vb->free -= 1UL << order;
    if (vb->free == 0)
      list_del_rcu(&vb->free_list);

set_memory_*(addr)
  vm_unmap_aliases()

    Does not find the VA which contains @addr because neither the VB is in
    free_list nor the VA is in the purge_list

    Unless _vm_unmap_aliases() finds a large enough range to cover
    @addr or ends up with a flush_tlb_all() the stale TLB(s) persists.

No?

> I don't know how you will tackle the per va flush Nadav pointed out,
> so I will not give a dtaft code.

It's not that hard.

An array of ranges which can be memcpy()'ed into that existing x86
percpu flush info thing, which avoids the cache line issue Nadav is
concerned of.

That array would have obviously a limited number of entries as anything
with multiple ranges is most likely going to end up in a full flush
anyway. The drain_vmap_area_work() case definitely does so as
vmap_lazy_nr is guaranteed to be at least covering 32MB/PAGE_SIZE worth
of pages.

I just got distracted and did not come around to finish it for various
reasons. One of them is to make sense of this code :)

>> Also these intermediate flushes are inconsistent versus how fully
>> utilized blocks are handled:
>> 
>> vb_alloc()
>>       if (vb->free == 0)
>>           list_del(vb->free_list);
>> 
>> So all allocations which are freed after that point stay unflushed until
>> the last allocation is freed which moves the block to the
>> purge_vmap_area_list, where it gets a full VA range flush.
>
> That may be risky if stay unflushed until the last allocation is freed.
> We use vm_map_ram() interface to map passed in pages into vmalloc area.
> If vb_free() is called, the sub-region has been unmapped and user maybe
> have released the pages. user of vm_unmap_aliases() may be impacted if
> we don't flush those area freed with vb_free(). In reality, those areas
> have been unmapped, while there's still TLB existing. Not very sure
> about that.
>
> If we can hold the vmap_block flush until purging it w/o risk, it will
> save us many troubles.

The point is that the TLBs _must_ be flushed

  1) Before the virtual address space occupied by a vmap_area is
     released and can be reused.

     So one might argue that flushing TLBs early is good to detect use
     after free.

     The DEBUG_PAGE_ALLOC case flushes right after
     vunmap_range_noflush(), which is the proper thing to do for
     debugging as any UAF will be immediately detected after the
     function returns.

     The production case flushes lazy and fully utilized blocks are not
     on the free list and therefore not flushed until they are
     purged.

     The partial flush via _vm_unmap_aliases() solves a completely
     different problem. See #2/#3 below

  2) Before a page, which might have been reused before the lazy flush
     happened, can be used again. Also when the mapping attributes of a
     page are changed via set_memory_*()

     That's what vm_unmap_aliases() -> _vm_unmap_aliases() is for. 

  3) When an executable mapping is torn down

     That's the vfree() + VM_FLUSH_RESET_PERMS case which also ends up
     in _vm_unmap_aliases() 

I completely understand that there needs to be a balance between the
high utilization use case and the occasional one which made us look into
this. That's fine, but there needs to be a way to make it least overhead
for both.

/me tries to find some spare cycles ...

Thanks,

        tglx
Baoquan He May 22, 2023, 11:21 a.m. UTC | #5
On 05/22/23 at 01:10am, Thomas Gleixner wrote:
> On Sat, May 20 2023 at 07:46, Baoquan He wrote:
> > On 05/19/23 at 08:38pm, Thomas Gleixner wrote:
> >> That block would have hundreds of pages left and is the most recently
> >> allocated. So the next vb_alloc() has to reallocate one instead of using
> >> the one which was allocated just before.
> >> 
> >> This clearly lacks a free space check so that blocks which have more
> >> free space than a certain threshold are not thrown away prematurely.
> >> Maybe it wants an age check too, so that blocks which are unused for a
> >> long time can be recycled, but that's an orthogonal issue.
> >
> > You are right, the vmap_block alloc/free does have the issue you pointed
> > out here. What I can defend is that it should be fine if
> > VM_FLUSH_RESET_PERMS memory doesn't upset the situation. As we see, the
> > lazy flush will only be triggered when lazy_max_pages() is met, or
> > alloc_vmap_area() can't find an available range. If these two happens,
> > means we really need to flush and reclaim the unmapped area into free
> > list/tree since the vmalloc address space has run out. Even though the
> > vmap_block has mach free space left, still need be purged to cope with
> > an emergency.
> >
> > So, if we pick VM_FLUSH_RESET_PERMS memory out and flush it alone, and
> 
> That might be counterproductive depending on the usage scenario
> especially as that BPF filtering is gaining traction.
> 
> > set a threshold for vmap_block purging, is it better?
> 
> The point is that there is no differentiation between:
> 
>   1) Purging freed areas, i.e. when the vmap_layz_nr hits the threshold,
>      from drain_vmap_area_work()
> 
>   2) Reclaiming vmalloc address space from pcpu_get_vm_areas()
> 
>   3) _unmap_aliases()
> 
> #1 You really don't want to reclaim vmap blocks from the per cpu free
>    lists, unless they have only a few free pages left and no active
>    mappings.
> 
>    There is no reason to zap vmap blocks unconditionally, simply because
>    reclaiming all lazy areas drains at least
> 
>       32MB * fls(num_online_cpus())
> 
>    per invocation which is plenty, no?
> 
> #2 You might want to try #1 first and if that does not help then reclaim
>    all vmap blocks on the per cpu lists which have no active mappings
>    unconditionally.
> 
> #3 _unmap_aliases() requires to touch everything because the caller has
>    no clue which vmap_area used a particular page and the vmap_area lost
>    that information too.
> 
>    Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the
>    vmap area first and then cares about the flush. That in turn requires
>    a full walk of _all_ vmap areas including the one which was just
>    added to the purge list.
> 
>    I can see the point that this is used to combine outstanding TLB
>    flushes and do the housekeeping of purging freed areas, but like #1
>    there is no real good reason to zap usable vmap blocks
>    unconditionally.
> 
> I'm all for consolidating functionality, but that swiss army knife
> approach of one fits it all does not make sense to me.

OK, got it.

> 
> >> The other issue I pointed out:
> >> 
> >> Assume the block has (for simplicity) 255 allocations size of 4 pages,
> >> again free space of 4 pages.
> >> 
> >> 254 allocations are freed, which means there is one remaining
> >> mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over
> >> time.
> >> 
> >> Now the last allocation is freed and the block is moved to the
> >> purge_vmap_area_list, which then does a full flush of the complete area,
> >> i.e. 4MB in that case, while in fact it only needs to flush 2 pages.

Agree, it sounds problematic. With your cure code, we can check that
when we flush per va of purge list. If we can keep vmap_block() not
freed until it's checked and freed in purge list, we can flush the last
two pages of range in __purge_vmap_area_lazy(), please see next comment
about how we could do to fix it.

> >
> > It's easy to fix. For vmap_block, I have marked it in va->flag with
> > VMAP_RAM|VMAP_BLOCK. When flushing va in purge list, we can skip
> > vmap_block va.
> 
> How can you skip that? The last 2 pages in that VA still need to be
> flushed, no?
> 
> But VA has no information about already done partial flushes via the
> vmap_block, so this requires flushing the full VA range unconditionally.

Right, I see the problem you spotted with the illutrations. For the
last 2 pages in that VA, we can keep vmap_block until we iterate the
purge list in __purge_vmap_area_lazy(). Since you will iterate the
purge list to add each va range to an array, we can do like below
draft code to fix it. Surely this is on top of your cure code.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5ca55b357148..4b11a32df49d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 	unsigned int num_purged_areas = 0;
 	struct list_head local_purge_list;
 	struct vmap_area *va, *n_va;
+	struct vmap_block vb;
 
 	lockdep_assert_held(&vmap_purge_lock);
 
@@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 	list_replace_init(&purge_vmap_area_list, &local_purge_list);
 	spin_unlock(&purge_vmap_area_lock);
 
+	vb = container_of(va, struct vmap_block, va);
+	if ((va->flags & VMAP_FLAGS_MASK == VMAP_RAM|VMAP_BLOCK) &&
+	    (vb->dirty_max)) {
+		s = vb->dirty_min << PAGE_SHIFT;
+		e = vb->dirty_max << PAGE_SHIFT;
+		kfree(vb);
+	}
+
 	if (unlikely(list_empty(&local_purge_list)))
 		goto out;
 
@@ -2083,7 +2092,6 @@ static void free_vmap_block(struct vmap_block *vb)
 	spin_unlock(&vmap_area_lock);
 
 	free_vmap_area_noflush(vb->va);
-	kfree_rcu(vb, rcu_head);
 }
 
 static void purge_fragmented_blocks(int cpu)
@@ -2163,11 +2171,6 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
 		vaddr = vmap_block_vaddr(vb->va->va_start, pages_off);
 		vb->free -= 1UL << order;
 		bitmap_set(vb->used_map, pages_off, (1UL << order));
-		if (vb->free == 0) {
-			spin_lock(&vbq->lock);
-			list_del_rcu(&vb->free_list);
-			spin_unlock(&vbq->lock);
-		}
 
 		spin_unlock(&vb->lock);
 		break;

> 
> That made me think about the following scenario:
> 
> vm_map_ram()
>   vb_alloc()
>     // Uses page P
>     vb->free -= 1UL << order;
>     if (vb->free == 0)
>       list_del_rcu(&vb->free_list);
> 
> vm_unmap_ram()
>   vb_free()
>     Does not free the last mapping of the vmap_block
>     Clears the page table entry for page P, but does not flush TLB
> 
> __free_page(P)
> 
> page P gets reused
> 
> vm_unmap_aliases()
> 
>   Does not find the VA which used page P because neither the VB is in
>   free_list nor the VA is in the purge_list
> 
>   Unless _vm_unmap_aliases() finds a large enough range to cover
>   page P or ends up with a flush_tlb_all() the stale TLB(s) persists.
> 
> No?
> 
> Same problem in this scenario:
> 
> addr = vm_map_ram()
>   vb_alloc()
>     vb->free -= 1UL << order;
>     if (vb->free == 0)
>       list_del_rcu(&vb->free_list);
> 
> set_memory_*(addr)
>   vm_unmap_aliases()
> 
>     Does not find the VA which contains @addr because neither the VB is in
>     free_list nor the VA is in the purge_list
> 
>     Unless _vm_unmap_aliases() finds a large enough range to cover
>     @addr or ends up with a flush_tlb_all() the stale TLB(s) persists.
> 
> No?
> 
> > I don't know how you will tackle the per va flush Nadav pointed out,
> > so I will not give a dtaft code.
> 
> It's not that hard.
> 
> An array of ranges which can be memcpy()'ed into that existing x86
> percpu flush info thing, which avoids the cache line issue Nadav is
> concerned of.
> 
> That array would have obviously a limited number of entries as anything
> with multiple ranges is most likely going to end up in a full flush
> anyway. The drain_vmap_area_work() case definitely does so as
> vmap_lazy_nr is guaranteed to be at least covering 32MB/PAGE_SIZE worth
> of pages.
> 
> I just got distracted and did not come around to finish it for various
> reasons. One of them is to make sense of this code :)

OK, looking forward to seeing it done when it's convenient to you.

> 
> >> Also these intermediate flushes are inconsistent versus how fully
> >> utilized blocks are handled:
> >> 
> >> vb_alloc()
> >>       if (vb->free == 0)
> >>           list_del(vb->free_list);
> >> 
> >> So all allocations which are freed after that point stay unflushed until
> >> the last allocation is freed which moves the block to the
> >> purge_vmap_area_list, where it gets a full VA range flush.
> >
> > That may be risky if stay unflushed until the last allocation is freed.
> > We use vm_map_ram() interface to map passed in pages into vmalloc area.
> > If vb_free() is called, the sub-region has been unmapped and user maybe
> > have released the pages. user of vm_unmap_aliases() may be impacted if
> > we don't flush those area freed with vb_free(). In reality, those areas
> > have been unmapped, while there's still TLB existing. Not very sure
> > about that.
> >
> > If we can hold the vmap_block flush until purging it w/o risk, it will
> > save us many troubles.
> 
> The point is that the TLBs _must_ be flushed
> 
>   1) Before the virtual address space occupied by a vmap_area is
>      released and can be reused.
> 
>      So one might argue that flushing TLBs early is good to detect use
>      after free.
> 
>      The DEBUG_PAGE_ALLOC case flushes right after
>      vunmap_range_noflush(), which is the proper thing to do for
>      debugging as any UAF will be immediately detected after the
>      function returns.
> 
>      The production case flushes lazy and fully utilized blocks are not
>      on the free list and therefore not flushed until they are
>      purged.
> 
>      The partial flush via _vm_unmap_aliases() solves a completely
>      different problem. See #2/#3 below
> 
>   2) Before a page, which might have been reused before the lazy flush
>      happened, can be used again. Also when the mapping attributes of a
>      page are changed via set_memory_*()
> 
>      That's what vm_unmap_aliases() -> _vm_unmap_aliases() is for. 
> 
>   3) When an executable mapping is torn down
> 
>      That's the vfree() + VM_FLUSH_RESET_PERMS case which also ends up
>      in _vm_unmap_aliases() 


I see it now, thanks.
> 
> I completely understand that there needs to be a balance between the
> high utilization use case and the occasional one which made us look into
> this. That's fine, but there needs to be a way to make it least overhead
> for both.

Agree. Based on all discussions, it should be much better if below three
things are done. I haven't considered if adding dirty bitmap in
vmap_block is necessary and acceptable.

1)A threshold for vmap_block purging, e.g 3/4 or 2/3 of VMAP_BBMAP_BITS;
2)your cure page + above draft code I just pasted;
3)per va range adding to an array and passed to flush_tlb_kernel_xx();

> 
> /me tries to find some spare cycles ...
> 
> Thanks,
> 
>         tglx
> 
> 
> 
>
Thomas Gleixner May 22, 2023, 12:02 p.m. UTC | #6
On Mon, May 22 2023 at 19:21, Baoquan He wrote:
> On 05/22/23 at 01:10am, Thomas Gleixner wrote:
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 5ca55b357148..4b11a32df49d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  	unsigned int num_purged_areas = 0;
>  	struct list_head local_purge_list;
>  	struct vmap_area *va, *n_va;
> +	struct vmap_block vb;
>  
>  	lockdep_assert_held(&vmap_purge_lock);
>  
> @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  	list_replace_init(&purge_vmap_area_list, &local_purge_list);
>  	spin_unlock(&purge_vmap_area_lock);
>  
> +	vb = container_of(va, struct vmap_block, va);

This cannot work vmap_area is not embedded in vmap_block. vmap_block::va
is a pointer. vmap_area does not link back to vmap_block, so there is no
way to find it based on a vmap_area.

Aside of that va is not initialized here :)
Baoquan He May 22, 2023, 2:34 p.m. UTC | #7
On 05/22/23 at 02:02pm, Thomas Gleixner wrote:
> On Mon, May 22 2023 at 19:21, Baoquan He wrote:
> > On 05/22/23 at 01:10am, Thomas Gleixner wrote:
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 5ca55b357148..4b11a32df49d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  	unsigned int num_purged_areas = 0;
> >  	struct list_head local_purge_list;
> >  	struct vmap_area *va, *n_va;
> > +	struct vmap_block vb;
> >  
> >  	lockdep_assert_held(&vmap_purge_lock);
> >  
> > @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  	list_replace_init(&purge_vmap_area_list, &local_purge_list);
> >  	spin_unlock(&purge_vmap_area_lock);
> >  
> > +	vb = container_of(va, struct vmap_block, va);
> 
> This cannot work vmap_area is not embedded in vmap_block. vmap_block::va
> is a pointer. vmap_area does not link back to vmap_block, so there is no
> way to find it based on a vmap_area.

Oh, the code is buggy. va->flags can tell if it's vmap_block, then we
can deduce the vb pointer.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5ca55b357148..73d6ce441351 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
        unsigned int num_purged_areas = 0;
        struct list_head local_purge_list;
        struct vmap_area *va, *n_va;
+       struct vmap_block vb;
 
        lockdep_assert_held(&vmap_purge_lock);
 
@@ -1736,6 +1737,15 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
        list_replace_init(&purge_vmap_area_list, &local_purge_list);
        spin_unlock(&purge_vmap_area_lock);
 
+       if (va->flags & VMAP_FLAGS_MASK == VMAP_RAM|VMAP_BLOCK) {
+               vb = container_of(va, struct vmap_block, va);
+               if (vb->dirty_max) {
			/*This is pseudo code for illustration*/
+                       s = vb->dirty_min << PAGE_SHIFT;
+                       e = vb->dirty_max << PAGE_SHIFT;
+               }
+               kfree(vb);
+       }
+
        if (unlikely(list_empty(&local_purge_list)))
                goto out;
 
@@ -2083,7 +2093,6 @@ static void free_vmap_block(struct vmap_block *vb)
        spin_unlock(&vmap_area_lock);
 
        free_vmap_area_noflush(vb->va);
-       kfree_rcu(vb, rcu_head);
 }

> 
> Aside of that va is not initialized here :)

Oh, this is not real code, just to illustrate how it can calculate and
flush the last two pages of vmap_block. If you have the per va flushing
via array patch, I can work out formal code change based on that.
Thomas Gleixner May 22, 2023, 8:21 p.m. UTC | #8
On Mon, May 22 2023 at 22:34, Baoquan He wrote:
> On 05/22/23 at 02:02pm, Thomas Gleixner wrote:
>> > @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>> >  	list_replace_init(&purge_vmap_area_list, &local_purge_list);
>> >  	spin_unlock(&purge_vmap_area_lock);
>> >  
>> > +	vb = container_of(va, struct vmap_block, va);
>> 
>> This cannot work vmap_area is not embedded in vmap_block. vmap_block::va
>> is a pointer. vmap_area does not link back to vmap_block, so there is no
>> way to find it based on a vmap_area.
>
> Oh, the code is buggy. va->flags can tell if it's vmap_block, then we
> can deduce the vb pointer.

No. It _CANNOT_ work whether you check the flags or not.

struct foo {
       .....
       struct bar bar;
};

container_of(ptr_to_bar, struct foo, bar) returns the pointer to the
struct foo which has struct bar embedded.

But

struct foo {
       .....
       struct bar *bar;
};

cannot do that because ptr_to_bar points to some object which is
completely disconnected from struct foo.

Care to look at the implementation of container_of()?

Here is what it boils down to:

  void *member_pointer = bar;
  
  p = (struct foo *)(member_pointer - offsetof(struct foo, bar);

So it uses the pointer to bar and subtracts the offset of bar in struct
foo. This obviously can only work when struct bar is embedded in struct
foo.

Lets assume that *bar is the first member of foo, i.e. offset of *bar in
struct foo is 0

  p = (struct foo *)(member_pointer - 0);

So you end up with

  p == member_pointer == bar

But you won't get there because the static_assert() in container_of()
will catch that and the compiler will tell you in colourful ways.

Once the vmap area is handed over for cleaning up the vmap block is gone
and even if you let it stay around then the vmap area does not have any
information where to find the block.

You'd need to have a pointer to the vmap block in vmap area or embed
vmap area into vmap block.

See?

Thanks,

        tglx
Thomas Gleixner May 22, 2023, 8:44 p.m. UTC | #9
On Mon, May 22 2023 at 22:21, Thomas Gleixner wrote:
> Lets assume that *bar is the first member of foo, i.e. offset of *bar in
> struct foo is 0
>
>   p = (struct foo *)(member_pointer - 0);
>
> So you end up with
>
>   p == member_pointer == bar
>
> But you won't get there because the static_assert() in container_of()
> will catch that and the compiler will tell you in colourful ways.
>
> Once the vmap area is handed over for cleaning up the vmap block is gone
> and even if you let it stay around then the vmap area does not have any
> information where to find the block.
>
> You'd need to have a pointer to the vmap block in vmap area or embed
> vmap area into vmap block.

The latter would require to:

  - split alloc_vmap_area() apart

  - sprinkle 'if (vmap_area_is_vmap_block(va))' all over the place

  - do a lot of other nasty things

Not sure if that's worth it. There are some other options to pursue.

Thanks,

        tglx
Baoquan He May 23, 2023, 9:35 a.m. UTC | #10
On 05/22/23 at 10:21pm, Thomas Gleixner wrote:
> On Mon, May 22 2023 at 22:34, Baoquan He wrote:
> > On 05/22/23 at 02:02pm, Thomas Gleixner wrote:
> >> > @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >> >  	list_replace_init(&purge_vmap_area_list, &local_purge_list);
> >> >  	spin_unlock(&purge_vmap_area_lock);
> >> >  
> >> > +	vb = container_of(va, struct vmap_block, va);
> >> 
> >> This cannot work vmap_area is not embedded in vmap_block. vmap_block::va
> >> is a pointer. vmap_area does not link back to vmap_block, so there is no
> >> way to find it based on a vmap_area.
> >
> > Oh, the code is buggy. va->flags can tell if it's vmap_block, then we
> > can deduce the vb pointer.
> 
> No. It _CANNOT_ work whether you check the flags or not.
> 
> struct foo {
>        .....
>        struct bar bar;
> };
> 
> container_of(ptr_to_bar, struct foo, bar) returns the pointer to the
> struct foo which has struct bar embedded.
> 
> But
> 
> struct foo {
>        .....
>        struct bar *bar;
> };
> 
> cannot do that because ptr_to_bar points to some object which is
> completely disconnected from struct foo.
> 
> Care to look at the implementation of container_of()?
> 
> Here is what it boils down to:
> 
>   void *member_pointer = bar;
>   
>   p = (struct foo *)(member_pointer - offsetof(struct foo, bar);
> 
> So it uses the pointer to bar and subtracts the offset of bar in struct
> foo. This obviously can only work when struct bar is embedded in struct
> foo.
> 
> Lets assume that *bar is the first member of foo, i.e. offset of *bar in
> struct foo is 0
> 
>   p = (struct foo *)(member_pointer - 0);
> 
> So you end up with
> 
>   p == member_pointer == bar
> 
> But you won't get there because the static_assert() in container_of()
> will catch that and the compiler will tell you in colourful ways.

Thanks a lot, learn it now. I never noticed container_of() is not
suitable for pointer member of struct case.

> 
> Once the vmap area is handed over for cleaning up the vmap block is gone
> and even if you let it stay around then the vmap area does not have any
> information where to find the block.
> 
> You'd need to have a pointer to the vmap block in vmap area or embed
> vmap area into vmap block.

Got it now. Embedding vmap_area into vmap_block seems not feasible
because va need be reused when inserting into free_vmap_area_root/list.
Adding a pointer to vmap_block looks do-able.

Since vm_map_ram area doesn't have vm_struct associated with it, we can
reuse the space of '->vm' to add vb pointer like below. Since in the
existing code there are places where we use 'if(!va->vm)' to check if
it's a normal va, we need be careful to find all of them out and replace
with new and tighter checking. Will give a draft code change after all
is done and testing is passed.

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..e2ba6d59d679 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -15,6 +15,7 @@
 struct vm_area_struct;         /* vma defining user mapping in mm_types.h */
 struct notifier_block;         /* in notifier.h */
 struct iov_iter;               /* in uio.h */
+struct vmap_block;             /* in mm/vmalloc.c */
 
 /* bits in flags of vmalloc's vm_struct below */
 #define VM_IOREMAP             0x00000001      /* ioremap() and friends */
@@ -76,6 +77,7 @@ struct vmap_area {
        union {
                unsigned long subtree_max_size; /* in "free" tree */
                struct vm_struct *vm;           /* in "busy" tree */
+               struct vmap_block *vb;           /* in "busy and purge" tree */
        };
        unsigned long flags; /* mark type of vm_map_ram area */
 };
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c0f80982eb06..d97343271e27 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2061,6 +2061,10 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
                return ERR_PTR(err);
        }
 
+       spin_lock(&vmap_area_lock);
+       va->vb = vb;
+       spin_unlock(&vmap_area_lock);
+
        vbq = raw_cpu_ptr(&vmap_block_queue);
        spin_lock(&vbq->lock);
        list_add_tail_rcu(&vb->free_list, &vbq->free);
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 87134dd8abc3..9f7cbd6182ad 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2236,8 +2236,23 @@  static void vb_free(unsigned long addr, unsigned long size)
 		spin_unlock(&vb->lock);
 }
 
-static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
+/**
+ * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer
+ *
+ * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily
+ * to amortize TLB flushing overheads. What this means is that any page you
+ * have now, may, in a former life, have been mapped into kernel virtual
+ * address by the vmap layer and so there might be some CPUs with TLB entries
+ * still referencing that page (additional to the regular 1:1 kernel mapping).
+ *
+ * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can
+ * be sure that none of the pages we have control over will have any aliases
+ * from the vmap layer.
+ */
+void vm_unmap_aliases(void)
 {
+	unsigned long start = ULONG_MAX, end = 0;
+	bool flush = false;
 	int cpu;
 
 	if (unlikely(!vmap_initialized))
@@ -2245,6 +2260,9 @@  static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 
 	might_sleep();
 
+	mutex_lock(&vmap_purge_lock);
+	purge_fragmented_blocks_allcpus();
+
 	for_each_possible_cpu(cpu) {
 		struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
 		struct vmap_block *vb;
@@ -2262,40 +2280,17 @@  static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 				start = min(s, start);
 				end   = max(e, end);
 
-				flush = 1;
+				flush = true;
 			}
 			spin_unlock(&vb->lock);
 		}
 		rcu_read_unlock();
 	}
 
-	mutex_lock(&vmap_purge_lock);
-	purge_fragmented_blocks_allcpus();
 	if (!__purge_vmap_area_lazy(start, end) && flush)
 		flush_tlb_kernel_range(start, end);
 	mutex_unlock(&vmap_purge_lock);
 }
-
-/**
- * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer
- *
- * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily
- * to amortize TLB flushing overheads. What this means is that any page you
- * have now, may, in a former life, have been mapped into kernel virtual
- * address by the vmap layer and so there might be some CPUs with TLB entries
- * still referencing that page (additional to the regular 1:1 kernel mapping).
- *
- * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can
- * be sure that none of the pages we have control over will have any aliases
- * from the vmap layer.
- */
-void vm_unmap_aliases(void)
-{
-	unsigned long start = ULONG_MAX, end = 0;
-	int flush = 0;
-
-	_vm_unmap_aliases(start, end, flush);
-}
 EXPORT_SYMBOL_GPL(vm_unmap_aliases);
 
 /**