Message ID | ZGdlff80rZ6v4LNO@MiWiFi-R3L-srv (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC,1/3] mm/vmalloc.c: try to flush vmap_area one by one | expand |
On Fri, May 19 2023 at 20:03, Baoquan He wrote: > @@ -2245,6 +2260,9 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) > > might_sleep(); > > + mutex_lock(&vmap_purge_lock); > + purge_fragmented_blocks_allcpus(); > + > for_each_possible_cpu(cpu) { That still does TWO iterations over all possible CPUs, while this really could be one.
On Fri, May 19 2023 at 20:03, Baoquan He wrote: > After vb_free() invocation, the va will be purged and put into purge > tree/list if the entire vmap_block is dirty. If not entirely dirty, the > vmap_block is still in percpu vmap_block_queue list, just like below two > graphs: > > (1) > |-----|------------|-----------|-------| > |dirty|still mapped| dirty | free | > > 2) > |------------------------------|-------| > | dirty | free | > > In the current _vm_unmap_aliases(), to reclaim those unmapped range and > flush, it will iterate percpu vbq to calculate the range from vmap_block > like above two cases. Then call purge_fragmented_blocks_allcpus() > to purge the vmap_block in case 2 since no mapping exists right now, > and put these purged vmap_block va into purge tree/list. Then in > __purge_vmap_area_lazy(), it will continue calculating the flush range > from purge list. Obviously, this will take vmap_block va in the 2nd case > into account twice. Which made me look deeper into purge_fragmented_blocks() list_for_each_entry_rcu(vb, &vbq->free, free_list) { if (!(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS)) continue; spin_lock(&vb->lock); if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) { That means if an allocation does not find something in the free list then this can happen: vaddr = vb_alloc(size) vaddr = new_vmap_block(order, gfp_mask); vb_free(vaddr, size) vb->dirty = 1ULL << order; purge_fragmented_blocks() purge(most_recently_allocated_block); vaddr = vb_alloc(size) vaddr = new_vmap_block(order, gfp_mask); How does that make sense? That block would have hundreds of pages left and is the most recently allocated. So the next vb_alloc() has to reallocate one instead of using the one which was allocated just before. This clearly lacks a free space check so that blocks which have more free space than a certain threshold are not thrown away prematurely. Maybe it wants an age check too, so that blocks which are unused for a long time can be recycled, but that's an orthogonal issue. That aside your patch does still not address what I pointed out to you and what my patch cures: pages bits dirtymin dirtymax vb_alloc(A) 255 0 - 254 VMAP_BBMAP_BITS 0 vb_alloc(B) 255 255 - 509 VMAP_BBMAP_BITS 0 vb_alloc(C) 255 510 - 764 VMAP_BBMAP_BITS 0 vb_alloc(D) 255 765 - 1020 VMAP_BBMAP_BITS 0 The block stays on the free list because there are still 4 pages left and it stays there until either _all_ free space is used or _all_ allocated space is freed. Now the first allocation gets freed: vb_free(A) 255 0 - 254 0 254 From there on _every_ invocation of __purge_vmap_area_lazy() will see this range as long as the block is on the free list: list_for_each_entry_rcu(vb, &vbq->free, free_list) { spin_lock(&vb->lock); if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) { because this condition is true. So this flushes the same range over and over, no? This flush range gets larger over time the more allocations are freed up to the point where the block vanishes from the free list. By resetting vb->dirty_min/max the freed range is only flushed _once_, no? The resulting flush range might still be excessively large as I pointed out before: 1) Flush after freeing A vb_free(A) 2 0 - 1 0 1 Flush VMAP_BBMAP_BITS 0 <- correct vb_free(C) 2 6 - 7 6 7 Flush VMAP_BBMAP_BITS 0 <- correct 2) No flush between freeing A and C vb_free(A) 2 0 - 1 0 1 vb_free(C) 2 6 - 7 0 7 Flush VMAP_BBMAP_BITS 0 <- overbroad flush 3) No flush between freeing A, C, B vb_free(A) 2 0 - 1 0 1 vb_free(B) 2 6 - 7 0 7 vb_free(C) 2 2 - 5 0 7 Flush VMAP_BBMAP_BITS 0 <- correct Obviously case 2 could be vb_free(A) 2 0 - 1 0 1 vb_free(X) 2 1000 - 1001 1000 1001 So that flush via purge_vmap_area_list() will ask to flush 1002 pages instead of 4, right? Again, that does not make sense. The other issue I pointed out: Assume the block has (for simplicity) 255 allocations size of 4 pages, again free space of 4 pages. 254 allocations are freed, which means there is one remaining mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over time. Now the last allocation is freed and the block is moved to the purge_vmap_area_list, which then does a full flush of the complete area, i.e. 4MB in that case, while in fact it only needs to flush 2 pages. Also these intermediate flushes are inconsistent versus how fully utilized blocks are handled: vb_alloc() if (vb->free == 0) list_del(vb->free_list); So all allocations which are freed after that point stay unflushed until the last allocation is freed which moves the block to the purge_vmap_area_list, where it gets a full VA range flush. IOW, for blocks on the free list this cares about unflushed mappings of freed spaces, but for fully utilized blocks with freed spaces it does obviously not matter, right? So either we care about flushing the mappings of freed spaces or we do not, but caring in one case and ignoring it in the other case is inconsistent at best. Thanks, tglx
On 05/19/23 at 08:38pm, Thomas Gleixner wrote: > On Fri, May 19 2023 at 20:03, Baoquan He wrote: > > After vb_free() invocation, the va will be purged and put into purge > > tree/list if the entire vmap_block is dirty. If not entirely dirty, the > > vmap_block is still in percpu vmap_block_queue list, just like below two > > graphs: > > > > (1) > > |-----|------------|-----------|-------| > > |dirty|still mapped| dirty | free | > > > > 2) > > |------------------------------|-------| > > | dirty | free | > > > > In the current _vm_unmap_aliases(), to reclaim those unmapped range and > > flush, it will iterate percpu vbq to calculate the range from vmap_block > > like above two cases. Then call purge_fragmented_blocks_allcpus() > > to purge the vmap_block in case 2 since no mapping exists right now, > > and put these purged vmap_block va into purge tree/list. Then in > > __purge_vmap_area_lazy(), it will continue calculating the flush range > > from purge list. Obviously, this will take vmap_block va in the 2nd case > > into account twice. > > Which made me look deeper into purge_fragmented_blocks() > > list_for_each_entry_rcu(vb, &vbq->free, free_list) { > > if (!(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS)) > continue; > > spin_lock(&vb->lock); > if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) { > > That means if an allocation does not find something in the free list > then this can happen: > > vaddr = vb_alloc(size) > vaddr = new_vmap_block(order, gfp_mask); > > vb_free(vaddr, size) > vb->dirty = 1ULL << order; > > purge_fragmented_blocks() > purge(most_recently_allocated_block); > > vaddr = vb_alloc(size) > vaddr = new_vmap_block(order, gfp_mask); > > How does that make sense? > > That block would have hundreds of pages left and is the most recently > allocated. So the next vb_alloc() has to reallocate one instead of using > the one which was allocated just before. > > This clearly lacks a free space check so that blocks which have more > free space than a certain threshold are not thrown away prematurely. > Maybe it wants an age check too, so that blocks which are unused for a > long time can be recycled, but that's an orthogonal issue. You are right, the vmap_block alloc/free does have the issue you pointed out here. What I can defend is that it should be fine if VM_FLUSH_RESET_PERMS memory doesn't upset the situation. As we see, the lazy flush will only be triggered when lazy_max_pages() is met, or alloc_vmap_area() can't find an available range. If these two happens, means we really need to flush and reclaim the unmapped area into free list/tree since the vmalloc address space has run out. Even though the vmap_block has mach free space left, still need be purged to cope with an emergency. So, if we pick VM_FLUSH_RESET_PERMS memory out and flush it alone, and set a threshold for vmap_block purging, is it better? > > > That aside your patch does still not address what I pointed out to you > and what my patch cures: > > pages bits dirtymin dirtymax > vb_alloc(A) 255 0 - 254 VMAP_BBMAP_BITS 0 > vb_alloc(B) 255 255 - 509 VMAP_BBMAP_BITS 0 > vb_alloc(C) 255 510 - 764 VMAP_BBMAP_BITS 0 > vb_alloc(D) 255 765 - 1020 VMAP_BBMAP_BITS 0 > > The block stays on the free list because there are still 4 pages left > and it stays there until either _all_ free space is used or _all_ > allocated space is freed. > > Now the first allocation gets freed: > > vb_free(A) 255 0 - 254 0 254 > > From there on _every_ invocation of __purge_vmap_area_lazy() will see > this range as long as the block is on the free list: > > list_for_each_entry_rcu(vb, &vbq->free, free_list) { > spin_lock(&vb->lock); > if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) { > > because this condition is true. So this flushes the same range over and > over, no? > > This flush range gets larger over time the more allocations are freed up > to the point where the block vanishes from the free list. > > By resetting vb->dirty_min/max the freed range is only flushed _once_, > no? The resulting flush range might still be excessively large as I > pointed out before: > > 1) Flush after freeing A > > vb_free(A) 2 0 - 1 0 1 > Flush VMAP_BBMAP_BITS 0 <- correct > vb_free(C) 2 6 - 7 6 7 > Flush VMAP_BBMAP_BITS 0 <- correct > > 2) No flush between freeing A and C > > vb_free(A) 2 0 - 1 0 1 > vb_free(C) 2 6 - 7 0 7 > Flush VMAP_BBMAP_BITS 0 <- overbroad flush > > 3) No flush between freeing A, C, B > > vb_free(A) 2 0 - 1 0 1 > vb_free(B) 2 6 - 7 0 7 > vb_free(C) 2 2 - 5 0 7 > Flush VMAP_BBMAP_BITS 0 <- correct > > Obviously case 2 could be > > vb_free(A) 2 0 - 1 0 1 > vb_free(X) 2 1000 - 1001 1000 1001 > > So that flush via purge_vmap_area_list() will ask to flush 1002 pages > instead of 4, right? Again, that does not make sense. Yes, I got your point now. I didn't read your cure code carefully, sorry for that. > > The other issue I pointed out: > > Assume the block has (for simplicity) 255 allocations size of 4 pages, > again free space of 4 pages. > > 254 allocations are freed, which means there is one remaining > mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over > time. > > Now the last allocation is freed and the block is moved to the > purge_vmap_area_list, which then does a full flush of the complete area, > i.e. 4MB in that case, while in fact it only needs to flush 2 pages. It's easy to fix. For vmap_block, I have marked it in va->flag with VMAP_RAM|VMAP_BLOCK. When flushing va in purge list, we can skip vmap_block va. I don't know how you will tackle the per va flush Nadav pointed out, so I will not give a dtaft code. > > > Also these intermediate flushes are inconsistent versus how fully > utilized blocks are handled: > > vb_alloc() > if (vb->free == 0) > list_del(vb->free_list); > > So all allocations which are freed after that point stay unflushed until > the last allocation is freed which moves the block to the > purge_vmap_area_list, where it gets a full VA range flush. That may be risky if stay unflushed until the last allocation is freed. We use vm_map_ram() interface to map passed in pages into vmalloc area. If vb_free() is called, the sub-region has been unmapped and user maybe have released the pages. user of vm_unmap_aliases() may be impacted if we don't flush those area freed with vb_free(). In reality, those areas have been unmapped, while there's still TLB existing. Not very sure about that. If we can hold the vmap_block flush until purging it w/o risk, it will save us many troubles. > > IOW, for blocks on the free list this cares about unflushed mappings of > freed spaces, but for fully utilized blocks with freed spaces it does > obviously not matter, right? Yes, while depends on how we flush them. Flush them each time if there's dirty, or hold the flush until purged if holding is allowed. > > So either we care about flushing the mappings of freed spaces or we do > not, but caring in one case and ignoring it in the other case is > inconsistent at best. >
On Sat, May 20 2023 at 07:46, Baoquan He wrote: > On 05/19/23 at 08:38pm, Thomas Gleixner wrote: >> That block would have hundreds of pages left and is the most recently >> allocated. So the next vb_alloc() has to reallocate one instead of using >> the one which was allocated just before. >> >> This clearly lacks a free space check so that blocks which have more >> free space than a certain threshold are not thrown away prematurely. >> Maybe it wants an age check too, so that blocks which are unused for a >> long time can be recycled, but that's an orthogonal issue. > > You are right, the vmap_block alloc/free does have the issue you pointed > out here. What I can defend is that it should be fine if > VM_FLUSH_RESET_PERMS memory doesn't upset the situation. As we see, the > lazy flush will only be triggered when lazy_max_pages() is met, or > alloc_vmap_area() can't find an available range. If these two happens, > means we really need to flush and reclaim the unmapped area into free > list/tree since the vmalloc address space has run out. Even though the > vmap_block has mach free space left, still need be purged to cope with > an emergency. > > So, if we pick VM_FLUSH_RESET_PERMS memory out and flush it alone, and That might be counterproductive depending on the usage scenario especially as that BPF filtering is gaining traction. > set a threshold for vmap_block purging, is it better? The point is that there is no differentiation between: 1) Purging freed areas, i.e. when the vmap_layz_nr hits the threshold, from drain_vmap_area_work() 2) Reclaiming vmalloc address space from pcpu_get_vm_areas() 3) _unmap_aliases() #1 You really don't want to reclaim vmap blocks from the per cpu free lists, unless they have only a few free pages left and no active mappings. There is no reason to zap vmap blocks unconditionally, simply because reclaiming all lazy areas drains at least 32MB * fls(num_online_cpus()) per invocation which is plenty, no? #2 You might want to try #1 first and if that does not help then reclaim all vmap blocks on the per cpu lists which have no active mappings unconditionally. #3 _unmap_aliases() requires to touch everything because the caller has no clue which vmap_area used a particular page and the vmap_area lost that information too. Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the vmap area first and then cares about the flush. That in turn requires a full walk of _all_ vmap areas including the one which was just added to the purge list. I can see the point that this is used to combine outstanding TLB flushes and do the housekeeping of purging freed areas, but like #1 there is no real good reason to zap usable vmap blocks unconditionally. I'm all for consolidating functionality, but that swiss army knife approach of one fits it all does not make sense to me. >> The other issue I pointed out: >> >> Assume the block has (for simplicity) 255 allocations size of 4 pages, >> again free space of 4 pages. >> >> 254 allocations are freed, which means there is one remaining >> mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over >> time. >> >> Now the last allocation is freed and the block is moved to the >> purge_vmap_area_list, which then does a full flush of the complete area, >> i.e. 4MB in that case, while in fact it only needs to flush 2 pages. > > It's easy to fix. For vmap_block, I have marked it in va->flag with > VMAP_RAM|VMAP_BLOCK. When flushing va in purge list, we can skip > vmap_block va. How can you skip that? The last 2 pages in that VA still need to be flushed, no? But VA has no information about already done partial flushes via the vmap_block, so this requires flushing the full VA range unconditionally. That made me think about the following scenario: vm_map_ram() vb_alloc() // Uses page P vb->free -= 1UL << order; if (vb->free == 0) list_del_rcu(&vb->free_list); vm_unmap_ram() vb_free() Does not free the last mapping of the vmap_block Clears the page table entry for page P, but does not flush TLB __free_page(P) page P gets reused vm_unmap_aliases() Does not find the VA which used page P because neither the VB is in free_list nor the VA is in the purge_list Unless _vm_unmap_aliases() finds a large enough range to cover page P or ends up with a flush_tlb_all() the stale TLB(s) persists. No? Same problem in this scenario: addr = vm_map_ram() vb_alloc() vb->free -= 1UL << order; if (vb->free == 0) list_del_rcu(&vb->free_list); set_memory_*(addr) vm_unmap_aliases() Does not find the VA which contains @addr because neither the VB is in free_list nor the VA is in the purge_list Unless _vm_unmap_aliases() finds a large enough range to cover @addr or ends up with a flush_tlb_all() the stale TLB(s) persists. No? > I don't know how you will tackle the per va flush Nadav pointed out, > so I will not give a dtaft code. It's not that hard. An array of ranges which can be memcpy()'ed into that existing x86 percpu flush info thing, which avoids the cache line issue Nadav is concerned of. That array would have obviously a limited number of entries as anything with multiple ranges is most likely going to end up in a full flush anyway. The drain_vmap_area_work() case definitely does so as vmap_lazy_nr is guaranteed to be at least covering 32MB/PAGE_SIZE worth of pages. I just got distracted and did not come around to finish it for various reasons. One of them is to make sense of this code :) >> Also these intermediate flushes are inconsistent versus how fully >> utilized blocks are handled: >> >> vb_alloc() >> if (vb->free == 0) >> list_del(vb->free_list); >> >> So all allocations which are freed after that point stay unflushed until >> the last allocation is freed which moves the block to the >> purge_vmap_area_list, where it gets a full VA range flush. > > That may be risky if stay unflushed until the last allocation is freed. > We use vm_map_ram() interface to map passed in pages into vmalloc area. > If vb_free() is called, the sub-region has been unmapped and user maybe > have released the pages. user of vm_unmap_aliases() may be impacted if > we don't flush those area freed with vb_free(). In reality, those areas > have been unmapped, while there's still TLB existing. Not very sure > about that. > > If we can hold the vmap_block flush until purging it w/o risk, it will > save us many troubles. The point is that the TLBs _must_ be flushed 1) Before the virtual address space occupied by a vmap_area is released and can be reused. So one might argue that flushing TLBs early is good to detect use after free. The DEBUG_PAGE_ALLOC case flushes right after vunmap_range_noflush(), which is the proper thing to do for debugging as any UAF will be immediately detected after the function returns. The production case flushes lazy and fully utilized blocks are not on the free list and therefore not flushed until they are purged. The partial flush via _vm_unmap_aliases() solves a completely different problem. See #2/#3 below 2) Before a page, which might have been reused before the lazy flush happened, can be used again. Also when the mapping attributes of a page are changed via set_memory_*() That's what vm_unmap_aliases() -> _vm_unmap_aliases() is for. 3) When an executable mapping is torn down That's the vfree() + VM_FLUSH_RESET_PERMS case which also ends up in _vm_unmap_aliases() I completely understand that there needs to be a balance between the high utilization use case and the occasional one which made us look into this. That's fine, but there needs to be a way to make it least overhead for both. /me tries to find some spare cycles ... Thanks, tglx
On 05/22/23 at 01:10am, Thomas Gleixner wrote: > On Sat, May 20 2023 at 07:46, Baoquan He wrote: > > On 05/19/23 at 08:38pm, Thomas Gleixner wrote: > >> That block would have hundreds of pages left and is the most recently > >> allocated. So the next vb_alloc() has to reallocate one instead of using > >> the one which was allocated just before. > >> > >> This clearly lacks a free space check so that blocks which have more > >> free space than a certain threshold are not thrown away prematurely. > >> Maybe it wants an age check too, so that blocks which are unused for a > >> long time can be recycled, but that's an orthogonal issue. > > > > You are right, the vmap_block alloc/free does have the issue you pointed > > out here. What I can defend is that it should be fine if > > VM_FLUSH_RESET_PERMS memory doesn't upset the situation. As we see, the > > lazy flush will only be triggered when lazy_max_pages() is met, or > > alloc_vmap_area() can't find an available range. If these two happens, > > means we really need to flush and reclaim the unmapped area into free > > list/tree since the vmalloc address space has run out. Even though the > > vmap_block has mach free space left, still need be purged to cope with > > an emergency. > > > > So, if we pick VM_FLUSH_RESET_PERMS memory out and flush it alone, and > > That might be counterproductive depending on the usage scenario > especially as that BPF filtering is gaining traction. > > > set a threshold for vmap_block purging, is it better? > > The point is that there is no differentiation between: > > 1) Purging freed areas, i.e. when the vmap_layz_nr hits the threshold, > from drain_vmap_area_work() > > 2) Reclaiming vmalloc address space from pcpu_get_vm_areas() > > 3) _unmap_aliases() > > #1 You really don't want to reclaim vmap blocks from the per cpu free > lists, unless they have only a few free pages left and no active > mappings. > > There is no reason to zap vmap blocks unconditionally, simply because > reclaiming all lazy areas drains at least > > 32MB * fls(num_online_cpus()) > > per invocation which is plenty, no? > > #2 You might want to try #1 first and if that does not help then reclaim > all vmap blocks on the per cpu lists which have no active mappings > unconditionally. > > #3 _unmap_aliases() requires to touch everything because the caller has > no clue which vmap_area used a particular page and the vmap_area lost > that information too. > > Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the > vmap area first and then cares about the flush. That in turn requires > a full walk of _all_ vmap areas including the one which was just > added to the purge list. > > I can see the point that this is used to combine outstanding TLB > flushes and do the housekeeping of purging freed areas, but like #1 > there is no real good reason to zap usable vmap blocks > unconditionally. > > I'm all for consolidating functionality, but that swiss army knife > approach of one fits it all does not make sense to me. OK, got it. > > >> The other issue I pointed out: > >> > >> Assume the block has (for simplicity) 255 allocations size of 4 pages, > >> again free space of 4 pages. > >> > >> 254 allocations are freed, which means there is one remaining > >> mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over > >> time. > >> > >> Now the last allocation is freed and the block is moved to the > >> purge_vmap_area_list, which then does a full flush of the complete area, > >> i.e. 4MB in that case, while in fact it only needs to flush 2 pages. Agree, it sounds problematic. With your cure code, we can check that when we flush per va of purge list. If we can keep vmap_block() not freed until it's checked and freed in purge list, we can flush the last two pages of range in __purge_vmap_area_lazy(), please see next comment about how we could do to fix it. > > > > It's easy to fix. For vmap_block, I have marked it in va->flag with > > VMAP_RAM|VMAP_BLOCK. When flushing va in purge list, we can skip > > vmap_block va. > > How can you skip that? The last 2 pages in that VA still need to be > flushed, no? > > But VA has no information about already done partial flushes via the > vmap_block, so this requires flushing the full VA range unconditionally. Right, I see the problem you spotted with the illutrations. For the last 2 pages in that VA, we can keep vmap_block until we iterate the purge list in __purge_vmap_area_lazy(). Since you will iterate the purge list to add each va range to an array, we can do like below draft code to fix it. Surely this is on top of your cure code. diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 5ca55b357148..4b11a32df49d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) unsigned int num_purged_areas = 0; struct list_head local_purge_list; struct vmap_area *va, *n_va; + struct vmap_block vb; lockdep_assert_held(&vmap_purge_lock); @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) list_replace_init(&purge_vmap_area_list, &local_purge_list); spin_unlock(&purge_vmap_area_lock); + vb = container_of(va, struct vmap_block, va); + if ((va->flags & VMAP_FLAGS_MASK == VMAP_RAM|VMAP_BLOCK) && + (vb->dirty_max)) { + s = vb->dirty_min << PAGE_SHIFT; + e = vb->dirty_max << PAGE_SHIFT; + kfree(vb); + } + if (unlikely(list_empty(&local_purge_list))) goto out; @@ -2083,7 +2092,6 @@ static void free_vmap_block(struct vmap_block *vb) spin_unlock(&vmap_area_lock); free_vmap_area_noflush(vb->va); - kfree_rcu(vb, rcu_head); } static void purge_fragmented_blocks(int cpu) @@ -2163,11 +2171,6 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask) vaddr = vmap_block_vaddr(vb->va->va_start, pages_off); vb->free -= 1UL << order; bitmap_set(vb->used_map, pages_off, (1UL << order)); - if (vb->free == 0) { - spin_lock(&vbq->lock); - list_del_rcu(&vb->free_list); - spin_unlock(&vbq->lock); - } spin_unlock(&vb->lock); break; > > That made me think about the following scenario: > > vm_map_ram() > vb_alloc() > // Uses page P > vb->free -= 1UL << order; > if (vb->free == 0) > list_del_rcu(&vb->free_list); > > vm_unmap_ram() > vb_free() > Does not free the last mapping of the vmap_block > Clears the page table entry for page P, but does not flush TLB > > __free_page(P) > > page P gets reused > > vm_unmap_aliases() > > Does not find the VA which used page P because neither the VB is in > free_list nor the VA is in the purge_list > > Unless _vm_unmap_aliases() finds a large enough range to cover > page P or ends up with a flush_tlb_all() the stale TLB(s) persists. > > No? > > Same problem in this scenario: > > addr = vm_map_ram() > vb_alloc() > vb->free -= 1UL << order; > if (vb->free == 0) > list_del_rcu(&vb->free_list); > > set_memory_*(addr) > vm_unmap_aliases() > > Does not find the VA which contains @addr because neither the VB is in > free_list nor the VA is in the purge_list > > Unless _vm_unmap_aliases() finds a large enough range to cover > @addr or ends up with a flush_tlb_all() the stale TLB(s) persists. > > No? > > > I don't know how you will tackle the per va flush Nadav pointed out, > > so I will not give a dtaft code. > > It's not that hard. > > An array of ranges which can be memcpy()'ed into that existing x86 > percpu flush info thing, which avoids the cache line issue Nadav is > concerned of. > > That array would have obviously a limited number of entries as anything > with multiple ranges is most likely going to end up in a full flush > anyway. The drain_vmap_area_work() case definitely does so as > vmap_lazy_nr is guaranteed to be at least covering 32MB/PAGE_SIZE worth > of pages. > > I just got distracted and did not come around to finish it for various > reasons. One of them is to make sense of this code :) OK, looking forward to seeing it done when it's convenient to you. > > >> Also these intermediate flushes are inconsistent versus how fully > >> utilized blocks are handled: > >> > >> vb_alloc() > >> if (vb->free == 0) > >> list_del(vb->free_list); > >> > >> So all allocations which are freed after that point stay unflushed until > >> the last allocation is freed which moves the block to the > >> purge_vmap_area_list, where it gets a full VA range flush. > > > > That may be risky if stay unflushed until the last allocation is freed. > > We use vm_map_ram() interface to map passed in pages into vmalloc area. > > If vb_free() is called, the sub-region has been unmapped and user maybe > > have released the pages. user of vm_unmap_aliases() may be impacted if > > we don't flush those area freed with vb_free(). In reality, those areas > > have been unmapped, while there's still TLB existing. Not very sure > > about that. > > > > If we can hold the vmap_block flush until purging it w/o risk, it will > > save us many troubles. > > The point is that the TLBs _must_ be flushed > > 1) Before the virtual address space occupied by a vmap_area is > released and can be reused. > > So one might argue that flushing TLBs early is good to detect use > after free. > > The DEBUG_PAGE_ALLOC case flushes right after > vunmap_range_noflush(), which is the proper thing to do for > debugging as any UAF will be immediately detected after the > function returns. > > The production case flushes lazy and fully utilized blocks are not > on the free list and therefore not flushed until they are > purged. > > The partial flush via _vm_unmap_aliases() solves a completely > different problem. See #2/#3 below > > 2) Before a page, which might have been reused before the lazy flush > happened, can be used again. Also when the mapping attributes of a > page are changed via set_memory_*() > > That's what vm_unmap_aliases() -> _vm_unmap_aliases() is for. > > 3) When an executable mapping is torn down > > That's the vfree() + VM_FLUSH_RESET_PERMS case which also ends up > in _vm_unmap_aliases() I see it now, thanks. > > I completely understand that there needs to be a balance between the > high utilization use case and the occasional one which made us look into > this. That's fine, but there needs to be a way to make it least overhead > for both. Agree. Based on all discussions, it should be much better if below three things are done. I haven't considered if adding dirty bitmap in vmap_block is necessary and acceptable. 1)A threshold for vmap_block purging, e.g 3/4 or 2/3 of VMAP_BBMAP_BITS; 2)your cure page + above draft code I just pasted; 3)per va range adding to an array and passed to flush_tlb_kernel_xx(); > > /me tries to find some spare cycles ... > > Thanks, > > tglx > > > >
On Mon, May 22 2023 at 19:21, Baoquan He wrote: > On 05/22/23 at 01:10am, Thomas Gleixner wrote: > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 5ca55b357148..4b11a32df49d 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) > unsigned int num_purged_areas = 0; > struct list_head local_purge_list; > struct vmap_area *va, *n_va; > + struct vmap_block vb; > > lockdep_assert_held(&vmap_purge_lock); > > @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) > list_replace_init(&purge_vmap_area_list, &local_purge_list); > spin_unlock(&purge_vmap_area_lock); > > + vb = container_of(va, struct vmap_block, va); This cannot work vmap_area is not embedded in vmap_block. vmap_block::va is a pointer. vmap_area does not link back to vmap_block, so there is no way to find it based on a vmap_area. Aside of that va is not initialized here :)
On 05/22/23 at 02:02pm, Thomas Gleixner wrote: > On Mon, May 22 2023 at 19:21, Baoquan He wrote: > > On 05/22/23 at 01:10am, Thomas Gleixner wrote: > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 5ca55b357148..4b11a32df49d 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) > > unsigned int num_purged_areas = 0; > > struct list_head local_purge_list; > > struct vmap_area *va, *n_va; > > + struct vmap_block vb; > > > > lockdep_assert_held(&vmap_purge_lock); > > > > @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) > > list_replace_init(&purge_vmap_area_list, &local_purge_list); > > spin_unlock(&purge_vmap_area_lock); > > > > + vb = container_of(va, struct vmap_block, va); > > This cannot work vmap_area is not embedded in vmap_block. vmap_block::va > is a pointer. vmap_area does not link back to vmap_block, so there is no > way to find it based on a vmap_area. Oh, the code is buggy. va->flags can tell if it's vmap_block, then we can deduce the vb pointer. diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 5ca55b357148..73d6ce441351 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) unsigned int num_purged_areas = 0; struct list_head local_purge_list; struct vmap_area *va, *n_va; + struct vmap_block vb; lockdep_assert_held(&vmap_purge_lock); @@ -1736,6 +1737,15 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) list_replace_init(&purge_vmap_area_list, &local_purge_list); spin_unlock(&purge_vmap_area_lock); + if (va->flags & VMAP_FLAGS_MASK == VMAP_RAM|VMAP_BLOCK) { + vb = container_of(va, struct vmap_block, va); + if (vb->dirty_max) { /*This is pseudo code for illustration*/ + s = vb->dirty_min << PAGE_SHIFT; + e = vb->dirty_max << PAGE_SHIFT; + } + kfree(vb); + } + if (unlikely(list_empty(&local_purge_list))) goto out; @@ -2083,7 +2093,6 @@ static void free_vmap_block(struct vmap_block *vb) spin_unlock(&vmap_area_lock); free_vmap_area_noflush(vb->va); - kfree_rcu(vb, rcu_head); } > > Aside of that va is not initialized here :) Oh, this is not real code, just to illustrate how it can calculate and flush the last two pages of vmap_block. If you have the per va flushing via array patch, I can work out formal code change based on that.
On Mon, May 22 2023 at 22:34, Baoquan He wrote: > On 05/22/23 at 02:02pm, Thomas Gleixner wrote: >> > @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) >> > list_replace_init(&purge_vmap_area_list, &local_purge_list); >> > spin_unlock(&purge_vmap_area_lock); >> > >> > + vb = container_of(va, struct vmap_block, va); >> >> This cannot work vmap_area is not embedded in vmap_block. vmap_block::va >> is a pointer. vmap_area does not link back to vmap_block, so there is no >> way to find it based on a vmap_area. > > Oh, the code is buggy. va->flags can tell if it's vmap_block, then we > can deduce the vb pointer. No. It _CANNOT_ work whether you check the flags or not. struct foo { ..... struct bar bar; }; container_of(ptr_to_bar, struct foo, bar) returns the pointer to the struct foo which has struct bar embedded. But struct foo { ..... struct bar *bar; }; cannot do that because ptr_to_bar points to some object which is completely disconnected from struct foo. Care to look at the implementation of container_of()? Here is what it boils down to: void *member_pointer = bar; p = (struct foo *)(member_pointer - offsetof(struct foo, bar); So it uses the pointer to bar and subtracts the offset of bar in struct foo. This obviously can only work when struct bar is embedded in struct foo. Lets assume that *bar is the first member of foo, i.e. offset of *bar in struct foo is 0 p = (struct foo *)(member_pointer - 0); So you end up with p == member_pointer == bar But you won't get there because the static_assert() in container_of() will catch that and the compiler will tell you in colourful ways. Once the vmap area is handed over for cleaning up the vmap block is gone and even if you let it stay around then the vmap area does not have any information where to find the block. You'd need to have a pointer to the vmap block in vmap area or embed vmap area into vmap block. See? Thanks, tglx
On Mon, May 22 2023 at 22:21, Thomas Gleixner wrote: > Lets assume that *bar is the first member of foo, i.e. offset of *bar in > struct foo is 0 > > p = (struct foo *)(member_pointer - 0); > > So you end up with > > p == member_pointer == bar > > But you won't get there because the static_assert() in container_of() > will catch that and the compiler will tell you in colourful ways. > > Once the vmap area is handed over for cleaning up the vmap block is gone > and even if you let it stay around then the vmap area does not have any > information where to find the block. > > You'd need to have a pointer to the vmap block in vmap area or embed > vmap area into vmap block. The latter would require to: - split alloc_vmap_area() apart - sprinkle 'if (vmap_area_is_vmap_block(va))' all over the place - do a lot of other nasty things Not sure if that's worth it. There are some other options to pursue. Thanks, tglx
On 05/22/23 at 10:21pm, Thomas Gleixner wrote: > On Mon, May 22 2023 at 22:34, Baoquan He wrote: > > On 05/22/23 at 02:02pm, Thomas Gleixner wrote: > >> > @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) > >> > list_replace_init(&purge_vmap_area_list, &local_purge_list); > >> > spin_unlock(&purge_vmap_area_lock); > >> > > >> > + vb = container_of(va, struct vmap_block, va); > >> > >> This cannot work vmap_area is not embedded in vmap_block. vmap_block::va > >> is a pointer. vmap_area does not link back to vmap_block, so there is no > >> way to find it based on a vmap_area. > > > > Oh, the code is buggy. va->flags can tell if it's vmap_block, then we > > can deduce the vb pointer. > > No. It _CANNOT_ work whether you check the flags or not. > > struct foo { > ..... > struct bar bar; > }; > > container_of(ptr_to_bar, struct foo, bar) returns the pointer to the > struct foo which has struct bar embedded. > > But > > struct foo { > ..... > struct bar *bar; > }; > > cannot do that because ptr_to_bar points to some object which is > completely disconnected from struct foo. > > Care to look at the implementation of container_of()? > > Here is what it boils down to: > > void *member_pointer = bar; > > p = (struct foo *)(member_pointer - offsetof(struct foo, bar); > > So it uses the pointer to bar and subtracts the offset of bar in struct > foo. This obviously can only work when struct bar is embedded in struct > foo. > > Lets assume that *bar is the first member of foo, i.e. offset of *bar in > struct foo is 0 > > p = (struct foo *)(member_pointer - 0); > > So you end up with > > p == member_pointer == bar > > But you won't get there because the static_assert() in container_of() > will catch that and the compiler will tell you in colourful ways. Thanks a lot, learn it now. I never noticed container_of() is not suitable for pointer member of struct case. > > Once the vmap area is handed over for cleaning up the vmap block is gone > and even if you let it stay around then the vmap area does not have any > information where to find the block. > > You'd need to have a pointer to the vmap block in vmap area or embed > vmap area into vmap block. Got it now. Embedding vmap_area into vmap_block seems not feasible because va need be reused when inserting into free_vmap_area_root/list. Adding a pointer to vmap_block looks do-able. Since vm_map_ram area doesn't have vm_struct associated with it, we can reuse the space of '->vm' to add vb pointer like below. Since in the existing code there are places where we use 'if(!va->vm)' to check if it's a normal va, we need be careful to find all of them out and replace with new and tighter checking. Will give a draft code change after all is done and testing is passed. diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index c720be70c8dd..e2ba6d59d679 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -15,6 +15,7 @@ struct vm_area_struct; /* vma defining user mapping in mm_types.h */ struct notifier_block; /* in notifier.h */ struct iov_iter; /* in uio.h */ +struct vmap_block; /* in mm/vmalloc.c */ /* bits in flags of vmalloc's vm_struct below */ #define VM_IOREMAP 0x00000001 /* ioremap() and friends */ @@ -76,6 +77,7 @@ struct vmap_area { union { unsigned long subtree_max_size; /* in "free" tree */ struct vm_struct *vm; /* in "busy" tree */ + struct vmap_block *vb; /* in "busy and purge" tree */ }; unsigned long flags; /* mark type of vm_map_ram area */ }; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index c0f80982eb06..d97343271e27 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2061,6 +2061,10 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) return ERR_PTR(err); } + spin_lock(&vmap_area_lock); + va->vb = vb; + spin_unlock(&vmap_area_lock); + vbq = raw_cpu_ptr(&vmap_block_queue); spin_lock(&vbq->lock); list_add_tail_rcu(&vb->free_list, &vbq->free);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 87134dd8abc3..9f7cbd6182ad 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2236,8 +2236,23 @@ static void vb_free(unsigned long addr, unsigned long size) spin_unlock(&vb->lock); } -static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) +/** + * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer + * + * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily + * to amortize TLB flushing overheads. What this means is that any page you + * have now, may, in a former life, have been mapped into kernel virtual + * address by the vmap layer and so there might be some CPUs with TLB entries + * still referencing that page (additional to the regular 1:1 kernel mapping). + * + * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can + * be sure that none of the pages we have control over will have any aliases + * from the vmap layer. + */ +void vm_unmap_aliases(void) { + unsigned long start = ULONG_MAX, end = 0; + bool flush = false; int cpu; if (unlikely(!vmap_initialized)) @@ -2245,6 +2260,9 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) might_sleep(); + mutex_lock(&vmap_purge_lock); + purge_fragmented_blocks_allcpus(); + for_each_possible_cpu(cpu) { struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu); struct vmap_block *vb; @@ -2262,40 +2280,17 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) start = min(s, start); end = max(e, end); - flush = 1; + flush = true; } spin_unlock(&vb->lock); } rcu_read_unlock(); } - mutex_lock(&vmap_purge_lock); - purge_fragmented_blocks_allcpus(); if (!__purge_vmap_area_lazy(start, end) && flush) flush_tlb_kernel_range(start, end); mutex_unlock(&vmap_purge_lock); } - -/** - * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer - * - * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily - * to amortize TLB flushing overheads. What this means is that any page you - * have now, may, in a former life, have been mapped into kernel virtual - * address by the vmap layer and so there might be some CPUs with TLB entries - * still referencing that page (additional to the regular 1:1 kernel mapping). - * - * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can - * be sure that none of the pages we have control over will have any aliases - * from the vmap layer. - */ -void vm_unmap_aliases(void) -{ - unsigned long start = ULONG_MAX, end = 0; - int flush = 0; - - _vm_unmap_aliases(start, end, flush); -} EXPORT_SYMBOL_GPL(vm_unmap_aliases); /**
After vb_free() invocation, the va will be purged and put into purge tree/list if the entire vmap_block is dirty. If not entirely dirty, the vmap_block is still in percpu vmap_block_queue list, just like below two graphs: (1) |-----|------------|-----------|-------| |dirty|still mapped| dirty | free | 2) |------------------------------|-------| | dirty | free | In the current _vm_unmap_aliases(), to reclaim those unmapped range and flush, it will iterate percpu vbq to calculate the range from vmap_block like above two cases. Then call purge_fragmented_blocks_allcpus() to purge the vmap_block in case 2 since no mapping exists right now, and put these purged vmap_block va into purge tree/list. Then in __purge_vmap_area_lazy(), it will continue calculating the flush range from purge list. Obviously, this will take vmap_block va in the 2nd case into account twice. So here just move purge_fragmented_blocks_allcpus() up to purge the vmap_block va of case 2 firstly, then only need iterate and count in the dirty range in above 1st case. With the change, counting in the dirty region of vmap_block in 1st case is now inside vmap_purge_lock protection region, which makes the flush range calculation more reasonable and accurate by avoiding concurrent operation in other cpu. And also rename _vm_unmap_aliases() to vm_unmap_aliases(), since no other caller except of the old vm_unmap_aliases(). Signed-off-by: Baoquan He <bhe@redhat.com> --- mm/vmalloc.c | 45 ++++++++++++++++++++------------------------- 1 file changed, 20 insertions(+), 25 deletions(-)