Message ID | 20191003013325.2614-1-leonardo@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduces new count-based method for tracking lockless pagetable walks | expand |
On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote: > If a process (qemu) with a lot of CPUs (128) try to munmap() a large > chunk of memory (496GB) mapped with THP, it takes an average of 275 > seconds, which can cause a lot of problems to the load (in qemu case, > the guest will lock for this time). > > Trying to find the source of this bug, I found out most of this time is > spent on serialize_against_pte_lookup(). This function will take a lot > of time in smp_call_function_many() if there is more than a couple CPUs > running the user process. Since it has to happen to all THP mapped, it > will take a very long time for large amounts of memory. > > By the docs, serialize_against_pte_lookup() is needed in order to avoid > pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless > pagetable walk, to happen concurrently with THP splitting/collapsing. > > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], > after interrupts are re-enabled. > Since, interrupts are (usually) disabled during lockless pagetable > walk, and serialize_against_pte_lookup will only return after > interrupts are enabled, it is protected. This is something entirely specific to Power, you shouldn't be touching generic code at all. Also, I'm not sure I understand things properly. So serialize_against_pte_lookup() wants to wait for all currently out-standing __find_linux_pte() instances (which are very similar to gup_fast). It seems to want to do this before flushing the THP TLB for some reason; why? Should not THP observe the normal page table freeing rules which includes a RCU-like grace period like this already. Why is THP special here? This doesn't seem adequately explained. Also, specifically to munmap(), this seems entirely superfluous, munmap() uses the normal page-table freeing code and should be entirely fine without additional waiting. Furthermore, Power never accurately tracks mm_cpumask(), so using that makes the whole thing more expensive than it needs to be. Also, I suppose that is buggered vs file backed THP.
Hello Peter, thanks for the feedback! On Thu, 2019-10-03 at 09:29 +0200, Peter Zijlstra wrote: > On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote: > > If a process (qemu) with a lot of CPUs (128) try to munmap() a large > > chunk of memory (496GB) mapped with THP, it takes an average of 275 > > seconds, which can cause a lot of problems to the load (in qemu case, > > the guest will lock for this time). > > > > Trying to find the source of this bug, I found out most of this time is > > spent on serialize_against_pte_lookup(). This function will take a lot > > of time in smp_call_function_many() if there is more than a couple CPUs > > running the user process. Since it has to happen to all THP mapped, it > > will take a very long time for large amounts of memory. > > > > By the docs, serialize_against_pte_lookup() is needed in order to avoid > > pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless > > pagetable walk, to happen concurrently with THP splitting/collapsing. > > > > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], > > after interrupts are re-enabled. > > Since, interrupts are (usually) disabled during lockless pagetable > > walk, and serialize_against_pte_lookup will only return after > > interrupts are enabled, it is protected. > > This is something entirely specific to Power, you shouldn't be touching > generic code at all. Up to v4, I was declaring dummy functions so it would not mess up with other archs: http://patchwork.ozlabs.org/patch/1168779/ But I was recommended to create a generic function that could guide the way to archs: http://patchwork.ozlabs.org/patch/1168775/ The idea was to concentrate all routines of beginning/ending lockless pagetable walks on these functions, and call them instead of irq_disable/irq_enable. Then it was easy to place the refcount-based tracking in these functions. It should only be enabled in case the config chooses to do so. > > Also, I'm not sure I understand things properly. > > So serialize_against_pte_lookup() wants to wait for all currently > out-standing __find_linux_pte() instances (which are very similar to > gup_fast). > > It seems to want to do this before flushing the THP TLB for some reason; > why? Should not THP observe the normal page table freeing rules which > includes a RCU-like grace period like this already. > > Why is THP special here? This doesn't seem adequately explained. "It's necessary to monitor lockless pagetable walks, in order to avoid doing THP splitting/collapsing during them." If a there is a THP split/collapse during the lockless pagetable walk, the returned ptep can be a pointing to an invalid pte. To avoid that, the pmd is updated, then serialize_against_pte_lookup is ran. Serialize runs a do_nothing in all cpu in cpu_mask. So, after all cpus finish running do_nothing(), there is a guarantee that if there is any 'lockless pagetable walk' it is running on top of a updated version of this pmd, and so, collapsing/splitting THP is safe. > > Also, specifically to munmap(), this seems entirely superfluous, > munmap() uses the normal page-table freeing code and should be entirely > fine without additional waiting. To be honest, I remember it being needed in munmap case, but I really don't remember the details. I will take a deeper look and come back with this answer. > Furthermore, Power never accurately tracks mm_cpumask(), so using that > makes the whole thing more expensive than it needs to be. Also, I > suppose that is buggered vs file backed THP. That accuracy of mm_cpumask is above my knowledge right now. =) I agree that it's to expensive to do that. That's why I suggested this method, that can check if there is any 'lockless pagetable walk' running before trying to serialize. It reduced the waiting time a lot for large amounts of memory. (more details on cover letter) Best regards, Leonardo Brás
On 10/3/19 1:36 PM, Leonardo Bras wrote: > On Thu, 2019-10-03 at 09:29 +0200, Peter Zijlstra wrote: >> On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote: ... >> This is something entirely specific to Power, you shouldn't be touching >> generic code at all. > > Up to v4, I was declaring dummy functions so it would not mess up with > other archs: http://patchwork.ozlabs.org/patch/1168779/ > > But I was recommended to create a generic function that could guide the > way to archs: http://patchwork.ozlabs.org/patch/1168775/ Yes. And to clarify, I was assuming that the changes to mm/gup.c were required in order to accomplish your goals. Given that assumption, I wanted the generic code to be "proper", and that's what that feedback is about. If you can somehow do it entirely as an arch-specific thing, then probably that's even better. Although the other questions about file-backed THP make it sound like some rethinking across the board is required now. thanks,
On Thu, 2019-10-03 at 13:49 -0700, John Hubbard wrote: > Yes. And to clarify, I was assuming that the changes to mm/gup.c were > required in order to accomplish your goals. Given that assumption, I > wanted the generic code to be "proper", and that's what that feedback > is about. You assumed right. On my counting approach it's necessary count all 'lockless pagetable walks', including the ones in generic code. And, I think even without the counting approach, it was a good way focus this 'lockless pagetable walk' routine in a single place. > > Although the other questions about file-backed THP > make it sound like some rethinking across the board is required now. Yeap, I need to better understand how the file-backed THP problem works. Thanks, Leonardo Brás
On Thu, Oct 03, 2019 at 05:36:31PM -0300, Leonardo Bras wrote: > > Also, I'm not sure I understand things properly. > > > > So serialize_against_pte_lookup() wants to wait for all currently > > out-standing __find_linux_pte() instances (which are very similar to > > gup_fast). > > > > It seems to want to do this before flushing the THP TLB for some reason; > > why? Should not THP observe the normal page table freeing rules which > > includes a RCU-like grace period like this already. > > > > Why is THP special here? This doesn't seem adequately explained. > > "It's necessary to monitor lockless pagetable walks, in order to avoid > doing THP splitting/collapsing during them." > > If a there is a THP split/collapse during the lockless pagetable walk, > the returned ptep can be a pointing to an invalid pte. So the whole premise of lockless page-table walks (gup_fast) is that it can work on in-flux page-tables. Specifically gup_fast() never returns PTEs, only struct page *, and only if it can increment the page refcount. In order to enable this, page-table pages are RCU(-like) freed, such that even if we access page-tables that have (concurrently) been unlinked, we'll not UaF (see asm-generic/tlb.h, the comment at HAVE_RCU_TABLE_FREE). IOW, the worst case if not getting a struct page *. I really don't see how THP splitting/collapsing is special here, either we see the PMD and find a struct page * or we see a PTE and find the same struct page * (compound page head). The only thing that needs to be guaranteed is that both PTEs and PMD page-tables are valid. Is this not so? > To avoid that, the pmd is updated, then serialize_against_pte_lookup is > ran. Serialize runs a do_nothing in all cpu in cpu_mask. > > So, after all cpus finish running do_nothing(), there is a guarantee > that if there is any 'lockless pagetable walk' it is running on top of > a updated version of this pmd, and so, collapsing/splitting THP is > safe. But why would it matter?! It would find the same struct page * through either version of the page-tables. *confused* > > Also, specifically to munmap(), this seems entirely superfluous, > > munmap() uses the normal page-table freeing code and should be entirely > > fine without additional waiting. > > To be honest, I remember it being needed in munmap case, but I really > don't remember the details. I will take a deeper look and come back > with this answer. munmap does normal mmu_gather page-table teardown, the THP PMD should be RCU-like freed just like any other PMD. Which should be perfectly safe vs lockless page-table walks. If you can find anything there that isn't right, please explain that in detail and we'll need to look hard at fixing _that_. > > Furthermore, Power never accurately tracks mm_cpumask(), so using that > > makes the whole thing more expensive than it needs to be. Also, I > > suppose that is buggered vs file backed THP. > > That accuracy of mm_cpumask is above my knowledge right now. =) Basically PowerPC only ever sets bits in there, unlike x86 that also clears bits (at expense, but it's worth it for us).
On Fri, Oct 04, 2019 at 01:42:36PM +0200, Peter Zijlstra wrote: > If you can find anything there that isn't right, please explain that in > detail and we'll need to look hard at fixing _that_. Also, I can't imagine Nick is happy with 128 CPUs banging on that atomic counter, esp. since atomics are horrifically slow on Power.