Message ID | 20181013013200.206928-3-joel@joelfernandes.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add support for fast mremap | expand |
On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote: > Android needs to mremap large regions of memory during memory management > related operations. Just curious: why? > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) > + || old_end - old_addr < PMD_SIZE) The || goes on the first line. > + } else if (extent == PMD_SIZE && IS_ENABLED(CONFIG_HAVE_MOVE_PMD)) { Overly long line.
On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote: > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote: > > Android needs to mremap large regions of memory during memory management > > related operations. > > Just curious: why? In Android we have a requirement of moving a large (up to a GB now, but may grow bigger in future) memory range from one location to another. This move operation has to happen when the application threads are paused for this operation. Therefore, an inefficient move like it is now (for example 250ms on arm64) will cause response time issues for applications, which is not acceptable. Huge pages cannot be used in such memory ranges to avoid this inefficiency as (when the application threads are running) our fault handlers are designed to process 4KB pages at a time, to keep response times low. So using huge pages in this context can, again, cause response time issues. Also, the mremap syscall waiting for quarter of a second for a large mremap is quite weird and we ought to improve it where possible. > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) > > + || old_end - old_addr < PMD_SIZE) > > The || goes on the first line. Ok, fixed. > > + } else if (extent == PMD_SIZE && IS_ENABLED(CONFIG_HAVE_MOVE_PMD)) { > > Overly long line. Ok, fixed. Preview of updated patch is below. thanks, - Joel ------8<--- From: "Joel Fernandes (Google)" <joel@joelfernandes.org> Subject: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v3) Android needs to mremap large regions of memory during memory management related operations. The mremap system call can be really slow if THP is not enabled. The bottleneck is move_page_tables, which is copying each pte at a time, and can be really slow across a large map. Turning on THP may not be a viable option, and is not for us. This patch speeds up the performance for non-THP system by copying at the PMD level when possible. The speed up is three orders of magnitude. On a 1GB mremap, the mremap completion times drops from 160-250 millesconds to 380-400 microseconds. Before: Total mremap time for 1GB data: 242321014 nanoseconds. Total mremap time for 1GB data: 196842467 nanoseconds. Total mremap time for 1GB data: 167051162 nanoseconds. After: Total mremap time for 1GB data: 385781 nanoseconds. Total mremap time for 1GB data: 388959 nanoseconds. Total mremap time for 1GB data: 402813 nanoseconds. Incase THP is enabled, the optimization is mostly skipped except in certain situations. I also flush the tlb every time we do this optimization since I couldn't find a way to determine if the low-level PTEs are dirty. It is seen that the cost of doing so is not much compared the improvement, on both x86-64 and arm64. Cc: minchan@kernel.org Cc: pantin@google.com Cc: hughd@google.com Cc: lokeshgidra@google.com Cc: dancol@google.com Cc: mhocko@kernel.org Cc: kirill@shutemov.name Cc: akpm@linux-foundation.org Cc: kernel-team@android.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- arch/Kconfig | 5 ++++ mm/mremap.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 71 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index 6801123932a5..9724fe39884f 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -518,6 +518,11 @@ config HAVE_IRQ_TIME_ACCOUNTING Archs need to ensure they use a high enough resolution clock to support irq time accounting and then call enable_sched_clock_irqtime(). +config HAVE_MOVE_PMD + bool + help + Archs that select this are able to move page tables at the PMD level. + config HAVE_ARCH_TRANSPARENT_HUGEPAGE bool diff --git a/mm/mremap.c b/mm/mremap.c index 9e68a02a52b1..a8dd98a59975 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, drop_rmap_locks(vma); } +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, unsigned long old_end, + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush) +{ + spinlock_t *old_ptl, *new_ptl; + struct mm_struct *mm = vma->vm_mm; + + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) || + old_end - old_addr < PMD_SIZE) + return false; + + /* + * The destination pmd shouldn't be established, free_pgtables() + * should have release it. + */ + if (WARN_ON(!pmd_none(*new_pmd))) + return false; + + /* + * We don't have to worry about the ordering of src and dst + * ptlocks because exclusive mmap_sem prevents deadlock. + */ + old_ptl = pmd_lock(vma->vm_mm, old_pmd); + if (old_ptl) { + pmd_t pmd; + + new_ptl = pmd_lockptr(mm, new_pmd); + if (new_ptl != old_ptl) + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + + /* Clear the pmd */ + pmd = *old_pmd; + pmd_clear(old_pmd); + + VM_BUG_ON(!pmd_none(*new_pmd)); + + /* Set the new pmd */ + set_pmd_at(mm, new_addr, new_pmd, pmd); + if (new_ptl != old_ptl) + spin_unlock(new_ptl); + spin_unlock(old_ptl); + + *need_flush = true; + return true; + } + return false; +} + unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, @@ -239,7 +287,25 @@ unsigned long move_page_tables(struct vm_area_struct *vma, split_huge_pmd(vma, old_pmd, old_addr); if (pmd_trans_unstable(old_pmd)) continue; + } else if (extent == PMD_SIZE && + IS_ENABLED(CONFIG_HAVE_MOVE_PMD)) { + /* + * If the extent is PMD-sized, try to speed the move by + * moving at the PMD level if possible. + */ + bool moved; + + if (need_rmap_locks) + take_rmap_locks(vma); + moved = move_normal_pmd(vma, old_addr, new_addr, + old_end, old_pmd, new_pmd, + &need_flush); + if (need_rmap_locks) + drop_rmap_locks(vma); + if (moved) + continue; } + if (pte_alloc(new_vma->vm_mm, new_pmd)) break; next = (new_addr + PMD_SIZE) & PMD_MASK;
On 10/16/18 12:33 AM, Joel Fernandes wrote: > On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote: >> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote: >>> Android needs to mremap large regions of memory during memory management >>> related operations. >> >> Just curious: why? > > In Android we have a requirement of moving a large (up to a GB now, but may > grow bigger in future) memory range from one location to another. I think Christoph's "why?" was about the requirement, not why it hurts applications. I admit I'm now also curious :) > This move > operation has to happen when the application threads are paused for this > operation. Therefore, an inefficient move like it is now (for example 250ms > on arm64) will cause response time issues for applications, which is not > acceptable. Huge pages cannot be used in such memory ranges to avoid this > inefficiency as (when the application threads are running) our fault handlers > are designed to process 4KB pages at a time, to keep response times low. So > using huge pages in this context can, again, cause response time issues. > > Also, the mremap syscall waiting for quarter of a second for a large mremap > is quite weird and we ought to improve it where possible.
On Tue, Oct 16, 2018 at 01:29:52PM +0200, Vlastimil Babka wrote: > On 10/16/18 12:33 AM, Joel Fernandes wrote: > > On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote: > >> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote: > >>> Android needs to mremap large regions of memory during memory management > >>> related operations. > >> > >> Just curious: why? > > > > In Android we have a requirement of moving a large (up to a GB now, but may > > grow bigger in future) memory range from one location to another. > > I think Christoph's "why?" was about the requirement, not why it hurts > applications. I admit I'm now also curious :) This issue was discovered when we wanted to be able to move the physical pages of a memory range to another location quickly so that, after the application threads are resumed, UFFDIO_REGISTER_MODE_MISSING userfaultfd faults can be received on the original memory range. The actual operations performed on the memory range are beyond the scope of this discussion. The user threads continue to refer to the old address which will now fault. The reason we want retain the old memory range and receives faults there is to avoid the need to fix the addresses all over the address space of the threads after we finish with performing operations on them in the fault handlers, so we mremap it and receive faults at the old addresses. Does that answer your question? thanks, - Joel
On 10/16/18 9:43 PM, Joel Fernandes wrote: > On Tue, Oct 16, 2018 at 01:29:52PM +0200, Vlastimil Babka wrote: >> On 10/16/18 12:33 AM, Joel Fernandes wrote: >>> On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote: >>>> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote: >>>>> Android needs to mremap large regions of memory during memory management >>>>> related operations. >>>> >>>> Just curious: why? >>> >>> In Android we have a requirement of moving a large (up to a GB now, but may >>> grow bigger in future) memory range from one location to another. >> >> I think Christoph's "why?" was about the requirement, not why it hurts >> applications. I admit I'm now also curious :) > > This issue was discovered when we wanted to be able to move the physical > pages of a memory range to another location quickly so that, after the > application threads are resumed, UFFDIO_REGISTER_MODE_MISSING userfaultfd > faults can be received on the original memory range. The actual operations > performed on the memory range are beyond the scope of this discussion. The > user threads continue to refer to the old address which will now fault. The > reason we want retain the old memory range and receives faults there is to > avoid the need to fix the addresses all over the address space of the threads > after we finish with performing operations on them in the fault handlers, so > we mremap it and receive faults at the old addresses. > > Does that answer your question? Yes, interesting, thanks! Vlastimil > thanks, > > - Joel >
On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote: > diff --git a/mm/mremap.c b/mm/mremap.c > index 9e68a02a52b1..2fd163cff406 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, > drop_rmap_locks(vma); > } > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, > + unsigned long new_addr, unsigned long old_end, > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush) > +{ > + spinlock_t *old_ptl, *new_ptl; > + struct mm_struct *mm = vma->vm_mm; > + > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) > + || old_end - old_addr < PMD_SIZE) > + return false; > + > + /* > + * The destination pmd shouldn't be established, free_pgtables() > + * should have release it. > + */ > + if (WARN_ON(!pmd_none(*new_pmd))) > + return false; > + > + /* > + * We don't have to worry about the ordering of src and dst > + * ptlocks because exclusive mmap_sem prevents deadlock. > + */ > + old_ptl = pmd_lock(vma->vm_mm, old_pmd); > + if (old_ptl) { How can it ever be false? > + pmd_t pmd; > + > + new_ptl = pmd_lockptr(mm, new_pmd); > + if (new_ptl != old_ptl) > + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); > + > + /* Clear the pmd */ > + pmd = *old_pmd; > + pmd_clear(old_pmd); > + > + VM_BUG_ON(!pmd_none(*new_pmd)); > + > + /* Set the new pmd */ > + set_pmd_at(mm, new_addr, new_pmd, pmd); > + if (new_ptl != old_ptl) > + spin_unlock(new_ptl); > + spin_unlock(old_ptl); > + > + *need_flush = true; > + return true; > + } > + return false; > +} > +
On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote: > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote: > > diff --git a/mm/mremap.c b/mm/mremap.c > > index 9e68a02a52b1..2fd163cff406 100644 > > --- a/mm/mremap.c > > +++ b/mm/mremap.c > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, > > drop_rmap_locks(vma); > > } > > > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, > > + unsigned long new_addr, unsigned long old_end, > > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush) > > +{ > > + spinlock_t *old_ptl, *new_ptl; > > + struct mm_struct *mm = vma->vm_mm; > > + > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) > > + || old_end - old_addr < PMD_SIZE) > > + return false; > > + > > + /* > > + * The destination pmd shouldn't be established, free_pgtables() > > + * should have release it. > > + */ > > + if (WARN_ON(!pmd_none(*new_pmd))) > > + return false; > > + > > + /* > > + * We don't have to worry about the ordering of src and dst > > + * ptlocks because exclusive mmap_sem prevents deadlock. > > + */ > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd); > > + if (old_ptl) { > > How can it ever be false? > > > + pmd_t pmd; > > + > > + new_ptl = pmd_lockptr(mm, new_pmd); Looks like this is largely inspired by move_huge_pmd(), I guess a lot of the code applies, why not just reuse as much as possible? The same comments w.r.t mmap_sem helping protect against lock order issues applies as well. > > + if (new_ptl != old_ptl) > > + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); > > + > > + /* Clear the pmd */ > > + pmd = *old_pmd; > > + pmd_clear(old_pmd); > > + > > + VM_BUG_ON(!pmd_none(*new_pmd)); > > + > > + /* Set the new pmd */ > > + set_pmd_at(mm, new_addr, new_pmd, pmd); > > + if (new_ptl != old_ptl) > > + spin_unlock(new_ptl); > > + spin_unlock(old_ptl); > > + > > + *need_flush = true; > > + return true; > > + } > > + return false; > > +} > > + > -- > Kirill A. Shutemov >
On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote: > On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote: > > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote: > > > diff --git a/mm/mremap.c b/mm/mremap.c > > > index 9e68a02a52b1..2fd163cff406 100644 > > > --- a/mm/mremap.c > > > +++ b/mm/mremap.c > > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, > > > drop_rmap_locks(vma); > > > } > > > > > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, > > > + unsigned long new_addr, unsigned long old_end, > > > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush) > > > +{ > > > + spinlock_t *old_ptl, *new_ptl; > > > + struct mm_struct *mm = vma->vm_mm; > > > + > > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) > > > + || old_end - old_addr < PMD_SIZE) > > > + return false; > > > + > > > + /* > > > + * The destination pmd shouldn't be established, free_pgtables() > > > + * should have release it. > > > + */ > > > + if (WARN_ON(!pmd_none(*new_pmd))) > > > + return false; > > > + > > > + /* > > > + * We don't have to worry about the ordering of src and dst > > > + * ptlocks because exclusive mmap_sem prevents deadlock. > > > + */ > > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd); > > > + if (old_ptl) { > > > > How can it ever be false? > > > > > + pmd_t pmd; > > > + > > > + new_ptl = pmd_lockptr(mm, new_pmd); > > > Looks like this is largely inspired by move_huge_pmd(), I guess a lot of > the code applies, why not just reuse as much as possible? The same comments > w.r.t mmap_sem helping protect against lock order issues applies as well. pmd_lock() cannot fail, but __pmd_trans_huge_lock() can. We should not copy the code blindly.
On Wed, Oct 24, 2018 at 03:57:24PM +0300, Kirill A. Shutemov wrote: > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote: > > On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote: > > > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote: > > > > diff --git a/mm/mremap.c b/mm/mremap.c > > > > index 9e68a02a52b1..2fd163cff406 100644 > > > > --- a/mm/mremap.c > > > > +++ b/mm/mremap.c > > > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, > > > > drop_rmap_locks(vma); > > > > } > > > > > > > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, > > > > + unsigned long new_addr, unsigned long old_end, > > > > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush) > > > > +{ > > > > + spinlock_t *old_ptl, *new_ptl; > > > > + struct mm_struct *mm = vma->vm_mm; > > > > + > > > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) > > > > + || old_end - old_addr < PMD_SIZE) > > > > + return false; > > > > + > > > > + /* > > > > + * The destination pmd shouldn't be established, free_pgtables() > > > > + * should have release it. > > > > + */ > > > > + if (WARN_ON(!pmd_none(*new_pmd))) > > > > + return false; > > > > + > > > > + /* > > > > + * We don't have to worry about the ordering of src and dst > > > > + * ptlocks because exclusive mmap_sem prevents deadlock. > > > > + */ > > > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd); > > > > + if (old_ptl) { > > > > > > How can it ever be false? Kirill, It cannot, you are right. I'll remove the test. By the way, there are new changes upstream by Linus which flush the TLB before releasing the ptlock instead of after. I'm guessing that patch came about because of reviews of this patch and someone spotted an issue in the existing code :) Anyway the patch in concern is: eb66ae030829 ("mremap: properly flush TLB before releasing the page") I need to rebase on top of that with appropriate modifications, but I worry that this patch will slow down performance since we have to flush at every PMD/PTE move before releasing the ptlock. Where as with my patch, the intention is to flush only at once in the end of move_page_tables. When I tried to flush TLB on every PMD move, it was quite slow on my arm64 device [2]. Further observation [1] is, it seems like the move_huge_pmds and move_ptes code is a bit sub optimal in the sense, we are acquiring and releasing the same ptlock for a bunch of PMDs if the said PMDs are on the same page-table page right? Instead we can do better by acquiring and release the ptlock less often. I think this observation [1] and the frequent TLB flush issue [2] can be solved by acquiring the ptlock once for a bunch of PMDs, move them all, then flush the tlb and then release the ptlock, and then proceed to doing the same thing for the PMDs in the next page-table page. What do you think? - Joel
On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote: [...] > > > + pmd_t pmd; > > > + > > > + new_ptl = pmd_lockptr(mm, new_pmd); > > > Looks like this is largely inspired by move_huge_pmd(), I guess a lot of > the code applies, why not just reuse as much as possible? The same comments > w.r.t mmap_sem helping protect against lock order issues applies as well. I thought about this and when I looked into it, it seemed there are subtle differences that make such sharing not worth it (or not possible). - Joel
On Wed, Oct 24, 2018 at 07:09:07PM -0700, Joel Fernandes wrote: > On Wed, Oct 24, 2018 at 03:57:24PM +0300, Kirill A. Shutemov wrote: > > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote: > > > On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote: > > > > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote: > > > > > diff --git a/mm/mremap.c b/mm/mremap.c > > > > > index 9e68a02a52b1..2fd163cff406 100644 > > > > > --- a/mm/mremap.c > > > > > +++ b/mm/mremap.c > > > > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, > > > > > drop_rmap_locks(vma); > > > > > } > > > > > > > > > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, > > > > > + unsigned long new_addr, unsigned long old_end, > > > > > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush) > > > > > +{ > > > > > + spinlock_t *old_ptl, *new_ptl; > > > > > + struct mm_struct *mm = vma->vm_mm; > > > > > + > > > > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) > > > > > + || old_end - old_addr < PMD_SIZE) > > > > > + return false; > > > > > + > > > > > + /* > > > > > + * The destination pmd shouldn't be established, free_pgtables() > > > > > + * should have release it. > > > > > + */ > > > > > + if (WARN_ON(!pmd_none(*new_pmd))) > > > > > + return false; > > > > > + > > > > > + /* > > > > > + * We don't have to worry about the ordering of src and dst > > > > > + * ptlocks because exclusive mmap_sem prevents deadlock. > > > > > + */ > > > > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd); > > > > > + if (old_ptl) { > > > > > > > > How can it ever be false? > > Kirill, > It cannot, you are right. I'll remove the test. > > By the way, there are new changes upstream by Linus which flush the TLB > before releasing the ptlock instead of after. I'm guessing that patch came > about because of reviews of this patch and someone spotted an issue in the > existing code :) > > Anyway the patch in concern is: > eb66ae030829 ("mremap: properly flush TLB before releasing the page") > > I need to rebase on top of that with appropriate modifications, but I worry > that this patch will slow down performance since we have to flush at every > PMD/PTE move before releasing the ptlock. Where as with my patch, the > intention is to flush only at once in the end of move_page_tables. When I > tried to flush TLB on every PMD move, it was quite slow on my arm64 device [2]. > > Further observation [1] is, it seems like the move_huge_pmds and move_ptes code > is a bit sub optimal in the sense, we are acquiring and releasing the same > ptlock for a bunch of PMDs if the said PMDs are on the same page-table page > right? Instead we can do better by acquiring and release the ptlock less > often. > > I think this observation [1] and the frequent TLB flush issue [2] can be solved > by acquiring the ptlock once for a bunch of PMDs, move them all, then flush > the tlb and then release the ptlock, and then proceed to doing the same thing > for the PMDs in the next page-table page. What do you think? Yeah, that's viable optimization. The tricky part is that one PMD page table can have PMD entires of different types: THP, page table that you can move as whole and the one that you cannot (for any reason). If we cannot move the PMD entry as a whole and must go to PTE page table we would need to drop PMD ptl and take PTE ptl (it might be the same lock in some configuations). Also we don't want to take PMD lock unless it's required. I expect it to be not very trivial to get everything right. But take a shot :)
On Thu, Oct 25, 2018 at 01:19:00PM +0300, Kirill A. Shutemov wrote: > On Wed, Oct 24, 2018 at 07:09:07PM -0700, Joel Fernandes wrote: > > On Wed, Oct 24, 2018 at 03:57:24PM +0300, Kirill A. Shutemov wrote: > > > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote: > > > > On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote: > > > > > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote: > > > > > > diff --git a/mm/mremap.c b/mm/mremap.c > > > > > > index 9e68a02a52b1..2fd163cff406 100644 > > > > > > --- a/mm/mremap.c > > > > > > +++ b/mm/mremap.c > > > > > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, > > > > > > drop_rmap_locks(vma); > > > > > > } > > > > > > > > > > > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, > > > > > > + unsigned long new_addr, unsigned long old_end, > > > > > > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush) > > > > > > +{ > > > > > > + spinlock_t *old_ptl, *new_ptl; > > > > > > + struct mm_struct *mm = vma->vm_mm; > > > > > > + > > > > > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) > > > > > > + || old_end - old_addr < PMD_SIZE) > > > > > > + return false; > > > > > > + > > > > > > + /* > > > > > > + * The destination pmd shouldn't be established, free_pgtables() > > > > > > + * should have release it. > > > > > > + */ > > > > > > + if (WARN_ON(!pmd_none(*new_pmd))) > > > > > > + return false; > > > > > > + > > > > > > + /* > > > > > > + * We don't have to worry about the ordering of src and dst > > > > > > + * ptlocks because exclusive mmap_sem prevents deadlock. > > > > > > + */ > > > > > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd); > > > > > > + if (old_ptl) { > > > > > > > > > > How can it ever be false? > > > > Kirill, > > It cannot, you are right. I'll remove the test. > > > > By the way, there are new changes upstream by Linus which flush the TLB > > before releasing the ptlock instead of after. I'm guessing that patch came > > about because of reviews of this patch and someone spotted an issue in the > > existing code :) > > > > Anyway the patch in concern is: > > eb66ae030829 ("mremap: properly flush TLB before releasing the page") > > > > I need to rebase on top of that with appropriate modifications, but I worry > > that this patch will slow down performance since we have to flush at every > > PMD/PTE move before releasing the ptlock. Where as with my patch, the > > intention is to flush only at once in the end of move_page_tables. When I > > tried to flush TLB on every PMD move, it was quite slow on my arm64 device [2]. > > > > Further observation [1] is, it seems like the move_huge_pmds and move_ptes code > > is a bit sub optimal in the sense, we are acquiring and releasing the same > > ptlock for a bunch of PMDs if the said PMDs are on the same page-table page > > right? Instead we can do better by acquiring and release the ptlock less > > often. > > > > I think this observation [1] and the frequent TLB flush issue [2] can be solved > > by acquiring the ptlock once for a bunch of PMDs, move them all, then flush > > the tlb and then release the ptlock, and then proceed to doing the same thing > > for the PMDs in the next page-table page. What do you think? > > Yeah, that's viable optimization. > > The tricky part is that one PMD page table can have PMD entires of > different types: THP, page table that you can move as whole and the one > that you cannot (for any reason). > > If we cannot move the PMD entry as a whole and must go to PTE page table > we would need to drop PMD ptl and take PTE ptl (it might be the same lock > in some configuations). > Also we don't want to take PMD lock unless it's required. > > I expect it to be not very trivial to get everything right. But take a > shot :) Yes, that is exactly the issue I hit when I attempted it. :) The locks need to be release if we do something different on the next loop iteration. It complicates the code and not sure if it is worth it in the long run. On x86 atleast, I don't see any perf issues with the TLB-flush per-PMD move, so the patch is Ok there. On arm64, it negates the performance benefit even though its not any worse than what we are doing currently at the PTE level. My thinking is to take it slow and get the patch in in its current state, since it improves x86. Then as a next step, look into why the arm64 tlb flushes are that expensive and look into optimizing that. On arm64 I am testing on a 4.9 kernel so I'm wondering there are any optimizations since 4.9 that can help speed it up there. After that, if all else fails about speeding up arm64, then I look into developing the cleanest possible solution where we can keep the lock held for longer and flush lesser. thanks, - Joel
On Wed, Oct 24, 2018 at 07:13:50PM -0700, Joel Fernandes wrote: > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote: > [...] > > > > + pmd_t pmd; > > > > + > > > > + new_ptl = pmd_lockptr(mm, new_pmd); > > > > > > Looks like this is largely inspired by move_huge_pmd(), I guess a lot of > > the code applies, why not just reuse as much as possible? The same comments > > w.r.t mmap_sem helping protect against lock order issues applies as well. > > I thought about this and when I looked into it, it seemed there are subtle > differences that make such sharing not worth it (or not possible). > Could you elaborate on them? Thanks, Balbir Singh.
Hi Balbir, On Sat, Oct 27, 2018 at 09:21:02PM +1100, Balbir Singh wrote: > On Wed, Oct 24, 2018 at 07:13:50PM -0700, Joel Fernandes wrote: > > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote: > > [...] > > > > > + pmd_t pmd; > > > > > + > > > > > + new_ptl = pmd_lockptr(mm, new_pmd); > > > > > > > > > Looks like this is largely inspired by move_huge_pmd(), I guess a lot of > > > the code applies, why not just reuse as much as possible? The same comments > > > w.r.t mmap_sem helping protect against lock order issues applies as well. > > > > I thought about this and when I looked into it, it seemed there are subtle > > differences that make such sharing not worth it (or not possible). > > > > Could you elaborate on them? The move_huge_page function is defined only for CONFIG_TRANSPARENT_HUGEPAGE so we cannot reuse it to begin with, since we have it disabled on our systems. I am not sure if it is a good idea to split that out and refactor it for reuse especially since our case is quite simple compared to huge pages. There are also a couple of subtle differences between the move_normal_pmd and the move_huge_pmd. Atleast 2 of them are: 1. We don't concern ourself with the PMD dirty bit, since the pages being moved are normal pages and at the soft-dirty bit accounting is at the PTE level, since we are not moving PTEs, we don't need to do that. 2. The locking is simpler as Kirill pointed, pmd_lock cannot fail however __pmd_trans_huge_lock can. I feel it is not super useful to refactor move_huge_pmd to support our case especially since move_normal_pmd is quite small, so IMHO the benefit of code reuse isn't there very much. Do let me know your thoughts and thanks for your interest in this. thanks, - Joel
On Sat, Oct 27, 2018 at 12:39:17PM -0700, Joel Fernandes wrote: > Hi Balbir, > > On Sat, Oct 27, 2018 at 09:21:02PM +1100, Balbir Singh wrote: > > On Wed, Oct 24, 2018 at 07:13:50PM -0700, Joel Fernandes wrote: > > > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote: > > > [...] > > > > > > + pmd_t pmd; > > > > > > + > > > > > > + new_ptl = pmd_lockptr(mm, new_pmd); > > > > > > > > > > > > Looks like this is largely inspired by move_huge_pmd(), I guess a lot of > > > > the code applies, why not just reuse as much as possible? The same comments > > > > w.r.t mmap_sem helping protect against lock order issues applies as well. > > > > > > I thought about this and when I looked into it, it seemed there are subtle > > > differences that make such sharing not worth it (or not possible). > > > > > > > Could you elaborate on them? > > The move_huge_page function is defined only for CONFIG_TRANSPARENT_HUGEPAGE > so we cannot reuse it to begin with, since we have it disabled on our > systems. I am not sure if it is a good idea to split that out and refactor it > for reuse especially since our case is quite simple compared to huge pages. > > There are also a couple of subtle differences between the move_normal_pmd and > the move_huge_pmd. Atleast 2 of them are: > > 1. We don't concern ourself with the PMD dirty bit, since the pages being > moved are normal pages and at the soft-dirty bit accounting is at the PTE > level, since we are not moving PTEs, we don't need to do that. > > 2. The locking is simpler as Kirill pointed, pmd_lock cannot fail however > __pmd_trans_huge_lock can. > > I feel it is not super useful to refactor move_huge_pmd to support our case > especially since move_normal_pmd is quite small, so IMHO the benefit of code > reuse isn't there very much. > My big concern is that any bug fixes will need to monitor both paths. Do you see a big overhead in checking the soft dirty bit? The locking is a little different. Having said that, I am not strictly opposed to the extra code, just concerned about missing fixes/updates as we find them. > Do let me know your thoughts and thanks for your interest in this. > > Balbir Singh.
On Fri, Oct 26, 2018 at 02:11:48PM -0700, Joel Fernandes wrote: > My thinking is to take it slow and get the patch in in its current state, > since it improves x86. Then as a next step, look into why the arm64 tlb > flushes are that expensive and look into optimizing that. On arm64 I am > testing on a 4.9 kernel so I'm wondering there are any optimizations since > 4.9 that can help speed it up there. After that, if all else fails about > speeding up arm64, then I look into developing the cleanest possible solution > where we can keep the lock held for longer and flush lesser. We rewrote a good chunk of the arm64 TLB invalidation and core mmu_gather code this merge window, so please do have another look at -rc1! Will
diff --git a/arch/Kconfig b/arch/Kconfig index 6801123932a5..9724fe39884f 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -518,6 +518,11 @@ config HAVE_IRQ_TIME_ACCOUNTING Archs need to ensure they use a high enough resolution clock to support irq time accounting and then call enable_sched_clock_irqtime(). +config HAVE_MOVE_PMD + bool + help + Archs that select this are able to move page tables at the PMD level. + config HAVE_ARCH_TRANSPARENT_HUGEPAGE bool diff --git a/mm/mremap.c b/mm/mremap.c index 9e68a02a52b1..2fd163cff406 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, drop_rmap_locks(vma); } +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, unsigned long old_end, + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush) +{ + spinlock_t *old_ptl, *new_ptl; + struct mm_struct *mm = vma->vm_mm; + + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) + || old_end - old_addr < PMD_SIZE) + return false; + + /* + * The destination pmd shouldn't be established, free_pgtables() + * should have release it. + */ + if (WARN_ON(!pmd_none(*new_pmd))) + return false; + + /* + * We don't have to worry about the ordering of src and dst + * ptlocks because exclusive mmap_sem prevents deadlock. + */ + old_ptl = pmd_lock(vma->vm_mm, old_pmd); + if (old_ptl) { + pmd_t pmd; + + new_ptl = pmd_lockptr(mm, new_pmd); + if (new_ptl != old_ptl) + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + + /* Clear the pmd */ + pmd = *old_pmd; + pmd_clear(old_pmd); + + VM_BUG_ON(!pmd_none(*new_pmd)); + + /* Set the new pmd */ + set_pmd_at(mm, new_addr, new_pmd, pmd); + if (new_ptl != old_ptl) + spin_unlock(new_ptl); + spin_unlock(old_ptl); + + *need_flush = true; + return true; + } + return false; +} + unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, @@ -239,7 +287,24 @@ unsigned long move_page_tables(struct vm_area_struct *vma, split_huge_pmd(vma, old_pmd, old_addr); if (pmd_trans_unstable(old_pmd)) continue; + } else if (extent == PMD_SIZE && IS_ENABLED(CONFIG_HAVE_MOVE_PMD)) { + /* + * If the extent is PMD-sized, try to speed the move by + * moving at the PMD level if possible. + */ + bool moved; + + if (need_rmap_locks) + take_rmap_locks(vma); + moved = move_normal_pmd(vma, old_addr, new_addr, + old_end, old_pmd, new_pmd, + &need_flush); + if (need_rmap_locks) + drop_rmap_locks(vma); + if (moved) + continue; } + if (pte_alloc(new_vma->vm_mm, new_pmd)) break; next = (new_addr + PMD_SIZE) & PMD_MASK;
Android needs to mremap large regions of memory during memory management related operations. The mremap system call can be really slow if THP is not enabled. The bottleneck is move_page_tables, which is copying each pte at a time, and can be really slow across a large map. Turning on THP may not be a viable option, and is not for us. This patch speeds up the performance for non-THP system by copying at the PMD level when possible. The speed up is three orders of magnitude. On a 1GB mremap, the mremap completion times drops from 160-250 millesconds to 380-400 microseconds. Before: Total mremap time for 1GB data: 242321014 nanoseconds. Total mremap time for 1GB data: 196842467 nanoseconds. Total mremap time for 1GB data: 167051162 nanoseconds. After: Total mremap time for 1GB data: 385781 nanoseconds. Total mremap time for 1GB data: 388959 nanoseconds. Total mremap time for 1GB data: 402813 nanoseconds. Incase THP is enabled, the optimization is skipped. I also flush the tlb every time we do this optimization since I couldn't find a way to determine if the low-level PTEs are dirty. It is seen that the cost of doing so is not much compared the improvement, on both x86-64 and arm64. Cc: minchan@kernel.org Cc: pantin@google.com Cc: hughd@google.com Cc: lokeshgidra@google.com Cc: dancol@google.com Cc: mhocko@kernel.org Cc: kirill@shutemov.name Cc: akpm@linux-foundation.org Cc: kernel-team@android.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- arch/Kconfig | 5 ++++ mm/mremap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+)