Message ID | 20200715135011.42743-1-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [PATCHv2] mm: Fix warning in move_normal_pmd() | expand |
On Wed, Jul 15, 2020 at 6:50 AM Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > > mremap(2) does not allow source and destination regions to overlap, but > shift_arg_pages() calls move_page_tables() directly and in this case the > source and destination overlap often. It Actually, before we do this patch (which I think is a good idea), I'd like Naresh to test the attached patch. And Kirill, Joel, mind looking it over too. IT IS ENTIRELY UNTESTED. But I hope the concept (and the code) is fairly obvious: it makes move_page_tables() try to align the range to be moved, if possible. That *should* actually remove the warning that Naresh sees, for the simple reason that now the PMD moving case will *always* trigger when the stack movement is PMD-aligned, and you never see the "first do a few small pages, then do the PMD optimization" case. And that should mean that we don't ever hit the "oh, we already have a target pmd" thing, because the "move the whole pmd" case will clear the ones it has already taken care of, and you never have that "oh, there's an empty pmd where the pages were moved away one by one". And that means that for this case, it's _ok_ to overlap (as long as we copy downwards). What do you think? Ok, so I could easily have screwed up the patch, even if it's conceptually fairly simple. The details are small, but they needed some care. The thing _looks_ fine to me, both on a source level and when looking at the generated code, and I made sure to try to use a lot of small helper functions and couple of helper macros to make each individual piece look clear and obvious. But obviously a single "Oh, that test is the wrong way around", or a missing mask inversion, or whatever, could completely break the code. So no guarantees, but it looks fairly straightforward to me. Naresh - this patch does *NOT* make sense to test together with Kirill's (or Joel's) patch that says "don't do the PMD optimization at all for overlapping cases". Those patches will hide the issue, and make the new code only work for mremap(), not for the initial stack setup that caused the original warning. Joel - I think this patch makes sense _regardless_ of the special execve() usage case, in that it extends your "let's just copy things at a whole PMD level" logic to even the case where the beginning wasn't PMD-aligned (or the length wasn't sufficient). Obviously it only triggers when the source and destination are mutually aligned, and if there are no other vma's around those first/last PMD entries. Which might mean that it almost never triggers in practice, but looking at the generated code, it sure looks like it's cheap enough to test. Didn't you have some test-cases for your pmd optimized movement cases, at least for timing? Linus
Hi Linus, On Wed, Jul 15, 2020 at 11:36:59AM -0700, Linus Torvalds wrote: > On Wed, Jul 15, 2020 at 6:50 AM Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > > > mremap(2) does not allow source and destination regions to overlap, but > > shift_arg_pages() calls move_page_tables() directly and in this case the > > source and destination overlap often. It > > Actually, before we do this patch (which I think is a good idea), I'd > like Naresh to test the attached patch. > > And Kirill, Joel, mind looking it over too. > > IT IS ENTIRELY UNTESTED. Seems a clever idea and was something I wanted as well. Thanks. Some comments below: > But I hope the concept (and the code) is fairly obvious: it makes > move_page_tables() try to align the range to be moved, if possible. > > That *should* actually remove the warning that Naresh sees, for the > simple reason that now the PMD moving case will *always* trigger when > the stack movement is PMD-aligned, and you never see the "first do a > few small pages, then do the PMD optimization" case. > > And that should mean that we don't ever hit the "oh, we already have a > target pmd" thing, because the "move the whole pmd" case will clear > the ones it has already taken care of, and you never have that "oh, > there's an empty pmd where the pages were moved away one by one". > > And that means that for this case, it's _ok_ to overlap (as long as we > copy downwards). > > What do you think? I might not have fully understand the code but I get the basic idea it is aiming for. Basically you want to align the cases where the address space is free from both sides so that you can trigger the PMD-moving optimization. Regarding the ADDR_AFTER_NEXT checks, shouldn't you check for: if (ADDR_AFTER_NEXT(ALIGN(*old_addr + *len, PMD_SIZE), old)) return; and for the len calculation, I did not follow what you did, but I think you meant something like this? Does the following reduce to what you did? At least this is a bit more readable I think: *len += (ALIGN(*new_addr + *len, PMD_SIZE) - (*new_addr + *len)); which reduces to: *len = ALIGN(*new_addr + *len, PMD_SIZE) - *new_addr; Also you did "len +=", it should be "*len +=" in this function. In try_to_align_start(), everything looks correct to me. > Ok, so I could easily have screwed up the patch, even if it's > conceptually fairly simple. The details are small, but they needed > some care. The thing _looks_ fine to me, both on a source level and > when looking at the generated code, and I made sure to try to use a > lot of small helper functions and couple of helper macros to make each > individual piece look clear and obvious. > > But obviously a single "Oh, that test is the wrong way around", or a > missing mask inversion, or whatever, could completely break the code. > So no guarantees, but it looks fairly straightforward to me. > > Naresh - this patch does *NOT* make sense to test together with > Kirill's (or Joel's) patch that says "don't do the PMD optimization at > all for overlapping cases". Those patches will hide the issue, and > make the new code only work for mremap(), not for the initial stack > setup that caused the original warning. > > Joel - I think this patch makes sense _regardless_ of the special > execve() usage case, in that it extends your "let's just copy things > at a whole PMD level" logic to even the case where the beginning > wasn't PMD-aligned (or the length wasn't sufficient). > > Obviously it only triggers when the source and destination are > mutually aligned, and if there are no other vma's around those > first/last PMD entries. Which might mean that it almost never triggers > in practice, but looking at the generated code, it sure looks like > it's cheap enough to test. Oh ok, we had some requirements in my old job about moving large address spaces which were aligned so it triggered a lot in those. Also folks in the ChromeOS team tell me it is helping them. > Didn't you have some test-cases for your pmd optimized movement cases, > at least for timing? Yes, I had a simple mremap() test which was moving a 1GB map and measuring performance. Once I get a chance I can try it out with this optimization and trigger it with unaligned addresses as well. thanks, - Joel
On Wed, Jul 15, 2020 at 11:36:59AM -0700, Linus Torvalds wrote: > On Wed, Jul 15, 2020 at 6:50 AM Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > > > mremap(2) does not allow source and destination regions to overlap, but > > shift_arg_pages() calls move_page_tables() directly and in this case the > > source and destination overlap often. It > > Actually, before we do this patch (which I think is a good idea), I'd > like Naresh to test the attached patch. > > And Kirill, Joel, mind looking it over too. I don't understand 'len' calculation in try_to_align_end(). IIUC, it increases 'len' by PMD_SIZE if 'new_addr+len' is not aligned to PMD_SIZE. It doesn't make sense to me. Maybe *len = roundup_up(*new_addr + *len, PMD_SIZE) - *new_addr; or something? BUT I *think* there's a bigger problem with the patch: For stack relocation case both VMAs are the same and always(?) the only VMA around at the time. It means none of ADDR_BEFORE_PREV and ADDR_AFTER_NEXT are going to stop us. Consider the following case, before and after try_to_align_start(): before after old_addr: 0x0123000 0x0000000 new_addr: 0x1123000 0x1000000 len: 0x1000000 0x1123000 (4k PAGE_SIZE, 2M PMD_SIZE) On the first iteration we would attempt to move 0x0-0x200000 to 0x1000000-0x1200000 and step onto the same WARN(), no?
On Wed, Jul 15, 2020 at 04:54:28PM -0400, Joel Fernandes wrote: > Hi Linus, > > On Wed, Jul 15, 2020 at 11:36:59AM -0700, Linus Torvalds wrote: > Regarding the ADDR_AFTER_NEXT checks, shouldn't you check for: > > if (ADDR_AFTER_NEXT(ALIGN(*old_addr + *len, PMD_SIZE), old)) > return; See comment to ADDR_AFTER_NEXT().
On Wed, Jul 15, 2020 at 1:54 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > Regarding the ADDR_AFTER_NEXT checks, shouldn't you check for: > > if (ADDR_AFTER_NEXT(ALIGN(*old_addr + *len, PMD_SIZE), old)) > return; No, there's even a comment to the effect. Instead, that ADDR_AFTER_NEXT() aligns the next address _down_ to the PMD boundary. Because otherwise, what can happen is: - you're on an architecture that has a separate address space for users - you're the next-to-last VMA in that address space, - you're in the last PMD. And now "ALIGN(*old_addr + *len, PMD_SIZE)" will wrap, and become 0, and you think it's ok to move the whole PMD, because it's now smaller than the start address of the next VMA. It's _not_ ok, because you'd be moving that next-vma data too. > and for the len calculation, I did not follow what you did, but I think you > meant something like this? Does the following reduce to what you did? At > least this is a bit more readable I think: > > *len += (ALIGN(*new_addr + *len, PMD_SIZE) - (*new_addr + *len)); Yes, right you are. I actually wrote that first (except I added a helper variable for that "*new_addr + *len" thing), and then I decided it can be simplified. And simplified it wrong ;) > Also you did "len +=", it should be "*len +=" in this function. That's indeed a plain stupid bug ;) Naresh - don't test that version. The bugs Joel found just make the math wrong, so it won't work. The concept was solid, the implementation not so much ;) Linus
On Wed, Jul 15, 2020 at 1:55 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > I don't understand 'len' calculation in try_to_align_end(). Yeah, Joel found the same thing. You don't understand it, because it's garbage. > BUT > > I *think* there's a bigger problem with the patch: > > For stack relocation case both VMAs are the same and always(?) the only > VMA around at the time. It means none of ADDR_BEFORE_PREV and > ADDR_AFTER_NEXT are going to stop us. But for the stack relocation case, that should actually be fine. We are moving the whole thing. Or maybe I missed something. > Consider the following case, before and after try_to_align_start(): > > before after > old_addr: 0x0123000 0x0000000 > new_addr: 0x1123000 0x1000000 > len: 0x1000000 0x1123000 That's the "move up" case that fundamentally doesn't work anyway because it will corrupt the data as it moves it. The stack relocation only moves down. Linus
On Wed, Jul 15, 2020 at 2:31 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Naresh - don't test that version. The bugs Joel found just make the > math wrong, so it won't work. Here's a new version with the thing that Joel and Kirill both noticed hopefully fixed. But I probably screwed it up again. I guess I should test it, but I don't have any really relevant environment (the plain mremap() case should have shown the obvious bugs, though, so that's just an excuse for my laziness) Linus
On Wed, Jul 15, 2020 at 2:35 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The stack relocation only moves down. .. and that may not be true on a grow-up stack. Christ, that code is hard to read. But I think Kirill is right, and I'm wrong. So that "try to expand" code is only ok when non-overlapping, or when moving down. Naresh - from a testing standpoint, that doesn't make any difference: I'd still like you to test on x86. We always shift the stack down there. Linus
On Wed, Jul 15, 2020 at 02:43:00PM -0700, Linus Torvalds wrote: > On Wed, Jul 15, 2020 at 2:31 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Naresh - don't test that version. The bugs Joel found just make the > > math wrong, so it won't work. > > Here's a new version with the thing that Joel and Kirill both noticed > hopefully fixed. > > But I probably screwed it up again. I guess I should test it, but I > don't have any really relevant environment (the plain mremap() case > should have shown the obvious bugs, though, so that's just an excuse > for my laziness) Sorry, but the patch is broken. It overcopies entires in some cases. It doesn't handles correctly if you try to move PTEs from the middle of VMA. The test case below rightly sigfaults without the patch when trying access DST_ADDR[0], but not with the patch. It creates PTE entry at DST_ADDR that doesn't belong to any VMA. :/ #define _GNU_SOURCE #include <stdio.h> #include <string.h> #include <sys/mman.h> #define SRC_ADDR ((char *)(4UL << 30)) #define DST_ADDR ((char *)(5UL << 30)) int main(void) { mmap(SRC_ADDR, 2UL << 20, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); madvise(SRC_ADDR, 2UL << 20, MADV_NOHUGEPAGE); memset(SRC_ADDR, 0x33, 2UL << 20); mremap(SRC_ADDR + 4096, 4096, 4096, MREMAP_MAYMOVE | MREMAP_FIXED, DST_ADDR + 4096); printf("0: %#x 4096: %#x\n", DST_ADDR[0], DST_ADDR[4096]); return 0; }
On Wed, Jul 15, 2020 at 3:22 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > Sorry, but the patch is broken. I should take up some other hobby. Maybe knitting. But then I'd probably end up with sweaters with three arms, and there's a very limited market for that. Oh well. Linus
On Wed, Jul 15, 2020 at 3:22 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > Sorry, but the patch is broken. .. instead of taking up knitting - which I'd invariably also screw up - I took a look. Yeah, in addition to checking the vm_prev/vm_next vma's, we need to check the limits of the 'vma' itself. Because we may not be moving the whole vma. So the "extend upwards" can only happen if the end address matches the end address of the current vma (and vice versa for the "extend down" case). Then it would hopefully work. But now I've screwed it up twice, and have a splitting headache, so rather than stare at this cross-eyed, I'll take a break and hope that somebody more competent than me looks at the code. Linus
On Wed, Jul 15, 2020 at 3:57 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But now I've screwed it up twice, and have a splitting headache, so > rather than stare at this cross-eyed, I'll take a break and hope that > somebody more competent than me looks at the code. I lied. I had a couple of pending pulls, so I couldn't go lie down anyway, so I tried to take another look. It *might* be as simple as this incremetal thing on top, but then I think that code will no longer trigger on the stack movement case at all, since there we don't work with the whole vma. So if we want that to work, we'd have to fix that up too. And this might be broken anyway. My track record isn't looking so hot right now. Linus
On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It *might* be as simple as this incremental thing on top No, it needs to be + if (*old_addr + *len < old->vm_end) + return; in try_to_align_end(), of course. Now I'm going for a lie-down, because this cross-eyed thing isn't working. Linus
On Thu, 16 Jul 2020 at 04:49, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > It *might* be as simple as this incremental thing on top > > No, it needs to be > > + if (*old_addr + *len < old->vm_end) > + return; > > in try_to_align_end(), of course. > > Now I'm going for a lie-down, because this cross-eyed thing isn't working. Just want to double check. Here is the diff after those two patches applied. Please correct me if it is wrong. This patch applied on top of Linus mainline master branch. I am starting my test cycles. --- mm/mremap.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/mm/mremap.c b/mm/mremap.c index 6b153dc05fe4..4961c18d2008 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -254,6 +254,77 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, return true; } + +#define ADDR_BEFORE_PREV(addr, vma) \ + ((vma)->vm_prev && (addr) < (vma)->vm_prev->vm_end) + +static inline void try_to_align_start(unsigned long *len, + struct vm_area_struct *old, unsigned long *old_addr, + struct vm_area_struct *new, unsigned long *new_addr) +{ + if (*old_addr > old->vm_start) + return; + + if (ADDR_BEFORE_PREV(*old_addr & PMD_MASK, old)) + return; + + if (ADDR_BEFORE_PREV(*new_addr & PMD_MASK, new)) + return; + + /* Bingo! */ + *len += *new_addr & ~PMD_MASK; + *old_addr &= PMD_MASK; + *new_addr &= PMD_MASK; +} + +/* + * When aligning the end, avoid ALIGN() (which can overflow + * if the user space is the full address space, and overshoot + * the vm_start of the next vma). + * + * Align the upper limit down instead, and check that it's not + * in the same PMD as the end. + */ +#define ADDR_AFTER_NEXT(addr, vma) \ + ((vma)->vm_next && (addr) > (PMD_MASK & (vma)->vm_next->vm_start)) + +static inline void try_to_align_end(unsigned long *len, + struct vm_area_struct *old, unsigned long *old_addr, + struct vm_area_struct *new, unsigned long *new_addr) +{ + if (*old_addr < old->vm_end) + return; + + if (ADDR_AFTER_NEXT(*old_addr + *len, old)) + return; + + if (ADDR_AFTER_NEXT(*new_addr + *len, new)) + return; + + /* Mutual alignment means this is same for new/old addr */ + *len = ALIGN(*new_addr + *len, PMD_SIZE) - *new_addr; +} + +/* + * The PMD move case is much more efficient, so if we have the + * mutually aligned case, try to see if we can extend the + * beginning and end to be aligned too. + * + * The pointer dereferences look bad, but with inlining, the + * compiler will sort it out. + */ +static inline void try_to_align_range(unsigned long *len, + struct vm_area_struct *old, unsigned long *old_addr, + struct vm_area_struct *new, unsigned long *new_addr) +{ + if ((*old_addr ^ *new_addr) & ~PMD_MASK) + return; + + try_to_align_start(len, old, old_addr, new, new_addr); + try_to_align_end(len, old, old_addr, new, new_addr); +} +#else +#define try_to_align_range(len,old,olda,new,newa) do { } while(0); #endif unsigned long move_page_tables(struct vm_area_struct *vma, @@ -272,6 +343,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma, old_addr, old_end); mmu_notifier_invalidate_range_start(&range); + try_to_align_range(&len, vma, &old_addr, new_vma, &new_addr); + for (; old_addr < old_end; old_addr += extent, new_addr += extent) { cond_resched(); next = (old_addr + PMD_SIZE) & PMD_MASK;
On Thu, 16 Jul 2020 at 12:07, Naresh Kamboju <naresh.kamboju@linaro.org> wrote: > > On Thu, 16 Jul 2020 at 04:49, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > It *might* be as simple as this incremental thing on top > > > > No, it needs to be > > > > + if (*old_addr + *len < old->vm_end) > > + return; > > > > in try_to_align_end(), of course. > > > > Now I'm going for a lie-down, because this cross-eyed thing isn't working. > > > Just want to double check. > Here is the diff after those two patches applied. Please correct me if > it is wrong. > This patch applied on top of Linus mainline master branch. > I am starting my test cycles. Sorry this patch (the below pasted ) did not solve the reported problem. I still notice warning [ 760.004318] ------------[ cut here ]------------ [ 760.009039] WARNING: CPU: 3 PID: 461 at mm/mremap.c:230 move_page_tables+0x818/0x840 [ 760.016848] Modules linked in: x86_pkg_temp_thermal fuse [ 760.022220] CPU: 3 PID: 461 Comm: true Not tainted 5.8.0-rc5 #1 [ 760.028198] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 760.035651] EIP: move_page_tables+0x818/0x840 ref: https://lkft.validation.linaro.org/scheduler/job/1574277#L12221 > > --- > mm/mremap.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 73 insertions(+) > > diff --git a/mm/mremap.c b/mm/mremap.c > index 6b153dc05fe4..4961c18d2008 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -254,6 +254,77 @@ static bool move_normal_pmd(struct vm_area_struct > *vma, unsigned long old_addr, > > return true; > } > + > +#define ADDR_BEFORE_PREV(addr, vma) \ > + ((vma)->vm_prev && (addr) < (vma)->vm_prev->vm_end) > + > +static inline void try_to_align_start(unsigned long *len, > + struct vm_area_struct *old, unsigned long *old_addr, > + struct vm_area_struct *new, unsigned long *new_addr) > +{ > + if (*old_addr > old->vm_start) > + return; > + > + if (ADDR_BEFORE_PREV(*old_addr & PMD_MASK, old)) > + return; > + > + if (ADDR_BEFORE_PREV(*new_addr & PMD_MASK, new)) > + return; > + > + /* Bingo! */ > + *len += *new_addr & ~PMD_MASK; > + *old_addr &= PMD_MASK; > + *new_addr &= PMD_MASK; > +} > + > +/* > + * When aligning the end, avoid ALIGN() (which can overflow > + * if the user space is the full address space, and overshoot > + * the vm_start of the next vma). > + * > + * Align the upper limit down instead, and check that it's not > + * in the same PMD as the end. > + */ > +#define ADDR_AFTER_NEXT(addr, vma) \ > + ((vma)->vm_next && (addr) > (PMD_MASK & (vma)->vm_next->vm_start)) > + > +static inline void try_to_align_end(unsigned long *len, > + struct vm_area_struct *old, unsigned long *old_addr, > + struct vm_area_struct *new, unsigned long *new_addr) > +{ > + if (*old_addr < old->vm_end) > + return; > + > + if (ADDR_AFTER_NEXT(*old_addr + *len, old)) > + return; > + > + if (ADDR_AFTER_NEXT(*new_addr + *len, new)) > + return; > + > + /* Mutual alignment means this is same for new/old addr */ > + *len = ALIGN(*new_addr + *len, PMD_SIZE) - *new_addr; > +} > + > +/* > + * The PMD move case is much more efficient, so if we have the > + * mutually aligned case, try to see if we can extend the > + * beginning and end to be aligned too. > + * > + * The pointer dereferences look bad, but with inlining, the > + * compiler will sort it out. > + */ > +static inline void try_to_align_range(unsigned long *len, > + struct vm_area_struct *old, unsigned long *old_addr, > + struct vm_area_struct *new, unsigned long *new_addr) > +{ > + if ((*old_addr ^ *new_addr) & ~PMD_MASK) > + return; > + > + try_to_align_start(len, old, old_addr, new, new_addr); > + try_to_align_end(len, old, old_addr, new, new_addr); > +} > +#else > +#define try_to_align_range(len,old,olda,new,newa) do { } while(0); > #endif > > unsigned long move_page_tables(struct vm_area_struct *vma, > @@ -272,6 +343,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma, > old_addr, old_end); > mmu_notifier_invalidate_range_start(&range); > > + try_to_align_range(&len, vma, &old_addr, new_vma, &new_addr); > + > for (; old_addr < old_end; old_addr += extent, new_addr += extent) { > cond_resched(); > next = (old_addr + PMD_SIZE) & PMD_MASK; > -- > 2.27.0 > > > > > Linus > > - Naresh
On Thu, 16 Jul 2020 at 04:49, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > It *might* be as simple as this incremental thing on top > > No, it needs to be > > + if (*old_addr + *len < old->vm_end) > + return; > > in try_to_align_end(), of course. I have modified the patch with the above change. Sorry the proposed patch [1] did not solve the reported problem. Link to warning on i386, https://lkft.validation.linaro.org/scheduler/job/1574295#L2055 Link to the patch which applied on top of Linus master and tested. [1] https://pastebin.com/2gTxi3pj I am happy to test the next version patch. - Naresh
On Thu, Jul 16, 2020 at 12:53:23PM +0530, Naresh Kamboju wrote: > On Thu, 16 Jul 2020 at 12:07, Naresh Kamboju <naresh.kamboju@linaro.org> wrote: > > > > On Thu, 16 Jul 2020 at 04:49, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > It *might* be as simple as this incremental thing on top > > > > > > No, it needs to be > > > > > > + if (*old_addr + *len < old->vm_end) > > > + return; > > > > > > in try_to_align_end(), of course. > > > > > > Now I'm going for a lie-down, because this cross-eyed thing isn't working. > > > > > > Just want to double check. > > Here is the diff after those two patches applied. Please correct me if > > it is wrong. > > This patch applied on top of Linus mainline master branch. > > I am starting my test cycles. > > Sorry this patch (the below pasted ) did not solve the reported problem. As Linus said, it does not trigger on the stack movement anymore and it is not going to fix the issue, but may help to increase coverage of the optimization.
On Wed, Jul 15, 2020 at 04:18:44PM -0700, Linus Torvalds wrote: > On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > It *might* be as simple as this incremental thing on top > > No, it needs to be > > + if (*old_addr + *len < old->vm_end) > + return; > > in try_to_align_end(), of course. Okay, this should work, but I'm not convinced that it gives much: number of cases covered by the optimization not going to be high. It can also lead to performance regression: for small mremap() if only one side of the range got aligned and there's no PMD_SIZE range to move, kernel will still iterate over PTEs, but it would need to handle more pte_none()s than without the patch.
On Thu, Jul 16, 2020 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > It can also lead to performance regression: for small mremap() if only one > side of the range got aligned and there's no PMD_SIZE range to move, > kernel will still iterate over PTEs, but it would need to handle more > pte_none()s than without the patch. Ack, I've dropped the patch from my queue of experiments, because it doesn't work for the case I wanted to do, and the other cases could regress, as you say. Plus considering how many problems that patch had, I decided it wasn't as simple as I initially thought it would be anyway ;) Joel - while it's gone from my mind, if you're still interested in this, maybe you can do something _similar_ that patch, except perhaps also start out checking that the initial size is large enough for this to make sense even when one of the sides doesn't align, for example. (It might be as simple as checking that the initial 'len' is at least PMD_SIZE - then you're guaranteed that whichever side gets aligned, it's not doing extra work because the other side didn't). Linus
On Thu, Jul 16, 2020 at 1:55 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Jul 16, 2020 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > It can also lead to performance regression: for small mremap() if only one > > side of the range got aligned and there's no PMD_SIZE range to move, > > kernel will still iterate over PTEs, but it would need to handle more > > pte_none()s than without the patch. > > Ack, I've dropped the patch from my queue of experiments, because it > doesn't work for the case I wanted to do, and the other cases could > regress, as you say. > > Plus considering how many problems that patch had, I decided it wasn't > as simple as I initially thought it would be anyway ;) > > Joel - while it's gone from my mind, if you're still interested in > this, maybe you can do something _similar_ that patch, except perhaps > also start out checking that the initial size is large enough for this > to make sense even when one of the sides doesn't align, for example. > > (It might be as simple as checking that the initial 'len' is at least > PMD_SIZE - then you're guaranteed that whichever side gets aligned, > it's not doing extra work because the other side didn't). Hi Linus, Yes I'm quite interested in doing a similar patch and also adding some test cases to make sure it always works properly. Probably I can add some kselftest cases doing mremap on various ranges. I think it will be a great optimization to trigger the PMD move more often. I'll work on it and get back hopefully soon on this, Thanks! - Joel
diff --git a/mm/mremap.c b/mm/mremap.c index 5dd572d57ca9..340a96a29cbb 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -245,6 +245,26 @@ unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long extent, next, old_end; struct mmu_notifier_range range; pmd_t *old_pmd, *new_pmd; + bool overlaps; + + /* + * shift_arg_pages() can call move_page_tables() on overlapping ranges. + * In this case we cannot use move_normal_pmd() because destination pmd + * might be established page table: move_ptes() doesn't free page + * table. + */ + if (old_addr > new_addr) { + overlaps = old_addr - new_addr < len; + } else { + overlaps = new_addr - old_addr < len; + + /* + * We are interating over ranges forward. It means we cannot + * handle overlapping ranges with new_addr > old_addr without + * risking data corruption. Don't do this. + */ + WARN_ON(overlaps); + } old_end = old_addr + len; flush_cache_range(vma, old_addr, old_end); @@ -282,7 +302,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, split_huge_pmd(vma, old_pmd, old_addr); if (pmd_trans_unstable(old_pmd)) continue; - } else if (extent == PMD_SIZE) { + } else if (!overlaps && extent == PMD_SIZE) { #ifdef CONFIG_HAVE_MOVE_PMD /* * If the extent is PMD-sized, try to speed the move by