diff mbox series

mm: thp: fix soft dirty for migration when split

Message ID 20181206084604.17167-1-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm: thp: fix soft dirty for migration when split | expand

Commit Message

Peter Xu Dec. 6, 2018, 8:46 a.m. UTC
When splitting a huge migrating PMD, we'll transfer the soft dirty bit
from the huge page to the small pages.  However we're possibly using a
wrong data since when fetching the bit we're using pmd_soft_dirty()
upon a migration entry.  Fix it up.

CC: Andrea Arcangeli <aarcange@redhat.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: Michal Hocko <mhocko@suse.com>
CC: Dave Jiang <dave.jiang@intel.com>
CC: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
CC: Souptick Joarder <jrdr.linux@gmail.com>
CC: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---

I noticed this during code reading.  Only compile tested.  I'm sending
a patch directly for review comments since it's relatively
straightforward and not easy to test.  Please have a look, thanks.
---
 mm/huge_memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Peter Xu Dec. 7, 2018, 3:34 a.m. UTC | #1
On Thu, Dec 06, 2018 at 04:46:04PM +0800, Peter Xu wrote:
> When splitting a huge migrating PMD, we'll transfer the soft dirty bit
> from the huge page to the small pages.  However we're possibly using a
> wrong data since when fetching the bit we're using pmd_soft_dirty()
> upon a migration entry.  Fix it up.

Note that if my understanding is correct about the problem then if
without the patch there is chance to lose some of the dirty bits in
the migrating pmd pages (on x86_64 we're fetching bit 11 which is part
of swap offset instead of bit 2) and it could potentially corrupt the
memory of an userspace program which depends on the dirty bit.

> 
> CC: Andrea Arcangeli <aarcange@redhat.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> CC: Matthew Wilcox <willy@infradead.org>
> CC: Michal Hocko <mhocko@suse.com>
> CC: Dave Jiang <dave.jiang@intel.com>
> CC: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> CC: Souptick Joarder <jrdr.linux@gmail.com>
> CC: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> CC: linux-mm@kvack.org
> CC: linux-kernel@vger.kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> 
> I noticed this during code reading.  Only compile tested.  I'm sending
> a patch directly for review comments since it's relatively
> straightforward and not easy to test.  Please have a look, thanks.
> ---
>  mm/huge_memory.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f2d19e4fe854..fb0787c3dd3b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2161,7 +2161,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		SetPageDirty(page);
>  	write = pmd_write(old_pmd);
>  	young = pmd_young(old_pmd);
> -	soft_dirty = pmd_soft_dirty(old_pmd);
> +	if (unlikely(pmd_migration))
> +		soft_dirty = pmd_swp_soft_dirty(old_pmd);
> +	else
> +		soft_dirty = pmd_soft_dirty(old_pmd);
>  
>  	/*
>  	 * Withdraw the table only after we mark the pmd entry invalid.
> -- 
> 2.17.1
> 

Regards,
Konstantin Khlebnikov Dec. 10, 2018, 4:50 p.m. UTC | #2
On Fri, Dec 7, 2018 at 6:34 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Dec 06, 2018 at 04:46:04PM +0800, Peter Xu wrote:
> > When splitting a huge migrating PMD, we'll transfer the soft dirty bit
> > from the huge page to the small pages.  However we're possibly using a
> > wrong data since when fetching the bit we're using pmd_soft_dirty()
> > upon a migration entry.  Fix it up.
>
> Note that if my understanding is correct about the problem then if
> without the patch there is chance to lose some of the dirty bits in
> the migrating pmd pages (on x86_64 we're fetching bit 11 which is part
> of swap offset instead of bit 2) and it could potentially corrupt the
> memory of an userspace program which depends on the dirty bit.

It seems this code is broken in case of pmd_migraion:

old_pmd = pmdp_invalidate(vma, haddr, pmd);

#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
pmd_migration = is_pmd_migration_entry(old_pmd);
if (pmd_migration) {
swp_entry_t entry;

entry = pmd_to_swp_entry(old_pmd);
page = pfn_to_page(swp_offset(entry));
} else
#endif
page = pmd_page(old_pmd);
VM_BUG_ON_PAGE(!page_count(page), page);
page_ref_add(page, HPAGE_PMD_NR - 1);
if (pmd_dirty(old_pmd))
SetPageDirty(page);
write = pmd_write(old_pmd);
young = pmd_young(old_pmd);
soft_dirty = pmd_soft_dirty(old_pmd);

Not just soft_dirt - all bits (dirty, write, young) have diffrent encoding
or not present at all for migration entry.

>
> >
> > CC: Andrea Arcangeli <aarcange@redhat.com>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> > CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > CC: Matthew Wilcox <willy@infradead.org>
> > CC: Michal Hocko <mhocko@suse.com>
> > CC: Dave Jiang <dave.jiang@intel.com>
> > CC: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> > CC: Souptick Joarder <jrdr.linux@gmail.com>
> > CC: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> > CC: linux-mm@kvack.org
> > CC: linux-kernel@vger.kernel.org
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >
> > I noticed this during code reading.  Only compile tested.  I'm sending
> > a patch directly for review comments since it's relatively
> > straightforward and not easy to test.  Please have a look, thanks.
> > ---
> >  mm/huge_memory.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index f2d19e4fe854..fb0787c3dd3b 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2161,7 +2161,10 @@ static void __split_huge_pmd_locked(struct
vm_area_struct *vma, pmd_t *pmd,
> >               SetPageDirty(page);
> >       write = pmd_write(old_pmd);
> >       young = pmd_young(old_pmd);
> > -     soft_dirty = pmd_soft_dirty(old_pmd);
> > +     if (unlikely(pmd_migration))
> > +             soft_dirty = pmd_swp_soft_dirty(old_pmd);
> > +     else
> > +             soft_dirty = pmd_soft_dirty(old_pmd);
> >
> >       /*
> >        * Withdraw the table only after we mark the pmd entry invalid.
> > --
> > 2.17.1
> >
>
> Regards,
>
> --
> Peter Xu
>
<div dir="ltr"><div dir="ltr"><br>On Fri, Dec 7, 2018 at 6:34 AM Peter Xu &lt;<a href="mailto:peterx@redhat.com">peterx@redhat.com</a>&gt; wrote:<br>&gt;<br>&gt; On Thu, Dec 06, 2018 at 04:46:04PM +0800, Peter Xu wrote:<br>&gt; &gt; When splitting a huge migrating PMD, we&#39;ll transfer the soft dirty bit<br>&gt; &gt; from the huge page to the small pages.  However we&#39;re possibly using a<br>&gt; &gt; wrong data since when fetching the bit we&#39;re using pmd_soft_dirty()<br>&gt; &gt; upon a migration entry.  Fix it up.<br>&gt;<br>&gt; Note that if my understanding is correct about the problem then if<br>&gt; without the patch there is chance to lose some of the dirty bits in<br>&gt; the migrating pmd pages (on x86_64 we&#39;re fetching bit 11 which is part<br>&gt; of swap offset instead of bit 2) and it could potentially corrupt the<br>&gt; memory of an userspace program which depends on the dirty bit.<br><br>It seems this code is broken in case of pmd_migraion:<br><br><div><span style="white-space:pre">	</span>old_pmd = pmdp_invalidate(vma, haddr, pmd);</div><div><br></div><div>#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION</div><div><span style="white-space:pre">	</span>pmd_migration = is_pmd_migration_entry(old_pmd);</div><div><span style="white-space:pre">	</span>if (pmd_migration) {</div><div><span style="white-space:pre">		</span>swp_entry_t entry;</div><div><br></div><div><span style="white-space:pre">		</span>entry = pmd_to_swp_entry(old_pmd);</div><div><span style="white-space:pre">		</span>page = pfn_to_page(swp_offset(entry));</div><div><span style="white-space:pre">	</span>} else</div><div>#endif</div><div><span style="white-space:pre">		</span>page = pmd_page(old_pmd);</div><div><span style="white-space:pre">	</span>VM_BUG_ON_PAGE(!page_count(page), page);</div><div><span style="white-space:pre">	</span>page_ref_add(page, HPAGE_PMD_NR - 1);</div><div><span style="white-space:pre">	</span>if (pmd_dirty(old_pmd))</div><div><span style="white-space:pre">		</span>SetPageDirty(page);</div><div><span style="white-space:pre">	</span>write = pmd_write(old_pmd);</div><div><span style="white-space:pre">	</span>young = pmd_young(old_pmd);</div><div><span style="white-space:pre">	</span>soft_dirty = pmd_soft_dirty(old_pmd);</div><div><br></div><div>Not just soft_dirt - all bits (dirty, write, young) have diffrent encoding or not present at all for migration entry.</div><br>&gt;<br>&gt; &gt;<br>&gt; &gt; CC: Andrea Arcangeli &lt;<a href="mailto:aarcange@redhat.com">aarcange@redhat.com</a>&gt;<br>&gt; &gt; CC: Andrew Morton &lt;<a href="mailto:akpm@linux-foundation.org">akpm@linux-foundation.org</a>&gt;<br>&gt; &gt; CC: &quot;Kirill A. Shutemov&quot; &lt;<a href="mailto:kirill.shutemov@linux.intel.com">kirill.shutemov@linux.intel.com</a>&gt;<br>&gt; &gt; CC: Matthew Wilcox &lt;<a href="mailto:willy@infradead.org">willy@infradead.org</a>&gt;<br>&gt; &gt; CC: Michal Hocko &lt;<a href="mailto:mhocko@suse.com">mhocko@suse.com</a>&gt;<br>&gt; &gt; CC: Dave Jiang &lt;<a href="mailto:dave.jiang@intel.com">dave.jiang@intel.com</a>&gt;<br>&gt; &gt; CC: &quot;Aneesh Kumar K.V&quot; &lt;<a href="mailto:aneesh.kumar@linux.vnet.ibm.com">aneesh.kumar@linux.vnet.ibm.com</a>&gt;<br>&gt; &gt; CC: Souptick Joarder &lt;<a href="mailto:jrdr.linux@gmail.com">jrdr.linux@gmail.com</a>&gt;<br>&gt; &gt; CC: Konstantin Khlebnikov &lt;<a href="mailto:khlebnikov@yandex-team.ru">khlebnikov@yandex-team.ru</a>&gt;<br>&gt; &gt; CC: <a href="mailto:linux-mm@kvack.org">linux-mm@kvack.org</a><br>&gt; &gt; CC: <a href="mailto:linux-kernel@vger.kernel.org">linux-kernel@vger.kernel.org</a><br>&gt; &gt; Signed-off-by: Peter Xu &lt;<a href="mailto:peterx@redhat.com">peterx@redhat.com</a>&gt;<br>&gt; &gt; ---<br>&gt; &gt;<br>&gt; &gt; I noticed this during code reading.  Only compile tested.  I&#39;m sending<br>&gt; &gt; a patch directly for review comments since it&#39;s relatively<br>&gt; &gt; straightforward and not easy to test.  Please have a look, thanks.<br>&gt; &gt; ---<br>&gt; &gt;  mm/huge_memory.c | 5 ++++-<br>&gt; &gt;  1 file changed, 4 insertions(+), 1 deletion(-)<br>&gt; &gt;<br>&gt; &gt; diff --git a/mm/huge_memory.c b/mm/huge_memory.c<br>&gt; &gt; index f2d19e4fe854..fb0787c3dd3b 100644<br>&gt; &gt; --- a/mm/huge_memory.c<br>&gt; &gt; +++ b/mm/huge_memory.c<br>&gt; &gt; @@ -2161,7 +2161,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,<br>&gt; &gt;               SetPageDirty(page);<br>&gt; &gt;       write = pmd_write(old_pmd);<br>&gt; &gt;       young = pmd_young(old_pmd);<br>&gt; &gt; -     soft_dirty = pmd_soft_dirty(old_pmd);<br>&gt; &gt; +     if (unlikely(pmd_migration))<br>&gt; &gt; +             soft_dirty = pmd_swp_soft_dirty(old_pmd);<br>&gt; &gt; +     else<br>&gt; &gt; +             soft_dirty = pmd_soft_dirty(old_pmd);<br>&gt; &gt;<br>&gt; &gt;       /*<br>&gt; &gt;        * Withdraw the table only after we mark the pmd entry invalid.<br>&gt; &gt; --<br>&gt; &gt; 2.17.1<br>&gt; &gt;<br>&gt;<br>&gt; Regards,<br>&gt;<br>&gt; --<br>&gt; Peter Xu<br>&gt;</div></div>
Peter Xu Dec. 11, 2018, 4:48 a.m. UTC | #3
On Mon, Dec 10, 2018 at 07:50:52PM +0300, Konstantin Khlebnikov wrote:
> On Fri, Dec 7, 2018 at 6:34 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Dec 06, 2018 at 04:46:04PM +0800, Peter Xu wrote:
> > > When splitting a huge migrating PMD, we'll transfer the soft dirty bit
> > > from the huge page to the small pages.  However we're possibly using a
> > > wrong data since when fetching the bit we're using pmd_soft_dirty()
> > > upon a migration entry.  Fix it up.
> >
> > Note that if my understanding is correct about the problem then if
> > without the patch there is chance to lose some of the dirty bits in
> > the migrating pmd pages (on x86_64 we're fetching bit 11 which is part
> > of swap offset instead of bit 2) and it could potentially corrupt the
> > memory of an userspace program which depends on the dirty bit.
> 
> It seems this code is broken in case of pmd_migraion:
> 
> old_pmd = pmdp_invalidate(vma, haddr, pmd);
> 
> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> pmd_migration = is_pmd_migration_entry(old_pmd);
> if (pmd_migration) {
> swp_entry_t entry;
> 
> entry = pmd_to_swp_entry(old_pmd);
> page = pfn_to_page(swp_offset(entry));
> } else
> #endif
> page = pmd_page(old_pmd);
> VM_BUG_ON_PAGE(!page_count(page), page);
> page_ref_add(page, HPAGE_PMD_NR - 1);
> if (pmd_dirty(old_pmd))
> SetPageDirty(page);
> write = pmd_write(old_pmd);
> young = pmd_young(old_pmd);
> soft_dirty = pmd_soft_dirty(old_pmd);
> 
> Not just soft_dirt - all bits (dirty, write, young) have diffrent encoding
> or not present at all for migration entry.

Hi, Konstantin,

Actually I noticed it but I thought it didn't hurt since both
write/young flags are not used at all when applying to the small pages
when pmd_migration==true.  But indeed there's at least an unexpected
side effect of an extra call to SetPageDirty() that I missed.

I'll repost soon.  Thanks!
Konstantin Khlebnikov Dec. 11, 2018, 1:12 p.m. UTC | #4
On Tue, Dec 11, 2018 at 7:48 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Dec 10, 2018 at 07:50:52PM +0300, Konstantin Khlebnikov wrote:
> > On Fri, Dec 7, 2018 at 6:34 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Dec 06, 2018 at 04:46:04PM +0800, Peter Xu wrote:
> > > > When splitting a huge migrating PMD, we'll transfer the soft dirty bit
> > > > from the huge page to the small pages.  However we're possibly using a
> > > > wrong data since when fetching the bit we're using pmd_soft_dirty()
> > > > upon a migration entry.  Fix it up.
> > >
> > > Note that if my understanding is correct about the problem then if
> > > without the patch there is chance to lose some of the dirty bits in
> > > the migrating pmd pages (on x86_64 we're fetching bit 11 which is part
> > > of swap offset instead of bit 2) and it could potentially corrupt the
> > > memory of an userspace program which depends on the dirty bit.
> >
> > It seems this code is broken in case of pmd_migraion:
> >
> > old_pmd = pmdp_invalidate(vma, haddr, pmd);
> >
> > #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > pmd_migration = is_pmd_migration_entry(old_pmd);
> > if (pmd_migration) {
> > swp_entry_t entry;
> >
> > entry = pmd_to_swp_entry(old_pmd);
> > page = pfn_to_page(swp_offset(entry));
> > } else
> > #endif
> > page = pmd_page(old_pmd);
> > VM_BUG_ON_PAGE(!page_count(page), page);
> > page_ref_add(page, HPAGE_PMD_NR - 1);
> > if (pmd_dirty(old_pmd))
> > SetPageDirty(page);
> > write = pmd_write(old_pmd);
> > young = pmd_young(old_pmd);
> > soft_dirty = pmd_soft_dirty(old_pmd);
> >
> > Not just soft_dirt - all bits (dirty, write, young) have diffrent encoding
> > or not present at all for migration entry.
>
> Hi, Konstantin,
>
> Actually I noticed it but I thought it didn't hurt since both
> write/young flags are not used at all when applying to the small pages
> when pmd_migration==true.  But indeed there's at least an unexpected
> side effect of an extra call to SetPageDirty() that I missed.

"write" is used for making smaller migration entry:

swp_entry = make_migration_entry(page + i, write);

>
>
> I'll repost soon.  Thanks!
>
> --
> Peter Xu
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f2d19e4fe854..fb0787c3dd3b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2161,7 +2161,10 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		SetPageDirty(page);
 	write = pmd_write(old_pmd);
 	young = pmd_young(old_pmd);
-	soft_dirty = pmd_soft_dirty(old_pmd);
+	if (unlikely(pmd_migration))
+		soft_dirty = pmd_swp_soft_dirty(old_pmd);
+	else
+		soft_dirty = pmd_soft_dirty(old_pmd);
 
 	/*
 	 * Withdraw the table only after we mark the pmd entry invalid.