[11/12] dax: Disable huge page handling
diff mbox

Message ID 1457637535-21633-12-git-send-email-jack@suse.cz
State New
Headers show

Commit Message

Jan Kara March 10, 2016, 7:18 p.m. UTC
Currently the handling of huge pages for DAX is racy. For example the
following can happen:

CPU0 (THP write fault)			CPU1 (normal read fault)

__dax_pmd_fault()			__dax_fault()
  get_block(inode, block, &bh, 0) -> not mapped
					get_block(inode, block, &bh, 0)
					  -> not mapped
  if (!buffer_mapped(&bh) && write)
    get_block(inode, block, &bh, 1) -> allocates blocks
  truncate_pagecache_range(inode, lstart, lend);
					dax_load_hole();

This results in data corruption since process on CPU1 won't see changes
into the file done by CPU0.

The race can happen even if two normal faults race however with THP the
situation is even worse because the two faults don't operate on the same
entries in the radix tree and we want to use these entries for
serialization. So disable THP support in DAX code for now.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 2 +-
 include/linux/dax.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Dan Williams March 10, 2016, 7:34 p.m. UTC | #1
On Thu, Mar 10, 2016 at 11:18 AM, Jan Kara <jack@suse.cz> wrote:
> Currently the handling of huge pages for DAX is racy. For example the
> following can happen:
>
> CPU0 (THP write fault)                  CPU1 (normal read fault)
>
> __dax_pmd_fault()                       __dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
>                                         get_block(inode, block, &bh, 0)
>                                           -> not mapped
>   if (!buffer_mapped(&bh) && write)
>     get_block(inode, block, &bh, 1) -> allocates blocks
>   truncate_pagecache_range(inode, lstart, lend);
>                                         dax_load_hole();
>
> This results in data corruption since process on CPU1 won't see changes
> into the file done by CPU0.
>
> The race can happen even if two normal faults race however with THP the
> situation is even worse because the two faults don't operate on the same
> entries in the radix tree and we want to use these entries for
> serialization. So disable THP support in DAX code for now.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c            | 2 +-
>  include/linux/dax.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 3951237ff248..7148fcdb2c92 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -715,7 +715,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  }
>  EXPORT_SYMBOL_GPL(dax_fault);
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if 0
>  /*
>   * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
>   * more often than one might expect in the below function.
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 4b63923e1f8d..fd28d824254b 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
>  }
>  #endif
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if 0
>  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
>                                 unsigned int flags, get_block_t);
>  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> --
> 2.6.2
>

Maybe switch to marking FS_DAX_PMD as "depends on BROKEN" again?  That
way we re-use the same mechanism as the check for the presence of
ZONE_DEVICE / struct page for the given pfn.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara March 10, 2016, 7:52 p.m. UTC | #2
On Thu 10-03-16 11:34:39, Dan Williams wrote:
> On Thu, Mar 10, 2016 at 11:18 AM, Jan Kara <jack@suse.cz> wrote:
> > Currently the handling of huge pages for DAX is racy. For example the
> > following can happen:
> >
> > CPU0 (THP write fault)                  CPU1 (normal read fault)
> >
> > __dax_pmd_fault()                       __dax_fault()
> >   get_block(inode, block, &bh, 0) -> not mapped
> >                                         get_block(inode, block, &bh, 0)
> >                                           -> not mapped
> >   if (!buffer_mapped(&bh) && write)
> >     get_block(inode, block, &bh, 1) -> allocates blocks
> >   truncate_pagecache_range(inode, lstart, lend);
> >                                         dax_load_hole();
> >
> > This results in data corruption since process on CPU1 won't see changes
> > into the file done by CPU0.
> >
> > The race can happen even if two normal faults race however with THP the
> > situation is even worse because the two faults don't operate on the same
> > entries in the radix tree and we want to use these entries for
> > serialization. So disable THP support in DAX code for now.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c            | 2 +-
> >  include/linux/dax.h | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 3951237ff248..7148fcdb2c92 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -715,7 +715,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_fault);
> >
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if 0
> >  /*
> >   * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> >   * more often than one might expect in the below function.
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 4b63923e1f8d..fd28d824254b 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
> >  }
> >  #endif
> >
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if 0
> >  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> >                                 unsigned int flags, get_block_t);
> >  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> > --
> > 2.6.2
> >
> 
> Maybe switch to marking FS_DAX_PMD as "depends on BROKEN" again?  That
> way we re-use the same mechanism as the check for the presence of
> ZONE_DEVICE / struct page for the given pfn.

Yeah, maybe I could do that. At this point PMD fault handler would not even
compile but I could possibly massage it so that it will work with the new
locking unless you try mixing PMD and PTE faults...

								Honza

Patch
diff mbox

diff --git a/fs/dax.c b/fs/dax.c
index 3951237ff248..7148fcdb2c92 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -715,7 +715,7 @@  int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if 0
 /*
  * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
  * more often than one might expect in the below function.
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 4b63923e1f8d..fd28d824254b 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -29,7 +29,7 @@  static inline struct page *read_dax_sector(struct block_device *bdev,
 }
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if 0
 int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
 				unsigned int flags, get_block_t);
 int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,