diff mbox

[2/2] dax: fix device-dax fault handling so that it would fallback

Message ID 148910376733.27406.9048213402296101821.stgit@djiang5-desk3.ch.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Jiang March 9, 2017, 11:56 p.m. UTC
Jeff Moyer reports that:
"
With a device dax alignment of 4KB or 2MB, I get sigbus when running the
attached fio job file for the current kernel (4.11.0-rc1+).  If I
specify an alignment of 1GB, it works.

I turned on debug output, and saw that it was failing in the huge fault
code.

[ 4614.138357] dax dax1.0: dax_open
[ 4614.154838] dax dax1.0: dax_mmap
[ 4614.171898] dax dax1.0: dax_dev_huge_fault: fio: write (0x7f08f0a00000 - 0x7f0ce0800000)
[ 4614.211720] dax dax1.0: __dax_dev_pud_fault: phys_to_pgoff(0xffffffffcf600) failed
[ 4614.568911] dax dax1.0: dax_release

fio config for reproduce:
[global]
ioengine=dev-dax
direct=0
filename=/dev/dax0.0
bs=2m

[write]
rw=write

[read]
stonewall
rw=read
"

It looks like the code does not fallback at all when handling faults. The
fault handlers are missing some boundary checking for the VMA. Also, we need
to fall back if the alignment requested is less than the P-entry size.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dax/dax.c |   22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

Comments

Dan Williams March 10, 2017, 12:03 a.m. UTC | #1
On Thu, Mar 9, 2017 at 3:56 PM, Dave Jiang <dave.jiang@intel.com> wrote:
> Jeff Moyer reports that:
> "
> With a device dax alignment of 4KB or 2MB, I get sigbus when running the
> attached fio job file for the current kernel (4.11.0-rc1+).  If I
> specify an alignment of 1GB, it works.
>
> I turned on debug output, and saw that it was failing in the huge fault
> code.
>
> [ 4614.138357] dax dax1.0: dax_open
> [ 4614.154838] dax dax1.0: dax_mmap
> [ 4614.171898] dax dax1.0: dax_dev_huge_fault: fio: write (0x7f08f0a00000 - 0x7f0ce0800000)
> [ 4614.211720] dax dax1.0: __dax_dev_pud_fault: phys_to_pgoff(0xffffffffcf600) failed
> [ 4614.568911] dax dax1.0: dax_release
>
> fio config for reproduce:
> [global]
> ioengine=dev-dax
> direct=0
> filename=/dev/dax0.0
> bs=2m
>
> [write]
> rw=write
>
> [read]
> stonewall
> rw=read
> "
>
> It looks like the code does not fallback at all when handling faults. The
> fault handlers are missing some boundary checking for the VMA. Also, we need
> to fall back if the alignment requested is less than the P-entry size.
>
> Reported-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/dax/dax.c |   22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
> index 174690a..5c7998e 100644
> --- a/drivers/dax/dax.c
> +++ b/drivers/dax/dax.c
> @@ -480,12 +480,21 @@ static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
>                 return VM_FAULT_SIGBUS;
>         }
>
> +       /* fall back if we are outside of the VMA */
> +       if (pmd_addr < vmf->vma->vm_start)
> +               return VM_FAULT_FALLBACK;
> +       if ((pmd_addr + PMD_SIZE) > vmf->vma->vm_end)
> +               return VM_FAULT_FALLBACK;

I think these are already handled by check_vma(), and we only want to
fallback if dax_region->align < PMD_SIZE, otherwise we're violating
device-dax's guaranteed fault granularity.

> +
> +       if (dax_region->align < PMD_SIZE)
> +               return VM_FAULT_FALLBACK;
> +
>         pgoff = linear_page_index(vmf->vma, pmd_addr);
>         phys = pgoff_to_phys(dax_dev, pgoff, PMD_SIZE);
>         if (phys == -1) {
>                 dev_dbg(dev, "%s: pgoff_to_phys(%#lx) failed\n", __func__,
>                                 pgoff);
> -               return VM_FAULT_SIGBUS;
> +               return VM_FAULT_FALLBACK;

If we get past all the checks above then this is a failure.

>         }
>
>         pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
> @@ -519,12 +528,21 @@ static int __dax_dev_pud_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
>                 return VM_FAULT_SIGBUS;
>         }
>
> +       /* fall back if we are outside of the VMA */
> +       if (pud_addr < vmf->vma->vm_start)
> +               return VM_FAULT_FALLBACK;
> +       if ((pud_addr + PUD_SIZE) > vmf->vma->vm_end)
> +               return VM_FAULT_FALLBACK;
> +
> +       if (dax_region->align < PUD_SIZE)
> +               return VM_FAULT_FALLBACK;
> +
>         pgoff = linear_page_index(vmf->vma, pud_addr);
>         phys = pgoff_to_phys(dax_dev, pgoff, PUD_SIZE);
>         if (phys == -1) {
>                 dev_dbg(dev, "%s: pgoff_to_phys(%#lx) failed\n", __func__,
>                                 pgoff);
> -               return VM_FAULT_SIGBUS;
> +               return VM_FAULT_FALLBACK;

Same comments as above. The dax_region->align check should catch all
the valid fallback cases and check_vma() should clean up everything
else.

>         }
>
>         pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
>
Dave Jiang March 10, 2017, 12:11 a.m. UTC | #2
On 03/09/2017 05:03 PM, Dan Williams wrote:
> On Thu, Mar 9, 2017 at 3:56 PM, Dave Jiang <dave.jiang@intel.com> wrote:
>> Jeff Moyer reports that:
>> "
>> With a device dax alignment of 4KB or 2MB, I get sigbus when running the
>> attached fio job file for the current kernel (4.11.0-rc1+).  If I
>> specify an alignment of 1GB, it works.
>>
>> I turned on debug output, and saw that it was failing in the huge fault
>> code.
>>
>> [ 4614.138357] dax dax1.0: dax_open
>> [ 4614.154838] dax dax1.0: dax_mmap
>> [ 4614.171898] dax dax1.0: dax_dev_huge_fault: fio: write (0x7f08f0a00000 - 0x7f0ce0800000)
>> [ 4614.211720] dax dax1.0: __dax_dev_pud_fault: phys_to_pgoff(0xffffffffcf600) failed
>> [ 4614.568911] dax dax1.0: dax_release
>>
>> fio config for reproduce:
>> [global]
>> ioengine=dev-dax
>> direct=0
>> filename=/dev/dax0.0
>> bs=2m
>>
>> [write]
>> rw=write
>>
>> [read]
>> stonewall
>> rw=read
>> "
>>
>> It looks like the code does not fallback at all when handling faults. The
>> fault handlers are missing some boundary checking for the VMA. Also, we need
>> to fall back if the alignment requested is less than the P-entry size.
>>
>> Reported-by: Jeff Moyer <jmoyer@redhat.com>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> ---
>>  drivers/dax/dax.c |   22 ++++++++++++++++++++--
>>  1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
>> index 174690a..5c7998e 100644
>> --- a/drivers/dax/dax.c
>> +++ b/drivers/dax/dax.c
>> @@ -480,12 +480,21 @@ static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
>>                 return VM_FAULT_SIGBUS;
>>         }
>>
>> +       /* fall back if we are outside of the VMA */
>> +       if (pmd_addr < vmf->vma->vm_start)
>> +               return VM_FAULT_FALLBACK;
>> +       if ((pmd_addr + PMD_SIZE) > vmf->vma->vm_end)
>> +               return VM_FAULT_FALLBACK;
> 
> I think these are already handled by check_vma(), and we only want to
> fallback if dax_region->align < PMD_SIZE, otherwise we're violating
> device-dax's guaranteed fault granularity.

I was seeing pud_addr < vmf->vma->vm_start in the pud handler. It
doesn't look like check_vma() checks the boundaries. Also, it wouldn't
know the pud_addr or pmd_addr. Or are you saying it doesn't matter?

> 
>> +
>> +       if (dax_region->align < PMD_SIZE)
>> +               return VM_FAULT_FALLBACK;
>> +
>>         pgoff = linear_page_index(vmf->vma, pmd_addr);
>>         phys = pgoff_to_phys(dax_dev, pgoff, PMD_SIZE);
>>         if (phys == -1) {
>>                 dev_dbg(dev, "%s: pgoff_to_phys(%#lx) failed\n", __func__,
>>                                 pgoff);
>> -               return VM_FAULT_SIGBUS;
>> +               return VM_FAULT_FALLBACK;
> 
> If we get past all the checks above then this is a failure.
> 
>>         }
>>
>>         pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
>> @@ -519,12 +528,21 @@ static int __dax_dev_pud_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
>>                 return VM_FAULT_SIGBUS;
>>         }
>>
>> +       /* fall back if we are outside of the VMA */
>> +       if (pud_addr < vmf->vma->vm_start)
>> +               return VM_FAULT_FALLBACK;
>> +       if ((pud_addr + PUD_SIZE) > vmf->vma->vm_end)
>> +               return VM_FAULT_FALLBACK;
>> +
>> +       if (dax_region->align < PUD_SIZE)
>> +               return VM_FAULT_FALLBACK;
>> +
>>         pgoff = linear_page_index(vmf->vma, pud_addr);
>>         phys = pgoff_to_phys(dax_dev, pgoff, PUD_SIZE);
>>         if (phys == -1) {
>>                 dev_dbg(dev, "%s: pgoff_to_phys(%#lx) failed\n", __func__,
>>                                 pgoff);
>> -               return VM_FAULT_SIGBUS;
>> +               return VM_FAULT_FALLBACK;
> 
> Same comments as above. The dax_region->align check should catch all
> the valid fallback cases and check_vma() should clean up everything
> else.
> 
>>         }
>>
>>         pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
>>
Dan Williams March 10, 2017, 12:33 a.m. UTC | #3
On Thu, Mar 9, 2017 at 4:11 PM, Dave Jiang <dave.jiang@intel.com> wrote:
>
>
> On 03/09/2017 05:03 PM, Dan Williams wrote:
>> On Thu, Mar 9, 2017 at 3:56 PM, Dave Jiang <dave.jiang@intel.com> wrote:
>>> Jeff Moyer reports that:
>>> "
>>> With a device dax alignment of 4KB or 2MB, I get sigbus when running the
>>> attached fio job file for the current kernel (4.11.0-rc1+).  If I
>>> specify an alignment of 1GB, it works.
>>>
>>> I turned on debug output, and saw that it was failing in the huge fault
>>> code.
>>>
>>> [ 4614.138357] dax dax1.0: dax_open
>>> [ 4614.154838] dax dax1.0: dax_mmap
>>> [ 4614.171898] dax dax1.0: dax_dev_huge_fault: fio: write (0x7f08f0a00000 - 0x7f0ce0800000)
>>> [ 4614.211720] dax dax1.0: __dax_dev_pud_fault: phys_to_pgoff(0xffffffffcf600) failed
>>> [ 4614.568911] dax dax1.0: dax_release
>>>
>>> fio config for reproduce:
>>> [global]
>>> ioengine=dev-dax
>>> direct=0
>>> filename=/dev/dax0.0
>>> bs=2m
>>>
>>> [write]
>>> rw=write
>>>
>>> [read]
>>> stonewall
>>> rw=read
>>> "
>>>
>>> It looks like the code does not fallback at all when handling faults. The
>>> fault handlers are missing some boundary checking for the VMA. Also, we need
>>> to fall back if the alignment requested is less than the P-entry size.
>>>
>>> Reported-by: Jeff Moyer <jmoyer@redhat.com>
>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>>> ---
>>>  drivers/dax/dax.c |   22 ++++++++++++++++++++--
>>>  1 file changed, 20 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
>>> index 174690a..5c7998e 100644
>>> --- a/drivers/dax/dax.c
>>> +++ b/drivers/dax/dax.c
>>> @@ -480,12 +480,21 @@ static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
>>>                 return VM_FAULT_SIGBUS;
>>>         }
>>>
>>> +       /* fall back if we are outside of the VMA */
>>> +       if (pmd_addr < vmf->vma->vm_start)
>>> +               return VM_FAULT_FALLBACK;
>>> +       if ((pmd_addr + PMD_SIZE) > vmf->vma->vm_end)
>>> +               return VM_FAULT_FALLBACK;
>>
>> I think these are already handled by check_vma(), and we only want to
>> fallback if dax_region->align < PMD_SIZE, otherwise we're violating
>> device-dax's guaranteed fault granularity.
>
> I was seeing pud_addr < vmf->vma->vm_start in the pud handler. It
> doesn't look like check_vma() checks the boundaries.

Sorry, yes, we do need to do the vma boundary checks in
__dax_dev_{pmd,pud}_fault() since check_vma() does not know the fault
size. We just can't fallback blindly.

> Also, it wouldn't
> know the pud_addr or pmd_addr. Or are you saying it doesn't matter?

If the fault does not align to the vma we need to fall back, but only
if the current fault size is > dax_region->align and the fallback
fault size is >= dax_region->align.
diff mbox

Patch

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 174690a..5c7998e 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -480,12 +480,21 @@  static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 	}
 
+	/* fall back if we are outside of the VMA */
+	if (pmd_addr < vmf->vma->vm_start)
+		return VM_FAULT_FALLBACK;
+	if ((pmd_addr + PMD_SIZE) > vmf->vma->vm_end)
+		return VM_FAULT_FALLBACK;
+
+	if (dax_region->align < PMD_SIZE)
+		return VM_FAULT_FALLBACK;
+
 	pgoff = linear_page_index(vmf->vma, pmd_addr);
 	phys = pgoff_to_phys(dax_dev, pgoff, PMD_SIZE);
 	if (phys == -1) {
 		dev_dbg(dev, "%s: pgoff_to_phys(%#lx) failed\n", __func__,
 				pgoff);
-		return VM_FAULT_SIGBUS;
+		return VM_FAULT_FALLBACK;
 	}
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
@@ -519,12 +528,21 @@  static int __dax_dev_pud_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 	}
 
+	/* fall back if we are outside of the VMA */
+	if (pud_addr < vmf->vma->vm_start)
+		return VM_FAULT_FALLBACK;
+	if ((pud_addr + PUD_SIZE) > vmf->vma->vm_end)
+		return VM_FAULT_FALLBACK;
+
+	if (dax_region->align < PUD_SIZE)
+		return VM_FAULT_FALLBACK;
+
 	pgoff = linear_page_index(vmf->vma, pud_addr);
 	phys = pgoff_to_phys(dax_dev, pgoff, PUD_SIZE);
 	if (phys == -1) {
 		dev_dbg(dev, "%s: pgoff_to_phys(%#lx) failed\n", __func__,
 				pgoff);
-		return VM_FAULT_SIGBUS;
+		return VM_FAULT_FALLBACK;
 	}
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);