diff mbox

[v2] dax: fix NULL pointer in __dax_pmd_fault()

Message ID 20150922141333.c28e3c5d800267937ca7b29a@linux-foundation.org (mailing list archive)
State Not Applicable
Headers show

Commit Message

Andrew Morton Sept. 22, 2015, 9:13 p.m. UTC
On Tue, 22 Sep 2015 13:36:22 -0600 Ross Zwisler <ross.zwisler@linux.intel.com> wrote:

> The following commit:
> 
> commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
> 	DAX")
> 
> moved some code in __dax_pmd_fault() that was responsible for zeroing
> newly allocated PMD pages.  The new location didn't properly set up
> 'kaddr', though, so when run this code resulted in a NULL pointer BUG.
> 
> Fix this by getting the correct 'kaddr' via bdev_direct_access().

Why the heck didn't gcc warn?

I had a fiddle:


gcc warns about the first printk, but not about the second.  So that
"if (...) return ..." seems to have defeated gcc uninitialized-var
detection.  wtf?

> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -569,8 +569,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
>  		goto fallback;
>  
> +	sector = bh.b_blocknr << (blkbits - 9);
> +
>  	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
>  		int i;
> +
> +		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
> +						bh.b_size);
> +		if (length < 0) {
> +			result = VM_FAULT_SIGBUS;
> +			goto out;
> +		}
> +		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
> +			goto fallback;
> +
>  		for (i = 0; i < PTRS_PER_PMD; i++)
>  			clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
>  		wmb_pmem();

hm, that's a lot of copy-n-paste.  Do we really need to run
bdev_direct_access() twice?  Will `kaddr' and `pfn' change?

Comments

Dan Williams Sept. 22, 2015, 9:25 p.m. UTC | #1
On Tue, Sep 22, 2015 at 2:13 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 22 Sep 2015 13:36:22 -0600 Ross Zwisler <ross.zwisler@linux.intel.com> wrote:
>
>> The following commit:
>>
>> commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
>>       DAX")
>>
>> moved some code in __dax_pmd_fault() that was responsible for zeroing
>> newly allocated PMD pages.  The new location didn't properly set up
>> 'kaddr', though, so when run this code resulted in a NULL pointer BUG.
>>
>> Fix this by getting the correct 'kaddr' via bdev_direct_access().
>
> Why the heck didn't gcc warn?
>
> I had a fiddle:
>
> --- a/fs/dax.c~a
> +++ a/fs/dax.c
> @@ -529,15 +529,18 @@ int __dax_pmd_fault(struct vm_area_struc
>         unsigned long pmd_addr = address & PMD_MASK;
>         bool write = flags & FAULT_FLAG_WRITE;
>         long length;
> -       void __pmem *kaddr;
> +       void *kaddr;
>         pgoff_t size, pgoff;
>         sector_t block, sector;
>         unsigned long pfn;
>         int result = 0;
>
> +//     printk("%p\n", kaddr);
> +
>         /* Fall back to PTEs if we're going to COW */
>         if (write && !(vma->vm_flags & VM_SHARED))
>                 return VM_FAULT_FALLBACK;
> +       printk("%p\n", kaddr);
>         /* If the PMD would extend outside the VMA */
>         if (pmd_addr < vma->vm_start)
>                 return VM_FAULT_FALLBACK;
>
> gcc warns about the first printk, but not about the second.  So that
> "if (...) return ..." seems to have defeated gcc uninitialized-var
> detection.  wtf?
>
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -569,8 +569,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>>       if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
>>               goto fallback;
>>
>> +     sector = bh.b_blocknr << (blkbits - 9);
>> +
>>       if (buffer_unwritten(&bh) || buffer_new(&bh)) {
>>               int i;
>> +
>> +             length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
>> +                                             bh.b_size);
>> +             if (length < 0) {
>> +                     result = VM_FAULT_SIGBUS;
>> +                     goto out;
>> +             }
>> +             if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
>> +                     goto fallback;
>> +
>>               for (i = 0; i < PTRS_PER_PMD; i++)
>>                       clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
>>               wmb_pmem();
>
> hm, that's a lot of copy-n-paste.  Do we really need to run
> bdev_direct_access() twice?  Will `kaddr' and `pfn' change?
>

They shouldn't change, but I'm working on a fix for handling the race
of unbinding the pmem device while that kaddr is in use (unbind
invalidates kaddr).  The proposal is a dax_map_bh()/dax_unmap_bh()
interface to temporarily pin the mapping around each usage.
Dave Chinner Sept. 22, 2015, 11:30 p.m. UTC | #2
On Tue, Sep 22, 2015 at 02:25:19PM -0700, Dan Williams wrote:
> On Tue, Sep 22, 2015 at 2:13 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Tue, 22 Sep 2015 13:36:22 -0600 Ross Zwisler <ross.zwisler@linux.intel.com> wrote:
> >
> >> The following commit:
> >>
> >> commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
> >>       DAX")
> >>
> >> moved some code in __dax_pmd_fault() that was responsible for zeroing
> >> newly allocated PMD pages.  The new location didn't properly set up
> >> 'kaddr', though, so when run this code resulted in a NULL pointer BUG.
> >>
> >> Fix this by getting the correct 'kaddr' via bdev_direct_access().
> >
> > Why the heck didn't gcc warn?
> >
> > I had a fiddle:
> >
> > --- a/fs/dax.c~a
> > +++ a/fs/dax.c
> > @@ -529,15 +529,18 @@ int __dax_pmd_fault(struct vm_area_struc
> >         unsigned long pmd_addr = address & PMD_MASK;
> >         bool write = flags & FAULT_FLAG_WRITE;
> >         long length;
> > -       void __pmem *kaddr;
> > +       void *kaddr;
> >         pgoff_t size, pgoff;
> >         sector_t block, sector;
> >         unsigned long pfn;
> >         int result = 0;
> >
> > +//     printk("%p\n", kaddr);
> > +
> >         /* Fall back to PTEs if we're going to COW */
> >         if (write && !(vma->vm_flags & VM_SHARED))
> >                 return VM_FAULT_FALLBACK;
> > +       printk("%p\n", kaddr);
> >         /* If the PMD would extend outside the VMA */
> >         if (pmd_addr < vma->vm_start)
> >                 return VM_FAULT_FALLBACK;
> >
> > gcc warns about the first printk, but not about the second.  So that
> > "if (...) return ..." seems to have defeated gcc uninitialized-var
> > detection.  wtf?
> >
> >> --- a/fs/dax.c
> >> +++ b/fs/dax.c
> >> @@ -569,8 +569,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >>       if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
> >>               goto fallback;
> >>
> >> +     sector = bh.b_blocknr << (blkbits - 9);
> >> +
> >>       if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> >>               int i;
> >> +
> >> +             length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
> >> +                                             bh.b_size);
> >> +             if (length < 0) {
> >> +                     result = VM_FAULT_SIGBUS;
> >> +                     goto out;
> >> +             }
> >> +             if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
> >> +                     goto fallback;
> >> +
> >>               for (i = 0; i < PTRS_PER_PMD; i++)
> >>                       clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
> >>               wmb_pmem();
> >
> > hm, that's a lot of copy-n-paste.  Do we really need to run
> > bdev_direct_access() twice?  Will `kaddr' and `pfn' change?
> >
> 
> They shouldn't change, but I'm working on a fix for handling the race
> of unbinding the pmem device while that kaddr is in use (unbind
> invalidates kaddr).

Exactly what does "unbinding the pmem device" mean, and why can
(parts of) the pmem device "go away" when there are active
references to it?

> The proposal is a dax_map_bh()/dax_unmap_bh()
> interface to temporarily pin the mapping around each usage.

Which mapping? The bufferhead maps file offset to filesystem block
addresses, so I'm not sure what problem you are actually refering
to here...

Cheers,

Dave.
Dan Williams Sept. 23, 2015, 3 a.m. UTC | #3
On Tue, Sep 22, 2015 at 4:30 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Sep 22, 2015 at 02:25:19PM -0700, Dan Williams wrote:
>> On Tue, Sep 22, 2015 at 2:13 PM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>> > On Tue, 22 Sep 2015 13:36:22 -0600 Ross Zwisler <ross.zwisler@linux.intel.com> wrote:
>> >
>> >> The following commit:
>> >>
>> >> commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
>> >>       DAX")
>> >>
>> >> moved some code in __dax_pmd_fault() that was responsible for zeroing
>> >> newly allocated PMD pages.  The new location didn't properly set up
>> >> 'kaddr', though, so when run this code resulted in a NULL pointer BUG.
>> >>
>> >> Fix this by getting the correct 'kaddr' via bdev_direct_access().
>> >
>> > Why the heck didn't gcc warn?
>> >
>> > I had a fiddle:
>> >
>> > --- a/fs/dax.c~a
>> > +++ a/fs/dax.c
>> > @@ -529,15 +529,18 @@ int __dax_pmd_fault(struct vm_area_struc
>> >         unsigned long pmd_addr = address & PMD_MASK;
>> >         bool write = flags & FAULT_FLAG_WRITE;
>> >         long length;
>> > -       void __pmem *kaddr;
>> > +       void *kaddr;
>> >         pgoff_t size, pgoff;
>> >         sector_t block, sector;
>> >         unsigned long pfn;
>> >         int result = 0;
>> >
>> > +//     printk("%p\n", kaddr);
>> > +
>> >         /* Fall back to PTEs if we're going to COW */
>> >         if (write && !(vma->vm_flags & VM_SHARED))
>> >                 return VM_FAULT_FALLBACK;
>> > +       printk("%p\n", kaddr);
>> >         /* If the PMD would extend outside the VMA */
>> >         if (pmd_addr < vma->vm_start)
>> >                 return VM_FAULT_FALLBACK;
>> >
>> > gcc warns about the first printk, but not about the second.  So that
>> > "if (...) return ..." seems to have defeated gcc uninitialized-var
>> > detection.  wtf?
>> >
>> >> --- a/fs/dax.c
>> >> +++ b/fs/dax.c
>> >> @@ -569,8 +569,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>> >>       if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
>> >>               goto fallback;
>> >>
>> >> +     sector = bh.b_blocknr << (blkbits - 9);
>> >> +
>> >>       if (buffer_unwritten(&bh) || buffer_new(&bh)) {
>> >>               int i;
>> >> +
>> >> +             length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
>> >> +                                             bh.b_size);
>> >> +             if (length < 0) {
>> >> +                     result = VM_FAULT_SIGBUS;
>> >> +                     goto out;
>> >> +             }
>> >> +             if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
>> >> +                     goto fallback;
>> >> +
>> >>               for (i = 0; i < PTRS_PER_PMD; i++)
>> >>                       clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
>> >>               wmb_pmem();
>> >
>> > hm, that's a lot of copy-n-paste.  Do we really need to run
>> > bdev_direct_access() twice?  Will `kaddr' and `pfn' change?
>> >
>>
>> They shouldn't change, but I'm working on a fix for handling the race
>> of unbinding the pmem device while that kaddr is in use (unbind
>> invalidates kaddr).
>
> Exactly what does "unbinding the pmem device" mean,

echo namespace0.0 > /sys/bus/nd/drivers/nd_pmem/unbind

> and why can
> (parts of) the pmem device "go away" when there are active
> references to it?

Normally we have outstanding i/o requests to hold off
blk_cleanup_queue(), but in the dax case we don't have any mechanism
(yet) to flag the queue as busy.  I have some patches to add a
percpu_refcount for this purpose.

>
>> The proposal is a dax_map_bh()/dax_unmap_bh()
>> interface to temporarily pin the mapping around each usage.
>
> Which mapping? The bufferhead maps file offset to filesystem block
> addresses, so I'm not sure what problem you are actually refering
> to here...

The kaddr is coming from the devm_memremap() in the pmem driver that
gets unmapped after the device is released by the driver.
Dave Chinner Sept. 23, 2015, 9:04 a.m. UTC | #4
On Tue, Sep 22, 2015 at 08:00:29PM -0700, Dan Williams wrote:
> On Tue, Sep 22, 2015 at 4:30 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Sep 22, 2015 at 02:25:19PM -0700, Dan Williams wrote:
> >> On Tue, Sep 22, 2015 at 2:13 PM, Andrew Morton
> >> <akpm@linux-foundation.org> wrote:
> >> > On Tue, 22 Sep 2015 13:36:22 -0600 Ross Zwisler <ross.zwisler@linux.intel.com> wrote:
> >> >
> >> >> The following commit:
> >> >>
> >> >> commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
> >> >>       DAX")
> >> >>
> >> >> moved some code in __dax_pmd_fault() that was responsible for zeroing
> >> >> newly allocated PMD pages.  The new location didn't properly set up
> >> >> 'kaddr', though, so when run this code resulted in a NULL pointer BUG.
> >> >>
> >> >> Fix this by getting the correct 'kaddr' via bdev_direct_access().
> >> >
> >> > Why the heck didn't gcc warn?
> >> >
> >> > I had a fiddle:
> >> >
> >> > --- a/fs/dax.c~a
> >> > +++ a/fs/dax.c
> >> > @@ -529,15 +529,18 @@ int __dax_pmd_fault(struct vm_area_struc
> >> >         unsigned long pmd_addr = address & PMD_MASK;
> >> >         bool write = flags & FAULT_FLAG_WRITE;
> >> >         long length;
> >> > -       void __pmem *kaddr;
> >> > +       void *kaddr;
> >> >         pgoff_t size, pgoff;
> >> >         sector_t block, sector;
> >> >         unsigned long pfn;
> >> >         int result = 0;
> >> >
> >> > +//     printk("%p\n", kaddr);
> >> > +
> >> >         /* Fall back to PTEs if we're going to COW */
> >> >         if (write && !(vma->vm_flags & VM_SHARED))
> >> >                 return VM_FAULT_FALLBACK;
> >> > +       printk("%p\n", kaddr);
> >> >         /* If the PMD would extend outside the VMA */
> >> >         if (pmd_addr < vma->vm_start)
> >> >                 return VM_FAULT_FALLBACK;
> >> >
> >> > gcc warns about the first printk, but not about the second.  So that
> >> > "if (...) return ..." seems to have defeated gcc uninitialized-var
> >> > detection.  wtf?
> >> >
> >> >> --- a/fs/dax.c
> >> >> +++ b/fs/dax.c
> >> >> @@ -569,8 +569,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >> >>       if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
> >> >>               goto fallback;
> >> >>
> >> >> +     sector = bh.b_blocknr << (blkbits - 9);
> >> >> +
> >> >>       if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> >> >>               int i;
> >> >> +
> >> >> +             length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
> >> >> +                                             bh.b_size);
> >> >> +             if (length < 0) {
> >> >> +                     result = VM_FAULT_SIGBUS;
> >> >> +                     goto out;
> >> >> +             }
> >> >> +             if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
> >> >> +                     goto fallback;
> >> >> +
> >> >>               for (i = 0; i < PTRS_PER_PMD; i++)
> >> >>                       clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
> >> >>               wmb_pmem();
> >> >
> >> > hm, that's a lot of copy-n-paste.  Do we really need to run
> >> > bdev_direct_access() twice?  Will `kaddr' and `pfn' change?
> >> >
> >>
> >> They shouldn't change, but I'm working on a fix for handling the race
> >> of unbinding the pmem device while that kaddr is in use (unbind
> >> invalidates kaddr).
> >
> > Exactly what does "unbinding the pmem device" mean,
> 
> echo namespace0.0 > /sys/bus/nd/drivers/nd_pmem/unbind

That tells me "how", not "what"..... :/

> > and why can
> > (parts of) the pmem device "go away" when there are active
> > references to it?
> 
> Normally we have outstanding i/o requests to hold off
> blk_cleanup_queue(), but in the dax case we don't have any mechanism
> (yet) to flag the queue as busy.  I have some patches to add a
> percpu_refcount for this purpose.

So this comes back to fact we allow a block device to be torn down
and freed while a filesystem has active references to it?

> >> The proposal is a dax_map_bh()/dax_unmap_bh()
> >> interface to temporarily pin the mapping around each usage.
> >
> > Which mapping? The bufferhead maps file offset to filesystem block
> > addresses, so I'm not sure what problem you are actually refering
> > to here...
> 
> The kaddr is coming from the devm_memremap() in the pmem driver that
> gets unmapped after the device is released by the driver.

Perhaps the better solution is to not tear down the block device
until all active references have gone away? i.e. unbind puts the
device into a persistent error state and forces all active mappings
to refault. Hence all future accesses error out and then when the
user unmounts the unhappy filesystem the last reference to the
blockdev goes away and the mappings can be torn down safely...

Cheers,

Dave.
Boaz Harrosh Sept. 24, 2015, 8:50 a.m. UTC | #5
On 09/23/2015 12:04 PM, Dave Chinner wrote:
> On Tue, Sep 22, 2015 at 08:00:29PM -0700, Dan Williams wrote:
<>
>> The kaddr is coming from the devm_memremap() in the pmem driver that
>> gets unmapped after the device is released by the driver.
> 
> Perhaps the better solution is to not tear down the block device
> until all active references have gone away? i.e. unbind puts the
> device into a persistent error state and forces all active mappings
> to refault. Hence all future accesses error out and then when the
> user unmounts the unhappy filesystem the last reference to the
> blockdev goes away and the mappings can be torn down safely...
> 

Me too

> Cheers,
> 
> Dave.
>
Dan Williams Sept. 24, 2015, 4:06 p.m. UTC | #6
On Wed, Sep 23, 2015 at 2:04 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Sep 22, 2015 at 08:00:29PM -0700, Dan Williams wrote:
>> The kaddr is coming from the devm_memremap() in the pmem driver that
>> gets unmapped after the device is released by the driver.
>
> Perhaps the better solution is to not tear down the block device
> until all active references have gone away? i.e. unbind puts the
> device into a persistent error state and forces all active mappings
> to refault. Hence all future accesses error out and then when the
> user unmounts the unhappy filesystem the last reference to the
> blockdev goes away and the mappings can be torn down safely...

In fact this is how it already works in the block layer, it's just
that the pmem driver was not participating in that mechanism.  The
filesystem prevents the gendisk and hosting driver module from going
away via the heavyweight get_disk().  The gendisk keeps the
request_queue from being de-allocated, but the queue can go "dead" to
new requests at any time.  Single-queue based drivers take the
queue_lock and check blk_queue_dying() before allowing new requests.
Multi-queue drivers take a lighter-weight approach and try to get a
new "live" reference from a percpu_refcount.

When the backing device is unplugged or otherwise unbound from its
driver it calls blk_cleanup_queue() in its shutdown path.  That marks
the queue dead and flushes any outstanding requests.  From that point
forward all requests end in error until the final put_disk().

This is what I came up with for pmem:
https://lists.01.org/pipermail/linux-nvdimm/2015-September/002206.html
diff mbox

Patch

--- a/fs/dax.c~a
+++ a/fs/dax.c
@@ -529,15 +529,18 @@  int __dax_pmd_fault(struct vm_area_struc
 	unsigned long pmd_addr = address & PMD_MASK;
 	bool write = flags & FAULT_FLAG_WRITE;
 	long length;
-	void __pmem *kaddr;
+	void *kaddr;
 	pgoff_t size, pgoff;
 	sector_t block, sector;
 	unsigned long pfn;
 	int result = 0;
 
+//	printk("%p\n", kaddr);
+
 	/* Fall back to PTEs if we're going to COW */
 	if (write && !(vma->vm_flags & VM_SHARED))
 		return VM_FAULT_FALLBACK;
+	printk("%p\n", kaddr);
 	/* If the PMD would extend outside the VMA */
 	if (pmd_addr < vma->vm_start)
 		return VM_FAULT_FALLBACK;