diff mbox

[01/23] mm: Introduce revoke_file_mappings.

Message ID 1243893048-17031-1-git-send-email-ebiederm@xmission.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Eric W. Biederman June 1, 2009, 9:50 p.m. UTC
From: Eric W. Biederman <ebiederm@xmission.com>

When the backing store of a file becomes inaccessible we need a function
to remove that file from the page tables and arrange for page faults
to receive SIGBUS until the file is unmapped.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
 include/linux/mm.h |    2 +
 mm/memory.c        |   98 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+), 0 deletions(-)

Comments

Andrew Morton June 1, 2009, 10:25 p.m. UTC | #1
On Mon,  1 Jun 2009 14:50:26 -0700
"Eric W. Biederman" <ebiederm@xmission.com> wrote:

> +static void revoke_vma(struct vm_area_struct *vma)

This looks odd.

> +{
> +	struct file *file = vma->vm_file;
> +	struct address_space *mapping = file->f_mapping;
> +	unsigned long start_addr, end_addr, size;
> +	struct mm_struct *mm;
> +
> +	start_addr = vma->vm_start;
> +	end_addr = vma->vm_end;

We take a copy of start_addr/end_addr (and this end_addr value is never used)

> +	/* Switch out the locks so I can maninuplate this under the mm sem.
> +	 * Needed so I can call vm_ops->close.
> +	 */
> +	mm = vma->vm_mm;
> +	atomic_inc(&mm->mm_users);
> +	spin_unlock(&mapping->i_mmap_lock);
> +
> +	/* Block page faults and other code modifying the mm. */
> +	down_write(&mm->mmap_sem);
> +
> +	/* Lookup a vma for my file address */
> +	vma = find_vma(mm, start_addr);

Then we look up a vma.  Is there reason to believe that this will
differ from the incoming arg which we just overwrote?  Maybe the code
is attempting to handle racing concurrent mmap/munmap activity?  If so,
what are the implications of this?

I _think_ that what the function is attempting to do is "unmap the vma
which covers the address at vma->start_addr".  If so, why not just pass
it that virtual address?

Anyway, it's all a bit obscure and I do think that the semantics and
behaviour should be carefully explained in a comment, no?

> +	if (vma->vm_file != file)
> +		goto out;

This strengthens the theory that some sort of race-management is
happening here.

> +	start_addr = vma->vm_start;
> +	end_addr   = vma->vm_end;
> +	size	   = end_addr - start_addr;
> +
> +	/* Unlock the pages */
> +	if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) {
> +		mm->locked_vm -= vma_pages(vma);
> +		vma->vm_flags &= ~VM_LOCKED;
> +	}
> +
> +	/* Unmap the vma */
> +	zap_page_range(vma, start_addr, size, NULL);
> +
> +	/* Unlink the vma from the file */
> +	unlink_file_vma(vma);
> +
> +	/* Close the vma */
> +	if (vma->vm_ops && vma->vm_ops->close)
> +		vma->vm_ops->close(vma);
> +	fput(vma->vm_file);
> +	vma->vm_file = NULL;
> +	if (vma->vm_flags & VM_EXECUTABLE)
> +		removed_exe_file_vma(vma->vm_mm);
> +
> +	/* Repurpose the vma  */
> +	vma->vm_private_data = NULL;
> +	vma->vm_ops = &revoked_vm_ops;
> +	vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR);
> +out:
> +	up_write(&mm->mmap_sem);
> +	spin_lock(&mapping->i_mmap_lock);
> +}

Also, I'm not a bit fan of the practice of overwriting the value of a
formal argument, especially in a function which is this large and
complex.  It makes the code harder to follow, because the one variable
holds two conceptually different things within the span of the same
function.  And it adds risk that someone will will later access a field
of *vma and it will be the wrong vma.  Worse, the bug is only exposed
under exeedingly rare conditions.

So..  Use a new local, please.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 2, 2009, 12:12 a.m. UTC | #2
Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon,  1 Jun 2009 14:50:26 -0700
> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>
>> +static void revoke_vma(struct vm_area_struct *vma)
>
> This looks odd.
>
>> +{
>> +	struct file *file = vma->vm_file;
>> +	struct address_space *mapping = file->f_mapping;
>> +	unsigned long start_addr, end_addr, size;
>> +	struct mm_struct *mm;
>> +
>> +	start_addr = vma->vm_start;
>> +	end_addr = vma->vm_end;
>
> We take a copy of start_addr/end_addr (and this end_addr value is never used)
A foolish consistency.

>> +	/* Switch out the locks so I can maninuplate this under the mm sem.
>> +	 * Needed so I can call vm_ops->close.
>> +	 */
>> +	mm = vma->vm_mm;
>> +	atomic_inc(&mm->mm_users);
>> +	spin_unlock(&mapping->i_mmap_lock);
>> +
>> +	/* Block page faults and other code modifying the mm. */
>> +	down_write(&mm->mmap_sem);
>> +
>> +	/* Lookup a vma for my file address */
>> +	vma = find_vma(mm, start_addr);
>
> Then we look up a vma.  Is there reason to believe that this will
> differ from the incoming arg which we just overwrote?  Maybe the code
> is attempting to handle racing concurrent mmap/munmap activity?  If so,
> what are the implications of this?

Yes it is.  The file based index is only safe while we hold the i_mmap_lock.
The manipulation that needs to happen under the mmap_sem.

So I drop all of the locks and restart.  And use the time honored
kernel practice of relooking up the thing I am going to manipulate.
As long as it is for the same file I don't care.

> I _think_ that what the function is attempting to do is "unmap the vma
> which covers the address at vma->start_addr".  If so, why not just pass
> it that virtual address?

Actually it is  unmapping a vma for the file I am revoking.  I hand it
one and then it does an address space jig.

> Anyway, it's all a bit obscure and I do think that the semantics and
> behaviour should be carefully explained in a comment, no?
>
>> +	if (vma->vm_file != file)
>> +		goto out;
>
> This strengthens the theory that some sort of race-management is
> happening here.

Totally.  I dropped all of my locks so I am having to restart in
a different locking context.

>> +	start_addr = vma->vm_start;
>> +	end_addr   = vma->vm_end;
>> +	size	   = end_addr - start_addr;
>> +
>> +	/* Unlock the pages */
>> +	if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) {
>> +		mm->locked_vm -= vma_pages(vma);
>> +		vma->vm_flags &= ~VM_LOCKED;
>> +	}
>> +
>> +	/* Unmap the vma */
>> +	zap_page_range(vma, start_addr, size, NULL);
>> +
>> +	/* Unlink the vma from the file */
>> +	unlink_file_vma(vma);
>> +
>> +	/* Close the vma */
>> +	if (vma->vm_ops && vma->vm_ops->close)
>> +		vma->vm_ops->close(vma);
>> +	fput(vma->vm_file);
>> +	vma->vm_file = NULL;
>> +	if (vma->vm_flags & VM_EXECUTABLE)
>> +		removed_exe_file_vma(vma->vm_mm);
>> +
>> +	/* Repurpose the vma  */
>> +	vma->vm_private_data = NULL;
>> +	vma->vm_ops = &revoked_vm_ops;
>> +	vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR);
>> +out:
>> +	up_write(&mm->mmap_sem);
>> +	spin_lock(&mapping->i_mmap_lock);
>> +}
>
> Also, I'm not a bit fan of the practice of overwriting the value of a
> formal argument, especially in a function which is this large and
> complex.  It makes the code harder to follow, because the one variable
> holds two conceptually different things within the span of the same
> function.  And it adds risk that someone will will later access a field
> of *vma and it will be the wrong vma.  Worse, the bug is only exposed
> under exeedingly rare conditions.
>
> So..  Use a new local, please.

We can never legitimately have more than one vma manipulated in this function.
As for the rest.  I guess I just assumed that the reader of the code would
have a basic understanding of the locking rules for those data structures.

Certainly the worst thing I suffer from is being close to the code,
and not realizing which pieces are not obvious to a naive observer.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..5d7480d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -808,6 +808,8 @@  static inline void unmap_shared_mapping_range(struct address_space *mapping,
 extern int vmtruncate(struct inode * inode, loff_t offset);
 extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
 
+extern void revoke_file_mappings(struct file *file);
+
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access);
diff --git a/mm/memory.c b/mm/memory.c
index 4126dd1..5cbee3b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -55,6 +55,7 @@ 
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
+#include <linux/file.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -2358,6 +2359,103 @@  void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+static int revoked_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static struct vm_operations_struct revoked_vm_ops = {
+	.fault	= revoked_fault,
+};
+
+static void revoke_vma(struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	unsigned long start_addr, end_addr, size;
+	struct mm_struct *mm;
+
+	start_addr = vma->vm_start;
+	end_addr = vma->vm_end;
+
+	/* Switch out the locks so I can maninuplate this under the mm sem.
+	 * Needed so I can call vm_ops->close.
+	 */
+	mm = vma->vm_mm;
+	atomic_inc(&mm->mm_users);
+	spin_unlock(&mapping->i_mmap_lock);
+
+	/* Block page faults and other code modifying the mm. */
+	down_write(&mm->mmap_sem);
+
+	/* Lookup a vma for my file address */
+	vma = find_vma(mm, start_addr);
+	if (vma->vm_file != file)
+		goto out;
+
+	start_addr = vma->vm_start;
+	end_addr   = vma->vm_end;
+	size	   = end_addr - start_addr;
+
+	/* Unlock the pages */
+	if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) {
+		mm->locked_vm -= vma_pages(vma);
+		vma->vm_flags &= ~VM_LOCKED;
+	}
+
+	/* Unmap the vma */
+	zap_page_range(vma, start_addr, size, NULL);
+
+	/* Unlink the vma from the file */
+	unlink_file_vma(vma);
+
+	/* Close the vma */
+	if (vma->vm_ops && vma->vm_ops->close)
+		vma->vm_ops->close(vma);
+	fput(vma->vm_file);
+	vma->vm_file = NULL;
+	if (vma->vm_flags & VM_EXECUTABLE)
+		removed_exe_file_vma(vma->vm_mm);
+
+	/* Repurpose the vma  */
+	vma->vm_private_data = NULL;
+	vma->vm_ops = &revoked_vm_ops;
+	vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR);
+out:
+	up_write(&mm->mmap_sem);
+	spin_lock(&mapping->i_mmap_lock);
+}
+
+void revoke_file_mappings(struct file *file)
+{
+	/* After a file has been marked dead update the vmas */
+	struct address_space *mapping = file->f_mapping;
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+
+	spin_lock(&mapping->i_mmap_lock);
+
+restart_tree:
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
+		/* Skip quickly over vmas that do not need to be touched */
+		if (vma->vm_file != file)
+			continue;
+		revoke_vma(vma);
+		goto restart_tree;
+	}
+
+restart_list:
+	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
+		/* Skip quickly over vmas that do not need to be touched */
+		if (vma->vm_file != file)
+			continue;
+		revoke_vma(vma);
+		goto restart_list;
+	}
+
+	spin_unlock(&mapping->i_mmap_lock);
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used