diff mbox

proc: get a reference to the owning module when opening file

Message ID 1423623907-6950-1-git-send-email-nicolas.iooss_linux@m4x.org (mailing list archive)
State New, archived
Headers show

Commit Message

Nicolas Iooss Feb. 11, 2015, 3:05 a.m. UTC
When a module creates a file in procfs and a program uses the file with
mmap, the .owner field of the file_operations structure is ignored.
This allows removing the module while the file is still being used.

Therefore this sequence of events leads to a kernel oops:

* load a module which creates a file in procfs with an mmap operation
* open the file
* use mmap on it
* rmmod the module
* call unmap

Here are parts of the oops caused by unmap:

[ 1234.337725] BUG: unable to handle kernel paging request at ffffffffa030a268
[ 1234.338007] IP: [<ffffffff811bf054>] remove_vma+0x24/0x70
[ 1234.338007] PGD 1c17067 PUD 1c18063 PMD 3d713067 PTE 0
[ 1234.338007] Oops: 0000 [#1] SMP
[ 1234.338007] Modules linked in: fuse rpcsec_gss_krb5 nfsv4
dns_resolver nfs fscache cfg80211 rfkill ppdev bochs_drm virtio_net
serio_raw parport_pc ttm pvpanic parport drm_kms_helper drm i2c_piix4
nfsd auth_rpcgss nfs_acl lockd sunrpc virtio_blk virtio_pci virtio_ring
virtio ata_generic pata_acpi [last unloaded: procfs_mmap]

[ 1234.338007] Call Trace:
[ 1234.338007]  [<ffffffff811c155f>] do_munmap+0x27f/0x3b0
[ 1234.338007]  [<ffffffff811c16d1>] vm_munmap+0x41/0x60
[ 1234.338007]  [<ffffffff811c2652>] SyS_munmap+0x22/0x30
[ 1234.338007]  [<ffffffff817301a9>] system_call_fastpath+0x16/0x1b

Fix this by making proc_reg_open grab a reference to the module owning
pde->proc_fops.

More information and example code to reproduce this oops can be found on
https://bugzilla.kernel.org/show_bug.cgi?id=92511

Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
---
 fs/proc/inode.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Al Viro Feb. 17, 2015, 8:14 p.m. UTC | #1
On Wed, Feb 11, 2015 at 11:05:07AM +0800, Nicolas Iooss wrote:
> When a module creates a file in procfs and a program uses the file with
> mmap, the .owner field of the file_operations structure is ignored.
> This allows removing the module while the file is still being used.
> 
> Therefore this sequence of events leads to a kernel oops:
> 
> * load a module which creates a file in procfs with an mmap operation
> * open the file
> * use mmap on it
> * rmmod the module
> * call unmap
> 
> Here are parts of the oops caused by unmap:
> 
> [ 1234.337725] BUG: unable to handle kernel paging request at ffffffffa030a268
> [ 1234.338007] IP: [<ffffffff811bf054>] remove_vma+0x24/0x70
> [ 1234.338007] PGD 1c17067 PUD 1c18063 PMD 3d713067 PTE 0
> [ 1234.338007] Oops: 0000 [#1] SMP
> [ 1234.338007] Modules linked in: fuse rpcsec_gss_krb5 nfsv4
> dns_resolver nfs fscache cfg80211 rfkill ppdev bochs_drm virtio_net
> serio_raw parport_pc ttm pvpanic parport drm_kms_helper drm i2c_piix4
> nfsd auth_rpcgss nfs_acl lockd sunrpc virtio_blk virtio_pci virtio_ring
> virtio ata_generic pata_acpi [last unloaded: procfs_mmap]
> 
> [ 1234.338007] Call Trace:
> [ 1234.338007]  [<ffffffff811c155f>] do_munmap+0x27f/0x3b0
> [ 1234.338007]  [<ffffffff811c16d1>] vm_munmap+0x41/0x60
> [ 1234.338007]  [<ffffffff811c2652>] SyS_munmap+0x22/0x30
> [ 1234.338007]  [<ffffffff817301a9>] system_call_fastpath+0x16/0x1b
> 
> Fix this by making proc_reg_open grab a reference to the module owning
> pde->proc_fops.

NAK.

procfs doesn't pin modules while the file is opened, by design.  And
it's not about to start.  Background:

* ->owner protects the wrong thing - it protects _code_ in methods, but
not the data structures.  Preventing rmmod might act as a proxy in some
cases, but only when objects are _not_ dynamically created/removed.

* what we want is to make sure that no new method calls might be issued
after procfs removal and that procfs removal will wait for method
calls already in progress to complete.  Again, the only way it's relevant
to rmmod is that procfs removal is often triggered by module_exit.
BTW, debugfs is seriously broken in that respect - it does protect the
module, all right, but have an object removed earlier and you are fucked.

* mmap is a big can of worms in that respect, mostly in terms of desired
semantics.  What should be done to existing mappings when object is removed?
Sure, no new ones should be allowed, that much is obvious.  But what should
happen to VMAs already there and what should happen to pages covered by
those?  IMO the effect of truncate() on shared file mappings is the best
approximation for what we want there, but there's a considerable amount of
PITA in the vm_operations_struct that needs to be sorted out first.  In
particular, ->open() and ->close() are often badly racy *and* there are
all kinds of bogisities with code making assumptions about impossibility of
getting several VMAs over the same object (fork() isn't the only thing that
could lead to that; munmap() the middle will do it as well and VM_DONTCOPY
won't prevent the latter).

* as it is, mmap() in procfs has _very_ limited use - /proc/bus/pci/*/*
is all there, it's never modular and it never uses any data structures
past ->mmap() anyway.  All pages are already mapped after that point,
->access() is the only possibly non-trivial method and even that won't
look at anything other than vma itself (no looking into ->vm_file, etc.).

Looks like you went and added another ->mmap() user.  I'm not at all sure
that procfs or debugfs are good places for that, but that depends on the
details of what you are trying to do.  _IF_ you really need it in procfs,
welcome to cleaning the vm_operations_struct out.  Won't be fun - I had
taken stabs at that several times, never got more than a few isolated
patches out of any of those attempts...  I have notes from the last such
work and if you are serious about digging through that crap, I can exhumate
those...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas Iooss Feb. 21, 2015, 11:52 a.m. UTC | #2
On 02/18/2015 04:14 AM, Al Viro wrote:
> On Wed, Feb 11, 2015 at 11:05:07AM +0800, Nicolas Iooss wrote:
>> Fix this by making proc_reg_open grab a reference to the module owning
>> pde->proc_fops.
> 
> NAK.
> 
> procfs doesn't pin modules while the file is opened, by design.  And
> it's not about to start.  Background:
> 
> * ->owner protects the wrong thing - it protects _code_ in methods, but
> not the data structures.  Preventing rmmod might act as a proxy in some
> cases, but only when objects are _not_ dynamically created/removed.

All right. I've found the relevant documentation for this [1]: what you
say is already well documented and I was wrongly assuming things based
on my experience with debugfs and character devices.  This bad
assumption came partly from the fact that do_dentry_open() takes a
reference to inode->i_fop->owner [2], which is only released in
__fput().  For character devices the file_operation structure is the one
given to register_chrdev() (thanks to a call to replace_fops() [3]) and
for debugfs files this structure is directly the one given to
debugfs_create_file() [4].  As a consequence the semantics of ->owner is
implicitly extended to protect the module which did register_chrdev() or
debugfs_create_file() from being unloaded while the file is used.  If
this extension is a side-effect which must not be generalized, I agree
with your point of view.

> * what we want is to make sure that no new method calls might be issued
> after procfs removal and that procfs removal will wait for method
> calls already in progress to complete.  Again, the only way it's relevant
> to rmmod is that procfs removal is often triggered by module_exit.
> BTW, debugfs is seriously broken in that respect - it does protect the
> module, all right, but have an object removed earlier and you are fucked.

Now I understand why commit 881adb853583 [5] was needed for procfs
instead of using try_module_get()/module_put().  Thank you for having
made this clear.

For the debugfs part, I though that debugfs_remove() did something
similar to unlink() for usual filesystems so that even if the file is
used, the inode is only unlinked from debugfs and the file is not
released.  It is of course not safe to free data which could be used by
the file without the file being released first.  Nevertheless I
considered that the scopes were made so that, while the file was being
used, the module which has registered the file couldn't be removed.
This is currently true for debugfs but not for procfs (by design).

> * mmap is a big can of worms in that respect, mostly in terms of desired
> semantics.  What should be done to existing mappings when object is removed?
> Sure, no new ones should be allowed, that much is obvious.  But what should
> happen to VMAs already there and what should happen to pages covered by
> those?  IMO the effect of truncate() on shared file mappings is the best
> approximation for what we want there, but there's a considerable amount of
> PITA in the vm_operations_struct that needs to be sorted out first.  In
> particular, ->open() and ->close() are often badly racy *and* there are
> all kinds of bogisities with code making assumptions about impossibility of
> getting several VMAs over the same object (fork() isn't the only thing that
> could lead to that; munmap() the middle will do it as well and VM_DONTCOPY
> won't prevent the latter).

I'm not familiar with mm/ code but vma->vm_file holds a reference to the
file which has been used by mmap(), which seems to mean that so long a
mapping exists, the file is guaranteed not to be released.  This
assumption is useful when the lifetime of the object used by mmap() can
be controlled, for example when it simply is allocated memory.  If this
lifetime can't be controlled and a mapped object is removed when
mappings still exist, I hope there is a way to trigger the vm_ops->fault
handler to give the module some control over resulting faults (I haven't
checked the code to see how drivers handle such a situation).  Of
course, this works only if the module providing vma->vm_ops and
vma->vm_ops->fault is still loaded in memory, which is not guaranteed as
the vma does not take a reference to the module.  This is currently not
an issue because modules which implement ->mmap() use filesystems where
"vma->vm_file = get_file(file)" (in mmap_region [6]) also prevents the
module from being unloaded.

Another possibility consists in making a module close every vma which
has a ->vm_ops in it when it is unloaded, which seems quite hard to do
in a safe way and useless for the modules which are currently in the
kernel tree.

> * as it is, mmap() in procfs has _very_ limited use - /proc/bus/pci/*/*
> is all there, it's never modular and it never uses any data structures
> past ->mmap() anyway.  All pages are already mapped after that point,
> ->access() is the only possibly non-trivial method and even that won't
> look at anything other than vma itself (no looking into ->vm_file, etc.).
> 
> Looks like you went and added another ->mmap() user.  I'm not at all sure
> that procfs or debugfs are good places for that, but that depends on the
> details of what you are trying to do.  _IF_ you really need it in procfs,
> welcome to cleaning the vm_operations_struct out.  Won't be fun - I had
> taken stabs at that several times, never got more than a few isolated
> patches out of any of those attempts...  I have notes from the last such
> work and if you are serious about digging through that crap, I can exhumate
> those...

I'm basically developing a project where I created an out-of-tree module
which uses mmap() as a kind of communication channel between a kernel
module and a userspace program (in short, a ring buffer).  The virtual
file which is used for mmap() was initially created in debugfs, but as
this filesystem is not always mounted on production systems, I needed
another one and ended up using procfs.  I did not consider creating a
character or block device in /dev as I wasn't at ease with grabbing a
major number which could potentially conflict with other out-of-tree
drivers.  As using procfs seems not to be fine, I guess I'll anyway use
a block device with a major number in the experimental range (240-254).

In short, the module I'm writing does not have to be in procfs, and I've
got funnier things to work on than cleaning up vm_operations_struct.

Thank you for your detailed reply.

Nicolas


[1]
https://www.kernel.org/doc/htmldocs/kernel-hacking/routines-module-use-counters.html
[2]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/open.c?id=v3.19#n716
[3]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/char_dev.c?id=v3.19#n407
[4]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/debugfs/inode.c?id=v3.19#n52
[5]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=881adb85358309ea9c6f707394002719982ec607
[6]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/mm/mmap.c?id=v3.19#n1625


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Feb. 22, 2015, 3:46 p.m. UTC | #3
On 02/21/2015 01:52 PM, Nicolas Iooss wrote:
<>
> 
> I'm basically developing a project where I created an out-of-tree module
> which uses mmap() as a kind of communication channel between a kernel
> module and a userspace program (in short, a ring buffer).  The virtual
> file which is used for mmap() was initially created in debugfs, but as
> this filesystem is not always mounted on production systems, I needed
> another one and ended up using procfs.  I did not consider creating a
> character or block device in /dev as I wasn't at ease with grabbing a
> major number which could potentially conflict with other out-of-tree
> drivers.  As using procfs seems not to be fine, I guess I'll anyway use
> a block device with a major number in the experimental range (240-254).
> 
> In short, the module I'm writing does not have to be in procfs, and I've
> got funnier things to work on than cleaning up vm_operations_struct.
> 

I know that some in-tree drivers do similar things with shmem (tmpfs).
shmem even has an in-kernel API for drivers to create and lock pre-allocated
files, so when user-mode loads they already find what they need.

You should check it out it should have a much better mmap semantics
for you, since it is a real FS.

> Thank you for your detailed reply.
> 
> Nicolas
> 
<>

Good luck
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 8420a2f80811..5df17cb730fa 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -329,12 +329,16 @@  static int proc_reg_open(struct inode *inode, struct file *file)
 		kfree(pdeo);
 		return -ENOENT;
 	}
+	fops_get(pde->proc_fops);
 	open = pde->proc_fops->open;
 	release = pde->proc_fops->release;
 
 	if (open)
 		rv = open(inode, file);
 
+	if (rv != 0)
+		fops_put(pde->proc_fops);
+
 	if (rv == 0 && release) {
 		/* To know what to release. */
 		pdeo->file = file;
@@ -361,6 +365,7 @@  static int proc_reg_release(struct inode *inode, struct file *file)
 		}
 	}
 	spin_unlock(&pde->pde_unload_lock);
+	fops_put(pde->proc_fops);
 	return 0;
 }