diff mbox

[3/3] eventfd: add internal reference counting to fix notifier race conditions

Message ID alpine.DEB.1.10.0906201202170.25534@makko.or.mcafeemobile.com (mailing list archive)
State New, archived
Headers show

Commit Message

Davide Libenzi June 20, 2009, 9:17 p.m. UTC
On Fri, 19 Jun 2009, Gregory Haskins wrote:

> > In the POLLIN event, you schedule_work(&irqfd->inject) and there are no 
> > races there AFAICS (you basically do not care of anything eventfd memory 
> > related at all).
> > For POLLHUP, you do:
> >
> > 	spin_lock(irqfd->slock);
> > 	if (irqfd->wqh)
> > 		schedule_work(&irqfd->inject);
> > 	irqfd->wqh = NULL;
> > 	spin_unlock(irqfd->slock);
> >
> > In your work function you notice the POLLHUP condition and take proper 
> > action (dunno what it is in your case).
> > In your kvm_irqfd_release() function:
> >
> > 	spin_lock(irqfd->slock);
> > 	if (irqfd->wqh)
> > 		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > 	irqfd->wqh = NULL;
> > 	spin_unlock(irqfd->slock);
> >
> > Any races in there?
> >   
> 
> Yes, for one you have an ABBA deadlock on the irqfd->slock and wqh->lock
> (recall wqh has to be locked to fix that other race I mentioned).

Yep, true. How about we go in little steps?
How about the one below?
When you register your poll callback, you get a reference with 
eventfd_refget().
At that point we have no more worries about the eventfd memory (namely 
"wqh") going away.
Symmetrically, you use eventfd_refput() once you done with it.
So we basically de-couple the internal VFS refcount, from the eventfd 
context memory refcount.
Where are the non-solveable races on the IRQfd side, of an approach like 
this one?



> Yes, understood.  What I was trying to gently say is that the one-liner
> proposal alone is still broken afaict.  However, if there is another
> solution that works that you like better than 133-liner I posted, I am
> more than willing to help analyze it.  In the end, I only care that this
> is fixed.

Is it so?
What I noticed here, is that you went from "Detailed Mode" (in the emails 
where you were pushing the new bits), to "Hint Mode" (as below) when it 
was time to see if there were simpler solutions to the problem.

> (As a hint, I think I fixed 4-5 races with these patches, so there are a
> few others still lurking as well)

Not knowing the details of IRQfd, hinting only is not going to help 
anything in this case.



- Davide



---
 fs/eventfd.c            |   37 ++++++++++++++++++++++++++++++++++++-
 include/linux/eventfd.h |    6 ++++++
 2 files changed, 42 insertions(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Gregory Haskins June 21, 2009, 1:05 a.m. UTC | #1
Davide Libenzi wrote:
> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>
>   
>>> In the POLLIN event, you schedule_work(&irqfd->inject) and there are no 
>>> races there AFAICS (you basically do not care of anything eventfd memory 
>>> related at all).
>>> For POLLHUP, you do:
>>>
>>> 	spin_lock(irqfd->slock);
>>> 	if (irqfd->wqh)
>>> 		schedule_work(&irqfd->inject);
>>> 	irqfd->wqh = NULL;
>>> 	spin_unlock(irqfd->slock);
>>>
>>> In your work function you notice the POLLHUP condition and take proper 
>>> action (dunno what it is in your case).
>>> In your kvm_irqfd_release() function:
>>>
>>> 	spin_lock(irqfd->slock);
>>> 	if (irqfd->wqh)
>>> 		remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>> 	irqfd->wqh = NULL;
>>> 	spin_unlock(irqfd->slock);
>>>
>>> Any races in there?
>>>   
>>>       
>> Yes, for one you have an ABBA deadlock on the irqfd->slock and wqh->lock
>> (recall wqh has to be locked to fix that other race I mentioned).
>>     
>
> Yep, true. How about we go in little steps?
> How about the one below?
> When you register your poll callback, you get a reference with 
> eventfd_refget().
>   

I was thinking about this last night/this morning and that was
essentially what I was going to propose next as a more palatable+simpler
solution.

> At that point we have no more worries about the eventfd memory (namely 
> "wqh") going away.
> Symmetrically, you use eventfd_refput() once you done with it.
> So we basically de-couple the internal VFS refcount, from the eventfd 
> context memory refcount.
> Where are the non-solveable races on the IRQfd side, of an approach like 
> this one?
>   
None... I think that can work.  The core requirements boil down to: 1)
locking the POLLHUP (as we already discussed), and 2) holding a kref (or
equivelent) to the eventfd_ctx that the client can release once they are
fully out.  After that, it looks like any vanilla wait-queue in the kernel.


>
>
>   
>> Yes, understood.  What I was trying to gently say is that the one-liner
>> proposal alone is still broken afaict.  However, if there is another
>> solution that works that you like better than 133-liner I posted, I am
>> more than willing to help analyze it.  In the end, I only care that this
>> is fixed.
>>     
>
> Is it so?
> What I noticed here, is that you went from "Detailed Mode" (in the emails 
> where you were pushing the new bits), to "Hint Mode" (as below) when it 
> was time to see if there were simpler solutions to the problem.
>   

I wasn't being purposely coy.  It was late friday and I was getting
pressure from the wife to get off the computer ;)  Plus, all the details
were in my patch, so I figured there wasn't a giant need to rehash since
it could just be gleened.  In any case, based on this current reply I
think we are on the same page now.

>   
>> (As a hint, I think I fixed 4-5 races with these patches, so there are a
>> few others still lurking as well)
>>     
>
> Not knowing the details of IRQfd, hinting only is not going to help 
> anything in this case.
>
>
>
> - Davide
>
>
>
> ---
>  fs/eventfd.c            |   37 ++++++++++++++++++++++++++++++++++++-
>  include/linux/eventfd.h |    6 ++++++
>  2 files changed, 42 insertions(+), 1 deletion(-)
>   

I haven't had a chance to go over the patch below in detail, but the
30,000 foot view is that it is in line with what I was thinking this
morning.  So, its a preliminary "looks good".  I'll take a closer look ASAP

Thanks Davide,
-Greg

> Index: linux-2.6.mod/fs/eventfd.c
> ===================================================================
> --- linux-2.6.mod.orig/fs/eventfd.c	2009-06-20 13:08:22.000000000 -0700
> +++ linux-2.6.mod/fs/eventfd.c	2009-06-20 14:00:23.000000000 -0700
> @@ -17,8 +17,10 @@
>  #include <linux/eventfd.h>
>  #include <linux/syscalls.h>
>  #include <linux/module.h>
> +#include <linux/kref.h>
>  
>  struct eventfd_ctx {
> +	struct kref kref;
>  	wait_queue_head_t wqh;
>  	/*
>  	 * Every time that a write(2) is performed on an eventfd, the
> @@ -59,9 +61,19 @@ int eventfd_signal(struct file *file, in
>  }
>  EXPORT_SYMBOL_GPL(eventfd_signal);
>  
> +static void eventfd_free(struct kref *kref)
> +{
> +	struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
> +
> +	kfree(ctx);
> +}
> +
>  static int eventfd_release(struct inode *inode, struct file *file)
>  {
> -	kfree(file->private_data);
> +	struct eventfd_ctx *ctx = file->private_data;
> +
> +	wake_up_poll(&ctx->wqh, POLLHUP);
> +	kref_put(&ctx->kref, eventfd_free);
>  	return 0;
>  }
>  
> @@ -201,6 +213,28 @@ struct file *eventfd_fget(int fd)
>  }
>  EXPORT_SYMBOL_GPL(eventfd_fget);
>  
> +int eventfd_refget(struct file *file)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +
> +	if (file->f_op != &eventfd_fops)
> +		return -EINVAL;
> +	kref_get(&ctx->kref);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(eventfd_refget);
> +
> +int eventfd_refput(struct file *file)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +
> +	if (file->f_op != &eventfd_fops)
> +		return -EINVAL;
> +	kref_put(&ctx->kref, eventfd_free);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(eventfd_refput);
> +
>  SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
>  {
>  	int fd;
> @@ -217,6 +251,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int, 
>  	if (!ctx)
>  		return -ENOMEM;
>  
> +	kref_init(&ctx->kref);
>  	init_waitqueue_head(&ctx->wqh);
>  	ctx->count = count;
>  	ctx->flags = flags;
> Index: linux-2.6.mod/include/linux/eventfd.h
> ===================================================================
> --- linux-2.6.mod.orig/include/linux/eventfd.h	2009-06-20 13:08:22.000000000 -0700
> +++ linux-2.6.mod/include/linux/eventfd.h	2009-06-20 13:59:09.000000000 -0700
> @@ -29,12 +29,18 @@
>  
>  struct file *eventfd_fget(int fd);
>  int eventfd_signal(struct file *file, int n);
> +int eventfd_refget(struct file *file);
> +int eventfd_refput(struct file *file);
>  
>  #else /* CONFIG_EVENTFD */
>  
>  #define eventfd_fget(fd) ERR_PTR(-ENOSYS)
>  static inline int eventfd_signal(struct file *file, int n)
>  { return 0; }
> +static inline int eventfd_refget(struct file *file)
> +{ return -ENOSYS; }
> +static inline int eventfd_refput(struct file *file)
> +{ return -ENOSYS; }
>  
>  #endif /* CONFIG_EVENTFD */
>  
>
diff mbox

Patch

Index: linux-2.6.mod/fs/eventfd.c
===================================================================
--- linux-2.6.mod.orig/fs/eventfd.c	2009-06-20 13:08:22.000000000 -0700
+++ linux-2.6.mod/fs/eventfd.c	2009-06-20 14:00:23.000000000 -0700
@@ -17,8 +17,10 @@ 
 #include <linux/eventfd.h>
 #include <linux/syscalls.h>
 #include <linux/module.h>
+#include <linux/kref.h>
 
 struct eventfd_ctx {
+	struct kref kref;
 	wait_queue_head_t wqh;
 	/*
 	 * Every time that a write(2) is performed on an eventfd, the
@@ -59,9 +61,19 @@  int eventfd_signal(struct file *file, in
 }
 EXPORT_SYMBOL_GPL(eventfd_signal);
 
+static void eventfd_free(struct kref *kref)
+{
+	struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+	kfree(ctx);
+}
+
 static int eventfd_release(struct inode *inode, struct file *file)
 {
-	kfree(file->private_data);
+	struct eventfd_ctx *ctx = file->private_data;
+
+	wake_up_poll(&ctx->wqh, POLLHUP);
+	kref_put(&ctx->kref, eventfd_free);
 	return 0;
 }
 
@@ -201,6 +213,28 @@  struct file *eventfd_fget(int fd)
 }
 EXPORT_SYMBOL_GPL(eventfd_fget);
 
+int eventfd_refget(struct file *file)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+
+	if (file->f_op != &eventfd_fops)
+		return -EINVAL;
+	kref_get(&ctx->kref);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_refget);
+
+int eventfd_refput(struct file *file)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+
+	if (file->f_op != &eventfd_fops)
+		return -EINVAL;
+	kref_put(&ctx->kref, eventfd_free);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_refput);
+
 SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 {
 	int fd;
@@ -217,6 +251,7 @@  SYSCALL_DEFINE2(eventfd2, unsigned int, 
 	if (!ctx)
 		return -ENOMEM;
 
+	kref_init(&ctx->kref);
 	init_waitqueue_head(&ctx->wqh);
 	ctx->count = count;
 	ctx->flags = flags;
Index: linux-2.6.mod/include/linux/eventfd.h
===================================================================
--- linux-2.6.mod.orig/include/linux/eventfd.h	2009-06-20 13:08:22.000000000 -0700
+++ linux-2.6.mod/include/linux/eventfd.h	2009-06-20 13:59:09.000000000 -0700
@@ -29,12 +29,18 @@ 
 
 struct file *eventfd_fget(int fd);
 int eventfd_signal(struct file *file, int n);
+int eventfd_refget(struct file *file);
+int eventfd_refput(struct file *file);
 
 #else /* CONFIG_EVENTFD */
 
 #define eventfd_fget(fd) ERR_PTR(-ENOSYS)
 static inline int eventfd_signal(struct file *file, int n)
 { return 0; }
+static inline int eventfd_refget(struct file *file)
+{ return -ENOSYS; }
+static inline int eventfd_refput(struct file *file)
+{ return -ENOSYS; }
 
 #endif /* CONFIG_EVENTFD */