diff mbox

[v2,2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

Message ID 4A35C269.7050209@novell.com (mailing list archive)
State New, archived
Headers show

Commit Message

Gregory Haskins June 15, 2009, 3:39 a.m. UTC
Michael S. Tsirkin wrote:
> On Sun, Jun 14, 2009 at 08:53:11AM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> +static void
>>>> +irqfd_disconnect(struct _irqfd *irqfd)
>>>> +{
>>>> +	struct kvm *kvm;
>>>> +
>>>> +	mutex_lock(&irqfd->lock);
>>>> +
>>>> +	kvm = rcu_dereference(irqfd->kvm);
>>>> +	rcu_assign_pointer(irqfd->kvm, NULL);
>>>> +
>>>> +	mutex_unlock(&irqfd->lock);
>>>> +
>>>> +	if (!kvm)
>>>> +		return;
>>>>  
>>>>  	mutex_lock(&kvm->lock);
>>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>>>> +	list_del(&irqfd->list);
>>>>  	mutex_unlock(&kvm->lock);
>>>> +
>>>> +	/*
>>>> +	 * It is important to not drop the kvm reference until the next grace
>>>> +	 * period because there might be lockless references in flight up
>>>> +	 * until then
>>>> +	 */
>>>> +	synchronize_srcu(&irqfd->srcu);
>>>> +	kvm_put_kvm(kvm);
>>>>  }
>>>>     
>>>>         
>>> So irqfd object will persist after kvm goes away, until eventfd is closed?
>>>   
>>>       
>> Yep, by design.  It becomes part of the eventfd and is thus associated
>> with its lifetime.  Consider it as if we made our own anon-fd
>> implementation for irqfd and the lifetime looks similar.  The difference
>> is that we are reusing eventfd and its interface semantics.
>>     
>>>   
>>>       
>>>>  
>>>>  static int
>>>>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>>>  {
>>>>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>>>> +	unsigned long flags = (unsigned long)key;
>>>>  
>>>> -	/*
>>>> -	 * The wake_up is called with interrupts disabled.  Therefore we need
>>>> -	 * to defer the IRQ injection until later since we need to acquire the
>>>> -	 * kvm->lock to do so.
>>>> -	 */
>>>> -	schedule_work(&irqfd->work);
>>>> +	if (flags & POLLIN)
>>>> +		/*
>>>> +		 * The POLLIN wake_up is called with interrupts disabled.
>>>> +		 * Therefore we need to defer the IRQ injection until later
>>>> +		 * since we need to acquire the kvm->lock to do so.
>>>> +		 */
>>>> +		schedule_work(&irqfd->inject);
>>>> +
>>>> +	if (flags & POLLHUP) {
>>>> +		/*
>>>> +		 * The POLLHUP is called unlocked, so it theoretically should
>>>> +		 * be safe to remove ourselves from the wqh using the locked
>>>> +		 * variant of remove_wait_queue()
>>>> +		 */
>>>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>>> +		flush_work(&irqfd->inject);
>>>> +		irqfd_disconnect(irqfd);
>>>> +
>>>> +		cleanup_srcu_struct(&irqfd->srcu);
>>>> +		kfree(irqfd);
>>>> +	}
>>>>  
>>>>  	return 0;
>>>>  }
>>>>     
>>>>         
>>> And it is removed by this function when eventfd is closed.
>>> But what prevents the kvm module from going away, meanwhile?
>>>   
>>>       
>> Well, we hold a reference to struct kvm until we call
>> irqfd_disconnect().  If kvm closes first, we disconnect and disassociate
>> all references to kvm leaving irqfd->kvm = NULL.  Likewise, if irqfd
>> closes first, we disassociate with kvm with the above quoted logic.  In
>> either case, we are holding a kvm reference up until that "disconnect"
>> point.  Therefore kvm should not be able to disappear before that
>> disconnect, and after that point we do not care.
>>     
>
> Yes, we do care.
>
> Here's the scenario in more detail:
>
> - kvm is closed
> - irq disconnect is called
> - kvm is put
> - kvm module is removed: all irqs are disconnected
> - eventfd closes and triggers callback into removed kvm module
> - crash
>   

[ lightbulb turns on]

Ah, now I see the point you were making.  I thought you were talking
about the .text in kvm_set_irq() (which would be protected by my
kvm_get_kvm() reference afaict).  But you are actually talking about the
irqfd .text itself.  Indeed, you are correct that is this currently a
race.  Good catch!

>   
>> If that is not sufficient to prevent kvm.ko from going away in the
>> middle, then IMO kvm_get_kvm() has a bug, not irqfd. ;) However, I
>> believe everything is actually ok here.
>>
>> -Greg
>>
>>     
>
>
> BTW, why can't we remove irqfds in kvm_release?
>   

Well, this would be ideal but we run into that bi-directional reference
thing that we talked about earlier and we both agree is non-trivial to
solve.  Solving this locking problem would incidentally also pave the
way for restoring the DEASSIGN feature, so patches welcome!  In the
meantime, I think we can close the hole you found with the following
patch (build-tested only):

commit f3a8dccc9e815599438e9feb0ea53e8eb10ad2b3
Author: Gregory Haskins <ghaskins@novell.com>
Date:   Sun Jun 14 23:37:49 2009 -0400

    KVM: make irqfd take kvm.ko module reference
   
    Michael Tsirkin pointed out that we currently have a race between
someone
    holding an irqfd reference and an rmmod against kvm.ko.  This patch
closes
    that hole by making sure that irqfd holds a kvm.ko reference for its
lifetime.
   
    Found-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: Gregory Haskins <ghaskins@novell.com>

 
        return 0;
@@ -176,6 +178,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
        if (ret < 0)
                goto fail;
 
+       __module_get(THIS_MODULE);
        kvm_get_kvm(kvm);
 
        mutex_lock(&kvm->lock);

Comments

Michael S. Tsirkin June 15, 2009, 9:46 a.m. UTC | #1
On Sun, Jun 14, 2009 at 11:39:21PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Sun, Jun 14, 2009 at 08:53:11AM -0400, Gregory Haskins wrote:
> >   
> >> Michael S. Tsirkin wrote:
> >>     
> >>> On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote:
> >>>   
> >>>       
> >>>> +static void
> >>>> +irqfd_disconnect(struct _irqfd *irqfd)
> >>>> +{
> >>>> +	struct kvm *kvm;
> >>>> +
> >>>> +	mutex_lock(&irqfd->lock);
> >>>> +
> >>>> +	kvm = rcu_dereference(irqfd->kvm);
> >>>> +	rcu_assign_pointer(irqfd->kvm, NULL);
> >>>> +
> >>>> +	mutex_unlock(&irqfd->lock);
> >>>> +
> >>>> +	if (!kvm)
> >>>> +		return;
> >>>>  
> >>>>  	mutex_lock(&kvm->lock);
> >>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> >>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> >>>> +	list_del(&irqfd->list);
> >>>>  	mutex_unlock(&kvm->lock);
> >>>> +
> >>>> +	/*
> >>>> +	 * It is important to not drop the kvm reference until the next grace
> >>>> +	 * period because there might be lockless references in flight up
> >>>> +	 * until then
> >>>> +	 */
> >>>> +	synchronize_srcu(&irqfd->srcu);
> >>>> +	kvm_put_kvm(kvm);
> >>>>  }
> >>>>     
> >>>>         
> >>> So irqfd object will persist after kvm goes away, until eventfd is closed?
> >>>   
> >>>       
> >> Yep, by design.  It becomes part of the eventfd and is thus associated
> >> with its lifetime.  Consider it as if we made our own anon-fd
> >> implementation for irqfd and the lifetime looks similar.  The difference
> >> is that we are reusing eventfd and its interface semantics.
> >>     
> >>>   
> >>>       
> >>>>  
> >>>>  static int
> >>>>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> >>>>  {
> >>>>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
> >>>> +	unsigned long flags = (unsigned long)key;
> >>>>  
> >>>> -	/*
> >>>> -	 * The wake_up is called with interrupts disabled.  Therefore we need
> >>>> -	 * to defer the IRQ injection until later since we need to acquire the
> >>>> -	 * kvm->lock to do so.
> >>>> -	 */
> >>>> -	schedule_work(&irqfd->work);
> >>>> +	if (flags & POLLIN)
> >>>> +		/*
> >>>> +		 * The POLLIN wake_up is called with interrupts disabled.
> >>>> +		 * Therefore we need to defer the IRQ injection until later
> >>>> +		 * since we need to acquire the kvm->lock to do so.
> >>>> +		 */
> >>>> +		schedule_work(&irqfd->inject);
> >>>> +
> >>>> +	if (flags & POLLHUP) {
> >>>> +		/*
> >>>> +		 * The POLLHUP is called unlocked, so it theoretically should
> >>>> +		 * be safe to remove ourselves from the wqh using the locked
> >>>> +		 * variant of remove_wait_queue()
> >>>> +		 */
> >>>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >>>> +		flush_work(&irqfd->inject);
> >>>> +		irqfd_disconnect(irqfd);
> >>>> +
> >>>> +		cleanup_srcu_struct(&irqfd->srcu);
> >>>> +		kfree(irqfd);
> >>>> +	}
> >>>>  
> >>>>  	return 0;
> >>>>  }
> >>>>     
> >>>>         
> >>> And it is removed by this function when eventfd is closed.
> >>> But what prevents the kvm module from going away, meanwhile?
> >>>   
> >>>       
> >> Well, we hold a reference to struct kvm until we call
> >> irqfd_disconnect().  If kvm closes first, we disconnect and disassociate
> >> all references to kvm leaving irqfd->kvm = NULL.  Likewise, if irqfd
> >> closes first, we disassociate with kvm with the above quoted logic.  In
> >> either case, we are holding a kvm reference up until that "disconnect"
> >> point.  Therefore kvm should not be able to disappear before that
> >> disconnect, and after that point we do not care.
> >>     
> >
> > Yes, we do care.
> >
> > Here's the scenario in more detail:
> >
> > - kvm is closed
> > - irq disconnect is called
> > - kvm is put
> > - kvm module is removed: all irqs are disconnected
> > - eventfd closes and triggers callback into removed kvm module
> > - crash
> >   
> 
> [ lightbulb turns on]
> 
> Ah, now I see the point you were making.  I thought you were talking
> about the .text in kvm_set_irq() (which would be protected by my
> kvm_get_kvm() reference afaict).  But you are actually talking about the
> irqfd .text itself.  Indeed, you are correct that is this currently a
> race.  Good catch!
> 
> >   
> >> If that is not sufficient to prevent kvm.ko from going away in the
> >> middle, then IMO kvm_get_kvm() has a bug, not irqfd. ;) However, I
> >> believe everything is actually ok here.
> >>
> >> -Greg
> >>
> >>     
> >
> >
> > BTW, why can't we remove irqfds in kvm_release?
> >   
> 
> Well, this would be ideal but we run into that bi-directional reference
> thing that we talked about earlier and we both agree is non-trivial to
> solve.  Solving this locking problem would incidentally also pave the
> way for restoring the DEASSIGN feature, so patches welcome!

So far the only workable approach that I see is reverting the POLLHUP
patch. I agree it looks pretty, but DEASSIGN and closing the races is
more important IMO. And locking will definitely become much simpler.

> In the meantime, I think we can close the hole you found with the
> following patch (build-tested only):
> 
> commit f3a8dccc9e815599438e9feb0ea53e8eb10ad2b3
> Author: Gregory Haskins <ghaskins@novell.com>
> Date:   Sun Jun 14 23:37:49 2009 -0400
> 
>     KVM: make irqfd take kvm.ko module reference
>    
>     Michael Tsirkin pointed out that we currently have a race between someone
>     holding an irqfd reference and an rmmod against kvm.ko.  This patch closes
>     that hole by making sure that irqfd holds a kvm.ko reference for its lifetime.
>    
>     Found-by: Michael S. Tsirkin <mst@redhat.com>
>     Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> 
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 2c8028c..67e4eca 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -29,6 +29,7 @@
>  #include <linux/list.h>
>  #include <linux/eventfd.h>
>  #include <linux/srcu.h>
> +#include <linux/module.h>
>  
>  /*
>   * --------------------------------------------------------------------
> @@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int
> sync, void
>  *key)
>  
>                 cleanup_srcu_struct(&irqfd->srcu);
>                 kfree(irqfd);
> +               module_put(THIS_MODULE);
>         }
>  
>         return 0;

module_put(THIS_MODULE) is always a bug unless you know that someone has
a reference to the current module: the module could go away between this
call and returning from function.

> @@ -176,6 +178,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>         if (ret < 0)
>                 goto fail;
>  
> +       __module_get(THIS_MODULE);
>         kvm_get_kvm(kvm);
>  
>         mutex_lock(&kvm->lock);
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins June 15, 2009, 12:08 p.m. UTC | #2
Michael S. Tsirkin wrote:
> On Sun, Jun 14, 2009 at 11:39:21PM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Sun, Jun 14, 2009 at 08:53:11AM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> Michael S. Tsirkin wrote:
>>>>     
>>>>         
>>>>> On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> +static void
>>>>>> +irqfd_disconnect(struct _irqfd *irqfd)
>>>>>> +{
>>>>>> +	struct kvm *kvm;
>>>>>> +
>>>>>> +	mutex_lock(&irqfd->lock);
>>>>>> +
>>>>>> +	kvm = rcu_dereference(irqfd->kvm);
>>>>>> +	rcu_assign_pointer(irqfd->kvm, NULL);
>>>>>> +
>>>>>> +	mutex_unlock(&irqfd->lock);
>>>>>> +
>>>>>> +	if (!kvm)
>>>>>> +		return;
>>>>>>  
>>>>>>  	mutex_lock(&kvm->lock);
>>>>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>>>>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>>>>>> +	list_del(&irqfd->list);
>>>>>>  	mutex_unlock(&kvm->lock);
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * It is important to not drop the kvm reference until the next grace
>>>>>> +	 * period because there might be lockless references in flight up
>>>>>> +	 * until then
>>>>>> +	 */
>>>>>> +	synchronize_srcu(&irqfd->srcu);
>>>>>> +	kvm_put_kvm(kvm);
>>>>>>  }
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> So irqfd object will persist after kvm goes away, until eventfd is closed?
>>>>>   
>>>>>       
>>>>>           
>>>> Yep, by design.  It becomes part of the eventfd and is thus associated
>>>> with its lifetime.  Consider it as if we made our own anon-fd
>>>> implementation for irqfd and the lifetime looks similar.  The difference
>>>> is that we are reusing eventfd and its interface semantics.
>>>>     
>>>>         
>>>>>   
>>>>>       
>>>>>           
>>>>>>  
>>>>>>  static int
>>>>>>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>>>>>  {
>>>>>>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>>>>>> +	unsigned long flags = (unsigned long)key;
>>>>>>  
>>>>>> -	/*
>>>>>> -	 * The wake_up is called with interrupts disabled.  Therefore we need
>>>>>> -	 * to defer the IRQ injection until later since we need to acquire the
>>>>>> -	 * kvm->lock to do so.
>>>>>> -	 */
>>>>>> -	schedule_work(&irqfd->work);
>>>>>> +	if (flags & POLLIN)
>>>>>> +		/*
>>>>>> +		 * The POLLIN wake_up is called with interrupts disabled.
>>>>>> +		 * Therefore we need to defer the IRQ injection until later
>>>>>> +		 * since we need to acquire the kvm->lock to do so.
>>>>>> +		 */
>>>>>> +		schedule_work(&irqfd->inject);
>>>>>> +
>>>>>> +	if (flags & POLLHUP) {
>>>>>> +		/*
>>>>>> +		 * The POLLHUP is called unlocked, so it theoretically should
>>>>>> +		 * be safe to remove ourselves from the wqh using the locked
>>>>>> +		 * variant of remove_wait_queue()
>>>>>> +		 */
>>>>>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>>>>> +		flush_work(&irqfd->inject);
>>>>>> +		irqfd_disconnect(irqfd);
>>>>>> +
>>>>>> +		cleanup_srcu_struct(&irqfd->srcu);
>>>>>> +		kfree(irqfd);
>>>>>> +	}
>>>>>>  
>>>>>>  	return 0;
>>>>>>  }
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> And it is removed by this function when eventfd is closed.
>>>>> But what prevents the kvm module from going away, meanwhile?
>>>>>   
>>>>>       
>>>>>           
>>>> Well, we hold a reference to struct kvm until we call
>>>> irqfd_disconnect().  If kvm closes first, we disconnect and disassociate
>>>> all references to kvm leaving irqfd->kvm = NULL.  Likewise, if irqfd
>>>> closes first, we disassociate with kvm with the above quoted logic.  In
>>>> either case, we are holding a kvm reference up until that "disconnect"
>>>> point.  Therefore kvm should not be able to disappear before that
>>>> disconnect, and after that point we do not care.
>>>>     
>>>>         
>>> Yes, we do care.
>>>
>>> Here's the scenario in more detail:
>>>
>>> - kvm is closed
>>> - irq disconnect is called
>>> - kvm is put
>>> - kvm module is removed: all irqs are disconnected
>>> - eventfd closes and triggers callback into removed kvm module
>>> - crash
>>>   
>>>       
>> [ lightbulb turns on]
>>
>> Ah, now I see the point you were making.  I thought you were talking
>> about the .text in kvm_set_irq() (which would be protected by my
>> kvm_get_kvm() reference afaict).  But you are actually talking about the
>> irqfd .text itself.  Indeed, you are correct that is this currently a
>> race.  Good catch!
>>
>>     
>>>   
>>>       
>>>> If that is not sufficient to prevent kvm.ko from going away in the
>>>> middle, then IMO kvm_get_kvm() has a bug, not irqfd. ;) However, I
>>>> believe everything is actually ok here.
>>>>
>>>> -Greg
>>>>
>>>>     
>>>>         
>>> BTW, why can't we remove irqfds in kvm_release?
>>>   
>>>       
>> Well, this would be ideal but we run into that bi-directional reference
>> thing that we talked about earlier and we both agree is non-trivial to
>> solve.  Solving this locking problem would incidentally also pave the
>> way for restoring the DEASSIGN feature, so patches welcome!
>>     
>
> So far the only workable approach that I see is reverting the POLLHUP
> patch. I agree it looks pretty, but DEASSIGN and closing the races is
> more important IMO. And locking will definitely become much simpler.
>
>   
>> In the meantime, I think we can close the hole you found with the
>> following patch (build-tested only):
>>
>> commit f3a8dccc9e815599438e9feb0ea53e8eb10ad2b3
>> Author: Gregory Haskins <ghaskins@novell.com>
>> Date:   Sun Jun 14 23:37:49 2009 -0400
>>
>>     KVM: make irqfd take kvm.ko module reference
>>    
>>     Michael Tsirkin pointed out that we currently have a race between someone
>>     holding an irqfd reference and an rmmod against kvm.ko.  This patch closes
>>     that hole by making sure that irqfd holds a kvm.ko reference for its lifetime.
>>    
>>     Found-by: Michael S. Tsirkin <mst@redhat.com>
>>     Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>>
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index 2c8028c..67e4eca 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -29,6 +29,7 @@
>>  #include <linux/list.h>
>>  #include <linux/eventfd.h>
>>  #include <linux/srcu.h>
>> +#include <linux/module.h>
>>  
>>  /*
>>   * --------------------------------------------------------------------
>> @@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int
>> sync, void
>>  *key)
>>  
>>                 cleanup_srcu_struct(&irqfd->srcu);
>>                 kfree(irqfd);
>> +               module_put(THIS_MODULE);
>>         }
>>  
>>         return 0;
>>     
>
> module_put(THIS_MODULE) is always a bug unless you know that someone has
> a reference to the current module: the module could go away between this
> call and returning from function.
>   

Hmm.  I understand what you are saying conceptually (i.e. the .text
could get yanked before we hit the next line of code, in this case the
"return 0").  However, holding a reference when you _know_ someone else
holds a reference to me says that one of the references is redundant. 
In addition, there is certainly plenty of precedence for
module_put(THIS_MODULE) all throughout the kernel (including
module_put_and_exit()).  Are those broken as well?

In any case, one of the patches I have in queue to push to Davide for
eventfd may provide a good solution to this problem as well, so I will
get that polished up today.

Thanks Michael,
-Greg

>   
>> @@ -176,6 +178,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>         if (ret < 0)
>>                 goto fail;
>>  
>> +       __module_get(THIS_MODULE);
>>         kvm_get_kvm(kvm);
>>  
>>         mutex_lock(&kvm->lock);
>>
>>
>>
>>     
>
>
>
Michael S. Tsirkin June 15, 2009, 12:54 p.m. UTC | #3
On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote:
> >> @@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int
> >> sync, void
> >>  *key)
> >>  
> >>                 cleanup_srcu_struct(&irqfd->srcu);
> >>                 kfree(irqfd);
> >> +               module_put(THIS_MODULE);
> >>         }
> >>  
> >>         return 0;
> >>     
> >
> > module_put(THIS_MODULE) is always a bug unless you know that someone has
> > a reference to the current module: the module could go away between this
> > call and returning from function.
> >   
> 
> Hmm.  I understand what you are saying conceptually (i.e. the .text
> could get yanked before we hit the next line of code, in this case the
> "return 0").  However, holding a reference when you _know_ someone else
> holds a reference to me says that one of the references is redundant. 
> In addition, there is certainly plenty of precedence for
> module_put(THIS_MODULE) all throughout the kernel (including
> module_put_and_exit()).  Are those broken as well?

Maybe not, but I don't know why. It works fine as long as you don't
unload any modules though :) Rusty, could you enlighten us please?
Rusty Russell June 18, 2009, 5:16 a.m. UTC | #4
On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote:
> On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote:
> > Hmm.  I understand what you are saying conceptually (i.e. the .text
> > could get yanked before we hit the next line of code, in this case the
> > "return 0").  However, holding a reference when you _know_ someone else
> > holds a reference to me says that one of the references is redundant.
> > In addition, there is certainly plenty of precedence for
> > module_put(THIS_MODULE) all throughout the kernel (including
> > module_put_and_exit()).  Are those broken as well?
>
> Maybe not, but I don't know why. It works fine as long as you don't
> unload any modules though :) Rusty, could you enlighten us please?

Yep, they're almost all broken.  A few have comments indicating that someone 
else is holding a reference (eg. loopback).

But at some point you give up playing whack-a-mole for random drivers.

module_put_and_exit() does *not* have this problem, BTW.

Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 18, 2009, 6:49 a.m. UTC | #5
On Thu, Jun 18, 2009 at 02:46:30PM +0930, Rusty Russell wrote:
> On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote:
> > On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote:
> > > Hmm.  I understand what you are saying conceptually (i.e. the .text
> > > could get yanked before we hit the next line of code, in this case the
> > > "return 0").  However, holding a reference when you _know_ someone else
> > > holds a reference to me says that one of the references is redundant.
> > > In addition, there is certainly plenty of precedence for
> > > module_put(THIS_MODULE) all throughout the kernel (including
> > > module_put_and_exit()).  Are those broken as well?
> >
> > Maybe not, but I don't know why. It works fine as long as you don't
> > unload any modules though :) Rusty, could you enlighten us please?
> 
> Yep, they're almost all broken.  A few have comments indicating that someone 
> else is holding a reference (eg. loopback).
> 
> But at some point you give up playing whack-a-mole for random drivers.
> 
> module_put_and_exit() does *not* have this problem, BTW.
> 
> Rusty.

I see that, the .text for module_put_and_exit is never modular itself.
Thanks, Rusty!

BTW, Gregory, this can be used to fix the race in the design: create a
thread and let it drop the module reference with module_put_and_exit.
Which will work, but I guess at this point we should ask ourselves
whether all the hearburn with srcu, threads and module references is
better than just asking the user to call and ioctl.
Gregory Haskins June 18, 2009, noon UTC | #6
Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 02:46:30PM +0930, Rusty Russell wrote:
>   
>> On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote:
>>     
>>> On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote:
>>>       
>>>> Hmm.  I understand what you are saying conceptually (i.e. the .text
>>>> could get yanked before we hit the next line of code, in this case the
>>>> "return 0").  However, holding a reference when you _know_ someone else
>>>> holds a reference to me says that one of the references is redundant.
>>>> In addition, there is certainly plenty of precedence for
>>>> module_put(THIS_MODULE) all throughout the kernel (including
>>>> module_put_and_exit()).  Are those broken as well?
>>>>         
>>> Maybe not, but I don't know why. It works fine as long as you don't
>>> unload any modules though :) Rusty, could you enlighten us please?
>>>       
>> Yep, they're almost all broken.  A few have comments indicating that someone 
>> else is holding a reference (eg. loopback).
>>
>> But at some point you give up playing whack-a-mole for random drivers.
>>
>> module_put_and_exit() does *not* have this problem, BTW.
>>
>> Rusty.
>>     
>
> I see that, the .text for module_put_and_exit is never modular itself.
> Thanks, Rusty!
>   

Ah!  That is the trick I wasn't understanding.
> BTW, Gregory, this can be used to fix the race in the design: create a
> thread and let it drop the module reference with module_put_and_exit.
>   

I had thought of doing something like this initially too, but I think
its racy as well.  Ultimately, you need to make sure the eventfd
callback is completely out before its safe to run, and deferring to a
thread would not change this race.  The only sane way I can see to do
that is to have the caller infrastructure annotate the event somehow
(either directly with a module_put(), or indirectly with some kind of
state transition that can be tracked with something like
synchronize_sched().
> Which will work, but I guess at this point we should ask ourselves
> whether all the hearburn with srcu, threads and module references is
> better than just asking the user to call and ioctl.
>   

I am starting to agree with you, here. :)

Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
conversation re: the module_put() races.  I only tied it into the
current thread because the eventfd_notifier_register() thread gave me a
convenient way to hook some other context to do the module_put().  In
the long term, the srcu changes are for the can_sleep() stuff.  So on
that note, lets see if I can convince Davide that the srcu stuff is not
so evil before we revert the POLLHUP patches, since the module_put() fix
is trivial once that is in place.

Thanks Michael (and Rusty),
-Greg
Michael S. Tsirkin June 18, 2009, 12:22 p.m. UTC | #7
On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2009 at 02:46:30PM +0930, Rusty Russell wrote:
> >   
> >> On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote:
> >>     
> >>> On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote:
> >>>       
> >>>> Hmm.  I understand what you are saying conceptually (i.e. the .text
> >>>> could get yanked before we hit the next line of code, in this case the
> >>>> "return 0").  However, holding a reference when you _know_ someone else
> >>>> holds a reference to me says that one of the references is redundant.
> >>>> In addition, there is certainly plenty of precedence for
> >>>> module_put(THIS_MODULE) all throughout the kernel (including
> >>>> module_put_and_exit()).  Are those broken as well?
> >>>>         
> >>> Maybe not, but I don't know why. It works fine as long as you don't
> >>> unload any modules though :) Rusty, could you enlighten us please?
> >>>       
> >> Yep, they're almost all broken.  A few have comments indicating that someone 
> >> else is holding a reference (eg. loopback).
> >>
> >> But at some point you give up playing whack-a-mole for random drivers.
> >>
> >> module_put_and_exit() does *not* have this problem, BTW.
> >>
> >> Rusty.
> >>     
> >
> > I see that, the .text for module_put_and_exit is never modular itself.
> > Thanks, Rusty!
> >   
> 
> Ah!  That is the trick I wasn't understanding.
> > BTW, Gregory, this can be used to fix the race in the design: create a
> > thread and let it drop the module reference with module_put_and_exit.
> >   
> 
> I had thought of doing something like this initially too, but I think
> its racy as well.  Ultimately, you need to make sure the eventfd
> callback is completely out before its safe to run, and deferring to a
> thread would not change this race.  The only sane way I can see to do
> that is to have the caller infrastructure annotate the event somehow
> (either directly with a module_put(), or indirectly with some kind of
> state transition that can be tracked with something like
> synchronize_sched().

Here's what one could do: create a thread for each irqfd, and increment
module ref count, put that thread to sleep.  When done with
irqfd, don't delete it and don't decrement module refcount, wake thread
instead.  thread kills irqfd and calls module_put_and_exit.

I don't think it's racy but I won't call it sane either.

> > Which will work, but I guess at this point we should ask ourselves
> > whether all the hearburn with srcu, threads and module references is
> > better than just asking the user to call and ioctl.
> >   
> 
> I am starting to agree with you, here. :)
> 
> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
> conversation re: the module_put() races.  I only tied it into the
> current thread because the eventfd_notifier_register() thread gave me a
> convenient way to hook some other context to do the module_put().  In
> the long term, the srcu changes are for the can_sleep() stuff.  So on
> that note, lets see if I can convince Davide that the srcu stuff is not
> so evil before we revert the POLLHUP patches, since the module_put() fix
> is trivial once that is in place.

Can this help with DEASSIGN as well? We need it for migration.

> Thanks Michael (and Rusty),
> -Greg
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins June 18, 2009, 2:03 p.m. UTC | #8
Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>
>>> BTW, Gregory, this can be used to fix the race in the design: create a
>>> thread and let it drop the module reference with module_put_and_exit.
>>>   
>>>       
>> I had thought of doing something like this initially too, but I think
>> its racy as well.  Ultimately, you need to make sure the eventfd
>> callback is completely out before its safe to run, and deferring to a
>> thread would not change this race.  The only sane way I can see to do
>> that is to have the caller infrastructure annotate the event somehow
>> (either directly with a module_put(), or indirectly with some kind of
>> state transition that can be tracked with something like
>> synchronize_sched().
>>     
>
> Here's what one could do: create a thread for each irqfd, and increment
> module ref count, put that thread to sleep.  When done with
> irqfd, don't delete it and don't decrement module refcount, wake thread
> instead.  thread kills irqfd and calls module_put_and_exit.
>
> I don't think it's racy


I believe it is. How would you prevent the thread from doing the
module_put_and_exit() before the eventfd callback thread is known to
have exited the relevant .text section?

All this talk does give me an idea, tho.  Ill make a patch.

 

>   
>>> Which will work, but I guess at this point we should ask ourselves
>>> whether all the hearburn with srcu, threads and module references is
>>> better than just asking the user to call and ioctl.
>>>   
>>>       
>> I am starting to agree with you, here. :)
>>
>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
>> conversation re: the module_put() races.  I only tied it into the
>> current thread because the eventfd_notifier_register() thread gave me a
>> convenient way to hook some other context to do the module_put().  In
>> the long term, the srcu changes are for the can_sleep() stuff.  So on
>> that note, lets see if I can convince Davide that the srcu stuff is not
>> so evil before we revert the POLLHUP patches, since the module_put() fix
>> is trivial once that is in place.
>>     
>
> Can this help with DEASSIGN as well? We need it for migration.
>   

No, but afaict you do not need this for migration anyway.  Migrate the
GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
relevant across a migration anyway?  I would think not, but admittedly I
know little about how qemu/kvm migration actually works.

Regards,
-Greg
Michael S. Tsirkin June 18, 2009, 2:35 p.m. UTC | #9
On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
> >   
> >> Michael S. Tsirkin wrote:
> >>     
> >>
> >>> BTW, Gregory, this can be used to fix the race in the design: create a
> >>> thread and let it drop the module reference with module_put_and_exit.
> >>>   
> >>>       
> >> I had thought of doing something like this initially too, but I think
> >> its racy as well.  Ultimately, you need to make sure the eventfd
> >> callback is completely out before its safe to run, and deferring to a
> >> thread would not change this race.  The only sane way I can see to do
> >> that is to have the caller infrastructure annotate the event somehow
> >> (either directly with a module_put(), or indirectly with some kind of
> >> state transition that can be tracked with something like
> >> synchronize_sched().
> >>     
> >
> > Here's what one could do: create a thread for each irqfd, and increment
> > module ref count, put that thread to sleep.  When done with
> > irqfd, don't delete it and don't decrement module refcount, wake thread
> > instead.  thread kills irqfd and calls module_put_and_exit.
> >
> > I don't think it's racy
> 
> 
> I believe it is. How would you prevent the thread from doing the
> module_put_and_exit() before the eventfd callback thread is known to
> have exited the relevant .text section?

Right.

> All this talk does give me an idea, tho.  Ill make a patch.

OK, but ask yourself whether this bag of tricks is worth it, and whether
we'll find another hole later. Let's reserve the trickiness for
fast-path, where it's needed, and keep at least the assign/deassign simple.

> >   
> >>> Which will work, but I guess at this point we should ask ourselves
> >>> whether all the hearburn with srcu, threads and module references is
> >>> better than just asking the user to call and ioctl.
> >>>   
> >>>       
> >> I am starting to agree with you, here. :)
> >>
> >> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
> >> conversation re: the module_put() races.  I only tied it into the
> >> current thread because the eventfd_notifier_register() thread gave me a
> >> convenient way to hook some other context to do the module_put().  In
> >> the long term, the srcu changes are for the can_sleep() stuff.  So on
> >> that note, lets see if I can convince Davide that the srcu stuff is not
> >> so evil before we revert the POLLHUP patches, since the module_put() fix
> >> is trivial once that is in place.
> >>     
> >
> > Can this help with DEASSIGN as well? We need it for migration.
> >   
> 
> No, but afaict you do not need this for migration anyway.  Migrate the
> GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
> relevant across a migration anyway?  I would think not, but admittedly I
> know little about how qemu/kvm migration actually works.

Yes but that's not live migration. For live migration, the trick is that
you are running locally but send changes to remote guest.  For that, we
need to put qemu in the middle between the device and the guest, so it
can detect activity and update the remote side.

And the best way to do that is to take poll eventfd that device assigns
and push eventfd that kvm polls. To switch between this setup
and the one where kvm polls the ventfd from device directly,
you need deassign.
Gregory Haskins June 18, 2009, 4:29 p.m. UTC | #10
Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> Michael S. Tsirkin wrote:
>>>>     
>>>>
>>>>         
>>>>> BTW, Gregory, this can be used to fix the race in the design: create a
>>>>> thread and let it drop the module reference with module_put_and_exit.
>>>>>   
>>>>>       
>>>>>           
>>>> I had thought of doing something like this initially too, but I think
>>>> its racy as well.  Ultimately, you need to make sure the eventfd
>>>> callback is completely out before its safe to run, and deferring to a
>>>> thread would not change this race.  The only sane way I can see to do
>>>> that is to have the caller infrastructure annotate the event somehow
>>>> (either directly with a module_put(), or indirectly with some kind of
>>>> state transition that can be tracked with something like
>>>> synchronize_sched().
>>>>     
>>>>         
>>> Here's what one could do: create a thread for each irqfd, and increment
>>> module ref count, put that thread to sleep.  When done with
>>> irqfd, don't delete it and don't decrement module refcount, wake thread
>>> instead.  thread kills irqfd and calls module_put_and_exit.
>>>
>>> I don't think it's racy
>>>       
>> I believe it is. How would you prevent the thread from doing the
>> module_put_and_exit() before the eventfd callback thread is known to
>> have exited the relevant .text section?
>>     
>
> Right.
>
>   
>> All this talk does give me an idea, tho.  Ill make a patch.
>>     
>
> OK, but ask yourself whether this bag of tricks is worth it, and whether
> we'll find another hole later. Let's reserve the trickiness for
> fast-path, where it's needed, and keep at least the assign/deassign simple.
>   

Understood.  OTOH, going back to the model where two steps are needed
for close() is ugly too, so I don't want to just give up and revert that
fix too easily.  At some point we will call it one way or the other, but
I am not there quite yet.
>   
>>>   
>>>       
>>>>> Which will work, but I guess at this point we should ask ourselves
>>>>> whether all the hearburn with srcu, threads and module references is
>>>>> better than just asking the user to call and ioctl.
>>>>>   
>>>>>       
>>>>>           
>>>> I am starting to agree with you, here. :)
>>>>
>>>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
>>>> conversation re: the module_put() races.  I only tied it into the
>>>> current thread because the eventfd_notifier_register() thread gave me a
>>>> convenient way to hook some other context to do the module_put().  In
>>>> the long term, the srcu changes are for the can_sleep() stuff.  So on
>>>> that note, lets see if I can convince Davide that the srcu stuff is not
>>>> so evil before we revert the POLLHUP patches, since the module_put() fix
>>>> is trivial once that is in place.
>>>>     
>>>>         
>>> Can this help with DEASSIGN as well? We need it for migration.
>>>   
>>>       
>> No, but afaict you do not need this for migration anyway.  Migrate the
>> GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
>> relevant across a migration anyway?  I would think not, but admittedly I
>> know little about how qemu/kvm migration actually works.
>>     
>
> Yes but that's not live migration. For live migration, the trick is that
> you are running locally but send changes to remote guest.  For that, we
> need to put qemu in the middle between the device and the guest, so it
> can detect activity and update the remote side.
>
> And the best way to do that is to take poll eventfd that device assigns
> and push eventfd that kvm polls. To switch between this setup
> and the one where kvm polls the ventfd from device directly,
> you need deassign.
>   

So its still not clear why the distinction between
deassign-the-gsi-but-leave-the-fd-valid is needed over a simple
close().  Can you elaborate?

-Greg
Michael S. Tsirkin June 19, 2009, 3:37 p.m. UTC | #11
On Thu, Jun 18, 2009 at 12:29:31PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote:
> >   
> >> Michael S. Tsirkin wrote:
> >>     
> >>> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
> >>>   
> >>>       
> >>>> Michael S. Tsirkin wrote:
> >>>>     
> >>>>
> >>>>         
> >>>>> BTW, Gregory, this can be used to fix the race in the design: create a
> >>>>> thread and let it drop the module reference with module_put_and_exit.
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> I had thought of doing something like this initially too, but I think
> >>>> its racy as well.  Ultimately, you need to make sure the eventfd
> >>>> callback is completely out before its safe to run, and deferring to a
> >>>> thread would not change this race.  The only sane way I can see to do
> >>>> that is to have the caller infrastructure annotate the event somehow
> >>>> (either directly with a module_put(), or indirectly with some kind of
> >>>> state transition that can be tracked with something like
> >>>> synchronize_sched().
> >>>>     
> >>>>         
> >>> Here's what one could do: create a thread for each irqfd, and increment
> >>> module ref count, put that thread to sleep.  When done with
> >>> irqfd, don't delete it and don't decrement module refcount, wake thread
> >>> instead.  thread kills irqfd and calls module_put_and_exit.
> >>>
> >>> I don't think it's racy
> >>>       
> >> I believe it is. How would you prevent the thread from doing the
> >> module_put_and_exit() before the eventfd callback thread is known to
> >> have exited the relevant .text section?
> >>     
> >
> > Right.
> >
> >   
> >> All this talk does give me an idea, tho.  Ill make a patch.
> >>     
> >
> > OK, but ask yourself whether this bag of tricks is worth it, and whether
> > we'll find another hole later. Let's reserve the trickiness for
> > fast-path, where it's needed, and keep at least the assign/deassign simple.
> >   
> 
> Understood.  OTOH, going back to the model where two steps are needed
> for close() is ugly too, so I don't want to just give up and revert that
> fix too easily.  At some point we will call it one way or the other, but
> I am not there quite yet.
> >   
> >>>   
> >>>       
> >>>>> Which will work, but I guess at this point we should ask ourselves
> >>>>> whether all the hearburn with srcu, threads and module references is
> >>>>> better than just asking the user to call and ioctl.
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> I am starting to agree with you, here. :)
> >>>>
> >>>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
> >>>> conversation re: the module_put() races.  I only tied it into the
> >>>> current thread because the eventfd_notifier_register() thread gave me a
> >>>> convenient way to hook some other context to do the module_put().  In
> >>>> the long term, the srcu changes are for the can_sleep() stuff.  So on
> >>>> that note, lets see if I can convince Davide that the srcu stuff is not
> >>>> so evil before we revert the POLLHUP patches, since the module_put() fix
> >>>> is trivial once that is in place.
> >>>>     
> >>>>         
> >>> Can this help with DEASSIGN as well? We need it for migration.
> >>>   
> >>>       
> >> No, but afaict you do not need this for migration anyway.  Migrate the
> >> GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
> >> relevant across a migration anyway?  I would think not, but admittedly I
> >> know little about how qemu/kvm migration actually works.
> >>     
> >
> > Yes but that's not live migration. For live migration, the trick is that
> > you are running locally but send changes to remote guest.  For that, we
> > need to put qemu in the middle between the device and the guest, so it
> > can detect activity and update the remote side.
> >
> > And the best way to do that is to take poll eventfd that device assigns
> > and push eventfd that kvm polls. To switch between this setup
> > and the one where kvm polls the ventfd from device directly,
> > you need deassign.
> >   
> 
> So its still not clear why the distinction between
> deassign-the-gsi-but-leave-the-fd-valid is needed over a simple
> close().  Can you elaborate?


The fd needs to be left assigned to the device, so that we can poll
the fd and get events, then forward them to kvm.
Gregory Haskins June 19, 2009, 4:07 p.m. UTC | #12
Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 12:29:31PM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> Michael S. Tsirkin wrote:
>>>>     
>>>>         
>>>>> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> Michael S. Tsirkin wrote:
>>>>>>     
>>>>>>
>>>>>>         
>>>>>>             
>>>>>>> BTW, Gregory, this can be used to fix the race in the design: create a
>>>>>>> thread and let it drop the module reference with module_put_and_exit.
>>>>>>>   
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> I had thought of doing something like this initially too, but I think
>>>>>> its racy as well.  Ultimately, you need to make sure the eventfd
>>>>>> callback is completely out before its safe to run, and deferring to a
>>>>>> thread would not change this race.  The only sane way I can see to do
>>>>>> that is to have the caller infrastructure annotate the event somehow
>>>>>> (either directly with a module_put(), or indirectly with some kind of
>>>>>> state transition that can be tracked with something like
>>>>>> synchronize_sched().
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> Here's what one could do: create a thread for each irqfd, and increment
>>>>> module ref count, put that thread to sleep.  When done with
>>>>> irqfd, don't delete it and don't decrement module refcount, wake thread
>>>>> instead.  thread kills irqfd and calls module_put_and_exit.
>>>>>
>>>>> I don't think it's racy
>>>>>       
>>>>>           
>>>> I believe it is. How would you prevent the thread from doing the
>>>> module_put_and_exit() before the eventfd callback thread is known to
>>>> have exited the relevant .text section?
>>>>     
>>>>         
>>> Right.
>>>
>>>   
>>>       
>>>> All this talk does give me an idea, tho.  Ill make a patch.
>>>>     
>>>>         
>>> OK, but ask yourself whether this bag of tricks is worth it, and whether
>>> we'll find another hole later. Let's reserve the trickiness for
>>> fast-path, where it's needed, and keep at least the assign/deassign simple.
>>>   
>>>       
>> Understood.  OTOH, going back to the model where two steps are needed
>> for close() is ugly too, so I don't want to just give up and revert that
>> fix too easily.  At some point we will call it one way or the other, but
>> I am not there quite yet.
>>     
>>>   
>>>       
>>>>>   
>>>>>       
>>>>>           
>>>>>>> Which will work, but I guess at this point we should ask ourselves
>>>>>>> whether all the hearburn with srcu, threads and module references is
>>>>>>> better than just asking the user to call and ioctl.
>>>>>>>   
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> I am starting to agree with you, here. :)
>>>>>>
>>>>>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
>>>>>> conversation re: the module_put() races.  I only tied it into the
>>>>>> current thread because the eventfd_notifier_register() thread gave me a
>>>>>> convenient way to hook some other context to do the module_put().  In
>>>>>> the long term, the srcu changes are for the can_sleep() stuff.  So on
>>>>>> that note, lets see if I can convince Davide that the srcu stuff is not
>>>>>> so evil before we revert the POLLHUP patches, since the module_put() fix
>>>>>> is trivial once that is in place.
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> Can this help with DEASSIGN as well? We need it for migration.
>>>>>   
>>>>>       
>>>>>           
>>>> No, but afaict you do not need this for migration anyway.  Migrate the
>>>> GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
>>>> relevant across a migration anyway?  I would think not, but admittedly I
>>>> know little about how qemu/kvm migration actually works.
>>>>     
>>>>         
>>> Yes but that's not live migration. For live migration, the trick is that
>>> you are running locally but send changes to remote guest.  For that, we
>>> need to put qemu in the middle between the device and the guest, so it
>>> can detect activity and update the remote side.
>>>
>>> And the best way to do that is to take poll eventfd that device assigns
>>> and push eventfd that kvm polls. To switch between this setup
>>> and the one where kvm polls the ventfd from device directly,
>>> you need deassign.
>>>   
>>>       
>> So its still not clear why the distinction between
>> deassign-the-gsi-but-leave-the-fd-valid is needed over a simple
>> close().  Can you elaborate?
>>     
>
>
> The fd needs to be left assigned to the device, so that we can poll
> the fd and get events, then forward them to kvm.
>   

Ah, ok.  Now I get what you are trying to do.

Well, per the PM I sent you this morning, I figured out the magic to
resolve the locking issues.  So we should be able to add DEASSIGN logic
soon, I hope.

-Greg
diff mbox

Patch

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 2c8028c..67e4eca 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -29,6 +29,7 @@ 
 #include <linux/list.h>
 #include <linux/eventfd.h>
 #include <linux/srcu.h>
+#include <linux/module.h>
 
 /*
  * --------------------------------------------------------------------
@@ -123,6 +124,7 @@  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int
sync, void
 *key)
 
                cleanup_srcu_struct(&irqfd->srcu);
                kfree(irqfd);
+               module_put(THIS_MODULE);
        }