diff mbox

Xen 4.7 crash

Message ID e3748d9a-3ec0-33e2-4be1-9b0972b69415@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Andrew Cooper June 1, 2016, 9:24 p.m. UTC
On 01/06/2016 21:45, Aaron Cornelius wrote:
>>
>>> However, since I only have 1 domain active at a time, I'm not sure why I
>> should run out of VM IDs.
>>
>> Sounds like a VMID resource leak.  Check to see whether it is freed properly
>> in domain_destroy().
>>
>> ~Andrew
> That would be my assumption.  But as far as I can tell, arch_domain_destroy() calls pwm_teardown() which calls p2m_free_vmid(), and none of the functionality related to freeing a VM ID appears to have changed in years.

The VMID handling looks suspect.  It can be called repeatedly during
domain destruction, and it will repeatedly clear the same bit out of the
vmid_mask.


Having said that, I can't explain why that bug would result in the
symptoms you are seeing.  It is also possibly that your issue is memory
corruption from a separate source.

Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with
vmid_alloc_lock held) to see which vmid is being allocated/freed ? 
After the initial boot of the system, you should see the same vmid being
allocated and freed for each of your domains.

~Andrew

Comments

Julien Grall June 1, 2016, 10:18 p.m. UTC | #1
Hi Andrew,

On 01/06/2016 22:24, Andrew Cooper wrote:
> On 01/06/2016 21:45, Aaron Cornelius wrote:
>>>
>>>> However, since I only have 1 domain active at a time, I'm not sure why I
>>> should run out of VM IDs.
>>>
>>> Sounds like a VMID resource leak.  Check to see whether it is freed properly
>>> in domain_destroy().
>>>
>>> ~Andrew
>> That would be my assumption.  But as far as I can tell, arch_domain_destroy() calls pwm_teardown() which calls p2m_free_vmid(), and none of the functionality related to freeing a VM ID appears to have changed in years.
>
> The VMID handling looks suspect.  It can be called repeatedly during
> domain destruction, and it will repeatedly clear the same bit out of the
> vmid_mask.

Can you explain how the p2m_free_vmid can be called multiple time?

We have the following path:
    arch_domain_destroy -> p2m_teardown -> p2m_free_vmid.

And I can find only 3 call of arch_domain_destroy we should only be done 
once per domain.

If arch_domain_destroy is called multiple time, p2m_free_vmid will not 
be the only place where Xen will be in trouble.

> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> index 838d004..7adb39a 100644
> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -1393,7 +1393,10 @@ static void p2m_free_vmid(struct domain *d)
>      struct p2m_domain *p2m = &d->arch.p2m;
>      spin_lock(&vmid_alloc_lock);
>      if ( p2m->vmid != INVALID_VMID )
> -        clear_bit(p2m->vmid, vmid_mask);
> +    {
> +        ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask));
> +        p2m->vmid = INVALID_VMID;
> +    }
>
>      spin_unlock(&vmid_alloc_lock);
>  }
>
> Having said that, I can't explain why that bug would result in the
> symptoms you are seeing.  It is also possibly that your issue is memory
> corruption from a separate source.
>
> Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with
> vmid_alloc_lock held) to see which vmid is being allocated/freed ?
> After the initial boot of the system, you should see the same vmid being
> allocated and freed for each of your domains.

Looking quickly at the log, the domain is dom1101. However, the number 
maximum number of VMID supported is 256, so the exhaustion might be a 
race somewhere.

I would be interested to get a reproducer. I wrote a script to cycle a 
domain (create/domain) in loop, and I have not seen any issue after 1200 
cycles (and counting).

Cheers,
Andrew Cooper June 1, 2016, 10:26 p.m. UTC | #2
On 01/06/2016 23:18, Julien Grall wrote:
> Hi Andrew,
>
> On 01/06/2016 22:24, Andrew Cooper wrote:
>> On 01/06/2016 21:45, Aaron Cornelius wrote:
>>>>
>>>>> However, since I only have 1 domain active at a time, I'm not sure
>>>>> why I
>>>> should run out of VM IDs.
>>>>
>>>> Sounds like a VMID resource leak.  Check to see whether it is freed
>>>> properly
>>>> in domain_destroy().
>>>>
>>>> ~Andrew
>>> That would be my assumption.  But as far as I can tell,
>>> arch_domain_destroy() calls pwm_teardown() which calls
>>> p2m_free_vmid(), and none of the functionality related to freeing a
>>> VM ID appears to have changed in years.
>>
>> The VMID handling looks suspect.  It can be called repeatedly during
>> domain destruction, and it will repeatedly clear the same bit out of the
>> vmid_mask.
>
> Can you explain how the p2m_free_vmid can be called multiple time?
>
> We have the following path:
>    arch_domain_destroy -> p2m_teardown -> p2m_free_vmid.
>
> And I can find only 3 call of arch_domain_destroy we should only be
> done once per domain.
>
> If arch_domain_destroy is called multiple time, p2m_free_vmid will not
> be the only place where Xen will be in trouble.

You are correct.  I was getting my phases of domain destruction mixed
up.  arch_domain_destroy() is strictly once, after the RCU reference of
the domain has dropped to 0.

>
>> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
>> index 838d004..7adb39a 100644
>> --- a/xen/arch/arm/p2m.c
>> +++ b/xen/arch/arm/p2m.c
>> @@ -1393,7 +1393,10 @@ static void p2m_free_vmid(struct domain *d)
>>      struct p2m_domain *p2m = &d->arch.p2m;
>>      spin_lock(&vmid_alloc_lock);
>>      if ( p2m->vmid != INVALID_VMID )
>> -        clear_bit(p2m->vmid, vmid_mask);
>> +    {
>> +        ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask));
>> +        p2m->vmid = INVALID_VMID;
>> +    }
>>
>>      spin_unlock(&vmid_alloc_lock);
>>  }
>>
>> Having said that, I can't explain why that bug would result in the
>> symptoms you are seeing.  It is also possibly that your issue is memory
>> corruption from a separate source.
>>
>> Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with
>> vmid_alloc_lock held) to see which vmid is being allocated/freed ?
>> After the initial boot of the system, you should see the same vmid being
>> allocated and freed for each of your domains.
>
> Looking quickly at the log, the domain is dom1101. However, the number
> maximum number of VMID supported is 256, so the exhaustion might be a
> race somewhere.
>
> I would be interested to get a reproducer. I wrote a script to cycle a
> domain (create/domain) in loop, and I have not seen any issue after
> 1200 cycles (and counting).

Given that my previous thought was wrong, I am going to suggest that
some other form of memory corruption is a more likely cause.

~Andrew
diff mbox

Patch

diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index 838d004..7adb39a 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1393,7 +1393,10 @@  static void p2m_free_vmid(struct domain *d)
     struct p2m_domain *p2m = &d->arch.p2m;
     spin_lock(&vmid_alloc_lock);
     if ( p2m->vmid != INVALID_VMID )
-        clear_bit(p2m->vmid, vmid_mask);
+    {
+        ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask));
+        p2m->vmid = INVALID_VMID;
+    }

     spin_unlock(&vmid_alloc_lock);
 }