diff mbox series

xen-blkfront: use old rinfo after enomem during migration

Message ID 1543468665-22795-1-git-send-email-manjunath.b.patil@oracle.com (mailing list archive)
State New, archived
Headers show
Series xen-blkfront: use old rinfo after enomem during migration | expand

Commit Message

Manjunath Patil Nov. 29, 2018, 5:17 a.m. UTC
Hi,
Feel free to suggest/comment on this.

I am trying to do the following at dst during the migration now.
1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
2. Store the old rinfo and nr_rings into temp variables in negotiate_mq()
3. let nr_rings get re-calculated based on backend data
4. try allocating new memory based on new nr_rings
5. 
  a. If memory allocation is a success
     - free the old rinfo and proceed to use the new rinfo
  b. If memory allocation is a failure
     - use the old the rinfo
     - adjust the nr_rings to the lowest of new nr_rings and old nr_rings

-Thanks,
Manjunath
--
During migration, the dst side device resume can fail to allocate rinfo
struct. Each rinfo is about 80K in size and allocating 4[typical] such
rings would need an order 7 allocation[512KB chuck] there by incresing
the chance of memory allocation failure. Device resume failure during
migration will lead the processes accessing the device into hung state.

This patch aims to reuse old rinfo in case of memory allocation failure.

Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
---
 drivers/block/xen-blkfront.c |   46 +++++++++++++++++++++++++++++++++++------
 1 files changed, 39 insertions(+), 7 deletions(-)

Comments

Boris Ostrovsky Nov. 30, 2018, 8:42 p.m. UTC | #1
On 11/29/18 12:17 AM, Manjunath Patil wrote:
> Hi,
> Feel free to suggest/comment on this.
>
> I am trying to do the following at dst during the migration now.
> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
> 2. Store the old rinfo and nr_rings into temp variables in negotiate_mq()
> 3. let nr_rings get re-calculated based on backend data
> 4. try allocating new memory based on new nr_rings

Since I suspect number of rings will likely be the same why not reuse
the rings in the common case?


> 5. 
>   a. If memory allocation is a success
>      - free the old rinfo and proceed to use the new rinfo
>   b. If memory allocation is a failure
>      - use the old the rinfo
>      - adjust the nr_rings to the lowest of new nr_rings and old nr_rings


> @@ -1918,10 +1936,24 @@ static int negotiate_mq(struct blkfront_info *info)
>  			      sizeof(struct blkfront_ring_info),
>  			      GFP_KERNEL);
>  	if (!info->rinfo) {
> -		xenbus_dev_fatal(info->xbdev, -ENOMEM, "allocating ring_info structure");
> -		info->nr_rings = 0;
> -		return -ENOMEM;
> -	}
> +		if (unlikely(nr_rings_old)) {
> +			/* We might waste some memory if
> +			 * info->nr_rings < nr_rings_old
> +			 */
> +			info->rinfo = rinfo_old;
> +			if (info->nr_rings > nr_rings_old)
> +				info->nr_rings = nr_rings_old;
> +			xenbus_dev_fatal(info->xbdev, -ENOMEM,


Why xenbus_dev_fatal()?

-boris


> +			"reusing old ring_info structure(new ring size=%d)",
> +				info->nr_rings);
> +		} else {
> +			xenbus_dev_fatal(info->xbdev, -ENOMEM,
> +				"allocating ring_info structure");
> +			info->nr_rings = 0;
> +			return -ENOMEM;
> +		}
> +	} else if (unlikely(nr_rings_old))
> +		kfree(rinfo_old);
>  
>  	for (i = 0; i < info->nr_rings; i++) {
>  		struct blkfront_ring_info *rinfo;
Manjunath Patil Nov. 30, 2018, 9:49 p.m. UTC | #2
Thank you Boris for your comments. I removed faulty email of mine.

replies inline.
On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>> Hi,
>> Feel free to suggest/comment on this.
>>
>> I am trying to do the following at dst during the migration now.
>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>> 2. Store the old rinfo and nr_rings into temp variables in negotiate_mq()
>> 3. let nr_rings get re-calculated based on backend data
>> 4. try allocating new memory based on new nr_rings
> Since I suspect number of rings will likely be the same why not reuse
> the rings in the common case?
I thought attaching devices will be more often than migration. Hence did 
not want add to an extra check for
   - if I am inside migration code path and
   - if new nr_rings is equal to old nr_rings or not

Sure addition of such a thing would avoid the memory allocation 
altogether in migration path,
but it would add a little overhead for normal device addition.

Do you think its worth adding that change?
>
>
>> 5.
>>    a. If memory allocation is a success
>>       - free the old rinfo and proceed to use the new rinfo
>>    b. If memory allocation is a failure
>>       - use the old the rinfo
>>       - adjust the nr_rings to the lowest of new nr_rings and old nr_rings
>
>> @@ -1918,10 +1936,24 @@ static int negotiate_mq(struct blkfront_info *info)
>>   			      sizeof(struct blkfront_ring_info),
>>   			      GFP_KERNEL);
>>   	if (!info->rinfo) {
>> -		xenbus_dev_fatal(info->xbdev, -ENOMEM, "allocating ring_info structure");
>> -		info->nr_rings = 0;
>> -		return -ENOMEM;
>> -	}
>> +		if (unlikely(nr_rings_old)) {
>> +			/* We might waste some memory if
>> +			 * info->nr_rings < nr_rings_old
>> +			 */
>> +			info->rinfo = rinfo_old;
>> +			if (info->nr_rings > nr_rings_old)
>> +				info->nr_rings = nr_rings_old;
>> +			xenbus_dev_fatal(info->xbdev, -ENOMEM,
>
> Why xenbus_dev_fatal()?
I wanted to make sure that this msg is seen on console by default. So 
that we know there was a enomem event happened and we recovered from it.
What do you suggest instead? xenbus_dev_error?
>
> -boris
>
>
>> +			"reusing old ring_info structure(new ring size=%d)",
>> +				info->nr_rings);
>> +		} else {
>> +			xenbus_dev_fatal(info->xbdev, -ENOMEM,
>> +				"allocating ring_info structure");
>> +			info->nr_rings = 0;
>> +			return -ENOMEM;
>> +		}
>> +	} else if (unlikely(nr_rings_old))
>> +		kfree(rinfo_old);
>>   
>>   	for (i = 0; i < info->nr_rings; i++) {
>>   		struct blkfront_ring_info *rinfo;
Boris Ostrovsky Nov. 30, 2018, 10:33 p.m. UTC | #3
On 11/30/18 4:49 PM, Manjunath Patil wrote:
> Thank you Boris for your comments. I removed faulty email of mine.
>
> replies inline.
> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
>> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>>> Hi,
>>> Feel free to suggest/comment on this.
>>>
>>> I am trying to do the following at dst during the migration now.
>>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>>> 2. Store the old rinfo and nr_rings into temp variables in
>>> negotiate_mq()
>>> 3. let nr_rings get re-calculated based on backend data
>>> 4. try allocating new memory based on new nr_rings
>> Since I suspect number of rings will likely be the same why not reuse
>> the rings in the common case?
> I thought attaching devices will be more often than migration. Hence
> did not want add to an extra check for
>   - if I am inside migration code path and
>   - if new nr_rings is equal to old nr_rings or not
>
> Sure addition of such a thing would avoid the memory allocation
> altogether in migration path,
> but it would add a little overhead for normal device addition.
>
> Do you think its worth adding that change?


IMO a couple of extra checks are not going to make much difference.

I wonder though --- have you actually seen the case where you did fail
allocation and changes provided in this patch made things work? I am
asking because right after negotiate_mq() we will call setup_blkring()
and it will want to allocate bunch of memory. A failure there is fatal
(to ring setup). So it seems to me that you will survive negotiate_mq()
but then will likely fail soon after.


>>
>>
>>> 5.
>>>    a. If memory allocation is a success
>>>       - free the old rinfo and proceed to use the new rinfo
>>>    b. If memory allocation is a failure
>>>       - use the old the rinfo
>>>       - adjust the nr_rings to the lowest of new nr_rings and old
>>> nr_rings
>>
>>> @@ -1918,10 +1936,24 @@ static int negotiate_mq(struct blkfront_info
>>> *info)
>>>                     sizeof(struct blkfront_ring_info),
>>>                     GFP_KERNEL);
>>>       if (!info->rinfo) {
>>> -        xenbus_dev_fatal(info->xbdev, -ENOMEM, "allocating
>>> ring_info structure");
>>> -        info->nr_rings = 0;
>>> -        return -ENOMEM;
>>> -    }
>>> +        if (unlikely(nr_rings_old)) {
>>> +            /* We might waste some memory if
>>> +             * info->nr_rings < nr_rings_old
>>> +             */
>>> +            info->rinfo = rinfo_old;
>>> +            if (info->nr_rings > nr_rings_old)
>>> +                info->nr_rings = nr_rings_old;
>>> +            xenbus_dev_fatal(info->xbdev, -ENOMEM,
>>
>> Why xenbus_dev_fatal()?
> I wanted to make sure that this msg is seen on console by default. So
> that we know there was a enomem event happened and we recovered from it.
> What do you suggest instead? xenbus_dev_error?

Neither. xenbus_dev_fatal() is going to change connection state so it is
certainly not what we want. And even xenbus_dev_error() doesn't look
like the right thing to do since as far as block device setup is
concerned there are no errors.

Maybe pr_warn().

-boris


>>
>> -boris
>>
>>
>>> +            "reusing old ring_info structure(new ring size=%d)",
>>> +                info->nr_rings);
>>> +        } else {
>>> +            xenbus_dev_fatal(info->xbdev, -ENOMEM,
>>> +                "allocating ring_info structure");
>>> +            info->nr_rings = 0;
>>> +            return -ENOMEM;
>>> +        }
>>> +    } else if (unlikely(nr_rings_old))
>>> +        kfree(rinfo_old);
>>>         for (i = 0; i < info->nr_rings; i++) {
>>>           struct blkfront_ring_info *rinfo;
>
Manjunath Patil Dec. 2, 2018, 8:31 p.m. UTC | #4
On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:

> On 11/30/18 4:49 PM, Manjunath Patil wrote:
>> Thank you Boris for your comments. I removed faulty email of mine.
>>
>> replies inline.
>> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
>>> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>>>> Hi,
>>>> Feel free to suggest/comment on this.
>>>>
>>>> I am trying to do the following at dst during the migration now.
>>>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>>>> 2. Store the old rinfo and nr_rings into temp variables in
>>>> negotiate_mq()
>>>> 3. let nr_rings get re-calculated based on backend data
>>>> 4. try allocating new memory based on new nr_rings
>>> Since I suspect number of rings will likely be the same why not reuse
>>> the rings in the common case?
>> I thought attaching devices will be more often than migration. Hence
>> did not want add to an extra check for
>>    - if I am inside migration code path and
>>    - if new nr_rings is equal to old nr_rings or not
>>
>> Sure addition of such a thing would avoid the memory allocation
>> altogether in migration path,
>> but it would add a little overhead for normal device addition.
>>
>> Do you think its worth adding that change?
>
> IMO a couple of extra checks are not going to make much difference.
I will add this change
>
> I wonder though --- have you actually seen the case where you did fail
> allocation and changes provided in this patch made things work? I am
> asking because right after negotiate_mq() we will call setup_blkring()
> and it will want to allocate bunch of memory. A failure there is fatal
> (to ring setup). So it seems to me that you will survive negotiate_mq()
> but then will likely fail soon after.
I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I 
included my patch, I manually triggered the ENOMEM using a debug flag.
The patch works for ENOMEM inside negotiate_mq().

As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we 
might hit it in setup_blkring() as well.
We should add the similar change to blkif_sring struct as well.

I will make this change as well and send the new patch-set for review.
>
>>>
>>>> 5.
>>>>     a. If memory allocation is a success
>>>>        - free the old rinfo and proceed to use the new rinfo
>>>>     b. If memory allocation is a failure
>>>>        - use the old the rinfo
>>>>        - adjust the nr_rings to the lowest of new nr_rings and old
>>>> nr_rings
>>>> @@ -1918,10 +1936,24 @@ static int negotiate_mq(struct blkfront_info
>>>> *info)
>>>>                      sizeof(struct blkfront_ring_info),
>>>>                      GFP_KERNEL);
>>>>        if (!info->rinfo) {
>>>> -        xenbus_dev_fatal(info->xbdev, -ENOMEM, "allocating
>>>> ring_info structure");
>>>> -        info->nr_rings = 0;
>>>> -        return -ENOMEM;
>>>> -    }
>>>> +        if (unlikely(nr_rings_old)) {
>>>> +            /* We might waste some memory if
>>>> +             * info->nr_rings < nr_rings_old
>>>> +             */
>>>> +            info->rinfo = rinfo_old;
>>>> +            if (info->nr_rings > nr_rings_old)
>>>> +                info->nr_rings = nr_rings_old;
>>>> +            xenbus_dev_fatal(info->xbdev, -ENOMEM,
>>> Why xenbus_dev_fatal()?
>> I wanted to make sure that this msg is seen on console by default. So
>> that we know there was a enomem event happened and we recovered from it.
>> What do you suggest instead? xenbus_dev_error?
> Neither. xenbus_dev_fatal() is going to change connection state so it is
> certainly not what we want. And even xenbus_dev_error() doesn't look
> like the right thing to do since as far as block device setup is
> concerned there are no errors.
>
> Maybe pr_warn().
I will include this.

Thank you for your comments.
>
> -boris
>
>
>>> -boris
>>>
>>>
>>>> +            "reusing old ring_info structure(new ring size=%d)",
>>>> +                info->nr_rings);
>>>> +        } else {
>>>> +            xenbus_dev_fatal(info->xbdev, -ENOMEM,
>>>> +                "allocating ring_info structure");
>>>> +            info->nr_rings = 0;
>>>> +            return -ENOMEM;
>>>> +        }
>>>> +    } else if (unlikely(nr_rings_old))
>>>> +        kfree(rinfo_old);
>>>>          for (i = 0; i < info->nr_rings; i++) {
>>>>            struct blkfront_ring_info *rinfo;
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
Boris Ostrovsky Dec. 3, 2018, 4:07 p.m. UTC | #5
On 12/2/18 3:31 PM, Manjunath Patil wrote:
> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:
>
>> On 11/30/18 4:49 PM, Manjunath Patil wrote:
>>> Thank you Boris for your comments. I removed faulty email of mine.
>>>
>>> replies inline.
>>> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
>>>> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>>>>> Hi,
>>>>> Feel free to suggest/comment on this.
>>>>>
>>>>> I am trying to do the following at dst during the migration now.
>>>>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>>>>> 2. Store the old rinfo and nr_rings into temp variables in
>>>>> negotiate_mq()
>>>>> 3. let nr_rings get re-calculated based on backend data
>>>>> 4. try allocating new memory based on new nr_rings
>>>> Since I suspect number of rings will likely be the same why not reuse
>>>> the rings in the common case?
>>> I thought attaching devices will be more often than migration. Hence
>>> did not want add to an extra check for
>>>    - if I am inside migration code path and
>>>    - if new nr_rings is equal to old nr_rings or not
>>>
>>> Sure addition of such a thing would avoid the memory allocation
>>> altogether in migration path,
>>> but it would add a little overhead for normal device addition.
>>>
>>> Do you think its worth adding that change?
>>
>> IMO a couple of extra checks are not going to make much difference.
> I will add this change
>>
>> I wonder though --- have you actually seen the case where you did fail
>> allocation and changes provided in this patch made things work? I am
>> asking because right after negotiate_mq() we will call setup_blkring()
>> and it will want to allocate bunch of memory. A failure there is fatal
>> (to ring setup). So it seems to me that you will survive negotiate_mq()
>> but then will likely fail soon after.
> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
> included my patch, I manually triggered the ENOMEM using a debug flag.
> The patch works for ENOMEM inside negotiate_mq().
>
> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
> might hit it in setup_blkring() as well.
> We should add the similar change to blkif_sring struct as well.


Won't you have a similar issue with other frontends, say, netfront?


-boris
Dongli Zhang Dec. 4, 2018, 1:14 a.m. UTC | #6
Hi Boris,

On 12/04/2018 12:07 AM, Boris Ostrovsky wrote:
> On 12/2/18 3:31 PM, Manjunath Patil wrote:
>> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:
>>
>>> On 11/30/18 4:49 PM, Manjunath Patil wrote:
>>>> Thank you Boris for your comments. I removed faulty email of mine.
>>>>
>>>> replies inline.
>>>> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
>>>>> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>>>>>> Hi,
>>>>>> Feel free to suggest/comment on this.
>>>>>>
>>>>>> I am trying to do the following at dst during the migration now.
>>>>>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>>>>>> 2. Store the old rinfo and nr_rings into temp variables in
>>>>>> negotiate_mq()
>>>>>> 3. let nr_rings get re-calculated based on backend data
>>>>>> 4. try allocating new memory based on new nr_rings
>>>>> Since I suspect number of rings will likely be the same why not reuse
>>>>> the rings in the common case?
>>>> I thought attaching devices will be more often than migration. Hence
>>>> did not want add to an extra check for
>>>>    - if I am inside migration code path and
>>>>    - if new nr_rings is equal to old nr_rings or not
>>>>
>>>> Sure addition of such a thing would avoid the memory allocation
>>>> altogether in migration path,
>>>> but it would add a little overhead for normal device addition.
>>>>
>>>> Do you think its worth adding that change?
>>>
>>> IMO a couple of extra checks are not going to make much difference.
>> I will add this change
>>>
>>> I wonder though --- have you actually seen the case where you did fail
>>> allocation and changes provided in this patch made things work? I am
>>> asking because right after negotiate_mq() we will call setup_blkring()
>>> and it will want to allocate bunch of memory. A failure there is fatal
>>> (to ring setup). So it seems to me that you will survive negotiate_mq()
>>> but then will likely fail soon after.
>> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
>> included my patch, I manually triggered the ENOMEM using a debug flag.
>> The patch works for ENOMEM inside negotiate_mq().
>>
>> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
>> might hit it in setup_blkring() as well.
>> We should add the similar change to blkif_sring struct as well.
> 
> 
> Won't you have a similar issue with other frontends, say, netfront?

I think the kmalloc is failed not because of OOM.

In fact, the size of "blkfront_ring_info" is large. When domU have 4
queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB.

There is chance that kmalloc() 300+ KB would fail.


About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 KB?

Dongli Zhang
Boris Ostrovsky Dec. 4, 2018, 2:16 a.m. UTC | #7
On 12/3/18 8:14 PM, Dongli Zhang wrote:
> Hi Boris,
>
> On 12/04/2018 12:07 AM, Boris Ostrovsky wrote:
>> On 12/2/18 3:31 PM, Manjunath Patil wrote:
>>> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:
>>>
>>>> On 11/30/18 4:49 PM, Manjunath Patil wrote:
>>>>> Thank you Boris for your comments. I removed faulty email of mine.
>>>>>
>>>>> replies inline.
>>>>> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
>>>>>> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>>>>>>> Hi,
>>>>>>> Feel free to suggest/comment on this.
>>>>>>>
>>>>>>> I am trying to do the following at dst during the migration now.
>>>>>>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>>>>>>> 2. Store the old rinfo and nr_rings into temp variables in
>>>>>>> negotiate_mq()
>>>>>>> 3. let nr_rings get re-calculated based on backend data
>>>>>>> 4. try allocating new memory based on new nr_rings
>>>>>> Since I suspect number of rings will likely be the same why not reuse
>>>>>> the rings in the common case?
>>>>> I thought attaching devices will be more often than migration. Hence
>>>>> did not want add to an extra check for
>>>>>    - if I am inside migration code path and
>>>>>    - if new nr_rings is equal to old nr_rings or not
>>>>>
>>>>> Sure addition of such a thing would avoid the memory allocation
>>>>> altogether in migration path,
>>>>> but it would add a little overhead for normal device addition.
>>>>>
>>>>> Do you think its worth adding that change?
>>>> IMO a couple of extra checks are not going to make much difference.
>>> I will add this change
>>>> I wonder though --- have you actually seen the case where you did fail
>>>> allocation and changes provided in this patch made things work? I am
>>>> asking because right after negotiate_mq() we will call setup_blkring()
>>>> and it will want to allocate bunch of memory. A failure there is fatal
>>>> (to ring setup). So it seems to me that you will survive negotiate_mq()
>>>> but then will likely fail soon after.
>>> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
>>> included my patch, I manually triggered the ENOMEM using a debug flag.
>>> The patch works for ENOMEM inside negotiate_mq().
>>>
>>> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
>>> might hit it in setup_blkring() as well.
>>> We should add the similar change to blkif_sring struct as well.
>>
>> Won't you have a similar issue with other frontends, say, netfront?
> I think the kmalloc is failed not because of OOM.
>
> In fact, the size of "blkfront_ring_info" is large. When domU have 4
> queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB.
>
> There is chance that kmalloc() 300+ KB would fail.
>
>
> About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 KB?

TBH these look like comparable sizes to me.  I am not convinced that
these changes will make a difference. If the number of rings on source
and destination were the same I'd absolutely agree with this patch but
since you are trying to handle different sizes the code becomes somewhat
more complex, and I am not sure it's worth it. (Can you actually give me
an example of when we can expect number of rings to change during
migration?)

But others may think differently.


-boris
Manjunath Patil Dec. 4, 2018, 2:49 a.m. UTC | #8
On 12/3/2018 6:16 PM, Boris Ostrovsky wrote:

> On 12/3/18 8:14 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 12/04/2018 12:07 AM, Boris Ostrovsky wrote:
>>> On 12/2/18 3:31 PM, Manjunath Patil wrote:
>>>> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:
>>>>
>>>>> On 11/30/18 4:49 PM, Manjunath Patil wrote:
>>>>>> Thank you Boris for your comments. I removed faulty email of mine.
>>>>>>
>>>>>> replies inline.
>>>>>> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
>>>>>>> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>>>>>>>> Hi,
>>>>>>>> Feel free to suggest/comment on this.
>>>>>>>>
>>>>>>>> I am trying to do the following at dst during the migration now.
>>>>>>>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>>>>>>>> 2. Store the old rinfo and nr_rings into temp variables in
>>>>>>>> negotiate_mq()
>>>>>>>> 3. let nr_rings get re-calculated based on backend data
>>>>>>>> 4. try allocating new memory based on new nr_rings
>>>>>>> Since I suspect number of rings will likely be the same why not reuse
>>>>>>> the rings in the common case?
>>>>>> I thought attaching devices will be more often than migration. Hence
>>>>>> did not want add to an extra check for
>>>>>>     - if I am inside migration code path and
>>>>>>     - if new nr_rings is equal to old nr_rings or not
>>>>>>
>>>>>> Sure addition of such a thing would avoid the memory allocation
>>>>>> altogether in migration path,
>>>>>> but it would add a little overhead for normal device addition.
>>>>>>
>>>>>> Do you think its worth adding that change?
>>>>> IMO a couple of extra checks are not going to make much difference.
>>>> I will add this change
>>>>> I wonder though --- have you actually seen the case where you did fail
>>>>> allocation and changes provided in this patch made things work? I am
>>>>> asking because right after negotiate_mq() we will call setup_blkring()
>>>>> and it will want to allocate bunch of memory. A failure there is fatal
>>>>> (to ring setup). So it seems to me that you will survive negotiate_mq()
>>>>> but then will likely fail soon after.
>>>> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
>>>> included my patch, I manually triggered the ENOMEM using a debug flag.
>>>> The patch works for ENOMEM inside negotiate_mq().
>>>>
>>>> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
>>>> might hit it in setup_blkring() as well.
>>>> We should add the similar change to blkif_sring struct as well.
>>> Won't you have a similar issue with other frontends, say, netfront?
>> I think the kmalloc is failed not because of OOM.
>>
>> In fact, the size of "blkfront_ring_info" is large. When domU have 4
>> queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB.
>>
>> There is chance that kmalloc() 300+ KB would fail.
>>
>>
>> About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 KB?
> TBH these look like comparable sizes to me.  I am not convinced that
> these changes will make a difference. If the number of rings on source
> and destination were the same I'd absolutely agree with this patch but
> since you are trying to handle different sizes the code becomes somewhat
> more complex, and I am not sure it's worth it. (Can you actually give me
> an example of when we can expect number of rings to change during
> migration?)
>
> But others may think differently.
Hi Boris,
I think allocation of 300KB chunk[order 7 allocation] is more likely to 
fail than 70KB[order 5] especially under memory pressure.
If it comes to that, I think we should fix this too.

The no.of rings in most cases remain 4 thanks to xen_blkif_max_queues 
module parameter.
If the src host has allocated less than 4[may be vpcu given to this dom0 
were less than 4], then we can expect the dst to allocate more than src 
side and vice versa.

-Thanks,
Manjunath
>
> -boris
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
Dongli Zhang Dec. 4, 2018, 3:17 a.m. UTC | #9
Hi Manjunath,

On 12/04/2018 10:49 AM, Manjunath Patil wrote:
> On 12/3/2018 6:16 PM, Boris Ostrovsky wrote:
> 
>> On 12/3/18 8:14 PM, Dongli Zhang wrote:
>>> Hi Boris,
>>>
>>> On 12/04/2018 12:07 AM, Boris Ostrovsky wrote:
>>>> On 12/2/18 3:31 PM, Manjunath Patil wrote:
>>>>> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:
>>>>>
>>>>>> On 11/30/18 4:49 PM, Manjunath Patil wrote:
>>>>>>> Thank you Boris for your comments. I removed faulty email of mine.
>>>>>>>
>>>>>>> replies inline.
>>>>>>> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
>>>>>>>> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>>>>>>>>> Hi,
>>>>>>>>> Feel free to suggest/comment on this.
>>>>>>>>>
>>>>>>>>> I am trying to do the following at dst during the migration now.
>>>>>>>>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>>>>>>>>> 2. Store the old rinfo and nr_rings into temp variables in
>>>>>>>>> negotiate_mq()
>>>>>>>>> 3. let nr_rings get re-calculated based on backend data
>>>>>>>>> 4. try allocating new memory based on new nr_rings
>>>>>>>> Since I suspect number of rings will likely be the same why not reuse
>>>>>>>> the rings in the common case?
>>>>>>> I thought attaching devices will be more often than migration. Hence
>>>>>>> did not want add to an extra check for
>>>>>>>     - if I am inside migration code path and
>>>>>>>     - if new nr_rings is equal to old nr_rings or not
>>>>>>>
>>>>>>> Sure addition of such a thing would avoid the memory allocation
>>>>>>> altogether in migration path,
>>>>>>> but it would add a little overhead for normal device addition.
>>>>>>>
>>>>>>> Do you think its worth adding that change?
>>>>>> IMO a couple of extra checks are not going to make much difference.
>>>>> I will add this change
>>>>>> I wonder though --- have you actually seen the case where you did fail
>>>>>> allocation and changes provided in this patch made things work? I am
>>>>>> asking because right after negotiate_mq() we will call setup_blkring()
>>>>>> and it will want to allocate bunch of memory. A failure there is fatal
>>>>>> (to ring setup). So it seems to me that you will survive negotiate_mq()
>>>>>> but then will likely fail soon after.
>>>>> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
>>>>> included my patch, I manually triggered the ENOMEM using a debug flag.
>>>>> The patch works for ENOMEM inside negotiate_mq().
>>>>>
>>>>> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
>>>>> might hit it in setup_blkring() as well.
>>>>> We should add the similar change to blkif_sring struct as well.
>>>> Won't you have a similar issue with other frontends, say, netfront?
>>> I think the kmalloc is failed not because of OOM.
>>>
>>> In fact, the size of "blkfront_ring_info" is large. When domU have 4
>>> queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB.
>>>
>>> There is chance that kmalloc() 300+ KB would fail.
>>>
>>>
>>> About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 KB?
>> TBH these look like comparable sizes to me.  I am not convinced that
>> these changes will make a difference. If the number of rings on source
>> and destination were the same I'd absolutely agree with this patch but
>> since you are trying to handle different sizes the code becomes somewhat
>> more complex, and I am not sure it's worth it. (Can you actually give me
>> an example of when we can expect number of rings to change during
>> migration?)
>>
>> But others may think differently.
> Hi Boris,
> I think allocation of 300KB chunk[order 7 allocation] is more likely to fail
> than 70KB[order 5] especially under memory pressure.
> If it comes to that, I think we should fix this too.
> 
> The no.of rings in most cases remain 4 thanks to xen_blkif_max_queues module
> parameter.
> If the src host has allocated less than 4[may be vpcu given to this dom0 were
> less than 4], then we can expect the dst to allocate more than src side and vice
> versa.

xen_blkif_max_queues is tunable so the size to kmalloc() would be larger when
both xen_blkif_max_queues and dom0 vcpu are large.

Dongli Zhang
Jürgen Groß Dec. 4, 2018, 5:57 a.m. UTC | #10
On 04/12/2018 02:14, Dongli Zhang wrote:
> Hi Boris,
> 
> On 12/04/2018 12:07 AM, Boris Ostrovsky wrote:
>> On 12/2/18 3:31 PM, Manjunath Patil wrote:
>>> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:
>>>
>>>> On 11/30/18 4:49 PM, Manjunath Patil wrote:
>>>>> Thank you Boris for your comments. I removed faulty email of mine.
>>>>>
>>>>> replies inline.
>>>>> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
>>>>>> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>>>>>>> Hi,
>>>>>>> Feel free to suggest/comment on this.
>>>>>>>
>>>>>>> I am trying to do the following at dst during the migration now.
>>>>>>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>>>>>>> 2. Store the old rinfo and nr_rings into temp variables in
>>>>>>> negotiate_mq()
>>>>>>> 3. let nr_rings get re-calculated based on backend data
>>>>>>> 4. try allocating new memory based on new nr_rings
>>>>>> Since I suspect number of rings will likely be the same why not reuse
>>>>>> the rings in the common case?
>>>>> I thought attaching devices will be more often than migration. Hence
>>>>> did not want add to an extra check for
>>>>>    - if I am inside migration code path and
>>>>>    - if new nr_rings is equal to old nr_rings or not
>>>>>
>>>>> Sure addition of such a thing would avoid the memory allocation
>>>>> altogether in migration path,
>>>>> but it would add a little overhead for normal device addition.
>>>>>
>>>>> Do you think its worth adding that change?
>>>>
>>>> IMO a couple of extra checks are not going to make much difference.
>>> I will add this change
>>>>
>>>> I wonder though --- have you actually seen the case where you did fail
>>>> allocation and changes provided in this patch made things work? I am
>>>> asking because right after negotiate_mq() we will call setup_blkring()
>>>> and it will want to allocate bunch of memory. A failure there is fatal
>>>> (to ring setup). So it seems to me that you will survive negotiate_mq()
>>>> but then will likely fail soon after.
>>> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
>>> included my patch, I manually triggered the ENOMEM using a debug flag.
>>> The patch works for ENOMEM inside negotiate_mq().
>>>
>>> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
>>> might hit it in setup_blkring() as well.
>>> We should add the similar change to blkif_sring struct as well.
>>
>>
>> Won't you have a similar issue with other frontends, say, netfront?
> 
> I think the kmalloc is failed not because of OOM.
> 
> In fact, the size of "blkfront_ring_info" is large. When domU have 4
> queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB.
> 
> There is chance that kmalloc() 300+ KB would fail.

So kmalloc() might not be the best choice. Any reason why you don't
change it to vmalloc()? This should address the problem in a much
simpler way.


Juergen
diff mbox series

Patch

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 0ed4b20..041ba67 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1353,9 +1353,17 @@  static void blkif_free(struct blkfront_info *info, int suspend)
 	for (i = 0; i < info->nr_rings; i++)
 		blkif_free_ring(&info->rinfo[i]);
 
-	kfree(info->rinfo);
-	info->rinfo = NULL;
-	info->nr_rings = 0;
+	if (unlikely(info->connected == BLKIF_STATE_SUSPENDED)) {
+		/* We are migrating. You may reuse it. Just clean. */
+		for (i = 0; i < info->nr_rings; i++) {
+			memset(&info->rinfo[i], 0,
+				sizeof(struct blkfront_ring_info));
+		}
+	} else {
+		kfree(info->rinfo);
+		info->rinfo = NULL;
+		info->nr_rings = 0;
+	}
 }
 
 struct copy_from_grant {
@@ -1903,6 +1911,16 @@  static int negotiate_mq(struct blkfront_info *info)
 {
 	unsigned int backend_max_queues;
 	unsigned int i;
+	struct blkfront_ring_info *rinfo_old = NULL;
+	unsigned int nr_rings_old = 0;
+
+	/* Migrating. We did not free old rinfo. Reuse it if possible */
+	if (unlikely(info->connected == BLKIF_STATE_SUSPENDED)) {
+		nr_rings_old = info->nr_rings;
+		rinfo_old = info->rinfo;
+		info->rinfo = NULL;
+		info->nr_rings = 0;
+	}
 
 	BUG_ON(info->nr_rings);
 
@@ -1918,10 +1936,24 @@  static int negotiate_mq(struct blkfront_info *info)
 			      sizeof(struct blkfront_ring_info),
 			      GFP_KERNEL);
 	if (!info->rinfo) {
-		xenbus_dev_fatal(info->xbdev, -ENOMEM, "allocating ring_info structure");
-		info->nr_rings = 0;
-		return -ENOMEM;
-	}
+		if (unlikely(nr_rings_old)) {
+			/* We might waste some memory if
+			 * info->nr_rings < nr_rings_old
+			 */
+			info->rinfo = rinfo_old;
+			if (info->nr_rings > nr_rings_old)
+				info->nr_rings = nr_rings_old;
+			xenbus_dev_fatal(info->xbdev, -ENOMEM,
+			"reusing old ring_info structure(new ring size=%d)",
+				info->nr_rings);
+		} else {
+			xenbus_dev_fatal(info->xbdev, -ENOMEM,
+				"allocating ring_info structure");
+			info->nr_rings = 0;
+			return -ENOMEM;
+		}
+	} else if (unlikely(nr_rings_old))
+		kfree(rinfo_old);
 
 	for (i = 0; i < info->nr_rings; i++) {
 		struct blkfront_ring_info *rinfo;