mbox series

[V3,0/4] Introduce Advanced Watch Dog module

Message ID 20191119123016.13740-1-chen.zhang@intel.com (mailing list archive)
Headers show
Series Introduce Advanced Watch Dog module | expand

Message

Zhang Chen Nov. 19, 2019, 12:30 p.m. UTC
From: Zhang Chen <chen.zhang@intel.com>

Advanced Watch Dog is an universal monitoring module on VMM side, it can be used to detect network down(VMM to guest, VMM to VMM, VMM to another remote server) and do previously set operation. Current AWD patch just accept any input as the signal to refresh the watchdog timer,
and we can also make a certain interactive protocol here. For the output user can pre-write
some command or some messages in the AWD opt-script. We noticed that there is no way
for VMM communicate directly, maybe some people think we don't need such things(up layer
software like openstack can handle it). But we engaged with real customer found that in some cases,they need a lightweight and efficient mechanism to solve some practical problems(openstack is too heavy).
for example: When it detects lost connection with the paired node,it will send message to admin, notify another VMM, send qmp command to qemu do some operation like restart the VM, build VMM heartbeat system, etc.
It make user have basic VM/Host network monitoring tools and basic false tolerance and recovery solution.

Demo usage(for COLO heartbeat service):

In primary node:

-chardev socket,id=h1,host=3.3.3.3,port=9009,server,nowait
-chardev socket,id=heartbeat0,host=3.3.3.3,port=4445
-object iothread,id=iothread2
-object advanced-watchdog,id=heart1,server=on,awd_node=h1,notification_node=heartbeat0,opt_script=colo_opt_script_path,iothread=iothread1,pulse_interval=1000,timeout=5000

In secondary node:

-monitor tcp::4445,server,nowait
-chardev socket,id=h1,host=3.3.3.3,port=9009,reconnect=1
-chardev socket,id=heart1,host=3.3.3.8,port=4445
-object iothread,id=iothread1
-object advanced-watchdog,id=heart1,server=off,awd_node=h1,notification_node=heart1,opt_script=colo_secondary_opt_script,iothread=iothread1,timeout=10000


V3:
 - Rebased on Qemu 4.2.0-rc1 code.
 - Fix commit message issue.

V2:
 - Addressed Philippe comments add configure selector for AWD.

Initial:
 - Initial version.


Zhang Chen (4):
  net/awd.c: Introduce Advanced Watch Dog module framework
  net/awd.c: Initailize input/output chardev
  net/awd.c: Load advanced watch dog worker thread job
  vl.c: Make Advanced Watch Dog delayed initialization

 configure         |   9 +
 net/Makefile.objs |   1 +
 net/awd.c         | 491 ++++++++++++++++++++++++++++++++++++++++++++++
 qemu-options.hx   |   6 +
 vl.c              |   7 +
 5 files changed, 514 insertions(+)
 create mode 100644 net/awd.c

Comments

Zhang Chen Dec. 8, 2019, 5:52 p.m. UTC | #1
Hi All~

No news for a long time.

Please give me more comments about this series.


Thanks

Zhang Chen

On 11/19/2019 8:30 PM, Zhang, Chen wrote:
> From: Zhang Chen <chen.zhang@intel.com>
>
> Advanced Watch Dog is an universal monitoring module on VMM side, it can be used to detect network down(VMM to guest, VMM to VMM, VMM to another remote server) and do previously set operation. Current AWD patch just accept any input as the signal to refresh the watchdog timer,
> and we can also make a certain interactive protocol here. For the output user can pre-write
> some command or some messages in the AWD opt-script. We noticed that there is no way
> for VMM communicate directly, maybe some people think we don't need such things(up layer
> software like openstack can handle it). But we engaged with real customer found that in some cases,they need a lightweight and efficient mechanism to solve some practical problems(openstack is too heavy).
> for example: When it detects lost connection with the paired node,it will send message to admin, notify another VMM, send qmp command to qemu do some operation like restart the VM, build VMM heartbeat system, etc.
> It make user have basic VM/Host network monitoring tools and basic false tolerance and recovery solution.
>
> Demo usage(for COLO heartbeat service):
>
> In primary node:
>
> -chardev socket,id=h1,host=3.3.3.3,port=9009,server,nowait
> -chardev socket,id=heartbeat0,host=3.3.3.3,port=4445
> -object iothread,id=iothread2
> -object advanced-watchdog,id=heart1,server=on,awd_node=h1,notification_node=heartbeat0,opt_script=colo_opt_script_path,iothread=iothread1,pulse_interval=1000,timeout=5000
>
> In secondary node:
>
> -monitor tcp::4445,server,nowait
> -chardev socket,id=h1,host=3.3.3.3,port=9009,reconnect=1
> -chardev socket,id=heart1,host=3.3.3.8,port=4445
> -object iothread,id=iothread1
> -object advanced-watchdog,id=heart1,server=off,awd_node=h1,notification_node=heart1,opt_script=colo_secondary_opt_script,iothread=iothread1,timeout=10000
>
>
> V3:
>   - Rebased on Qemu 4.2.0-rc1 code.
>   - Fix commit message issue.
>
> V2:
>   - Addressed Philippe comments add configure selector for AWD.
>
> Initial:
>   - Initial version.
>
>
> Zhang Chen (4):
>    net/awd.c: Introduce Advanced Watch Dog module framework
>    net/awd.c: Initailize input/output chardev
>    net/awd.c: Load advanced watch dog worker thread job
>    vl.c: Make Advanced Watch Dog delayed initialization
>
>   configure         |   9 +
>   net/Makefile.objs |   1 +
>   net/awd.c         | 491 ++++++++++++++++++++++++++++++++++++++++++++++
>   qemu-options.hx   |   6 +
>   vl.c              |   7 +
>   5 files changed, 514 insertions(+)
>   create mode 100644 net/awd.c
>
Paolo Bonzini Dec. 9, 2019, 9:08 a.m. UTC | #2
On 08/12/19 18:52, Zhang, Chen wrote:
> Hi All~
> 
> No news for a long time.
> 
> Please give me more comments about this series.

Sorry, people were probably busy with the QEMU release candidates.

Even before looking at the code, the series is completely missing
documentation on how to use it and on the chardev protocol.  The
documentation should go in docs/ and should be written as restructuredText.

The qemu-options.hx patches also lack documentation about the properties
accepted by the new object.

In particular:

>> -chardev socket,id=h1,host=3.3.3.3,port=9009,server,nowait
>> -chardev socket,id=heartbeat0,host=3.3.3.3,port=4445
>> -object iothread,id=iothread2
>> -object
>> advanced-watchdog,id=heart1,server=on,awd_node=h1,notification_node=heartbeat0,opt_script=colo_opt_script_path,iothread=iothread1,pulse_interval=1000,timeout=5000

What are the two sockets for, and what should be in colo_opt_script_path?

>>
>> In secondary node:
>>
>> -monitor tcp::4445,server,nowait
>> -chardev socket,id=h1,host=3.3.3.3,port=9009,reconnect=1
>> -chardev socket,id=heart1,host=3.3.3.8,port=4445
>> -object iothread,id=iothread1
>> -object
>> advanced-watchdog,id=heart1,server=off,awd_node=h1,notification_node=heart1,opt_script=colo_secondary_opt_script,iothread=iothread1,timeout=10000

Same here.

Paolo
Zhang Chen Dec. 10, 2019, 12:29 a.m. UTC | #3
On 12/9/2019 5:08 PM, Paolo Bonzini wrote:
> On 08/12/19 18:52, Zhang, Chen wrote:
>> Hi All~
>>
>> No news for a long time.
>>
>> Please give me more comments about this series.
> Sorry, people were probably busy with the QEMU release candidates.
>
> Even before looking at the code, the series is completely missing
> documentation on how to use it and on the chardev protocol.  The
> documentation should go in docs/ and should be written as restructuredText.
>
> The qemu-options.hx patches also lack documentation about the properties
> accepted by the new object.

OK, I will add documentation in docs/ and qemu-options.hx in next version.

For the chardev protocol part, current implementation just use plaintext 
that make AWD easy to connect with other user defined node,  I am not 
very familiar with this part, do you have any suggestions here? maybe 
use some general protocol is better? or Jason have any suggestions?


>
> In particular:
>
>>> -chardev socket,id=h1,host=3.3.3.3,port=9009,server,nowait
>>> -chardev socket,id=heartbeat0,host=3.3.3.3,port=4445
>>> -object iothread,id=iothread2
>>> -object
>>> advanced-watchdog,id=heart1,server=on,awd_node=h1,notification_node=heartbeat0,opt_script=colo_opt_script_path,iothread=iothread1,pulse_interval=1000,timeout=5000
> What are the two sockets for, and what should be in colo_opt_script_path?

Anything user want to send when timeout, for example:

If timeout is detected, AWD send quit command to Qemu.

colo_opt_script_path=/tmp/qemu-qmp-quit.script

------------------------------------

qemu-qmp-quit.script:

  { "execute": "quit" }

------------------------------------


Thanks

Zhang Chen

>
>>> In secondary node:
>>>
>>> -monitor tcp::4445,server,nowait
>>> -chardev socket,id=h1,host=3.3.3.3,port=9009,reconnect=1
>>> -chardev socket,id=heart1,host=3.3.3.8,port=4445
>>> -object iothread,id=iothread1
>>> -object
>>> advanced-watchdog,id=heart1,server=off,awd_node=h1,notification_node=heart1,opt_script=colo_secondary_opt_script,iothread=iothread1,timeout=10000
> Same here.
>
> Paolo
>