mbox series

[0/4] colo: Introduce resource agent and high-level test

Message ID cover.1574356137.git.lukasstraub2@web.de (mailing list archive)
Headers show
Series colo: Introduce resource agent and high-level test | expand

Message

Lukas Straub Nov. 21, 2019, 5:49 p.m. UTC
Hello Everyone,
These patches introduce a resource agent for use with the Pacemaker CRM and a
high-level test utilizing it for testing qemu COLO.

The resource agent manages qemu COLO including continuous replication.

Currently the second test case (where the peer qemu is frozen) fails on primary
failover, because qemu hangs while removing the replication related block nodes.
Note that this also happens in real world test when cutting power to the peer
host, so this needs to be fixed.

Based-on: <cover.1571925699.git.lukasstraub2@web.de>
([PATCH v7 0/4] colo: Add support for continuous replication)

Lukas Straub (4):
  block/quorum.c: stable children names
  colo: Introduce resource agent
  colo: Introduce high-level test
  MAINTAINERS: Add myself as maintainer for COLO resource agent

 MAINTAINERS                            |    6 +
 block/quorum.c                         |    6 +
 scripts/colo-resource-agent/colo       | 1026 ++++++++++++++++++++++++
 scripts/colo-resource-agent/crm_master |   44 +
 tests/acceptance/colo.py               |  444 ++++++++++
 5 files changed, 1526 insertions(+)
 create mode 100755 scripts/colo-resource-agent/colo
 create mode 100755 scripts/colo-resource-agent/crm_master
 create mode 100644 tests/acceptance/colo.py

--
2.20.1

Comments

Dr. David Alan Gilbert Nov. 22, 2019, 9:46 a.m. UTC | #1
* Lukas Straub (lukasstraub2@web.de) wrote:
> Hello Everyone,
> These patches introduce a resource agent for use with the Pacemaker CRM and a
> high-level test utilizing it for testing qemu COLO.
> 
> The resource agent manages qemu COLO including continuous replication.
> 
> Currently the second test case (where the peer qemu is frozen) fails on primary
> failover, because qemu hangs while removing the replication related block nodes.
> Note that this also happens in real world test when cutting power to the peer
> host, so this needs to be fixed.

Do you understand why that happens? Is this it's trying to finish a
read/write to the dead partner?

Dave

> Based-on: <cover.1571925699.git.lukasstraub2@web.de>
> ([PATCH v7 0/4] colo: Add support for continuous replication)
> 
> Lukas Straub (4):
>   block/quorum.c: stable children names
>   colo: Introduce resource agent
>   colo: Introduce high-level test
>   MAINTAINERS: Add myself as maintainer for COLO resource agent
> 
>  MAINTAINERS                            |    6 +
>  block/quorum.c                         |    6 +
>  scripts/colo-resource-agent/colo       | 1026 ++++++++++++++++++++++++
>  scripts/colo-resource-agent/crm_master |   44 +
>  tests/acceptance/colo.py               |  444 ++++++++++
>  5 files changed, 1526 insertions(+)
>  create mode 100755 scripts/colo-resource-agent/colo
>  create mode 100755 scripts/colo-resource-agent/crm_master
>  create mode 100644 tests/acceptance/colo.py
> 
> --
> 2.20.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Lukas Straub Nov. 27, 2019, 9:11 p.m. UTC | #2
On Fri, 22 Nov 2019 09:46:46 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Lukas Straub (lukasstraub2@web.de) wrote:
> > Hello Everyone,
> > These patches introduce a resource agent for use with the Pacemaker CRM and a
> > high-level test utilizing it for testing qemu COLO.
> >
> > The resource agent manages qemu COLO including continuous replication.
> >
> > Currently the second test case (where the peer qemu is frozen) fails on primary
> > failover, because qemu hangs while removing the replication related block nodes.
> > Note that this also happens in real world test when cutting power to the peer
> > host, so this needs to be fixed.
>
> Do you understand why that happens? Is this it's trying to finish a
> read/write to the dead partner?
>
> Dave

I haven't looked into it too closely yet, but it's often hanging in bdrv_flush()
while removing the replication blockdev and of course thats probably because the
nbd client waits for a reply. So I tried with the workaround below, which will
actively kill the TCP connection and with it the test passes, though I haven't
tested it in real world yet.

A proper solution to this would probably be a "force" parameter for blockdev-del,
which skips all flushing and aborts all inflight io. Or we could add a timeout
to the nbd client.

Regards,
Lukas Straub

diff --git a/scripts/colo-resource-agent/colo b/scripts/colo-resource-agent/colo
index 5fd9cfc0b5..62210af2a1 100755
--- a/scripts/colo-resource-agent/colo
+++ b/scripts/colo-resource-agent/colo
@@ -935,6 +935,7 @@ def qemu_colo_notify():
            and HOSTNAME == str.strip(OCF_RESKEY_CRM_meta_notify_master_uname):
             fd = qmp_open()
             peer = qmp_get_nbd_remote(fd)
+            os.system("sudo ss -K dst %s dport = %s" % (peer, NBD_PORT))
             if peer == str.strip(OCF_RESKEY_CRM_meta_notify_stop_uname):
                 if qmp_check_resync(fd) != None:
                     qmp_cancel_resync(fd)
Lukas Straub Dec. 18, 2019, 9:27 a.m. UTC | #3
On Wed, 27 Nov 2019 22:11:34 +0100
Lukas Straub <lukasstraub2@web.de> wrote:

> On Fri, 22 Nov 2019 09:46:46 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>
> > * Lukas Straub (lukasstraub2@web.de) wrote:
> > > Hello Everyone,
> > > These patches introduce a resource agent for use with the Pacemaker CRM and a
> > > high-level test utilizing it for testing qemu COLO.
> > >
> > > The resource agent manages qemu COLO including continuous replication.
> > >
> > > Currently the second test case (where the peer qemu is frozen) fails on primary
> > > failover, because qemu hangs while removing the replication related block nodes.
> > > Note that this also happens in real world test when cutting power to the peer
> > > host, so this needs to be fixed.
> >
> > Do you understand why that happens? Is this it's trying to finish a
> > read/write to the dead partner?
> >
> > Dave
>
> I haven't looked into it too closely yet, but it's often hanging in bdrv_flush()
> while removing the replication blockdev and of course thats probably because the
> nbd client waits for a reply. So I tried with the workaround below, which will
> actively kill the TCP connection and with it the test passes, though I haven't
> tested it in real world yet.
>

In the real cluster, sometimes qemu even hangs while connecting to qmp (after remote
poweroff). But I currently don't have the time to look into it.

Still a failing test is better than no test. Could we mark this test as known-bad and
fix this issue later? How should I mark it as known-bad? By tag? Or warn in the log?

Regards,
Lukas Straub
Dr. David Alan Gilbert Dec. 18, 2019, 7:46 p.m. UTC | #4
* Lukas Straub (lukasstraub2@web.de) wrote:
> On Wed, 27 Nov 2019 22:11:34 +0100
> Lukas Straub <lukasstraub2@web.de> wrote:
> 
> > On Fri, 22 Nov 2019 09:46:46 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >
> > > * Lukas Straub (lukasstraub2@web.de) wrote:
> > > > Hello Everyone,
> > > > These patches introduce a resource agent for use with the Pacemaker CRM and a
> > > > high-level test utilizing it for testing qemu COLO.
> > > >
> > > > The resource agent manages qemu COLO including continuous replication.
> > > >
> > > > Currently the second test case (where the peer qemu is frozen) fails on primary
> > > > failover, because qemu hangs while removing the replication related block nodes.
> > > > Note that this also happens in real world test when cutting power to the peer
> > > > host, so this needs to be fixed.
> > >
> > > Do you understand why that happens? Is this it's trying to finish a
> > > read/write to the dead partner?
> > >
> > > Dave
> >
> > I haven't looked into it too closely yet, but it's often hanging in bdrv_flush()
> > while removing the replication blockdev and of course thats probably because the
> > nbd client waits for a reply. So I tried with the workaround below, which will
> > actively kill the TCP connection and with it the test passes, though I haven't
> > tested it in real world yet.
> >
> 
> In the real cluster, sometimes qemu even hangs while connecting to qmp (after remote
> poweroff). But I currently don't have the time to look into it.

That doesn't surprise me too much; QMP is mostly handled in the main
thread, as are a lot of other things; hanging in COLO has been my
assumption for a while because of that.  However, there's a way to fix
it.

A while ago, Peter Xu added a feature called 'out of band' to QMP; you
can open a QMP connection, set the OOB feature, and then commands that
are marked as OOB are executed off the main thread on  that connection.

At the moment we've just got the one real OOB command, 'migrate-recover'
which is used for recovering postcopy from a similar failure to the COLO
case.

To fix this you'd have to convert colo-lost-heartbeat to be an OOB
command; note it's not that trivial, because you have to make sure the
code that's run as part of the OOB command doesn't take any locks that
could block on something in the main thread; so it can set flags, start
new threads, perhaps call shutdown() on a socket; but it takes some
thinking about.


> Still a failing test is better than no test. Could we mark this test as known-bad and
> fix this issue later? How should I mark it as known-bad? By tag? Or warn in the log?

Not sure of that; cc'ing Maybe thuth knows?

Dave

> Regards,
> Lukas Straub
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK