[49/51] io/channel-watch: Fix socket watch on Windows

Message ID	20220824094029.1634519-50-bmeng.cn@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Bin Meng <bmeng.cn@gmail.com> To: qemu-devel@nongnu.org Cc: Bin Meng <bin.meng@windriver.com>, =?utf-8?q?Daniel_P=2E_Berrang=C3=A9?= <berrange@redhat.com> Subject: [PATCH 49/51] io/channel-watch: Fix socket watch on Windows Date: Wed, 24 Aug 2022 17:40:27 +0800 Message-Id: <20220824094029.1634519-50-bmeng.cn@gmail.com> In-Reply-To: <20220824094029.1634519-1-bmeng.cn@gmail.com> References: <20220824094029.1634519-1-bmeng.cn@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::634; envelope-from=bmeng.cn@gmail.com; helo=mail-pl1-x634.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	tests/qtest: Enable running qtest on Windows \| expand [00/51] tests/qtest: Enable running qtest on Windows [01/51] tests/qtest: Use g_setenv() [02/51] tests/qtest: Use g_mkdtemp() [03/51] block: Unify the get_tmp_filename() implementation [04/51] semihosting/arm-compat-semi: Avoid using hardcoded /tmp [05/51] tcg: Avoid using hardcoded /tmp [06/51] util/qemu-sockets: Use g_get_tmp_dir() to get the directory for temporary files [07/51] tests: Avoid using hardcoded /tmp in test cases [08/51] block/vvfat: Unify the mkdir() call [09/51] fsdev/virtfs-proxy-helper: Use g_mkdir_with_parents() [10/51] hw/usb: dev-mtp: Use g_mkdir_with_parents() [11/51] qga/commands-posix-ssh: Use g_mkdir_with_parents() [12/51] tests: Use g_mkdir_with_parents() [13/51] tests/qtest: migration-test: Handle link() for win32 [14/51] backends/tpm: Exclude headers and macros that don't exist on win32 [15/51] tests/qtest: Adapt {m48t59,rtc}-test cases for win32 [16/51] tests/qtest: Build e1000e-test for posix only [17/51] tests/qtest: Build virtio-net-test for posix only [18/51] tests/qtest: Build cases that use memory-backend-file for posix only [19/51] tests/qtest: Build test-filter-{mirror, redirector} cases for posix only [20/51] tests/qtest: i440fx-test: Skip running request_{bios, pflash} for win32 [21/51] tests/qtest: migration-test: Skip running test_migrate_fd_proto on win32 [22/51] tests/qtest: qmp-test: Skip running test_qmp_oob for win32 [23/51] accel/qtest: Support qtest accelerator for Windows [24/51] tests/qtest: libqos: Drop inclusion of <sys/wait.h> [25/51] tests/qtest: libqos: Rename malloc.h to libqos-malloc.h [26/51] tests/qtest: libqtest: Move global_qtest definition back to libqtest.c [27/51] tests/qtest: Use send/recv for socket communication [28/51] tests/qtest: libqtest: Exclude the *_fds APIs for win32 [29/51] tests/qtest: libqtest: Install signal handler via signal() [30/51] tests: Skip iotests and qtest when '--without-default-devices' [31/51] tests/qtest: Support libqtest to build and run on Windows [32/51] tests/qtest: Fix ERROR_SHARING_VIOLATION for win32 [33/51] tests/qtest: {ahci, ide}-test: Use relative path for temporary files [34/51] tests/qtest: bios-tables-test: Adapt the case for win32 [35/51] tests/qtest: device-plug-test: Reverse the usage of double/single quotes [36/51] tests/qtest: machine-none-test: Use double quotes to pass the cpu option [37/51] tests/qtest: migration-test: Disable IO redirection for win32 [38/51] tests/qtest: {ahci,ide}-test: Open file in binary mode [39/51] tests/qtest: virtio-net-failover: Disable migration tests for win32 [40/51] chardev/char-file: Add FILE_SHARE_WRITE when openning the file for win32 [41/51] tests/qtest: migration-test: Kill "to" after migration is canceled [42/51] hw/ppc: spapr: Use qemu_vfree() to free spapr->htab [43/51] tests/qtest: npcm7xx_emc-test: Skip running test_{tx, rx} on win32 [44/51] tests/qtest: microbit-test: Fix socket access for win32 [45/51] tests/qtest: prom-env-test: Use double quotes to pass the prom-env option [46/51] tests/qtest: libqtest: Replace the call to close a socket with closesocket() [47/51] tests/qtest: libqtest: Correct the timeout unit of blocking receive calls for win32 [48/51] io/channel-watch: Drop a superfluous '#ifdef WIN32' [49/51] io/channel-watch: Fix socket watch on Windows [50/51] .gitlab-ci.d/windows.yml: Increase the timeout to the runner limit [51/51] docs/devel: testing: Document writing portable test cases

Bin Meng Aug. 24, 2022, 9:40 a.m. UTC

From: Bin Meng <bin.meng@windriver.com>

Random failure was observed when running qtests on Windows due to
"Broken pipe" detected by qmp_fd_receive(). What happened is that
the qtest executable sends testing data over a socket to the QEMU
under test but no response is received. The errno of the recv()
call from the qtest executable indicates ETIMEOUT, due to the qmp
chardev's tcp_chr_read() is never called to receive testing data
hence no response is sent to the other side.

tcp_chr_read() is registered as the callback of the socket watch
GSource. The reason of the callback not being called by glib, is
that the source check fails to indicate the source is ready. There
are two socket watch sources created to monitor the same socket
event object from the char-socket backend in update_ioc_handlers().
During the source check phase, qio_channel_socket_source_check()
calls WSAEnumNetworkEvents() to discovers occurrences of network
events for the indicated socket, clear internal network event records,
and reset the event object. Testing shows that if we don't reset the
event object by not passing the event handle to WSAEnumNetworkEvents()
the symptom goes away and qtest runs very stably.

It looks we don't need to call WSAEnumNetworkEvents() at all, as we
don't parse the result of WSANETWORKEVENTS returned from this API.
We use select() to poll the socket status. Fix this instability by
dropping the WSAEnumNetworkEvents() call.

Signed-off-by: Bin Meng <bin.meng@windriver.com>
---
During the testing, I removed the following codes in update_ioc_handlers():

    remove_hup_source(s);
    s->hup_source = qio_channel_create_watch(s->ioc, G_IO_HUP);
    g_source_set_callback(s->hup_source, (GSourceFunc)tcp_chr_hup,
                          chr, NULL);
    g_source_attach(s->hup_source, chr->gcontext);

and such change also makes the symptom go away.

And if I moved the above codes to the beginning, before the call to
io_add_watch_poll(), the symptom also goes away.

It seems two sources watching on the same socket event object is
the key that leads to the instability. The order of adding a source
watch seems to also play a role but I can't explain why.
Hopefully a Windows and glib expert could explain this behavior.

 io/channel-watch.c | 4 ----
 1 file changed, 4 deletions(-)

Marc-André Lureau Sept. 1, 2022, 12:58 p.m. UTC | #1

Hi

On Wed, Aug 24, 2022 at 2:49 PM Bin Meng <bmeng.cn@gmail.com> wrote:

> From: Bin Meng <bin.meng@windriver.com>
>
> Random failure was observed when running qtests on Windows due to
> "Broken pipe" detected by qmp_fd_receive(). What happened is that
> the qtest executable sends testing data over a socket to the QEMU
> under test but no response is received. The errno of the recv()
> call from the qtest executable indicates ETIMEOUT, due to the qmp
> chardev's tcp_chr_read() is never called to receive testing data
> hence no response is sent to the other side.
>
> tcp_chr_read() is registered as the callback of the socket watch
> GSource. The reason of the callback not being called by glib, is
> that the source check fails to indicate the source is ready. There
> are two socket watch sources created to monitor the same socket
> event object from the char-socket backend in update_ioc_handlers().
>
During the source check phase, qio_channel_socket_source_check()
> calls WSAEnumNetworkEvents() to discovers occurrences of network
> events for the indicated socket, clear internal network event records,
> and reset the event object. Testing shows that if we don't reset the
> event object by not passing the event handle to WSAEnumNetworkEvents()
> the symptom goes away and qtest runs very stably.
>
> It looks we don't need to call WSAEnumNetworkEvents() at all, as we
> don't parse the result of WSANETWORKEVENTS returned from this API.
> We use select() to poll the socket status. Fix this instability by
> dropping the WSAEnumNetworkEvents() call.
>
> Signed-off-by: Bin Meng <bin.meng@windriver.com>
>

What clears the event then?



> ---
> During the testing, I removed the following codes in update_ioc_handlers():
>
>     remove_hup_source(s);
>     s->hup_source = qio_channel_create_watch(s->ioc, G_IO_HUP);
>     g_source_set_callback(s->hup_source, (GSourceFunc)tcp_chr_hup,
>                           chr, NULL);
>     g_source_attach(s->hup_source, chr->gcontext);
>
> and such change also makes the symptom go away.
>
> And if I moved the above codes to the beginning, before the call to
> io_add_watch_poll(), the symptom also goes away.
>
> It seems two sources watching on the same socket event object is
> the key that leads to the instability. The order of adding a source
> watch seems to also play a role but I can't explain why.
> Hopefully a Windows and glib expert could explain this behavior.
>
>
Feel free to leave that comment in the commit message.

This is strange, as both sources should have different events, clearing one
shouldn't affect the other.

I guess it's WSAEnumNetworkEvents clearing of the internal network event
records that is problematic.

Can you check if you replace the call with ResetEvent() everything works?



>  io/channel-watch.c | 4 ----
>  1 file changed, 4 deletions(-)
>
> diff --git a/io/channel-watch.c b/io/channel-watch.c
> index 89f3c8a88a..e34d86e810 100644
> --- a/io/channel-watch.c
> +++ b/io/channel-watch.c
> @@ -115,17 +115,13 @@ static gboolean
>  qio_channel_socket_source_check(GSource *source)
>  {
>      static struct timeval tv0;
> -
>      QIOChannelSocketSource *ssource = (QIOChannelSocketSource *)source;
> -    WSANETWORKEVENTS ev;
>      fd_set rfds, wfds, xfds;
>
>      if (!ssource->condition) {
>          return 0;
>      }
>
> -    WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> -
>      FD_ZERO(&rfds);
>      FD_ZERO(&wfds);
>      FD_ZERO(&xfds);
>

Unrelated, after this chunk, there is
        FD_SET((SOCKET)ssource->socket, &rfds);

it seems we can drop the cast. Feel free to send another patch.

Bin Meng Sept. 4, 2022, 6:24 a.m. UTC | #2

On Thu, Sep 1, 2022 at 8:58 PM Marc-André Lureau
<marcandre.lureau@gmail.com> wrote:
>
> Hi
>
> On Wed, Aug 24, 2022 at 2:49 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>>
>> From: Bin Meng <bin.meng@windriver.com>
>>
>> Random failure was observed when running qtests on Windows due to
>> "Broken pipe" detected by qmp_fd_receive(). What happened is that
>> the qtest executable sends testing data over a socket to the QEMU
>> under test but no response is received. The errno of the recv()
>> call from the qtest executable indicates ETIMEOUT, due to the qmp
>> chardev's tcp_chr_read() is never called to receive testing data
>> hence no response is sent to the other side.
>>
>> tcp_chr_read() is registered as the callback of the socket watch
>> GSource. The reason of the callback not being called by glib, is
>> that the source check fails to indicate the source is ready. There
>> are two socket watch sources created to monitor the same socket
>> event object from the char-socket backend in update_ioc_handlers().
>>
>> During the source check phase, qio_channel_socket_source_check()
>> calls WSAEnumNetworkEvents() to discovers occurrences of network
>> events for the indicated socket, clear internal network event records,
>> and reset the event object. Testing shows that if we don't reset the
>> event object by not passing the event handle to WSAEnumNetworkEvents()
>> the symptom goes away and qtest runs very stably.
>>
>> It looks we don't need to call WSAEnumNetworkEvents() at all, as we
>> don't parse the result of WSANETWORKEVENTS returned from this API.
>> We use select() to poll the socket status. Fix this instability by
>> dropping the WSAEnumNetworkEvents() call.
>>
>> Signed-off-by: Bin Meng <bin.meng@windriver.com>
>
>
> What clears the event then?
>

It seems we don't need to clear the event as everything still works as expected.

>>
>> ---
>> During the testing, I removed the following codes in update_ioc_handlers():
>>
>>     remove_hup_source(s);
>>     s->hup_source = qio_channel_create_watch(s->ioc, G_IO_HUP);
>>     g_source_set_callback(s->hup_source, (GSourceFunc)tcp_chr_hup,
>>                           chr, NULL);
>>     g_source_attach(s->hup_source, chr->gcontext);
>>
>> and such change also makes the symptom go away.
>>
>> And if I moved the above codes to the beginning, before the call to
>> io_add_watch_poll(), the symptom also goes away.
>>
>> It seems two sources watching on the same socket event object is
>> the key that leads to the instability. The order of adding a source
>> watch seems to also play a role but I can't explain why.
>> Hopefully a Windows and glib expert could explain this behavior.
>>
>
> Feel free to leave that comment in the commit message.

Sure, will add the above message into the commit in v2.

>
> This is strange, as both sources should have different events, clearing one shouldn't affect the other.

Both sources have the same event, as one QIO channel only has one
socket, and one socket can only be bound to one event.

>
> I guess it's WSAEnumNetworkEvents clearing of the internal network event records that is problematic.
>
> Can you check if you replace the call with ResetEvent() everything works?

No, ResetEvent() does not work, and is not recommended by MSDN [1]
too, which says:

The proper way to reset the state of an event object used with the
WSAEventSelect function is to pass the handle of the event object to
the WSAEnumNetworkEvents function in the hEventObject parameter. This
will reset the event object and adjust the status of active FD events
on the socket in an atomic fashion.

[1] https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsaeventselect

>
>
>>
>>  io/channel-watch.c | 4 ----
>>  1 file changed, 4 deletions(-)
>>
>> diff --git a/io/channel-watch.c b/io/channel-watch.c
>> index 89f3c8a88a..e34d86e810 100644
>> --- a/io/channel-watch.c
>> +++ b/io/channel-watch.c
>> @@ -115,17 +115,13 @@ static gboolean
>>  qio_channel_socket_source_check(GSource *source)
>>  {
>>      static struct timeval tv0;
>> -
>>      QIOChannelSocketSource *ssource = (QIOChannelSocketSource *)source;
>> -    WSANETWORKEVENTS ev;
>>      fd_set rfds, wfds, xfds;
>>
>>      if (!ssource->condition) {
>>          return 0;
>>      }
>>
>> -    WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
>> -
>>      FD_ZERO(&rfds);
>>      FD_ZERO(&wfds);
>>      FD_ZERO(&xfds);
>
>
> Unrelated, after this chunk, there is
>         FD_SET((SOCKET)ssource->socket, &rfds);
>
> it seems we can drop the cast. Feel free to send another patch.
>

Yeah, this cast is unnecessary. Will spin a new patch in v2. Thanks!

Regards,
Bin

Marc-André Lureau Sept. 5, 2022, 6:04 a.m. UTC | #3

Hi

On Sun, Sep 4, 2022 at 10:24 AM Bin Meng <bmeng.cn@gmail.com> wrote:

> On Thu, Sep 1, 2022 at 8:58 PM Marc-André Lureau
> <marcandre.lureau@gmail.com> wrote:
> >
> > Hi
> >
> > On Wed, Aug 24, 2022 at 2:49 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> >>
> >> From: Bin Meng <bin.meng@windriver.com>
> >>
> >> Random failure was observed when running qtests on Windows due to
> >> "Broken pipe" detected by qmp_fd_receive(). What happened is that
> >> the qtest executable sends testing data over a socket to the QEMU
> >> under test but no response is received. The errno of the recv()
> >> call from the qtest executable indicates ETIMEOUT, due to the qmp
> >> chardev's tcp_chr_read() is never called to receive testing data
> >> hence no response is sent to the other side.
> >>
> >> tcp_chr_read() is registered as the callback of the socket watch
> >> GSource. The reason of the callback not being called by glib, is
> >> that the source check fails to indicate the source is ready. There
> >> are two socket watch sources created to monitor the same socket
> >> event object from the char-socket backend in update_ioc_handlers().
> >>
> >> During the source check phase, qio_channel_socket_source_check()
> >> calls WSAEnumNetworkEvents() to discovers occurrences of network
> >> events for the indicated socket, clear internal network event records,
> >> and reset the event object. Testing shows that if we don't reset the
> >> event object by not passing the event handle to WSAEnumNetworkEvents()
> >> the symptom goes away and qtest runs very stably.
> >>
> >> It looks we don't need to call WSAEnumNetworkEvents() at all, as we
> >> don't parse the result of WSANETWORKEVENTS returned from this API.
> >> We use select() to poll the socket status. Fix this instability by
> >> dropping the WSAEnumNetworkEvents() call.
> >>
> >> Signed-off-by: Bin Meng <bin.meng@windriver.com>
> >
> >
> > What clears the event then?
> >
>
> It seems we don't need to clear the event as everything still works as
> expected.
>

Well, it can "work" but are you sure it doesn't have a busy loop?

>>
> >> ---
> >> During the testing, I removed the following codes in
> update_ioc_handlers():
> >>
> >>     remove_hup_source(s);
> >>     s->hup_source = qio_channel_create_watch(s->ioc, G_IO_HUP);
> >>     g_source_set_callback(s->hup_source, (GSourceFunc)tcp_chr_hup,
> >>                           chr, NULL);
> >>     g_source_attach(s->hup_source, chr->gcontext);
> >>
> >> and such change also makes the symptom go away.
> >>
> >> And if I moved the above codes to the beginning, before the call to
> >> io_add_watch_poll(), the symptom also goes away.
> >>
> >> It seems two sources watching on the same socket event object is
> >> the key that leads to the instability. The order of adding a source
> >> watch seems to also play a role but I can't explain why.
> >> Hopefully a Windows and glib expert could explain this behavior.
> >>
> >
> > Feel free to leave that comment in the commit message.
>
> Sure, will add the above message into the commit in v2.
>
> >
> > This is strange, as both sources should have different events, clearing
> one shouldn't affect the other.
>
> Both sources have the same event, as one QIO channel only has one
> socket, and one socket can only be bound to one event.
>

 "one socket can only be bound to one event", is that a WSAEventSelect
limitation?


> >
> > I guess it's WSAEnumNetworkEvents clearing of the internal network event
> records that is problematic.
> >
> > Can you check if you replace the call with ResetEvent() everything works?
>
> No, ResetEvent() does not work, and is not recommended by MSDN [1]
> too, which says:
>

It probably works to some extent (I see glib is using it
https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/giowin32.c#L903), What
you mean is that it doesn't solve the issue, I guess.

>
> The proper way to reset the state of an event object used with the
> WSAEventSelect function is to pass the handle of the event object to
> the WSAEnumNetworkEvents function in the hEventObject parameter. This
> will reset the event object and adjust the status of active FD events
> on the socket in an atomic fashion.
>
>
This is not what you want though if you have multiple event objects for the
same socket.


> [1]
> https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsaeventselect
>
> >
> >
> >>
> >>  io/channel-watch.c | 4 ----
> >>  1 file changed, 4 deletions(-)
> >>
> >> diff --git a/io/channel-watch.c b/io/channel-watch.c
> >> index 89f3c8a88a..e34d86e810 100644
> >> --- a/io/channel-watch.c
> >> +++ b/io/channel-watch.c
> >> @@ -115,17 +115,13 @@ static gboolean
> >>  qio_channel_socket_source_check(GSource *source)
> >>  {
> >>      static struct timeval tv0;
> >> -
> >>      QIOChannelSocketSource *ssource = (QIOChannelSocketSource *)source;
> >> -    WSANETWORKEVENTS ev;
> >>      fd_set rfds, wfds, xfds;
> >>
> >>      if (!ssource->condition) {
> >>          return 0;
> >>      }
> >>
> >> -    WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> >> -
> >>      FD_ZERO(&rfds);
> >>      FD_ZERO(&wfds);
> >>      FD_ZERO(&xfds);
> >
> >
> > Unrelated, after this chunk, there is
> >         FD_SET((SOCKET)ssource->socket, &rfds);
> >
> > it seems we can drop the cast. Feel free to send another patch.
> >
>
> Yeah, this cast is unnecessary. Will spin a new patch in v2. Thanks!
>
> Regards,
> Bin
>

Bin Meng Sept. 5, 2022, 6:13 a.m. UTC | #4

On Mon, Sep 5, 2022 at 2:04 PM Marc-André Lureau
<marcandre.lureau@gmail.com> wrote:
>
> Hi
>
> On Sun, Sep 4, 2022 at 10:24 AM Bin Meng <bmeng.cn@gmail.com> wrote:
>>
>> On Thu, Sep 1, 2022 at 8:58 PM Marc-André Lureau
>> <marcandre.lureau@gmail.com> wrote:
>> >
>> > Hi
>> >
>> > On Wed, Aug 24, 2022 at 2:49 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>> >>
>> >> From: Bin Meng <bin.meng@windriver.com>
>> >>
>> >> Random failure was observed when running qtests on Windows due to
>> >> "Broken pipe" detected by qmp_fd_receive(). What happened is that
>> >> the qtest executable sends testing data over a socket to the QEMU
>> >> under test but no response is received. The errno of the recv()
>> >> call from the qtest executable indicates ETIMEOUT, due to the qmp
>> >> chardev's tcp_chr_read() is never called to receive testing data
>> >> hence no response is sent to the other side.
>> >>
>> >> tcp_chr_read() is registered as the callback of the socket watch
>> >> GSource. The reason of the callback not being called by glib, is
>> >> that the source check fails to indicate the source is ready. There
>> >> are two socket watch sources created to monitor the same socket
>> >> event object from the char-socket backend in update_ioc_handlers().
>> >>
>> >> During the source check phase, qio_channel_socket_source_check()
>> >> calls WSAEnumNetworkEvents() to discovers occurrences of network
>> >> events for the indicated socket, clear internal network event records,
>> >> and reset the event object. Testing shows that if we don't reset the
>> >> event object by not passing the event handle to WSAEnumNetworkEvents()
>> >> the symptom goes away and qtest runs very stably.
>> >>
>> >> It looks we don't need to call WSAEnumNetworkEvents() at all, as we
>> >> don't parse the result of WSANETWORKEVENTS returned from this API.
>> >> We use select() to poll the socket status. Fix this instability by
>> >> dropping the WSAEnumNetworkEvents() call.
>> >>
>> >> Signed-off-by: Bin Meng <bin.meng@windriver.com>
>> >
>> >
>> > What clears the event then?
>> >
>>
>> It seems we don't need to clear the event as everything still works as expected.
>
>
> Well, it can "work" but are you sure it doesn't have a busy loop?

You mean busy loop in g_poll()?

>> >>
>> >> ---
>> >> During the testing, I removed the following codes in update_ioc_handlers():
>> >>
>> >>     remove_hup_source(s);
>> >>     s->hup_source = qio_channel_create_watch(s->ioc, G_IO_HUP);
>> >>     g_source_set_callback(s->hup_source, (GSourceFunc)tcp_chr_hup,
>> >>                           chr, NULL);
>> >>     g_source_attach(s->hup_source, chr->gcontext);
>> >>
>> >> and such change also makes the symptom go away.
>> >>
>> >> And if I moved the above codes to the beginning, before the call to
>> >> io_add_watch_poll(), the symptom also goes away.
>> >>
>> >> It seems two sources watching on the same socket event object is
>> >> the key that leads to the instability. The order of adding a source
>> >> watch seems to also play a role but I can't explain why.
>> >> Hopefully a Windows and glib expert could explain this behavior.
>> >>
>> >
>> > Feel free to leave that comment in the commit message.
>>
>> Sure, will add the above message into the commit in v2.
>>
>> >
>> > This is strange, as both sources should have different events, clearing one shouldn't affect the other.
>>
>> Both sources have the same event, as one QIO channel only has one
>> socket, and one socket can only be bound to one event.
>
>
>  "one socket can only be bound to one event", is that a WSAEventSelect limitation?
>

Yes, please see the MSDN:

It is not possible to specify different event objects for different
network events. The following code will not work; the second call will
cancel the effects of the first, and only the FD_WRITE network event
will be associated with hEventObject2:

rc = WSAEventSelect(s, hEventObject1, FD_READ);
rc = WSAEventSelect(s, hEventObject2, FD_WRITE); //bad

>>
>> >
>> > I guess it's WSAEnumNetworkEvents clearing of the internal network event records that is problematic.
>> >
>> > Can you check if you replace the call with ResetEvent() everything works?
>>
>> No, ResetEvent() does not work, and is not recommended by MSDN [1]
>> too, which says:
>
>
> It probably works to some extent (I see glib is using it https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/giowin32.c#L903), What you mean is that it doesn't solve the issue, I guess.

Correct, it does not solve the issue.

>>
>>
>> The proper way to reset the state of an event object used with the
>> WSAEventSelect function is to pass the handle of the event object to
>> the WSAEnumNetworkEvents function in the hEventObject parameter. This
>> will reset the event object and adjust the status of active FD events
>> on the socket in an atomic fashion.
>>
>
> This is not what you want though if you have multiple event objects for the same socket.
>
>>
>> [1] https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsaeventselect

Regards,
Bin

Clément Chigot Sept. 5, 2022, 8:10 a.m. UTC | #5

Hi all,

I did reach the same issue while trying to connect a gdb to qemu on
Windows hosts. Some inputs send by gdb aren't getting correctly pulled,
they will be retrieved only once g_poll times out.

As you explained earlier, the issue, AFAICT, is that WSAEnumNetworkEvents
will reset the internal events before select can detect them.
Sadly, I didn't find any way to adjust the current code using select to
avoid that. As select and any cleaner (ever WSAEnumNetworkEvents or
WSEResetEvent) cannot be called in an atomic way, it seems that events
can always be received between the reset and the select. At least, all
my attempts failed.

The only solution I've found is to move completely to the Windows API
and stop calling select. This, however, needs some tricks. In Particular, we
need to "remove" the support of GSources having only G_IO_HUP.
This is already kind of done as we currently don't detect them in
qio_channel_socket_source_check anyway.
This is mandatory because of the two GSources created on the same socket.
IIRC, the first one (I'll call it the master GSource) is created during
the initialization of the channel-socket by update_ioc_handlers and will
have only G_IO_HUP to catch up.
The second one is created during the "prepare" phase of the first one,
in io_watch_poll_prepare. This one will have all the events we want
to pull (G_IO_IN here).
As they are referring to the same socket, the Windows API cannot know
on which GSources we want to catch which events. The solution is then
to avoid WSAEnumNetworkEvents for the master GSource which only has
G_IO_HUP (or for any GSource having only that).
As I said above, the current code doesn't do anything with it anyway.
So, IMO, it's safe to do so.

I'll send you my patch attached. I was planning to send it in the following
weeks anyway. I was just waiting to be sure everything looks fine on our
CI. Feel free to test and modify it if needed.

PS: I don't know if it will correctly extend if I simply attach it to
my mail. If not, tell me I'll simply copy-paste it, even if it might
mess up the space/tab stuff.

> >> >>
> >> >> ---
> >> >> During the testing, I removed the following codes in update_ioc_handlers():
> >> >>
> >> >>     remove_hup_source(s);
> >> >>     s->hup_source = qio_channel_create_watch(s->ioc, G_IO_HUP);
> >> >>     g_source_set_callback(s->hup_source, (GSourceFunc)tcp_chr_hup,
> >> >>                           chr, NULL);
> >> >>     g_source_attach(s->hup_source, chr->gcontext);
> >> >>
> >> >> and such change also makes the symptom go away.
> >> >>
> >> >> And if I moved the above codes to the beginning, before the call to
> >> >> io_add_watch_poll(), the symptom also goes away.
> >> >>
> >> >> It seems two sources watching on the same socket event object is
> >> >> the key that leads to the instability. The order of adding a source
> >> >> watch seems to also play a role but I can't explain why.
> >> >> Hopefully a Windows and glib expert could explain this behavior.
> >> >>
> >> >
> >> > Feel free to leave that comment in the commit message.
> >>
> >> Sure, will add the above message into the commit in v2.
> >>
> >> >
> >> > This is strange, as both sources should have different events, clearing one shouldn't affect the other.
> >>
> >> Both sources have the same event, as one QIO channel only has one
> >> socket, and one socket can only be bound to one event.
> >
> >
> >  "one socket can only be bound to one event", is that a WSAEventSelect limitation?
> >
>
> Yes, please see the MSDN:
>
> It is not possible to specify different event objects for different
> network events. The following code will not work; the second call will
> cancel the effects of the first, and only the FD_WRITE network event
> will be associated with hEventObject2:
>
> rc = WSAEventSelect(s, hEventObject1, FD_READ);
> rc = WSAEventSelect(s, hEventObject2, FD_WRITE); //bad

Yes, the Windows API is handled at socket levels. That's why having
two GSources on the same sockets is problematic.
Note that maybe there is a mix to be done between your patch with
the update_ioc_handlers part to be removed and my patch which improves
the polling of channel-watch.

Thanks,
Clément

Bin Meng Sept. 6, 2022, 7 a.m. UTC | #6

Hi Clément,

On Mon, Sep 5, 2022 at 4:10 PM Clément Chigot <chigot@adacore.com> wrote:
>
> Hi all,
>
> I did reach the same issue while trying to connect a gdb to qemu on
> Windows hosts. Some inputs send by gdb aren't getting correctly pulled,
> they will be retrieved only once g_poll times out.
>
> As you explained earlier, the issue, AFAICT, is that WSAEnumNetworkEvents
> will reset the internal events before select can detect them.

No, I don't think WSAEnumNetworkEvents and select cannot be used together.

MSDN says:
https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-select

"The select function has no effect on the persistence of socket events
registered with WSAAsyncSelect or WSAEventSelect."

That sounds to me like current usage in QEMU codes does not have a problem.

> Sadly, I didn't find any way to adjust the current code using select to
> avoid that. As select and any cleaner (ever WSAEnumNetworkEvents or
> WSEResetEvent) cannot be called in an atomic way, it seems that events
> can always be received between the reset and the select. At least, all
> my attempts failed.

According to MSDN:
https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsaeventselect

"Having successfully recorded the occurrence of the network event (by
setting the corresponding bit in the internal network event record)
and signaled the associated event object, no further actions are taken
for that network event until the application makes the function call
that implicitly reenables the setting of that network event and
signaling of the associated event object."

So events will be kept unsignaled after they are signaled, until the
reenable routine is called. For example, recv() for the FD_READ event.

>
> The only solution I've found is to move completely to the Windows API
> and stop calling select. This, however, needs some tricks. In Particular, we
> need to "remove" the support of GSources having only G_IO_HUP.
> This is already kind of done as we currently don't detect them in
> qio_channel_socket_source_check anyway.
> This is mandatory because of the two GSources created on the same socket.
> IIRC, the first one (I'll call it the master GSource) is created during
> the initialization of the channel-socket by update_ioc_handlers and will
> have only G_IO_HUP to catch up.
> The second one is created during the "prepare" phase of the first one,
> in io_watch_poll_prepare. This one will have all the events we want
> to pull (G_IO_IN here).
> As they are referring to the same socket, the Windows API cannot know
> on which GSources we want to catch which events. The solution is then

I think Windows knows which events to catch. As WSAEventSelect() in
channel-watch.c::qio_channel_create_socket_watch() tells Windows which
event it should monitor.

Both the "master" and "child" GSources are watching on the same socket
with G_IO_IN condition (see qio_channel_create_socket_watch), and GLib
is smart enough to merge these two resources into one in the event
handle array which is passed to WaitForMultipleObjectsEx() in
g_poll().

I checked your patch, what you did seems to be something one would
naturally write, but what is currently in the QEMU sources seems to be
written intentionally.

+Paolo Bonzini , you are the one who implemented the socket watch on
Windows. Could you please help analyze this issue?

> to avoid WSAEnumNetworkEvents for the master GSource which only has
> G_IO_HUP (or for any GSource having only that).
> As I said above, the current code doesn't do anything with it anyway.
> So, IMO, it's safe to do so.
>
> I'll send you my patch attached. I was planning to send it in the following
> weeks anyway. I was just waiting to be sure everything looks fine on our
> CI. Feel free to test and modify it if needed.

I tested your patch. Unfortunately there is still one test case
(migration-test.exe) throwing up the "Broken pipe" message.

Can you test my patch instead to see if your gdb issue can be fixed?

>
> PS: I don't know if it will correctly extend if I simply attach it to
> my mail. If not, tell me I'll simply copy-paste it, even if it might
> mess up the space/tab stuff.
>
> > >> >>
> > >> >> ---
> > >> >> During the testing, I removed the following codes in update_ioc_handlers():
> > >> >>
> > >> >>     remove_hup_source(s);
> > >> >>     s->hup_source = qio_channel_create_watch(s->ioc, G_IO_HUP);
> > >> >>     g_source_set_callback(s->hup_source, (GSourceFunc)tcp_chr_hup,
> > >> >>                           chr, NULL);
> > >> >>     g_source_attach(s->hup_source, chr->gcontext);
> > >> >>
> > >> >> and such change also makes the symptom go away.
> > >> >>
> > >> >> And if I moved the above codes to the beginning, before the call to
> > >> >> io_add_watch_poll(), the symptom also goes away.
> > >> >>
> > >> >> It seems two sources watching on the same socket event object is
> > >> >> the key that leads to the instability. The order of adding a source
> > >> >> watch seems to also play a role but I can't explain why.
> > >> >> Hopefully a Windows and glib expert could explain this behavior.
> > >> >>
> > >> >
> > >> > Feel free to leave that comment in the commit message.
> > >>
> > >> Sure, will add the above message into the commit in v2.
> > >>
> > >> >
> > >> > This is strange, as both sources should have different events, clearing one shouldn't affect the other.
> > >>
> > >> Both sources have the same event, as one QIO channel only has one
> > >> socket, and one socket can only be bound to one event.
> > >
> > >
> > >  "one socket can only be bound to one event", is that a WSAEventSelect limitation?
> > >
> >
> > Yes, please see the MSDN:
> >
> > It is not possible to specify different event objects for different
> > network events. The following code will not work; the second call will
> > cancel the effects of the first, and only the FD_WRITE network event
> > will be associated with hEventObject2:
> >
> > rc = WSAEventSelect(s, hEventObject1, FD_READ);
> > rc = WSAEventSelect(s, hEventObject2, FD_WRITE); //bad
>
> Yes, the Windows API is handled at socket levels. That's why having
> two GSources on the same sockets is problematic.
> Note that maybe there is a mix to be done between your patch with
> the update_ioc_handlers part to be removed and my patch which improves
> the polling of channel-watch.

Regards,
Bin

Clément Chigot Sept. 6, 2022, 7:41 a.m. UTC | #7

Hi Bin,

> On Mon, Sep 5, 2022 at 4:10 PM Clément Chigot <chigot@adacore.com> wrote:
> >
> > Hi all,
> >
> > I did reach the same issue while trying to connect a gdb to qemu on
> > Windows hosts. Some inputs send by gdb aren't getting correctly pulled,
> > they will be retrieved only once g_poll times out.
> >
> > As you explained earlier, the issue, AFAICT, is that WSAEnumNetworkEvents
> > will reset the internal events before select can detect them.
>
> No, I don't think WSAEnumNetworkEvents and select cannot be used together.
>
> MSDN says:
> https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-select
>
> "The select function has no effect on the persistence of socket events
> registered with WSAAsyncSelect or WSAEventSelect."
>
> That sounds to me like current usage in QEMU codes does not have a problem.

Yes, but WSAEnumNetworkEvents has effects. Thus, it might reset the
events before
select detects them no ?
Moreover, I'm not sure what they mean by "persistence of WSAEventSelect".
Does that mean that select doesn't change the events wanted to be
notified as set
by WSAEventSelect or that select isn't resetting the events once
retrieved, as you
imply ?
Moreover, the current code is creating the event object with
CreateEvent(NULL, FALSE, FALSE, NULL). The first FALSE implies that we want
an auto-reset. The MSDN doc says:

"Boolean that specifies whether a manual-reset or auto-reset event
object is created.
If TRUE, then you must use the ResetEvent function to manually reset
the state to
nonsignaled. If FALSE, the system automatically resets the state to nonsignaled
after a single waiting thread has been released."

However, I'm not sure if I understand correctly what "after a single
waiting thread
has been released." signified. Is the reset occuring after recv() or
after select or
WSAEnumNetworkEvents calls ? AFAIR, I've tried to move to manual-reset with
the current qemu but it doesn't change anything.

> > Sadly, I didn't find any way to adjust the current code using select to
> > avoid that. As select and any cleaner (ever WSAEnumNetworkEvents or
> > WSEResetEvent) cannot be called in an atomic way, it seems that events
> > can always be received between the reset and the select. At least, all
> > my attempts failed.
>
> According to MSDN:
> https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsaeventselect
>
> "Having successfully recorded the occurrence of the network event (by
> setting the corresponding bit in the internal network event record)
> and signaled the associated event object, no further actions are taken
> for that network event until the application makes the function call
> that implicitly reenables the setting of that network event and
> signaling of the associated event object."
>
> So events will be kept unsignaled after they are signaled, until the
> reenable routine is called. For example, recv() for the FD_READ event.

In my understanding, WSAEnumNetworkEvents states the opposite:
https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsaenumnetworkevents

"The Windows Sockets provider guarantees that the operations of copying
the network event record, clearing it and resetting any associated event
object are atomic, such that the next occurrence of a nominated network
event will cause the event object to become set."

> > The only solution I've found is to move completely to the Windows API
> > and stop calling select. This, however, needs some tricks. In Particular, we
> > need to "remove" the support of GSources having only G_IO_HUP.
> > This is already kind of done as we currently don't detect them in
> > qio_channel_socket_source_check anyway.
> > This is mandatory because of the two GSources created on the same socket.
> > IIRC, the first one (I'll call it the master GSource) is created during
> > the initialization of the channel-socket by update_ioc_handlers and will
> > have only G_IO_HUP to catch up.
> > The second one is created during the "prepare" phase of the first one,
> > in io_watch_poll_prepare. This one will have all the events we want
> > to pull (G_IO_IN here).
> > As they are referring to the same socket, the Windows API cannot know
> > on which GSources we want to catch which events. The solution is then
>
> I think Windows knows which events to catch. As WSAEventSelect() in
> channel-watch.c::qio_channel_create_socket_watch() tells Windows which
> event it should monitor.
>
> Both the "master" and "child" GSources are watching on the same socket
> with G_IO_IN condition (see qio_channel_create_socket_watch), and GLib
> is smart enough to merge these two resources into one in the event
> handle array which is passed to WaitForMultipleObjectsEx() in
> g_poll().

I've forgotten to mention it. But the current code only fails with newer glib
versions. I wasn't able to test all of them but it's working with 2.54 and
starts failing with versions after 2.60.
The qemu master isn't supporting 2.54 anymore. Thus I've tested that with
our internal qemu-6 which runs with 2.54. When upgrading glib, it starts
behaving like our issue.
Sadly, I wasn't able to test with glib-2.56/2.58 nor to find anything relevant
in glib code which would explain our issue. (But I wasn't aware of the
two GSources at that time so maybe it's worth investigating again).

> I checked your patch, what you did seems to be something one would
> naturally write, but what is currently in the QEMU sources seems to be
> written intentionally.
>
> +Paolo Bonzini , you are the one who implemented the socket watch on
> Windows. Could you please help analyze this issue?
>
> > to avoid WSAEnumNetworkEvents for the master GSource which only has
> > G_IO_HUP (or for any GSource having only that).
> > As I said above, the current code doesn't do anything with it anyway.
> > So, IMO, it's safe to do so.
> >
> > I'll send you my patch attached. I was planning to send it in the following
> > weeks anyway. I was just waiting to be sure everything looks fine on our
> > CI. Feel free to test and modify it if needed.
>
> I tested your patch. Unfortunately there is still one test case
> (migration-test.exe) throwing up the "Broken pipe" message.

I must say I didn't fully test it against qemu testsuite yet. Maybe there are
some refinements to be done. "Broken pipe" might be linked to the missing
G_IO_HUP support.

> Can you test my patch instead to see if your gdb issue can be fixed?

Yeah sure. I'll try to do it this afternoon.

Bin Meng Sept. 6, 2022, 8:14 a.m. UTC | #8

Hi Clément,

On Tue, Sep 6, 2022 at 3:41 PM Clément Chigot <chigot@adacore.com> wrote:
>
> Hi Bin,
>
> > On Mon, Sep 5, 2022 at 4:10 PM Clément Chigot <chigot@adacore.com> wrote:
> > >
> > > Hi all,
> > >
> > > I did reach the same issue while trying to connect a gdb to qemu on
> > > Windows hosts. Some inputs send by gdb aren't getting correctly pulled,
> > > they will be retrieved only once g_poll times out.
> > >
> > > As you explained earlier, the issue, AFAICT, is that WSAEnumNetworkEvents
> > > will reset the internal events before select can detect them.
> >
> > No, I don't think WSAEnumNetworkEvents and select cannot be used together.
> >
> > MSDN says:
> > https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-select
> >
> > "The select function has no effect on the persistence of socket events
> > registered with WSAAsyncSelect or WSAEventSelect."
> >
> > That sounds to me like current usage in QEMU codes does not have a problem.
>
> Yes, but WSAEnumNetworkEvents has effects. Thus, it might reset the
> events before
> select detects them no ?

Even if the event *handle* is reset, select can still detect the
network event has happened. I think the above MSDN statement is trying
to explain this.

> Moreover, I'm not sure what they mean by "persistence of WSAEventSelect".
> Does that mean that select doesn't change the events wanted to be
> notified as set
> by WSAEventSelect or that select isn't resetting the events once
> retrieved, as you
> imply ?

My understanding is that select just determines the socket status
neither from the event handle associated with the socket, nor does it
change anything on the event handle.

> Moreover, the current code is creating the event object with
> CreateEvent(NULL, FALSE, FALSE, NULL). The first FALSE implies that we want
> an auto-reset. The MSDN doc says:
>
> "Boolean that specifies whether a manual-reset or auto-reset event
> object is created.
> If TRUE, then you must use the ResetEvent function to manually reset
> the state to
> nonsignaled. If FALSE, the system automatically resets the state to nonsignaled
> after a single waiting thread has been released."

Yes, I think this is a bug however when I changed to CreateEvent(NULL,
TRUE, FALSE, NULL) or WSACreateEvent() there are still "Broken pipe"
errors ...

> However, I'm not sure if I understand correctly what "after a single
> waiting thread
> has been released." signified. Is the reset occuring after recv() or
> after select or
> WSAEnumNetworkEvents calls ? AFAIR, I've tried to move to manual-reset with
> the current qemu but it doesn't change anything.

The same here :(

>
> > > Sadly, I didn't find any way to adjust the current code using select to
> > > avoid that. As select and any cleaner (ever WSAEnumNetworkEvents or
> > > WSEResetEvent) cannot be called in an atomic way, it seems that events
> > > can always be received between the reset and the select. At least, all
> > > my attempts failed.
> >
> > According to MSDN:
> > https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsaeventselect
> >
> > "Having successfully recorded the occurrence of the network event (by
> > setting the corresponding bit in the internal network event record)
> > and signaled the associated event object, no further actions are taken
> > for that network event until the application makes the function call
> > that implicitly reenables the setting of that network event and
> > signaling of the associated event object."
> >
> > So events will be kept unsignaled after they are signaled, until the
> > reenable routine is called. For example, recv() for the FD_READ event.
>
> In my understanding, WSAEnumNetworkEvents states the opposite:
> https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsaenumnetworkevents
>
> "The Windows Sockets provider guarantees that the operations of copying
> the network event record, clearing it and resetting any associated event
> object are atomic, such that the next occurrence of a nominated network
> event will cause the event object to become set."

My interpretation of this does not conflict with the WSAEventSelect().
I think they are just two things.

>
> > > The only solution I've found is to move completely to the Windows API
> > > and stop calling select. This, however, needs some tricks. In Particular, we
> > > need to "remove" the support of GSources having only G_IO_HUP.
> > > This is already kind of done as we currently don't detect them in
> > > qio_channel_socket_source_check anyway.
> > > This is mandatory because of the two GSources created on the same socket.
> > > IIRC, the first one (I'll call it the master GSource) is created during
> > > the initialization of the channel-socket by update_ioc_handlers and will
> > > have only G_IO_HUP to catch up.
> > > The second one is created during the "prepare" phase of the first one,
> > > in io_watch_poll_prepare. This one will have all the events we want
> > > to pull (G_IO_IN here).
> > > As they are referring to the same socket, the Windows API cannot know
> > > on which GSources we want to catch which events. The solution is then
> >
> > I think Windows knows which events to catch. As WSAEventSelect() in
> > channel-watch.c::qio_channel_create_socket_watch() tells Windows which
> > event it should monitor.
> >
> > Both the "master" and "child" GSources are watching on the same socket
> > with G_IO_IN condition (see qio_channel_create_socket_watch), and GLib
> > is smart enough to merge these two resources into one in the event
> > handle array which is passed to WaitForMultipleObjectsEx() in
> > g_poll().
>
> I've forgotten to mention it. But the current code only fails with newer glib
> versions. I wasn't able to test all of them but it's working with 2.54 and
> starts failing with versions after 2.60.
> The qemu master isn't supporting 2.54 anymore. Thus I've tested that with
> our internal qemu-6 which runs with 2.54. When upgrading glib, it starts
> behaving like our issue.
> Sadly, I wasn't able to test with glib-2.56/2.58 nor to find anything relevant
> in glib code which would explain our issue. (But I wasn't aware of the
> two GSources at that time so maybe it's worth investigating again).
>

Ah, good to know glib version matters.

> > I checked your patch, what you did seems to be something one would
> > naturally write, but what is currently in the QEMU sources seems to be
> > written intentionally.
> >
> > +Paolo Bonzini , you are the one who implemented the socket watch on
> > Windows. Could you please help analyze this issue?
> >
> > > to avoid WSAEnumNetworkEvents for the master GSource which only has
> > > G_IO_HUP (or for any GSource having only that).
> > > As I said above, the current code doesn't do anything with it anyway.
> > > So, IMO, it's safe to do so.
> > >
> > > I'll send you my patch attached. I was planning to send it in the following
> > > weeks anyway. I was just waiting to be sure everything looks fine on our
> > > CI. Feel free to test and modify it if needed.
> >
> > I tested your patch. Unfortunately there is still one test case
> > (migration-test.exe) throwing up the "Broken pipe" message.
>
> I must say I didn't fully test it against qemu testsuite yet. Maybe there are
> some refinements to be done. "Broken pipe" might be linked to the missing
> G_IO_HUP support.
>
> > Can you test my patch instead to see if your gdb issue can be fixed?
>
> Yeah sure. I'll try to do it this afternoon.

Thanks!

Regards,
Bin

Clément Chigot Sept. 6, 2022, 12:06 p.m. UTC | #9

> > > I checked your patch, what you did seems to be something one would
> > > naturally write, but what is currently in the QEMU sources seems to be
> > > written intentionally.
> > >
> > > +Paolo Bonzini , you are the one who implemented the socket watch on
> > > Windows. Could you please help analyze this issue?
> > >
> > > > to avoid WSAEnumNetworkEvents for the master GSource which only has
> > > > G_IO_HUP (or for any GSource having only that).
> > > > As I said above, the current code doesn't do anything with it anyway.
> > > > So, IMO, it's safe to do so.
> > > >
> > > > I'll send you my patch attached. I was planning to send it in the following
> > > > weeks anyway. I was just waiting to be sure everything looks fine on our
> > > > CI. Feel free to test and modify it if needed.
> > >
> > > I tested your patch. Unfortunately there is still one test case
> > > (migration-test.exe) throwing up the "Broken pipe" message.
> >
> > I must say I didn't fully test it against qemu testsuite yet. Maybe there are
> > some refinements to be done. "Broken pipe" might be linked to the missing
> > G_IO_HUP support.
> >
> > > Can you test my patch instead to see if your gdb issue can be fixed?
> >
> > Yeah sure. I'll try to do it this afternoon.

I can't explain how mad at me I am... I'm pretty sure your patch was the first
thing I've tried when I encountered this issue. But it wasn't working
or IIRC the
issue went away but that was because the polling was actually disabled (looping
indefinitely)...I'm suspecting that I already had changed the CreateEvent for
WSACreateEvent which forces you to handle the reset.
Finally, I end up struggling reworking the whole check function...
But yeah, your patch does work fine on my gdb issues too.

And I guess the events are reset when recv() is being called because of the
auto-reset feature set up by CreateEvent().
IIUC, what Marc-André means by busy loop is the polling being looping
indefinitely as I encountered. I can ensure that this patch doesn't do that.
It can be easily checked by setting the env variable G_MAIN_POLL_DEBUG.
It'll show what g_poll is doing and it's normally always available on
Windows.

Anyway, we'll wait for Paolo to see if he remembers why he had to call
WSAEnumNetworkEvents. Otherwize, let's go for your patch. Mine might
be a good start to improve the whole polling on Windows but if it doesn't
work in your case, it then needs some refinements.

Thanks,
Clément

Bin Meng Sept. 7, 2022, 5:07 a.m. UTC | #10

Hi Clément,

On Tue, Sep 6, 2022 at 8:06 PM Clément Chigot <chigot@adacore.com> wrote:
>
> > > > I checked your patch, what you did seems to be something one would
> > > > naturally write, but what is currently in the QEMU sources seems to be
> > > > written intentionally.
> > > >
> > > > +Paolo Bonzini , you are the one who implemented the socket watch on
> > > > Windows. Could you please help analyze this issue?
> > > >
> > > > > to avoid WSAEnumNetworkEvents for the master GSource which only has
> > > > > G_IO_HUP (or for any GSource having only that).
> > > > > As I said above, the current code doesn't do anything with it anyway.
> > > > > So, IMO, it's safe to do so.
> > > > >
> > > > > I'll send you my patch attached. I was planning to send it in the following
> > > > > weeks anyway. I was just waiting to be sure everything looks fine on our
> > > > > CI. Feel free to test and modify it if needed.
> > > >
> > > > I tested your patch. Unfortunately there is still one test case
> > > > (migration-test.exe) throwing up the "Broken pipe" message.
> > >
> > > I must say I didn't fully test it against qemu testsuite yet. Maybe there are
> > > some refinements to be done. "Broken pipe" might be linked to the missing
> > > G_IO_HUP support.
> > >
> > > > Can you test my patch instead to see if your gdb issue can be fixed?
> > >
> > > Yeah sure. I'll try to do it this afternoon.
>
> I can't explain how mad at me I am... I'm pretty sure your patch was the first
> thing I've tried when I encountered this issue. But it wasn't working
> or IIRC the
> issue went away but that was because the polling was actually disabled (looping
> indefinitely)...I'm suspecting that I already had changed the CreateEvent for
> WSACreateEvent which forces you to handle the reset.
> Finally, I end up struggling reworking the whole check function...
> But yeah, your patch does work fine on my gdb issues too.

Good to know this patch works for you too.

> And I guess the events are reset when recv() is being called because of the
> auto-reset feature set up by CreateEvent().
> IIUC, what Marc-André means by busy loop is the polling being looping
> indefinitely as I encountered. I can ensure that this patch doesn't do that.
> It can be easily checked by setting the env variable G_MAIN_POLL_DEBUG.
> It'll show what g_poll is doing and it's normally always available on
> Windows.
>
> Anyway, we'll wait for Paolo to see if he remembers why he had to call
> WSAEnumNetworkEvents. Otherwize, let's go for your patch. Mine might
> be a good start to improve the whole polling on Windows but if it doesn't
> work in your case, it then needs some refinements.
>

Yeah, this issue bugged me quite a lot. If we want to reset the event
in qio_channel_socket_source_check(), we will have to do the following
to make sure qtests are happy.

diff --git a/io/channel-watch.c b/io/channel-watch.c
index 43d38494f7..f1e1650b81 100644
--- a/io/channel-watch.c
+++ b/io/channel-watch.c
@@ -124,8 +124,6 @@ qio_channel_socket_source_check(GSource *source)
return 0;
}
- WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
-
FD_ZERO(&rfds);
FD_ZERO(&wfds);
FD_ZERO(&xfds);
@@ -153,6 +151,10 @@ qio_channel_socket_source_check(GSource *source)
ssource->revents |= G_IO_PRI;
}
+ if (ssource->revents) {
+ WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
+ }
+
return ssource->revents;
}

Removing "if (ssource->revents)" won't work.

It seems to me that resetting the event twice (one time with the
master Gsource, and the other time with the child GSource) causes some
bizarre behavior. But MSDN [1] says

    "Resetting an event that is already reset has no effect."

[1] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-resetevent

Regards,
Bin

Bin Meng Sept. 14, 2022, 8:08 a.m. UTC | #11

On Wed, Sep 7, 2022 at 1:07 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>
> Hi Clément,
>
> On Tue, Sep 6, 2022 at 8:06 PM Clément Chigot <chigot@adacore.com> wrote:
> >
> > > > > I checked your patch, what you did seems to be something one would
> > > > > naturally write, but what is currently in the QEMU sources seems to be
> > > > > written intentionally.
> > > > >
> > > > > +Paolo Bonzini , you are the one who implemented the socket watch on
> > > > > Windows. Could you please help analyze this issue?
> > > > >
> > > > > > to avoid WSAEnumNetworkEvents for the master GSource which only has
> > > > > > G_IO_HUP (or for any GSource having only that).
> > > > > > As I said above, the current code doesn't do anything with it anyway.
> > > > > > So, IMO, it's safe to do so.
> > > > > >
> > > > > > I'll send you my patch attached. I was planning to send it in the following
> > > > > > weeks anyway. I was just waiting to be sure everything looks fine on our
> > > > > > CI. Feel free to test and modify it if needed.
> > > > >
> > > > > I tested your patch. Unfortunately there is still one test case
> > > > > (migration-test.exe) throwing up the "Broken pipe" message.
> > > >
> > > > I must say I didn't fully test it against qemu testsuite yet. Maybe there are
> > > > some refinements to be done. "Broken pipe" might be linked to the missing
> > > > G_IO_HUP support.
> > > >
> > > > > Can you test my patch instead to see if your gdb issue can be fixed?
> > > >
> > > > Yeah sure. I'll try to do it this afternoon.
> >
> > I can't explain how mad at me I am... I'm pretty sure your patch was the first
> > thing I've tried when I encountered this issue. But it wasn't working
> > or IIRC the
> > issue went away but that was because the polling was actually disabled (looping
> > indefinitely)...I'm suspecting that I already had changed the CreateEvent for
> > WSACreateEvent which forces you to handle the reset.
> > Finally, I end up struggling reworking the whole check function...
> > But yeah, your patch does work fine on my gdb issues too.
>
> Good to know this patch works for you too.
>
> > And I guess the events are reset when recv() is being called because of the
> > auto-reset feature set up by CreateEvent().
> > IIUC, what Marc-André means by busy loop is the polling being looping
> > indefinitely as I encountered. I can ensure that this patch doesn't do that.
> > It can be easily checked by setting the env variable G_MAIN_POLL_DEBUG.
> > It'll show what g_poll is doing and it's normally always available on
> > Windows.
> >
> > Anyway, we'll wait for Paolo to see if he remembers why he had to call
> > WSAEnumNetworkEvents. Otherwize, let's go for your patch. Mine might
> > be a good start to improve the whole polling on Windows but if it doesn't
> > work in your case, it then needs some refinements.
> >
>
> Yeah, this issue bugged me quite a lot. If we want to reset the event
> in qio_channel_socket_source_check(), we will have to do the following
> to make sure qtests are happy.
>
> diff --git a/io/channel-watch.c b/io/channel-watch.c
> index 43d38494f7..f1e1650b81 100644
> --- a/io/channel-watch.c
> +++ b/io/channel-watch.c
> @@ -124,8 +124,6 @@ qio_channel_socket_source_check(GSource *source)
> return 0;
> }
> - WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> -
> FD_ZERO(&rfds);
> FD_ZERO(&wfds);
> FD_ZERO(&xfds);
> @@ -153,6 +151,10 @@ qio_channel_socket_source_check(GSource *source)
> ssource->revents |= G_IO_PRI;
> }
> + if (ssource->revents) {
> + WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> + }
> +
> return ssource->revents;
> }
>
> Removing "if (ssource->revents)" won't work.
>
> It seems to me that resetting the event twice (one time with the
> master Gsource, and the other time with the child GSource) causes some
> bizarre behavior. But MSDN [1] says
>
>     "Resetting an event that is already reset has no effect."
>
> [1] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-resetevent
>

Paolo, any comments about this issue?

Regards,
Bin

Bin Meng Sept. 21, 2022, 1:02 a.m. UTC | #12

On Wed, Sep 14, 2022 at 4:08 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>
> On Wed, Sep 7, 2022 at 1:07 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> >
> > Hi Clément,
> >
> > On Tue, Sep 6, 2022 at 8:06 PM Clément Chigot <chigot@adacore.com> wrote:
> > >
> > > > > > I checked your patch, what you did seems to be something one would
> > > > > > naturally write, but what is currently in the QEMU sources seems to be
> > > > > > written intentionally.
> > > > > >
> > > > > > +Paolo Bonzini , you are the one who implemented the socket watch on
> > > > > > Windows. Could you please help analyze this issue?
> > > > > >
> > > > > > > to avoid WSAEnumNetworkEvents for the master GSource which only has
> > > > > > > G_IO_HUP (or for any GSource having only that).
> > > > > > > As I said above, the current code doesn't do anything with it anyway.
> > > > > > > So, IMO, it's safe to do so.
> > > > > > >
> > > > > > > I'll send you my patch attached. I was planning to send it in the following
> > > > > > > weeks anyway. I was just waiting to be sure everything looks fine on our
> > > > > > > CI. Feel free to test and modify it if needed.
> > > > > >
> > > > > > I tested your patch. Unfortunately there is still one test case
> > > > > > (migration-test.exe) throwing up the "Broken pipe" message.
> > > > >
> > > > > I must say I didn't fully test it against qemu testsuite yet. Maybe there are
> > > > > some refinements to be done. "Broken pipe" might be linked to the missing
> > > > > G_IO_HUP support.
> > > > >
> > > > > > Can you test my patch instead to see if your gdb issue can be fixed?
> > > > >
> > > > > Yeah sure. I'll try to do it this afternoon.
> > >
> > > I can't explain how mad at me I am... I'm pretty sure your patch was the first
> > > thing I've tried when I encountered this issue. But it wasn't working
> > > or IIRC the
> > > issue went away but that was because the polling was actually disabled (looping
> > > indefinitely)...I'm suspecting that I already had changed the CreateEvent for
> > > WSACreateEvent which forces you to handle the reset.
> > > Finally, I end up struggling reworking the whole check function...
> > > But yeah, your patch does work fine on my gdb issues too.
> >
> > Good to know this patch works for you too.
> >
> > > And I guess the events are reset when recv() is being called because of the
> > > auto-reset feature set up by CreateEvent().
> > > IIUC, what Marc-André means by busy loop is the polling being looping
> > > indefinitely as I encountered. I can ensure that this patch doesn't do that.
> > > It can be easily checked by setting the env variable G_MAIN_POLL_DEBUG.
> > > It'll show what g_poll is doing and it's normally always available on
> > > Windows.
> > >
> > > Anyway, we'll wait for Paolo to see if he remembers why he had to call
> > > WSAEnumNetworkEvents. Otherwize, let's go for your patch. Mine might
> > > be a good start to improve the whole polling on Windows but if it doesn't
> > > work in your case, it then needs some refinements.
> > >
> >
> > Yeah, this issue bugged me quite a lot. If we want to reset the event
> > in qio_channel_socket_source_check(), we will have to do the following
> > to make sure qtests are happy.
> >
> > diff --git a/io/channel-watch.c b/io/channel-watch.c
> > index 43d38494f7..f1e1650b81 100644
> > --- a/io/channel-watch.c
> > +++ b/io/channel-watch.c
> > @@ -124,8 +124,6 @@ qio_channel_socket_source_check(GSource *source)
> > return 0;
> > }
> > - WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> > -
> > FD_ZERO(&rfds);
> > FD_ZERO(&wfds);
> > FD_ZERO(&xfds);
> > @@ -153,6 +151,10 @@ qio_channel_socket_source_check(GSource *source)
> > ssource->revents |= G_IO_PRI;
> > }
> > + if (ssource->revents) {
> > + WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> > + }
> > +
> > return ssource->revents;
> > }
> >
> > Removing "if (ssource->revents)" won't work.
> >
> > It seems to me that resetting the event twice (one time with the
> > master Gsource, and the other time with the child GSource) causes some
> > bizarre behavior. But MSDN [1] says
> >
> >     "Resetting an event that is already reset has no effect."
> >
> > [1] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-resetevent
> >
>
> Paolo, any comments about this issue?
>

v2 series has been sent out, and this patch remains unchanged.

Paolo, still would appreciate your comments.

Regards,
Bin

Bin Meng Sept. 28, 2022, 6:10 a.m. UTC | #13

Hi Paolo,

On Wed, Sep 21, 2022 at 9:02 AM Bin Meng <bmeng.cn@gmail.com> wrote:
>
> On Wed, Sep 14, 2022 at 4:08 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> >
> > On Wed, Sep 7, 2022 at 1:07 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > >
> > > Hi Clément,
> > >
> > > On Tue, Sep 6, 2022 at 8:06 PM Clément Chigot <chigot@adacore.com> wrote:
> > > >
> > > > > > > I checked your patch, what you did seems to be something one would
> > > > > > > naturally write, but what is currently in the QEMU sources seems to be
> > > > > > > written intentionally.
> > > > > > >
> > > > > > > +Paolo Bonzini , you are the one who implemented the socket watch on
> > > > > > > Windows. Could you please help analyze this issue?
> > > > > > >
> > > > > > > > to avoid WSAEnumNetworkEvents for the master GSource which only has
> > > > > > > > G_IO_HUP (or for any GSource having only that).
> > > > > > > > As I said above, the current code doesn't do anything with it anyway.
> > > > > > > > So, IMO, it's safe to do so.
> > > > > > > >
> > > > > > > > I'll send you my patch attached. I was planning to send it in the following
> > > > > > > > weeks anyway. I was just waiting to be sure everything looks fine on our
> > > > > > > > CI. Feel free to test and modify it if needed.
> > > > > > >
> > > > > > > I tested your patch. Unfortunately there is still one test case
> > > > > > > (migration-test.exe) throwing up the "Broken pipe" message.
> > > > > >
> > > > > > I must say I didn't fully test it against qemu testsuite yet. Maybe there are
> > > > > > some refinements to be done. "Broken pipe" might be linked to the missing
> > > > > > G_IO_HUP support.
> > > > > >
> > > > > > > Can you test my patch instead to see if your gdb issue can be fixed?
> > > > > >
> > > > > > Yeah sure. I'll try to do it this afternoon.
> > > >
> > > > I can't explain how mad at me I am... I'm pretty sure your patch was the first
> > > > thing I've tried when I encountered this issue. But it wasn't working
> > > > or IIRC the
> > > > issue went away but that was because the polling was actually disabled (looping
> > > > indefinitely)...I'm suspecting that I already had changed the CreateEvent for
> > > > WSACreateEvent which forces you to handle the reset.
> > > > Finally, I end up struggling reworking the whole check function...
> > > > But yeah, your patch does work fine on my gdb issues too.
> > >
> > > Good to know this patch works for you too.
> > >
> > > > And I guess the events are reset when recv() is being called because of the
> > > > auto-reset feature set up by CreateEvent().
> > > > IIUC, what Marc-André means by busy loop is the polling being looping
> > > > indefinitely as I encountered. I can ensure that this patch doesn't do that.
> > > > It can be easily checked by setting the env variable G_MAIN_POLL_DEBUG.
> > > > It'll show what g_poll is doing and it's normally always available on
> > > > Windows.
> > > >
> > > > Anyway, we'll wait for Paolo to see if he remembers why he had to call
> > > > WSAEnumNetworkEvents. Otherwize, let's go for your patch. Mine might
> > > > be a good start to improve the whole polling on Windows but if it doesn't
> > > > work in your case, it then needs some refinements.
> > > >
> > >
> > > Yeah, this issue bugged me quite a lot. If we want to reset the event
> > > in qio_channel_socket_source_check(), we will have to do the following
> > > to make sure qtests are happy.
> > >
> > > diff --git a/io/channel-watch.c b/io/channel-watch.c
> > > index 43d38494f7..f1e1650b81 100644
> > > --- a/io/channel-watch.c
> > > +++ b/io/channel-watch.c
> > > @@ -124,8 +124,6 @@ qio_channel_socket_source_check(GSource *source)
> > > return 0;
> > > }
> > > - WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> > > -
> > > FD_ZERO(&rfds);
> > > FD_ZERO(&wfds);
> > > FD_ZERO(&xfds);
> > > @@ -153,6 +151,10 @@ qio_channel_socket_source_check(GSource *source)
> > > ssource->revents |= G_IO_PRI;
> > > }
> > > + if (ssource->revents) {
> > > + WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> > > + }
> > > +
> > > return ssource->revents;
> > > }
> > >
> > > Removing "if (ssource->revents)" won't work.
> > >
> > > It seems to me that resetting the event twice (one time with the
> > > master Gsource, and the other time with the child GSource) causes some
> > > bizarre behavior. But MSDN [1] says
> > >
> > >     "Resetting an event that is already reset has no effect."
> > >
> > > [1] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-resetevent
> > >
> >
> > Paolo, any comments about this issue?
> >
>
> v2 series has been sent out, and this patch remains unchanged.
>
> Paolo, still would appreciate your comments.
>

Ping?

Regards,
Bin

Bin Meng Oct. 6, 2022, 3:03 a.m. UTC | #14

Hi Paolo,

On Wed, Sep 28, 2022 at 2:10 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>
> Hi Paolo,
>
> On Wed, Sep 21, 2022 at 9:02 AM Bin Meng <bmeng.cn@gmail.com> wrote:
> >
> > On Wed, Sep 14, 2022 at 4:08 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > >
> > > On Wed, Sep 7, 2022 at 1:07 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > >
> > > > Hi Clément,
> > > >
> > > > On Tue, Sep 6, 2022 at 8:06 PM Clément Chigot <chigot@adacore.com> wrote:
> > > > >
> > > > > > > > I checked your patch, what you did seems to be something one would
> > > > > > > > naturally write, but what is currently in the QEMU sources seems to be
> > > > > > > > written intentionally.
> > > > > > > >
> > > > > > > > +Paolo Bonzini , you are the one who implemented the socket watch on
> > > > > > > > Windows. Could you please help analyze this issue?
> > > > > > > >
> > > > > > > > > to avoid WSAEnumNetworkEvents for the master GSource which only has
> > > > > > > > > G_IO_HUP (or for any GSource having only that).
> > > > > > > > > As I said above, the current code doesn't do anything with it anyway.
> > > > > > > > > So, IMO, it's safe to do so.
> > > > > > > > >
> > > > > > > > > I'll send you my patch attached. I was planning to send it in the following
> > > > > > > > > weeks anyway. I was just waiting to be sure everything looks fine on our
> > > > > > > > > CI. Feel free to test and modify it if needed.
> > > > > > > >
> > > > > > > > I tested your patch. Unfortunately there is still one test case
> > > > > > > > (migration-test.exe) throwing up the "Broken pipe" message.
> > > > > > >
> > > > > > > I must say I didn't fully test it against qemu testsuite yet. Maybe there are
> > > > > > > some refinements to be done. "Broken pipe" might be linked to the missing
> > > > > > > G_IO_HUP support.
> > > > > > >
> > > > > > > > Can you test my patch instead to see if your gdb issue can be fixed?
> > > > > > >
> > > > > > > Yeah sure. I'll try to do it this afternoon.
> > > > >
> > > > > I can't explain how mad at me I am... I'm pretty sure your patch was the first
> > > > > thing I've tried when I encountered this issue. But it wasn't working
> > > > > or IIRC the
> > > > > issue went away but that was because the polling was actually disabled (looping
> > > > > indefinitely)...I'm suspecting that I already had changed the CreateEvent for
> > > > > WSACreateEvent which forces you to handle the reset.
> > > > > Finally, I end up struggling reworking the whole check function...
> > > > > But yeah, your patch does work fine on my gdb issues too.
> > > >
> > > > Good to know this patch works for you too.
> > > >
> > > > > And I guess the events are reset when recv() is being called because of the
> > > > > auto-reset feature set up by CreateEvent().
> > > > > IIUC, what Marc-André means by busy loop is the polling being looping
> > > > > indefinitely as I encountered. I can ensure that this patch doesn't do that.
> > > > > It can be easily checked by setting the env variable G_MAIN_POLL_DEBUG.
> > > > > It'll show what g_poll is doing and it's normally always available on
> > > > > Windows.
> > > > >
> > > > > Anyway, we'll wait for Paolo to see if he remembers why he had to call
> > > > > WSAEnumNetworkEvents. Otherwize, let's go for your patch. Mine might
> > > > > be a good start to improve the whole polling on Windows but if it doesn't
> > > > > work in your case, it then needs some refinements.
> > > > >
> > > >
> > > > Yeah, this issue bugged me quite a lot. If we want to reset the event
> > > > in qio_channel_socket_source_check(), we will have to do the following
> > > > to make sure qtests are happy.
> > > >
> > > > diff --git a/io/channel-watch.c b/io/channel-watch.c
> > > > index 43d38494f7..f1e1650b81 100644
> > > > --- a/io/channel-watch.c
> > > > +++ b/io/channel-watch.c
> > > > @@ -124,8 +124,6 @@ qio_channel_socket_source_check(GSource *source)
> > > > return 0;
> > > > }
> > > > - WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> > > > -
> > > > FD_ZERO(&rfds);
> > > > FD_ZERO(&wfds);
> > > > FD_ZERO(&xfds);
> > > > @@ -153,6 +151,10 @@ qio_channel_socket_source_check(GSource *source)
> > > > ssource->revents |= G_IO_PRI;
> > > > }
> > > > + if (ssource->revents) {
> > > > + WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> > > > + }
> > > > +
> > > > return ssource->revents;
> > > > }
> > > >
> > > > Removing "if (ssource->revents)" won't work.
> > > >
> > > > It seems to me that resetting the event twice (one time with the
> > > > master Gsource, and the other time with the child GSource) causes some
> > > > bizarre behavior. But MSDN [1] says
> > > >
> > > >     "Resetting an event that is already reset has no effect."
> > > >
> > > > [1] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-resetevent
> > > >
> > >
> > > Paolo, any comments about this issue?
> > >
> >
> > v2 series has been sent out, and this patch remains unchanged.
> >
> > Paolo, still would appreciate your comments.
> >
>
> Ping?
>

Ping? Can you please comment??

Regards,
Bin

Bin Meng Oct. 11, 2022, 10:42 a.m. UTC | #15

Hi Paolo,

On Thu, Oct 6, 2022 at 11:03 AM Bin Meng <bmeng.cn@gmail.com> wrote:
>
> Hi Paolo,
>
> On Wed, Sep 28, 2022 at 2:10 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> >
> > Hi Paolo,
> >
> > On Wed, Sep 21, 2022 at 9:02 AM Bin Meng <bmeng.cn@gmail.com> wrote:
> > >
> > > On Wed, Sep 14, 2022 at 4:08 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > >
> > > > On Wed, Sep 7, 2022 at 1:07 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > >
> > > > > Hi Clément,
> > > > >
> > > > > On Tue, Sep 6, 2022 at 8:06 PM Clément Chigot <chigot@adacore.com> wrote:
> > > > > >
> > > > > > > > > I checked your patch, what you did seems to be something one would
> > > > > > > > > naturally write, but what is currently in the QEMU sources seems to be
> > > > > > > > > written intentionally.
> > > > > > > > >
> > > > > > > > > +Paolo Bonzini , you are the one who implemented the socket watch on
> > > > > > > > > Windows. Could you please help analyze this issue?
> > > > > > > > >
> > > > > > > > > > to avoid WSAEnumNetworkEvents for the master GSource which only has
> > > > > > > > > > G_IO_HUP (or for any GSource having only that).
> > > > > > > > > > As I said above, the current code doesn't do anything with it anyway.
> > > > > > > > > > So, IMO, it's safe to do so.
> > > > > > > > > >
> > > > > > > > > > I'll send you my patch attached. I was planning to send it in the following
> > > > > > > > > > weeks anyway. I was just waiting to be sure everything looks fine on our
> > > > > > > > > > CI. Feel free to test and modify it if needed.
> > > > > > > > >
> > > > > > > > > I tested your patch. Unfortunately there is still one test case
> > > > > > > > > (migration-test.exe) throwing up the "Broken pipe" message.
> > > > > > > >
> > > > > > > > I must say I didn't fully test it against qemu testsuite yet. Maybe there are
> > > > > > > > some refinements to be done. "Broken pipe" might be linked to the missing
> > > > > > > > G_IO_HUP support.
> > > > > > > >
> > > > > > > > > Can you test my patch instead to see if your gdb issue can be fixed?
> > > > > > > >
> > > > > > > > Yeah sure. I'll try to do it this afternoon.
> > > > > >
> > > > > > I can't explain how mad at me I am... I'm pretty sure your patch was the first
> > > > > > thing I've tried when I encountered this issue. But it wasn't working
> > > > > > or IIRC the
> > > > > > issue went away but that was because the polling was actually disabled (looping
> > > > > > indefinitely)...I'm suspecting that I already had changed the CreateEvent for
> > > > > > WSACreateEvent which forces you to handle the reset.
> > > > > > Finally, I end up struggling reworking the whole check function...
> > > > > > But yeah, your patch does work fine on my gdb issues too.
> > > > >
> > > > > Good to know this patch works for you too.
> > > > >
> > > > > > And I guess the events are reset when recv() is being called because of the
> > > > > > auto-reset feature set up by CreateEvent().
> > > > > > IIUC, what Marc-André means by busy loop is the polling being looping
> > > > > > indefinitely as I encountered. I can ensure that this patch doesn't do that.
> > > > > > It can be easily checked by setting the env variable G_MAIN_POLL_DEBUG.
> > > > > > It'll show what g_poll is doing and it's normally always available on
> > > > > > Windows.
> > > > > >
> > > > > > Anyway, we'll wait for Paolo to see if he remembers why he had to call
> > > > > > WSAEnumNetworkEvents. Otherwize, let's go for your patch. Mine might
> > > > > > be a good start to improve the whole polling on Windows but if it doesn't
> > > > > > work in your case, it then needs some refinements.
> > > > > >
> > > > >
> > > > > Yeah, this issue bugged me quite a lot. If we want to reset the event
> > > > > in qio_channel_socket_source_check(), we will have to do the following
> > > > > to make sure qtests are happy.
> > > > >
> > > > > diff --git a/io/channel-watch.c b/io/channel-watch.c
> > > > > index 43d38494f7..f1e1650b81 100644
> > > > > --- a/io/channel-watch.c
> > > > > +++ b/io/channel-watch.c
> > > > > @@ -124,8 +124,6 @@ qio_channel_socket_source_check(GSource *source)
> > > > > return 0;
> > > > > }
> > > > > - WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> > > > > -
> > > > > FD_ZERO(&rfds);
> > > > > FD_ZERO(&wfds);
> > > > > FD_ZERO(&xfds);
> > > > > @@ -153,6 +151,10 @@ qio_channel_socket_source_check(GSource *source)
> > > > > ssource->revents |= G_IO_PRI;
> > > > > }
> > > > > + if (ssource->revents) {
> > > > > + WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> > > > > + }
> > > > > +
> > > > > return ssource->revents;
> > > > > }
> > > > >
> > > > > Removing "if (ssource->revents)" won't work.
> > > > >
> > > > > It seems to me that resetting the event twice (one time with the
> > > > > master Gsource, and the other time with the child GSource) causes some
> > > > > bizarre behavior. But MSDN [1] says
> > > > >
> > > > >     "Resetting an event that is already reset has no effect."
> > > > >
> > > > > [1] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-resetevent
> > > > >
> > > >
> > > > Paolo, any comments about this issue?
> > > >
> > >
> > > v2 series has been sent out, and this patch remains unchanged.
> > >
> > > Paolo, still would appreciate your comments.
> > >
> >
> > Ping?
> >
>
> Ping? Can you please comment??
>

Ping?

Regards,
Bin

Bin Meng Oct. 17, 2022, 12:21 p.m. UTC | #16

+more people

On Tue, Oct 11, 2022 at 6:42 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>
> Hi Paolo,
>
> On Thu, Oct 6, 2022 at 11:03 AM Bin Meng <bmeng.cn@gmail.com> wrote:
> >
> > Hi Paolo,
> >
> > On Wed, Sep 28, 2022 at 2:10 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > >
> > > Hi Paolo,
> > >
> > > On Wed, Sep 21, 2022 at 9:02 AM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > >
> > > > On Wed, Sep 14, 2022 at 4:08 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > >
> > > > > On Wed, Sep 7, 2022 at 1:07 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > > >
> > > > > > Hi Clément,
> > > > > >
> > > > > > On Tue, Sep 6, 2022 at 8:06 PM Clément Chigot <chigot@adacore.com> wrote:
> > > > > > >
> > > > > > > > > > I checked your patch, what you did seems to be something one would
> > > > > > > > > > naturally write, but what is currently in the QEMU sources seems to be
> > > > > > > > > > written intentionally.
> > > > > > > > > >
> > > > > > > > > > +Paolo Bonzini , you are the one who implemented the socket watch on
> > > > > > > > > > Windows. Could you please help analyze this issue?
> > > > > > > > > >
> > > > > > > > > > > to avoid WSAEnumNetworkEvents for the master GSource which only has
> > > > > > > > > > > G_IO_HUP (or for any GSource having only that).
> > > > > > > > > > > As I said above, the current code doesn't do anything with it anyway.
> > > > > > > > > > > So, IMO, it's safe to do so.
> > > > > > > > > > >
> > > > > > > > > > > I'll send you my patch attached. I was planning to send it in the following
> > > > > > > > > > > weeks anyway. I was just waiting to be sure everything looks fine on our
> > > > > > > > > > > CI. Feel free to test and modify it if needed.
> > > > > > > > > >
> > > > > > > > > > I tested your patch. Unfortunately there is still one test case
> > > > > > > > > > (migration-test.exe) throwing up the "Broken pipe" message.
> > > > > > > > >
> > > > > > > > > I must say I didn't fully test it against qemu testsuite yet. Maybe there are
> > > > > > > > > some refinements to be done. "Broken pipe" might be linked to the missing
> > > > > > > > > G_IO_HUP support.
> > > > > > > > >
> > > > > > > > > > Can you test my patch instead to see if your gdb issue can be fixed?
> > > > > > > > >
> > > > > > > > > Yeah sure. I'll try to do it this afternoon.
> > > > > > >
> > > > > > > I can't explain how mad at me I am... I'm pretty sure your patch was the first
> > > > > > > thing I've tried when I encountered this issue. But it wasn't working
> > > > > > > or IIRC the
> > > > > > > issue went away but that was because the polling was actually disabled (looping
> > > > > > > indefinitely)...I'm suspecting that I already had changed the CreateEvent for
> > > > > > > WSACreateEvent which forces you to handle the reset.
> > > > > > > Finally, I end up struggling reworking the whole check function...
> > > > > > > But yeah, your patch does work fine on my gdb issues too.
> > > > > >
> > > > > > Good to know this patch works for you too.
> > > > > >
> > > > > > > And I guess the events are reset when recv() is being called because of the
> > > > > > > auto-reset feature set up by CreateEvent().
> > > > > > > IIUC, what Marc-André means by busy loop is the polling being looping
> > > > > > > indefinitely as I encountered. I can ensure that this patch doesn't do that.
> > > > > > > It can be easily checked by setting the env variable G_MAIN_POLL_DEBUG.
> > > > > > > It'll show what g_poll is doing and it's normally always available on
> > > > > > > Windows.
> > > > > > >
> > > > > > > Anyway, we'll wait for Paolo to see if he remembers why he had to call
> > > > > > > WSAEnumNetworkEvents. Otherwize, let's go for your patch. Mine might
> > > > > > > be a good start to improve the whole polling on Windows but if it doesn't
> > > > > > > work in your case, it then needs some refinements.
> > > > > > >
> > > > > >
> > > > > > Yeah, this issue bugged me quite a lot. If we want to reset the event
> > > > > > in qio_channel_socket_source_check(), we will have to do the following
> > > > > > to make sure qtests are happy.
> > > > > >
> > > > > > diff --git a/io/channel-watch.c b/io/channel-watch.c
> > > > > > index 43d38494f7..f1e1650b81 100644
> > > > > > --- a/io/channel-watch.c
> > > > > > +++ b/io/channel-watch.c
> > > > > > @@ -124,8 +124,6 @@ qio_channel_socket_source_check(GSource *source)
> > > > > > return 0;
> > > > > > }
> > > > > > - WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> > > > > > -
> > > > > > FD_ZERO(&rfds);
> > > > > > FD_ZERO(&wfds);
> > > > > > FD_ZERO(&xfds);
> > > > > > @@ -153,6 +151,10 @@ qio_channel_socket_source_check(GSource *source)
> > > > > > ssource->revents |= G_IO_PRI;
> > > > > > }
> > > > > > + if (ssource->revents) {
> > > > > > + WSAEnumNetworkEvents(ssource->socket, ssource->ioc->event, &ev);
> > > > > > + }
> > > > > > +
> > > > > > return ssource->revents;
> > > > > > }
> > > > > >
> > > > > > Removing "if (ssource->revents)" won't work.
> > > > > >
> > > > > > It seems to me that resetting the event twice (one time with the
> > > > > > master Gsource, and the other time with the child GSource) causes some
> > > > > > bizarre behavior. But MSDN [1] says
> > > > > >
> > > > > >     "Resetting an event that is already reset has no effect."
> > > > > >
> > > > > > [1] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-resetevent
> > > > > >
> > > > >
> > > > > Paolo, any comments about this issue?
> > > > >
> > > >
> > > > v2 series has been sent out, and this patch remains unchanged.
> > > >
> > > > Paolo, still would appreciate your comments.
> > > >
> > >
> > > Ping?
> > >
> >
> > Ping? Can you please comment??
> >
>
> Ping?
>

Hi,

Paolo remains silent. Please let me know who else could approve this
change. Thanks.

Regards,
Bin

Daniel P. Berrangé Oct. 17, 2022, 12:30 p.m. UTC | #17

On Mon, Oct 17, 2022 at 08:21:37PM +0800, Bin Meng wrote:
> +more people
> 
> On Tue, Oct 11, 2022 at 6:42 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> >
> > Hi Paolo,
> >
> > On Thu, Oct 6, 2022 at 11:03 AM Bin Meng <bmeng.cn@gmail.com> wrote:
> > >
> > > Hi Paolo,
> > >
> > > On Wed, Sep 28, 2022 at 2:10 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > >
> > > > Hi Paolo,
> > > >
> > > > On Wed, Sep 21, 2022 at 9:02 AM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > >
> > > > > On Wed, Sep 14, 2022 at 4:08 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Sep 7, 2022 at 1:07 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > > > > It seems to me that resetting the event twice (one time with the
> > > > > > > master Gsource, and the other time with the child GSource) causes some
> > > > > > > bizarre behavior. But MSDN [1] says
> > > > > > >
> > > > > > >     "Resetting an event that is already reset has no effect."
> > > > > > >
> > > > > > > [1] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-resetevent
> > > > > > >
> > > > > >
> > > > > > Paolo, any comments about this issue?
> > > > >
> > > > > v2 series has been sent out, and this patch remains unchanged.
> > > > >
> > > > > Paolo, still would appreciate your comments.
> > > >
> > > > Ping?
> > >
> > > Ping? Can you please comment??
> >
> > Ping?
> 
> Paolo remains silent. Please let me know who else could approve this
> change. Thanks.

Given there has been plenty of time for objecting, I'll queue this
patch on the basis that you've tested it on a real Windows host
and found it better than what we have today.

With regards,
Daniel

Daniel P. Berrangé Oct. 17, 2022, 12:35 p.m. UTC | #18

On Wed, Aug 24, 2022 at 05:40:27PM +0800, Bin Meng wrote:
> From: Bin Meng <bin.meng@windriver.com>
> 
> Random failure was observed when running qtests on Windows due to
> "Broken pipe" detected by qmp_fd_receive(). What happened is that
> the qtest executable sends testing data over a socket to the QEMU
> under test but no response is received. The errno of the recv()
> call from the qtest executable indicates ETIMEOUT, due to the qmp
> chardev's tcp_chr_read() is never called to receive testing data
> hence no response is sent to the other side.
> 
> tcp_chr_read() is registered as the callback of the socket watch
> GSource. The reason of the callback not being called by glib, is
> that the source check fails to indicate the source is ready. There
> are two socket watch sources created to monitor the same socket
> event object from the char-socket backend in update_ioc_handlers().
> During the source check phase, qio_channel_socket_source_check()
> calls WSAEnumNetworkEvents() to discovers occurrences of network
> events for the indicated socket, clear internal network event records,
> and reset the event object. Testing shows that if we don't reset the
> event object by not passing the event handle to WSAEnumNetworkEvents()
> the symptom goes away and qtest runs very stably.
> 
> It looks we don't need to call WSAEnumNetworkEvents() at all, as we
> don't parse the result of WSANETWORKEVENTS returned from this API.
> We use select() to poll the socket status. Fix this instability by
> dropping the WSAEnumNetworkEvents() call.
> 
> Signed-off-by: Bin Meng <bin.meng@windriver.com>
> ---
> During the testing, I removed the following codes in update_ioc_handlers():
> 
>     remove_hup_source(s);
>     s->hup_source = qio_channel_create_watch(s->ioc, G_IO_HUP);
>     g_source_set_callback(s->hup_source, (GSourceFunc)tcp_chr_hup,
>                           chr, NULL);
>     g_source_attach(s->hup_source, chr->gcontext);
> 
> and such change also makes the symptom go away.
> 
> And if I moved the above codes to the beginning, before the call to
> io_add_watch_poll(), the symptom also goes away.
> 
> It seems two sources watching on the same socket event object is
> the key that leads to the instability. The order of adding a source
> watch seems to also play a role but I can't explain why.
> Hopefully a Windows and glib expert could explain this behavior.
> 
>  io/channel-watch.c | 4 ----
>  1 file changed, 4 deletions(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

and queued


With regards,
Daniel

Bin Meng Oct. 17, 2022, 1:03 p.m. UTC | #19

On Mon, Oct 17, 2022 at 8:30 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Mon, Oct 17, 2022 at 08:21:37PM +0800, Bin Meng wrote:
> > +more people
> >
> > On Tue, Oct 11, 2022 at 6:42 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > >
> > > Hi Paolo,
> > >
> > > On Thu, Oct 6, 2022 at 11:03 AM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > >
> > > > Hi Paolo,
> > > >
> > > > On Wed, Sep 28, 2022 at 2:10 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > >
> > > > > Hi Paolo,
> > > > >
> > > > > On Wed, Sep 21, 2022 at 9:02 AM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Sep 14, 2022 at 4:08 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Sep 7, 2022 at 1:07 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > > > > > It seems to me that resetting the event twice (one time with the
> > > > > > > > master Gsource, and the other time with the child GSource) causes some
> > > > > > > > bizarre behavior. But MSDN [1] says
> > > > > > > >
> > > > > > > >     "Resetting an event that is already reset has no effect."
> > > > > > > >
> > > > > > > > [1] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-resetevent
> > > > > > > >
> > > > > > >
> > > > > > > Paolo, any comments about this issue?
> > > > > >
> > > > > > v2 series has been sent out, and this patch remains unchanged.
> > > > > >
> > > > > > Paolo, still would appreciate your comments.
> > > > >
> > > > > Ping?
> > > >
> > > > Ping? Can you please comment??
> > >
> > > Ping?
> >
> > Paolo remains silent. Please let me know who else could approve this
> > change. Thanks.
>
> Given there has been plenty of time for objecting, I'll queue this
> patch on the basis that you've tested it on a real Windows host
> and found it better than what we have today.
>

Thank you Daniel!

Please queue the following patches from v5 instead.

Message-ids:

20221006151927.2079583-15-bmeng.cn@gmail.com
20221006151927.2079583-16-bmeng.cn@gmail.com
20221006151927.2079583-17-bmeng.cn@gmail.com

Regards,
Bin

Daniel P. Berrangé Oct. 17, 2022, 1:29 p.m. UTC | #20

On Mon, Oct 17, 2022 at 09:03:15PM +0800, Bin Meng wrote:
> On Mon, Oct 17, 2022 at 8:30 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Mon, Oct 17, 2022 at 08:21:37PM +0800, Bin Meng wrote:
> > > +more people
> > >
> > > On Tue, Oct 11, 2022 at 6:42 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > >
> > > > Hi Paolo,
> > > >
> > > > On Thu, Oct 6, 2022 at 11:03 AM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > >
> > > > > Hi Paolo,
> > > > >
> > > > > On Wed, Sep 28, 2022 at 2:10 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > > >
> > > > > > Hi Paolo,
> > > > > >
> > > > > > On Wed, Sep 21, 2022 at 9:02 AM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Sep 14, 2022 at 4:08 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Sep 7, 2022 at 1:07 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> > > > > > > > > It seems to me that resetting the event twice (one time with the
> > > > > > > > > master Gsource, and the other time with the child GSource) causes some
> > > > > > > > > bizarre behavior. But MSDN [1] says
> > > > > > > > >
> > > > > > > > >     "Resetting an event that is already reset has no effect."
> > > > > > > > >
> > > > > > > > > [1] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-resetevent
> > > > > > > > >
> > > > > > > >
> > > > > > > > Paolo, any comments about this issue?
> > > > > > >
> > > > > > > v2 series has been sent out, and this patch remains unchanged.
> > > > > > >
> > > > > > > Paolo, still would appreciate your comments.
> > > > > >
> > > > > > Ping?
> > > > >
> > > > > Ping? Can you please comment??
> > > >
> > > > Ping?
> > >
> > > Paolo remains silent. Please let me know who else could approve this
> > > change. Thanks.
> >
> > Given there has been plenty of time for objecting, I'll queue this
> > patch on the basis that you've tested it on a real Windows host
> > and found it better than what we have today.
> >
> 
> Thank you Daniel!
> 
> Please queue the following patches from v5 instead.
> 
> Message-ids:
> 
> 20221006151927.2079583-15-bmeng.cn@gmail.com
> 20221006151927.2079583-16-bmeng.cn@gmail.com
> 20221006151927.2079583-17-bmeng.cn@gmail.com

Ok, have done so.

With regards,
Daniel

[49/51] io/channel-watch: Fix socket watch on Windows

Commit Message

Comments

Patch