Message ID | 20240607150021.121536-1-alexander.ivanov@virtuozzo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | nbd: Prevent NULL pointer dereference in nbd_blockdev_client_closed() | expand |
There is a bug reproducer in the attachment. On 6/7/24 17:00, Alexander Ivanov wrote: > In some cases, the NBD server can be stopped before > nbd_blockdev_client_closed() is called, causing the nbd_server variable > to be nullified. This leads to a NULL pointer dereference when accessing > nbd_server. > > Add a NULL check for nbd_server to the nbd_blockdev_client_closed() > function to prevent NULL pointer dereference. > > Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> > --- > blockdev-nbd.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/blockdev-nbd.c b/blockdev-nbd.c > index 213012435f..fb1f30ae0d 100644 > --- a/blockdev-nbd.c > +++ b/blockdev-nbd.c > @@ -52,6 +52,9 @@ int nbd_server_max_connections(void) > static void nbd_blockdev_client_closed(NBDClient *client, bool ignored) > { > nbd_client_put(client); > + if (nbd_server == NULL) { > + return; > + } > assert(nbd_server->connections > 0); > nbd_server->connections--; > nbd_update_server_watch(nbd_server);
On Sat, Jun 08, 2024 at 11:36:59AM GMT, Alexander Ivanov wrote: > There is a bug reproducer in the attachment. Summarizing the reproducer, you are repeatedly calling QMP nbd-server-start/nbd-server-stop on qemu as NBD server in one thread, and repeatedly calling 'qemu-nbd -L' in another, to try and provoke the race where the server is stopped while qemu-nbd -L is still trying to grab information. > > > On 6/7/24 17:00, Alexander Ivanov wrote: > > In some cases, the NBD server can be stopped before > > nbd_blockdev_client_closed() is called, causing the nbd_server variable > > to be nullified. This leads to a NULL pointer dereference when accessing > > nbd_server. Am I correct that the NULL pointer deref was in qemu-nbd in your reproducer, and not in qemu-kvm? > > > > Add a NULL check for nbd_server to the nbd_blockdev_client_closed() > > function to prevent NULL pointer dereference. > > > > Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> > > --- > > blockdev-nbd.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/blockdev-nbd.c b/blockdev-nbd.c > > index 213012435f..fb1f30ae0d 100644 > > --- a/blockdev-nbd.c > > +++ b/blockdev-nbd.c > > @@ -52,6 +52,9 @@ int nbd_server_max_connections(void) > > static void nbd_blockdev_client_closed(NBDClient *client, bool ignored) > > { > > nbd_client_put(client); > > + if (nbd_server == NULL) { > > + return; > > + } > > assert(nbd_server->connections > 0); While this does indeed avoid the NULL dereference right here, I still want to understand why nbd_server is getting set to NULL while there is still an active client, and make sure we don't have any other NULL derefs. I'll respond again once I've studied the code a bit more. > > nbd_server->connections--; > > nbd_update_server_watch(nbd_server); > > -- > Best regards, > Alexander Ivanov
On 6/10/24 14:33, Eric Blake wrote: > On Sat, Jun 08, 2024 at 11:36:59AM GMT, Alexander Ivanov wrote: >> There is a bug reproducer in the attachment. > Summarizing the reproducer, you are repeatedly calling QMP > nbd-server-start/nbd-server-stop on qemu as NBD server in one thread, > and repeatedly calling 'qemu-nbd -L' in another, to try and provoke > the race where the server is stopped while qemu-nbd -L is still trying > to grab information. Yes, you are right. > >> >> On 6/7/24 17:00, Alexander Ivanov wrote: >>> In some cases, the NBD server can be stopped before >>> nbd_blockdev_client_closed() is called, causing the nbd_server variable >>> to be nullified. This leads to a NULL pointer dereference when accessing >>> nbd_server. > Am I correct that the NULL pointer deref was in qemu-nbd in your > reproducer, and not in qemu-kvm? No, it happened in qemu-kvm. > >>> Add a NULL check for nbd_server to the nbd_blockdev_client_closed() >>> function to prevent NULL pointer dereference. >>> >>> Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> >>> --- >>> blockdev-nbd.c | 3 +++ >>> 1 file changed, 3 insertions(+) >>> >>> diff --git a/blockdev-nbd.c b/blockdev-nbd.c >>> index 213012435f..fb1f30ae0d 100644 >>> --- a/blockdev-nbd.c >>> +++ b/blockdev-nbd.c >>> @@ -52,6 +52,9 @@ int nbd_server_max_connections(void) >>> static void nbd_blockdev_client_closed(NBDClient *client, bool ignored) >>> { >>> nbd_client_put(client); >>> + if (nbd_server == NULL) { >>> + return; >>> + } >>> assert(nbd_server->connections > 0); > While this does indeed avoid the NULL dereference right here, I still > want to understand why nbd_server is getting set to NULL while there > is still an active client, and make sure we don't have any other NULL > derefs. I'll respond again once I've studied the code a bit more. I agree that the patch fixes only symptoms, but haven't succeeded with the bug roots investigation, and decided to raise this issue with the community. Also, later I reproduced another bug with the patch applied and with the same reproducer. There was an assert(nbd_server->connections > 0); violation in nbd_blockdev_client_closed(). It happens much less frequently, however. Think the reasons are similar to the original bug. > >>> nbd_server->connections--; >>> nbd_update_server_watch(nbd_server); >> -- >> Best regards, >> Alexander Ivanov
Hello Eric, Do you have any ideas about the bug? Thank you. On 6/10/24 14:33, Eric Blake wrote: > On Sat, Jun 08, 2024 at 11:36:59AM GMT, Alexander Ivanov wrote: >> There is a bug reproducer in the attachment. > Summarizing the reproducer, you are repeatedly calling QMP > nbd-server-start/nbd-server-stop on qemu as NBD server in one thread, > and repeatedly calling 'qemu-nbd -L' in another, to try and provoke > the race where the server is stopped while qemu-nbd -L is still trying > to grab information. > >> >> On 6/7/24 17:00, Alexander Ivanov wrote: >>> In some cases, the NBD server can be stopped before >>> nbd_blockdev_client_closed() is called, causing the nbd_server variable >>> to be nullified. This leads to a NULL pointer dereference when accessing >>> nbd_server. > Am I correct that the NULL pointer deref was in qemu-nbd in your > reproducer, and not in qemu-kvm? > >>> Add a NULL check for nbd_server to the nbd_blockdev_client_closed() >>> function to prevent NULL pointer dereference. >>> >>> Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> >>> --- >>> blockdev-nbd.c | 3 +++ >>> 1 file changed, 3 insertions(+) >>> >>> diff --git a/blockdev-nbd.c b/blockdev-nbd.c >>> index 213012435f..fb1f30ae0d 100644 >>> --- a/blockdev-nbd.c >>> +++ b/blockdev-nbd.c >>> @@ -52,6 +52,9 @@ int nbd_server_max_connections(void) >>> static void nbd_blockdev_client_closed(NBDClient *client, bool ignored) >>> { >>> nbd_client_put(client); >>> + if (nbd_server == NULL) { >>> + return; >>> + } >>> assert(nbd_server->connections > 0); > While this does indeed avoid the NULL dereference right here, I still > want to understand why nbd_server is getting set to NULL while there > is still an active client, and make sure we don't have any other NULL > derefs. I'll respond again once I've studied the code a bit more. > >>> nbd_server->connections--; >>> nbd_update_server_watch(nbd_server); >> -- >> Best regards, >> Alexander Ivanov
Ping? On 6/7/24 17:00, Alexander Ivanov wrote: > static void nbd_blockdev_client_closed(NBDClient *client, bool ignored) > { > nbd_client_put(client); > + if (nbd_server == NULL) { > + return; > + } > assert(nbd_server->connections > 0); > nbd_server->connections--; > nbd_update_server_watch(nbd_server);
On Fri, Jun 28, 2024 at 11:58:37AM GMT, Alexander Ivanov wrote: > Ping? > > On 6/7/24 17:00, Alexander Ivanov wrote: > > static void nbd_blockdev_client_closed(NBDClient *client, bool ignored) > > { > > nbd_client_put(client); > > + if (nbd_server == NULL) { > > + return; > > + } > > assert(nbd_server->connections > 0); > > nbd_server->connections--; > > nbd_update_server_watch(nbd_server); > I've spent more time looking at this, and I'm stumped. Maybe Dan can help out. If I understand correctly, here's the race situation we have: qemu main loop coroutine client QMP nbd-server-start - calls nbd_server_start - assigns nbd_server = g_new0() - calls qio_net_listener_open_sync - calls nbd_server_update_watch - calls qio_net_listener_set_client_func(, nbd_accept, ) - waiting for client is now in coroutine - returns control to QMP main loop opens TCP socket - qio notices incoming connection - calls nbd_accept callback - starts NBD handshake nbd_client_new(, nbd_blockdev_client_closed) QMP nbd-server-stop - calls nbd_server_stop - calls nbd_server_free - calls qio_net_listener_disconnect() - marks qio channel as useless - sets nbd_server = NULL - qio sees channel is useless, disconnects client deals gracefully with dead connection - calls nbd_blockdev_client_closed callback - segfault since nbd_server->connections derefs NULL Thread 1 "qemu-kvm" received signal SIGSEGV, Segmentation fault. 0x000055555600af59 in nbd_blockdev_client_closed (client=0x55555810dfc0, ignored=false) at ../blockdev-nbd.c:56 56 assert(nbd_server->connections > 0); (gdb) p nbd_server $1 = (NBDServerData *) 0x0 (gdb) bt #0 0x000055555600af59 in nbd_blockdev_client_closed (client=0x55555810dfc0, ignored=false) at ../blockdev-nbd.c:56 #1 0x0000555555ff72f9 in client_close (client=0x55555810dfc0, negotiated=false) at ../nbd/server.c:1624 #2 0x0000555555ffbc49 in nbd_co_client_start (opaque=0x55555810dfc0) at ../nbd/server.c:3198 #3 0x0000555556220629 in coroutine_trampoline (i0=1488434896, i1=21845) Worse, the race could go another way: if another QMP nbd-server-start is called fast enough before the coroutine finishes the nbd_accept code from the first connection, it could attempt to modify the second nbd_server object, which may be associated with a completely different socket. As far as I can tell, the problem stems from the fact that the attempt to establish the connection with the client is continuing in a background coroutine despite our efforts to honor the QMP nbd-server-stop command. But I'm not sure on the proper way to inform qio to abandon all incoming traffic to a given net listener. Or maybe I should just change the semantics of QMP nbd-server-stop to fail if there are known connections from and nbd_accept() - but I still don't want that to wait indefinitely, so I still want some way to tell the qio channel that the server is going away and to tear down sockets as soon as possible. As a stopgap, something like this might prevent the SEGV: diff --git i/blockdev-nbd.c w/blockdev-nbd.c index 213012435f4..2f5ea094407 100644 --- i/blockdev-nbd.c +++ w/blockdev-nbd.c @@ -277,6 +277,12 @@ void qmp_nbd_server_stop(Error **errp) blk_exp_close_all_type(BLOCK_EXPORT_TYPE_NBD); + if (qio_net_listener_is_connected(nbd_server->listener) || + nbd_server->connections > 0) { + error_setg(errp, "NBD server still has connected clients"); + return; + } + nbd_server_free(nbd_server); nbd_server = NULL; } but it's not as graceful as I'd like (it would be nicer to have the nbd-server-stop command wait until it knows all connections are cleaned, instead of failing up front).
diff --git a/blockdev-nbd.c b/blockdev-nbd.c index 213012435f..fb1f30ae0d 100644 --- a/blockdev-nbd.c +++ b/blockdev-nbd.c @@ -52,6 +52,9 @@ int nbd_server_max_connections(void) static void nbd_blockdev_client_closed(NBDClient *client, bool ignored) { nbd_client_put(client); + if (nbd_server == NULL) { + return; + } assert(nbd_server->connections > 0); nbd_server->connections--; nbd_update_server_watch(nbd_server);
In some cases, the NBD server can be stopped before nbd_blockdev_client_closed() is called, causing the nbd_server variable to be nullified. This leads to a NULL pointer dereference when accessing nbd_server. Add a NULL check for nbd_server to the nbd_blockdev_client_closed() function to prevent NULL pointer dereference. Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> --- blockdev-nbd.c | 3 +++ 1 file changed, 3 insertions(+)