mbox series

[v4,0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large

Message ID 20230628152726.110295-1-bmeng@tinylab.org (mailing list archive)
Headers show
Series net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large | expand

Message

Bin Meng June 28, 2023, 3:27 p.m. UTC
Current codes using a brute-force traversal of all file descriptors
do not scale on a system where the maximum number of file descriptors
is set to a very large value (e.g.: in a Docker container of Manjaro
distribution it is set to 1073741816). QEMU just looks frozen during
start-up.

The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
doesn't need to manually close the fds for child process as the proper
O_CLOEXEC flag should have been set properly on files with its own
codes, QEMU uses a huge number of 3rd party libraries and we don't
trust them to reliably be using O_CLOEXEC on everything they open.

Modern Linux and BSDs have the close_range() call we can use to do the
job, and on Linux we have one more way to walk through /proc/self/fd
to complete the task efficiently, which is what qemu_close_range()
does, a new API we add in util/osdep.c.

V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/

Changes in v4:
- add 'first > last' check logic
- reorder the ifdefs logic
- change i to unsigned int type
- use qemu_strtoi() instead of atoi()
- limit last upper value to sysconf(_SC_OPEN_MAX) - 1
- call sysconf directly instead of using a variable
- put fd on its own line

Changes in v3:
- fix win32 build failure
- limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)

Changes in v2:
- new patch: "tests/tcg/cris: Fix the coding style"
- new patch: "tests/tcg/cris: Correct the off-by-one error"
- new patch: "util/async-teardown: Fall back to close fds one by one"
- new patch: "util/osdep: Introduce qemu_close_range()"
- new patch: "util/async-teardown: Use qemu_close_range() to close fds"
- Change to use qemu_close_range() to close fds for child process efficiently
- v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/

Bin Meng (4):
  tests/tcg/cris: Fix the coding style
  tests/tcg/cris: Correct the off-by-one error
  util/async-teardown: Fall back to close fds one by one
  util/osdep: Introduce qemu_close_range()

Zhangjin Wu (2):
  util/async-teardown: Use qemu_close_range() to close fds
  net: tap: Use qemu_close_range() to close fds

 include/qemu/osdep.h                |  1 +
 net/tap.c                           | 24 ++++++------
 tests/tcg/cris/libc/check_openpf5.c | 57 +++++++++++++--------------
 util/async-teardown.c               | 37 +-----------------
 util/osdep.c                        | 60 +++++++++++++++++++++++++++++
 5 files changed, 101 insertions(+), 78 deletions(-)

Comments

Michael Tokarev June 29, 2023, 8:33 a.m. UTC | #1
28.06.2023 18:27, Bin Meng wrote:
> 
> Current codes using a brute-force traversal of all file descriptors
> do not scale on a system where the maximum number of file descriptors
> is set to a very large value (e.g.: in a Docker container of Manjaro
> distribution it is set to 1073741816). QEMU just looks frozen during
> start-up.

So, the same question as before. *Why* do we close all filedescriptors
to begin with?

Thanks,

/mjt
Daniel P. Berrangé June 29, 2023, 9:05 a.m. UTC | #2
On Thu, Jun 29, 2023 at 11:33:29AM +0300, Michael Tokarev wrote:
> 28.06.2023 18:27, Bin Meng wrote:
> > 
> > Current codes using a brute-force traversal of all file descriptors
> > do not scale on a system where the maximum number of file descriptors
> > is set to a very large value (e.g.: in a Docker container of Manjaro
> > distribution it is set to 1073741816). QEMU just looks frozen during
> > start-up.
> 
> So, the same question as before. *Why* do we close all filedescriptors
> to begin with?

The O_CLOSEXEC flag is a terrible concept, as the default behaviour of
file descriptors is to be leaked into all child processes, unless code
takes explicit action to set O_CLOEXEC in every case. Even if they are
diligent about their own code, apps developers can have zero confidence
that every library they use has set O_CLOEXEC. Not just set it after the
FD is open, but set it atomically when the the FD is open, because
threads create race conditions if not atomically set.

Leaking FDs is a security risk, and QEMU is an especially security
critical application. QEMU needs stronger guarantees that O_CLOEXEC
can offer, and mass-close before execve is the only viable option
to achieve this.

With regards,
Daniel
Bin Meng July 9, 2023, 3:47 p.m. UTC | #3
On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
>
>
> Current codes using a brute-force traversal of all file descriptors
> do not scale on a system where the maximum number of file descriptors
> is set to a very large value (e.g.: in a Docker container of Manjaro
> distribution it is set to 1073741816). QEMU just looks frozen during
> start-up.
>
> The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
> 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
> doesn't need to manually close the fds for child process as the proper
> O_CLOEXEC flag should have been set properly on files with its own
> codes, QEMU uses a huge number of 3rd party libraries and we don't
> trust them to reliably be using O_CLOEXEC on everything they open.
>
> Modern Linux and BSDs have the close_range() call we can use to do the
> job, and on Linux we have one more way to walk through /proc/self/fd
> to complete the task efficiently, which is what qemu_close_range()
> does, a new API we add in util/osdep.c.
>
> V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>
> Changes in v4:
> - add 'first > last' check logic
> - reorder the ifdefs logic
> - change i to unsigned int type
> - use qemu_strtoi() instead of atoi()
> - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
> - call sysconf directly instead of using a variable
> - put fd on its own line
>
> Changes in v3:
> - fix win32 build failure
> - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
>
> Changes in v2:
> - new patch: "tests/tcg/cris: Fix the coding style"
> - new patch: "tests/tcg/cris: Correct the off-by-one error"
> - new patch: "util/async-teardown: Fall back to close fds one by one"
> - new patch: "util/osdep: Introduce qemu_close_range()"
> - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
> - Change to use qemu_close_range() to close fds for child process efficiently
> - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>
> Bin Meng (4):
>   tests/tcg/cris: Fix the coding style
>   tests/tcg/cris: Correct the off-by-one error
>   util/async-teardown: Fall back to close fds one by one
>   util/osdep: Introduce qemu_close_range()
>
> Zhangjin Wu (2):
>   util/async-teardown: Use qemu_close_range() to close fds
>   net: tap: Use qemu_close_range() to close fds
>

Ping for 8.1?
Jason Wang July 10, 2023, 3:05 a.m. UTC | #4
On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>
> On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
> >
> >
> > Current codes using a brute-force traversal of all file descriptors
> > do not scale on a system where the maximum number of file descriptors
> > is set to a very large value (e.g.: in a Docker container of Manjaro
> > distribution it is set to 1073741816). QEMU just looks frozen during
> > start-up.
> >
> > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
> > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
> > doesn't need to manually close the fds for child process as the proper
> > O_CLOEXEC flag should have been set properly on files with its own
> > codes, QEMU uses a huge number of 3rd party libraries and we don't
> > trust them to reliably be using O_CLOEXEC on everything they open.
> >
> > Modern Linux and BSDs have the close_range() call we can use to do the
> > job, and on Linux we have one more way to walk through /proc/self/fd
> > to complete the task efficiently, which is what qemu_close_range()
> > does, a new API we add in util/osdep.c.
> >
> > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >
> > Changes in v4:
> > - add 'first > last' check logic
> > - reorder the ifdefs logic
> > - change i to unsigned int type
> > - use qemu_strtoi() instead of atoi()
> > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
> > - call sysconf directly instead of using a variable
> > - put fd on its own line
> >
> > Changes in v3:
> > - fix win32 build failure
> > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
> >
> > Changes in v2:
> > - new patch: "tests/tcg/cris: Fix the coding style"
> > - new patch: "tests/tcg/cris: Correct the off-by-one error"
> > - new patch: "util/async-teardown: Fall back to close fds one by one"
> > - new patch: "util/osdep: Introduce qemu_close_range()"
> > - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
> > - Change to use qemu_close_range() to close fds for child process efficiently
> > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >
> > Bin Meng (4):
> >   tests/tcg/cris: Fix the coding style
> >   tests/tcg/cris: Correct the off-by-one error
> >   util/async-teardown: Fall back to close fds one by one
> >   util/osdep: Introduce qemu_close_range()
> >
> > Zhangjin Wu (2):
> >   util/async-teardown: Use qemu_close_range() to close fds
> >   net: tap: Use qemu_close_range() to close fds
> >
>
> Ping for 8.1?

Queued.

Thanks
Markus Armbruster July 10, 2023, 6:07 a.m. UTC | #5
Jason Wang <jasowang@redhat.com> writes:

> On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>>
>> On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
>> >
>> >
>> > Current codes using a brute-force traversal of all file descriptors
>> > do not scale on a system where the maximum number of file descriptors
>> > is set to a very large value (e.g.: in a Docker container of Manjaro
>> > distribution it is set to 1073741816). QEMU just looks frozen during
>> > start-up.
>> >
>> > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
>> > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
>> > doesn't need to manually close the fds for child process as the proper
>> > O_CLOEXEC flag should have been set properly on files with its own
>> > codes, QEMU uses a huge number of 3rd party libraries and we don't
>> > trust them to reliably be using O_CLOEXEC on everything they open.
>> >
>> > Modern Linux and BSDs have the close_range() call we can use to do the
>> > job, and on Linux we have one more way to walk through /proc/self/fd
>> > to complete the task efficiently, which is what qemu_close_range()
>> > does, a new API we add in util/osdep.c.
>> >
>> > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>> >
>> > Changes in v4:
>> > - add 'first > last' check logic
>> > - reorder the ifdefs logic
>> > - change i to unsigned int type
>> > - use qemu_strtoi() instead of atoi()
>> > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
>> > - call sysconf directly instead of using a variable
>> > - put fd on its own line
>> >
>> > Changes in v3:
>> > - fix win32 build failure
>> > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
>> >
>> > Changes in v2:
>> > - new patch: "tests/tcg/cris: Fix the coding style"
>> > - new patch: "tests/tcg/cris: Correct the off-by-one error"
>> > - new patch: "util/async-teardown: Fall back to close fds one by one"
>> > - new patch: "util/osdep: Introduce qemu_close_range()"
>> > - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
>> > - Change to use qemu_close_range() to close fds for child process efficiently
>> > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>> >
>> > Bin Meng (4):
>> >   tests/tcg/cris: Fix the coding style
>> >   tests/tcg/cris: Correct the off-by-one error
>> >   util/async-teardown: Fall back to close fds one by one
>> >   util/osdep: Introduce qemu_close_range()
>> >
>> > Zhangjin Wu (2):
>> >   util/async-teardown: Use qemu_close_range() to close fds
>> >   net: tap: Use qemu_close_range() to close fds
>> >
>>
>> Ping for 8.1?
>
> Queued.

There are review questions open on PATCH 4+5.
Jason Wang July 11, 2023, 2:40 a.m. UTC | #6
On Mon, Jul 10, 2023 at 2:07 PM Markus Armbruster <armbru@redhat.com> wrote:
>
> Jason Wang <jasowang@redhat.com> writes:
>
> > On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> >>
> >> On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
> >> >
> >> >
> >> > Current codes using a brute-force traversal of all file descriptors
> >> > do not scale on a system where the maximum number of file descriptors
> >> > is set to a very large value (e.g.: in a Docker container of Manjaro
> >> > distribution it is set to 1073741816). QEMU just looks frozen during
> >> > start-up.
> >> >
> >> > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
> >> > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
> >> > doesn't need to manually close the fds for child process as the proper
> >> > O_CLOEXEC flag should have been set properly on files with its own
> >> > codes, QEMU uses a huge number of 3rd party libraries and we don't
> >> > trust them to reliably be using O_CLOEXEC on everything they open.
> >> >
> >> > Modern Linux and BSDs have the close_range() call we can use to do the
> >> > job, and on Linux we have one more way to walk through /proc/self/fd
> >> > to complete the task efficiently, which is what qemu_close_range()
> >> > does, a new API we add in util/osdep.c.
> >> >
> >> > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >> >
> >> > Changes in v4:
> >> > - add 'first > last' check logic
> >> > - reorder the ifdefs logic
> >> > - change i to unsigned int type
> >> > - use qemu_strtoi() instead of atoi()
> >> > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
> >> > - call sysconf directly instead of using a variable
> >> > - put fd on its own line
> >> >
> >> > Changes in v3:
> >> > - fix win32 build failure
> >> > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
> >> >
> >> > Changes in v2:
> >> > - new patch: "tests/tcg/cris: Fix the coding style"
> >> > - new patch: "tests/tcg/cris: Correct the off-by-one error"
> >> > - new patch: "util/async-teardown: Fall back to close fds one by one"
> >> > - new patch: "util/osdep: Introduce qemu_close_range()"
> >> > - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
> >> > - Change to use qemu_close_range() to close fds for child process efficiently
> >> > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >> >
> >> > Bin Meng (4):
> >> >   tests/tcg/cris: Fix the coding style
> >> >   tests/tcg/cris: Correct the off-by-one error
> >> >   util/async-teardown: Fall back to close fds one by one
> >> >   util/osdep: Introduce qemu_close_range()
> >> >
> >> > Zhangjin Wu (2):
> >> >   util/async-teardown: Use qemu_close_range() to close fds
> >> >   net: tap: Use qemu_close_range() to close fds
> >> >
> >>
> >> Ping for 8.1?
> >
> > Queued.
>
> There are review questions open on PATCH 4+5.

Sorry, I missed that. I've dropped them from the queue now.

Thanks

>