Message ID | 20230628152726.110295-1-bmeng@tinylab.org (mailing list archive) |
---|---|
Headers | show |
Series | net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large | expand |
28.06.2023 18:27, Bin Meng wrote: > > Current codes using a brute-force traversal of all file descriptors > do not scale on a system where the maximum number of file descriptors > is set to a very large value (e.g.: in a Docker container of Manjaro > distribution it is set to 1073741816). QEMU just looks frozen during > start-up. So, the same question as before. *Why* do we close all filedescriptors to begin with? Thanks, /mjt
On Thu, Jun 29, 2023 at 11:33:29AM +0300, Michael Tokarev wrote: > 28.06.2023 18:27, Bin Meng wrote: > > > > Current codes using a brute-force traversal of all file descriptors > > do not scale on a system where the maximum number of file descriptors > > is set to a very large value (e.g.: in a Docker container of Manjaro > > distribution it is set to 1073741816). QEMU just looks frozen during > > start-up. > > So, the same question as before. *Why* do we close all filedescriptors > to begin with? The O_CLOSEXEC flag is a terrible concept, as the default behaviour of file descriptors is to be leaked into all child processes, unless code takes explicit action to set O_CLOEXEC in every case. Even if they are diligent about their own code, apps developers can have zero confidence that every library they use has set O_CLOEXEC. Not just set it after the FD is open, but set it atomically when the the FD is open, because threads create race conditions if not atomically set. Leaking FDs is a security risk, and QEMU is an especially security critical application. QEMU needs stronger guarantees that O_CLOEXEC can offer, and mass-close before execve is the only viable option to achieve this. With regards, Daniel
On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote: > > > Current codes using a brute-force traversal of all file descriptors > do not scale on a system where the maximum number of file descriptors > is set to a very large value (e.g.: in a Docker container of Manjaro > distribution it is set to 1073741816). QEMU just looks frozen during > start-up. > > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU > doesn't need to manually close the fds for child process as the proper > O_CLOEXEC flag should have been set properly on files with its own > codes, QEMU uses a huge number of 3rd party libraries and we don't > trust them to reliably be using O_CLOEXEC on everything they open. > > Modern Linux and BSDs have the close_range() call we can use to do the > job, and on Linux we have one more way to walk through /proc/self/fd > to complete the task efficiently, which is what qemu_close_range() > does, a new API we add in util/osdep.c. > > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/ > > Changes in v4: > - add 'first > last' check logic > - reorder the ifdefs logic > - change i to unsigned int type > - use qemu_strtoi() instead of atoi() > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1 > - call sysconf directly instead of using a variable > - put fd on its own line > > Changes in v3: > - fix win32 build failure > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX) > > Changes in v2: > - new patch: "tests/tcg/cris: Fix the coding style" > - new patch: "tests/tcg/cris: Correct the off-by-one error" > - new patch: "util/async-teardown: Fall back to close fds one by one" > - new patch: "util/osdep: Introduce qemu_close_range()" > - new patch: "util/async-teardown: Use qemu_close_range() to close fds" > - Change to use qemu_close_range() to close fds for child process efficiently > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/ > > Bin Meng (4): > tests/tcg/cris: Fix the coding style > tests/tcg/cris: Correct the off-by-one error > util/async-teardown: Fall back to close fds one by one > util/osdep: Introduce qemu_close_range() > > Zhangjin Wu (2): > util/async-teardown: Use qemu_close_range() to close fds > net: tap: Use qemu_close_range() to close fds > Ping for 8.1?
On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote: > > On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote: > > > > > > Current codes using a brute-force traversal of all file descriptors > > do not scale on a system where the maximum number of file descriptors > > is set to a very large value (e.g.: in a Docker container of Manjaro > > distribution it is set to 1073741816). QEMU just looks frozen during > > start-up. > > > > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel > > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU > > doesn't need to manually close the fds for child process as the proper > > O_CLOEXEC flag should have been set properly on files with its own > > codes, QEMU uses a huge number of 3rd party libraries and we don't > > trust them to reliably be using O_CLOEXEC on everything they open. > > > > Modern Linux and BSDs have the close_range() call we can use to do the > > job, and on Linux we have one more way to walk through /proc/self/fd > > to complete the task efficiently, which is what qemu_close_range() > > does, a new API we add in util/osdep.c. > > > > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/ > > > > Changes in v4: > > - add 'first > last' check logic > > - reorder the ifdefs logic > > - change i to unsigned int type > > - use qemu_strtoi() instead of atoi() > > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1 > > - call sysconf directly instead of using a variable > > - put fd on its own line > > > > Changes in v3: > > - fix win32 build failure > > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX) > > > > Changes in v2: > > - new patch: "tests/tcg/cris: Fix the coding style" > > - new patch: "tests/tcg/cris: Correct the off-by-one error" > > - new patch: "util/async-teardown: Fall back to close fds one by one" > > - new patch: "util/osdep: Introduce qemu_close_range()" > > - new patch: "util/async-teardown: Use qemu_close_range() to close fds" > > - Change to use qemu_close_range() to close fds for child process efficiently > > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/ > > > > Bin Meng (4): > > tests/tcg/cris: Fix the coding style > > tests/tcg/cris: Correct the off-by-one error > > util/async-teardown: Fall back to close fds one by one > > util/osdep: Introduce qemu_close_range() > > > > Zhangjin Wu (2): > > util/async-teardown: Use qemu_close_range() to close fds > > net: tap: Use qemu_close_range() to close fds > > > > Ping for 8.1? Queued. Thanks
Jason Wang <jasowang@redhat.com> writes: > On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote: >> >> On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote: >> > >> > >> > Current codes using a brute-force traversal of all file descriptors >> > do not scale on a system where the maximum number of file descriptors >> > is set to a very large value (e.g.: in a Docker container of Manjaro >> > distribution it is set to 1073741816). QEMU just looks frozen during >> > start-up. >> > >> > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel >> > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU >> > doesn't need to manually close the fds for child process as the proper >> > O_CLOEXEC flag should have been set properly on files with its own >> > codes, QEMU uses a huge number of 3rd party libraries and we don't >> > trust them to reliably be using O_CLOEXEC on everything they open. >> > >> > Modern Linux and BSDs have the close_range() call we can use to do the >> > job, and on Linux we have one more way to walk through /proc/self/fd >> > to complete the task efficiently, which is what qemu_close_range() >> > does, a new API we add in util/osdep.c. >> > >> > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/ >> > >> > Changes in v4: >> > - add 'first > last' check logic >> > - reorder the ifdefs logic >> > - change i to unsigned int type >> > - use qemu_strtoi() instead of atoi() >> > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1 >> > - call sysconf directly instead of using a variable >> > - put fd on its own line >> > >> > Changes in v3: >> > - fix win32 build failure >> > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX) >> > >> > Changes in v2: >> > - new patch: "tests/tcg/cris: Fix the coding style" >> > - new patch: "tests/tcg/cris: Correct the off-by-one error" >> > - new patch: "util/async-teardown: Fall back to close fds one by one" >> > - new patch: "util/osdep: Introduce qemu_close_range()" >> > - new patch: "util/async-teardown: Use qemu_close_range() to close fds" >> > - Change to use qemu_close_range() to close fds for child process efficiently >> > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/ >> > >> > Bin Meng (4): >> > tests/tcg/cris: Fix the coding style >> > tests/tcg/cris: Correct the off-by-one error >> > util/async-teardown: Fall back to close fds one by one >> > util/osdep: Introduce qemu_close_range() >> > >> > Zhangjin Wu (2): >> > util/async-teardown: Use qemu_close_range() to close fds >> > net: tap: Use qemu_close_range() to close fds >> > >> >> Ping for 8.1? > > Queued. There are review questions open on PATCH 4+5.
On Mon, Jul 10, 2023 at 2:07 PM Markus Armbruster <armbru@redhat.com> wrote: > > Jason Wang <jasowang@redhat.com> writes: > > > On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote: > >> > >> On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote: > >> > > >> > > >> > Current codes using a brute-force traversal of all file descriptors > >> > do not scale on a system where the maximum number of file descriptors > >> > is set to a very large value (e.g.: in a Docker container of Manjaro > >> > distribution it is set to 1073741816). QEMU just looks frozen during > >> > start-up. > >> > > >> > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel > >> > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU > >> > doesn't need to manually close the fds for child process as the proper > >> > O_CLOEXEC flag should have been set properly on files with its own > >> > codes, QEMU uses a huge number of 3rd party libraries and we don't > >> > trust them to reliably be using O_CLOEXEC on everything they open. > >> > > >> > Modern Linux and BSDs have the close_range() call we can use to do the > >> > job, and on Linux we have one more way to walk through /proc/self/fd > >> > to complete the task efficiently, which is what qemu_close_range() > >> > does, a new API we add in util/osdep.c. > >> > > >> > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/ > >> > > >> > Changes in v4: > >> > - add 'first > last' check logic > >> > - reorder the ifdefs logic > >> > - change i to unsigned int type > >> > - use qemu_strtoi() instead of atoi() > >> > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1 > >> > - call sysconf directly instead of using a variable > >> > - put fd on its own line > >> > > >> > Changes in v3: > >> > - fix win32 build failure > >> > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX) > >> > > >> > Changes in v2: > >> > - new patch: "tests/tcg/cris: Fix the coding style" > >> > - new patch: "tests/tcg/cris: Correct the off-by-one error" > >> > - new patch: "util/async-teardown: Fall back to close fds one by one" > >> > - new patch: "util/osdep: Introduce qemu_close_range()" > >> > - new patch: "util/async-teardown: Use qemu_close_range() to close fds" > >> > - Change to use qemu_close_range() to close fds for child process efficiently > >> > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/ > >> > > >> > Bin Meng (4): > >> > tests/tcg/cris: Fix the coding style > >> > tests/tcg/cris: Correct the off-by-one error > >> > util/async-teardown: Fall back to close fds one by one > >> > util/osdep: Introduce qemu_close_range() > >> > > >> > Zhangjin Wu (2): > >> > util/async-teardown: Use qemu_close_range() to close fds > >> > net: tap: Use qemu_close_range() to close fds > >> > > >> > >> Ping for 8.1? > > > > Queued. > > There are review questions open on PATCH 4+5. Sorry, I missed that. I've dropped them from the queue now. Thanks >