diff mbox series

[v2] btrfs-progs: fix race window during mkfs

Message ID eda4915edce8006a1578082817733a9af74a9b97.1678860378.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [v2] btrfs-progs: fix race window during mkfs | expand

Commit Message

Qu Wenruo March 15, 2023, 6:06 a.m. UTC
[BUG]
There is an internal bug report that, after mkfs.btrfs there is a chance
that no /dev/disk/by-uuid/<uuid> soft link is not created at all.

[CAUSE]
That uuid soft link is created by udev, which listens inotify
IN_CLOSE_WRITE events from all block devices.

After such IN_CLOSE_WRITE event is triggered, udev would *disable*
inotify for that block device, and do a blkid scan on it.
After the blkid scan is done, re-enable the inotify listening.

This means normally mkfs tools should open the fd, do all the writes,
and close the fd after everything is done.

But unfortunately for mkfs.btrfs, it's not the case, we have a lot of
phases seperated by different close() calls:

  open_ctree() would open fds of each involved device
  and close them at close_ctree()
  Only after close_ctree() we have a valid superblock -\
                                                       |
|<------- A -------->|<--------- B --------->|<------- C ------->|
          |                      |
          |                      `- open a new fd for make_btrfs()
          |                         and close it before open_ctree()
          |                         The device contains invalid sb.
          |
          `- open a new fd for each device, then call
             btrfs_prepare_device(), then close the fd.
             The device would contain no valid superblock.

If at the close() of phase A udev event is triggered, while doing udev
scan we go into phase C (but before the new valid super blocks written),
udev would only see no superblock or invalid superblock.

Then phase C finished, udev resume its inotify listening, but at this
timing mkfs is finished, while udev only see the premature data from
phase A, and missed the IN_CLOSE_WRITE events from phase C.

[FIX]
Instead of open and close a new fd for each device, re-use the fd opened
during prepare_one_device(), and close all the fds until close_ctree()
is called.

By this, although we may still have race between close_ctree() and
explicit close() calls, at least udev can always see the properly
written super blocks.

To compensate the change, some extra cleanups are made:

- Do not touch @device_count
  Which makes later prepare_ctx iteration much easier.

- Remove top-level @fd variable
  Instead go with prepare_ctx[i].fd.

- Do not open with O_RDWR in test_dev_for_mkfs()
  as test_dev_for_mkfs() would close the fd, if we go O_RDWR, it can
  cause the udev race.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2:
- Fix the call site in test_dev_for_mkfs()
---
 mkfs/common.c |  2 +-
 mkfs/main.c   | 65 ++++++++++++++++++++++-----------------------------
 2 files changed, 29 insertions(+), 38 deletions(-)

Comments

David Sterba April 19, 2023, 7:45 p.m. UTC | #1
On Wed, Mar 15, 2023 at 02:06:54PM +0800, Qu Wenruo wrote:
> [BUG]
> There is an internal bug report that, after mkfs.btrfs there is a chance
> that no /dev/disk/by-uuid/<uuid> soft link is not created at all.
> 
> [CAUSE]
> That uuid soft link is created by udev, which listens inotify
> IN_CLOSE_WRITE events from all block devices.
> 
> After such IN_CLOSE_WRITE event is triggered, udev would *disable*
> inotify for that block device, and do a blkid scan on it.
> After the blkid scan is done, re-enable the inotify listening.
> 
> This means normally mkfs tools should open the fd, do all the writes,
> and close the fd after everything is done.
> 
> But unfortunately for mkfs.btrfs, it's not the case, we have a lot of
> phases seperated by different close() calls:
> 
>   open_ctree() would open fds of each involved device
>   and close them at close_ctree()
>   Only after close_ctree() we have a valid superblock -\
>                                                        |
> |<------- A -------->|<--------- B --------->|<------- C ------->|
>           |                      |
>           |                      `- open a new fd for make_btrfs()
>           |                         and close it before open_ctree()
>           |                         The device contains invalid sb.
>           |
>           `- open a new fd for each device, then call
>              btrfs_prepare_device(), then close the fd.
>              The device would contain no valid superblock.
> 
> If at the close() of phase A udev event is triggered, while doing udev
> scan we go into phase C (but before the new valid super blocks written),
> udev would only see no superblock or invalid superblock.
> 
> Then phase C finished, udev resume its inotify listening, but at this
> timing mkfs is finished, while udev only see the premature data from
> phase A, and missed the IN_CLOSE_WRITE events from phase C.
> 
> [FIX]
> Instead of open and close a new fd for each device, re-use the fd opened
> during prepare_one_device(), and close all the fds until close_ctree()
> is called.
> 
> By this, although we may still have race between close_ctree() and
> explicit close() calls, at least udev can always see the properly
> written super blocks.
> 
> To compensate the change, some extra cleanups are made:
> 
> - Do not touch @device_count
>   Which makes later prepare_ctx iteration much easier.
> 
> - Remove top-level @fd variable
>   Instead go with prepare_ctx[i].fd.
> 
> - Do not open with O_RDWR in test_dev_for_mkfs()
>   as test_dev_for_mkfs() would close the fd, if we go O_RDWR, it can
>   cause the udev race.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Fix the call site in test_dev_for_mkfs()

Added to devel. Thanks for writing down the analysis, the interactions
between udev and open/close are tricky.
Anand Jain April 20, 2023, 7:25 a.m. UTC | #2
On 15/3/23 14:06, Qu Wenruo wrote:
> [BUG]
> There is an internal bug report that, after mkfs.btrfs there is a chance
> that no /dev/disk/by-uuid/<uuid> soft link is not created at all.
> 
> [CAUSE]
> That uuid soft link is created by udev, which listens inotify
> IN_CLOSE_WRITE events from all block devices.
> 
> After such IN_CLOSE_WRITE event is triggered, udev would *disable*
> inotify for that block device, and do a blkid scan on it.
> After the blkid scan is done, re-enable the inotify listening.
> 
> This means normally mkfs tools should open the fd, do all the writes,
> and close the fd after everything is done.
> 
> But unfortunately for mkfs.btrfs, it's not the case, we have a lot of
> phases seperated by different close() calls:
> 
>    open_ctree() would open fds of each involved device
>    and close them at close_ctree()
>    Only after close_ctree() we have a valid superblock -\
>                                                         |
> |<------- A -------->|<--------- B --------->|<------- C ------->|
>            |                      |
>            |                      `- open a new fd for make_btrfs()
>            |                         and close it before open_ctree()
>            |                         The device contains invalid sb.
>            |
>            `- open a new fd for each device, then call
>               btrfs_prepare_device(), then close the fd.
>               The device would contain no valid superblock.
> 
> If at the close() of phase A udev event is triggered, while doing udev
> scan we go into phase C (but before the new valid super blocks written),
> udev would only see no superblock or invalid superblock.
> 
> Then phase C finished, udev resume its inotify listening, but at this
> timing mkfs is finished, while udev only see the premature data from
> phase A, and missed the IN_CLOSE_WRITE events from phase C.
> 
> [FIX]
> Instead of open and close a new fd for each device, re-use the fd opened
> during prepare_one_device(), and close all the fds until close_ctree()
> is called.
> 
> By this, although we may still have race between close_ctree() and
> explicit close() calls, at least udev can always see the properly
> written super blocks.
> 
> To compensate the change, some extra cleanups are made:
> 
> - Do not touch @device_count
>    Which makes later prepare_ctx iteration much easier.
> 
> - Remove top-level @fd variable
>    Instead go with prepare_ctx[i].fd.
> 
> - Do not open with O_RDWR in test_dev_for_mkfs()
>    as test_dev_for_mkfs() would close the fd, if we go O_RDWR, it can
>    cause the udev race.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Sorry, I missed this earlier. Thanks for the nice fix and cleanup!

Reviewed-by: Anand Jain <anand.jain@oracle.com>
diff mbox series

Patch

diff --git a/mkfs/common.c b/mkfs/common.c
index d77688ba584d..1988bc88aff9 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -1085,7 +1085,7 @@  int test_dev_for_mkfs(const char *file, int force_overwrite)
 	if (ret)
 		return 1;
 	/* check if the device is busy */
-	fd = open(file, O_RDWR|O_EXCL);
+	fd = open(file, O_RDONLY|O_EXCL);
 	if (fd < 0) {
 		error("unable to open %s: %m", file);
 		return 1;
diff --git a/mkfs/main.c b/mkfs/main.c
index f5e34cbda612..2595100d800b 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -72,6 +72,7 @@  static bool opt_zoned = true;
 static int opt_oflags = O_RDWR;
 
 struct prepare_device_progress {
+	int fd;
 	char *file;
 	u64 dev_block_count;
 	u64 block_count;
@@ -965,23 +966,21 @@  fail:
 static void *prepare_one_device(void *ctx)
 {
 	struct prepare_device_progress *prepare_ctx = ctx;
-	int fd;
 
-	fd = open(prepare_ctx->file, opt_oflags);
-	if (fd < 0) {
+	prepare_ctx->fd = open(prepare_ctx->file, opt_oflags);
+	if (prepare_ctx->fd < 0) {
 		error("unable to open %s: %m", prepare_ctx->file);
 		prepare_ctx->ret = -errno;
 		return NULL;
 	}
-	prepare_ctx->ret = btrfs_prepare_device(fd, prepare_ctx->file,
+	prepare_ctx->ret = btrfs_prepare_device(prepare_ctx->fd,
+				prepare_ctx->file,
 				&prepare_ctx->dev_block_count,
 				prepare_ctx->block_count,
 				(bconf.verbose ? PREP_DEVICE_VERBOSE : 0) |
 				(opt_zero_end ? PREP_DEVICE_ZERO_END : 0) |
 				(opt_discard ? PREP_DEVICE_DISCARD : 0) |
 				(opt_zoned ? PREP_DEVICE_ZONED : 0));
-	close(fd);
-
 	return NULL;
 }
 
@@ -992,7 +991,6 @@  int BOX_MAIN(mkfs)(int argc, char **argv)
 	struct btrfs_fs_info *fs_info;
 	struct btrfs_trans_handle *trans;
 	struct open_ctree_flags ocf = { 0 };
-	int fd = -1;
 	int ret = 0;
 	int close_ret;
 	int i;
@@ -1244,8 +1242,9 @@  int BOX_MAIN(mkfs)(int argc, char **argv)
 		}
 	}
 
-	while (device_count-- > 0) {
+	for (i = 0; i < device_count; i++) {
 		file = argv[optind++];
+
 		if (source_dir_set && path_exists(file) == 0)
 			ret = 0;
 		else if (path_is_block_device(file) == 1)
@@ -1391,6 +1390,7 @@  int BOX_MAIN(mkfs)(int argc, char **argv)
 	if (source_dir_set) {
 		int oflags = O_RDWR;
 		struct stat statbuf;
+		int fd;
 
 		if (path_exists(file) == 0)
 			oflags |= O_CREAT;
@@ -1521,12 +1521,6 @@  int BOX_MAIN(mkfs)(int argc, char **argv)
 		goto error;
 	}
 
-	device_count--;
-	fd = open(file, opt_oflags);
-	if (fd < 0) {
-		error("unable to open %s: %m", file);
-		goto error;
-	}
 	dev_block_count = prepare_ctx[0].dev_block_count;
 	if (block_count && block_count > dev_block_count) {
 		error("%s is smaller than requested size, expected %llu, found %llu",
@@ -1567,7 +1561,7 @@  int BOX_MAIN(mkfs)(int argc, char **argv)
 	else
 		mkfs_cfg.zone_size = 0;
 
-	ret = make_btrfs(fd, &mkfs_cfg);
+	ret = make_btrfs(prepare_ctx[0].fd, &mkfs_cfg);
 	if (ret) {
 		errno = -ret;
 		error("error during mkfs: %m");
@@ -1581,8 +1575,7 @@  int BOX_MAIN(mkfs)(int argc, char **argv)
 		error("open ctree failed");
 		goto error;
 	}
-	close(fd);
-	fd = -1;
+
 	root = fs_info->fs_root;
 
 	ret = create_metadata_block_groups(root, mixed, &allocation);
@@ -1635,36 +1628,29 @@  int BOX_MAIN(mkfs)(int argc, char **argv)
 	if (device_count == 0)
 		goto raid_groups;
 
-	while (device_count-- > 0) {
-		int dev_index = argc - saved_optind - device_count - 1;
-		file = argv[optind++];
-
-		fd = open(file, opt_oflags);
-		if (fd < 0) {
-			error("unable to open %s: %m", file);
-			goto error;
-		}
-		ret = btrfs_device_already_in_root(root, fd,
+	for (i = 1; i < device_count; i++) {
+		ret = btrfs_device_already_in_root(root, prepare_ctx[i].fd,
 						   BTRFS_SUPER_INFO_OFFSET);
 		if (ret) {
 			error("skipping duplicate device %s in the filesystem",
 				file);
-			close(fd);
 			continue;
 		}
-		dev_block_count = prepare_ctx[dev_index].dev_block_count;
+		dev_block_count = prepare_ctx[i].dev_block_count;
 
-		if (prepare_ctx[dev_index].ret) {
-			errno = -prepare_ctx[dev_index].ret;
+		if (prepare_ctx[i].ret) {
+			errno = -prepare_ctx[i].ret;
 			error("unable to prepare device %s: %m",
-					prepare_ctx[dev_index].file);
+					prepare_ctx[i].file);
 			goto error;
 		}
 
-		ret = btrfs_add_to_fsid(trans, root, fd, file, dev_block_count,
+		ret = btrfs_add_to_fsid(trans, root, prepare_ctx[i].fd,
+					prepare_ctx[i].file, dev_block_count,
 					sectorsize, sectorsize, sectorsize);
 		if (ret) {
-			error("unable to add %s to filesystem: %d", file, ret);
+			error("unable to add %s to filesystem: %d",
+			      prepare_ctx[i].file, ret);
 			goto error;
 		}
 		if (bconf.verbose >= 2) {
@@ -1838,6 +1824,10 @@  out:
 	}
 
 	btrfs_close_all_devices();
+	if (prepare_ctx) {
+		for (i = 0; i < device_count; i++)
+			close(prepare_ctx[i].fd);
+	}
 	free(t_prepare);
 	free(prepare_ctx);
 	free(label);
@@ -1846,9 +1836,10 @@  out:
 	return !!ret;
 
 error:
-	if (fd > 0)
-		close(fd);
-
+	if (prepare_ctx) {
+		for (i = 0; i < device_count; i++)
+			close(prepare_ctx[i].fd);
+	}
 	free(t_prepare);
 	free(prepare_ctx);
 	free(label);