diff mbox series

[2/2] reftable/stack: fix race in up-to-date check

Message ID 713e51a25c1c4cfa830db97f71cd7c39e85864d4.1705585037.git.ps@pks.im (mailing list archive)
State Accepted
Commit 4f36b8597c8f1c20e465ade4159896fb592e34c0
Headers show
Series reftable/stack: fix race in up-to-date check | expand

Commit Message

Patrick Steinhardt Jan. 18, 2024, 1:41 p.m. UTC
In 6fdfaf15a0 (reftable/stack: use stat info to avoid re-reading stack
list, 2024-01-11) we have introduced a new mechanism to avoid re-reading
the table list in case stat(3P) figures out that the stack didn't change
since the last time we read it.

While this change significantly improved performance when writing many
refs, it can unfortunately lead to false negatives in very specific
scenarios. Given two processes A and B, there is a feasible sequence of
events that cause us to accidentally treat the table list as up-to-date
even though it changed:

  1. A reads the reftable stack and caches its stat info.

  2. B updates the stack, appending a new table to "tables.list". This
     will both use a new inode and result in a different file size, thus
     invalidating A's cache in theory.

  3. B decides to auto-compact the stack and merges two tables. The file
     size now matches what A has cached again. Furthermore, the
     filesystem may decide to recycle the inode number of the file we
     have replaced in (2) because it is not in use anymore.

  4. A reloads the reftable stack. Neither the inode number nor the
     file size changed. If the timestamps did not change either then we
     think the cached copy of our stack is up-to-date.

In fact, the commit introduced three related issues:

  - Non-POSIX compliant systems may not report proper `st_dev` and
    `st_ino` values in stat(3P), which made us rely solely on the
    file's potentially coarse-grained mtime and ctime.

  - `stat_validity_check()` and friends may end up not comparing
    `st_dev` and `st_ino` depending on the "core.checkstat" config,
    again reducing the signal to the mtime and ctime.

  - `st_ino` can be recycled, rendering the check moot even on
    POSIX-compliant systems.

Given that POSIX defines that "The st_ino and st_dev fields taken
together uniquely identify the file within the system", these issues led
to the most important signal to establish file identity to be ignored or
become useless in some cases.

Refactor the code to stop using `stat_validity_check()`. Instead, we
manually stat(3P) the file descriptors to make relevant information
available. On Windows and MSYS2 the result will have both `st_dev` and
`st_ino` set to 0, which allows us to address the first issue by not
using the stat-based cache in that case. It also allows us to make sure
that we always compare `st_dev` and `st_ino`, addressing the second
issue.

The third issue of inode recycling can be addressed by keeping the file
descriptor of "files.list" open during the lifetime of the reftable
stack. As the file will still exist on disk even though it has been
unlinked it is impossible for its inode to be recycled as long as the
file descriptor is still open.

This should address the race in a POSIX-compliant way. The only real
downside is that this mechanism cannot be used on non-POSIX-compliant
systems like Windows. But we at least have the second-level caching
mechanism in place that compares contents of "files.list" with the
currently loaded list of tables.

This new mechanism performs roughly the same as the previous one that
relied on `stat_validity_check()`:

  Benchmark 1: update-ref: create many refs (HEAD~)
    Time (mean ± σ):      4.754 s ±  0.026 s    [User: 2.204 s, System: 2.549 s]
    Range (min … max):    4.694 s …  4.802 s    20 runs

  Benchmark 2: update-ref: create many refs (HEAD)
    Time (mean ± σ):      4.721 s ±  0.020 s    [User: 2.194 s, System: 2.527 s]
    Range (min … max):    4.691 s …  4.753 s    20 runs

  Summary
    update-ref: create many refs (HEAD~) ran
      1.01 ± 0.01 times faster than update-ref: create many refs (HEAD)

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/stack.c  | 99 +++++++++++++++++++++++++++++++++++++++++++----
 reftable/stack.h  |  4 +-
 reftable/system.h |  1 -
 3 files changed, 95 insertions(+), 9 deletions(-)

Comments

Junio C Hamano Jan. 18, 2024, 8:07 p.m. UTC | #1
Patrick Steinhardt <ps@pks.im> writes:

> This should address the race in a POSIX-compliant way. The only real
> downside is that this mechanism cannot be used on non-POSIX-compliant
> systems like Windows. But we at least have the second-level caching
> mechanism in place that compares contents of "files.list" with the
> currently loaded list of tables.

OK.


> +	/*
> +	 * Cache stat information in case it provides a useful signal to us.
> +	 * According to POSIX, "The st_ino and st_dev fields taken together
> +	 * uniquely identify the file within the system." That being said,
> +	 * Windows is not POSIX compliant and we do not have these fields
> +	 * available. So the information we have there is insufficient to
> +	 * determine whether two file descriptors point to the same file.
> +	 *
> +	 * While we could fall back to using other signals like the file's
> +	 * mtime, those are not sufficient to avoid races. We thus refrain from
> +	 * using the stat cache on such systems and fall back to the secondary
> +	 * caching mechanism, which is to check whether contents of the file
> +	 * have changed.

OK.

> +	 *
> +	 * On other systems which are POSIX compliant we must keep the file
> +	 * descriptor open. This is to avoid a race condition where two
> +	 * processes access the reftable stack at the same point in time:
> +	 *
> +	 *   1. A reads the reftable stack and caches its stat info.
> +	 *
> +	 *   2. B updates the stack, appending a new table to "tables.list".
> +	 *      This will both use a new inode and result in a different file
> +	 *      size, thus invalidating A's cache in theory.
> +	 *
> +	 *   3. B decides to auto-compact the stack and merges two tables. The
> +	 *      file size now matches what A has cached again. Furthermore, the
> +	 *      filesystem may decide to recycle the inode number of the file
> +	 *      we have replaced in (2) because it is not in use anymore.
> +	 *
> +	 *   4. A reloads the reftable stack. Neither the inode number nor the
> +	 *      file size changed. If the timestamps did not change either then
> +	 *      we think the cached copy of our stack is up-to-date.
> +	 *
> +	 * By keeping the file descriptor open the inode number cannot be
> +	 * recycled, mitigating the race.
> +	 */

This is nasty.  Well diagnosed and fixed.

Will queue.

Thanks.
Jeff King Jan. 20, 2024, 1:05 a.m. UTC | #2
On Thu, Jan 18, 2024 at 02:41:56PM +0100, Patrick Steinhardt wrote:

> Refactor the code to stop using `stat_validity_check()`. Instead, we
> manually stat(3P) the file descriptors to make relevant information
> available. On Windows and MSYS2 the result will have both `st_dev` and
> `st_ino` set to 0, which allows us to address the first issue by not
> using the stat-based cache in that case. It also allows us to make sure
> that we always compare `st_dev` and `st_ino`, addressing the second
> issue.

I didn't think too hard about the details, but does this mean that
every user of stat_validity_check() has the same issue? The other big
one is packed-refs (for which the code was originally written). Should
this fix go into that API?

-Peff
Patrick Steinhardt Jan. 22, 2024, 10:32 a.m. UTC | #3
On Fri, Jan 19, 2024 at 08:05:59PM -0500, Jeff King wrote:
> On Thu, Jan 18, 2024 at 02:41:56PM +0100, Patrick Steinhardt wrote:
> 
> > Refactor the code to stop using `stat_validity_check()`. Instead, we
> > manually stat(3P) the file descriptors to make relevant information
> > available. On Windows and MSYS2 the result will have both `st_dev` and
> > `st_ino` set to 0, which allows us to address the first issue by not
> > using the stat-based cache in that case. It also allows us to make sure
> > that we always compare `st_dev` and `st_ino`, addressing the second
> > issue.
> 
> I didn't think too hard about the details, but does this mean that
> every user of stat_validity_check() has the same issue? The other big
> one is packed-refs (for which the code was originally written). Should
> this fix go into that API?

In theory, the issue is the same for the `packed-refs` file. But in
practice it's much less likely to be an issue:

  - The file gets rewritten a lot less frequently than the "tables.list"
    file, making the race less likely in the first place. It can only
    happen when git-pack-refs(1) races with concurrent readers, whereas
    it can happen for any two concurrent process with reftables.

  - Due to entries in the `packed-refs` being of variable length it's
    less likely that the size will be exactly the same after the file
    has been rewritten. For reftables we have constant-length entries in
    the "tables.list", so it's likely to happen there.

  - It is very unlikely that we have the same issue with inode reuse
    with the packed-refs file. The only reason why we have it with the
    reftable backend is that it is very likely that we end up writing to
    "tables.list" twice, once for the normal update and once for auto
    compaction.

So overall, I doubt that it's all that critical in practice for the
packed-refs backend. It _is_ possible to happen, but chances are
significantly lower. I cannot recall a single report of this issue,
which underpins how unlikely it seems to be for the files backend.

Also, applying the same fix for the packed-refs would essentially mean
that the caching mechanism is now ineffective on Windows systems where
we do not have proper `st_dev` and `st_ino` values available. I think
this is a no-go in the context of packed-refs because we don't have a
second-level caching mechanism like we do in the reftable backend. It's
not great that we have to reread the "tables.list" file on Windows
systems for now, but at least it's better than having to reread the
complete "packed-refs" file like we'd have to do with the files backend.

Patrick
Jeff King Jan. 23, 2024, 12:32 a.m. UTC | #4
On Mon, Jan 22, 2024 at 11:32:14AM +0100, Patrick Steinhardt wrote:

> > I didn't think too hard about the details, but does this mean that
> > every user of stat_validity_check() has the same issue? The other big
> > one is packed-refs (for which the code was originally written). Should
> > this fix go into that API?
> 
> In theory, the issue is the same for the `packed-refs` file. But in
> practice it's much less likely to be an issue:

Thanks for laying this all out. It does concern me a little that there's
still a possible race, because they can be so hard to catch and debug in
practice. But I think you make a compelling argument that it's probably
not happening a lot in practice, and especially...

> Also, applying the same fix for the packed-refs would essentially mean
> that the caching mechanism is now ineffective on Windows systems where
> we do not have proper `st_dev` and `st_ino` values available. I think
> this is a no-go in the context of packed-refs because we don't have a
> second-level caching mechanism like we do in the reftable backend. It's
> not great that we have to reread the "tables.list" file on Windows
> systems for now, but at least it's better than having to reread the
> complete "packed-refs" file like we'd have to do with the files backend.

...here that the performance profile is so different. If the "fix"
means re-reading the whole packed-refs file constantly, that's going
to be quite noticeable.

Given that this race has been here forever-ish, I agree with you that we
should leave it be.

-Peff
diff mbox series

Patch

diff --git a/reftable/stack.c b/reftable/stack.c
index 705cfb6caa..77a387a86c 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -66,6 +66,7 @@  int reftable_new_stack(struct reftable_stack **dest, const char *dir,
 	strbuf_addstr(&list_file_name, "/tables.list");
 
 	p->list_file = strbuf_detach(&list_file_name, NULL);
+	p->list_fd = -1;
 	p->reftable_dir = xstrdup(dir);
 	p->config = config;
 
@@ -175,7 +176,12 @@  void reftable_stack_destroy(struct reftable_stack *st)
 		st->readers_len = 0;
 		FREE_AND_NULL(st->readers);
 	}
-	stat_validity_clear(&st->list_validity);
+
+	if (st->list_fd >= 0) {
+		close(st->list_fd);
+		st->list_fd = -1;
+	}
+
 	FREE_AND_NULL(st->list_file);
 	FREE_AND_NULL(st->reftable_dir);
 	reftable_free(st);
@@ -375,11 +381,59 @@  static int reftable_stack_reload_maybe_reuse(struct reftable_stack *st,
 		sleep_millisec(delay);
 	}
 
-	stat_validity_update(&st->list_validity, fd);
-
 out:
-	if (err)
-		stat_validity_clear(&st->list_validity);
+	/*
+	 * Invalidate the stat cache. It is sufficient to only close the file
+	 * descriptor and keep the cached stat info because we never use the
+	 * latter when the former is negative.
+	 */
+	if (st->list_fd >= 0) {
+		close(st->list_fd);
+		st->list_fd = -1;
+	}
+
+	/*
+	 * Cache stat information in case it provides a useful signal to us.
+	 * According to POSIX, "The st_ino and st_dev fields taken together
+	 * uniquely identify the file within the system." That being said,
+	 * Windows is not POSIX compliant and we do not have these fields
+	 * available. So the information we have there is insufficient to
+	 * determine whether two file descriptors point to the same file.
+	 *
+	 * While we could fall back to using other signals like the file's
+	 * mtime, those are not sufficient to avoid races. We thus refrain from
+	 * using the stat cache on such systems and fall back to the secondary
+	 * caching mechanism, which is to check whether contents of the file
+	 * have changed.
+	 *
+	 * On other systems which are POSIX compliant we must keep the file
+	 * descriptor open. This is to avoid a race condition where two
+	 * processes access the reftable stack at the same point in time:
+	 *
+	 *   1. A reads the reftable stack and caches its stat info.
+	 *
+	 *   2. B updates the stack, appending a new table to "tables.list".
+	 *      This will both use a new inode and result in a different file
+	 *      size, thus invalidating A's cache in theory.
+	 *
+	 *   3. B decides to auto-compact the stack and merges two tables. The
+	 *      file size now matches what A has cached again. Furthermore, the
+	 *      filesystem may decide to recycle the inode number of the file
+	 *      we have replaced in (2) because it is not in use anymore.
+	 *
+	 *   4. A reloads the reftable stack. Neither the inode number nor the
+	 *      file size changed. If the timestamps did not change either then
+	 *      we think the cached copy of our stack is up-to-date.
+	 *
+	 * By keeping the file descriptor open the inode number cannot be
+	 * recycled, mitigating the race.
+	 */
+	if (!err && fd >= 0 && !fstat(fd, &st->list_st) &&
+	    st->list_st.st_dev && st->list_st.st_ino) {
+		st->list_fd = fd;
+		fd = -1;
+	}
+
 	if (fd >= 0)
 		close(fd);
 	free_names(names);
@@ -396,8 +450,39 @@  static int stack_uptodate(struct reftable_stack *st)
 	int err;
 	int i = 0;
 
-	if (stat_validity_check(&st->list_validity, st->list_file))
-		return 0;
+	/*
+	 * When we have cached stat information available then we use it to
+	 * verify whether the file has been rewritten.
+	 *
+	 * Note that we explicitly do not want to use `stat_validity_check()`
+	 * and friends here because they may end up not comparing the `st_dev`
+	 * and `st_ino` fields. These functions thus cannot guarantee that we
+	 * indeed still have the same file.
+	 */
+	if (st->list_fd >= 0) {
+		struct stat list_st;
+
+		if (stat(st->list_file, &list_st) < 0) {
+			/*
+			 * It's fine for "tables.list" to not exist. In that
+			 * case, we have to refresh when the loaded stack has
+			 * any readers.
+			 */
+			if (errno == ENOENT)
+				return !!st->readers_len;
+			return REFTABLE_IO_ERROR;
+		}
+
+		/*
+		 * When "tables.list" refers to the same file we can assume
+		 * that it didn't change. This is because we always use
+		 * rename(3P) to update the file and never write to it
+		 * directly.
+		 */
+		if (st->list_st.st_dev == list_st.st_dev &&
+		    st->list_st.st_ino == list_st.st_ino)
+			return 0;
+	}
 
 	err = read_lines(st->list_file, &names);
 	if (err < 0)
diff --git a/reftable/stack.h b/reftable/stack.h
index 3f80cc598a..c1e3efa899 100644
--- a/reftable/stack.h
+++ b/reftable/stack.h
@@ -14,8 +14,10 @@  license that can be found in the LICENSE file or at
 #include "reftable-stack.h"
 
 struct reftable_stack {
-	struct stat_validity list_validity;
+	struct stat list_st;
 	char *list_file;
+	int list_fd;
+
 	char *reftable_dir;
 	int disable_auto_compact;
 
diff --git a/reftable/system.h b/reftable/system.h
index 2cc7adf271..6b74a81514 100644
--- a/reftable/system.h
+++ b/reftable/system.h
@@ -12,7 +12,6 @@  license that can be found in the LICENSE file or at
 /* This header glues the reftable library to the rest of Git */
 
 #include "git-compat-util.h"
-#include "statinfo.h"
 #include "strbuf.h"
 #include "hash-ll.h" /* hash ID, sizes.*/
 #include "dir.h" /* remove_dir_recursively, for tests.*/