diff mbox series

[7/8] reftable/stack: fix corruption on concurrent compaction

Message ID b746565bf0ff80fff60821bdeb79867ad646c142.1722435214.git.ps@pks.im (mailing list archive)
State Superseded
Headers show
Series reftable: improvements and fixes for compaction | expand

Commit Message

Patrick Steinhardt July 31, 2024, 2:15 p.m. UTC
The locking employed by compaction uses the following schema:

  1. Lock "tables.list" and verify that it matches the version we have
     loaded in core.

  2. Lock each of the tables in the user-supplied range of tables that
     we are supposed to compact. These locks prohibit any concurrent
     process to compact those tables while we are doing that.

  3. Unlock "tables.list". This enables concurrent processes to add new
     tables to the stack, but also allows them to compact tables outside
     of the range of tables that we have locked.

  4. Perform the compaction.

  5. Lock "tables.list" again.

  6. Move the compacted table into place.

  7. Write the new order of tables, including the compacted table, into
     the lockfile.

  8. Commit the lockfile into place.

Letting concurrent processes modify the "tables.list" file while we are
doing the compaction is very much part of the design and thus expected.
After all, it may take some time to compact tables in the case where we
are compacting a lot or very large tables.

But there is a bug in the code. Suppose we have two processes which are
compacting two slices of the table. Given that we lock each of the
tables before compacting them, we know that the slices must be disjunct
from each other. But regardless of that, compaction performed by one
process will always impact what the other process needs to write to the
"tables.list" file.

Right now , we do not check whether the "tables.list" has been
changed after we have locked it for the second time in (5). This has the
consequence that we will always commit the old, cached in-core tables to
disk without paying to respect what the other process has written. This
scenario would then lead to data loss and corruption.

This can even happen in the simpler case of one compacting process and
one writing process. The newly-appended table by the writing process
would get discarded by the compacting process because it never sees the
new table.

Fix this bug by re-checking whether our stack is still up to date after
locking for the second time. If it isn't, then we adjust the indices of
tables to replace in the updated stack.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/stack.c | 101 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 96 insertions(+), 5 deletions(-)

Comments

Justin Tobler Aug. 1, 2024, 1:04 a.m. UTC | #1
On 24/07/31 04:15PM, Patrick Steinhardt wrote:
> The locking employed by compaction uses the following schema:
> 
>   1. Lock "tables.list" and verify that it matches the version we have
>      loaded in core.
> 
>   2. Lock each of the tables in the user-supplied range of tables that
>      we are supposed to compact. These locks prohibit any concurrent
>      process to compact those tables while we are doing that.
> 
>   3. Unlock "tables.list". This enables concurrent processes to add new
>      tables to the stack, but also allows them to compact tables outside
>      of the range of tables that we have locked.
> 
>   4. Perform the compaction.
> 
>   5. Lock "tables.list" again.
> 
>   6. Move the compacted table into place.
> 
>   7. Write the new order of tables, including the compacted table, into
>      the lockfile.
> 
>   8. Commit the lockfile into place.
> 
> Letting concurrent processes modify the "tables.list" file while we are
> doing the compaction is very much part of the design and thus expected.
> After all, it may take some time to compact tables in the case where we
> are compacting a lot or very large tables.

s/or/of/

> But there is a bug in the code. Suppose we have two processes which are
> compacting two slices of the table. Given that we lock each of the
> tables before compacting them, we know that the slices must be disjunct
> from each other. But regardless of that, compaction performed by one
> process will always impact what the other process needs to write to the
> "tables.list" file.

I'm not quite sure I understand at this point how it is possible for two
compaction operations to be performed concurrently. Wouldn't there
always be overlap between the two compaction segments thus causing one
of the operations to be unable to acquire all of the required locks and
abort?

> Right now , we do not check whether the "tables.list" has been

s/now ,/now,/

> changed after we have locked it for the second time in (5). This has the
> consequence that we will always commit the old, cached in-core tables to
> disk without paying to respect what the other process has written. This
> scenario would then lead to data loss and corruption.

If a concurrent compaction happens though, it would mess up the indices
and cause problems when writting the "tables.list" file. That would not
be good.

> This can even happen in the simpler case of one compacting process and
> one writing process. The newly-appended table by the writing process
> would get discarded by the compacting process because it never sees the
> new table.

This is indeed a problem. Since we don't reload the stack, we are
unaware of any concurrently append tables causing them to not be
written in the new "tables.list" file. Scary

> Fix this bug by re-checking whether our stack is still up to date after
> locking for the second time. If it isn't, then we adjust the indices of
> tables to replace in the updated stack.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  reftable/stack.c | 101 ++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 96 insertions(+), 5 deletions(-)
> 
> diff --git a/reftable/stack.c b/reftable/stack.c
> index 9cc91a262c..2b1ac58120 100644
> --- a/reftable/stack.c
> +++ b/reftable/stack.c
> @@ -1021,7 +1021,9 @@ static int stack_compact_range(struct reftable_stack *st,
>  	struct lock_file *table_locks = NULL;
>  	struct tempfile *new_table = NULL;
>  	int is_empty_table = 0, err = 0;
> +	size_t first_to_replace, last_to_replace;
>  	size_t i, nlocks = 0;
> +	char **names = NULL;
>  
>  	if (first > last || (!expiry && first == last)) {
>  		err = 0;
> @@ -1124,6 +1126,94 @@ static int stack_compact_range(struct reftable_stack *st,
>  		}
>  	}
>  
> +	/*
> +	 * As we have unlocked the stack while compacting our slice of tables
> +	 * it may have happened that a concurrently running process has updated
> +	 * the stack while we were compacting. In that case, we need to check
> +	 * whether the tables that we have just compacted still exist in the
> +	 * stack in the exact same order as we have compacted them.
> +	 *
> +	 * If they do exist, then it is fine to continue and replace those
> +	 * tables with our compacted version. If they don't, then we need to
> +	 * abort.
> +	 */
> +	err = stack_uptodate(st);
> +	if (err < 0)
> +		goto done;
> +	if (err > 0) {
> +		ssize_t new_offset = -1;
> +		int fd;
> +
> +		fd = open(st->list_file, O_RDONLY);
> +		if (fd < 0) {
> +			err = REFTABLE_IO_ERROR;
> +			goto done;
> +		}
> +
> +		err = fd_read_lines(fd, &names);

Reading `names` here will include all tables that were appended
concurrently which we need to accurately rewrite the new "tables.list".
Makes sense.

> +		close(fd);
> +		if (err < 0)
> +			goto done;
> +
> +		/*
> +		 * Search for the offset of the first table that we have
> +		 * compacted in the updated "tables.list" file.
> +		 */
> +		for (size_t i = 0; names[i]; i++) {
> +			if (strcmp(names[i], st->readers[first]->name))
> +				continue;
> +
> +			/*
> +			 * We have found the first entry. Verify that all the
> +			 * subsequent tables we have compacted still exist in
> +			 * the modified stack in the exact same order as we
> +			 * have compacted them.
> +			 */
> +			for (size_t j = 1; j < last - first + 1; j++) {
> +				const char *old = first + j < st->merged->stack_len ?
> +					st->readers[first + j]->name : NULL;
> +				const char *new = names[i + j];
> +
> +				/*
> +				 * If some entries are missing or in case the tables
> +				 * have changed then we need to bail out. Again, this
> +				 * shouldn't ever happen because we have locked the
> +				 * tables we are compacting.
> +				 */
> +				if (!old || !new || strcmp(old, new)) {
> +					err = REFTABLE_OUTDATED_ERROR;
> +					goto done;
> +				}
> +			}
> +
> +			new_offset = i;
> +			break;
> +		}
> +
> +		/*
> +		 * In case we didn't find our compacted tables in the stack we
> +		 * need to bail out. In theory, this should have never happened
> +		 * because we locked the tables we are compacting.
> +		 */
> +		if (new_offset < 0) {
> +			err = REFTABLE_OUTDATED_ERROR;
> +			goto done;
> +		}
> +
> +		/*
> +		 * We have found the new range that we want to replace, so
> +		 * let's update the range of tables that we want to replace.
> +		 */
> +		last_to_replace = last + (new_offset - first);
> +		first_to_replace = new_offset;
> +	} else {
> +		REFTABLE_CALLOC_ARRAY(names, st->merged->stack_len + 1);

I was confused at first by the `stack_len` + 1. The extra element is
NULL which tells us there are no more tables to add to the list,
correct? It looks like `fd_read_lines()` also adds an extra element.

> +		for (size_t i = 0; i < st->merged->stack_len; i++)
> +			names[i] = xstrdup(st->readers[i]->name);
> +		last_to_replace = last;
> +		first_to_replace = first;
> +	}
> +
>  	/*
>  	 * If the resulting compacted table is not empty, then we need to move
>  	 * it into place now.
> @@ -1146,12 +1236,12 @@ static int stack_compact_range(struct reftable_stack *st,
>  	 * have just written. In case the compacted table became empty we
>  	 * simply skip writing it.
>  	 */
> -	for (i = 0; i < first; i++)
> -		strbuf_addf(&tables_list_buf, "%s\n", st->readers[i]->name);
> +	for (i = 0; i < first_to_replace; i++)
> +		strbuf_addf(&tables_list_buf, "%s\n", names[i]);
>  	if (!is_empty_table)
>  		strbuf_addf(&tables_list_buf, "%s\n", new_table_name.buf);
> -	for (i = last + 1; i < st->merged->stack_len; i++)
> -		strbuf_addf(&tables_list_buf, "%s\n", st->readers[i]->name);
> +	for (i = last_to_replace + 1; names[i]; i++)
> +		strbuf_addf(&tables_list_buf, "%s\n", names[i]);

The content of names is now up-to-date along with the starting and
ending indices for the segement being compacted. This allows to to omit
the correct segment of tables.
>  
>  	err = write_in_full(get_lock_file_fd(&tables_list_lock),
>  			    tables_list_buf.buf, tables_list_buf.len);
> @@ -1204,9 +1294,10 @@ static int stack_compact_range(struct reftable_stack *st,
>  	delete_tempfile(&new_table);
>  	strbuf_release(&new_table_name);
>  	strbuf_release(&new_table_path);
> -
>  	strbuf_release(&tables_list_buf);
>  	strbuf_release(&table_name);
> +	free_names(names);
> +
>  	return err;
>  }
>  
> -- 
> 2.46.0.dirty
>
Patrick Steinhardt Aug. 1, 2024, 8:41 a.m. UTC | #2
On Wed, Jul 31, 2024 at 08:04:17PM -0500, Justin Tobler wrote:
> On 24/07/31 04:15PM, Patrick Steinhardt wrote:
> > But there is a bug in the code. Suppose we have two processes which are
> > compacting two slices of the table. Given that we lock each of the
> > tables before compacting them, we know that the slices must be disjunct
> > from each other. But regardless of that, compaction performed by one
> > process will always impact what the other process needs to write to the
> > "tables.list" file.
> 
> I'm not quite sure I understand at this point how it is possible for two
> compaction operations to be performed concurrently. Wouldn't there
> always be overlap between the two compaction segments thus causing one
> of the operations to be unable to acquire all of the required locks and
> abort?

In practice we cannot assume anything about how another process compacts
tables. While we can assume something about how a particular version of
Git compacts tables, we cannot assume anything about future versions of
Git or about alternate implementations of Git. The reftable backend
allows for compacting only a subset of tables, and the heuristic is not
mandated by the on-disk format except that the tables that we are about
to compact need to be next to each other in the stack.

Furthermore, with the next patch, we also handle it gracefully when some
parts of the stack are locked already. Thus, it can easily happen that
process A compacts tables 1 to 3, whereas process B will try to compact
tables 1 to 5, fail to acquire the lock for table 3, and then reduce the
range to compact to 3 to 5.

> > changed after we have locked it for the second time in (5). This has the
> > consequence that we will always commit the old, cached in-core tables to
> > disk without paying to respect what the other process has written. This
> > scenario would then lead to data loss and corruption.
> 
> If a concurrent compaction happens though, it would mess up the indices
> and cause problems when writting the "tables.list" file. That would not
> be good.

Yup.

> > This can even happen in the simpler case of one compacting process and
> > one writing process. The newly-appended table by the writing process
> > would get discarded by the compacting process because it never sees the
> > new table.
> 
> This is indeed a problem. Since we don't reload the stack, we are
> unaware of any concurrently append tables causing them to not be
> written in the new "tables.list" file. Scary

Indeed.

> > +		/*
> > +		 * We have found the new range that we want to replace, so
> > +		 * let's update the range of tables that we want to replace.
> > +		 */
> > +		last_to_replace = last + (new_offset - first);
> > +		first_to_replace = new_offset;
> > +	} else {
> > +		REFTABLE_CALLOC_ARRAY(names, st->merged->stack_len + 1);
> 
> I was confused at first by the `stack_len` + 1. The extra element is
> NULL which tells us there are no more tables to add to the list,
> correct? It looks like `fd_read_lines()` also adds an extra element.

Yes, that's the reason why we have it. We end up passing `names` to
`free_names()`, which uses `NULL` as a sentinel value to know when to
stop iterating over the array's entries.

I'll add a comment.

Thanks for your review. I'll wait a bit longer before sending out
another version of this patch series to wait for some more feedback.

Patrick
diff mbox series

Patch

diff --git a/reftable/stack.c b/reftable/stack.c
index 9cc91a262c..2b1ac58120 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -1021,7 +1021,9 @@  static int stack_compact_range(struct reftable_stack *st,
 	struct lock_file *table_locks = NULL;
 	struct tempfile *new_table = NULL;
 	int is_empty_table = 0, err = 0;
+	size_t first_to_replace, last_to_replace;
 	size_t i, nlocks = 0;
+	char **names = NULL;
 
 	if (first > last || (!expiry && first == last)) {
 		err = 0;
@@ -1124,6 +1126,94 @@  static int stack_compact_range(struct reftable_stack *st,
 		}
 	}
 
+	/*
+	 * As we have unlocked the stack while compacting our slice of tables
+	 * it may have happened that a concurrently running process has updated
+	 * the stack while we were compacting. In that case, we need to check
+	 * whether the tables that we have just compacted still exist in the
+	 * stack in the exact same order as we have compacted them.
+	 *
+	 * If they do exist, then it is fine to continue and replace those
+	 * tables with our compacted version. If they don't, then we need to
+	 * abort.
+	 */
+	err = stack_uptodate(st);
+	if (err < 0)
+		goto done;
+	if (err > 0) {
+		ssize_t new_offset = -1;
+		int fd;
+
+		fd = open(st->list_file, O_RDONLY);
+		if (fd < 0) {
+			err = REFTABLE_IO_ERROR;
+			goto done;
+		}
+
+		err = fd_read_lines(fd, &names);
+		close(fd);
+		if (err < 0)
+			goto done;
+
+		/*
+		 * Search for the offset of the first table that we have
+		 * compacted in the updated "tables.list" file.
+		 */
+		for (size_t i = 0; names[i]; i++) {
+			if (strcmp(names[i], st->readers[first]->name))
+				continue;
+
+			/*
+			 * We have found the first entry. Verify that all the
+			 * subsequent tables we have compacted still exist in
+			 * the modified stack in the exact same order as we
+			 * have compacted them.
+			 */
+			for (size_t j = 1; j < last - first + 1; j++) {
+				const char *old = first + j < st->merged->stack_len ?
+					st->readers[first + j]->name : NULL;
+				const char *new = names[i + j];
+
+				/*
+				 * If some entries are missing or in case the tables
+				 * have changed then we need to bail out. Again, this
+				 * shouldn't ever happen because we have locked the
+				 * tables we are compacting.
+				 */
+				if (!old || !new || strcmp(old, new)) {
+					err = REFTABLE_OUTDATED_ERROR;
+					goto done;
+				}
+			}
+
+			new_offset = i;
+			break;
+		}
+
+		/*
+		 * In case we didn't find our compacted tables in the stack we
+		 * need to bail out. In theory, this should have never happened
+		 * because we locked the tables we are compacting.
+		 */
+		if (new_offset < 0) {
+			err = REFTABLE_OUTDATED_ERROR;
+			goto done;
+		}
+
+		/*
+		 * We have found the new range that we want to replace, so
+		 * let's update the range of tables that we want to replace.
+		 */
+		last_to_replace = last + (new_offset - first);
+		first_to_replace = new_offset;
+	} else {
+		REFTABLE_CALLOC_ARRAY(names, st->merged->stack_len + 1);
+		for (size_t i = 0; i < st->merged->stack_len; i++)
+			names[i] = xstrdup(st->readers[i]->name);
+		last_to_replace = last;
+		first_to_replace = first;
+	}
+
 	/*
 	 * If the resulting compacted table is not empty, then we need to move
 	 * it into place now.
@@ -1146,12 +1236,12 @@  static int stack_compact_range(struct reftable_stack *st,
 	 * have just written. In case the compacted table became empty we
 	 * simply skip writing it.
 	 */
-	for (i = 0; i < first; i++)
-		strbuf_addf(&tables_list_buf, "%s\n", st->readers[i]->name);
+	for (i = 0; i < first_to_replace; i++)
+		strbuf_addf(&tables_list_buf, "%s\n", names[i]);
 	if (!is_empty_table)
 		strbuf_addf(&tables_list_buf, "%s\n", new_table_name.buf);
-	for (i = last + 1; i < st->merged->stack_len; i++)
-		strbuf_addf(&tables_list_buf, "%s\n", st->readers[i]->name);
+	for (i = last_to_replace + 1; names[i]; i++)
+		strbuf_addf(&tables_list_buf, "%s\n", names[i]);
 
 	err = write_in_full(get_lock_file_fd(&tables_list_lock),
 			    tables_list_buf.buf, tables_list_buf.len);
@@ -1204,9 +1294,10 @@  static int stack_compact_range(struct reftable_stack *st,
 	delete_tempfile(&new_table);
 	strbuf_release(&new_table_name);
 	strbuf_release(&new_table_path);
-
 	strbuf_release(&tables_list_buf);
 	strbuf_release(&table_name);
+	free_names(names);
+
 	return err;
 }