mbox series

[v4,0/8] refs: add reflog support to `git refs migrate`

Message ID 20241216-320-git-refs-migrate-reflogs-v4-0-d7cd3f197453@gmail.com (mailing list archive)
Headers show
Series refs: add reflog support to `git refs migrate` | expand

Message

Karthik Nayak Dec. 16, 2024, 4:44 p.m. UTC
The `git refs migrate` command was introduced in
25a0023f28 (builtin/refs: new command to migrate ref storage formats,
2024-06-06) to support migrating from one reference backend to another.

One limitation of the feature was that it didn't support migrating
repositories which contained reflogs. This isn't a requirement on the
server side as repositories are stored as bare repositories (which do
not contain any reflogs). Clients however generally use reflogs and
until now couldn't use the `git refs migrate` command to migrate their
repositories to the new reftable format.

One of the issues for adding reflog support is that the ref transactions
don't support reflogs additions:
  1. While there is REF_LOG_ONLY flag, there is no function to utilize
  the flag and add reflogs.
  2. reference backends generally sort the updates by the refname. This
  wouldn't work for reflogs which need to ensure that they maintain the
  order of creation.
  3. In the files backend, reflog entries are added by obtaining locks
  on the refs themselves. This means each update in the transaction, will
  obtain a ref_lock. This paradigm fails to accompany the fact that there
  could be multiple reflog updates for a refname in a single transaction.
  4. The backends check for duplicate entries, which doesn't make sense
  in the context of adding multiple reflogs for a given refname.

We overcome these issue we make the following changes:
  - Update the ref_update structure to also include the committer
  information. Using this, we can add a new function which only adds
  reflog updates to the transaction.
  - Add an index field to the ref_update structure, this will help order
  updates in pre-defined order, this fixes #2.
  - While the ideal fix for #3 would be to actually introduce reflog
  locks, this wouldn't be possible without breaking backward
  compatibility. So we add a count field to the existing ref_lock. With
  this, multiple reflog updates can share a single ref_lock.

Overall, this series is a bit more involved, and I would appreciate it
if it receives a bit more scrutiny.

The series is based on top of e66fd72e97 (The fourteenth batch,
2024-12-06) with `kn/reftable-writer-log-write-verify` merged in.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
Changes in v4:
- Fix broken tests, due to two reasons in patch 8/8:
  - The `index` field in `reflog_migration_data` wasn't initialized to
    0. This specifically doesn't break the test, but causes undefined
    behavior. Fix this by using designated initializers.
  - The strbuf within `migration_data` wasn't initialized when the flow
    exited early, causing memory leaks. Fix this too by using designated
    initializers.
- Thanks to Junio for reporting and Patrick for helping shed some light
  on these broken tests.
- Link to v3: https://lore.kernel.org/r/20241215-320-git-refs-migrate-reflogs-v3-0-4127fe707b98@gmail.com

Changes in v3:
- patch 5: Use `xstrdup_or_null` unconditionally.
- patch 6: In `transaction_refname_valid()` use the transaction flags
  to identify reflogs. Update the documentation to also mention the
  purpose of the `index` field.
- patch 8: Instead of allocating an strbuf for each reflog entry, we
  store and re-use one in the migration callback data.
- patch 8: Don't use a global index increment for all reflogs entries,
  instead create and use one per reflog.
- patch 8: Avoid setting the first reflog index to `1`. This would default
  to `0` as the first index, which is okay, since the index is incremented
  for consequtive reflog entries.
- Small typo fixes.
- Thanks to Christian and Patrick for the review!
- Link to v2: https://lore.kernel.org/all/20241213-320-git-refs-migrate-reflogs-v2-0-f28312cdb6c0@gmail.com/

Changes in v2:
- Split patch 5 into two separate patches. This should make it easier to
  review and reduce cognitive load in a single patch.
- In reftable backend, instead of using `strmapint` to ensure we have
  new update_indexes for reflogs with the same refname, we now use the
  already available `update->index` field to increment the update_index.
- Cleanup the code and follow some of the better practices.
- Add some clarity to the commit messages.
- Link to v1: https://lore.kernel.org/r/20241209-320-git-refs-migrate-reflogs-v1-0-d4bc37ee860f@gmail.com

---
Karthik Nayak (8):
      refs: include committer info in `ref_update` struct
      refs: add `index` field to `struct ref_udpate`
      refs/files: add count field to ref_lock
      refs: extract out refname verification in transactions
      refs: add `committer_info` to `ref_transaction_add_update()`
      refs: introduce the `ref_transaction_update_reflog` function
      refs: allow multiple reflog entries for the same refname
      refs: add support for migrating reflogs

 Documentation/git-refs.txt |   2 -
 refs.c                     | 168 +++++++++++++++++++++++++++++++++------------
 refs.h                     |  14 ++++
 refs/files-backend.c       | 131 +++++++++++++++++++++++------------
 refs/refs-internal.h       |   9 +++
 refs/reftable-backend.c    |  53 +++++++++++---
 t/t1460-refs-migrate.sh    |  73 ++++++++++++++------
 7 files changed, 329 insertions(+), 121 deletions(-)
---

Range-diff versus v3:

1:  7989ca0679 = 1:  34fb6a475e refs: include committer info in `ref_update` struct
2:  12acd7b4bb = 2:  4badd2b8ec refs: add `index` field to `struct ref_udpate`
3:  0d13c0d09b = 3:  0d8673c2fe refs/files: add count field to ref_lock
4:  d4073cd9dc = 4:  dec4b3c6f6 refs: extract out refname verification in transactions
5:  5a3d242955 = 5:  b1cc1bb242 refs: add `committer_info` to `ref_transaction_add_update()`
6:  fedd93f113 = 6:  f669e87498 refs: introduce the `ref_transaction_update_reflog` function
7:  b4465ee0c5 = 7:  53c0d2b62b refs: allow multiple reflog entries for the same refname
8:  8760610904 ! 8:  1cf30a5c3a refs: add support for migrating reflogs
    @@ Commit message
         the reflogs from the old reference backend. This is to ensure that the
         order is maintained in the new backend.
     
    +    Helped-by: Patrick Steinhardt <ps@pks.im>
         Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
     
      ## Documentation/git-refs.txt ##
    @@ refs.c: static int migrate_one_ref(const char *refname, const char *referent UNU
     +static int migrate_one_reflog(const char *refname, void *cb_data)
     +{
     +	struct migration_data *migration_data = cb_data;
    -+	struct reflog_migration_data data;
    -+
    -+	data.refname = refname;
    -+	data.old_refs = migration_data->old_refs;
    -+	data.transaction = migration_data->transaction;
    -+	data.errbuf = migration_data->errbuf;
    -+	data.sb = &migration_data->sb;
    ++	struct reflog_migration_data data = {
    ++		.refname = refname,
    ++		.old_refs = migration_data->old_refs,
    ++		.transaction = migration_data->transaction,
    ++		.errbuf = migration_data->errbuf,
    ++		.sb = &migration_data->sb,
    ++	};
     +
     +	return refs_for_each_reflog_ent(migration_data->old_refs, refname,
     +					migrate_one_reflog_entry, &data);
    @@ refs.c: static int move_files(const char *from_path, const char *to_path, struct
      {
      	struct worktree **worktrees = get_worktrees();
     @@ refs.c: int repo_migrate_ref_storage_format(struct repository *repo,
    + 	struct ref_store *old_refs = NULL, *new_refs = NULL;
      	struct ref_transaction *transaction = NULL;
      	struct strbuf new_gitdir = STRBUF_INIT;
    - 	struct migration_data data;
    +-	struct migration_data data;
     -	size_t reflog_count = 0;
    ++	struct migration_data data = {
    ++		.sb = STRBUF_INIT,
    ++	};
      	int did_migrate_refs = 0;
      	int ret;
      
    @@ refs.c: int repo_migrate_ref_storage_format(struct repository *repo,
      	 */
      	strbuf_addf(&new_gitdir, "%s/%s", old_refs->gitdir, "ref_migration.XXXXXX");
      	if (!mkdtemp(new_gitdir.buf)) {
    -@@ refs.c: int repo_migrate_ref_storage_format(struct repository *repo,
    - 	data.old_refs = old_refs;
    - 	data.transaction = transaction;
    - 	data.errbuf = errbuf;
    -+	strbuf_init(&data.sb, 0);
    - 
    - 	/*
    - 	 * We need to use the internal `do_for_each_ref()` here so that we can
     @@ refs.c: int repo_migrate_ref_storage_format(struct repository *repo,
      	if (ret < 0)
      		goto done;


--- 

base-commit: 09245f4b75863f4e94dac7feebaafce53a26965f
change-id: 20241111-320-git-refs-migrate-reflogs-a53e3a6cffc9

Thanks
- Karthik

Comments

Patrick Steinhardt Dec. 17, 2024, 6:59 a.m. UTC | #1
On Mon, Dec 16, 2024 at 05:44:25PM +0100, Karthik Nayak wrote:
> Changes in v4:
> - Fix broken tests, due to two reasons in patch 8/8:
>   - The `index` field in `reflog_migration_data` wasn't initialized to
>     0. This specifically doesn't break the test, but causes undefined
>     behavior. Fix this by using designated initializers.
>   - The strbuf within `migration_data` wasn't initialized when the flow
>     exited early, causing memory leaks. Fix this too by using designated
>     initializers.
> - Thanks to Junio for reporting and Patrick for helping shed some light
>   on these broken tests.
> - Link to v3: https://lore.kernel.org/r/20241215-320-git-refs-migrate-reflogs-v3-0-4127fe707b98@gmail.com

The range-diff looks as expected, so this version of the series looks
good to me. Thanks!

Patrick
Karthik Nayak Dec. 17, 2024, 9:35 a.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:

> On Mon, Dec 16, 2024 at 05:44:25PM +0100, Karthik Nayak wrote:
>> Changes in v4:
>> - Fix broken tests, due to two reasons in patch 8/8:
>>   - The `index` field in `reflog_migration_data` wasn't initialized to
>>     0. This specifically doesn't break the test, but causes undefined
>>     behavior. Fix this by using designated initializers.
>>   - The strbuf within `migration_data` wasn't initialized when the flow
>>     exited early, causing memory leaks. Fix this too by using designated
>>     initializers.
>> - Thanks to Junio for reporting and Patrick for helping shed some light
>>   on these broken tests.
>> - Link to v3: https://lore.kernel.org/r/20241215-320-git-refs-migrate-reflogs-v3-0-4127fe707b98@gmail.com
>
> The range-diff looks as expected, so this version of the series looks
> good to me. Thanks!
>
> Patrick

Thanks Patrick for your review!
Junio C Hamano Dec. 17, 2024, 9:28 p.m. UTC | #3
karthik nayak <karthik.188@gmail.com> writes:

> Patrick Steinhardt <ps@pks.im> writes:
>
>> The range-diff looks as expected, so this version of the series looks
>> good to me. Thanks!
>>
>> Patrick
>
> Thanks Patrick for your review!

Thanks, both.  Let's queue it again.
Toon Claes Dec. 19, 2024, 7:32 p.m. UTC | #4
Karthik Nayak <karthik.188@gmail.com> writes:

> The `git refs migrate` command was introduced in
> 25a0023f28 (builtin/refs: new command to migrate ref storage formats,
> 2024-06-06) to support migrating from one reference backend to another.
>
> One limitation of the feature was that it didn't support migrating
> repositories which contained reflogs. This isn't a requirement on the
> server side as repositories are stored as bare repositories (which do
> not contain any reflogs). Clients however generally use reflogs and
> until now couldn't use the `git refs migrate` command to migrate their
> repositories to the new reftable format.
>
> One of the issues for adding reflog support is that the ref transactions
> don't support reflogs additions:
>   1. While there is REF_LOG_ONLY flag, there is no function to utilize
>   the flag and add reflogs.
>   2. reference backends generally sort the updates by the refname. This
>   wouldn't work for reflogs which need to ensure that they maintain the
>   order of creation.
>   3. In the files backend, reflog entries are added by obtaining locks
>   on the refs themselves. This means each update in the transaction, will
>   obtain a ref_lock. This paradigm fails to accompany the fact that there
>   could be multiple reflog updates for a refname in a single transaction.
>   4. The backends check for duplicate entries, which doesn't make sense
>   in the context of adding multiple reflogs for a given refname.
>
> We overcome these issue we make the following changes:
>   - Update the ref_update structure to also include the committer
>   information. Using this, we can add a new function which only adds
>   reflog updates to the transaction.

Out of interest, I see various changes happen around committer info. But
why is the committer info more relevant for reflog updates, in contrast
to normal ref updates?
Karthik Nayak Dec. 20, 2024, 11:23 a.m. UTC | #5
Toon Claes <toon@iotcl.com> writes:

> Karthik Nayak <karthik.188@gmail.com> writes:
>
>> The `git refs migrate` command was introduced in
>> 25a0023f28 (builtin/refs: new command to migrate ref storage formats,
>> 2024-06-06) to support migrating from one reference backend to another.
>>
>> One limitation of the feature was that it didn't support migrating
>> repositories which contained reflogs. This isn't a requirement on the
>> server side as repositories are stored as bare repositories (which do
>> not contain any reflogs). Clients however generally use reflogs and
>> until now couldn't use the `git refs migrate` command to migrate their
>> repositories to the new reftable format.
>>
>> One of the issues for adding reflog support is that the ref transactions
>> don't support reflogs additions:
>>   1. While there is REF_LOG_ONLY flag, there is no function to utilize
>>   the flag and add reflogs.
>>   2. reference backends generally sort the updates by the refname. This
>>   wouldn't work for reflogs which need to ensure that they maintain the
>>   order of creation.
>>   3. In the files backend, reflog entries are added by obtaining locks
>>   on the refs themselves. This means each update in the transaction, will
>>   obtain a ref_lock. This paradigm fails to accompany the fact that there
>>   could be multiple reflog updates for a refname in a single transaction.
>>   4. The backends check for duplicate entries, which doesn't make sense
>>   in the context of adding multiple reflogs for a given refname.
>>
>> We overcome these issue we make the following changes:
>>   - Update the ref_update structure to also include the committer
>>   information. Using this, we can add a new function which only adds
>>   reflog updates to the transaction.
>
> Out of interest, I see various changes happen around committer info. But
> why is the committer info more relevant for reflog updates, in contrast
> to normal ref updates?
>
> --
> Toon

The committer info is metadata around a ref update. It is only stored in
the reflogs. So they're only relevant for reflogs! For e.g. in the files
backend, we can see:

  $ cat .git/logs/HEAD | tail -1
  7a929cb27ca6df69d4db64b008e27e002c691028
bc8e00178b671da2b845ccbba175f6c093ed6949 Karthik Nayak
<karthik.188@google.com> 1734624674 +0100	commit: c10

Here the 'Karthik Nayak <karthik.188@gmail.com> 1734624673 +0100'
section is the committer_info. Whereas if you see the reference itself

  $ cat .git/HEAD
  ref: refs/heads/master

It only contains reference information. So, the `committer_info` is
not needed for regular ref updates. Earlier we dynamically obtain the
information when a reflog was being created. But for migration of
existing reflogs we need to pass this information from the source to the
ref transaction mechanism. So we pass the `committer_info` through the
layers.

Thank you for the review.