diff mbox series

t7527 intermittent failure on macOS APFS and possible fix

Message ID YvZbGAf+82WtNXcJ@danh.dev (mailing list archive)
State New, archived
Headers show
Series t7527 intermittent failure on macOS APFS and possible fix | expand

Commit Message

Đoàn Trần Công Danh Aug. 12, 2022, 1:52 p.m. UTC
Hello,

Running t7527 on macOS with encrypted APFS filesystem.
I observes intermittent failure, however, when I manually check the
test cases, they're all passed.

I suspected fileystem caching issue.
I added those sync-s into test steps and the test pass.
I'm not sure if this is the intending "fix" for the tests
since we're testing the fsmonitor with t7527.

Please advise!

P/S: When debugging, I also found out that:
"test-tool fsmonitor-client query" doesn't write the final newline
character, thus making the output harder to read. The diff also have
the final newline added.

----- 8< -------
---- >8 --------------

Comments

Junio C Hamano Aug. 12, 2022, 6:08 p.m. UTC | #1
Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:

> Running t7527 on macOS with encrypted APFS filesystem.
> I observes intermittent failure, however, when I manually check the
> test cases, they're all passed.
>
> I suspected fileystem caching issue.
> I added those sync-s into test steps and the test pass.
> I'm not sure if this is the intending "fix" for the tests
> since we're testing the fsmonitor with t7527.
>
> Please advise!

fsmonitor experts, please do.

My gut feeling is that, unless the production code internally calls
the equivalent of "sync" done here and the failure in tests are
coming from the fact that the "sync" is missing in "test-tool
fsmonitor-client" (i.e. test-tool does not emulate the real world
closely enough and fails in cases where the machinery does not fail
in the real world), these "sync" calls only sweep the problem under
the rug.

> P/S: When debugging, I also found out that:
> "test-tool fsmonitor-client query" doesn't write the final newline
> character, thus making the output harder to read. The diff also have
> the final newline added.
>
> ----- 8< -------
> diff --git a/t/helper/test-fsmonitor-client.c b/t/helper/test-fsmonitor-client.c
> index 54a4856c48..98d6cf1440 100644
> --- a/t/helper/test-fsmonitor-client.c
> +++ b/t/helper/test-fsmonitor-client.c
> @@ -55,6 +55,7 @@ static int do_send_query(const char *token)
>  
>  	write_in_full(1, answer.buf, answer.len);
>  	strbuf_release(&answer);
> +	write_in_full(1, "\n", 1);
>  
>  	return 0;
>  }
> @@ -77,6 +78,7 @@ static int do_send_flush(void)
>  
>  	write_in_full(1, answer.buf, answer.len);
>  	strbuf_release(&answer);
> +	write_in_full(1, "\n", 1);
>  
>  	return 0;
>  }

Aren't these protocol drivers?  If the protocol is defined without
the trailing LF, would it make sense to update only the sending end
to do this?  Or does the protocol makes it clear that a trailing LF,
or lack of it, should be tolerated by all the implementations?

If we are absolutely sure that no implementation of the other side
will get upset by seeing an extra LF, It would be fine, but as the
original code wants to call write_in_full(), it would be more
preferrable to do it this way instead, I suspect.

+	strbuf_complete(&answer, '\n');
	write_in_full(1, answer.buf, answer.len);
	strbuf_release(&answer);
Đoàn Trần Công Danh Aug. 13, 2022, 12:16 a.m. UTC | #2
On 2022-08-12 11:08:30-0700, Junio C Hamano <gitster@pobox.com> wrote:
> Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:
> 
> > Running t7527 on macOS with encrypted APFS filesystem.
> > I observes intermittent failure, however, when I manually check the
> > test cases, they're all passed.
> >
> > I suspected fileystem caching issue.
> > I added those sync-s into test steps and the test pass.
> > I'm not sure if this is the intending "fix" for the tests
> > since we're testing the fsmonitor with t7527.
> >
> > Please advise!
> 
> fsmonitor experts, please do.
> 
> My gut feeling is that, unless the production code internally calls
> the equivalent of "sync" done here and the failure in tests are
> coming from the fact that the "sync" is missing in "test-tool
> fsmonitor-client" (i.e. test-tool does not emulate the real world
> closely enough and fails in cases where the machinery does not fail
> in the real world), these "sync" calls only sweep the problem under
> the rug.

It's my gut feeling, too.
Anyway, I'm running the test again, too confirm your suggestion on
t/helper/test-fsmonitor-client.c

t7527.63 (Unicode nfc/nfd) also failed intermittently, too.
Here is the content of unicode.trace:
---- 8< ----
$ cat unicode.trace
statfs('/path/to/src/git/t/trash directory.t7527-builtin-fsmonitor/test_unicode') [type 0x0000001a][flags 0x04909080] 'apfs'
statfs('/path/to/src/git/t/trash directory.t7527-builtin-fsmonitor/test_unicode') [type 0x0000001a][flags 0x04909080] 'apfs'
Watching: worktree '/path/to/src/git/t/trash directory.t7527-builtin-fsmonitor/test_unicode'
statfs('/path/to/src/git/t/trash directory.t7527-builtin-fsmonitor/test_unicode') [type 0x0000001a][flags 0x04909080] 'apfs'
requested token: quit
---------- >8 ------------------

> 
> > P/S: When debugging, I also found out that:
> > "test-tool fsmonitor-client query" doesn't write the final newline
> > character, thus making the output harder to read. The diff also have
> > the final newline added.
> >
> > ----- 8< -------
> > diff --git a/t/helper/test-fsmonitor-client.c b/t/helper/test-fsmonitor-client.c
> > index 54a4856c48..98d6cf1440 100644
> > --- a/t/helper/test-fsmonitor-client.c
> > +++ b/t/helper/test-fsmonitor-client.c
> > @@ -55,6 +55,7 @@ static int do_send_query(const char *token)
> >  
> >  	write_in_full(1, answer.buf, answer.len);
> >  	strbuf_release(&answer);
> > +	write_in_full(1, "\n", 1);
> >  
> >  	return 0;
> >  }
> > @@ -77,6 +78,7 @@ static int do_send_flush(void)
> >  
> >  	write_in_full(1, answer.buf, answer.len);
> >  	strbuf_release(&answer);
> > +	write_in_full(1, "\n", 1);
> >  
> >  	return 0;
> >  }
> 
> Aren't these protocol drivers?

The "answer" strbuf is the response from fsmonitor daemon, I think.
The write_in_full to fd 1 is test-tool writes down the answer to
stdout.

> If the protocol is defined without
> the trailing LF, would it make sense to update only the sending end
> to do this?  Or does the protocol makes it clear that a trailing LF,
> or lack of it, should be tolerated by all the implementations?
> 
> If we are absolutely sure that no implementation of the other side
> will get upset by seeing an extra LF, It would be fine, but as the
> original code wants to call write_in_full(), it would be more
> preferrable to do it this way instead, I suspect.
> 
> +	strbuf_complete(&answer, '\n');
> 	write_in_full(1, answer.buf, answer.len);
> 	strbuf_release(&answer);

This could work, since we don't send "answer" back to
fsmonitor-daemon.
Junio C Hamano Aug. 13, 2022, 12:45 a.m. UTC | #3
Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:

>> Aren't these protocol drivers?
>
> The "answer" strbuf is the response from fsmonitor daemon, I think.
> The write_in_full to fd 1 is test-tool writes down the answer to
> stdout.

Ahh, then please scratch everything I said about this part.  Your
original implementation makes it more clear that we are "dumping"
what we got from the daemon, and then adding LF for our own
readability.  I somehow mistook the code to be preparing and sending
protocol message to the other end (iow, fd #1 was connected to
"them").

Sorry about the gotcha, and thanks for straightening me out.
Jeff King Aug. 14, 2022, 6:52 a.m. UTC | #4
On Fri, Aug 12, 2022 at 08:52:24PM +0700, Đoàn Trần Công Danh wrote:

> Running t7527 on macOS with encrypted APFS filesystem.
> I observes intermittent failure, however, when I manually check the
> test cases, they're all passed.
> 
> I suspected fileystem caching issue.
> I added those sync-s into test steps and the test pass.
> I'm not sure if this is the intending "fix" for the tests
> since we're testing the fsmonitor with t7527.

I don't know anything about your problem area, but some tools in our
test suite that might help you get better results with an intermittent
problem (if you didn't already know about them):

  - add:

      GIT_TEST_OPTS = -x --verbose-log

    into config.mak. Then when a "make test" run fails, you can see the
    log in t/test-results/t7527-*.out.

  - try running:

      ./t7527-* --stress

    which will run a bunch of instances simultaneously, with the idea
    that the load will introduce timing issues that will trigger the
    bug. The failed instance will produce a detailed log on stdout.

That might help you get more output from the failing cases.

-Peff
Jeff Hostetler Aug. 17, 2022, 8:23 p.m. UTC | #5
On 8/12/22 9:52 AM, Đoàn Trần Công Danh wrote:
> Hello,
> 
> Running t7527 on macOS with encrypted APFS filesystem.
> I observes intermittent failure, however, when I manually check the
> test cases, they're all passed.
> 
> I suspected fileystem caching issue.
> I added those sync-s into test steps and the test pass.
> I'm not sure if this is the intending "fix" for the tests
> since we're testing the fsmonitor with t7527.
> 
> Please advise!
> 
> P/S: When debugging, I also found out that:
> "test-tool fsmonitor-client query" doesn't write the final newline
> character, thus making the output harder to read. The diff also have
> the final newline added.

Thanks for the report.  Sorry for the delay, I'm still digging thru
my email backlog after a long vacation.  I'll take a look at this
and get back to you shortly.

Thanks,
Jeff
diff mbox series

Patch

diff --git a/t/helper/test-fsmonitor-client.c b/t/helper/test-fsmonitor-client.c
index 54a4856c48..98d6cf1440 100644
--- a/t/helper/test-fsmonitor-client.c
+++ b/t/helper/test-fsmonitor-client.c
@@ -55,6 +55,7 @@  static int do_send_query(const char *token)
 
 	write_in_full(1, answer.buf, answer.len);
 	strbuf_release(&answer);
+	write_in_full(1, "\n", 1);
 
 	return 0;
 }
@@ -77,6 +78,7 @@  static int do_send_flush(void)
 
 	write_in_full(1, answer.buf, answer.len);
 	strbuf_release(&answer);
+	write_in_full(1, "\n", 1);
 
 	return 0;
 }
diff --git a/t/t7527-builtin-fsmonitor.sh b/t/t7527-builtin-fsmonitor.sh
index 56c0dfffea..3f2cef251d 100755
--- a/t/t7527-builtin-fsmonitor.sh
+++ b/t/t7527-builtin-fsmonitor.sh
@@ -430,6 +430,7 @@  test_expect_success 'edit some files' '
 	edit_files &&
 
 	test-tool fsmonitor-client query --token 0 &&
+	sync &&
 
 	grep "^event: dir1/modified$"  .git/trace &&
 	grep "^event: dir2/modified$"  .git/trace &&
@@ -445,6 +446,7 @@  test_expect_success 'create some files' '
 	create_files &&
 
 	test-tool fsmonitor-client query --token 0 &&
+	sync &&
 
 	grep "^event: dir1/new$" .git/trace &&
 	grep "^event: dir2/new$" .git/trace &&
@@ -459,6 +461,7 @@  test_expect_success 'delete some files' '
 	delete_files &&
 
 	test-tool fsmonitor-client query --token 0 &&
+	sync &&
 
 	grep "^event: dir1/delete$" .git/trace &&
 	grep "^event: dir2/delete$" .git/trace &&
@@ -473,6 +476,7 @@  test_expect_success 'rename some files' '
 	rename_files &&
 
 	test-tool fsmonitor-client query --token 0 &&
+	sync &&
 
 	grep "^event: dir1/rename$"  .git/trace &&
 	grep "^event: dir2/rename$"  .git/trace &&
@@ -490,6 +494,7 @@  test_expect_success 'rename directory' '
 	mv dirtorename dirrenamed &&
 
 	test-tool fsmonitor-client query --token 0 &&
+	sync &&
 
 	grep "^event: dirtorename/*$" .git/trace &&
 	grep "^event: dirrenamed/*$"  .git/trace
@@ -503,6 +508,7 @@  test_expect_success 'file changes to directory' '
 	file_to_directory &&
 
 	test-tool fsmonitor-client query --token 0 &&
+	sync &&
 
 	grep "^event: delete$"     .git/trace &&
 	grep "^event: delete/new$" .git/trace
@@ -516,6 +522,7 @@  test_expect_success 'directory changes to a file' '
 	directory_to_file &&
 
 	test-tool fsmonitor-client query --token 0 &&
+	sync &&
 
 	grep "^event: dir1$" .git/trace
 '