diff mbox series

[RFC,v2] khash_test.c : add unit tests

Message ID 20230627132026.3204186-1-siddhartth@google.com (mailing list archive)
State New, archived
Headers show
Series [RFC,v2] khash_test.c : add unit tests | expand

Commit Message

Siddharth Singh June 27, 2023, 1:20 p.m. UTC
Signed-off-by: Siddharth Singh <siddhartth@google.com>
---
Since v1
- rewrote the code keeping coding style guidelines in mind.
- refactored tests to improve their implementation..
- added a few more tests that cause collisions in the hashmap.
In the v1 patch, there were concerns that new tests were unnecessary because of upstream tests. However, I believe it would be beneficial to have tests based on the khash implementation in the git codebase because the existing tests[1] for khash do not use the same code for khash as the git codebase. 
E.g. in khash.h file of “attractivechaos/klib/khash.h” [2]. There's an implementation for `KHASH_MAP_INIT_INT64` which isn’t defined in “git/git/khash.h”[3]. So, there’s a possibility that the khash.h in a different repository might move forward and end up being different from out implementation, so writing tests for “git/git/khash.h” would ensure that our tests are relevant to the actual usage of the library.
While my colleagues are currently investigating whether C Tap harness is the best way to test libified code, this RFC is intended to discuss the content of unit tests and what other tests would be useful for khash.h. I am unsure of what tests will be valuable in the future, so I would appreciate your thoughts on the matter. Many test cases are easier to construct in C TAP harness than in test-tool. For example, In C TAP harness, non-string values can be used directly in hash maps. However, in test-tool, non-string values must be passed as an argument to a shell executable, parsed into the correct type, and then operated on.
I'm also very new to writing unit tests in C and git codebase in general, so I'll appreciate constructive feedback and discussion..
With this RFC, I hope you can help me answer these questions.
What are the unit test cases to include in khash.h?
What are the other types of tests that would be useful for khash.h?
[1] https://github.com/attractivechaos/klib/blob/master/test/khash_test.c
[2] https://github.com/attractivechaos/klib/blob/master/khash.h
[3] https://github.com/git/git/blob/master/khash.h

 Makefile       |   2 +
 t/khash_test.c | 214 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 216 insertions(+)
 create mode 100644 t/khash_test.c

Comments

Glen Choo June 30, 2023, 12:53 a.m. UTC | #1
Siddharth Singh <siddhartth@google.com> writes:

> - rewrote the code keeping coding style guidelines in mind.

I noticed several style issues in the patch. I'll avoid commenting on
those here so that we can focus on the other, big questions.

> I'm also very new to writing unit tests in C and git codebase in general, so I'll appreciate constructive feedback and discussion..

Okay. I think this is a good chance for the ML to decide on what sorts
of unit test practices we want too.

> What are the unit test cases to include in khash.h?
> What are the other types of tests that would be useful for khash.h?

For a third party library, I mostly agree with upthread commenters that
testing the correctness of the _library_ code isn't that valuable on its
own, but unit tests are also useful for demonstrating how the code
should be used. In a lot of projects, unit tests are a pseudo-official
demonstration of how an interface should be used, and that's sometimes
better than seeing the code in production, because unit tests are
typically a lot simpler and easier to reason about.

As an aside, this is something that a unit test framework will do _much
better_ than test-tool, because it makes it easy to read and write these
idtiomatic demos without worrying about how the test-tool executable
will be used.

I think that if we consider these two factors together (correctness and
developer education), the fact that we don't have great documentation of
khash.h in-tree, and that the external docs aren't that comprehensive
either, I think it would be useful to exercise each of the khash
'functions' to show how they work.

As Taylor mentioned upthread, I think the most valuable thing to test is
the way our code uses khash.h (not the library code itself). IOW:

  KHASH_INIT(oid_set, struct object_id, int, 0, oidhash_by_value, oideq_by_value)
  KHASH_INIT(oid_map, struct object_id, void *, 1, oidhash_by_value, oideq_by_value)
  KHASH_INIT(oid_pos, struct object_id, int, 1, oidhash_by_value, oideq_by_value)

So we would focus on oidhash_by_value and oideq_by_value in particular,
since that's provided by us. It would be very useful to verify that
these functions are doing exactly what we think they should be doing,
and we could do this with _every_ khash function to really explore the
API surface.

In fact... I suspect that they _not_ doing what we think they should,
or at the very least, might be displaying some very unintuitive
behavior. object_id contains both the hash value _and_ the hash
algorithm that generated it. If we have the same hash value, but
generated by different algorithms, are they considered the same equal or
not? To me, the intuitive answer is "no", but oidhash_by_value() only
considers the hash value and ignores the algorithm altogether! I think
that testing collisions between SHA1 and SHA256 values would be very
useful.

> +/*
> +  These tests are for base assumptions; if they fails, library is broken
> +*/
> +int hash_string_succeeds() {
> +  const char *str = "test_string";
> +  khint_t value = __ac_X31_hash_string(str);
> +  return value == 0xf1a6305e;
> +}
> +
> +int hash_string_with_non_alphabets_succeeds() {
> +  const char *str = "test_string #2";
> +  khint_t value = __ac_X31_hash_string(str);
> +  return value == 0xfa970771;
> +}

For a unit test, it is a best practice to test the external interface
that callers are depending on, and to avoid implementation
details/internals. If we do the latter, we end up with brittle tests
that break on non-important changes. E.g. we definitely should not be
testing the exact value of the hash function. The khash library
maintainer could decide to change the hash function, and we should not
be broken by that.

For khash.h, it appears that that anything prepended with "__" is meant
to be internal, and everything else is external. E.g.
__ac_X31_hash_string is certainly not meant to be used directly by end
users., especially since there's a human-friendly macro for it:

  #define kh_str_hash_func(key) __ac_X31_hash_string(key)

However, because something is external, doesn't mean we need to be
testing its implementation - we still shouldn't be testing the
return value of kh_str_hash_func(). We may want to test certain
_properties_ of the external function (e.g. does it provide good
collision resistance?), but in this case, I think we can trust that what
the library author wrote is a good default.

> +KHASH_INIT(str, const char *, int *, 1, kh_str_hash_func, kh_str_hash_equal);

As mentioned earlier, I think testing oid_[set|map|pos] would be much
more useful, and we should try to demonstrate each of the
macros/functions and how they interact, e.g. if I kh_put_*, I should
expect to be able to kh_get_* afterwards.

> +
> +int initialized_hashmap_pointer_not_null() {
> +  kh_str_t *hashmap = kh_init_str();
> +
> +  if(hashmap != NULL){
> +    free(hashmap);
> +    return 1;
> +  }
> +  return 0;
> +}

I don't think this is a useful test. It would be extremely weird for the
return value to be NULL in regular usage.

> +int put_key_into_hashmap_once_succeeds() {
> +  int ret, value;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if(!ret)
> +    return 0;
> +  return pos != 0;
> +}

I don't think it makes sense to only check the return value, we should
check that it had the intended effect - that we can read back the value.
There is a later test that does this, so I'd prefer we remove this.

Comment on the test framework, not the test: I think it makes sense to
assert that "ret" is nonzero, but this "if..return" way of asserting
makes it impossible to tell what has broken. It also adds a lot of
cognitive overhead when trying to parse the assertion. I'll echo Phillip
Wood's comments and say that I think this is definitely not a framework
I want to use.

> +int put_same_key_into_hashmap_twice_fails() {
> +  int first_return_value, second_return_value, value;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  kh_put_str(hashmap, "test_key", &first_return_value);
> +  kh_put_str(hashmap, "test_key", &second_return_value);
> +  return !second_return_value;
> +}

This test seems reasonable.

> +int put_value_into_hashmap_once_succeeds() {
> +  int ret, value = 14;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  return *kh_value(hashmap, pos) == value;
> +}

Checking the inserted value makes sense to me I don't really understand
khash.h, but is this the recommended way to get values back? kh_value()
looks like just a small macro around an internal array, so we are just
checking that *&value == value. I would have expected that we would call
kh_get to get "pos", then call kh_value on that.

Also, do we have to kh_put_str("some_key") and then immediately set it
again with kh_key(pos)? That seems odd, and I don't see other callers
doing that all the time.

Since this is the second time I see "if (!ret)", I should ask whether
this is meant as an assertion (checking the correctness of the code), or
for flow control (the code might return either 0 or 1, and I need to
handle both). If it is the former, I think it's unnecessary to
assert on it multiple times. The latter does not belong in a unit test -
we should already _know_ what we want to be returned. Btw, it is a huge
flaw in the test framework that both look exactly the same, so I'll say
again that I really don't like it.

> +int put_same_value_into_hashmap_twice_fails() {
> +  int first_return_value, second_return_value, value = 14;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &first_return_value);
> +  if (!first_return_value)
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  kh_put_str(hashmap, "test_key", &second_return_value);
> +  return !second_return_value;
> +}

I don't see how this is different from the previous test that tested for
collisions.

> +int get_existing_kv_from_hashmap_succeeds() {
> +  int ret, value =14;
> +  int expected = value;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  return *kh_value(hashmap, kh_get_str(hashmap, "test_key")) == expected;
> +}

Isn't this identical to put_value_into_hashmap_once_succeeds()?

Also, this assertion is harder to read than necessary because we have
both "expected" and "value" - the different names suggest that they are
different, but they're really the same. Let's just use one of them.

> +int get_nonexisting_kv_from_hashmap_fails() {
> +  int value = 14;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  return !kh_get_str(hashmap, "test_key");
> +}

Looks reasonable.

> +int deletion_from_hashmap_sets_flag() {
> +  int ret, value = 14;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  if(!kh_exist(hashmap, pos))
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  kh_del_str(hashmap, pos);
> +  return !kh_exist(hashmap, pos);
> +}

What is the "flag" being set? If it is the khash-internal flag that
marks that an entry is deleted, that's an unimportant implementation
detail - the important thing is that the value is deleted. Let's just
rename this to "deletion_works" or something.

> +
> +int deletion_from_hashmap_updates_size() {
> +  int ret, value = 14, current_size;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  current_size = hashmap->size;
> +  kh_del_str(hashmap, pos);
> +  return hashmap->size + 1 == current_size;
> +}

The interface is kh_size(), so let's use that instead of referencing the
.size member correctly.

> +int deletion_from_hashmap_does_not_update_kv() {
> +  int ret, value = 14, current_size;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  kh_del_str(hashmap, pos);
> +  return !strcmp(kh_key(hashmap, pos),"test_key");
> +}

Interesting! If we kh_del_(pos) but read it back with kh_key(), we can
still read it back. This has some implications on how khash should be
used - we should be extremely careful if we pass around a returned value
from kh_get_ or kh_put_ because we don't know if it's deleted or not.
Instead, we should prefer kh_get_(). Very useful demonstration.

> +int update_value_after_deletion_succeeds() {
> +  int ret, value1 = 14, value2 = 15;
> +  khint_t pos1, pos2;
> +  kh_str_t *hashmap = kh_init_str();
> +  // adding the kv to the hashmap
> +  pos1 = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos1) = xstrdup("test_key");
> +  kh_value(hashmap, pos1) = &value1;
> +  // delete the kv from the hashmap
> +  kh_del_str(hashmap, pos1);
> +  // adding the same key with different value
> +  pos2 = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos2) = xstrdup("test_key");
> +  kh_value(hashmap, pos2) = &value2;
> +  // check if the value is different
> +  return *kh_value(hashmap, kh_get_str(hashmap, "test_key")) == value2;
> +}

Looks reasonable.

> +int hashmap_size_matches_number_of_added_elements() {
> +  int ret, value1 = 14, value2 = 15, value3 = 16;
> +  khint_t pos1, pos2, pos3;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos1 = kh_put_str(hashmap, "test_key1", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos1) = xstrdup("test_key1");
> +  kh_value(hashmap, pos1) = &value1;
> +  pos2 = kh_put_str(hashmap, "test_key2", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos2) = xstrdup("test_key2");
> +  kh_value(hashmap, pos2) = &value2;
> +  pos3 = kh_put_str(hashmap, "test_key3", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos3) = xstrdup("test_key3");
> +  kh_value(hashmap, pos3) = &value3;
> +  return kh_size(hashmap) == 3;
> +}

Looks reasonable.

I would also exercise kh_end, kh_begin, kh_clear, kh_foreach and
kh_foreach_value

kh_destroy might be useful to test too, in case there are oddities
there, but it's probably not that interesting

kh_n_buckets doesn't seem useful to test - it looks too specific to the
impkementation. kh_resize looks like an implementation detail that we
shouldn't test either.
Glen Choo June 30, 2023, 1:05 a.m. UTC | #2
Some comments on mechanical issues...

It seems that you dropped the other participants from your reply.

  https://git-scm.com/docs/MyFirstContribution#v2-git-send-email

Do CC them as it helps them keep track of discussions they were in. I
think they will not mind if you send a new message to this thread, CC
them, and explain that you didn't mean to omit them.

Siddharth Singh <siddhartth@google.com> writes:

> Signed-off-by: Siddharth Singh <siddhartth@google.com>
> ---
> Since v1
> - rewrote the code keeping coding style guidelines in mind.
> - refactored tests to improve their implementation..
> - added a few more tests that cause collisions in the hashmap.
> In the v1 patch, there were concerns that new tests were unnecessary because of upstream tests. However, I believe it would be beneficial to have tests based on the khash implementation in the git codebase because the existing tests[1] for khash do not use the same code for khash as the git codebase. 
> E.g. in khash.h file of “attractivechaos/klib/khash.h” [2]. There's an implementation for `KHASH_MAP_INIT_INT64` which isn’t defined in “git/git/khash.h”[3]. So, there’s a possibility that the khash.h in a different repository might move forward and end up being different from out implementation, so writing tests for “git/git/khash.h” would ensure that our tests are relevant to the actual usage of the library.
> While my colleagues are currently investigating whether C Tap harness is the best way to test libified code, this RFC is intended to discuss the content of unit tests and what other tests would be useful for khash.h. I am unsure of what tests will be valuable in the future, so I would appreciate your thoughts on the matter. Many test cases are easier to construct in C TAP harness than in test-tool. For example, In C TAP harness, non-string values can be used directly in hash maps. However, in test-tool, non-string values must be passed as an argument to a shell executable, parsed into the correct type, and then operated on.
> I'm also very new to writing unit tests in C and git codebase in general, so I'll appreciate constructive feedback and discussion..

Do rewrap the line to 80 characters, otherwise it is quite hard to read
on most text-based clients. You can preview your patches with "git
send-email --annotate" and see if they are a reasonable column width.

> diff --git a/Makefile b/Makefile
> index 660c72d72e..d3ad2fa23c 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -3847,11 +3847,13 @@ $(UNIT_TEST_RUNNER): t/runtests.c
>  
>  UNIT_TEST_PROGS += t/unit-tests/unit-test-t
>  UNIT_TEST_PROGS += t/unit-tests/strbuf-t
> +UNIT_TEST_PROGS += t/unit-tests/khash-t
>  
>  $(UNIT_TEST_PROGS): $(CTAP_OBJS) $(LIBS)
>  	$(QUIET)mkdir -p t/unit-tests
>  	$(QUIET_CC)$(CC) -It -o t/unit-tests/unit-test-t t/unit-test.c $(CTAP_OBJS)
>  	$(QUIET_CC)$(CC) -It -o t/unit-tests/strbuf-t t/strbuf-test.c -DSKIP_COMMON_MAIN common-main.c $(CTAP_OBJS) $(LIBS)

Many reviewers like to apply the patches for review. It's typically
assumed that patches are based on 'master', but this looks like it is
not. That's okay, though you should call it out. If it is based on
something not in Junio's tree (and thus a reviewer can't apply your
patch), it would be appreciated if you share this commit via a Git fork,
e.g. GitHub.

> +/*
> +  These tests are for base assumptions; if they fails, library is broken

Missing *.

> +int hash_string_succeeds() {
> +  const char *str = "test_string";

We use tabs, not spaces. I think "make style" catches this.

> +int initialized_hashmap_pointer_not_null() {
> +  kh_str_t *hashmap = kh_init_str();
> +
> +  if(hashmap != NULL){

"if ", not "if". "make style" will catch this.

> +int update_value_after_deletion_succeeds() {
> +  int ret, value1 = 14, value2 = 15;
> +  khint_t pos1, pos2;
> +  kh_str_t *hashmap = kh_init_str();
> +  // adding the kv to the hashmap

Comments use /* */, not //.
Siddharth Singh July 7, 2023, 3:04 p.m. UTC | #3
On Thu, Jun 29, 2023 at 05:53:16PM -0700, Glen Choo wrote:
> Siddharth Singh <siddhartth@google.com> writes:
> 
> > - rewrote the code keeping coding style guidelines in mind.
> 
> I noticed several style issues in the patch. I'll avoid commenting on
> those here so that we can focus on the other, big questions.
> 
> > I'm also very new to writing unit tests in C and git codebase in general, so I'll appreciate constructive feedback and discussion..
> 
> Okay. I think this is a good chance for the ML to decide on what sorts
> of unit test practices we want too.
> 
> > What are the unit test cases to include in khash.h?
> > What are the other types of tests that would be useful for khash.h?
> 
> For a third party library, I mostly agree with upthread commenters that
> testing the correctness of the _library_ code isn't that valuable on its
> own, but unit tests are also useful for demonstrating how the code
> should be used. In a lot of projects, unit tests are a pseudo-official
> demonstration of how an interface should be used, and that's sometimes
> better than seeing the code in production, because unit tests are
> typically a lot simpler and easier to reason about.
> 
> As an aside, this is something that a unit test framework will do _much
> better_ than test-tool, because it makes it easy to read and write these
> idtiomatic demos without worrying about how the test-tool executable
> will be used.
> 
> I think that if we consider these two factors together (correctness and
> developer education), the fact that we don't have great documentation of
> khash.h in-tree, and that the external docs aren't that comprehensive
> either, I think it would be useful to exercise each of the khash
> 'functions' to show how they work.
> 
> As Taylor mentioned upthread, I think the most valuable thing to test is
> the way our code uses khash.h (not the library code itself). IOW:
> 
>   KHASH_INIT(oid_set, struct object_id, int, 0, oidhash_by_value, oideq_by_value)
>   KHASH_INIT(oid_map, struct object_id, void *, 1, oidhash_by_value, oideq_by_value)
>   KHASH_INIT(oid_pos, struct object_id, int, 1, oidhash_by_value, oideq_by_value)
> 
> So we would focus on oidhash_by_value and oideq_by_value in particular,
> since that's provided by us. It would be very useful to verify that
> these functions are doing exactly what we think they should be doing,
> and we could do this with _every_ khash function to really explore the
> API surface.
> 
> In fact... I suspect that they _not_ doing what we think they should,
> or at the very least, might be displaying some very unintuitive
> behavior. object_id contains both the hash value _and_ the hash
> algorithm that generated it. If we have the same hash value, but
> generated by different algorithms, are they considered the same equal or
> not? To me, the intuitive answer is "no", but oidhash_by_value() only
> considers the hash value and ignores the algorithm altogether! I think
> that testing collisions between SHA1 and SHA256 values would be very
> useful.
> 
> > +/*
> > +  These tests are for base assumptions; if they fails, library is broken
> > +*/
> > +int hash_string_succeeds() {
> > +  const char *str = "test_string";
> > +  khint_t value = __ac_X31_hash_string(str);
> > +  return value == 0xf1a6305e;
> > +}
> > +
> > +int hash_string_with_non_alphabets_succeeds() {
> > +  const char *str = "test_string #2";
> > +  khint_t value = __ac_X31_hash_string(str);
> > +  return value == 0xfa970771;
> > +}
> 
> For a unit test, it is a best practice to test the external interface
> that callers are depending on, and to avoid implementation
> details/internals. If we do the latter, we end up with brittle tests
> that break on non-important changes. E.g. we definitely should not be
> testing the exact value of the hash function. The khash library
> maintainer could decide to change the hash function, and we should not
> be broken by that.
> 
> For khash.h, it appears that that anything prepended with "__" is meant
> to be internal, and everything else is external. E.g.
> __ac_X31_hash_string is certainly not meant to be used directly by end
> users., especially since there's a human-friendly macro for it:
> 
>   #define kh_str_hash_func(key) __ac_X31_hash_string(key)
> 
> However, because something is external, doesn't mean we need to be
> testing its implementation - we still shouldn't be testing the
> return value of kh_str_hash_func(). We may want to test certain
> _properties_ of the external function (e.g. does it provide good
> collision resistance?), but in this case, I think we can trust that what
> the library author wrote is a good default.
> 
> > +KHASH_INIT(str, const char *, int *, 1, kh_str_hash_func, kh_str_hash_equal);
> 
> As mentioned earlier, I think testing oid_[set|map|pos] would be much
> more useful, and we should try to demonstrate each of the
> macros/functions and how they interact, e.g. if I kh_put_*, I should
> expect to be able to kh_get_* afterwards.
> 
> > +
> > +int initialized_hashmap_pointer_not_null() {
> > +  kh_str_t *hashmap = kh_init_str();
> > +
> > +  if(hashmap != NULL){
> > +    free(hashmap);
> > +    return 1;
> > +  }
> > +  return 0;
> > +}
> 
> I don't think this is a useful test. It would be extremely weird for the
> return value to be NULL in regular usage.

While it is indeed uncommon for the return value of kh_init_str() to be null in
regular usage, there could be certain cases where the hashmap pointer might be
null. One possible scenario is if there is an error during memory allocation for
the hashmap. This can happen if the system runs out of memory or if there are
other memory-related issues. In such cases, the kh_init_str() function might
return a null pointer to indicate the failure to allocate memory for the hashmap.

>
> > +int put_key_into_hashmap_once_succeeds() {
> > +  int ret, value;
> > +  khint_t pos;
> > +  kh_str_t *hashmap = kh_init_str();
> > +  pos = kh_put_str(hashmap, "test_key", &ret);
> > +  if(!ret)
> > +    return 0;
> > +  return pos != 0;
> > +}
> 
> I don't think it makes sense to only check the return value, we should
> check that it had the intended effect - that we can read back the value.
> There is a later test that does this, so I'd prefer we remove this.
> 
> Comment on the test framework, not the test: I think it makes sense to
> assert that "ret" is nonzero, but this "if..return" way of asserting
> makes it impossible to tell what has broken. It also adds a lot of
> cognitive overhead when trying to parse the assertion. I'll echo Phillip
> Wood's comments and say that I think this is definitely not a framework
> I want to use.
> 
> > +int put_same_key_into_hashmap_twice_fails() {
> > +  int first_return_value, second_return_value, value;
> > +  khint_t pos;
> > +  kh_str_t *hashmap = kh_init_str();
> > +  kh_put_str(hashmap, "test_key", &first_return_value);
> > +  kh_put_str(hashmap, "test_key", &second_return_value);
> > +  return !second_return_value;
> > +}
> 
> This test seems reasonable.
> 
> > +int put_value_into_hashmap_once_succeeds() {
> > +  int ret, value = 14;
> > +  khint_t pos;
> > +  kh_str_t *hashmap = kh_init_str();
> > +  pos = kh_put_str(hashmap, "test_key", &ret);
> > +  if (!ret)
> > +    return 0;
> > +  kh_key(hashmap, pos) = xstrdup("test_key");
> > +  kh_value(hashmap, pos) = &value;
> > +  return *kh_value(hashmap, pos) == value;
> > +}
> 
> Checking the inserted value makes sense to me I don't really understand
> khash.h, but is this the recommended way to get values back? kh_value()
> looks like just a small macro around an internal array, so we are just
> checking that *&value == value. I would have expected that we would call
> kh_get to get "pos", then call kh_value on that.
> 
> Also, do we have to kh_put_str("some_key") and then immediately set it
> again with kh_key(pos)? That seems odd, and I don't see other callers
> doing that all the time.
> 
> Since this is the second time I see "if (!ret)", I should ask whether
> this is meant as an assertion (checking the correctness of the code), or
> for flow control (the code might return either 0 or 1, and I need to
> handle both). If it is the former, I think it's unnecessary to
> assert on it multiple times. The latter does not belong in a unit test -
> we should already _know_ what we want to be returned. Btw, it is a huge
> flaw in the test framework that both look exactly the same, so I'll say
> again that I really don't like it.
> 

Regarding the usage of kh_value(), I saw this part of code[1] This suggests
that it may be the recommended approach in this context.

For kh_put_str, if i understand correctly, we first allocate a bucket then
then set it using kh_key, here's a use case[2] that does it similarly.

Regarding the usage of if (!ret), it seems to serve as an assertion or error
checking mechanism. If the return value ret is 0, it likely indicates an
error condition. It's worth mentioning that the similarity between the
assertion and flow control statements could be a flaw in the test framework,
as you pointed out.

> > +int put_same_value_into_hashmap_twice_fails() {
> > +  int first_return_value, second_return_value, value = 14;
> > +  khint_t pos;
> > +  kh_str_t *hashmap = kh_init_str();
> > +  pos = kh_put_str(hashmap, "test_key", &first_return_value);
> > +  if (!first_return_value)
> > +    return 0;
> > +  kh_key(hashmap, pos) = xstrdup("test_key");
> > +  kh_value(hashmap, pos) = &value;
> > +  kh_put_str(hashmap, "test_key", &second_return_value);
> > +  return !second_return_value;
> > +}
> 
> I don't see how this is different from the previous test that tested for
> collisions.

I also attempt to assign a value to the hashmap here, but this will not be
successful due to a collision. The intent of this test was different, however
 
> > +int get_existing_kv_from_hashmap_succeeds() {
> > +  int ret, value =14;
> > +  int expected = value;
> > +  khint_t pos;
> > +  kh_str_t *hashmap = kh_init_str();
> > +  pos = kh_put_str(hashmap, "test_key", &ret);
> > +  if (!ret)
> > +    return 0;
> > +  kh_key(hashmap, pos) = xstrdup("test_key");
> > +  kh_value(hashmap, pos) = &value;
> > +  return *kh_value(hashmap, kh_get_str(hashmap, "test_key")) == expected;
> > +}
> 
> Isn't this identical to put_value_into_hashmap_once_succeeds()?
> 
> Also, this assertion is harder to read than necessary because we have
> both "expected" and "value" - the different names suggest that they are
> different, but they're really the same. Let's just use one of them.
> 

I agree they both end up same implementation wise.

> > +int get_nonexisting_kv_from_hashmap_fails() {
> > +  int value = 14;
> > +  khint_t pos;
> > +  kh_str_t *hashmap = kh_init_str();
> > +  return !kh_get_str(hashmap, "test_key");
> > +}
> 
> Looks reasonable.
> 
> > +int deletion_from_hashmap_sets_flag() {
> > +  int ret, value = 14;
> > +  khint_t pos;
> > +  kh_str_t *hashmap = kh_init_str();
> > +  pos = kh_put_str(hashmap, "test_key", &ret);
> > +  if (!ret)
> > +    return 0;
> > +  if(!kh_exist(hashmap, pos))
> > +    return 0;
> > +  kh_key(hashmap, pos) = xstrdup("test_key");
> > +  kh_value(hashmap, pos) = &value;
> > +  kh_del_str(hashmap, pos);
> > +  return !kh_exist(hashmap, pos);
> > +}
> 
> What is the "flag" being set? If it is the khash-internal flag that
> marks that an entry is deleted, that's an unimportant implementation
> detail - the important thing is that the value is deleted. Let's just
> rename this to "deletion_works" or something.
> 
Oka, sounds good.
> > +
> > +int deletion_from_hashmap_updates_size() {
> > +  int ret, value = 14, current_size;
> > +  khint_t pos;
> > +  kh_str_t *hashmap = kh_init_str();
> > +  pos = kh_put_str(hashmap, "test_key", &ret);
> > +  if (!ret)
> > +    return 0;
> > +  kh_key(hashmap, pos) = xstrdup("test_key");
> > +  kh_value(hashmap, pos) = &value;
> > +  current_size = hashmap->size;
> > +  kh_del_str(hashmap, pos);
> > +  return hashmap->size + 1 == current_size;
> > +}
> 
> The interface is kh_size(), so let's use that instead of referencing the
> .size member correctly.

Okay.

> > +int deletion_from_hashmap_does_not_update_kv() {
> > +  int ret, value = 14, current_size;
> > +  khint_t pos;
> > +  kh_str_t *hashmap = kh_init_str();
> > +  pos = kh_put_str(hashmap, "test_key", &ret);
> > +  if (!ret)
> > +    return 0;
> > +  kh_key(hashmap, pos) = xstrdup("test_key");
> > +  kh_value(hashmap, pos) = &value;
> > +  kh_del_str(hashmap, pos);
> > +  return !strcmp(kh_key(hashmap, pos),"test_key");
> > +}
> 
> Interesting! If we kh_del_(pos) but read it back with kh_key(), we can
> still read it back. This has some implications on how khash should be
> used - we should be extremely careful if we pass around a returned value
> from kh_get_ or kh_put_ because we don't know if it's deleted or not.
> Instead, we should prefer kh_get_(). Very useful demonstration.
> 
Yes, we are still able to get the value but to truly know if something
was deleted or not, we need to use kh_exist.

> > +int update_value_after_deletion_succeeds() {
> > +  int ret, value1 = 14, value2 = 15;
> > +  khint_t pos1, pos2;
> > +  kh_str_t *hashmap = kh_init_str();
> > +  // adding the kv to the hashmap
> > +  pos1 = kh_put_str(hashmap, "test_key", &ret);
> > +  if (!ret)
> > +    return 0;
> > +  kh_key(hashmap, pos1) = xstrdup("test_key");
> > +  kh_value(hashmap, pos1) = &value1;
> > +  // delete the kv from the hashmap
> > +  kh_del_str(hashmap, pos1);
> > +  // adding the same key with different value
> > +  pos2 = kh_put_str(hashmap, "test_key", &ret);
> > +  if (!ret)
> > +    return 0;
> > +  kh_key(hashmap, pos2) = xstrdup("test_key");
> > +  kh_value(hashmap, pos2) = &value2;
> > +  // check if the value is different
> > +  return *kh_value(hashmap, kh_get_str(hashmap, "test_key")) == value2;
> > +}
> 
> Looks reasonable.
> 
> > +int hashmap_size_matches_number_of_added_elements() {
> > +  int ret, value1 = 14, value2 = 15, value3 = 16;
> > +  khint_t pos1, pos2, pos3;
> > +  kh_str_t *hashmap = kh_init_str();
> > +  pos1 = kh_put_str(hashmap, "test_key1", &ret);
> > +  if (!ret)
> > +    return 0;
> > +  kh_key(hashmap, pos1) = xstrdup("test_key1");
> > +  kh_value(hashmap, pos1) = &value1;
> > +  pos2 = kh_put_str(hashmap, "test_key2", &ret);
> > +  if (!ret)
> > +    return 0;
> > +  kh_key(hashmap, pos2) = xstrdup("test_key2");
> > +  kh_value(hashmap, pos2) = &value2;
> > +  pos3 = kh_put_str(hashmap, "test_key3", &ret);
> > +  if (!ret)
> > +    return 0;
> > +  kh_key(hashmap, pos3) = xstrdup("test_key3");
> > +  kh_value(hashmap, pos3) = &value3;
> > +  return kh_size(hashmap) == 3;
> > +}
> 
> Looks reasonable.
> 
> I would also exercise kh_end, kh_begin, kh_clear, kh_foreach and
> kh_foreach_value
> 
> kh_destroy might be useful to test too, in case there are oddities
> there, but it's probably not that interesting
> 
> kh_n_buckets doesn't seem useful to test - it looks too specific to the
> impkementation. kh_resize looks like an implementation detail that we
> shouldn't test either.
>
Glen Choo July 7, 2023, 9:12 p.m. UTC | #4
Siddharth Singh <siddhartth@google.com> writes:

>> > +int initialized_hashmap_pointer_not_null() {
>> > +  kh_str_t *hashmap = kh_init_str();
>> > +
>> > +  if(hashmap != NULL){
>> > +    free(hashmap);
>> > +    return 1;
>> > +  }
>> > +  return 0;
>> > +}
>> 
>> I don't think this is a useful test. It would be extremely weird for the
>> return value to be NULL in regular usage.
>
> While it is indeed uncommon for the return value of kh_init_str() to be null in
> regular usage, there could be certain cases where the hashmap pointer might be
> null. One possible scenario is if there is an error during memory allocation for
> the hashmap. This can happen if the system runs out of memory or if there are
> other memory-related issues. In such cases, the kh_init_str() function might
> return a null pointer to indicate the failure to allocate memory for the hashmap.

Ah, yes, that is what I expected. By "regular usage", I meant the case
where the allocation succeeds. Nevertheless, I don't think this test
demonstrates or asserts anything that isn't easily observed in other
tests, though I am open to be proven wrong.

>> > +int put_value_into_hashmap_once_succeeds() {
>> > +  int ret, value = 14;
>> > +  khint_t pos;
>> > +  kh_str_t *hashmap = kh_init_str();
>> > +  pos = kh_put_str(hashmap, "test_key", &ret);
>> > +  if (!ret)
>> > +    return 0;
>> > +  kh_key(hashmap, pos) = xstrdup("test_key");
>> > +  kh_value(hashmap, pos) = &value;
>> > +  return *kh_value(hashmap, pos) == value;
>> > +}
>> Also, do we have to kh_put_str("some_key") and then immediately set it
>> again with kh_key(pos)? That seems odd, and I don't see other callers
>> doing that all the time.
>
> Regarding the usage of kh_value(), I saw this part of code[1] This suggests
> that it may be the recommended approach in this context.

What 'context' were you thinking of when you say "recommended approach
in this context"? (Did you mean kh_put_str() vs kh_put_oid_*()?) I
didn't see such an approach in most kh_put_*() call sites. I couldn't
find the link you're referencing, but I assume it is the following lines
in delta-islands.c:

	int hash_ret;
	khiter_t pos = kh_put_str(remote_islands, island_name, &hash_ret);

	if (hash_ret) {
		kh_key(remote_islands, pos) = xstrdup(island_name);
		kh_value(remote_islands, pos) = xcalloc(1, sizeof(struct remote_island));
	}

In which case, the use case (as I understand it at least) seems quite
different:

- When kh_put_str() 'returns' hash_ret = 0, it means there's already a
  matching entry and we should do nothing.

- Otherwise, kh_put_str() actually creates a new matching entry, and now
  we want to allocate memory to store the entry, so we overwrite the key
  with "kh_key() = xstrdup()". We could have avoided this by doing
  something like:

    char *key = xstrdup(island_name)
    khiter_t pos = kh_put_str(remote_islands, key, &hash_ret);

    if (hash_ret) {
      kh_value(remote_islands, pos) = xcalloc(1, sizeof(struct remote_island));
    } else {
      free(key);
    }

  but allocations are expensive, so we should avoid allocating before we
  know it's needed.

>> > +int put_same_value_into_hashmap_twice_fails() {
>> > +  int first_return_value, second_return_value, value = 14;
>> > +  khint_t pos;
>> > +  kh_str_t *hashmap = kh_init_str();
>> > +  pos = kh_put_str(hashmap, "test_key", &first_return_value);
>> > +  if (!first_return_value)
>> > +    return 0;
>> > +  kh_key(hashmap, pos) = xstrdup("test_key");
>> > +  kh_value(hashmap, pos) = &value;
>> > +  kh_put_str(hashmap, "test_key", &second_return_value);
>> > +  return !second_return_value;
>> > +}
>> 
>> I don't see how this is different from the previous test that tested for
>> collisions.
>
> I also attempt to assign a value to the hashmap here, but this will not be
> successful due to a collision. The intent of this test was different, however

Hm, what was the different intent? Is it to test this slightly different
case where the value is assigned?

My reasoning is:

- Would assigning a value result in different behavior?
- Would a reasonable user expect that it would result in different
  behavior?

If the answer to both of those is 'no' (which I believe it is), then it
is not worth testing.
diff mbox series

Patch

diff --git a/Makefile b/Makefile
index 660c72d72e..d3ad2fa23c 100644
--- a/Makefile
+++ b/Makefile
@@ -3847,11 +3847,13 @@  $(UNIT_TEST_RUNNER): t/runtests.c
 
 UNIT_TEST_PROGS += t/unit-tests/unit-test-t
 UNIT_TEST_PROGS += t/unit-tests/strbuf-t
+UNIT_TEST_PROGS += t/unit-tests/khash-t
 
 $(UNIT_TEST_PROGS): $(CTAP_OBJS) $(LIBS)
 	$(QUIET)mkdir -p t/unit-tests
 	$(QUIET_CC)$(CC) -It -o t/unit-tests/unit-test-t t/unit-test.c $(CTAP_OBJS)
 	$(QUIET_CC)$(CC) -It -o t/unit-tests/strbuf-t t/strbuf-test.c -DSKIP_COMMON_MAIN common-main.c $(CTAP_OBJS) $(LIBS)
+	$(QUIET_CC)$(CC) -It -o t/unit-tests/khash-t t/khash_test.c -DSKIP_COMMON_MAIN common-main.c $(CTAP_OBJS) $(LIBS)
 
 .PHONY: unit-tests
 unit-tests: $(UNIT_TEST_PROGS) $(UNIT_TEST_RUNNER)
diff --git a/t/khash_test.c b/t/khash_test.c
new file mode 100644
index 0000000000..d1a43add13
--- /dev/null
+++ b/t/khash_test.c
@@ -0,0 +1,214 @@ 
+#include "../git-compat-util.h"
+#include "../khash.h"
+#include <tap/basic.h>
+
+/*
+  These tests are for base assumptions; if they fails, library is broken
+*/
+int hash_string_succeeds() {
+  const char *str = "test_string";
+  khint_t value = __ac_X31_hash_string(str);
+  return value == 0xf1a6305e;
+}
+
+int hash_string_with_non_alphabets_succeeds() {
+  const char *str = "test_string #2";
+  khint_t value = __ac_X31_hash_string(str);
+  return value == 0xfa970771;
+}
+
+KHASH_INIT(str, const char *, int *, 1, kh_str_hash_func, kh_str_hash_equal);
+
+int initialized_hashmap_pointer_not_null() {
+  kh_str_t *hashmap = kh_init_str();
+
+  if(hashmap != NULL){
+    free(hashmap);
+    return 1;
+  }
+  return 0;
+}
+
+int put_key_into_hashmap_once_succeeds() {
+  int ret, value;
+  khint_t pos;
+  kh_str_t *hashmap = kh_init_str();
+  pos = kh_put_str(hashmap, "test_key", &ret);
+  if(!ret)
+    return 0;
+  return pos != 0;
+}
+
+int put_same_key_into_hashmap_twice_fails() {
+  int first_return_value, second_return_value, value;
+  khint_t pos;
+  kh_str_t *hashmap = kh_init_str();
+  kh_put_str(hashmap, "test_key", &first_return_value);
+  kh_put_str(hashmap, "test_key", &second_return_value);
+  return !second_return_value;
+}
+
+int put_value_into_hashmap_once_succeeds() {
+  int ret, value = 14;
+  khint_t pos;
+  kh_str_t *hashmap = kh_init_str();
+  pos = kh_put_str(hashmap, "test_key", &ret);
+  if (!ret)
+    return 0;
+  kh_key(hashmap, pos) = xstrdup("test_key");
+  kh_value(hashmap, pos) = &value;
+  return *kh_value(hashmap, pos) == value;
+}
+
+int put_same_value_into_hashmap_twice_fails() {
+  int first_return_value, second_return_value, value = 14;
+  khint_t pos;
+  kh_str_t *hashmap = kh_init_str();
+  pos = kh_put_str(hashmap, "test_key", &first_return_value);
+  if (!first_return_value)
+    return 0;
+  kh_key(hashmap, pos) = xstrdup("test_key");
+  kh_value(hashmap, pos) = &value;
+  kh_put_str(hashmap, "test_key", &second_return_value);
+  return !second_return_value;
+}
+
+int get_existing_kv_from_hashmap_succeeds() {
+  int ret, value =14;
+  int expected = value;
+  khint_t pos;
+  kh_str_t *hashmap = kh_init_str();
+  pos = kh_put_str(hashmap, "test_key", &ret);
+  if (!ret)
+    return 0;
+  kh_key(hashmap, pos) = xstrdup("test_key");
+  kh_value(hashmap, pos) = &value;
+  return *kh_value(hashmap, kh_get_str(hashmap, "test_key")) == expected;
+}
+
+int get_nonexisting_kv_from_hashmap_fails() {
+  int value = 14;
+  khint_t pos;
+  kh_str_t *hashmap = kh_init_str();
+  return !kh_get_str(hashmap, "test_key");
+}
+
+int deletion_from_hashmap_sets_flag() {
+  int ret, value = 14;
+  khint_t pos;
+  kh_str_t *hashmap = kh_init_str();
+  pos = kh_put_str(hashmap, "test_key", &ret);
+  if (!ret)
+    return 0;
+  if(!kh_exist(hashmap, pos))
+    return 0;
+  kh_key(hashmap, pos) = xstrdup("test_key");
+  kh_value(hashmap, pos) = &value;
+  kh_del_str(hashmap, pos);
+  return !kh_exist(hashmap, pos);
+}
+
+int deletion_from_hashmap_updates_size() {
+  int ret, value = 14, current_size;
+  khint_t pos;
+  kh_str_t *hashmap = kh_init_str();
+  pos = kh_put_str(hashmap, "test_key", &ret);
+  if (!ret)
+    return 0;
+  kh_key(hashmap, pos) = xstrdup("test_key");
+  kh_value(hashmap, pos) = &value;
+  current_size = hashmap->size;
+  kh_del_str(hashmap, pos);
+  return hashmap->size + 1 == current_size;
+}
+
+int deletion_from_hashmap_does_not_update_kv() {
+  int ret, value = 14, current_size;
+  khint_t pos;
+  kh_str_t *hashmap = kh_init_str();
+  pos = kh_put_str(hashmap, "test_key", &ret);
+  if (!ret)
+    return 0;
+  kh_key(hashmap, pos) = xstrdup("test_key");
+  kh_value(hashmap, pos) = &value;
+  kh_del_str(hashmap, pos);
+  return !strcmp(kh_key(hashmap, pos),"test_key");
+}
+
+int update_value_after_deletion_succeeds() {
+  int ret, value1 = 14, value2 = 15;
+  khint_t pos1, pos2;
+  kh_str_t *hashmap = kh_init_str();
+  // adding the kv to the hashmap
+  pos1 = kh_put_str(hashmap, "test_key", &ret);
+  if (!ret)
+    return 0;
+  kh_key(hashmap, pos1) = xstrdup("test_key");
+  kh_value(hashmap, pos1) = &value1;
+  // delete the kv from the hashmap
+  kh_del_str(hashmap, pos1);
+  // adding the same key with different value
+  pos2 = kh_put_str(hashmap, "test_key", &ret);
+  if (!ret)
+    return 0;
+  kh_key(hashmap, pos2) = xstrdup("test_key");
+  kh_value(hashmap, pos2) = &value2;
+  // check if the value is different
+  return *kh_value(hashmap, kh_get_str(hashmap, "test_key")) == value2;
+}
+
+int hashmap_size_matches_number_of_added_elements() {
+  int ret, value1 = 14, value2 = 15, value3 = 16;
+  khint_t pos1, pos2, pos3;
+  kh_str_t *hashmap = kh_init_str();
+  pos1 = kh_put_str(hashmap, "test_key1", &ret);
+  if (!ret)
+    return 0;
+  kh_key(hashmap, pos1) = xstrdup("test_key1");
+  kh_value(hashmap, pos1) = &value1;
+  pos2 = kh_put_str(hashmap, "test_key2", &ret);
+  if (!ret)
+    return 0;
+  kh_key(hashmap, pos2) = xstrdup("test_key2");
+  kh_value(hashmap, pos2) = &value2;
+  pos3 = kh_put_str(hashmap, "test_key3", &ret);
+  if (!ret)
+    return 0;
+  kh_key(hashmap, pos3) = xstrdup("test_key3");
+  kh_value(hashmap, pos3) = &value3;
+  return kh_size(hashmap) == 3;
+}
+
+int main(void) {
+  plan(14);
+
+  ok(hash_string_succeeds(), "X31 hash value of a string is correct");
+  ok(hash_string_with_non_alphabets_succeeds(),
+     "Get X31 hash value for a string with non-alphabets in it.");
+  ok(initialized_hashmap_pointer_not_null(),
+     "Successfully initialize a hashmap and it's not NULL");
+  ok(put_key_into_hashmap_once_succeeds(),
+     "Put a key into hashmap that doesn't exist.");
+  ok(put_same_key_into_hashmap_twice_fails(),
+     "Put a key into hashmap twice that causes collision.");
+  ok(put_value_into_hashmap_once_succeeds(),
+     "Put a unique key-value pair into hashmap.");
+  ok(put_same_value_into_hashmap_twice_fails(),
+     "Put a key-value pair into hashmap twice that causes collision.");
+  ok(get_existing_kv_from_hashmap_succeeds(),
+     "Get the value from a key that exists in hashmap");
+  ok(get_nonexisting_kv_from_hashmap_fails(),
+     "Get the value from a key that doesn't exist in hashmap");
+  ok(deletion_from_hashmap_sets_flag(),
+     "Delete the key-value from hashmap and the flag is set to false");
+  ok(deletion_from_hashmap_updates_size(),
+     "Delete the key-value from hashmap and the size changes");
+  ok(deletion_from_hashmap_does_not_update_kv(),
+     "Delete the key-value from hashmap but it can be fetched using iterator");
+  ok(update_value_after_deletion_succeeds(),
+     "Updates the value of a key after deleting it");
+  ok(hashmap_size_matches_number_of_added_elements(),
+     "size of hashmap is correct after adding 3 elements");
+
+  return 0;
+}