Message ID | 20250124122217.250925-1-usmanakinyemi202@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce os-version Capability with Configurable Options | expand |
Usman Akinyemi <usmanakinyemi202@gmail.com> writes: > For debugging, statistical analysis, and security purposes, it can > be valuable for Git servers to know the operating system the clients > are using. OK. I think the reorganization done in this round makes it much easier to see what is going on in each step. Very well done. The only remaining issue from my point of view is if we really want this as a separate and new knob with capability, or if we would be better off to carry this kind of extra piece of information by enhancing existing "agent" capability. Given what Web Browsers do in their UA strings, it does feel cumbersome for analitics tools to pay attention to two separate input sources (os-version and agent). Has somebody brought up any downsides of cramming the OS information to the existing agent thing? I have not thought of any possible downsides since I made this suggestion in a previous review of this topic, but I may be missing something obvious, so... Thanks.
On Fri, Jan 24, 2025 at 7:39 PM Junio C Hamano <gitster@pobox.com> wrote: > The only remaining issue from my point of view is if we really want > this as a separate and new knob with capability, or if we would be > better off to carry this kind of extra piece of information by > enhancing existing "agent" capability. Given what Web Browsers do > in their UA strings, it does feel cumbersome for analitics tools to > pay attention to two separate input sources (os-version and agent). > > Has somebody brought up any downsides of cramming the OS information > to the existing agent thing? I have not thought of any possible > downsides since I made this suggestion in a previous review of this > topic, but I may be missing something obvious, so... My opinion is that it isn't a good idea to enhance the existing "agent" capability. Yeah, it goes in the same direction as what web browsers have been doing with the User-Agent header, but I think web browsers are an especially bad example that we should strive not to follow. According to Wikipedia (https://en.wikipedia.org/wiki/User-Agent_header) the format for the User-Agent header is now "Mozilla/[version] ([system and browser information]) [platform] ([platform details]) [extensions]", for example "Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405". This is obviously very difficult to parse for everyone including analytics tools and is not very flexible either. It serves as a way to pass information about available features, but leak some privacy information in the process. The fact that it's used to pass information about available features has led to a lot of user agent spoofing which means that analytics, statistics and debugging are likely harder than they need to be. When Git developed capabilities and the "agent" capability, the doc took care of saying things that it "MUST NOT be used to programmatically assume the presence or absence of particular features". This was done to go in the direction of not passing more information through this "agent" capability but instead use separate ones. So I think we should just avoid putting other things in the "agent" capability to avoid what happened to the User-Agent header in browsers and to stay true to our original intent to have a different capability for each advertised information or feature.
Christian Couder <christian.couder@gmail.com> writes: > information in the process. The fact that it's used to pass > information about available features has led to a lot of user agent > spoofing which means that analytics, statistics and debugging are > likely harder than they need to be. Yes, that is a valid viewpoint, but ... > When Git developed capabilities and the "agent" capability, the doc > took care of saying things that it "MUST NOT be used to > programmatically assume the presence or absence of particular > features". ... the proposed os-version thing has the same wording in its documentation, doesn't it? What is being added is not to be used in a way that requires parsing and trusting the result. So unless your point is that users (like those who parse User-Agent string by browsers) will do the wrong thing and assume these strings are usable for feature detection anyway so we should make it easier to parse, I'd have to disagree. If we are not aiming to make it easier to parse and assume certain things that we do not want them to, I do not see why we need to have the pieces of information in two separate capabilities. Thanks.
On Mon, Jan 27, 2025 at 4:26 PM Junio C Hamano <gitster@pobox.com> wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > > information in the process. The fact that it's used to pass > > information about available features has led to a lot of user agent > > spoofing which means that analytics, statistics and debugging are > > likely harder than they need to be. > > Yes, that is a valid viewpoint, but ... > > > When Git developed capabilities and the "agent" capability, the doc > > took care of saying things that it "MUST NOT be used to > > programmatically assume the presence or absence of particular > > features". > > ... the proposed os-version thing has the same wording in its > documentation, doesn't it? Yeah, we repeat it to make sure that users read it. I am fine with refactoring that wording if we think that having it once is enough. > What is being added is not to be used > in a way that requires parsing and trusting the result. Why not? If server people want to do OS stats on their clients, for example, why shouldn't they parse and trust the result? > So unless your point is that users (like those who parse User-Agent > string by browsers) will do the wrong thing and assume these strings > are usable for feature detection anyway so we should make it easier > to parse, I'd have to disagree. We should make it easy to parse because people will use this field (otherwise why are we adding it?), and we want to make it easy to use rather than hard just because we are nice with our users. I think we should not assume that they will do the wrong thing, especially if our docs are clear about how it shouldn't be used. > If we are not aiming to make it > easier to parse and assume certain things that we do not want them > to, I do not see why we need to have the pieces of information in > two separate capabilities. I think it's just the right thing to make it easy to parse. Doing OS stats on the server side doesn't need to be unnecessarily hard. By the way, if we put the OS information in the "agent" capability, how do we separate it from the existing "package/version" content and make it easy to parse? I don't see a good solution because GIT_USER_AGENT could be used, and the config option to not show the OS name could be used too. Also we don't know what could be in the "version" part. The doc says that the agent part is typically of the form "package/version" but doesn't require it. Thanks.
Christian Couder <christian.couder@gmail.com> writes: > By the way, if we put the OS information in the "agent" capability, > how do we separate it from the existing "package/version" content and > make it easy to parse? Do NOT parse, period. If three "things" that talk the Git protocol on the other end of the connection gives "Linux git/2.48.0", and "macOS libgit2/1.9.0", and "Windows git/2.47.1" as their (enhanced) "agent" strings, there is no "ah, this one is 1.9.0 which way older than 2.47.1 so it must be missing features X and Y" the users of the information are allowed to infer. Just take it as a single opaque string, and group identical ones. In the above scenario, we found three different kinds now. Maybe we'll accumulate the counts and notice that there are N times as many connections whose agent string begins with "Windows" as "Linux" and "macOS" combined or something. That would be an offline analysis, and forcing users to do the stats offline would reduce the temptation to use it for purposes other than its intended one. You may find "ImNotTellingYou" and may wonder what OS the user is really using, but they do not want to tell you, so you honor their wish. > I don't see a good solution because > GIT_USER_AGENT could be used, and the config option to not show the OS > name could be used too. That is a good privacy measure. > Also we don't know what could be in the "version" part. The doc says > that the agent part is typically of the form "package/version" but > doesn't require it. Exactly. I would think it is a feature, and the way to treat the string in line with the philosophy behind that feature is to take it as a single opaque thing.
On Fri, Jan 31, 2025 at 10:07 PM Junio C Hamano <gitster@pobox.com> wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > > By the way, if we put the OS information in the "agent" capability, > > how do we separate it from the existing "package/version" content and > > make it easy to parse? > > Do NOT parse, period. > > If three "things" that talk the Git protocol on the other end of the > connection gives "Linux git/2.48.0", and "macOS libgit2/1.9.0", and > "Windows git/2.47.1" as their (enhanced) "agent" strings, there is > no "ah, this one is 1.9.0 which way older than 2.47.1 so it must be > missing features X and Y" the users of the information are allowed > to infer. Hi Junio, Do you have any concerns "git/2.47.1 Windows" instead of "Windows git/2.47.1" ? Thank you. > > Just take it as a single opaque string, and group identical ones. > > In the above scenario, we found three different kinds now. Maybe > we'll accumulate the counts and notice that there are N times as > many connections whose agent string begins with "Windows" as "Linux" > and "macOS" combined or something. That would be an offline > analysis, and forcing users to do the stats offline would reduce the > temptation to use it for purposes other than its intended one. > > You may find "ImNotTellingYou" and may wonder what OS the user is > really using, but they do not want to tell you, so you honor their > wish. > > > I don't see a good solution because > > GIT_USER_AGENT could be used, and the config option to not show the OS > > name could be used too. > > That is a good privacy measure. > > > Also we don't know what could be in the "version" part. The doc says > > that the agent part is typically of the form "package/version" but > > doesn't require it. > > Exactly. I would think it is a feature, and the way to treat the > string in line with the philosophy behind that feature is to take it > as a single opaque thing. > >
On Fri, Jan 31, 2025 at 10:07 PM Junio C Hamano <gitster@pobox.com> wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > > By the way, if we put the OS information in the "agent" capability, > > how do we separate it from the existing "package/version" content and > > make it easy to parse? > > Do NOT parse, period. > > If three "things" that talk the Git protocol on the other end of the > connection gives "Linux git/2.48.0", and "macOS libgit2/1.9.0", and > "Windows git/2.47.1" as their (enhanced) "agent" strings, there is > no "ah, this one is 1.9.0 which way older than 2.47.1 so it must be > missing features X and Y" the users of the information are allowed > to infer. > > Just take it as a single opaque string, and group identical ones. > > In the above scenario, we found three different kinds now. Maybe > we'll accumulate the counts and notice that there are N times as > many connections whose agent string begins with "Windows" as "Linux" > and "macOS" combined or something. That would be an offline > analysis, and forcing users to do the stats offline would reduce the > temptation to use it for purposes other than its intended one. > > You may find "ImNotTellingYou" and may wonder what OS the user is > really using, but they do not want to tell you, so you honor their > wish. While the current implementation allows user to specify this form of string i.e "ImNotTellingYou", for agent value, it is not mentioned in the docs, I will add in the next iteration. > > > I don't see a good solution because > > GIT_USER_AGENT could be used, and the config option to not show the OS > > name could be used too. > > That is a good privacy measure. > > > Also we don't know what could be in the "version" part. The doc says > > that the agent part is typically of the form "package/version" but > > doesn't require it. > > Exactly. I would think it is a feature, and the way to treat the > string in line with the philosophy behind that feature is to take it > as a single opaque thing. > >
Usman Akinyemi <usmanakinyemi202@gmail.com> writes: > Do you have any concerns "git/2.47.1 Windows" instead of > "Windows git/2.47.1" ? Either is fine. I expect that (1) Implementors on _our_ side will do the sensible thing and reviewers help them to make sure, where the definition of "the sensible thing" will be that whatever order we pick, we consistently use that same order. If "git/2.47.1 Windows" is how GfW identifies itself, "git/2.48.1 Linux" or "git/2.49.0 macOS" would be its contemporary counterparts, and _our_ binaries would not identify themselves as "Linux git/2.49.0". (2) Implementors of third-party reimplementations of Git will just mimick what we will do, as long as we tell them our intention (i.e. this is a single opaque unparsable string to be collected for statistics, nothing more) clearly enough. (3) Most users are lazy and/or trusting enough that only a very few minority privacy conscious folks would configure it away, making their "IamNotTellingYou" agent string merely an insignificant noise in the statistics.
Usman Akinyemi <usmanakinyemi202@gmail.com> writes: >> You may find "ImNotTellingYou" and may wonder what OS the user is >> really using, but they do not want to tell you, so you honor their >> wish. > While the current implementation allows user to specify this form of string > i.e "ImNotTellingYou", for agent value, it is not mentioned in the docs, > I will add in the next iteration. OK. You may want to wait before hearing other's opinions, though, for at least the time it takes for the earth to rotate once. Thanks.