[v2] tr2: log parent process name

It can be useful to tell who invoked Git - was it invoked manually by a
user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
we can influence the source code and set the GIT_TRACE2_PARENT_SID
environment variable from the caller process. In 'repo''s case, that
parent SID is manipulated to include the string "repo", which means we
can positively identify when Git was invoked by 'repo' tool. However,
identifying parents that way requires both that we know which tools
invoke Git and that we have the ability to modify the source code of
those tools. It cannot scale to keep up with the various IDEs and
wrappers which use Git, most of which we don't know about. Learning
which tools and wrappers invoke Git, and how, would give us insight to
decide where to improve Git's usability and performance.

Unfortunately, there's no cross-platform reliable way to gather the name
of the parent process. If procfs is present, we can use that; otherwise
we will need to discover the name another way. However, the process ID
should be sufficient regardless of platform.

Git for Windows gathers similar information and logs it as a "data_json"
event. However, since "data_json" has a variable format, it is difficult
to parse effectively in some languages; instead, let's pursue a
dedicated "cmd_ancestry" event to record information about the ancestry
of the current process and a consistent, parseable way.

Git for Windows also gathers information about more than one parent. In
Linux further ancestry info can be gathered with procfs, but it's
unwieldy to do so. In the interest of later moving Git for Windows
ancestry logging to the 'cmd_ancestry' event, and in the interest of
later adding more ancestry to the Linux implementation - or of adding
this functionality to other platforms which have an easier time walking
the process tree - let's make 'cmd_ancestry' accept an array of
parentage.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---

Hi folks, the comments I received in v1 were of two varieties:
1) "There are better ways to make this platform-safe", and
2) "Your commit message doesn't convince me".
Since I sent v1, though, I also learned a little more about procfs, and
about the trace2 structure overall, so there are some pretty significant
differences from v1:

- I took a look at Jeff H's advice on using a "data_json" event to log
  this and decided it would be a little more flexible to add a new event
  instead. If we want, it'd be feasible to then shoehorn the GfW parent
  tree stuff into this new event too. Doing it this way is definitely
  easier to parse for Google's trace analysis system (which for now
  completely skips "data_json" as it's polymorphic), and also - I think
  - means that we can add more fields later on if we need to (thread
  info, different fields than just /proc/n/comm like exec path, argv,
  whatever).
- Jonathan N also pointed out to me that /proc/n/comm exists, and logs
  the "command name" - excluding argv, excluding path, etc. It seems
  like this is a little more safe about excluding personal information
  from the traces which take the form of "myscript.sh
  --password=hunter2", but would still be worrisome for something like
  "mysupersecretproject.sh". I'm not sure whether that means we still
  want to guard it with a config flag, though.
- I also added a lot to the commit message; hopefully it's not too
  rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
  wasn't going to cut it.
- As for testing, I followed the lead of GfW's parentage info - "this
  isn't portable so writing tests for it will suck, just scrub it from
  the tests". Maybe it makes sense to do some more
  platform-specific-ness in the test suite instead? I wasn't sure.

Thanks, all.
 - Emily

 Makefile                  |  5 ++++
 compat/procinfo.c         | 53 +++++++++++++++++++++++++++++++++++++++
 config.mak.uname          |  1 +
 git-compat-util.h         |  6 +++++
 t/t0210/scrub_normal.perl |  6 +++++
 t/t0211/scrub_perf.perl   |  5 ++++
 t/t0212/parse_events.perl |  5 +++-
 trace2.c                  | 13 ++++++++++
 trace2.h                  | 12 ++++++++-
 trace2/tr2_tgt.h          |  3 +++
 trace2/tr2_tgt_event.c    | 21 ++++++++++++++++
 trace2/tr2_tgt_normal.c   | 19 ++++++++++++++
 trace2/tr2_tgt_perf.c     | 16 ++++++++++++
 13 files changed, 163 insertions(+), 2 deletions(-)
 create mode 100644 compat/procinfo.c

Message ID	20210520210546.4129620-1-emilyshaffer@google.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Date: Thu, 20 May 2021 14:05:46 -0700 In-Reply-To: <20210507002908.1495061-1-emilyshaffer@google.com> Message-Id: <20210520210546.4129620-1-emilyshaffer@google.com> Mime-Version: 1.0 Subject: [PATCH v2] tr2: log parent process name From: Emily Shaffer <emilyshaffer@google.com> To: git@vger.kernel.org Cc: Emily Shaffer <emilyshaffer@google.com>, " =?utf-8?b?w4Z2YXIgQXJuZmo=?= =?utf-8?b?w7Zyw7AgQmphcm1hc29u?= " <avarab@gmail.com>, Junio C Hamano <gitster@pobox.com>, Jeff Hostetler <git@jeffhostetler.com>, Bagas Sanjaya <bagasdotme@gmail.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	[v2] tr2: log parent process name \| expand [v2] tr2: log parent process name

[v2] tr2: log parent process name

Commit Message

Comments

Patch