mbox series

[v4,00/22] git-p4: Various code tidy-ups

Message ID 20220210164627.279520-1-jholdsworth@nvidia.com (mailing list archive)
Headers show
Series git-p4: Various code tidy-ups | expand

Message

Joel Holdsworth Feb. 10, 2022, 4:46 p.m. UTC
This patch set contains multiple patches to improve consistency and
tidyness of the git-p4 script's code style.

Many of these patches have been driven by the guidlines contained in the
Python PEP8 "Style Guide for Python Code" and were applied using a
mixture of human intervention, and tools including autopep8 and
pycodestyle.

This patch-set stops short of bringing git-p4 into full PEP8 compliance,
most notably in regard to the following items:

  - There is no patch to apply the recommended column limit of
    79-characters,
  - There is no patch to correct hanging indents of multi-line
    declarations such as multi-line function delcarations, function
    invocations, etc.

Patches to correct these items may be provided later.

This fourth version of the patch-set corrects the rebasing errors that
were submitted in the previous version.

Joel Holdsworth (22):
  git-p4: add blank lines between functions and class definitions
  git-p4: remove unneeded semicolons from statements
  git-p4: indent with 4-spaces
  git-p4: improve consistency of docstring formatting
  git-p4: convert descriptive class and function comments into
    docstrings
  git-p4: remove commented code
  git-p4: sort and de-duplcate pylint disable list
  git-p4: remove padding from lists, tuples and function arguments
  git-p4: remove spaces around default arguments
  git-p4: removed brackets when assigning multiple return values
  git-p4: place a single space after every comma
  git-p4: remove extraneous spaces before function arguments
  git-p4: remove redundant backslash-continuations inside brackets
  git-p4: remove spaces between dictionary keys and colons
  git-p4: ensure every comment has a single #
  git-p4: ensure there is a single space around all operators
  git-p4: normalize indentation of lines in conditionals
  git-p4: compare to singletons with "is" and "is not"
  git-p4: only seperate code blocks by a single empty line
  git-p4: move inline comments to line above
  git-p4: seperate multiple statements onto seperate lines
  git-p4: sort imports

 git-p4.py | 882 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 525 insertions(+), 357 deletions(-)

Comments

Tao Klerks April 2, 2022, 12:14 p.m. UTC | #1
On Thu, Feb 10, 2022 at 5:46 PM Joel Holdsworth <jholdsworth@nvidia.com> wrote:
>
> This patch set contains multiple patches to improve consistency and
> tidyness of the git-p4 script's code style.
>
> Many of these patches have been driven by the guidlines contained in the
> Python PEP8 "Style Guide for Python Code" and were applied using a
> mixture of human intervention, and tools including autopep8 and
> pycodestyle.
>
> This patch-set stops short of bringing git-p4 into full PEP8 compliance,
> most notably in regard to the following items:
>
>   - There is no patch to apply the recommended column limit of
>     79-characters,
>   - There is no patch to correct hanging indents of multi-line
>     declarations such as multi-line function delcarations, function
>     invocations, etc.
>

I love the direction of cleaning this up and making it compliant with
*something* :)

I've tried to run pylint on the previous state and state after this
patch, but unfortunately there's a *lot* of noise either way - from
all the "pylint: disable" entries in the script I have to assume that
at some point it was compliant with *some* pylint version, but at the
moment it's very far from any sort of compliance with checks I can run
(both before and after this patchset).

I have a few questions about the changes - I don't think they're
specific to any single commit so I'll list them here:
1) Is there a tool that can be used to check for PEP8 compliance, and
shows only the two remaining issues you highlighted above?
2) What is the relationship between "git-p4" and "git-p4.py"? Before
this patchset they are identical except for the shebang line, after
this patchset all these fixes are applied only to one of them. I
assume the changes should be made to both files in a coordinated
fashion?
3) Which of the "pylint: disable" entries remain meaningful after
these changes, if any? I imagine "disable=wrong-import-order" for
example makes no sense?

I personally have an interest in making sure this script keeps running
correctly under python2, so I plan to test this when I can. I imagine
this is already accounted for in the t98xx suite somewhere, but I
haven't found it.

Thanks,
Tao
Tao Klerks April 2, 2022, 5:45 p.m. UTC | #2
On Sat, Apr 2, 2022 at 2:14 PM Tao Klerks <tao@klerks.biz> wrote:
>
> 2) What is the relationship between "git-p4" and "git-p4.py"? Before
> this patchset they are identical except for the shebang line, after
> this patchset all these fixes are applied only to one of them. I
> assume the changes should be made to both files in a coordinated
> fashion?

Please ignore this question - after some playing around I finally
realized "git-p4" is a build product, and ended up different because I
had not built yet after applying your patch.
Joel Holdsworth April 4, 2022, 2:46 p.m. UTC | #3
> I've tried to run pylint on the previous state and state after this patch, but
> unfortunately there's a *lot* of noise either way - from all the "pylint:
> disable" entries in the script I have to assume that at some point it was
> compliant with *some* pylint version, but at the moment it's very far from
> any sort of compliance with checks I can run (both before and after this
> patchset).

I'm using pylint 2.12.2 on Python 3. I agree there are still many pylint warnings - especially if we start trying to cut down the disable list. However, quite a few of them are false positives.

The majority of the unsuppressed warnings I get from pylint suggest changes requiring Python 3 e.g. conversion to F-Strings.

> 1) Is there a tool that can be used to check for PEP8 compliance, and shows
> only the two remaining issues you highlighted above?

I used autopep8 to locate the issue I fixed here. It's not nearly as thorough as pylint, but PEP8 itself is very patchy in regard to the issues the authors chose to address.

> 3) Which of the "pylint: disable" entries remain meaningful after these
> changes, if any? I imagine "disable=wrong-import-order" for example makes
> no sense?

I don't know if it is possible to cut the list down any more yet. However, "wrong-import-order" certainly is required to suppress warnings about the Python version check mixed into the import list.

> I personally have an interest in making sure this script keeps running correctly
> under python2, so I plan to test this when I can. I imagine this is already
> accounted for in the t98xx suite somewhere, but I haven't found it.

Can you give me a bit more information about this, because I was going to try and push for the git-p4 to discard Python 2 support later this year in a bid to simplify the code, reduce the testing burden, and move toward cleaner pylint output.

Thanks for your comments
Joel
Tao Klerks April 5, 2022, 4:27 a.m. UTC | #4
On Mon, Apr 4, 2022 at 4:46 PM Joel Holdsworth <jholdsworth@nvidia.com> wrote:
>
> I'm using pylint 2.12.2 on Python 3. I agree there are still many pylint warnings - especially if we start trying to cut down the disable list. However, quite a few of them are false positives.
>
[...]
>
> I used autopep8 to locate the issue I fixed here. It's not nearly as thorough as pylint, but PEP8 itself is very patchy in regard to the issues the authors chose to address.
>

Thx for the feedback! I will play.

>
> > I personally have an interest in making sure this script keeps running correctly
> > under python2, so I plan to test this when I can. I imagine this is already
> > accounted for in the t98xx suite somewhere, but I haven't found it.
>
> Can you give me a bit more information about this, because I was going to try and push for the git-p4 to discard Python 2 support later this year in a bid to simplify the code, reduce the testing burden, and move toward cleaner pylint output.
>

Now that you question it, I'm thinking a bit about how I got there:

I work with a Perforce repository that has a long and colorful
history, with users from many countries on at least three different
platforms (windows, linux, osx), and as such both user metadata and
changelist descriptions have used inconsistent encodings.

When running git-p4 under python2, any author and CL description
metadata that is not valid utf8 is simply carried over as-is, and the
"invalidly encoded" data just carries into the git repo. (that can
later be interesting when working with the resulting git repo in
python3, of course - at a minimum decoding errors then need to be
swallowed, when minor data loss is acceptable)

Under python3, the import fails when some of this stuff gets decoded
into unicode strings. Obviously the right thing to do is to figure out
exactly what fails under python3, and figure out how to fix it, but I
have historically just swept it under the rug and kept using python2
which doesn't bother trying to "interpret" the inconsistently-encoded
byte sequences.

I'll try to look into this and see whether git-p4 itself can be made
to be more forgiving under python3, or whether there is an even better
solution where inconsistently-encoded Perforce metadata can somehow be
harmonized. Either way, I'll try to contribute some
test_expect_failure tests.
Tao Klerks April 5, 2022, 11:29 a.m. UTC | #5
On Tue, Apr 5, 2022 at 6:27 AM Tao Klerks <tao@klerks.biz> wrote:
>
> On Mon, Apr 4, 2022 at 4:46 PM Joel Holdsworth <jholdsworth@nvidia.com> wrote:
> >
> >
> > Can you give me a bit more information about this, because I was going to try and push for the git-p4 to discard Python 2 support later this year in a bid to simplify the code, reduce the testing burden, and move toward cleaner pylint output.
> >
>
> I'll try to look into this and see whether git-p4 itself can be made
> to be more forgiving under python3, or whether there is an even better
> solution where inconsistently-encoded Perforce metadata can somehow be
> harmonized. Either way, I'll try to contribute some
> test_expect_failure tests.

So, an initial test suggests that a recent version of git-p4 at least
doesn't fail in the same way under python3, in the face of at least
some of these encoding issues. I don't know yet whether failures will
occur in other places, nor whether the not-failing behavior is better,
worse or the same as I had under python2, but it seems plausible that
I won't be filing any test_expect_failure tests after all, and will
instead say "yay, python3 ftw!"
Joel Holdsworth April 5, 2022, 12:04 p.m. UTC | #6
> So, an initial test suggests that a recent version of git-p4 at least doesn't fail in
> the same way under python3, in the face of at least some of these encoding
> issues. I don't know yet whether failures will occur in other places, nor
> whether the not-failing behavior is better, worse or the same as I had under
> python2, but it seems plausible that I won't be filing any test_expect_failure
> tests after all, and will instead say "yay, python3 ftw!"

That would be fabulous.

I myself have a repository that has a variety of such issues. A common case is CP-1252 Smart Quote characters produced on Windows which are incompatible with UTF-8, without explicit conversion.

However, a lot of these problems can be avoided by simply avoiding conversion to text in the first place. In many cases the incoming data doesn't need to be converted and can be passed around as binary. I'm slowly working toward this goal, and once this patch-set it merged I have a couple of other decoding patches in the pipeline. If you have any other failure cases, please do submit them as test cases, or bug reports at least.

I would prefer the script to discard Python 2 support, but even if the consensus is to retain it, Python 3 forces us to address these issues which is a great step in the right direction.

Joel
Tao Klerks April 5, 2022, 5:16 p.m. UTC | #7
On Tue, Apr 5, 2022 at 2:04 PM Joel Holdsworth <jholdsworth@nvidia.com> wrote:
>
> > So, an initial test suggests that a recent version of git-p4 at least doesn't fail in
> > the same way under python3, in the face of at least some of these encoding
> > issues. I don't know yet whether failures will occur in other places, nor
> > whether the not-failing behavior is better, worse or the same as I had under
> > python2, but it seems plausible that I won't be filing any test_expect_failure
> > tests after all, and will instead say "yay, python3 ftw!"
>
> That would be fabulous.
>

Indeed, but completely untrue in the end. I had simply lost my
"python3" override in the update (and failed to notice). It works just
like before, unfortunately, so I'm back on track to try to reproduce
under controlled conditions, and propose fixes.

> I myself have a repository that has a variety of such issues. A common case is CP-1252 Smart Quote characters produced on Windows which are incompatible with UTF-8, without explicit conversion.
>

I was perplexed as to how you could still have these issues, if they
had magically cleared up for me. Now I know - they had not.

> However, a lot of these problems can be avoided by simply avoiding conversion to text in the first place. In many cases the incoming data doesn't need to be converted and can be passed around as binary.

That's certainly the behavior in git-p4 under python2, but I'm pretty
sure it's not "right". Git has a formal and coherent strategy for
encoding - commit (meta)data is interpreted as utf-8 unless an
"encoding" header is present in the commit. Git's output is utf-8
(with on-the-fly re-encoding if commits do have a different encoding
specified), unless you explicitly ask it to output in a different
encoding. These processes are "forgiving", in that a set of bytes that
cannot be interpreted as utf-8, in a commit that does not explicitly
specify an encoding, will be output as-is rather than being rejected
or warned about... But there is a "right way" to store data. When
CP-1252 (high-range) bytes are simply "passed through" into the git
data by python2, the resulting git commits cannot be displayed
correctly by most or any git clients - they expect utf-8.

On the other hand, in Perforce that does not appear to be the case at
all: as far as I can tell, p4v in windows is simply using windows's
active "legacy codepage", and reading or writing bytes accordingly
(CP1252 in my case); in windows cmd, same thing by default; in linux,
where utf8 is the norm nowadays, I get "unprintable character" output
when something was submitted from a CP-1252 environment (same as I do
for the python2-imported git commits, on any platform). If I switch a
windows cmd terminal to utf8 ("chcp 65001"), it behaves the same. If I
create a pending changelist with non-ANSI characters in Linux they get
submitted as utf8-encoded, and when I view the same pending changelist
in windows (from the command-line in its default legacy codepage, or
in p4v which also seems to use the legacy codepage), I see the classic
utf8-interpreted-as-CP1252 artifact "Æ" instead of "Æ".

I don't see any sort of "codepage selection" configuration either in
the perforce clients, or in the connection / server relationship.
Perforce tooling seems to simply assume that the Operating Systems of
all users will be configured with the same codepage - the bytes are
the bytes, clients encode and decode them however they will, and there
is simply no encoding metadata. US/western Windows clients, modern
linux clients, older mac clients, and cyrillic-configured clients, etc
will all have produced mutually-unintelligible content in the same
repo over the years.

In *my* environment, a majority of users have been using Windows over
the years to write commit messages, and most will have had CP-1252
configured, so I believe the most reasonable behavior would be "try
interpreting the bytes as utf-8, and if that fails then fall back to
interpreting as CP-1252, and either way, store the result in git as
utf-8" - but I don't know whether that is the right strategy for a
majority of environments. This strategy makes sense for me because
it's very rare for something that *can* be correctly read as utf-8 to,
in fact, not be intended to be utf-8; that was kind of the point in
the standard's development - so starting by *trying* to interpret as
utf-8 is a safe bet, and has value if *any* of your clients are likely
to have used it. In my environment there are linux users, and in fact
p4 user management happens in linux, so user names/descriptions are
also utf-8.

For other environments I suspect the "fallback codepage" should be
configurable (I don't know how codepage identifiers work
cross-platform in python, that sounds like it might be fun). You could
imagine having an "auto-detect" fallback option, but for the reasons
identified above this would need to operate on a per-string basis to
have high value, and I think CL descriptions (and user names) are
simply too short for autodetection to be a viable option.

Another reasonable strategy (less optimal for my environment) would be
to follow the "python2" approach of writing the bytes into git as they
were read from p4, but also additionally *telling* git what the
encoding is expected to be, by setting the "encoding" header
accordingly in the fast-import data. In my case that wouldn't work
well because, again, author names are utf-8, but commit messages are
more commonly CP-1252. Git expects all the strings in the commit to
use the same encoding (unless I've missed something).


> I'm slowly working toward this goal, and once this patch-set it merged I have a couple of other decoding patches in the pipeline. If you have any other failure cases, please do submit them as test cases, or bug reports at least.
>

I'd love to know what approach you're looking at!

I only use git-p4 for mirroring p4 activity into git, and not the
other way around, so my perspective on this whole topic might be a bit
skewed or short-sighted. For example, if we do end up storing the data
in git as utf-8, as git expects, then when someone comes to create and
submit p4 changelists from git activity, they might presumably want
that stuff to be output-encoded as CP-1252, if that is the more common
convention on their Perforce server...?

> I would prefer the script to discard Python 2 support, but even if the consensus is to retain it, Python 3 forces us to address these issues which is a great step in the right direction.
>

Right - discontinuing python 2 forces the elaboration of *some* sort
of strategy, besides the python2 default of "oh, just import
effectively-corrupt bytestreams into the git repo". That sounds like a
good thing :)
Tao Klerks April 11, 2022, 9:54 a.m. UTC | #8
On Tue, Apr 5, 2022 at 7:16 PM Tao Klerks <tao@klerks.biz> wrote:
>
> On Tue, Apr 5, 2022 at 2:04 PM Joel Holdsworth <jholdsworth@nvidia.com> wrote:
> >
> > > So, an initial test suggests that a recent version of git-p4 at least doesn't fail in
> > > the same way under python3, in the face of at least some of these encoding
> > > issues. I don't know yet whether failures will occur in other places, nor
> > > whether the not-failing behavior is better, worse or the same as I had under
> > > python2, but it seems plausible that I won't be filing any test_expect_failure
> > > tests after all, and will instead say "yay, python3 ftw!"
> >
> > That would be fabulous.
> >
>

I went a little further in the end, and implemented a full "fix" that
I believe makes sense. I've just submitted as RFC.

> > I myself have a repository that has a variety of such issues. A common case is CP-1252 Smart Quote characters produced on Windows which are incompatible with UTF-8, without explicit conversion.
> >
>

I'd love to know how the proposed patch works for you.

> > However, a lot of these problems can be avoided by simply avoiding conversion to text in the first place. In many cases the incoming data doesn't need to be converted and can be passed around as binary.
>
> That's certainly the behavior in git-p4 under python2, but I'm pretty
> sure it's not "right". Git has a formal and coherent strategy for
> encoding - commit (meta)data is interpreted as utf-8 unless an
> "encoding" header is present in the commit. Git's output is utf-8
> (with on-the-fly re-encoding if commits do have a different encoding
> specified), unless you explicitly ask it to output in a different
> encoding. These processes are "forgiving", in that a set of bytes that
> cannot be interpreted as utf-8, in a commit that does not explicitly
> specify an encoding, will be output as-is rather than being rejected
> or warned about... But there is a "right way" to store data. When
> CP-1252 (high-range) bytes are simply "passed through" into the git
> data by python2, the resulting git commits cannot be displayed
> correctly by most or any git clients - they expect utf-8.
>
> On the other hand, in Perforce that does not appear to be the case at
> all: as far as I can tell, p4v in windows is simply using windows's
> active "legacy codepage", and reading or writing bytes accordingly
> (CP1252 in my case); in windows cmd, same thing by default; in linux,
> where utf8 is the norm nowadays, I get "unprintable character" output
> when something was submitted from a CP-1252 environment (same as I do
> for the python2-imported git commits, on any platform). If I switch a
> windows cmd terminal to utf8 ("chcp 65001"), it behaves the same. If I
> create a pending changelist with non-ANSI characters in Linux they get
> submitted as utf8-encoded, and when I view the same pending changelist
> in windows (from the command-line in its default legacy codepage, or
> in p4v which also seems to use the legacy codepage), I see the classic
> utf8-interpreted-as-CP1252 artifact "Æ" instead of "Æ".
>
> I don't see any sort of "codepage selection" configuration either in
> the perforce clients, or in the connection / server relationship.
> Perforce tooling seems to simply assume that the Operating Systems of
> all users will be configured with the same codepage - the bytes are
> the bytes, clients encode and decode them however they will, and there
> is simply no encoding metadata. US/western Windows clients, modern
> linux clients, older mac clients, and cyrillic-configured clients, etc
> will all have produced mutually-unintelligible content in the same
> repo over the years.
>
> In *my* environment, a majority of users have been using Windows over
> the years to write commit messages, and most will have had CP-1252
> configured, so I believe the most reasonable behavior would be "try
> interpreting the bytes as utf-8, and if that fails then fall back to
> interpreting as CP-1252, and either way, store the result in git as
> utf-8" - but I don't know whether that is the right strategy for a
> majority of environments. This strategy makes sense for me because
> it's very rare for something that *can* be correctly read as utf-8 to,
> in fact, not be intended to be utf-8; that was kind of the point in
> the standard's development - so starting by *trying* to interpret as
> utf-8 is a safe bet, and has value if *any* of your clients are likely
> to have used it. In my environment there are linux users, and in fact
> p4 user management happens in linux, so user names/descriptions are
> also utf-8.
>
> For other environments I suspect the "fallback codepage" should be
> configurable (I don't know how codepage identifiers work
> cross-platform in python, that sounds like it might be fun). You could
> imagine having an "auto-detect" fallback option, but for the reasons
> identified above this would need to operate on a per-string basis to
> have high value, and I think CL descriptions (and user names) are
> simply too short for autodetection to be a viable option.
>
> Another reasonable strategy (less optimal for my environment) would be
> to follow the "python2" approach of writing the bytes into git as they
> were read from p4, but also additionally *telling* git what the
> encoding is expected to be, by setting the "encoding" header
> accordingly in the fast-import data. In my case that wouldn't work
> well because, again, author names are utf-8, but commit messages are
> more commonly CP-1252. Git expects all the strings in the commit to
> use the same encoding (unless I've missed something).
>

After working through this, I decided there was no practical value in
using the "encoding" header, *given* that the real encodings are
technically unknowable. There doesn't seem to be any advantage to
keeping the original bytes and slapping an "encoding" header into
place, over simply converting the original bytes into utf-8 following
an assumed (overridable) fallback encoding.

That said, I did keep the original python2 "just copy the bytes over"
strategy in place, and made it available under python3, in case anyone
(you?) prefer that over the attempt to convert to utf-8.

>
> > I'm slowly working toward this goal, and once this patch-set it merged I have a couple of other decoding patches in the pipeline. If you have any other failure cases, please do submit them as test cases, or bug reports at least.
> >
>

I did not find any issues and did mark "Reviewed-By" for your "git-p4:
Various code tidy-ups" v5 patchset, fwiw. (when I eventually found it
in my gmail spam...:( )

Examples of my problem cases are included in my new "[RFC] git-p4:
improve encoding handling to support inconsistent encodings" patch.
This patch has minor conflicts with your code tidy-ups, but if I'm
reading Junio's latest updates correctly, the tidy-ups are headed to
"next" so I will re-roll when they get there.

>
> > I would prefer the script to discard Python 2 support, but even if the consensus is to retain it, Python 3 forces us to address these issues which is a great step in the right direction.
> >
>

Given the (unintentional, but significant) behavior differences
between running under python2 and python3, I decided that it would be
best to support existing behaviors either way - be able to test the
new fallback behavior under python2, be able to maintain the old
behavior under python3, and be able to maintain python3's strict
behavior as well if desired (if, for example, cp1252 is likely to be a
bad guess, and you want to handle problems explicitly).

Obviously, the submitted patch is RFC - please let me know if you see
a better path forward!

Thanks,
Tao