mbox series

[v3,0/5] Fix a whole host of nvmem registration/cleanup issues

Message ID Y7RezbPSGrO37NZZ@shell.armlinux.org.uk (mailing list archive)
Headers show
Series Fix a whole host of nvmem registration/cleanup issues | expand

Message

Russell King (Oracle) Jan. 3, 2023, 4:58 p.m. UTC
Hi,

This series fixes a whole host of nvmem registration/error cleanup
issues that have been identified by both Hector and myself. It is a
substantial rework of my original patch fixing the first problem.

The first most obvious problem is the race between nvmem registration
and use, which leads to sporadic failures of drivers to probe at boot
time.

While fixing this, it has been noticed that a recent fix to check the
return value of dev_set_name() introduced a new bug where wp_gpio was
not being put in that newly introduced error path.

Then there's a fix for a previous fix which itself purports to fix
another bug, but results in the allocated ID being leaked. Fix for a
fix for a fix is not good!

Then there's an error in the docbook documentation for wp_gpio (it's
listed as wp-gpio instead) but as nothing seems to set wp_gpio, we
might as well get rid of it - which also solves the issue that we
call gpiod_put() on this whether we own it or not.

Lastly, there's a fix for yet another spurious white-space in this
code, one of what seems to be a long history of past white-space
fixes.

These patches have been individually build-tested in the order of
posting, but not run-time tested except for the entire series.

 drivers/nvmem/core.c           | 51 ++++++++++++++++++------------------------
 include/linux/nvmem-provider.h |  2 --
 2 files changed, 22 insertions(+), 31 deletions(-)

Comments

Hector Martin Jan. 3, 2023, 6:12 p.m. UTC | #1
On 04/01/2023 01.58, Russell King (Oracle) wrote:
> Hi,
> 
> This series fixes a whole host of nvmem registration/error cleanup
> issues that have been identified by both Hector and myself. It is a
> substantial rework of my original patch fixing the first problem.
> 
> The first most obvious problem is the race between nvmem registration
> and use, which leads to sporadic failures of drivers to probe at boot
> time.
> 
> While fixing this, it has been noticed that a recent fix to check the
> return value of dev_set_name() introduced a new bug where wp_gpio was
> not being put in that newly introduced error path.
> 
> Then there's a fix for a previous fix which itself purports to fix
> another bug, but results in the allocated ID being leaked. Fix for a
> fix for a fix is not good!
> 
> Then there's an error in the docbook documentation for wp_gpio (it's
> listed as wp-gpio instead) but as nothing seems to set wp_gpio, we
> might as well get rid of it - which also solves the issue that we
> call gpiod_put() on this whether we own it or not.
> 
> Lastly, there's a fix for yet another spurious white-space in this
> code, one of what seems to be a long history of past white-space
> fixes.
> 
> These patches have been individually build-tested in the order of
> posting, but not run-time tested except for the entire series.
> 
>  drivers/nvmem/core.c           | 51 ++++++++++++++++++------------------------
>  include/linux/nvmem-provider.h |  2 --
>  2 files changed, 22 insertions(+), 31 deletions(-)
> 

Uhh. The series itself looks fine as far as fixing the problems, but I
fail to see how this is any better than my attempt as far as backporting
or commit atomicity goes. Patch #4 fixes the newer gpio leak bug *and*
half fixes the race condition bug, then patch #5 completes the race
condition fix but now depends on #4, meaning you're left with exactly
the same backporting mess since now you can't apply #5 to older kernels
and #4 only to newer ones. Splitting the commits like this buys you nothing.

I thought we were doing minimal backportable fixes to solve this, but
your commit message for #4 literally says "While a minimal fix for this
would be to add the gpiod_put() call, we can do better if we split
device_register() [...]"... and then that whole "let's do better" part
is what breaks the backportability again.

And then of course if you *do* manage to queue at least #4 to be
backported to a newer subset of stable trees, #3 certainly isn't going
to get backported itself (since it's just removing dead code, not
eligible for stable since it fixes no actual bugs), but then you're left
with the same
broken-on-paper-except-nobody-uses-it-anyway-so-it-doesn't-matter
situation my v2 left us in for those stable kernels.

That said, thanks for identifying that nobody uses the functionality I
supposedly regressed (in a tiny corner case code path where it was
already broken anyway) in my v2, and therefore I didn't actually regress
anything in practice and strictly fixed real bugs.

- Hector
Russell King (Oracle) Jan. 3, 2023, 8:56 p.m. UTC | #2
Really not interested in your politics. Not interested in fixing this
problem.

I'll use these patches to fix the problem in my tree. I don't care about
mainline.

On Wed, Jan 04, 2023 at 03:12:44AM +0900, Hector Martin wrote:
> On 04/01/2023 01.58, Russell King (Oracle) wrote:
> > Hi,
> > 
> > This series fixes a whole host of nvmem registration/error cleanup
> > issues that have been identified by both Hector and myself. It is a
> > substantial rework of my original patch fixing the first problem.
> > 
> > The first most obvious problem is the race between nvmem registration
> > and use, which leads to sporadic failures of drivers to probe at boot
> > time.
> > 
> > While fixing this, it has been noticed that a recent fix to check the
> > return value of dev_set_name() introduced a new bug where wp_gpio was
> > not being put in that newly introduced error path.
> > 
> > Then there's a fix for a previous fix which itself purports to fix
> > another bug, but results in the allocated ID being leaked. Fix for a
> > fix for a fix is not good!
> > 
> > Then there's an error in the docbook documentation for wp_gpio (it's
> > listed as wp-gpio instead) but as nothing seems to set wp_gpio, we
> > might as well get rid of it - which also solves the issue that we
> > call gpiod_put() on this whether we own it or not.
> > 
> > Lastly, there's a fix for yet another spurious white-space in this
> > code, one of what seems to be a long history of past white-space
> > fixes.
> > 
> > These patches have been individually build-tested in the order of
> > posting, but not run-time tested except for the entire series.
> > 
> >  drivers/nvmem/core.c           | 51 ++++++++++++++++++------------------------
> >  include/linux/nvmem-provider.h |  2 --
> >  2 files changed, 22 insertions(+), 31 deletions(-)
> > 
> 
> Uhh. The series itself looks fine as far as fixing the problems, but I
> fail to see how this is any better than my attempt as far as backporting
> or commit atomicity goes. Patch #4 fixes the newer gpio leak bug *and*
> half fixes the race condition bug, then patch #5 completes the race
> condition fix but now depends on #4, meaning you're left with exactly
> the same backporting mess since now you can't apply #5 to older kernels
> and #4 only to newer ones. Splitting the commits like this buys you nothing.
> 
> I thought we were doing minimal backportable fixes to solve this, but
> your commit message for #4 literally says "While a minimal fix for this
> would be to add the gpiod_put() call, we can do better if we split
> device_register() [...]"... and then that whole "let's do better" part
> is what breaks the backportability again.
> 
> And then of course if you *do* manage to queue at least #4 to be
> backported to a newer subset of stable trees, #3 certainly isn't going
> to get backported itself (since it's just removing dead code, not
> eligible for stable since it fixes no actual bugs), but then you're left
> with the same
> broken-on-paper-except-nobody-uses-it-anyway-so-it-doesn't-matter
> situation my v2 left us in for those stable kernels.
> 
> That said, thanks for identifying that nobody uses the functionality I
> supposedly regressed (in a tiny corner case code path where it was
> already broken anyway) in my v2, and therefore I didn't actually regress
> anything in practice and strictly fixed real bugs.
> 
> - Hector
>
Srinivas Kandagatla Jan. 3, 2023, 9:15 p.m. UTC | #3
Hi Russell,


On 03/01/2023 16:58, Russell King (Oracle) wrote:
> Hi,
> 
> This series fixes a whole host of nvmem registration/error cleanup
> issues that have been identified by both Hector and myself. It is a
> substantial rework of my original patch fixing the first problem.
> 
> The first most obvious problem is the race between nvmem registration
> and use, which leads to sporadic failures of drivers to probe at boot
> time.
> 
> While fixing this, it has been noticed that a recent fix to check the
> return value of dev_set_name() introduced a new bug where wp_gpio was
> not being put in that newly introduced error path.
> 
> Then there's a fix for a previous fix which itself purports to fix
> another bug, but results in the allocated ID being leaked. Fix for a
> fix for a fix is not good!
> 
> Then there's an error in the docbook documentation for wp_gpio (it's
> listed as wp-gpio instead) but as nothing seems to set wp_gpio, we
> might as well get rid of it - which also solves the issue that we
> call gpiod_put() on this whether we own it or not.
> 
> Lastly, there's a fix for yet another spurious white-space in this
> code, one of what seems to be a long history of past white-space
> fixes.
> 
> These patches have been individually build-tested in the order of
> posting, but not run-time tested except for the entire series.


thanks for fixing these issues, I have applied these after fixing the 
subject on all the patches, as it ended up with email ids in subject.


thanks,
Srini
> 
>   drivers/nvmem/core.c           | 51 ++++++++++++++++++------------------------
>   include/linux/nvmem-provider.h |  2 --
>   2 files changed, 22 insertions(+), 31 deletions(-)
>
Hector Martin Jan. 4, 2023, 1:15 a.m. UTC | #4
On 04/01/2023 06.15, Srinivas Kandagatla wrote:
> Hi Russell,
> 
> 
> On 03/01/2023 16:58, Russell King (Oracle) wrote:
>> Hi,
>>
>> This series fixes a whole host of nvmem registration/error cleanup
>> issues that have been identified by both Hector and myself. It is a
>> substantial rework of my original patch fixing the first problem.
>>
>> The first most obvious problem is the race between nvmem registration
>> and use, which leads to sporadic failures of drivers to probe at boot
>> time.
>>
>> While fixing this, it has been noticed that a recent fix to check the
>> return value of dev_set_name() introduced a new bug where wp_gpio was
>> not being put in that newly introduced error path.
>>
>> Then there's a fix for a previous fix which itself purports to fix
>> another bug, but results in the allocated ID being leaked. Fix for a
>> fix for a fix is not good!
>>
>> Then there's an error in the docbook documentation for wp_gpio (it's
>> listed as wp-gpio instead) but as nothing seems to set wp_gpio, we
>> might as well get rid of it - which also solves the issue that we
>> call gpiod_put() on this whether we own it or not.
>>
>> Lastly, there's a fix for yet another spurious white-space in this
>> code, one of what seems to be a long history of past white-space
>> fixes.
>>
>> These patches have been individually build-tested in the order of
>> posting, but not run-time tested except for the entire series.
> 
> 
> thanks for fixing these issues, I have applied these after fixing the 
> subject on all the patches, as it ended up with email ids in subject.

Right. I see none of the issues you two lectured me about actually
mattered, it was all for show, and this isn't getting backported anyway.

Good job you two. The day I finally rage quit the kernel after enough of
this nonsense and make a big dossier of just how fucked up the kernel
maintainer community's attitude is, I will be sure to use this as an
example.

That day is not today though. But I certainly won't be upstreaming any
more patches to nvmem.

- Hector
Russell King (Oracle) Jan. 4, 2023, 10:29 a.m. UTC | #5
On Wed, Jan 04, 2023 at 10:15:14AM +0900, Hector Martin wrote:
> On 04/01/2023 06.15, Srinivas Kandagatla wrote:
> > Hi Russell,
> > 
> > 
> > On 03/01/2023 16:58, Russell King (Oracle) wrote:
> >> Hi,
> >>
> >> This series fixes a whole host of nvmem registration/error cleanup
> >> issues that have been identified by both Hector and myself. It is a
> >> substantial rework of my original patch fixing the first problem.
> >>
> >> The first most obvious problem is the race between nvmem registration
> >> and use, which leads to sporadic failures of drivers to probe at boot
> >> time.
> >>
> >> While fixing this, it has been noticed that a recent fix to check the
> >> return value of dev_set_name() introduced a new bug where wp_gpio was
> >> not being put in that newly introduced error path.
> >>
> >> Then there's a fix for a previous fix which itself purports to fix
> >> another bug, but results in the allocated ID being leaked. Fix for a
> >> fix for a fix is not good!
> >>
> >> Then there's an error in the docbook documentation for wp_gpio (it's
> >> listed as wp-gpio instead) but as nothing seems to set wp_gpio, we
> >> might as well get rid of it - which also solves the issue that we
> >> call gpiod_put() on this whether we own it or not.
> >>
> >> Lastly, there's a fix for yet another spurious white-space in this
> >> code, one of what seems to be a long history of past white-space
> >> fixes.
> >>
> >> These patches have been individually build-tested in the order of
> >> posting, but not run-time tested except for the entire series.
> > 
> > 
> > thanks for fixing these issues, I have applied these after fixing the 
> > subject on all the patches, as it ended up with email ids in subject.
> 
> Right. I see none of the issues you two lectured me about actually
> mattered, it was all for show, and this isn't getting backported anyway.
> 
> Good job you two. The day I finally rage quit the kernel after enough of
> this nonsense and make a big dossier of just how fucked up the kernel
> maintainer community's attitude is, I will be sure to use this as an
> example.
> 
> That day is not today though. But I certainly won't be upstreaming any
> more patches to nvmem.

You've really little clue, have you.

I really don't see that you'll *ever* get apple hardware properly
functional in mainline. Good luck maintaining a fork of the kernel
for ever into the future.

I've had enough of you.
Russell King (Oracle) Jan. 10, 2023, 4:25 p.m. UTC | #6
On Tue, Jan 03, 2023 at 08:56:32PM +0000, Russell King (Oracle) wrote:
> Really not interested in your politics. Not interested in fixing this
> problem.
> 
> I'll use these patches to fix the problem in my tree. I don't care about
> mainline.

Having thought this over, this was an unfair over-reaction for which
I'd like to apologise. It was proving to be a very stressful couple
of days.

> > Uhh. The series itself looks fine as far as fixing the problems, but I
> > fail to see how this is any better than my attempt as far as backporting
> > or commit atomicity goes. Patch #4 fixes the newer gpio leak bug *and*
> > half fixes the race condition bug,

There are two choices for patch for:

1. add gpiod_put().
2. make use of the gpiod_put() already present in the release function.

Either way, patch 5 depends on patch 4, and there's no way around that.
This does not mean that they can't be back ported though - patch 4
can be backported to the appropriate point, and further backported
merely by dropping one hunk - and thus it becomes a preparatory patch
for patch 5, and even more so, separates the conversion from
device_register() to device_initialize()+device_add() from the actual
fix itself (which in patch 5 merely moves device_add() and rearranges
the cleanup.)

However, where our patches differ is that my series fixes one problem
in one patch, rather than trying to address multiple problems in one
patch. As has been pointed out, this is a documented requirement in
the submitting-patches.rst document, which has been there for a very
long time. You had been pointed to this document already over this
point.

Therefore, I believe my series to be a technically better approach
which addresses several more issues while conforming to the "Solve
only one problem per patch." requirement which can be trivially
backported - and I truely believe that even patch 4 complies with
the requirement in submitting-patches.rst. I certainly do not
believe patch 4 is a kind of "partial" fix for the race condition,
since it in no way changes the presence of the race.

I hope that we can continue to work to get the Apple M1 supported more
fully in the future.
Hector Martin Jan. 13, 2023, 11:55 a.m. UTC | #7
On 11/01/2023 01.25, Russell King (Oracle) wrote:
> On Tue, Jan 03, 2023 at 08:56:32PM +0000, Russell King (Oracle) wrote:
>> Really not interested in your politics. Not interested in fixing this
>> problem.
>>
>> I'll use these patches to fix the problem in my tree. I don't care about
>> mainline.
> 
> Having thought this over, this was an unfair over-reaction for which
> I'd like to apologise. It was proving to be a very stressful couple
> of days.

Thank you.

> 
>>> Uhh. The series itself looks fine as far as fixing the problems, but I
>>> fail to see how this is any better than my attempt as far as backporting
>>> or commit atomicity goes. Patch #4 fixes the newer gpio leak bug *and*
>>> half fixes the race condition bug,
> 
> There are two choices for patch for:
> 
> 1. add gpiod_put().
> 2. make use of the gpiod_put() already present in the release function.
> 
> Either way, patch 5 depends on patch 4, and there's no way around that.

How so? My idea was to just add gpiod_put() as one commit, do the race
condition fix as another, and optionally clean up as a third commit
after that to unify the error paths. I just tried it and it effectively
avoids the inter-patch dependency.

Done like this, both patches apply cleanly to the 6.1 stable tree, while
the race fix applies cleanly to at least the 6.0 and 5.15 stable trees
(haven't tried going further back). No manual backports needed. It just
works with a Cc: stable@.

> However, where our patches differ is that my series fixes one problem
> in one patch, rather than trying to address multiple problems in one
> patch. As has been pointed out, this is a documented requirement in
> the submitting-patches.rst document, which has been there for a very
> long time. You had been pointed to this document already over this
> point.

I'd like to direct your attention to submitting-patches.rst:

> For example, if your changes include both bug fixes and performance
> enhancements for a single driver, separate those changes into two
> or more patches.

Note how it says *bug fixes* (plural), and asks you to separate the
changes into *two or more* patches (where one is the performance
enhancements, and the other the multiple bug fixes, in the two case).

The author of the document clearly understood that sometimes it makes
perfect sense to group related changes together, when they are of the
same type and touching closely related bits of code, and focused on the
more egregious case of making wildly unrelated changes (performance and
bugfixes, presumably in unrelated parts of the code) in one commit.
*That* is what we are trying to avoid, because when done
indiscriminately, it makes bisecting and backporting and review much harder.

Yet, there is this persistent myth among some kernel maintainers that It
Is The Rule Of The Kernel that You Do One, and Exactly One, Bug Fix Per
Commit. And this kind of applying (not even actual, in this case) rules
with an iron fist without any regard to nuance is one of the things that
contributes to the unwelcoming nature of this community.

The reality is that us engineers often notice little things to fix while
working on code doing something else, and in those cases, it often makes
little to no sense to split off the fix into another commit, especially
if it's minor enough that it doesn't need backporting or the fix ends up
getting subsumed into other changes being made.

There is room for discretion in deciding how to split up commits, and
it's not something you can make an objective, strict rule for that
actually makes sense. When it comes to programming in a large
collaborative project like this, strict rules can only be made and
enforced for things that can be fully checked and made consistent by a
machine... and the Linux kernel development process is too antiquated to
have many of those (e.g. many projects are doing fully automated code
formatting these days, while here all we have is checkpatch.pl and it's
often wrong and unsuitable as an automated gating check). Everything
else is at best a strong suggestion, but always subject to human
interpretation and nuance, because we aren't robots.

I do these kinds of drive-by fixes all the time, because sometimes it
just *makes sense*. See for example 38892ea4cef, which fixed a typo
while renaming a DT property. Or d182dc6de932, which refactored some
OF-related code and in the process also fixed a reference leak in the
original code. Those are things not worth backporting, and in those
cases there is absolutely no good reason to split out the "fix" part
from the "do" part into a tiny one-liner commit.

Ultimately, drive-by fixes like this are a gift from the contributor to
the maintainer, as they are almost always not something that is actually
causing pain for anyone (at least not significant pain), and therefore
there is very little reason to spend time on them. Thus, if maintainers
attempt to enforce a strict one-patch-per-bugfix policy, it will simply
result in (and I'm sure it already is) contributors just not doing those
fixes at all, or choosing not to describe them in the commit message
when they are "accidentally" fixed as part of a larger change. I know I
would. When the patch submission process is painful, you naturally
gravitate towards doing the least offensive thing that gets you where
you want to be, even if that isn't necessarily the best place for the
kernel to be in the end.

That all said, you *did* have a very valid point with the merged commit
being harder to backport in this case (where a backport is desirable),
and checking the blame to see if it would apply cleanly further back was
a good call - but then please appeal to that, and not to
submitting-patches.rst. And then if we'd agreed on that being the
desirable goal, I could've helped you structure the patches in a way
that cleanly backports to stable, as I already locally confirmed is
possible to do.

> Therefore, I believe my series to be a technically better approach
> which addresses several more issues while conforming to the "Solve
> only one problem per patch." requirement which can be trivially
> backported - and I truely believe that even patch 4 complies with
> the requirement in submitting-patches.rst. I certainly do not
> believe patch 4 is a kind of "partial" fix for the race condition,
> since it in no way changes the presence of the race.

It certainly fixed more things, but as I said, if this were about
backporting, we could've done better ;)

Patch 4 may not change the presence of the race, but both you and I know
it is preparatory work for that (as if the goal were just to fix the
gpiod leak, that would've just been adding the one-line gpiod_put()), so
it is somewhat disingenuous to claim that it is a stand-alone fix. I
posit that if nothing else it is harder to review as a fix for what it
purports to fix, and not the best approach for a backport (which as I
said doesn't work out of the box, but even if it did). Sure, it's
splitting off the fixes in the sense that only one thing is actually
fixed after each commit is applied, but it's not splitting them off at
the sensible point for that.

> I hope that we can continue to work to get the Apple M1 supported more
> fully in the future.

Thank you. (I still need to re-send those SMC patches... maybe next week
I'll finally find some time)

- Hector