[v2,0/3] docs: define policy forbidding use of "AI" / LLM code generators

Message ID	20240516162230.937047-1-berrange@redhat.com (mailing list archive)
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: =?utf-8?q?Daniel_P=2E_Berrang=C3=A9?= <berrange@redhat.com> To: qemu-devel@nongnu.org Cc: Thomas Huth <thuth@redhat.com>, =?utf-8?q?Alex_Benn=C3=A9e?= <alex.bennee@linaro.org>, "Michael S. Tsirkin" <mst@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>, Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>, =?utf-8?q?Philippe_Mathie?= =?utf-8?q?u-Daud=C3=A9?= <philmd@linaro.org>, Kevin Wolf <kwolf@redhat.com>, =?utf-8?q?Daniel_P=2E_Berrang=C3=A9?= <berrange@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, Alexander Graf <agraf@csgraf.de>, Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, Peter Maydell <peter.maydell@linaro.org>, Markus Armbruster <armbru@redhat.com> Subject: [PATCH v2 0/3] docs: define policy forbidding use of "AI" / LLM code generators Date: Thu, 16 May 2024 17:22:27 +0100 Message-ID: <20240516162230.937047-1-berrange@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.133.124; envelope-from=berrange@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -30 X-Spam_score: -3.1 X-Spam_bar: --- X-Spam_report: (-3.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-1.022, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	docs: define policy forbidding use of "AI" / LLM code generators \| expand [v2,0/3] docs: define policy forbidding use of "AI" / LLM code generators [v2,1/3] docs: introduce dedicated page about code provenance / sign-off [v2,2/3] docs: define policy limiting the inclusion of generated files [v2,3/3] docs: define policy forbidding use of AI code generators

Daniel P. Berrangé May 16, 2024, 4:22 p.m. UTC

This patch kicks the hornet's nest of AI / LLM code generators.

With the increasing interest in code generators in recent times,
it is inevitable that QEMU contributions will include AI generated
code. Thus far we have remained silent on the matter. Given that
everyone knows these tools exist, our current position has to be
considered tacit acceptance of the use of AI generated code in QEMU.

The question for the project is whether that is a good position for
QEMU to take or not ?

IANAL, but I like to think I'm reasonably proficient at understanding
open source licensing. I am not inherantly against the use of AI tools,
rather I am anti-risk. I also want to see OSS licenses respected and
complied with.

AFAICT at its current state of (im)maturity the question of licensing
of AI code generator output does not have a broadly accepted / settled
legal position. This is an inherant bias/self-interest from the vendors
promoting their usage, who tend to minimize/dismiss the legal questions.
From my POV, this puts such tools in a position of elevated legal risk.

Given the fuzziness over the legal position of generated code from
such tools, I don't consider it credible (today) for a contributor
to assert compliance with the DCO terms (b) or (c) (which is a stated
pre-requisite for QEMU accepting patches) when a patch includes (or is
derived from) AI generated code.

By implication, I think that QEMU must (for now) explicitly decline
to (knowingly) accept AI generated code.

Perhaps a few years down the line the legal uncertainty will have
reduced and we can re-evaluate this policy.

Discuss...

Changes in v2:

 * Fix a huge number of typos in docs
 * Clarify that maintainers should still add R-b where relevant, even
   if they are already adding their own S-oB.
 * Clarify situation when contributor re-starts previously abandoned
   work from another contributor.
 * Add info about Suggested-by tag
 * Add new docs section dealing with the broad topic of "generated
   files" (whether code generators or compilers)
 * Simplify the section related to prohibition of AI generated files
   and give further examples of tools considered covered
 * Remove repeated references to "LLM" as a specific technology, just
   use the broad "AI" term, except for one use of LLM as an example.
 * Add note that the policy may evolve if the legal clarity improves
 * Add note that exceptions can be requested on case-by-case basis
   if contributor thinks they can demonstrate a credible copyright
   and licensing status

Daniel P. Berrangé (3):
  docs: introduce dedicated page about code provenance / sign-off
  docs: define policy limiting the inclusion of generated files
  docs: define policy forbidding use of AI code generators

 docs/devel/code-provenance.rst    | 315 ++++++++++++++++++++++++++++++
 docs/devel/index-process.rst      |   1 +
 docs/devel/submitting-a-patch.rst |  19 +-
 3 files changed, 318 insertions(+), 17 deletions(-)
 create mode 100644 docs/devel/code-provenance.rst

Michael S. Tsirkin May 16, 2024, 5:20 p.m. UTC | #1

On Thu, May 16, 2024 at 05:22:27PM +0100, Daniel P. Berrangé wrote:
> This patch kicks the hornet's nest of AI / LLM code generators.
> 
> With the increasing interest in code generators in recent times,
> it is inevitable that QEMU contributions will include AI generated
> code. Thus far we have remained silent on the matter. Given that
> everyone knows these tools exist, our current position has to be
> considered tacit acceptance of the use of AI generated code in QEMU.
> 
> The question for the project is whether that is a good position for
> QEMU to take or not ?
> 
> IANAL, but I like to think I'm reasonably proficient at understanding
> open source licensing. I am not inherantly against the use of AI tools,
> rather I am anti-risk. I also want to see OSS licenses respected and
> complied with.
> 
> AFAICT at its current state of (im)maturity the question of licensing
> of AI code generator output does not have a broadly accepted / settled
> legal position. This is an inherant bias/self-interest from the vendors
> promoting their usage, who tend to minimize/dismiss the legal questions.
> >From my POV, this puts such tools in a position of elevated legal risk.
> 
> Given the fuzziness over the legal position of generated code from
> such tools, I don't consider it credible (today) for a contributor
> to assert compliance with the DCO terms (b) or (c) (which is a stated
> pre-requisite for QEMU accepting patches) when a patch includes (or is
> derived from) AI generated code.
> 
> By implication, I think that QEMU must (for now) explicitly decline
> to (knowingly) accept AI generated code.
> 
> Perhaps a few years down the line the legal uncertainty will have
> reduced and we can re-evaluate this policy.
> 
> Discuss...

At this junction, the code generated by these tools is of such
quality that I really won't expect it to pass even cursory code
review.


So for now, I propose adding a single paragraph:

 If you wrote the patch, make sure your "From:" and "Signed-off-by:"
 lines use the same spelling. It's okay if you subscribe or contribute to
 the list via more than one address, but using multiple addresses in one
 commit just confuses things. If someone else wrote the patch, git will
 include a "From:" line in the body of the email (different from your
 envelope From:) that will give credit to the correct author; but again,
 that author's Signed-off-by: line is mandatory, with the same spelling.

+Q: I prompted ChatGPT/Copilot/Llama and it wrote
+   the patch for me. Can I submit it and how do I sign it?
+A: Your patch is likely trash or trivial. Please write your own code.






> Changes in v2:
> 
>  * Fix a huge number of typos in docs
>  * Clarify that maintainers should still add R-b where relevant, even
>    if they are already adding their own S-oB.
>  * Clarify situation when contributor re-starts previously abandoned
>    work from another contributor.
>  * Add info about Suggested-by tag
>  * Add new docs section dealing with the broad topic of "generated
>    files" (whether code generators or compilers)
>  * Simplify the section related to prohibition of AI generated files
>    and give further examples of tools considered covered
>  * Remove repeated references to "LLM" as a specific technology, just
>    use the broad "AI" term, except for one use of LLM as an example.
>  * Add note that the policy may evolve if the legal clarity improves
>  * Add note that exceptions can be requested on case-by-case basis
>    if contributor thinks they can demonstrate a credible copyright
>    and licensing status
> 
> Daniel P. Berrangé (3):
>   docs: introduce dedicated page about code provenance / sign-off
>   docs: define policy limiting the inclusion of generated files
>   docs: define policy forbidding use of AI code generators
> 
>  docs/devel/code-provenance.rst    | 315 ++++++++++++++++++++++++++++++
>  docs/devel/index-process.rst      |   1 +
>  docs/devel/submitting-a-patch.rst |  19 +-
>  3 files changed, 318 insertions(+), 17 deletions(-)
>  create mode 100644 docs/devel/code-provenance.rst
> 
> -- 
> 2.43.0

Peter Maydell May 16, 2024, 5:34 p.m. UTC | #2

On Thu, 16 May 2024 at 18:20, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, May 16, 2024 at 05:22:27PM +0100, Daniel P. Berrangé wrote:
> > AFAICT at its current state of (im)maturity the question of licensing
> > of AI code generator output does not have a broadly accepted / settled
> > legal position. This is an inherant bias/self-interest from the vendors
> > promoting their usage, who tend to minimize/dismiss the legal questions.
> > >From my POV, this puts such tools in a position of elevated legal risk.
> >
> > Given the fuzziness over the legal position of generated code from
> > such tools, I don't consider it credible (today) for a contributor
> > to assert compliance with the DCO terms (b) or (c) (which is a stated
> > pre-requisite for QEMU accepting patches) when a patch includes (or is
> > derived from) AI generated code.
> >
> > By implication, I think that QEMU must (for now) explicitly decline
> > to (knowingly) accept AI generated code.
> >
> > Perhaps a few years down the line the legal uncertainty will have
> > reduced and we can re-evaluate this policy.

> At this junction, the code generated by these tools is of such
> quality that I really won't expect it to pass even cursory code
> review.

I disagree, I think that in at least some cases they can
produce code that would pass our quality bar, especially with
human supervision and editing after the fact. If the problem
was merely "LLMs tend to produce lousy output" then we wouldn't
need to write anything new -- we already have a process for
dealing with bad patches, which is to say we do code review and
suggest changes or simply reject the patches. What we *don't* have
any process to handle is the legal uncertainties that Dan outlines
above.

-- PMM

Michael S. Tsirkin May 16, 2024, 5:36 p.m. UTC | #3

On Thu, May 16, 2024 at 06:34:13PM +0100, Peter Maydell wrote:
> On Thu, 16 May 2024 at 18:20, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, May 16, 2024 at 05:22:27PM +0100, Daniel P. Berrangé wrote:
> > > AFAICT at its current state of (im)maturity the question of licensing
> > > of AI code generator output does not have a broadly accepted / settled
> > > legal position. This is an inherant bias/self-interest from the vendors
> > > promoting their usage, who tend to minimize/dismiss the legal questions.
> > > >From my POV, this puts such tools in a position of elevated legal risk.
> > >
> > > Given the fuzziness over the legal position of generated code from
> > > such tools, I don't consider it credible (today) for a contributor
> > > to assert compliance with the DCO terms (b) or (c) (which is a stated
> > > pre-requisite for QEMU accepting patches) when a patch includes (or is
> > > derived from) AI generated code.
> > >
> > > By implication, I think that QEMU must (for now) explicitly decline
> > > to (knowingly) accept AI generated code.
> > >
> > > Perhaps a few years down the line the legal uncertainty will have
> > > reduced and we can re-evaluate this policy.
> 
> > At this junction, the code generated by these tools is of such
> > quality that I really won't expect it to pass even cursory code
> > review.
> 
> I disagree, I think that in at least some cases they can
> produce code that would pass our quality bar, especially with
> human supervision and editing after the fact. If the problem
> was merely "LLMs tend to produce lousy output" then we wouldn't
> need to write anything new -- we already have a process for
> dealing with bad patches, which is to say we do code review and
> suggest changes or simply reject the patches. What we *don't* have
> any process to handle is the legal uncertainties that Dan outlines
> above.
> 
> -- PMM


Maybe I'm bad at prompting ;)

Stefan Hajnoczi May 21, 2024, 2:27 p.m. UTC | #4

On Thu, 16 May 2024 at 12:23, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> This patch kicks the hornet's nest of AI / LLM code generators.
>
> With the increasing interest in code generators in recent times,
> it is inevitable that QEMU contributions will include AI generated
> code. Thus far we have remained silent on the matter. Given that
> everyone knows these tools exist, our current position has to be
> considered tacit acceptance of the use of AI generated code in QEMU.
>
> The question for the project is whether that is a good position for
> QEMU to take or not ?
>
> IANAL, but I like to think I'm reasonably proficient at understanding
> open source licensing. I am not inherantly against the use of AI tools,
> rather I am anti-risk. I also want to see OSS licenses respected and
> complied with.
>
> AFAICT at its current state of (im)maturity the question of licensing
> of AI code generator output does not have a broadly accepted / settled
> legal position. This is an inherant bias/self-interest from the vendors
> promoting their usage, who tend to minimize/dismiss the legal questions.
> From my POV, this puts such tools in a position of elevated legal risk.
>
> Given the fuzziness over the legal position of generated code from
> such tools, I don't consider it credible (today) for a contributor
> to assert compliance with the DCO terms (b) or (c) (which is a stated
> pre-requisite for QEMU accepting patches) when a patch includes (or is
> derived from) AI generated code.
>
> By implication, I think that QEMU must (for now) explicitly decline
> to (knowingly) accept AI generated code.
>
> Perhaps a few years down the line the legal uncertainty will have
> reduced and we can re-evaluate this policy.
>
> Discuss...

Although this policy is unenforceable, I think it's a valid position
to take until the legal situation becomes clear.

Acked-by: Stefan Hajnoczi <stefanha@gmail.com>

Kevin Wolf May 28, 2024, 3:41 p.m. UTC | #5

Am 16.05.2024 um 18:22 hat Daniel P. Berrangé geschrieben:
> This patch kicks the hornet's nest of AI / LLM code generators.
> 
> With the increasing interest in code generators in recent times,
> it is inevitable that QEMU contributions will include AI generated
> code. Thus far we have remained silent on the matter. Given that
> everyone knows these tools exist, our current position has to be
> considered tacit acceptance of the use of AI generated code in QEMU.
> 
> The question for the project is whether that is a good position for
> QEMU to take or not ?
> 
> IANAL, but I like to think I'm reasonably proficient at understanding
> open source licensing. I am not inherantly against the use of AI tools,
> rather I am anti-risk. I also want to see OSS licenses respected and
> complied with.
> 
> AFAICT at its current state of (im)maturity the question of licensing
> of AI code generator output does not have a broadly accepted / settled
> legal position. This is an inherant bias/self-interest from the vendors
> promoting their usage, who tend to minimize/dismiss the legal questions.
> From my POV, this puts such tools in a position of elevated legal risk.
> 
> Given the fuzziness over the legal position of generated code from
> such tools, I don't consider it credible (today) for a contributor
> to assert compliance with the DCO terms (b) or (c) (which is a stated
> pre-requisite for QEMU accepting patches) when a patch includes (or is
> derived from) AI generated code.
> 
> By implication, I think that QEMU must (for now) explicitly decline
> to (knowingly) accept AI generated code.
> 
> Perhaps a few years down the line the legal uncertainty will have
> reduced and we can re-evaluate this policy.
> 
> Discuss...
> 
> Changes in v2:
> 
>  * Fix a huge number of typos in docs
>  * Clarify that maintainers should still add R-b where relevant, even
>    if they are already adding their own S-oB.
>  * Clarify situation when contributor re-starts previously abandoned
>    work from another contributor.
>  * Add info about Suggested-by tag
>  * Add new docs section dealing with the broad topic of "generated
>    files" (whether code generators or compilers)
>  * Simplify the section related to prohibition of AI generated files
>    and give further examples of tools considered covered
>  * Remove repeated references to "LLM" as a specific technology, just
>    use the broad "AI" term, except for one use of LLM as an example.
>  * Add note that the policy may evolve if the legal clarity improves
>  * Add note that exceptions can be requested on case-by-case basis
>    if contributor thinks they can demonstrate a credible copyright
>    and licensing status
> 
> Daniel P. Berrangé (3):
>   docs: introduce dedicated page about code provenance / sign-off
>   docs: define policy limiting the inclusion of generated files
>   docs: define policy forbidding use of AI code generators
> 
>  docs/devel/code-provenance.rst    | 315 ++++++++++++++++++++++++++++++
>  docs/devel/index-process.rst      |   1 +
>  docs/devel/submitting-a-patch.rst |  19 +-
>  3 files changed, 318 insertions(+), 17 deletions(-)
>  create mode 100644 docs/devel/code-provenance.rst

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

[v2,0/3] docs: define policy forbidding use of "AI" / LLM code generators

Message

Comments