Message ID | 20231123114026.3589272-3-berrange@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | docs: define policy forbidding use of "AI" / LLM code generators | expand |
Daniel P. Berrangé <berrange@redhat.com> writes: > There has been an explosion of interest in so called "AI" (LLM) > code generators in the past year or so. Thus far though, this is > has not been matched by a broadly accepted legal interpretation > of the licensing implications for code generator outputs. While > the vendors may claim there is no problem and a free choice of > license is possible, they have an inherent conflict of interest > in promoting this interpretation. More broadly there is, as yet, > no broad consensus on the licensing implications of code generators > trained on inputs under a wide variety of licenses. > > The DCO requires contributors to assert they have the right to > contribute under the designated project license. Given the lack > of consensus on the licensing of "AI" (LLM) code generator output, > it is not considered credible to assert compliance with the DCO > clause (b) or (c) where a patch includes such generated code. > > This patch thus defines a policy that the QEMU project will not > accept contributions where use of "AI" (LLM) code generators is > either known, or suspected. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > --- > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ > 1 file changed, 40 insertions(+) > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst > index b4591a2dec..a6e42c6b1b 100644 > --- a/docs/devel/code-provenance.rst > +++ b/docs/devel/code-provenance.rst > @@ -195,3 +195,43 @@ example:: > Signed-off-by: Some Person <some.person@example.com> > [Rebased and added support for 'foo'] > Signed-off-by: New Person <new.person@example.com> > + > +Use of "AI" (LLM) code generators > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +TL;DR: > + > + **Current QEMU project policy is to DECLINE any contributions > + which are believed to include or derive from "AI" (LLM) > + generated code.** > + > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ > +/ LLM) code generators raises a number of difficult legal questions, a > +number of which impact on Open Source projects. As noted earlier, the > +QEMU community requires that contributors certify their patch submissions > +are made in accordance with the rules of the :ref:`dco` (DCO). When a > +patch contains "AI" generated code this raises difficulties with code > +provenence and thus DCO compliance. I agree this is going to be a field that keeps lawyers well re-numerated for the foreseeable future. However I suspect this elides over the main use case for LLM generators which is non-novel transformation. One good example is generating text fixtures where you write a piece of original code and then ask the code completion engine to fill out some unit tests to exercise the code. It's boring mechanical work but one an LLM is very suited to (even if you might tweak the final result). > +To satisfy the DCO, the patch contributor has to fully understand > +the origins and license of code they are contributing to QEMU. The > +license terms that should apply to the output of an "AI" code generator > +are ill-defined, given that both training data and operation of the > +"AI" are typically opaque to the user. Even where the training data > +is said to all be open source, it will likely be under a wide variety > +of license terms. > + > +While the vendor's of "AI" code generators may promote the idea that > +code output can be taken under a free choice of license, this is not > +yet considered to be a generally accepted, nor tested, legal opinion. > + > +With this in mind, the QEMU maintainers does not consider it is > +currently possible to comply with DCO terms (b) or (c) for most "AI" > +generated code. There is a load of code out that isn't eligible for copyright projection because it doesn't demonstrate much originality or creativity. In the experimentation I've done so far I've not seen much sign of genuine creativity. LLM's benefit from having access to a wide corpus of training data and tend to do a better job of inferencing solutions from semi-related posts than say for example human manually comparing posts having pasted an error message in google. > + > +The QEMU maintainers thus require that contributors refrain from using > +"AI" code generators on patches intended to be submitted to the project, > +and will decline any contribution if use of "AI" is known or suspected. > + > +Examples of tools impacted by this policy includes both GitHub CoPilot, > +and ChatGPT, amongst many others which are less well known. What about if you took an LLM and then fine tuned it by using project data so it could better help new users in making contributions to the project? You would be biasing the model to your own data for the purposes of helping developers write better QEMU code?
Am 23.11.2023 um 12:40 hat Daniel P. Berrangé geschrieben: > There has been an explosion of interest in so called "AI" (LLM) > code generators in the past year or so. Thus far though, this is > has not been matched by a broadly accepted legal interpretation > of the licensing implications for code generator outputs. While > the vendors may claim there is no problem and a free choice of > license is possible, they have an inherent conflict of interest > in promoting this interpretation. More broadly there is, as yet, > no broad consensus on the licensing implications of code generators > trained on inputs under a wide variety of licenses. > > The DCO requires contributors to assert they have the right to > contribute under the designated project license. Given the lack > of consensus on the licensing of "AI" (LLM) code generator output, > it is not considered credible to assert compliance with the DCO > clause (b) or (c) where a patch includes such generated code. > > This patch thus defines a policy that the QEMU project will not > accept contributions where use of "AI" (LLM) code generators is > either known, or suspected. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > --- > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ > 1 file changed, 40 insertions(+) > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst > index b4591a2dec..a6e42c6b1b 100644 > --- a/docs/devel/code-provenance.rst > +++ b/docs/devel/code-provenance.rst > @@ -195,3 +195,43 @@ example:: > Signed-off-by: Some Person <some.person@example.com> > [Rebased and added support for 'foo'] > Signed-off-by: New Person <new.person@example.com> > + > +Use of "AI" (LLM) code generators > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +TL;DR: > + > + **Current QEMU project policy is to DECLINE any contributions > + which are believed to include or derive from "AI" (LLM) > + generated code.** > + > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ > +/ LLM) code generators raises a number of difficult legal questions, a > +number of which impact on Open Source projects. As noted earlier, the > +QEMU community requires that contributors certify their patch submissions > +are made in accordance with the rules of the :ref:`dco` (DCO). When a > +patch contains "AI" generated code this raises difficulties with code > +provenence and thus DCO compliance. > + > +To satisfy the DCO, the patch contributor has to fully understand > +the origins and license of code they are contributing to QEMU. The > +license terms that should apply to the output of an "AI" code generator > +are ill-defined, given that both training data and operation of the > +"AI" are typically opaque to the user. Even where the training data > +is said to all be open source, it will likely be under a wide variety > +of license terms. > + > +While the vendor's of "AI" code generators may promote the idea that > +code output can be taken under a free choice of license, this is not > +yet considered to be a generally accepted, nor tested, legal opinion. > + > +With this in mind, the QEMU maintainers does not consider it is s/does/do/ or maybe s/maintainers/project/ > +currently possible to comply with DCO terms (b) or (c) for most "AI" > +generated code. > + > +The QEMU maintainers thus require that contributors refrain from using > +"AI" code generators on patches intended to be submitted to the project, > +and will decline any contribution if use of "AI" is known or suspected. > + > +Examples of tools impacted by this policy includes both GitHub CoPilot, > +and ChatGPT, amongst many others which are less well known. Acked-by: Kevin Wolf <kwolf@redhat.com>
On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote: > There has been an explosion of interest in so called "AI" (LLM) > code generators in the past year or so. Thus far though, this is > has not been matched by a broadly accepted legal interpretation > of the licensing implications for code generator outputs. While > the vendors may claim there is no problem and a free choice of > license is possible, they have an inherent conflict of interest > in promoting this interpretation. More broadly there is, as yet, > no broad consensus on the licensing implications of code generators > trained on inputs under a wide variety of licenses. > > The DCO requires contributors to assert they have the right to > contribute under the designated project license. Given the lack > of consensus on the licensing of "AI" (LLM) code generator output, > it is not considered credible to assert compliance with the DCO > clause (b) or (c) where a patch includes such generated code. > > This patch thus defines a policy that the QEMU project will not > accept contributions where use of "AI" (LLM) code generators is > either known, or suspected. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > --- > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ > 1 file changed, 40 insertions(+) > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst > index b4591a2dec..a6e42c6b1b 100644 > --- a/docs/devel/code-provenance.rst > +++ b/docs/devel/code-provenance.rst > @@ -195,3 +195,43 @@ example:: > Signed-off-by: Some Person <some.person@example.com> > [Rebased and added support for 'foo'] > Signed-off-by: New Person <new.person@example.com> > + > +Use of "AI" (LLM) code generators > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +TL;DR: > + > + **Current QEMU project policy is to DECLINE any contributions > + which are believed to include or derive from "AI" (LLM) > + generated code.** > + > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ > +/ LLM) code generators raises a number of difficult legal questions, a > +number of which impact on Open Source projects. As noted earlier, the > +QEMU community requires that contributors certify their patch submissions > +are made in accordance with the rules of the :ref:`dco` (DCO). When a > +patch contains "AI" generated code this raises difficulties with code > +provenence and thus DCO compliance. > + > +To satisfy the DCO, the patch contributor has to fully understand > +the origins and license of code they are contributing to QEMU. The > +license terms that should apply to the output of an "AI" code generator > +are ill-defined, given that both training data and operation of the > +"AI" are typically opaque to the user. Even where the training data > +is said to all be open source, it will likely be under a wide variety > +of license terms. > + > +While the vendor's of "AI" code generators may promote the idea that > +code output can be taken under a free choice of license, this is not > +yet considered to be a generally accepted, nor tested, legal opinion. > + > +With this in mind, the QEMU maintainers does not consider it is > +currently possible to comply with DCO terms (b) or (c) for most "AI" > +generated code. > + > +The QEMU maintainers thus require that contributors refrain from using > +"AI" code generators on patches intended to be submitted to the project, > +and will decline any contribution if use of "AI" is known or suspected. > + > +Examples of tools impacted by this policy includes both GitHub CoPilot, > +and ChatGPT, amongst many others which are less well known. So you called out these two by name, fine, but given "AI" is in scare quotes I don't really know what is or is not allowed and I don't know how will contributors know. Is the "AI" that one must not use necessarily an LLM? And how do you define LLM even? Wikipedia says "general-purpose language understanding and generation". All this seems vague to me. However, can't we define a simpler more specific policy? For example, isn't it true that *any* automatically generated code can only be included if the scripts producing said code are also included or otherwise available under GPLv2? > -- > 2.41.0
On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote: >On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote: >> There has been an explosion of interest in so called "AI" (LLM) >> code generators in the past year or so. Thus far though, this is >> has not been matched by a broadly accepted legal interpretation >> of the licensing implications for code generator outputs. While >> the vendors may claim there is no problem and a free choice of >> license is possible, they have an inherent conflict of interest >> in promoting this interpretation. More broadly there is, as yet, >> no broad consensus on the licensing implications of code generators >> trained on inputs under a wide variety of licenses. >> >> The DCO requires contributors to assert they have the right to >> contribute under the designated project license. Given the lack >> of consensus on the licensing of "AI" (LLM) code generator output, >> it is not considered credible to assert compliance with the DCO >> clause (b) or (c) where a patch includes such generated code. >> >> This patch thus defines a policy that the QEMU project will not >> accept contributions where use of "AI" (LLM) code generators is >> either known, or suspected. >> >> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> >> --- >> docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ >> 1 file changed, 40 insertions(+) >> >> diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst >> index b4591a2dec..a6e42c6b1b 100644 >> --- a/docs/devel/code-provenance.rst >> +++ b/docs/devel/code-provenance.rst >> @@ -195,3 +195,43 @@ example:: >> Signed-off-by: Some Person <some.person@example.com> >> [Rebased and added support for 'foo'] >> Signed-off-by: New Person <new.person@example.com> >> + >> +Use of "AI" (LLM) code generators >> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> + >> +TL;DR: >> + >> + **Current QEMU project policy is to DECLINE any contributions >> + which are believed to include or derive from "AI" (LLM) >> + generated code.** >> + >> +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ >> +/ LLM) code generators raises a number of difficult legal questions, a >> +number of which impact on Open Source projects. As noted earlier, the >> +QEMU community requires that contributors certify their patch submissions >> +are made in accordance with the rules of the :ref:`dco` (DCO). When a >> +patch contains "AI" generated code this raises difficulties with code >> +provenence and thus DCO compliance. >> + >> +To satisfy the DCO, the patch contributor has to fully understand >> +the origins and license of code they are contributing to QEMU. The >> +license terms that should apply to the output of an "AI" code generator >> +are ill-defined, given that both training data and operation of the >> +"AI" are typically opaque to the user. Even where the training data >> +is said to all be open source, it will likely be under a wide variety >> +of license terms. >> + >> +While the vendor's of "AI" code generators may promote the idea that >> +code output can be taken under a free choice of license, this is not >> +yet considered to be a generally accepted, nor tested, legal opinion. >> + >> +With this in mind, the QEMU maintainers does not consider it is >> +currently possible to comply with DCO terms (b) or (c) for most "AI" >> +generated code. >> + >> +The QEMU maintainers thus require that contributors refrain from using >> +"AI" code generators on patches intended to be submitted to the project, >> +and will decline any contribution if use of "AI" is known or suspected. >> + >> +Examples of tools impacted by this policy includes both GitHub CoPilot, >> +and ChatGPT, amongst many others which are less well known. > > >So you called out these two by name, fine, but given "AI" is in scare >quotes I don't really know what is or is not allowed and I don't know >how will contributors know. Is the "AI" that one must not use >necessarily an LLM? And how do you define LLM even? Wikipedia says >"general-purpose language understanding and generation". > > >All this seems vague to me. > > >However, can't we define a simpler more specific policy? >For example, isn't it true that *any* automatically generated code >can only be included if the scripts producing said code >are also included or otherwise available under GPLv2? The following definition makes sense to me: - Automated codegen tool must be idempotent. - Automated codegen tool must not use statistical modelling. I'd remove all AI or LLM references. These are non-specific, colloquial and in the case of `AI`, non-technical. This policy should apply the same to a Markov chain code generator.
On Thu, Nov 23, 2023 at 04:56:28PM +0200, Manos Pitsidianakis wrote: > > However, can't we define a simpler more specific policy? > > For example, isn't it true that *any* automatically generated code > > can only be included if the scripts producing said code > > are also included or otherwise available under GPLv2? > > The following definition makes sense to me: > > - Automated codegen tool must be idempotent. > - Automated codegen tool must not use statistical modelling. Why does it matter so much? > I'd remove all AI or LLM references. These are non-specific, colloquial and > in the case of `AI`, non-technical. This policy should apply the same to a > Markov chain code generator.
On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote: > There has been an explosion of interest in so called "AI" (LLM) > code generators in the past year or so. Thus far though, this is > has not been matched by a broadly accepted legal interpretation > of the licensing implications for code generator outputs. While > the vendors may claim there is no problem and a free choice of > license is possible, they have an inherent conflict of interest > in promoting this interpretation. More broadly there is, as yet, > no broad consensus on the licensing implications of code generators > trained on inputs under a wide variety of licenses. > > The DCO requires contributors to assert they have the right to > contribute under the designated project license. Given the lack > of consensus on the licensing of "AI" (LLM) code generator output, > it is not considered credible to assert compliance with the DCO > clause (b) or (c) where a patch includes such generated code. > > This patch thus defines a policy that the QEMU project will not > accept contributions where use of "AI" (LLM) code generators is > either known, or suspected. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > --- > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ > 1 file changed, 40 insertions(+) As open source LLMs mature, it may be possible to curate the training data so that the output complies with software licenses and can be used in QEMU. For the time being, the position in this patch seems reasonable because it prevents license problems down the road. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
On 23/11/23 15:56, Manos Pitsidianakis wrote: > On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote: >> On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote: >>> There has been an explosion of interest in so called "AI" (LLM) >>> code generators in the past year or so. Thus far though, this is >>> has not been matched by a broadly accepted legal interpretation >>> of the licensing implications for code generator outputs. While >>> the vendors may claim there is no problem and a free choice of >>> license is possible, they have an inherent conflict of interest >>> in promoting this interpretation. More broadly there is, as yet, >>> no broad consensus on the licensing implications of code generators >>> trained on inputs under a wide variety of licenses. >>> >>> The DCO requires contributors to assert they have the right to >>> contribute under the designated project license. Given the lack >>> of consensus on the licensing of "AI" (LLM) code generator output, >>> it is not considered credible to assert compliance with the DCO >>> clause (b) or (c) where a patch includes such generated code. >>> >>> This patch thus defines a policy that the QEMU project will not >>> accept contributions where use of "AI" (LLM) code generators is >>> either known, or suspected. >>> >>> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> >>> --- >>> docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ >>> 1 file changed, 40 insertions(+) >>> +Use of "AI" (LLM) code generators >>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> + >>> +TL;DR: >>> + >>> + **Current QEMU project policy is to DECLINE any contributions >>> + which are believed to include or derive from "AI" (LLM) >>> + generated code.** >>> + >>> +The existence of "AI" (`Large Language Model >>> <https://en.wikipedia.org/wiki/Large_language_model>`__ >>> +/ LLM) code generators raises a number of difficult legal questions, a >>> +number of which impact on Open Source projects. As noted earlier, the >>> +QEMU community requires that contributors certify their patch >>> submissions >>> +are made in accordance with the rules of the :ref:`dco` (DCO). When a >>> +patch contains "AI" generated code this raises difficulties with code >>> +provenence and thus DCO compliance. >>> + >>> +To satisfy the DCO, the patch contributor has to fully understand >>> +the origins and license of code they are contributing to QEMU. The >>> +license terms that should apply to the output of an "AI" code generator >>> +are ill-defined, given that both training data and operation of the >>> +"AI" are typically opaque to the user. Even where the training data >>> +is said to all be open source, it will likely be under a wide variety >>> +of license terms. >>> + >>> +While the vendor's of "AI" code generators may promote the idea that >>> +code output can be taken under a free choice of license, this is not >>> +yet considered to be a generally accepted, nor tested, legal opinion. >>> + >>> +With this in mind, the QEMU maintainers does not consider it is >>> +currently possible to comply with DCO terms (b) or (c) for most "AI" >>> +generated code. >>> + >>> +The QEMU maintainers thus require that contributors refrain from using >>> +"AI" code generators on patches intended to be submitted to the >>> project, >>> +and will decline any contribution if use of "AI" is known or suspected. >>> + >>> +Examples of tools impacted by this policy includes both GitHub CoPilot, >>> +and ChatGPT, amongst many others which are less well known. >> >> >> So you called out these two by name, fine, but given "AI" is in scare >> quotes I don't really know what is or is not allowed and I don't know >> how will contributors know. Is the "AI" that one must not use >> necessarily an LLM? And how do you define LLM even? Wikipedia says >> "general-purpose language understanding and generation". >> >> >> All this seems vague to me. >> >> >> However, can't we define a simpler more specific policy? >> For example, isn't it true that *any* automatically generated code >> can only be included if the scripts producing said code >> are also included or otherwise available under GPLv2? > > The following definition makes sense to me: > > - Automated codegen tool must be idempotent. > - Automated codegen tool must not use statistical modelling. > > I'd remove all AI or LLM references. These are non-specific, colloquial > and in the case of `AI`, non-technical. This policy should apply the > same to a Markov chain code generator. This document targets all contributors. Contributions can be typo fix, translations, ... and don't have to be technical. Similarly, contributors aren't expected to be technical experts. As a neophyte, "AI" makes sense. "Idempotent code generator" or "LLM" don't :)
Manos Pitsidianakis <manos.pitsidianakis@linaro.org> writes: > On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote: >>On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote: >>> There has been an explosion of interest in so called "AI" (LLM) >>> code generators in the past year or so. Thus far though, this is >>> has not been matched by a broadly accepted legal interpretation >>> of the licensing implications for code generator outputs. While >>> the vendors may claim there is no problem and a free choice of >>> license is possible, they have an inherent conflict of interest >>> in promoting this interpretation. More broadly there is, as yet, >>> no broad consensus on the licensing implications of code generators >>> trained on inputs under a wide variety of licenses. >>> The DCO requires contributors to assert they have the right to >>> contribute under the designated project license. Given the lack >>> of consensus on the licensing of "AI" (LLM) code generator output, >>> it is not considered credible to assert compliance with the DCO >>> clause (b) or (c) where a patch includes such generated code. >>> This patch thus defines a policy that the QEMU project will not >>> accept contributions where use of "AI" (LLM) code generators is >>> either known, or suspected. >>> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> >>> --- >>> docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ >>> 1 file changed, 40 insertions(+) >>> diff --git a/docs/devel/code-provenance.rst >>> b/docs/devel/code-provenance.rst >>> index b4591a2dec..a6e42c6b1b 100644 >>> --- a/docs/devel/code-provenance.rst >>> +++ b/docs/devel/code-provenance.rst >>> @@ -195,3 +195,43 @@ example:: >>> Signed-off-by: Some Person <some.person@example.com> >>> [Rebased and added support for 'foo'] >>> Signed-off-by: New Person <new.person@example.com> >>> + >>> +Use of "AI" (LLM) code generators >>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> + >>> +TL;DR: >>> + >>> + **Current QEMU project policy is to DECLINE any contributions >>> + which are believed to include or derive from "AI" (LLM) >>> + generated code.** >>> + >>> +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ >>> +/ LLM) code generators raises a number of difficult legal questions, a >>> +number of which impact on Open Source projects. As noted earlier, the >>> +QEMU community requires that contributors certify their patch submissions >>> +are made in accordance with the rules of the :ref:`dco` (DCO). When a >>> +patch contains "AI" generated code this raises difficulties with code >>> +provenence and thus DCO compliance. >>> + <snip> >>> + >>> +The QEMU maintainers thus require that contributors refrain from using >>> +"AI" code generators on patches intended to be submitted to the project, >>> +and will decline any contribution if use of "AI" is known or suspected. >>> + >>> +Examples of tools impacted by this policy includes both GitHub CoPilot, >>> +and ChatGPT, amongst many others which are less well known. >> >> >>So you called out these two by name, fine, but given "AI" is in scare >>quotes I don't really know what is or is not allowed and I don't know >>how will contributors know. Is the "AI" that one must not use >>necessarily an LLM? And how do you define LLM even? Wikipedia says >>"general-purpose language understanding and generation". >> >> >>All this seems vague to me. >> >> >>However, can't we define a simpler more specific policy? >>For example, isn't it true that *any* automatically generated code >>can only be included if the scripts producing said code >>are also included or otherwise available under GPLv2? > > The following definition makes sense to me: > > - Automated codegen tool must be idempotent. > - Automated codegen tool must not use statistical modelling. > > I'd remove all AI or LLM references. These are non-specific, > colloquial and in the case of `AI`, non-technical. This policy should > apply the same to a Markov chain code generator. I'm fairly sure my Emacs auto-complete would fail by that definition.
On Thu, Nov 23, 2023 at 04:29:52PM +0100, Philippe Mathieu-Daudé wrote: > This document targets all contributors. Contributions can be typo > fix, translations, ... and don't have to be technical. Similarly, > contributors aren't expected to be technical experts. As a neophyte, > "AI" makes sense. "Idempotent code generator" or "LLM" don't :) I don't think there's any big deal in using AI for typo fixes.
On Thu, Nov 23, 2023 at 12:06:59PM -0500, Michael S. Tsirkin wrote: > On Thu, Nov 23, 2023 at 04:29:52PM +0100, Philippe Mathieu-Daudé wrote: > > This document targets all contributors. Contributions can be typo > > fix, translations, ... and don't have to be technical. Similarly, > > contributors aren't expected to be technical experts. As a neophyte, > > "AI" makes sense. "Idempotent code generator" or "LLM" don't :) > > I don't think there's any big deal in using AI for typo fixes. For how many typos it is still OK, and would not a deterministic spellchecker be preferred? There are some edge cases where using AI is OK, the problem is most of the time it is not clear it is OK to use. Thanks Michal
On Thu, Nov 23, 2023 at 12:57:42PM +0000, Alex Bennée wrote: > Daniel P. Berrangé <berrange@redhat.com> writes: > > > There has been an explosion of interest in so called "AI" (LLM) > > code generators in the past year or so. Thus far though, this is > > has not been matched by a broadly accepted legal interpretation > > of the licensing implications for code generator outputs. While > > the vendors may claim there is no problem and a free choice of > > license is possible, they have an inherent conflict of interest > > in promoting this interpretation. More broadly there is, as yet, > > no broad consensus on the licensing implications of code generators > > trained on inputs under a wide variety of licenses. > > > > The DCO requires contributors to assert they have the right to > > contribute under the designated project license. Given the lack > > of consensus on the licensing of "AI" (LLM) code generator output, > > it is not considered credible to assert compliance with the DCO > > clause (b) or (c) where a patch includes such generated code. > > > > This patch thus defines a policy that the QEMU project will not > > accept contributions where use of "AI" (LLM) code generators is > > either known, or suspected. > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > --- > > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 40 insertions(+) > > > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst > > index b4591a2dec..a6e42c6b1b 100644 > > --- a/docs/devel/code-provenance.rst > > +++ b/docs/devel/code-provenance.rst > > @@ -195,3 +195,43 @@ example:: > > Signed-off-by: Some Person <some.person@example.com> > > [Rebased and added support for 'foo'] > > Signed-off-by: New Person <new.person@example.com> > > + > > +Use of "AI" (LLM) code generators > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +TL;DR: > > + > > + **Current QEMU project policy is to DECLINE any contributions > > + which are believed to include or derive from "AI" (LLM) > > + generated code.** > > + > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ > > +/ LLM) code generators raises a number of difficult legal questions, a > > +number of which impact on Open Source projects. As noted earlier, the > > +QEMU community requires that contributors certify their patch submissions > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a > > +patch contains "AI" generated code this raises difficulties with code > > +provenence and thus DCO compliance. > > I agree this is going to be a field that keeps lawyers well re-numerated > for the foreseeable future. However I suspect this elides over the main > use case for LLM generators which is non-novel transformation. One good > example is generating text fixtures where you write a piece of original > code and then ask the code completion engine to fill out some unit tests > to exercise the code. It's boring mechanical work but one an LLM is very > suited to (even if you might tweak the final result). It may be suited to produce such code (disputable) but the code is not suited for inclusion into the project, for legal reasons. > > +To satisfy the DCO, the patch contributor has to fully understand > > +the origins and license of code they are contributing to QEMU. The > > +license terms that should apply to the output of an "AI" code generator > > +are ill-defined, given that both training data and operation of the > > +"AI" are typically opaque to the user. Even where the training data > > +is said to all be open source, it will likely be under a wide variety > > +of license terms. > > + > > +While the vendor's of "AI" code generators may promote the idea that > > +code output can be taken under a free choice of license, this is not > > +yet considered to be a generally accepted, nor tested, legal opinion. > > + > > +With this in mind, the QEMU maintainers does not consider it is > > +currently possible to comply with DCO terms (b) or (c) for most "AI" > > +generated code. > > There is a load of code out that isn't eligible for copyright projection > because it doesn't demonstrate much originality or creativity. In the > experimentation I've done so far I've not seen much sign of genuine > creativity. LLM's benefit from having access to a wide corpus of > training data and tend to do a better job of inferencing solutions from > semi-related posts than say for example human manually comparing posts > having pasted an error message in google. And license of that corpus of training data is not defined. If you could erase the copyright on anything by feeding it into a statistical model and pulling it back out there would be some big content license holders objecting so it's very unlikely to happen. Consequently, for all practical purposes the "AI"/LLM output is derivative work of the input with all legal consequences. This is, of course, only a problem for *generative* use of AI/LLM where the putput can contain contain copies of substantial parts of input. Thanks Michal
On Thu, Nov 23, 2023 at 12:57:42PM +0000, Alex Bennée wrote: > Daniel P. Berrangé <berrange@redhat.com> writes: > > > There has been an explosion of interest in so called "AI" (LLM) > > code generators in the past year or so. Thus far though, this is > > has not been matched by a broadly accepted legal interpretation > > of the licensing implications for code generator outputs. While > > the vendors may claim there is no problem and a free choice of > > license is possible, they have an inherent conflict of interest > > in promoting this interpretation. More broadly there is, as yet, > > no broad consensus on the licensing implications of code generators > > trained on inputs under a wide variety of licenses. > > > > The DCO requires contributors to assert they have the right to > > contribute under the designated project license. Given the lack > > of consensus on the licensing of "AI" (LLM) code generator output, > > it is not considered credible to assert compliance with the DCO > > clause (b) or (c) where a patch includes such generated code. > > > > This patch thus defines a policy that the QEMU project will not > > accept contributions where use of "AI" (LLM) code generators is > > either known, or suspected. > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > --- > > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 40 insertions(+) > > > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst > > index b4591a2dec..a6e42c6b1b 100644 > > --- a/docs/devel/code-provenance.rst > > +++ b/docs/devel/code-provenance.rst > > @@ -195,3 +195,43 @@ example:: > > Signed-off-by: Some Person <some.person@example.com> > > [Rebased and added support for 'foo'] > > Signed-off-by: New Person <new.person@example.com> > > + > > +Use of "AI" (LLM) code generators > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +TL;DR: > > + > > + **Current QEMU project policy is to DECLINE any contributions > > + which are believed to include or derive from "AI" (LLM) > > + generated code.** > > + > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ > > +/ LLM) code generators raises a number of difficult legal questions, a > > +number of which impact on Open Source projects. As noted earlier, the > > +QEMU community requires that contributors certify their patch submissions > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a > > +patch contains "AI" generated code this raises difficulties with code > > +provenence and thus DCO compliance. > > I agree this is going to be a field that keeps lawyers well re-numerated > for the foreseeable future. However I suspect this elides over the main > use case for LLM generators which is non-novel transformation. One good > example is generating text fixtures where you write a piece of original > code and then ask the code completion engine to fill out some unit tests > to exercise the code. It's boring mechanical work but one an LLM is very > suited to (even if you might tweak the final result). Yes, I can see how that is helpful, but I think in many cases the resulting code will be complex enough to be considered copyrightable, and so even with the original input code, I feel the licensing of the output is still ill-defined. > > > +To satisfy the DCO, the patch contributor has to fully understand > > +the origins and license of code they are contributing to QEMU. The > > +license terms that should apply to the output of an "AI" code generator > > +are ill-defined, given that both training data and operation of the > > +"AI" are typically opaque to the user. Even where the training data > > +is said to all be open source, it will likely be under a wide variety > > +of license terms. > > + > > +While the vendor's of "AI" code generators may promote the idea that > > +code output can be taken under a free choice of license, this is not > > +yet considered to be a generally accepted, nor tested, legal opinion. > > + > > +With this in mind, the QEMU maintainers does not consider it is > > +currently possible to comply with DCO terms (b) or (c) for most "AI" > > +generated code. > > There is a load of code out that isn't eligible for copyright projection > because it doesn't demonstrate much originality or creativity. In the > experimentation I've done so far I've not seen much sign of genuine > creativity. LLM's benefit from having access to a wide corpus of > training data and tend to do a better job of inferencing solutions from > semi-related posts than say for example human manually comparing posts > having pasted an error message in google. The boundary between what is considered copyrightable and not, it itself quite ill-defined, and thus it is hard to express a clear rule that can be applied. I think more experience long term contributors end up getting somewhat of a "gut feeling" about what's ok and what's not, but I'm not sure if that is true for contibutors in general. IOW, while there are likely cases where it is possible to safely use a AI generator, I'm not sure how to best express that in an way that makes sense. Perhaps a loosely worded addendum about possible exception for "trivial" output > > +The QEMU maintainers thus require that contributors refrain from using > > +"AI" code generators on patches intended to be submitted to the project, > > +and will decline any contribution if use of "AI" is known or suspected. > > + > > +Examples of tools impacted by this policy includes both GitHub CoPilot, > > +and ChatGPT, amongst many others which are less well known. > > What about if you took an LLM and then fine tuned it by using project > data so it could better help new users in making contributions to the > project? You would be biasing the model to your own data for the > purposes of helping developers write better QEMU code? It is hard to provide an answer to that question, since I think it is something that would need to be considered case by case. It hinges around how much does the new QEMU specific training data influence the model, vs other pre-existing training (if any) Perhaps we can finish this policy with a general point to solicit feedback on possible exceptions ? "If a contributor believes they can demonstrate that the output of a particular tool has deterministic licensing, such that they can satisfy the DCO, they should provide such info to the mailing list" With regards, Daniel
On Thu, Nov 23, 2023 at 09:35:43AM -0500, Michael S. Tsirkin wrote: > On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote: > > There has been an explosion of interest in so called "AI" (LLM) > > code generators in the past year or so. Thus far though, this is > > has not been matched by a broadly accepted legal interpretation > > of the licensing implications for code generator outputs. While > > the vendors may claim there is no problem and a free choice of > > license is possible, they have an inherent conflict of interest > > in promoting this interpretation. More broadly there is, as yet, > > no broad consensus on the licensing implications of code generators > > trained on inputs under a wide variety of licenses. > > > > The DCO requires contributors to assert they have the right to > > contribute under the designated project license. Given the lack > > of consensus on the licensing of "AI" (LLM) code generator output, > > it is not considered credible to assert compliance with the DCO > > clause (b) or (c) where a patch includes such generated code. > > > > This patch thus defines a policy that the QEMU project will not > > accept contributions where use of "AI" (LLM) code generators is > > either known, or suspected. > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > --- > > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 40 insertions(+) > > > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst > > index b4591a2dec..a6e42c6b1b 100644 > > --- a/docs/devel/code-provenance.rst > > +++ b/docs/devel/code-provenance.rst > > @@ -195,3 +195,43 @@ example:: > > Signed-off-by: Some Person <some.person@example.com> > > [Rebased and added support for 'foo'] > > Signed-off-by: New Person <new.person@example.com> > > + > > +Use of "AI" (LLM) code generators > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +TL;DR: > > + > > + **Current QEMU project policy is to DECLINE any contributions > > + which are believed to include or derive from "AI" (LLM) > > + generated code.** > > + > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ > > +/ LLM) code generators raises a number of difficult legal questions, a > > +number of which impact on Open Source projects. As noted earlier, the > > +QEMU community requires that contributors certify their patch submissions > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a > > +patch contains "AI" generated code this raises difficulties with code > > +provenence and thus DCO compliance. > > + > > +To satisfy the DCO, the patch contributor has to fully understand > > +the origins and license of code they are contributing to QEMU. The > > +license terms that should apply to the output of an "AI" code generator > > +are ill-defined, given that both training data and operation of the > > +"AI" are typically opaque to the user. Even where the training data > > +is said to all be open source, it will likely be under a wide variety > > +of license terms. > > + > > +While the vendor's of "AI" code generators may promote the idea that > > +code output can be taken under a free choice of license, this is not > > +yet considered to be a generally accepted, nor tested, legal opinion. > > + > > +With this in mind, the QEMU maintainers does not consider it is > > +currently possible to comply with DCO terms (b) or (c) for most "AI" > > +generated code. > > + > > +The QEMU maintainers thus require that contributors refrain from using > > +"AI" code generators on patches intended to be submitted to the project, > > +and will decline any contribution if use of "AI" is known or suspected. > > + > > +Examples of tools impacted by this policy includes both GitHub CoPilot, > > +and ChatGPT, amongst many others which are less well known. > > > So you called out these two by name, fine, but given "AI" is in scare > quotes I don't really know what is or is not allowed and I don't know > how will contributors know. Is the "AI" that one must not use > necessarily an LLM? And how do you define LLM even? Wikipedia says > "general-purpose language understanding and generation". I used "AI" in quotes, because I think it can mean different things to different people. In practical terms it has become a bit of a catch all term for a wide variety of tools. Thus I think the quote serve to express this as a loose generalization, rather than a precise definition. The same for "LLM", I don't want to try to define it, as it has also become somewhat of a general term. > All this seems vague to me. Delibrately so, as there are a wide variety of tools working in varying ways, but all with similar caveats around the licensing of the output "derivative" work. > However, can't we define a simpler more specific policy? > For example, isn't it true that *any* automatically generated code > can only be included if the scripts producing said code > are also included or otherwise available under GPLv2? The license of a code generation tool itself is usually considered to be not a factor in the license of its output. In most cases the license of the input data will determine the license of the output data, since the latter is a derivative work of the former. The person runing the tool will typically know exact what the input data is, and so have confidence over the license of the output. If there are questions about whether the output is a derivative of the tool's code itself, then the tool author can provide an disclaimer for this. Such a disclaimer though, would not erase the derivative link between input data and output data. One example is GCC where the output .o/exe is a derivative of the input .c. The output, however, may also link the gcc runtime library, and so GCC has a license exception saying that this runtime linkage doesn't affect the license of the output program. This is OK, since the GCC authors who added this exception owned copyright over the runtime library they're adding an exception for. If we apply this to LLMs, the output of the LLM is a derivative of the training data. The output is not a derivative of the LLM code. The LLM copyright holders could make this latter point explicit since they own copyright of the LLM code, but they do not own copyright of the training data, and neither does the person using the LLM, hence the legal uncertainty. With regards, Daniel
On Thu, Nov 23, 2023 at 04:56:28PM +0200, Manos Pitsidianakis wrote: > On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote: > > On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote: > > > There has been an explosion of interest in so called "AI" (LLM) > > > code generators in the past year or so. Thus far though, this is > > > has not been matched by a broadly accepted legal interpretation > > > of the licensing implications for code generator outputs. While > > > the vendors may claim there is no problem and a free choice of > > > license is possible, they have an inherent conflict of interest > > > in promoting this interpretation. More broadly there is, as yet, > > > no broad consensus on the licensing implications of code generators > > > trained on inputs under a wide variety of licenses. > > > > > > The DCO requires contributors to assert they have the right to > > > contribute under the designated project license. Given the lack > > > of consensus on the licensing of "AI" (LLM) code generator output, > > > it is not considered credible to assert compliance with the DCO > > > clause (b) or (c) where a patch includes such generated code. > > > > > > This patch thus defines a policy that the QEMU project will not > > > accept contributions where use of "AI" (LLM) code generators is > > > either known, or suspected. > > > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > > --- > > > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ > > > 1 file changed, 40 insertions(+) > > > > > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst > > > index b4591a2dec..a6e42c6b1b 100644 > > > --- a/docs/devel/code-provenance.rst > > > +++ b/docs/devel/code-provenance.rst > > > @@ -195,3 +195,43 @@ example:: > > > Signed-off-by: Some Person <some.person@example.com> > > > [Rebased and added support for 'foo'] > > > Signed-off-by: New Person <new.person@example.com> > > > + > > > +Use of "AI" (LLM) code generators > > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > + > > > +TL;DR: > > > + > > > + **Current QEMU project policy is to DECLINE any contributions > > > + which are believed to include or derive from "AI" (LLM) > > > + generated code.** > > > + > > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ > > > +/ LLM) code generators raises a number of difficult legal questions, a > > > +number of which impact on Open Source projects. As noted earlier, the > > > +QEMU community requires that contributors certify their patch submissions > > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a > > > +patch contains "AI" generated code this raises difficulties with code > > > +provenence and thus DCO compliance. > > > + > > > +To satisfy the DCO, the patch contributor has to fully understand > > > +the origins and license of code they are contributing to QEMU. The > > > +license terms that should apply to the output of an "AI" code generator > > > +are ill-defined, given that both training data and operation of the > > > +"AI" are typically opaque to the user. Even where the training data > > > +is said to all be open source, it will likely be under a wide variety > > > +of license terms. > > > + > > > +While the vendor's of "AI" code generators may promote the idea that > > > +code output can be taken under a free choice of license, this is not > > > +yet considered to be a generally accepted, nor tested, legal opinion. > > > + > > > +With this in mind, the QEMU maintainers does not consider it is > > > +currently possible to comply with DCO terms (b) or (c) for most "AI" > > > +generated code. > > > + > > > +The QEMU maintainers thus require that contributors refrain from using > > > +"AI" code generators on patches intended to be submitted to the project, > > > +and will decline any contribution if use of "AI" is known or suspected. > > > + > > > +Examples of tools impacted by this policy includes both GitHub CoPilot, > > > +and ChatGPT, amongst many others which are less well known. > > > > > > So you called out these two by name, fine, but given "AI" is in scare > > quotes I don't really know what is or is not allowed and I don't know > > how will contributors know. Is the "AI" that one must not use > > necessarily an LLM? And how do you define LLM even? Wikipedia says > > "general-purpose language understanding and generation". > > > > > > All this seems vague to me. > > > > > > However, can't we define a simpler more specific policy? > > For example, isn't it true that *any* automatically generated code > > can only be included if the scripts producing said code > > are also included or otherwise available under GPLv2? > > The following definition makes sense to me: > > - Automated codegen tool must be idempotent. > - Automated codegen tool must not use statistical modelling. As a casual reader, I would find this somewhat unclear to interpet and relate to. > I'd remove all AI or LLM references. These are non-specific, colloquial and > in the case of `AI`, non-technical. This policy should apply the same to a > Markov chain code generator. The fact that they are colloaquial is, IMHO, a good thing is it makes the policy relatable to the casual reader who hears the terms "AI" and "LLM" in technical press articles/blogs/etc all over the place. I would have considered "Markov chain code generator" to fall under the "AI" reference, since "AI" has defacto become a general purpose term that covers a wierd variety of underlying technologies. With regards, Daniel
On Thu, Nov 23, 2023 at 06:29:38PM +0100, Michal Suchánek wrote: > On Thu, Nov 23, 2023 at 12:06:59PM -0500, Michael S. Tsirkin wrote: > > On Thu, Nov 23, 2023 at 04:29:52PM +0100, Philippe Mathieu-Daudé wrote: > > > This document targets all contributors. Contributions can be typo > > > fix, translations, ... and don't have to be technical. Similarly, > > > contributors aren't expected to be technical experts. As a neophyte, > > > "AI" makes sense. "Idempotent code generator" or "LLM" don't :) > > > > I don't think there's any big deal in using AI for typo fixes. > > For how many typos it is still OK, and would not a deterministic > spellchecker be preferred? > > There are some edge cases where using AI is OK, the problem is most of > the time it is not clear it is OK to use. > > Thanks > > Michal ¯\_(ツ)_/¯ I am not a lawyer, and I don't speak for Red Hat. My point is however that e.g. even if you are using e.g. a grammar corrector you better make sure that it is not claiming that its output is a derivative work.
On Thu, 23 Nov 2023 at 18:02, Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Thu, Nov 23, 2023 at 04:56:28PM +0200, Manos Pitsidianakis wrote: > > On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote: > > > > +Examples of tools impacted by this policy includes both GitHub CoPilot, > > > > +and ChatGPT, amongst many others which are less well known. > > > > > > > > > So you called out these two by name, fine, but given "AI" is in scare > > > quotes I don't really know what is or is not allowed and I don't know > > > how will contributors know. Is the "AI" that one must not use > > > necessarily an LLM? And how do you define LLM even? Wikipedia says > > > "general-purpose language understanding and generation". > > > > > > > > > All this seems vague to me. > > > > > > > > > However, can't we define a simpler more specific policy? > > > For example, isn't it true that *any* automatically generated code > > > can only be included if the scripts producing said code > > > are also included or otherwise available under GPLv2? > > > > The following definition makes sense to me: > > > > - Automated codegen tool must be idempotent. > > - Automated codegen tool must not use statistical modelling. > > As a casual reader, I would find this somewhat unclear to interpet > and relate to. It's also not really relevant to what we're trying to rule out. A non-idempotent codegen tool is fine, if the code it generates is clearly under a license that's compatible with QEMU's. A codegen tool that uses statistical modelling is also fine, if (for example) it's only doing statistical modelling of the data in the single file it's adding code to and doesn't use any external data set. > > I'd remove all AI or LLM references. These are non-specific, colloquial and > > in the case of `AI`, non-technical. This policy should apply the same to a > > Markov chain code generator. > > The fact that they are colloaquial is, IMHO, a good thing is it makes > the policy relatable to the casual reader who hears the terms "AI" and > "LLM" in technical press articles/blogs/etc all over the place. Yes, I think that the most important thing about the wording of this policy (assuming we agree on it) is that it should be immediately very clear to anybody reading it that ChatGPT, Copilot, etc type tools aren't permitted. Because in practice the most likely case is somebody who wants to use those, and we don't want to make them have to go through "read an abstract definition of what isn't permitted and apply that abstract definition to the concrete tool they're using". thanks -- PMM
On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote: > The license of a code generation tool itself is usually considered > to be not a factor in the license of its output. Really? I would find it very surprising if a code generation tool that is not a language model and so is not understanding the code it's generating did not include some code snippets going into the output. It is also possible to unintentionally run afoul of GPL's definition of source code which is "the preferred form of the work for making modifications to it". So even if you have copyright to input, dumping just output and putting GPL on it might or might not be ok.
On Thu, Nov 23, 2023 at 06:37:47PM +0100, Michal Suchánek wrote: > If you could erase the copyright on anything by feeding it into a > statistical model and pulling it back out there > Would be some big > content license holders objecting so it's very unlikely to happen. I won't venture a guess and I think neither should QEMU. For now, being on the safe side and rejecting auto-generated code sounds very reasonable to me, though, in particular because it's often quite low quality ;). Not a lawyer, and I don't speak for Red Hat.
On Thu, Nov 23, 2023 at 05:46:16PM +0000, Daniel P. Berrangé wrote: > On Thu, Nov 23, 2023 at 12:57:42PM +0000, Alex Bennée wrote: > > Daniel P. Berrangé <berrange@redhat.com> writes: > > > > > There has been an explosion of interest in so called "AI" (LLM) > > > code generators in the past year or so. Thus far though, this is > > > has not been matched by a broadly accepted legal interpretation > > > of the licensing implications for code generator outputs. While > > > the vendors may claim there is no problem and a free choice of > > > license is possible, they have an inherent conflict of interest > > > in promoting this interpretation. More broadly there is, as yet, > > > no broad consensus on the licensing implications of code generators > > > trained on inputs under a wide variety of licenses. > > > > > > The DCO requires contributors to assert they have the right to > > > contribute under the designated project license. Given the lack > > > of consensus on the licensing of "AI" (LLM) code generator output, > > > it is not considered credible to assert compliance with the DCO > > > clause (b) or (c) where a patch includes such generated code. > > > > > > This patch thus defines a policy that the QEMU project will not > > > accept contributions where use of "AI" (LLM) code generators is > > > either known, or suspected. > > > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > > --- > > > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ > > > 1 file changed, 40 insertions(+) > > > > > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst > > > index b4591a2dec..a6e42c6b1b 100644 > > > --- a/docs/devel/code-provenance.rst > > > +++ b/docs/devel/code-provenance.rst > > > @@ -195,3 +195,43 @@ example:: > > > Signed-off-by: Some Person <some.person@example.com> > > > [Rebased and added support for 'foo'] > > > Signed-off-by: New Person <new.person@example.com> > > > + > > > +Use of "AI" (LLM) code generators > > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > + > > > +TL;DR: > > > + > > > + **Current QEMU project policy is to DECLINE any contributions > > > + which are believed to include or derive from "AI" (LLM) > > > + generated code.** > > > + > > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ > > > +/ LLM) code generators raises a number of difficult legal questions, a > > > +number of which impact on Open Source projects. As noted earlier, the > > > +QEMU community requires that contributors certify their patch submissions > > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a > > > +patch contains "AI" generated code this raises difficulties with code > > > +provenence and thus DCO compliance. > > > > I agree this is going to be a field that keeps lawyers well re-numerated > > for the foreseeable future. However I suspect this elides over the main > > use case for LLM generators which is non-novel transformation. One good > > example is generating text fixtures where you write a piece of original > > code and then ask the code completion engine to fill out some unit tests > > to exercise the code. It's boring mechanical work but one an LLM is very > > suited to (even if you might tweak the final result). > > Yes, I can see how that is helpful, but I think in many cases the > resulting code will be complex enough to be considered copyrightable, > and so even with the original input code, I feel the licensing of the > output is still ill-defined. > > > > > > +To satisfy the DCO, the patch contributor has to fully understand > > > +the origins and license of code they are contributing to QEMU. The > > > +license terms that should apply to the output of an "AI" code generator > > > +are ill-defined, given that both training data and operation of the > > > +"AI" are typically opaque to the user. Even where the training data > > > +is said to all be open source, it will likely be under a wide variety > > > +of license terms. > > > + > > > +While the vendor's of "AI" code generators may promote the idea that > > > +code output can be taken under a free choice of license, this is not > > > +yet considered to be a generally accepted, nor tested, legal opinion. > > > + > > > +With this in mind, the QEMU maintainers does not consider it is > > > +currently possible to comply with DCO terms (b) or (c) for most "AI" > > > +generated code. > > > > There is a load of code out that isn't eligible for copyright projection > > because it doesn't demonstrate much originality or creativity. In the > > experimentation I've done so far I've not seen much sign of genuine > > creativity. LLM's benefit from having access to a wide corpus of > > training data and tend to do a better job of inferencing solutions from > > semi-related posts than say for example human manually comparing posts > > having pasted an error message in google. > > The boundary between what is considered copyrightable and not, it > itself quite ill-defined, and thus it is hard to express a clear > rule that can be applied. > > I think more experience long term contributors end up getting somewhat > of a "gut feeling" about what's ok and what's not, but I'm not sure if > that is true for contibutors in general. > > IOW, while there are likely cases where it is possible to safely use > a AI generator, I'm not sure how to best express that in an way that > makes sense. > > Perhaps a loosely worded addendum about possible exception for > "trivial" output > > > > +The QEMU maintainers thus require that contributors refrain from using > > > +"AI" code generators on patches intended to be submitted to the project, > > > +and will decline any contribution if use of "AI" is known or suspected. > > > + > > > +Examples of tools impacted by this policy includes both GitHub CoPilot, > > > +and ChatGPT, amongst many others which are less well known. > > > > What about if you took an LLM and then fine tuned it by using project > > data so it could better help new users in making contributions to the > > project? You would be biasing the model to your own data for the > > purposes of helping developers write better QEMU code? > > It is hard to provide an answer to that question, since I think it is > something that would need to be considered case by case. It hinges > around how much does the new QEMU specific training data influence > the model, vs other pre-existing training (if any) > > Perhaps we can finish this policy with a general point to solicit > feedback on possible exceptions ? > > "If a contributor believes they can demonstrate that the output of > a particular tool has deterministic licensing, such that they can > satisfy the DCO, they should provide such info to the mailing list" > > With regards, > Daniel But the question is not about what QEMU should accept. We can trust maintainers to DTRT. The question is the meaning of DCO. If you want DCO to mean "this code was not generated by AI" then you better define "AI" in an unambiguous way otherwise what is it certifying? Instead, I propose adding simply this: Thus, generally, Signed-off-by from *each* person who has written a substantial portion of the patch is required. If a substantial portion of the patch was not written by any human person but was instead generated automatically (e.g. by an AI such as ChatGPT, or a decompiler) then you *must* clearly document this in the patch commit message. As a matter of policy, and out of an abundance of caution, such contributions will generally be rejected. When in doubt whether a specific portion is substantial - assume that Signed-off-by is required.
On Thu, Nov 23, 2023 at 05:39:18PM -0500, Michael S. Tsirkin wrote: > On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote: > > The license of a code generation tool itself is usually considered > > to be not a factor in the license of its output. > > Really? I would find it very surprising if a code generation tool that > is not a language model and so is not understanding the code it's > generating did not include some code snippets going into the output. > It is also possible to unintentionally run afoul of GPL's definition of source > code which is "the preferred form of the work for making modifications to it". > So even if you have copyright to input, dumping just output and putting > GPL on it might or might not be ok. Consider the C pre-processor. This takes an input .c file, and expands all the macros, to split out a new .c file. The license of the output .c file is determined by the license of the input .c file. The license of the CPP impl (whether OSS or proprietary) doesn't have any influence on the license of the output file, it cannot magically force the output file to be proprietary any more than it can force it to be output file GPL. With regards, Daniel
On Fri, Nov 24, 2023 at 09:06:29AM +0000, Daniel P. Berrangé wrote: > On Thu, Nov 23, 2023 at 05:39:18PM -0500, Michael S. Tsirkin wrote: > > On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote: > > > The license of a code generation tool itself is usually considered > > > to be not a factor in the license of its output. > > > > Really? I would find it very surprising if a code generation tool that > > is not a language model and so is not understanding the code it's > > generating did not include some code snippets going into the output. > > It is also possible to unintentionally run afoul of GPL's definition of source > > code which is "the preferred form of the work for making modifications to it". > > So even if you have copyright to input, dumping just output and putting > > GPL on it might or might not be ok. > > Consider the C pre-processor. This takes an input .c file, and expands > all the macros, to split out a new .c file. > > The license of the output .c file is determined by the license of the > input .c file. The license of the CPP impl (whether OSS or proprietary) > doesn't have any influence on the license of the output file, it cannot > magically force the output file to be proprietary any more than it can > force it to be output file GPL. > > With regards, > Daniel Sorry I don't get how is C preprocessor relevant here? It does not generate source code in the GPL sense. We won't accept C preprocessor output in a patch. Not being a lawyer I personally am not really interested in discussing how copyright works, certainly not at this highly abstract and simplified level.
Am 24.11.2023 um 00:53 hat Michael S. Tsirkin geschrieben: > On Thu, Nov 23, 2023 at 05:46:16PM +0000, Daniel P. Berrangé wrote: > > On Thu, Nov 23, 2023 at 12:57:42PM +0000, Alex Bennée wrote: > > > Daniel P. Berrangé <berrange@redhat.com> writes: > > > > > > > There has been an explosion of interest in so called "AI" (LLM) > > > > code generators in the past year or so. Thus far though, this is > > > > has not been matched by a broadly accepted legal interpretation > > > > of the licensing implications for code generator outputs. While > > > > the vendors may claim there is no problem and a free choice of > > > > license is possible, they have an inherent conflict of interest > > > > in promoting this interpretation. More broadly there is, as yet, > > > > no broad consensus on the licensing implications of code generators > > > > trained on inputs under a wide variety of licenses. > > > > > > > > The DCO requires contributors to assert they have the right to > > > > contribute under the designated project license. Given the lack > > > > of consensus on the licensing of "AI" (LLM) code generator output, > > > > it is not considered credible to assert compliance with the DCO > > > > clause (b) or (c) where a patch includes such generated code. > > > > > > > > This patch thus defines a policy that the QEMU project will not > > > > accept contributions where use of "AI" (LLM) code generators is > > > > either known, or suspected. > > > > > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > > > --- > > > > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ > > > > 1 file changed, 40 insertions(+) > > > > > > > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst > > > > index b4591a2dec..a6e42c6b1b 100644 > > > > --- a/docs/devel/code-provenance.rst > > > > +++ b/docs/devel/code-provenance.rst > > > > @@ -195,3 +195,43 @@ example:: > > > > Signed-off-by: Some Person <some.person@example.com> > > > > [Rebased and added support for 'foo'] > > > > Signed-off-by: New Person <new.person@example.com> > > > > + > > > > +Use of "AI" (LLM) code generators > > > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > + > > > > +TL;DR: > > > > + > > > > + **Current QEMU project policy is to DECLINE any contributions > > > > + which are believed to include or derive from "AI" (LLM) > > > > + generated code.** > > > > + > > > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ > > > > +/ LLM) code generators raises a number of difficult legal questions, a > > > > +number of which impact on Open Source projects. As noted earlier, the > > > > +QEMU community requires that contributors certify their patch submissions > > > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a > > > > +patch contains "AI" generated code this raises difficulties with code > > > > +provenence and thus DCO compliance. > > > > > > I agree this is going to be a field that keeps lawyers well re-numerated > > > for the foreseeable future. However I suspect this elides over the main > > > use case for LLM generators which is non-novel transformation. One good > > > example is generating text fixtures where you write a piece of original > > > code and then ask the code completion engine to fill out some unit tests > > > to exercise the code. It's boring mechanical work but one an LLM is very > > > suited to (even if you might tweak the final result). > > > > Yes, I can see how that is helpful, but I think in many cases the > > resulting code will be complex enough to be considered copyrightable, > > and so even with the original input code, I feel the licensing of the > > output is still ill-defined. > > > > > > > > > +To satisfy the DCO, the patch contributor has to fully understand > > > > +the origins and license of code they are contributing to QEMU. The > > > > +license terms that should apply to the output of an "AI" code generator > > > > +are ill-defined, given that both training data and operation of the > > > > +"AI" are typically opaque to the user. Even where the training data > > > > +is said to all be open source, it will likely be under a wide variety > > > > +of license terms. > > > > + > > > > +While the vendor's of "AI" code generators may promote the idea that > > > > +code output can be taken under a free choice of license, this is not > > > > +yet considered to be a generally accepted, nor tested, legal opinion. > > > > + > > > > +With this in mind, the QEMU maintainers does not consider it is > > > > +currently possible to comply with DCO terms (b) or (c) for most "AI" > > > > +generated code. > > > > > > There is a load of code out that isn't eligible for copyright projection > > > because it doesn't demonstrate much originality or creativity. In the > > > experimentation I've done so far I've not seen much sign of genuine > > > creativity. LLM's benefit from having access to a wide corpus of > > > training data and tend to do a better job of inferencing solutions from > > > semi-related posts than say for example human manually comparing posts > > > having pasted an error message in google. > > > > The boundary between what is considered copyrightable and not, it > > itself quite ill-defined, and thus it is hard to express a clear > > rule that can be applied. > > > > I think more experience long term contributors end up getting somewhat > > of a "gut feeling" about what's ok and what's not, but I'm not sure if > > that is true for contibutors in general. > > > > IOW, while there are likely cases where it is possible to safely use > > a AI generator, I'm not sure how to best express that in an way that > > makes sense. > > > > Perhaps a loosely worded addendum about possible exception for > > "trivial" output > > > > > > +The QEMU maintainers thus require that contributors refrain from using > > > > +"AI" code generators on patches intended to be submitted to the project, > > > > +and will decline any contribution if use of "AI" is known or suspected. > > > > + > > > > +Examples of tools impacted by this policy includes both GitHub CoPilot, > > > > +and ChatGPT, amongst many others which are less well known. > > > > > > What about if you took an LLM and then fine tuned it by using project > > > data so it could better help new users in making contributions to the > > > project? You would be biasing the model to your own data for the > > > purposes of helping developers write better QEMU code? > > > > It is hard to provide an answer to that question, since I think it is > > something that would need to be considered case by case. It hinges > > around how much does the new QEMU specific training data influence > > the model, vs other pre-existing training (if any) I suspect fine tuning won't be enough because it doesn't make the unlicensed original training data go away. If you could make sure that all of the training data consists only of code for which you have the right to contribute it to QEMU, that would be a different case. > > Perhaps we can finish this policy with a general point to solicit > > feedback on possible exceptions ? > > > > "If a contributor believes they can demonstrate that the output of > > a particular tool has deterministic licensing, such that they can > > satisfy the DCO, they should provide such info to the mailing list" > > > > With regards, > > Daniel > > > But the question is not about what QEMU should accept. We can trust > maintainers to DTRT. The question is the meaning of DCO. If you want > DCO to mean "this code was not generated by AI" then you better define > "AI" in an unambiguous way otherwise what is it certifying? That you can state confidently that you have the legal right to contribute this code. The problem is not AI per se, the problem is incompatibly licensed - or really, unlicensed (should I call it "pirated" for effect?) - training input for the AI. So if you got the code from ChatGPT, I simply won't believe you even if you claim that you have the right. > Instead, I propose adding simply this: > > Thus, generally, Signed-off-by from *each* person who has written > a substantial portion of the patch is required. > > If a substantial portion of the patch was not written by any > human person but was instead generated automatically (e.g. by an AI such > as ChatGPT, or a decompiler) then you *must* clearly document > this in the patch commit message. As a matter of policy, and out of an > abundance of caution, such contributions will generally be rejected. > > When in doubt whether a specific portion is substantial - assume > that Signed-off-by is required. "generated automatically" is going way too far. There is no problem at all with code changes generated by Coccinelle if you wrote the rules yourself or received them under a license that allows their inclusion in QEMU. The problem with ChatGPT etc. is that there is no licensing information attached to the generated code. You know it's based on someone else's work, but you don't know who it is, if they are willing to give you a license and under which conditions. And it's not an "abundance of caution" why we reject such patches, but that you obviously can't actually sign the DCO under such cirumstances and therefore the S-o-b is wrong. Kevin
Daniel P. Berrangé <berrange@redhat.com> writes: > On Thu, Nov 23, 2023 at 05:39:18PM -0500, Michael S. Tsirkin wrote: >> On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote: >> > The license of a code generation tool itself is usually considered >> > to be not a factor in the license of its output. >> >> Really? I would find it very surprising if a code generation tool that >> is not a language model and so is not understanding the code it's >> generating did not include some code snippets going into the output. >> It is also possible to unintentionally run afoul of GPL's definition of source >> code which is "the preferred form of the work for making modifications to it". >> So even if you have copyright to input, dumping just output and putting >> GPL on it might or might not be ok. > > Consider the C pre-processor. This takes an input .c file, and expands > all the macros, to split out a new .c file. > > The license of the output .c file is determined by the license of the > input .c file. The license of the CPP impl (whether OSS or proprietary) > doesn't have any influence on the license of the output file, it cannot > magically force the output file to be proprietary any more than it can > force it to be output file GPL. LLM's are just a tool like a compiler (albeit with spookier different internals). The prompt and the instructions are arguably the more important part of how to get good results from the LLM transformation. In fact most of the way I've been using them has been by pasting some existing code and asking for review or transformation of it. However I totally get that using the various online LLMs you have very little transparency about what has gone into their training and therefor there is a danger of proprietary code being hallucinated out of their matricies. Conversely what if I use an LLM like OpenLLaMa: https://github.com/openlm-research/open_llama I have fairly exhaustive definitions of what went into the training data which of most interest is probably the StarCoder dataset (paper): https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view where there are tools to detect if generated code has been lifted directly from the dataset or is indeed a transformation. > > With regards, > Daniel
Am 23.11.2023 um 15:56 hat Manos Pitsidianakis geschrieben: > On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote: > > On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote: > > > There has been an explosion of interest in so called "AI" (LLM) > > > code generators in the past year or so. Thus far though, this is > > > has not been matched by a broadly accepted legal interpretation > > > of the licensing implications for code generator outputs. While > > > the vendors may claim there is no problem and a free choice of > > > license is possible, they have an inherent conflict of interest > > > in promoting this interpretation. More broadly there is, as yet, > > > no broad consensus on the licensing implications of code generators > > > trained on inputs under a wide variety of licenses. > > > > > > The DCO requires contributors to assert they have the right to > > > contribute under the designated project license. Given the lack > > > of consensus on the licensing of "AI" (LLM) code generator output, > > > it is not considered credible to assert compliance with the DCO > > > clause (b) or (c) where a patch includes such generated code. > > > > > > This patch thus defines a policy that the QEMU project will not > > > accept contributions where use of "AI" (LLM) code generators is > > > either known, or suspected. > > > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > > --- > > > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ > > > 1 file changed, 40 insertions(+) > > > > > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst > > > index b4591a2dec..a6e42c6b1b 100644 > > > --- a/docs/devel/code-provenance.rst > > > +++ b/docs/devel/code-provenance.rst > > > @@ -195,3 +195,43 @@ example:: > > > Signed-off-by: Some Person <some.person@example.com> > > > [Rebased and added support for 'foo'] > > > Signed-off-by: New Person <new.person@example.com> > > > + > > > +Use of "AI" (LLM) code generators > > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > + > > > +TL;DR: > > > + > > > + **Current QEMU project policy is to DECLINE any contributions > > > + which are believed to include or derive from "AI" (LLM) > > > + generated code.** > > > + > > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ > > > +/ LLM) code generators raises a number of difficult legal questions, a > > > +number of which impact on Open Source projects. As noted earlier, the > > > +QEMU community requires that contributors certify their patch submissions > > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a > > > +patch contains "AI" generated code this raises difficulties with code > > > +provenence and thus DCO compliance. > > > + > > > +To satisfy the DCO, the patch contributor has to fully understand > > > +the origins and license of code they are contributing to QEMU. The > > > +license terms that should apply to the output of an "AI" code generator > > > +are ill-defined, given that both training data and operation of the > > > +"AI" are typically opaque to the user. Even where the training data > > > +is said to all be open source, it will likely be under a wide variety > > > +of license terms. > > > + > > > +While the vendor's of "AI" code generators may promote the idea that > > > +code output can be taken under a free choice of license, this is not > > > +yet considered to be a generally accepted, nor tested, legal opinion. > > > + > > > +With this in mind, the QEMU maintainers does not consider it is > > > +currently possible to comply with DCO terms (b) or (c) for most "AI" > > > +generated code. > > > + > > > +The QEMU maintainers thus require that contributors refrain from using > > > +"AI" code generators on patches intended to be submitted to the project, > > > +and will decline any contribution if use of "AI" is known or suspected. > > > + > > > +Examples of tools impacted by this policy includes both GitHub CoPilot, > > > +and ChatGPT, amongst many others which are less well known. > > > > > > So you called out these two by name, fine, but given "AI" is in scare > > quotes I don't really know what is or is not allowed and I don't know > > how will contributors know. Is the "AI" that one must not use > > necessarily an LLM? And how do you define LLM even? Wikipedia says > > "general-purpose language understanding and generation". > > > > > > All this seems vague to me. > > > > > > However, can't we define a simpler more specific policy? > > For example, isn't it true that *any* automatically generated code > > can only be included if the scripts producing said code > > are also included or otherwise available under GPLv2? > > The following definition makes sense to me: > > - Automated codegen tool must be idempotent. > - Automated codegen tool must not use statistical modelling. How are these definitions related to your ability to sign the DCO? Kevin
On Fri, Nov 24, 2023 at 10:21:17AM +0000, Alex Bennée wrote: > LLM's are just a tool like a compiler (albeit with spookier different > internals). We already generally don't accept compiler output in patches since it is not source code by the definition of GPL.
Kevin Wolf <kwolf@redhat.com> writes: > Am 24.11.2023 um 00:53 hat Michael S. Tsirkin geschrieben: >> On Thu, Nov 23, 2023 at 05:46:16PM +0000, Daniel P. Berrangé wrote: >> > On Thu, Nov 23, 2023 at 12:57:42PM +0000, Alex Bennée wrote: >> > > Daniel P. Berrangé <berrange@redhat.com> writes: >> > > <snip> >> > > > +The QEMU maintainers thus require that contributors refrain from using >> > > > +"AI" code generators on patches intended to be submitted to the project, >> > > > +and will decline any contribution if use of "AI" is known or suspected. >> > > > + >> > > > +Examples of tools impacted by this policy includes both GitHub CoPilot, >> > > > +and ChatGPT, amongst many others which are less well known. >> > > >> > > What about if you took an LLM and then fine tuned it by using project >> > > data so it could better help new users in making contributions to the >> > > project? You would be biasing the model to your own data for the >> > > purposes of helping developers write better QEMU code? >> > >> > It is hard to provide an answer to that question, since I think it is >> > something that would need to be considered case by case. It hinges >> > around how much does the new QEMU specific training data influence >> > the model, vs other pre-existing training (if any) > > I suspect fine tuning won't be enough because it doesn't make the > unlicensed original training data go away. > > If you could make sure that all of the training data consists only of > code for which you have the right to contribute it to QEMU, that would > be a different case. That probably means we can never use even open source LLMs to generate code for QEMU because while the source data is all open source it won't necessarily be GPL compatible.
On Fri, Nov 24, 2023 at 11:25:55AM +0100, Kevin Wolf wrote: > > - Automated codegen tool must be idempotent. > > - Automated codegen tool must not use statistical modelling. > > How are these definitions related to your ability to sign the DCO? Not only that - while the question of whether code generated e.g. by copilot would be source code by GPL definition is unclear at least to me, code generated by an idempotent automated tool seems highly likely not to satisfy the GPL definition. Though I am not a lawyer and do not speak for Red Hat.
On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote: > That probably means we can never use even open source LLMs to generate > code for QEMU because while the source data is all open source it won't > necessarily be GPL compatible. I would probably wait until the dust settles before we start accepting LLM generated code. If nothing else, generated code quality in our niche area is at this point still nowhere near being useful.
On Fri, 24 Nov 2023 12:25, Kevin Wolf <kwolf@redhat.com> wrote: >Am 23.11.2023 um 15:56 hat Manos Pitsidianakis geschrieben: >> On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote: >> > On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote: >> > > There has been an explosion of interest in so called "AI" (LLM) >> > > code generators in the past year or so. Thus far though, this is >> > > has not been matched by a broadly accepted legal interpretation >> > > of the licensing implications for code generator outputs. While >> > > the vendors may claim there is no problem and a free choice of >> > > license is possible, they have an inherent conflict of interest >> > > in promoting this interpretation. More broadly there is, as yet, >> > > no broad consensus on the licensing implications of code generators >> > > trained on inputs under a wide variety of licenses. >> > > >> > > The DCO requires contributors to assert they have the right to >> > > contribute under the designated project license. Given the lack >> > > of consensus on the licensing of "AI" (LLM) code generator output, >> > > it is not considered credible to assert compliance with the DCO >> > > clause (b) or (c) where a patch includes such generated code. >> > > >> > > This patch thus defines a policy that the QEMU project will not >> > > accept contributions where use of "AI" (LLM) code generators is >> > > either known, or suspected. >> > > >> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> >> > > --- >> > > docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ >> > > 1 file changed, 40 insertions(+) >> > > >> > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst >> > > index b4591a2dec..a6e42c6b1b 100644 >> > > --- a/docs/devel/code-provenance.rst >> > > +++ b/docs/devel/code-provenance.rst >> > > @@ -195,3 +195,43 @@ example:: >> > > Signed-off-by: Some Person <some.person@example.com> >> > > [Rebased and added support for 'foo'] >> > > Signed-off-by: New Person <new.person@example.com> >> > > + >> > > +Use of "AI" (LLM) code generators >> > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> > > + >> > > +TL;DR: >> > > + >> > > + **Current QEMU project policy is to DECLINE any contributions >> > > + which are believed to include or derive from "AI" (LLM) >> > > + generated code.** >> > > + >> > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ >> > > +/ LLM) code generators raises a number of difficult legal questions, a >> > > +number of which impact on Open Source projects. As noted earlier, the >> > > +QEMU community requires that contributors certify their patch submissions >> > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a >> > > +patch contains "AI" generated code this raises difficulties with code >> > > +provenence and thus DCO compliance. >> > > + >> > > +To satisfy the DCO, the patch contributor has to fully understand >> > > +the origins and license of code they are contributing to QEMU. The >> > > +license terms that should apply to the output of an "AI" code generator >> > > +are ill-defined, given that both training data and operation of the >> > > +"AI" are typically opaque to the user. Even where the training data >> > > +is said to all be open source, it will likely be under a wide variety >> > > +of license terms. >> > > + >> > > +While the vendor's of "AI" code generators may promote the idea that >> > > +code output can be taken under a free choice of license, this is not >> > > +yet considered to be a generally accepted, nor tested, legal opinion. >> > > + >> > > +With this in mind, the QEMU maintainers does not consider it is >> > > +currently possible to comply with DCO terms (b) or (c) for most "AI" >> > > +generated code. >> > > + >> > > +The QEMU maintainers thus require that contributors refrain from using >> > > +"AI" code generators on patches intended to be submitted to the project, >> > > +and will decline any contribution if use of "AI" is known or suspected. >> > > + >> > > +Examples of tools impacted by this policy includes both GitHub CoPilot, >> > > +and ChatGPT, amongst many others which are less well known. >> > >> > >> > So you called out these two by name, fine, but given "AI" is in scare >> > quotes I don't really know what is or is not allowed and I don't know >> > how will contributors know. Is the "AI" that one must not use >> > necessarily an LLM? And how do you define LLM even? Wikipedia says >> > "general-purpose language understanding and generation". >> > >> > >> > All this seems vague to me. >> > >> > >> > However, can't we define a simpler more specific policy? >> > For example, isn't it true that *any* automatically generated code >> > can only be included if the scripts producing said code >> > are also included or otherwise available under GPLv2? >> >> The following definition makes sense to me: >> >> - Automated codegen tool must be idempotent. >> - Automated codegen tool must not use statistical modelling. > >How are these definitions related to your ability to sign the DCO? > >Kevin This was a response to Michael's salient observation that AI and LLM are very vague and not clearly defined terms. I did not mention DCO at all. Manos
On Fri, 24 Nov 2023 at 10:42, Michael S. Tsirkin <mst@redhat.com> wrote: > > On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote: > > That probably means we can never use even open source LLMs to generate > > code for QEMU because while the source data is all open source it won't > > necessarily be GPL compatible. > > I would probably wait until the dust settles before we start accepting > LLM generated code. I think that's pretty much my take on what this policy is: "say no for now; we can always come back later when the legal situation seems clearer". -- PMM
On Fri, Nov 24, 2023 at 10:43:05AM +0000, Peter Maydell wrote: > On Fri, 24 Nov 2023 at 10:42, Michael S. Tsirkin <mst@redhat.com> wrote: > > > > On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote: > > > That probably means we can never use even open source LLMs to generate > > > code for QEMU because while the source data is all open source it won't > > > necessarily be GPL compatible. > > > > I would probably wait until the dust settles before we start accepting > > LLM generated code. > > I think that's pretty much my take on what this policy is: > "say no for now; we can always come back later when the legal > situation seems clearer". Absolutely. So I think we should not try and venture into terminology such as what is ai or try and promote legal copyright theories. ATM there's no good reason for someone who did not write the code to put their DCO on the code. If it is not clear who wrote the code because it was generated and not written then we don't want it.
On Fri, Nov 24, 2023 at 10:43:05AM +0000, Peter Maydell wrote: > On Fri, 24 Nov 2023 at 10:42, Michael S. Tsirkin <mst@redhat.com> wrote: > > > > On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote: > > > That probably means we can never use even open source LLMs to generate > > > code for QEMU because while the source data is all open source it won't > > > necessarily be GPL compatible. > > > > I would probably wait until the dust settles before we start accepting > > LLM generated code. > > I think that's pretty much my take on what this policy is: > "say no for now; we can always come back later when the legal > situation seems clearer". Yes, that was my thoughts exactly. And if anyone comes along with a specific LLM/AI code generator that they believe can be used in a way compatible with the DCO, they can ask for an exception to the general policy which we can discuss then. With regards, Daniel
On Fri, Nov 24, 2023 at 11:37:15AM +0000, Daniel P. Berrangé wrote: > On Fri, Nov 24, 2023 at 10:43:05AM +0000, Peter Maydell wrote: > > On Fri, 24 Nov 2023 at 10:42, Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote: > > > > That probably means we can never use even open source LLMs to generate > > > > code for QEMU because while the source data is all open source it won't > > > > necessarily be GPL compatible. > > > > > > I would probably wait until the dust settles before we start accepting > > > LLM generated code. > > > > I think that's pretty much my take on what this policy is: > > "say no for now; we can always come back later when the legal > > situation seems clearer". > > Yes, that was my thoughts exactly. > > And if anyone comes along with a specific LLM/AI code generator that > they believe can be used in a way compatible with the DCO, they can > ask for an exception to the general policy which we can discuss then. Yea. But why do you keep worrying about LLM/AI mess? Are there code generators whose output do allow? What are these?
On Fri, Nov 24, 2023 at 06:39:21AM -0500, Michael S. Tsirkin wrote: > On Fri, Nov 24, 2023 at 11:37:15AM +0000, Daniel P. Berrangé wrote: > > On Fri, Nov 24, 2023 at 10:43:05AM +0000, Peter Maydell wrote: > > > On Fri, 24 Nov 2023 at 10:42, Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > > > On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote: > > > > > That probably means we can never use even open source LLMs to generate > > > > > code for QEMU because while the source data is all open source it won't > > > > > necessarily be GPL compatible. > > > > > > > > I would probably wait until the dust settles before we start accepting > > > > LLM generated code. > > > > > > I think that's pretty much my take on what this policy is: > > > "say no for now; we can always come back later when the legal > > > situation seems clearer". > > > > Yes, that was my thoughts exactly. > > > > And if anyone comes along with a specific LLM/AI code generator that > > they believe can be used in a way compatible with the DCO, they can > > ask for an exception to the general policy which we can discuss then. > > Yea. But why do you keep worrying about LLM/AI mess? Are there code > generators whose output do allow? What are these? And to clarify I mean source code in the GPL sense so please do not say "compiler".
On Fri, Nov 24, 2023 at 10:21:17AM +0000, Alex Bennée wrote: > Daniel P. Berrangé <berrange@redhat.com> writes: > > > On Thu, Nov 23, 2023 at 05:39:18PM -0500, Michael S. Tsirkin wrote: > >> On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote: > >> > The license of a code generation tool itself is usually considered > >> > to be not a factor in the license of its output. > >> > >> Really? I would find it very surprising if a code generation tool that > >> is not a language model and so is not understanding the code it's > >> generating did not include some code snippets going into the output. > >> It is also possible to unintentionally run afoul of GPL's definition of source > >> code which is "the preferred form of the work for making modifications to it". > >> So even if you have copyright to input, dumping just output and putting > >> GPL on it might or might not be ok. > > > > Consider the C pre-processor. This takes an input .c file, and expands > > all the macros, to split out a new .c file. > > > > The license of the output .c file is determined by the license of the > > input .c file. The license of the CPP impl (whether OSS or proprietary) > > doesn't have any influence on the license of the output file, it cannot > > magically force the output file to be proprietary any more than it can > > force it to be output file GPL. > > LLM's are just a tool like a compiler (albeit with spookier different > internals). The prompt and the instructions are arguably the more > important part of how to get good results from the LLM transformation. > In fact most of the way I've been using them has been by pasting some > existing code and asking for review or transformation of it. > > However I totally get that using the various online LLMs you have very > little transparency about what has gone into their training and therefor > there is a danger of proprietary code being hallucinated out of their > matricies. Conversely what if I use an LLM like OpenLLaMa: > > https://github.com/openlm-research/open_llama > > I have fairly exhaustive definitions of what went into the training data > which of most interest is probably the StarCoder dataset (paper): > > https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view > > where there are tools to detect if generated code has been lifted > directly from the dataset or is indeed a transformation. I've not looked at the links above, but I think if someone can make an compelling argument that *specific* tools have sufficient transparency to be compatible with signing the DCO, then I think we could maintain a list of exceptions in the policy. With regards, Daniel
diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst index b4591a2dec..a6e42c6b1b 100644 --- a/docs/devel/code-provenance.rst +++ b/docs/devel/code-provenance.rst @@ -195,3 +195,43 @@ example:: Signed-off-by: Some Person <some.person@example.com> [Rebased and added support for 'foo'] Signed-off-by: New Person <new.person@example.com> + +Use of "AI" (LLM) code generators +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +TL;DR: + + **Current QEMU project policy is to DECLINE any contributions + which are believed to include or derive from "AI" (LLM) + generated code.** + +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__ +/ LLM) code generators raises a number of difficult legal questions, a +number of which impact on Open Source projects. As noted earlier, the +QEMU community requires that contributors certify their patch submissions +are made in accordance with the rules of the :ref:`dco` (DCO). When a +patch contains "AI" generated code this raises difficulties with code +provenence and thus DCO compliance. + +To satisfy the DCO, the patch contributor has to fully understand +the origins and license of code they are contributing to QEMU. The +license terms that should apply to the output of an "AI" code generator +are ill-defined, given that both training data and operation of the +"AI" are typically opaque to the user. Even where the training data +is said to all be open source, it will likely be under a wide variety +of license terms. + +While the vendor's of "AI" code generators may promote the idea that +code output can be taken under a free choice of license, this is not +yet considered to be a generally accepted, nor tested, legal opinion. + +With this in mind, the QEMU maintainers does not consider it is +currently possible to comply with DCO terms (b) or (c) for most "AI" +generated code. + +The QEMU maintainers thus require that contributors refrain from using +"AI" code generators on patches intended to be submitted to the project, +and will decline any contribution if use of "AI" is known or suspected. + +Examples of tools impacted by this policy includes both GitHub CoPilot, +and ChatGPT, amongst many others which are less well known.
There has been an explosion of interest in so called "AI" (LLM) code generators in the past year or so. Thus far though, this is has not been matched by a broadly accepted legal interpretation of the licensing implications for code generator outputs. While the vendors may claim there is no problem and a free choice of license is possible, they have an inherent conflict of interest in promoting this interpretation. More broadly there is, as yet, no broad consensus on the licensing implications of code generators trained on inputs under a wide variety of licenses. The DCO requires contributors to assert they have the right to contribute under the designated project license. Given the lack of consensus on the licensing of "AI" (LLM) code generator output, it is not considered credible to assert compliance with the DCO clause (b) or (c) where a patch includes such generated code. This patch thus defines a policy that the QEMU project will not accept contributions where use of "AI" (LLM) code generators is either known, or suspected. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+)