Written by Tristan Marcelin with Filippo Cassetti.
To train their models, general-purpose AI (GPAI) providers need large datasets, which may include copyrighted materials. Despite the EU Directive 2019/790 on Copyright and the EU Artificial Intelligence (AI) Act, researchers have identified legal limitations and uncertainty in the use of copyrighted materials for GPAI training.
Training GPAI
AI models able to perform a wide range of distinct tasks, such as OpenAI’s GPT models, are known as general-purpose AI (GPAI), and are each trained on a very large amount of data. The European AI Act legally defines GPAI, using factors such as capabilities, characteristics, and number of end-users. This definition comprises what are also known as generative AI models or foundation models. The latest GPAI models are multimodal, meaning they can work with different types of content. Moreover, current state-of-the-art GPAI are termed ‘reasoning models, as they are able to ‘reason‘ step by step. OpenAI’s o3-mini model and DeepSeek’s R1 model are examples of recently released reasoning models.
GPAI models rely on deep learning techniques, which involve training the internal parameters of the model using data. The construction of datasets for training starts with the collection stage. In practice, this often relies on freely available online materials. OpenAI’s GPT-4o model was trained using data including publicly available data. Mistral’s 7b model was also trained with data from the web. Providers have generally maintained confidentiality around the exact data used to train their models, considering it a key part of their competitive edge. On the other hand, rights-holders fear losing control over their content. Various pending lawsuits outside the EU, listed by researchers, claim that GPAI training data contains copyrighted materials.
EU copyright law and the AI Act
To find publicly available web data to train GPAI, providers use web crawlers – programmes that autonomously navigate the web in order to perform a defined set of actions. OpenAI’s crawlers are known as GPTBot. Web crawlers have been used for years by companies such as Google, whose Googlebots crawl the web to index content for their search engine. As highlighted by researchers, the emergence of the web ‘created unprecedented challenges and opportunities for copyright holders’, although international copyright law has been changed to some extent to adapt to the Information Age.
Copyright law grants exclusive economic and moral rights to authors, such as the right to reproduce, distribute, communicate to the public, and make available to the public. With the Information Society Directive (Directive 2001/29), the EU created an exception for temporary acts of reproduction as part of a technological process (Article 5(1)). The EU Copyright Directive (Directive 2019/790) added two new exceptions for ‘text and data mining’ (TDM) purposes (Articles 3 and 4). TDM is defined as ‘any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations’. The exceptions allow, under specific conditions, the reproduction and extraction of protected works for TDM purposes. Performing such acts would otherwise constitute violations of certain rights under copyright and database law.
The European AI Act has two provisions related to copyright (Article 53(1)(c) and (d)). The first requires GPAI providers to comply with copyright law and the opt-out exception of the Copyright Directive, which authorises TDM as long as rights-holders do not express their refusal. It concerns any provider placing a GPAI on the EU market, ‘regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of those general-purpose AI models take place’ (recital 106). The second provision requires GPAI providers to make public a sufficiently detailed summary explaining the content used for training. Those requirements apply to providers of GPAI with or without systemic risks. To facilitate compliance with the regulation, the Commission is due to release a GPAI Code of Practice in May 2025.
Problem of copyrighted materials in GPAI training
According to researchers, the EU legislation does not yet fully address issues related to AI models and intellectual property law. The core issue is the potential presence of copyrighted materials in GPAI training datasets. Researchers have therefore been trying to assess to what extent copyright exceptions permit the reproduction of works for GPAI training. They believe the existing Copyright Directive’s TDM exceptions are not clear enough, thus legal limitations and uncertainty remain problematic.
Uncertainty and limitations with the legal framework
The two TDM exceptions only cover specific rights protected under copyright law. However, exceptions to other rights, such as the right of communication to the public, could be needed. Indeed, researchers argue that the right of communication to the public could be triggered by enabling public access to GPAI models that produce outputs with substantial portions of copyright-protected works.
Regarding the two exceptions themselves, researchers identified legal uncertainties in using them to train GPAI models with copyright-protected materials.
The first exception for reproduction and extraction of works authorises research organisations and cultural heritage institutions to perform TDM for the purposes of scientific research and under lawful access (Article 3, Copyright Directive). There are two issues with claiming this exception for GPAI training. Firstly, researchers expressed concerns over its technical applicability. Indeed, rights-holders can implement technological protection measures (TPM) – such as restrictive application programming interfaces limiting requests – to control TDM, which would prevent researchers from fully exercising their right. Secondly, the ambiguity surrounding the ‘lawful access’ condition further complicates the practical application of the exception. In this context, stakeholders could be better to conclude licensing agreements than to rely on the exception. As noted by stakeholders, several Member States have broadened the legal framework for scientific research in their transposition of the Directive. They have extended the exception to include communication to the public, in addition to reproduction and extraction.
The second exception for reproduction and extraction of work authorises TDM as long as it ‘has not been expressly reserved by their right holders in an appropriate manner, such as machine-readable means …’ (Article 4, Copyright Directive). This is known as the opt-out exception. Stakeholders have been debating the definition of ‘machine-readable’ and the duration for which reproductions of works can be kept. For ‘machine-readable’, GPAI providers support the adoption of an easy-to-access standardised file such as robot.txt. A recent German court case ruled that including the opt-out in ‘natural language’ – for instance in terms of use – qualifies as a machine-readable opt-out. Experts noted that this decision may be appealed ‘given the fundamental legal issues involved and the ambiguity of the law …’. Researchers added that the opt-out mechanism is likely to fail whenever rights-holders do not have the administrative rights for the webpage displaying their works, as they cannot add the opt-out themselves. Regarding the duration for which reproductions of works can be kept, the exception allows it for as long as needed for TDM. However, GPAI providers may need them for further processes such as evaluating models.
Potential next steps
A number of Member States set up a Copyright Infrastructure Task Force in 2023 to assist the Commission in finding solutions. Meanwhile, the Council of the EU published a summary in December 2024 of the Member States’ views on the issue. Several Member States state that ‘copyright uses for AI training go beyond the scope of the TDM exception’. The majority considers that introducing a legislative instrument is not necessary at this stage, prioritising implementation and monitoring of the existing legal framework.
Commissioner Henna Virkkunen suggested in October 2024 that the Commission should investigate if specific licensing mechanisms would facilitate the conclusion of licences between creative industries and AI companies. Unlike the Copyright Directive’s requirements on certain uses of protected content by online services, the AI Act does not mention licensing agreements in the context of GPAI training.
While the AI Act’s GPAI Code of Practice will not have the mandate to change the EU copyright framework, this guidance could be an intermediate step before the review of the Copyright Directive, set by law for June 2026. A revised Copyright Directive could address the identified limitations and uncertainties in training GPAI using copyright-protected works.
Read this ‘at a glance’ note on ‘AI and copyright: The training of general purpose AI‘ in the Think Tank pages of the European Parliament.




Comments are closed for this post.