Technical aspects of text and data mining research in copyright directive

A new very useful research, requested by policy department for citizens’ rights and constitutional affairs, has been published. The author of research, Eleonora Rosati, has briefly but informative and understandable way outlined the main issues with text and data mining exception to copyright. The entire research available here, below some technical points of exception – its three steps.


TDM activities can take place through different procedures and with different goals, the only common element being that of analysing and extracting associations between concepts to identify new patterns and relations. By means of a necessary simplification, it appears however possible to distinguish three common – yet not all necessary – steps in TDM processes:

  1. Access to content (Step 1);
  2. Extraction and/or copying of content (Step 2);
  3. Mining of text and/or data and knowledge discovery (Step 3)

Step 1 – Access to content

The primary distinction to be made is between content that is freely accessible and content that is not, and in relation to which access permission, i.e. a licence, may be required. In relation to the former, freedom of access does not necessarily entail that the content (text and data) is also free of legal restrictions. In relation to the latter, an issue might be also that of identifying the subjects from whom permission is to be sought, i.e. the relevant rightholders. Problems might also arise in relation to orphan works, these being works and other subject-matter that are protected by copyright or related rights and for which no rightholder has been identified or for which the rightholder, even if identified, has not been located.

If a licence is required and is successfully secured, its resulting scope determines the types of activities that the licensee is entitled to undertake in relation to the content to which access has been secured. It is worth recalling that exceeding the scope of the licence secured might expose the licensee to liability for infringing acts. Some publishers include the possibility of undertaking TDM activities within the scope of the licences available, but that is not always the case. In particular, if acts of extraction and/or copying of content are needed to undertake TDM, then further issues should be considered by the licensee who is not also explicitly allowed to perform TDM on the licensed content.

Step 2 – Extraction and/or copying of content

Lawful access to content – whether because such content is freely accessible or access has been obtained through a licence – does not necessarily entitle one to undertake TDM in respect of such content (text or data). This is because to undertake TDM it may be necessary to undertake certain propaedeutic activities, including extracting and/or copying the content, for which specific authorization may be required. Not all TDM practices require necessarily the extraction and/or copying of content.

Not all acts of copying are necessarily subject to the control of the relevant rightholder. If the content extracted and/or copied is included in a database, then both copyright and the sui generis (database) right might come into consideration, as well as other aspects in the event that neither vests in the database considered.

With regard to copyright, the author of a database is entitled to prevent a number of acts, including the reproduction – whether temporary or permanent – by any means and in any form, in whole or in part of the expression of the database which is protectable by copyright, i.e. expression that is sufficiently original. The only mandatory limitation to the rights of the copyright holder relates to the performance by the lawful user of a database or of a copy thereof of any acts that are necessary for the purposes of accessing the content of the databases and its normal use.

With regard to the sui generis (database) right, the maker of a database who has made qualitatively and/or quantitatively a substantial investment in either the obtaining, verification or presentation of the contents is entitled to prevent extraction and/or re-utilization of the whole or of a substantial part, evaluated qualitatively and/or quantitatively, of the contents of that database. Restrictions may also subsist in relation to databases that are protected by neither copyright nor the sui generis (database) right.

Also related (neighbouring) rights might come into consideration in relation to perspective acts of copying finalized to the undertaking of TDM activities. Should the proposed press publishers’ rights be ultimately adopted, acts of reproduction in respect of press publications might also require authorization of press publishers.

Finally, it should be noted that not only might intellectual property rights limit the activities underlying Step 2, but also other areas of the law might be relevant at this stage. In this sense, the application of data protection and privacy laws to the realm of text and data extraction should be considered. Another area that might be relevant is contract law, especially in relation to contractual restrictions and – where applicable – contractual restrictions of TDM.

Step 3 – Mining of text and/or data and knowledge discovery

This step is also propaedeutic to the realization of the goal underlying predictive TDM, which is not just mere extraction of information, but rather knowledge discovery. In addition to the steps discussed above, which consist of identifying the content to use and securing access to it, in most cases stages in text and data mining processes include:

  • Pre-processing of relevant text and data (Stage A);
  • Extractino of structured data (Stage B).