Directorate-General for Research & Innovation logo Horizon: the EU Research & Innovation magazine | European Commission logo
Receive our editor’s picks

Copyright shift would put Europe ahead in ‘future of research’ data mining

 Text and data mining uses software that can sift through large quantities of material to find connections and contradictions that would otherwise be impossible to uncover. Image credit: Flickr/ Christiaan Colen
Text and data mining uses software that can sift through large quantities of material to find connections and contradictions that would otherwise be impossible to uncover. Image credit: Flickr/ Christiaan Colen

In today’s digital age, it can feel as though we are drowning in a deluge of data, and the scientific field is no different. According to a 2014 study, one paper is published every 30 seconds, and more than 70 000 papers have been published on a single protein, a tumour suppressor called p53. 

Given the challenge of manually keeping track of such a volume of information, let alone making use of it, it is unsurprising that a separate study found that 90 % of scientific papers are never cited, and only half are ever read by anyone except the authors, referees and journal editors.

Enter text and data mining (TDM). This is a technique that uses intelligent software to sift through large quantities of material, pull out the data and analyse it for patterns, the idea being that it can help scientists identify the plethora of connections and contradictions that would be otherwise impossible to uncover.

‘Text and data mining is about extracting hidden knowledge from text, from all this volume of text that is laying around that no one is able to read,’ said Natalia Manola, a researcher at the Athena Research and Innovation Centre, Greece. ‘We are able to connect and infer knowledge that even an expert might not be able to see.’

Manola coordinates the EU-funded OpenMinTeD project, which is building a registry of TDM services and tools so that researchers are able to find an appropriate piece of software for their purposes and use it easily. Joining up TDM technologies with potential users in this way also has benefits for the designers of the software, who need access to research data to test and hone their algorithms.

‘There is a world of scientists and researchers who want to use text and data mining services but they don’t know how,’ said Manola. ‘And then there is the other side, people who are able to produce these tools and services, but somehow these have stayed within their labs, without a broad uptake from scientists, industry and the public. (We’re) trying to bridge this gap.’

Zika virus

Dr Peter Murray-Rust is director of ContentMine, a not-for-profit organisation which has developed software that enables researchers to search through scientific papers on a particular subject. He gives the example of the Zika outbreak as an area where TDM can help to enhance knowledge.

‘We’re going to need to know a lot more about Zika, and much of it may already be in the scientific literature that’s been published but that we don’t read. We don’t read it because there’s so much, so we’ve built a machine, ContentMine, that will liberate the facts from the literature.’

‘We are able to connect and infer knowledge that even an expert might not be able to see.’

Natalia Manola, Athena Research and Innovation Centre, Greece

However, while TDM has been billed as the research method of the future, there is some indication that Europe is currently lagging behind its global counterparts in using the technology.

Marco Caspers from the Institute for Information Law in the Netherlands is working on the EU’s FutureTDM project, which is trying to identify what is preventing people from using text and data mining more.

‘There is some empirical evidence that shows that scientific output using TDM technologies is significantly less apparent in Europe than, for example, in the US,’ he said. ‘We are looking at what the cause of this could be.’

Caspers says that the danger of Europe’s underactivity is that it risks driving away innovative companies that want to develop or use TDM technologies.

‘Companies that are starting to explore this field will move out of the EU because they will have a better climate – maybe economically, maybe legally, maybe otherwise. They would be leaving the EU, which would affect the growth of the economy – because it is a growing sector.’

Some of the challenges the FutureTDM project is examining include how to set up the legal framework so people can use TDM technologies without worrying about violating data protection and privacy laws, whether data can or should come in a standard format, and how to ensure the quality of the data that’s retrieved.


One of the issues Caspers is looking at closely is copyright. Because TDM techniques may involve copying protected content, it may infringe copyright laws, even when a researcher has the right to access that content.

He says that while the EU has a rule that allows reproductions to be made for scientific research with non-commercial purposes, it is not mandatory and very few Member States have implemented it in a way that allows text and data mining.

‘In many countries, TDM researchers do not even know if it would be legal to do any TDM,’ said Caspers. ‘It also affects cross-border collaborations (because) they are not sure if it would be lawful.’

To resolve this, the European Commission has proposed a copyright exception, meaning that European researchers and some innovators should have the explicit right to process on a large scale the content to which they have legal access.

The aim is to create legal clarity and make it easy for researchers to access content for TDM purposes without having to invest time and money in negotiating complex licenses. It would also mean that the copyright situation is the same in all EU countries.

Currently the proposal covers public or private organisations that are carrying out scientific research in the public interest. However, many researchers would like to see the exception extended to companies, such as small- and medium-sized enterprises (SMEs), which are important not only for developing the technology to perform text and data mining activities, but also to use these tools to innovate.

‘We are aiming for a digital single market. If we’re not allowing TDM for SMEs we break the bridge between open science and open innovation,’ said Natalia Manola. ‘On one hand we are advertising and we want to attract SMEs, but how are they going to come to this?’

More info