Written by Andras Schwarcz,
Now that the scientific information produced exceeds the amount humans can process, Text and data mining is necessary to advance research. TDM has the potential to accelerate innovation by making research more efficient, by reducing literature review time by 80 %, and by discovering connections between apparently unrelated datasets. TDM may consequently help reinterpret existing knowledge. It can help the research community and indirectly the general public to benefit from a larger proportion of research data generated mostly using public funds.
There are some issues hindering the spread of TDM globally, and especially Europe. The first to tackle are legal – such as the unclear and fragmented legal framework for copyright. Currently, researchers need the express consent of each data publisher to be able to mine for information, even if they have the right to access the data. This discourages researchers from covering a wide range of datasets, especially relevant in niche fields. Rule for copyright exceptions, digital rights management (DRM) and definitions of such crucial terms as ‘non-commercial use’ are unclear and vary across Member States. Publishers and data owners often set legal restrictions to data mining in their licensing contract with libraries and research institutions, but also technical restrictions to bulk download and crawling their websites, especially hindering APIs developed by researchers. Although many major publishers have developed and standardised their technology to allow easy access and bulk downloads, smaller publishers are not all as advanced, and do not grant easy and standardised access for TDM. There is a lack of technical expertise and fora or expert networks where researchers and institutions such as libraries could access know-how.
The debate surrounding TDM in the EU therefore is currently focused on copyright, an initiative under discussion in the European Parliament. For TDM there is a need for a unified and clear legal framework on exceptions, to be able to use all the research data to which a researcher has legal access for machine reading and data mining, even with tools developed by researchers. Further to the copyright issues, access to data, standards and interoperability need to be addressed at EU level. These are necessary for Europe to be able to compete with Asia and the USA. There is also a need for dialogue between the data users and the publisher to achieve a balance between profit for publishers and public access to knowledge. There is also much to do in educating researchers on the potential of TDM, developing tools for TDM – areas where libraries could be in the forefront.
Libraries have a public service mission to help the public navigate the digital world, make information and data publicly available in all forms. The role of the library in finding specific information is being extended to finding trends and connections in information. More specifically, libraries need to play a bridging role in connecting researchers and IT developers that brings the research questions together with the research tools. Libraries need to educate non-specialists, highlighting the potential of TDM. They need to act as proponents of data harmonisation by using open standards. They can serve as information hubs, influencers and content providers, contact points for researchers, and help developing TDM strategies for other institutions. Libraries have always played a leading role in archive digitalisation, now they have to provide machine-readable digital archives. For these reasons, librarians’ jobs will be transformed, becoming data scientists and machine reading experts.
The panel concluded that the European Parliament Library has certainly potential in the TDM field, in semantic research in the EP archives and repositories of open data, especially legal texts.