Copyright Exceptions and Data Mining

On July 5^th, 2018, we had the opportunity to interview two esteemed individuals within the field of copyright—Dr. Jane Secker and Mr. Chris Morrison. Dr. Secker is a senior lecturer in educational development at City, University of London and Mr. Morrison is the copyright software licensing and information services policy manager at the University of Kent. Both Dr. Secker and Mr. Morrison sit on the Universities UK / GuildHE Copyright Negotiation and Advisory Committee and are co-founders of the UK Copyright Literacy blog.

To provide some context to their responses, two key elements about European law should be noted. First, on June 1^st, 2014, an exception was made to UK copyright law through the implementation of the Copyright and Rights in Performances (Research, Education, Libraries and Archives) Regulations 2014. This included the removal of barriers for Text and Data Mining (TDM) for non-commercial purposes. Second, the European Union (EU) is in the process of modernizing its copyright laws, and recently had a vote on the proposal for a Directive of the European Parliament and of the Council on Copyright in the Digital Single Market, which was rejected and will be revisited in September of 2018.

We asked Dr. Secker and Mr. Morrison the following questions:

What are some examples of the kinds of data mining that researchers should legitimately be able to do?
What legal barriers stand in the way of this and, if you could, tell us about the proposed exception for TDM that’s proposed in Europe.
Given the current controversies about data mining by social media companies and political consulting companies, privacy issues have risen to prominence. How would the proposed copyright exception intersect with privacy law and what types of research would not be permitted given European privacy regulation?

According to Mr. Morrison, the right to read should be the right to mine. Dr. Secker reiterated this notion and also stressed the importance of being able to legitimately mine various forms of data, whether it be full text subscription databases, abstracts, digitized collections, social media content, etc.

The legal barriers identified by Mr. Morrison include the various licensing terms, terms and conditions of websites, and differing laws around the world on data mining that are often very complicated even for researchers to fully grasp. Further, many researchers find themselves under pressure from external sources—such as those funding the research—to openly license the data set, which can be troublesome especially if the researcher is working in collaboration with a commercial organization. According to Dr. Secker, TDM has been recognized in UK law since 2014, and it is not something that is currently available as a copyright exception in other European countries. This makes it difficult when working in partnership with others who may not have similar legal restrictions on how they can interact with specific datasets. Also, the barriers are not just legal—they can be technical as well, especially when considering factors such as Digital Rights Management (DRM) protection. This poses a conundrum for copyright—on the one hand, there’s an exception that indicates your ability to engage in TDM, but there are also technical limitations such as DRM or other technical protection measures which may prevent you from obtaining access.

When it comes to research, Dr. Secker reminds us that there are pre-existing ethical codes of practice that researchers must adhere to. For any researcher working in the field of copyright or TDM, they would have to get ethical clearance before conducting their research. Mr. Morrison also reminds us that intellectual property laws are not implemented for privacy purposes, but to incentivize creativity and investment in information goods. Privacy concerns are a separate issue from copyright and it’s important to keep them separate when addressing them.

See below for a transcript of the interview (transcript has been edited for clarity and readability).

What are some examples of the kinds of data mining that researchers should legitimately be able to do?

Mr. Morrison: Well, I think they should be able to mine legitimately acquired sources of data, specifically subscription databases that academic institutions subscribe to. In our view, and the view of many information professionals, we have paid to get legitimate access, and we should be able to run computational analysis and algorithms on those datasets in order to understand the facts and the underlying patterns within that information source. But also, beyond that, anything that has value to pure research, whether that be science, social science, or even humanities, anything where new knowledge can be created, and new understandings can be created after the information source, that should be something that researchers should be able to do without having to go into a very complex and potentially expensive process of getting additional permissions. In summary, the right to read should be the right to mine.

Dr. Secker: The only thing I would add to this is that the law should cover data in all sorts of formats. It should cover full text subscription databases, but the researcher might be mining abstracts as well, such as the case in large scale systematic reviews, so it should cover abstracts, image data as well, where you’ve got digitized collections. In my previous role as the copyright and digital literacy adviser at the London School of Economics (LSE), we had historical sources that had been digitized and they were mainly image-based, although some of them had been converted to text, but being able to mine all sorts of different data is crucial to researchers, and there was a lot of interest in this from researchers.

What legal barriers stand in the way of this and, if you could, tell us about the proposed exception for TDM that’s proposed in Europe.

Mr. Morrison: Well, I think the legal barrier to this from the perspective of the researcher is the numerous licensing terms, terms and conditions, and different laws that for most people are very complicated and worrying. So, the area of research that Jane and I are most interested in is how copyright is perceived and how it’s experienced by those involved in research and education. In our experience, most of them are under a lot of pressure from many different sources that have funded to make their research available in certain ways to publish on an open access basis. At the same time, there are ethical concerns that they have to abide by and therefore copyright and associated rights, such as database rights, are just another aspect of a great many things that they have to make sure they get right and it’s something they find hugely complicated. Questions such as what is commercial and what is non-commercial can also become a barrier when they’re working with other partners in what could be regarded as commercial organizations.

Dr. Secker: We’ve had TDM in UK law since 2014 [https://www.gov.uk/government/news/new-exceptions-to-copyright-reflect-digital-age], which obviously, other European countries don’t have at the moment. So, if we might want to work with a partner that is outside the UK, and the fact that this would be harmonized as something across Europe, it would help for those kinds of projects because at the moment, it is only something we’ve had for four years in the UK and there’s still been quite a lot of difficulty getting the message out there that it is something that is permitted. The barriers aren’t necessarily legal; a lot of them are technical, so they could be related to things like DRM. That has caused some problems in examples I know of where, essentially, databases or some kind of web-based source will have some sort of mechanism to stop you from downloading the amount of data that you need to perform TDM and if they use DRM, then you get into quite a difficult situation legally because you can’t circumvent the DRM because that’s illegal to do. So, what takes the precedence? You’ve got an exception that says you’re allowed to do TDM but if you’ve got a DRM on there in some form and you need to apply to have it taken off, you can’t just sort of hack into the system, which would be a way around it. But the kind of issue about Europe I think is significant that, where it’s a project that might be working across more than one country, having that exception only in the UK, I think it’s potentially meant that there haven’t been large-scale projects to look at from a sort of European level yet.

Mr. Morrison: Yes, and also to add at the European level that question about DRM or Technical Protection Measures (TPM): we’re obviously part of a process and there’s been some developments today on what’s happening with that final vote that’s going to the vote in September [https://www.bbc.com/news/technology-44712475]. But there are potential worrying provisions in there around fixing that situation with the TPM in law so that there is no way to kind of get around that at all even at a local level. Jane has had the experience of referring a potential TDM example to the UK Intellectual Property Office because we wanted to remove the TPM, and that’s possibly going to be changed at the European level which would make that impossible to do. Also, the European proposal which is to limit it to research institutions only could be problematic where we are working, as I mentioned earlier on, in partnership with other organizations, that will potentially limit what researchers can do.

Given the current controversies about data mining by social media companies and political consulting companies, privacy issues have risen to prominence. How would the proposed copyright exception intersect with privacy law and what types of research would not be permitted given European privacy regulation?

Dr. Secker: This is an interesting question. I think in terms of social media data for example, I’ve run into a number of situations about using social media in research, how to sort of harvest data out of Facebook and Twitter particularly. There’s a lot of interest from researchers in doing new types of research and I think one of the things to remember is that there are ethical codes of practice that already exist. So, the Association of Internet Researchers have a strict code of conduct if you’re doing this type of research where privacy and the use of personal data is really clearly considered. I had a number of examples where people would come, often Ph.D. students, where they might have harvested data out of blogs or from social media and a lot of this came down to informed consent and what that means when you are taking data that somebody’s put out on the web. It doesn’t mean it’s fair game to do what you want with it. Obviously, there are huge concerns at the moment with changes to data protection, that privacy should somehow trump copyright and become the kind of thing that we always have to be mindful of. But, I think for any researcher that’s working in this space, they would be getting ethical clearance and I think privacy would be a massive concern. I think if you’re doing a project that involves a very sensitive area, perhaps you’re using a hashtag exposing people’s identity and things that they say as individuals; that’s just kind of unethical from the start really.

Mr. Morrison: Yes, I think when having conversations with people about how to overcome the potential barriers that intellectual property laws provide, the conversation often turns towards privacy, and people will say well, does copyright stop me from doing this in order to protect people’s privacy? I think we’re very clear that intellectual property laws are not there for privacy purposes; they are there to incentivize creativity or the investment in information goods, and the recent General Data Protection Regulations (GDPRs) do create a challenge for researchers using TDM. For example, if they decide they have lawful access to an information source which involves lots of personal data, they would be allowed to do that under copyright law or database rights and the TDM provisions certainly in the UK, but they wouldn’t necessarily have permission to use that personal data for a secondary purpose. For example, to provide their dataset to somebody else to then go and look at and draw their own conclusions because that original data subject would only have given their permission for it to be used by the original service, the original party that had taken it. So, researchers have this issue, but in a way that’s a separate issue from copyright and it’s quite important I think to keep those separate when addressing them.

Dr. Secker: But I think it is about looking at the data while getting ethical clearance. Just because you’re not talking to individuals and interviewing them or getting the data from a questionnaire because you might be doing some sort of large scale mining of something like Twitter, it doesn’t mean that those people’s identity are fair game to be sort of reproduced completely un-anonymized. But it is something people that do social research, I think if they’ve moved into this space and they haven’t done research using these types of sources before, it’s something you can cover in research training and that was certainly what we were trying to do in my previous role. We ran a couple of really successful workshops where we got them to understand what the legal issues were, but really importantly what the ethical issues were with using that type of data.