The first microbial supertree from figure-mining thousands of papers

While recent reports reveal the existence of more than 114,000,000 documents of published scientific literature, finding a way to improve the access to this knowledge and efficiently synthesise it becomes an increasingly pressing issue.

Seeking to address the problem through their PLUTo workflow, British scientists Ross Mounce and Peter Murray-Rust, University of Cambridge and Matthew Wills, University of Bath perform the world’s first attempt at automated supertree construction using data exclusively extracted by machines from published figure images. Their results are published in the open science journal Research Ideas and Outcomes (RIO).

For their study, the researchers picked the International Journal of Systematics and Evolutionary Microbiology (IJSEM) – the sole repository hosting all new validly described prokaryote taxa and, therefore, an excellent choice against which to test systems for the automated and semi-automated synthesis of published phylogenies. According to the authors, IJSEM publishes a greater number of phylogenetic tree figure images a year than any other journal.

An eleven-year span of articles dating back to January, 2003 was systematically downloaded so that all image files of phylogenetic tree figures could be extracted for analysis. Computer vision techniques then allowed for the automatic conversion of the images back into re-usable, computable, phylogenetic data and used for a formal supertree synthesis of all the evidence.

During their research, the scientists had to overcome various challenges posed by copyrights formally covering almost all of the documents they needed to mine for the purpose of their work. At this point, they faced quite a paradox – while easy access and re-use of data published in scientific literature is generally supported and strongly promoted, common copyright practices make it difficult for a scientist to be confident when incorporating previously compiled data into their own work. The authors discuss recent changes to UK copyright law that have allowed for their work to see the light of day. As a result, they provide their output as facts, and assign them to the public domain by using the CC0 waiver of Creative Commons, to enable worry-free re-use by anyone.

“We are now at the stage where no individual has the time to read even just the titles of all published papers, let alone the abstracts,” comment the authors.

“We believe that machines are now essential to enable us to make sense of the stream of published science, and this paper addresses several of the key problems inherent in doing this.”

“We have deliberately selected a subsection of the literature (limited to one journal) to reduce the volume, velocity and variety, concentrating primarily on validity. We ask whether high-throughput machine extraction of data from the semistructured scientific literature is possible and valuable.”  

 

Original source:

Mounce R, Murray-Rust P, Wills M (2017) A machine-compiled microbial supertree from figure-mining thousands of papers. Research Ideas and Outcomes 3: e13589. https://doi.org/10.3897/rio.3.e13589

 

Additional information:

The research has been funded by the BBSRC (grant BB/K015702/1 awarded to MAW and supporting RM).

Legitimacy of reusing images from scientific papers addressed

It goes without saying that scientific research has to build on previous breakthroughs and publications. However, it feels quite counter-intuitive for data and their re-use to be legally restricted. Yet, that is what happens when copyright restrictions are placed on many scientific papers.

The discipline of taxonomy is highly reliant on previously published photographs, drawings and other images as biodiversity data. Inspired by the uncertainty among taxonomists, a team, representing both taxonomists and experts in rights and copyright law, has traced the role and relevance of copyright when it comes to images with scientific value. Their discussion and conclusions are published in the latest paper added in the EU BON Collection in the open science journal Research Ideas and Outcomes (RIO).

Taxonomic papers, by definition, cite a large number of previous publications, for instance, when comparing a new species to closely related ones that have already been described. Often it is necessary to use images to demonstrate characteristic traits and morphological differences or similarities. In this role, the images are best seen as biodiversity data rather than artwork. According to the authors, this puts them outside the scope, purposes and principles of Copyright. Moreover, such images are most useful when they are presented in a standardized fashion, and lack the artistic creativity that would otherwise make them ‘copyrightable works’.

image 3

“It follows that most images found in taxonomic literature can be re-used for research or many other purposes without seeking permission, regardless of any copyright declaration,” says Prof. David J. Patterson, affiliated with both Plazi and the University of Sydney.

Nonetheless, the authors point out that, “in observance of ethical and scholarly standards, re-users are expected to cite the author and original source of any image that they use.” Such practice is “demanded by the conventions of scholarship, not by legal obligation,” they add.

However, the authors underline that there are actual copyrightable visuals, which might also make their way to a scientific paper. These include wildlife photographs, drawings and artwork produced in a distinctive individual form and intended for other than comparative purposes, as well as collections of images, qualifiable as databases in the sense of the European Protection of Databases directive.

In their paper, the scientists also provide an updated version of the Blue List, originally compiled in 2014 and comprising the copyright exemptions applicable to taxonomic works. In their Extended Blue List, the authors expand the list to include five extra items relating specifically to images.

“Egloff, Agosti, et al. make the compelling argument that taxonomic images, as highly standardized ‘references for identification of known biodiversity,’ by necessity, lack sufficient creativity to qualify for copyright. Their contention that ‘parameters of lighting, optical and specimen orientation’ in biological imaging must be consistent for comparative purposes underscores the relevance of the merger doctrine for photographic works created specifically as scientific data,” comments on the publication Ms. Gail Clement, Head of Research Services at the Caltech Library.

“In these cases, the idea and expression are the same and the creator exercises no discretion in complying with an established convention. This paper is an important contribution to the literature on property interests in scientific research data – an essential framing question for legal interoperability of research data,” she adds.

###

Original source:

Egloff W, Agosti D, Kishor P, Patterson D, Miller J (2017) Copyright and the Use of Images as Biodiversity Data. Research Ideas and Outcomes 3: e12502. https://doi.org/10.3897/rio.3.e12502

Additional information:

The present study is a research outcome of the European Union’s FP7-funded project EU BON, grant agreement No 308454.