The first microbial supertree from figure-mining thousands of papers

While recent reports reveal the existence of more than 114,000,000 documents of published scientific literature, finding a way to improve the access to this knowledge and efficiently synthesise it becomes an increasingly pressing issue.

Seeking to address the problem through their PLUTo workflow, British scientists Ross Mounce and Peter Murray-Rust, University of Cambridge and Matthew Wills, University of Bath perform the world’s first attempt at automated supertree construction using data exclusively extracted by machines from published figure images. Their results are published in the open science journal Research Ideas and Outcomes (RIO).

For their study, the researchers picked the International Journal of Systematics and Evolutionary Microbiology (IJSEM) – the sole repository hosting all new validly described prokaryote taxa and, therefore, an excellent choice against which to test systems for the automated and semi-automated synthesis of published phylogenies. According to the authors, IJSEM publishes a greater number of phylogenetic tree figure images a year than any other journal.

An eleven-year span of articles dating back to January, 2003 was systematically downloaded so that all image files of phylogenetic tree figures could be extracted for analysis. Computer vision techniques then allowed for the automatic conversion of the images back into re-usable, computable, phylogenetic data and used for a formal supertree synthesis of all the evidence.

During their research, the scientists had to overcome various challenges posed by copyrights formally covering almost all of the documents they needed to mine for the purpose of their work. At this point, they faced quite a paradox – while easy access and re-use of data published in scientific literature is generally supported and strongly promoted, common copyright practices make it difficult for a scientist to be confident when incorporating previously compiled data into their own work. The authors discuss recent changes to UK copyright law that have allowed for their work to see the light of day. As a result, they provide their output as facts, and assign them to the public domain by using the CC0 waiver of Creative Commons, to enable worry-free re-use by anyone.

“We are now at the stage where no individual has the time to read even just the titles of all published papers, let alone the abstracts,” comment the authors.

“We believe that machines are now essential to enable us to make sense of the stream of published science, and this paper addresses several of the key problems inherent in doing this.”

“We have deliberately selected a subsection of the literature (limited to one journal) to reduce the volume, velocity and variety, concentrating primarily on validity. We ask whether high-throughput machine extraction of data from the semistructured scientific literature is possible and valuable.”  

 

Original source:

Mounce R, Murray-Rust P, Wills M (2017) A machine-compiled microbial supertree from figure-mining thousands of papers. Research Ideas and Outcomes 3: e13589. https://doi.org/10.3897/rio.3.e13589

 

Additional information:

The research has been funded by the BBSRC (grant BB/K015702/1 awarded to MAW and supporting RM).

Openly published Open Science Prize Grant Proposal builds on ContentMine and Hypothes.is to bridge scientists and facts

Public health emergencies such as the currently spreading Zika disease might be successfully necessitating open access for the available biomedical researches and their underlying data, yet filtering the right information, so that it lands in the hands of the right people, is what holds up professionals to bring the adequate measures about.

Submitted to the Open Science Prize contest, the present grant proposal, prepared with the joint efforts of scientists affiliated with Hypothes.is, ContentMine, University of CambridgeCottage Labs LLP and Imperial College of London, suggests a new scholarly assistant system, called amanuens.is, based on the existing ContentMine and Hypothes.is prototypes. Its aim is to combine machines and humans, so that mining critically important facts and making them available to the world can be made not only significantly faster, but also less costly. Through their publication in the open access journal Research Ideas and Outcomes (RIO), the scientists, who are also well-known open access and open data proponents, are looking for further support, feedback and collaborations.

While Hypothes.is is a mixture of software and communities, which together annotate the available literature, ContentMine are building an open source pipeline to extract facts from scientific documents, thus making the literature review process cheaper, more rigorous, continuous and transparent. The role of amanuens.is is meant to bring these two systems together.

As a result, Hypothes.is is to display ContentMine facts as annotations on the online document, therefore increasing their visibility. In turn, the large Hypothes.is community, comprising users ranging from devoted and experienced Wikipedia editors to dedicated citizen scientists, would be able to provide manually their own annotations, which could be then fed back into the ContentMine facts store.

“Facts are important – but science is performed by people – so ContentMine are partnering with Hypothes.is to bring communities together around facts in the scholarly literature,” sums up Dr Peter Murray-Rust. “Through combining machines and humans in a tight, iterating, loop, amanuens.is will be able to mine critically important facts and make them available to the world.”

In their proposal, the authors give a hypothetical, yet foreseeable example with a Hypothes.is community, centered around research and discussions regarding a bacterium, already proven to restrain some mosquitoes from transmitting various viruses, and its potential use against Zika. There, amanuens.is downloads all open access papers on Zika from a multitude of sources within 3 minutes. In a matter of a couple of seconds a total of 123 files are downloaded. Then, amanuens.is delivers a data table of the extracted data, including species, human genes, DNA primers and top word frequencies.

Within the community and thanks to the literature, made available via ContentMine, the users would be able to collaborate and build on the existing research outcomes. As a result, it could take only fifteen minutes and a brief proposal to mobilise the related scholarly resources and test for Zika resistance in infected with the virus mosquitoes.

“Finding facts to finding people took 15 minutes and this is how modern collaborative science should work,” Prof Peter Murray-Rust says about the given example. “The people then create knowledge from the facts. The knowledge creates communities. The communities explore science- and people-based solutions.”

In conclusion, the proposal states that similarly to the content and software provided by ContentMine and Hypothes.is, the outputs produced by amanuens.is will also be openly available. All of its data and annotations are to be public domain under a CC0 waiver.

###

Original source:

Martone M, Murray-Rust P, Molloy J, Arrow T, MacGillivray M, Kittel C, Kasberger S, Steel G, Oppenheim C, Ranganathan A, Tennant J, Udell J (2016) ContentMine/Hypothes.is Proposal.Research Ideas and Outcomes 2: e8424. doi: 10.3897/rio.2.e8424