The first microbial supertree from figure-mining thousands of papers

While recent reports reveal the existence of more than 114,000,000 documents of published scientific literature, finding a way to improve the access to this knowledge and efficiently synthesise it becomes an increasingly pressing issue.

Seeking to address the problem through their PLUTo workflow, British scientists Ross Mounce and Peter Murray-Rust, University of Cambridge and Matthew Wills, University of Bath perform the world’s first attempt at automated supertree construction using data exclusively extracted by machines from published figure images. Their results are published in the open science journal Research Ideas and Outcomes (RIO).

For their study, the researchers picked the International Journal of Systematics and Evolutionary Microbiology (IJSEM) – the sole repository hosting all new validly described prokaryote taxa and, therefore, an excellent choice against which to test systems for the automated and semi-automated synthesis of published phylogenies. According to the authors, IJSEM publishes a greater number of phylogenetic tree figure images a year than any other journal.

An eleven-year span of articles dating back to January, 2003 was systematically downloaded so that all image files of phylogenetic tree figures could be extracted for analysis. Computer vision techniques then allowed for the automatic conversion of the images back into re-usable, computable, phylogenetic data and used for a formal supertree synthesis of all the evidence.

During their research, the scientists had to overcome various challenges posed by copyrights formally covering almost all of the documents they needed to mine for the purpose of their work. At this point, they faced quite a paradox – while easy access and re-use of data published in scientific literature is generally supported and strongly promoted, common copyright practices make it difficult for a scientist to be confident when incorporating previously compiled data into their own work. The authors discuss recent changes to UK copyright law that have allowed for their work to see the light of day. As a result, they provide their output as facts, and assign them to the public domain by using the CC0 waiver of Creative Commons, to enable worry-free re-use by anyone.

“We are now at the stage where no individual has the time to read even just the titles of all published papers, let alone the abstracts,” comment the authors.

“We believe that machines are now essential to enable us to make sense of the stream of published science, and this paper addresses several of the key problems inherent in doing this.”

“We have deliberately selected a subsection of the literature (limited to one journal) to reduce the volume, velocity and variety, concentrating primarily on validity. We ask whether high-throughput machine extraction of data from the semistructured scientific literature is possible and valuable.”  

 

Original source:

Mounce R, Murray-Rust P, Wills M (2017) A machine-compiled microbial supertree from figure-mining thousands of papers. Research Ideas and Outcomes 3: e13589. https://doi.org/10.3897/rio.3.e13589

 

Additional information:

The research has been funded by the BBSRC (grant BB/K015702/1 awarded to MAW and supporting RM).

Biodiversity data import from historical literature assessed in an EMODnet Workshop Report

While biodiversity loss is an undisputable issue concerning everyone on a global scale, data about species distribution and numbers through the centuries is crucial for adopting adequate and timely measures.

However, as abundant as this information currently is, large parts of the actual data are locked-up as scanned documents, or not digitized at all. Far from the machine-readable knowledge, this information is left effectively inaccessible. In particular, this is the case for data from marine systems.

This is how data managers who implement data archaeology and rescue activities, as well as external experts in data mobilization and data publication, were all brought together in Crete for the European Marine Observation and Data network (EMODnet) Workshop, which is now reported in the open access journal Research Ideas and Outcomes (RIO).

“In a time of global change and biodiversity loss, information on species occurrences over time is crucial for the calculation of ecological models and future predictions”, explain the authors. “But while data coverage is sufficient for many terrestrial areas and areas with high scientific activity, large gaps exist for other regions, especially concerning the marine systems.”

Aiming to fill both spatial and temporal gaps in European marine species occurrence data availability by implementing data archaeology and rescue activities, the workshop took place on 8th and 9th June in 2015 at the Hellenic Center for Marine Research Crete (HCMR), Heraklion Crete, Greece. There, the participants joined forces to assess possible mechanisms and guidelines to mobilize legacy biodiversity data.

Together, the attendees reviewed the current issues associated with manual extraction of occurrence data. They also used the occasion to test tools and mechanisms that could potentially support a semi-automated process of data extraction. Long-disputed in the scholarly communities matters surrounding data re-publication, such as openly accessible data and author attribution were also discussed. As a result, at the end of the event, a list of recommendations and conclusions was compiled, also openly available in the Workshop Report publication.

Ahead of the workshop, curators extracted legacy data to compile a list of old faunistic reports, based on certain criteria. While performing the task, they noted the time and the problems they encountered along the way. Thus, they set the starting point for the workshop, where participants would get the chance to practice data extraction themselves at the organised hands-on sessions.

“Legacy biodiversity literature contains a tremendous amount of data that are of high value for many contemporary research directions. This has been recognized by projects and institutions such as the Biodiversity Heritage Library (BHL), which have initiated mass digitization of century-old books, journals and other publications and are making them available in a digital format over the internet,” note the authors.

“However, the information remains locked up even in these scanned files, as they are available only as free text, not in a structured, machine-readable format”.

In conclusion, the participants at the European Marine Observation and Data network Workshop listed practical tips regarding in-house document scanning; suggested a reward scheme for data curators, pointing out that credit needs to be given to the people “who made these valuable data accessible again”; encouraged Data papers publication, for aligning with the “emerging success of open data”; and proposed the establishment of a data encoding schema. They also highlighted the need for academic institutions to increase their number of professional data manager permanent positions, while also providing quality training to long-term data experts.

###

Original source:

Faulwetter S, Pafilis E, Fanini L, Bailly N, Agosti D, Arvanitidis C, Boicenco L, Capatano T, Claus S, Dekeyzer S, Georgiev T, Legaki A, Mavraki D, Oulas A, Papastefanou G, Penev L, Sautter G, Schigel D, Senderov V, Teaca A, Tsompanou M (2016) EMODnet Workshop on mechanisms and guidelines to mobilise historical data into biogeographic databases. Research Ideas and Outcomes 2: e9774. doi: 10.3897/rio.2.e9774

Publishing grant proposals, presubmission

There are a lot of really interesting works being published over at Research Ideas and Outcomes (RIO).  If you aren’t already following the updates you can do so via RSS, Twitter, or via email (scroll to the bottom for sign-up).

In this post I’m going to discuss why Chad Hammond’s contribution is so remarkable and why it could represent an exciting model for a more transparent and more immediate future of scholarly communications.

Version1

 

 

 

So, what’s special?

Well, to state the obvious first: it’s a grant proposal, not a research article. RIO Journal has published quite a lot of research proposals now, it’s becoming a real strength of the journal. But that’s not the really interesting thing about it. The really cool thing is that Chad published this grant proposal with RIO before it was submitted it to the funder (Canadian Institutes of Health Research) for evaluation.

You’ll see the publication date of Version 1 of the work is 24th March 2016. Pleasingly, after publication in RIO Chad’s proposal was evaluated by CIHR and awarded research funding. Chad received news of this in late April:

…and the story gets even better from here because thanks to RIO’s unique technology called ARPHA, Chad was able to re-import his published article back into editing mode, to update the proposal to acknowledge that it had been funded:

This proposal was submitted to and received funding from the annual Canadian Institutes of Health Research (CIHR) competition for postdoctoral fellowships.

The updated proposal was then checked by the editorial team and republished as an updated version of the original proposal: Version 2, making-use of CrossMark technology to formally link the two versions and to make sure readers are always made aware if a newer version of the work exists. Chad’s updated proposal now has a little ‘Funded’ button appended to it (see below), to indicate that this proposal has been successfully funded. We hope to see many more such successfully funded proposals published at RIO.

Title and metadata

 

 

 

With permission given, Chad was also able to supply some of the reviewer comments passed to him from CIHR reviewers as supplementary data to the updated Version 2 proposal. These will undoubtedly provide invaluable insight into reviewing processes for many.

Finally, for funders and publishing-tech geeks: you should really take note of the lovely machine-readable XML-formatted version of Chad’s proposal. Pensoft has machine-readable XML output as standard, not just PDF and HTML. Funding agencies around the world would do well to think closely about the value of having XML-formatted machine-readable grant proposal submissions. There’s serious value to this and I think it’s something we’ll see more of in the future. Pensoft is actively looking to work with funders to develop further these ideas and approaches for genuinelyadding-value to scholarly communications.
RIO is truly an innovative journal don’t you think?

References

Version 1:
Hammond C (2016) Widening the circle of care: An arts-based, participatory dialogue with stakeholders on cancer care for First Nations, Inuit,and Métis peoples in Ontario, Canada. Research Ideas and Outcomes 2: e8615. doi: 10.3897/rio.2.e8615

Version 2:
Hammond C (2016) Widening the circle of care: An arts-based, participatory dialogue with stakeholders on cancer care for First Nations, Inuit, and Métis peoples in Ontario, Canada. Research Ideas and Outcomes 2: e9115. doi: 10.3897/rio.2.e9115

 

This blog post was originally published on Ross Mounce’s blog.