The first microbial supertree from figure-mining thousands of papers

While recent reports reveal the existence of more than 114,000,000 documents of published scientific literature, finding a way to improve the access to this knowledge and efficiently synthesise it becomes an increasingly pressing issue.

Seeking to address the problem through their PLUTo workflow, British scientists Ross Mounce and Peter Murray-Rust, University of Cambridge and Matthew Wills, University of Bath perform the world’s first attempt at automated supertree construction using data exclusively extracted by machines from published figure images. Their results are published in the open science journal Research Ideas and Outcomes (RIO).

For their study, the researchers picked the International Journal of Systematics and Evolutionary Microbiology (IJSEM) – the sole repository hosting all new validly described prokaryote taxa and, therefore, an excellent choice against which to test systems for the automated and semi-automated synthesis of published phylogenies. According to the authors, IJSEM publishes a greater number of phylogenetic tree figure images a year than any other journal.

An eleven-year span of articles dating back to January, 2003 was systematically downloaded so that all image files of phylogenetic tree figures could be extracted for analysis. Computer vision techniques then allowed for the automatic conversion of the images back into re-usable, computable, phylogenetic data and used for a formal supertree synthesis of all the evidence.

During their research, the scientists had to overcome various challenges posed by copyrights formally covering almost all of the documents they needed to mine for the purpose of their work. At this point, they faced quite a paradox – while easy access and re-use of data published in scientific literature is generally supported and strongly promoted, common copyright practices make it difficult for a scientist to be confident when incorporating previously compiled data into their own work. The authors discuss recent changes to UK copyright law that have allowed for their work to see the light of day. As a result, they provide their output as facts, and assign them to the public domain by using the CC0 waiver of Creative Commons, to enable worry-free re-use by anyone.

“We are now at the stage where no individual has the time to read even just the titles of all published papers, let alone the abstracts,” comment the authors.

“We believe that machines are now essential to enable us to make sense of the stream of published science, and this paper addresses several of the key problems inherent in doing this.”

“We have deliberately selected a subsection of the literature (limited to one journal) to reduce the volume, velocity and variety, concentrating primarily on validity. We ask whether high-throughput machine extraction of data from the semistructured scientific literature is possible and valuable.”  

 

Original source:

Mounce R, Murray-Rust P, Wills M (2017) A machine-compiled microbial supertree from figure-mining thousands of papers. Research Ideas and Outcomes 3: e13589. https://doi.org/10.3897/rio.3.e13589

 

Additional information:

The research has been funded by the BBSRC (grant BB/K015702/1 awarded to MAW and supporting RM).

In a nutshell: The four peer review stages in RIO explained

Having received a number of requests to further clarify our peer review process, we hereby provide a concise summary of the four author- and journal-organised peer review stages applicable to all research article publications submitted to RIO

 

Stage 1: Author-organised pre-submission review

Optional. This review process can take place in the ARPHA Writing Tool (AWT) during the authoring process BEFORE the manuscript is submitted to the journal. It works much like discussion of a manuscript within an institutional department, akin to soliciting comments and changes on a collaborate Google Doc file. The author can invite reviewers via the “+Reviewers” button located on the upper horizontal bar of the AWT. Then, the author(s) and the reviewers are able to work together in the ARPHA online environment through an inline comment/reply interface. The reviewers are then expected to submit a concise evaluation form and a final statement.

The pre-submission review is not mandatory, but we strongly encourage it. Pre-submission reviews will be published along with the article and will bear a DOI and citation details. Articles reviewed before submission are labelled “Reviewed” when published. Manuscripts that have not been peer-reviewed before submission can be published on the basis of in-house editorial and technical checks, and will be labelled “Reviewable”.

If there is no pre-submission review, the authors have to provide a public statement explaining why they do not have, or need a pre-submission review for this work (e.g. a manuscript has been previously reviewed; a grant proposal has already been accepted for funding, etc.).

 

Stage 2: Pre-submission technical and editorial check with in-house editors or relevant members of RIO’s editorial board

Mandatory. Provided by the journal’s editorial office within the ARPHA Writing Tool when a manuscript is submitted to the journal. If necessary, it can take several rounds, until the manuscript is improved to the level appropriate for direct submission and publication in the journal. This stage ensures format compliance with RIO’s requirements, as well as relevant funding-body and discipline-specific requirements.

 

Stage 3: Community-sourced post-publication peer review

Continuously available. All articles published in RIO are available for post-publication review, regardless of them being subject to a pre-submission review or not, or their review status (Reviewable, Reviewed, or RIO-validated). The author may decide to publish a revised version of an article anytime based on feedback received from the community. Putatively, even years after publication of the original work our system allows a review to be published alongside the paper.  

 

Stage 4: Journal-organized post-publication peer review

Optional. If the author(s) request it, the journal can additionally organize a formal peer review from discipline-specific researchers in a timely manner. Authors may suggest reviewers during the submission process, but RIO may not necessarily invite suggested reviewers.

Once an editor and reviewers are invited by the journal, the review process happens much like the conventional peer review in many other journals, but is entirely open and transparent. It is also subject to a small additional fee, in order to cover the management of this process. When this review stage is successfully completed and the editors have decided to validate the article, the revised article version is labelled “RIO-validated”.

Biodiversity data import from historical literature assessed in an EMODnet Workshop Report

While biodiversity loss is an undisputable issue concerning everyone on a global scale, data about species distribution and numbers through the centuries is crucial for adopting adequate and timely measures.

However, as abundant as this information currently is, large parts of the actual data are locked-up as scanned documents, or not digitized at all. Far from the machine-readable knowledge, this information is left effectively inaccessible. In particular, this is the case for data from marine systems.

This is how data managers who implement data archaeology and rescue activities, as well as external experts in data mobilization and data publication, were all brought together in Crete for the European Marine Observation and Data network (EMODnet) Workshop, which is now reported in the open access journal Research Ideas and Outcomes (RIO).

“In a time of global change and biodiversity loss, information on species occurrences over time is crucial for the calculation of ecological models and future predictions”, explain the authors. “But while data coverage is sufficient for many terrestrial areas and areas with high scientific activity, large gaps exist for other regions, especially concerning the marine systems.”

Aiming to fill both spatial and temporal gaps in European marine species occurrence data availability by implementing data archaeology and rescue activities, the workshop took place on 8th and 9th June in 2015 at the Hellenic Center for Marine Research Crete (HCMR), Heraklion Crete, Greece. There, the participants joined forces to assess possible mechanisms and guidelines to mobilize legacy biodiversity data.

Together, the attendees reviewed the current issues associated with manual extraction of occurrence data. They also used the occasion to test tools and mechanisms that could potentially support a semi-automated process of data extraction. Long-disputed in the scholarly communities matters surrounding data re-publication, such as openly accessible data and author attribution were also discussed. As a result, at the end of the event, a list of recommendations and conclusions was compiled, also openly available in the Workshop Report publication.

Ahead of the workshop, curators extracted legacy data to compile a list of old faunistic reports, based on certain criteria. While performing the task, they noted the time and the problems they encountered along the way. Thus, they set the starting point for the workshop, where participants would get the chance to practice data extraction themselves at the organised hands-on sessions.

“Legacy biodiversity literature contains a tremendous amount of data that are of high value for many contemporary research directions. This has been recognized by projects and institutions such as the Biodiversity Heritage Library (BHL), which have initiated mass digitization of century-old books, journals and other publications and are making them available in a digital format over the internet,” note the authors.

“However, the information remains locked up even in these scanned files, as they are available only as free text, not in a structured, machine-readable format”.

In conclusion, the participants at the European Marine Observation and Data network Workshop listed practical tips regarding in-house document scanning; suggested a reward scheme for data curators, pointing out that credit needs to be given to the people “who made these valuable data accessible again”; encouraged Data papers publication, for aligning with the “emerging success of open data”; and proposed the establishment of a data encoding schema. They also highlighted the need for academic institutions to increase their number of professional data manager permanent positions, while also providing quality training to long-term data experts.

###

Original source:

Faulwetter S, Pafilis E, Fanini L, Bailly N, Agosti D, Arvanitidis C, Boicenco L, Capatano T, Claus S, Dekeyzer S, Georgiev T, Legaki A, Mavraki D, Oulas A, Papastefanou G, Penev L, Sautter G, Schigel D, Senderov V, Teaca A, Tsompanou M (2016) EMODnet Workshop on mechanisms and guidelines to mobilise historical data into biogeographic databases. Research Ideas and Outcomes 2: e9774. doi: 10.3897/rio.2.e9774