Sunday, June 19, 2016

the first visual search engine for scientific diagrams

economist |  A PICTURE is said to be worth a thousand words. That metaphor might be expected to pertain a fortiori in the case of scientific papers, where a figure can brilliantly illuminate an idea that might otherwise be baffling. Papers with figures in them should thus be easier to grasp than those without. They should therefore reach larger audiences and, in turn, be more influential simply by virtue of being more widely read. But are they? Bill Howe and his colleagues at the University of Washington, in Seattle, decided to find out.

First, they trained a computer algorithm to distinguish between various sorts of figures—which they defined as diagrams, equations, photographs, plots (such as bar charts and scatter graphs) and tables. They exposed their algorithm to between 400 and 600 images of each of these types of figure until it could distinguish them with an accuracy greater than 90%. Then they set it loose on the more-than-650,000 papers (containing more than 10m figures) stored on PubMed Central, an online archive of biomedical-research articles.

To measure each paper’s influence, they calculated its article-level Eigenfactor score—a modified version of the PageRank algorithm Google uses to provide the most relevant results for internet searches. Eigenfactor scoring gives a better measure than simply noting the number of times a paper is cited elsewhere, because it weights citations by their influence. A citation in a paper that is itself highly cited is worth more than one in a paper that is not.

As the team describe in a paper posted on arXiv, they found that figures did indeed matter—but not all in the same way. An average paper in PubMed Central has about one diagram for every three pages and gets 1.67 citations. Papers with more diagrams per page and, to a lesser extent, plots per page tended to be more influential (on average, a paper accrued two more citations for every extra diagram per page, and one more for every extra plot per page). By contrast, including photographs and equations seemed to decrease the chances of a paper being cited by others. That agrees with a study from 2012, whose authors counted (by hand) the number of mathematical expressions in over 600 biology papers and found that each additional equation per page reduced the number of citations a paper received by 22%.