New R package: scholar

My new R package, scholar, has just been posted on CRAN.

The scholar package provides functions to extract citation data from Google Scholar. In addition to retrieving basic information about a single scholar, the package also allows you to compare multiple scholars and predict future h-index values. There’s a full guide on Github (along with the source code), but here are some quick highlights.

Get profile data on a scholar

Not everyone has a Google Scholar profile page, but if they do, you can find them by searching in the corner of a profile page. The resulting URL will contain a string that looks like user=B7vSqZsAAAAJ. To use the package, we need to reference scholars by that id. So, for example, here is Richard Feynman’s data:

library(scholar)
id < - 'B7vSqZsAAAAJ' feynman <- get_profile(id) feynman$name # Prints out his name

Compare multiple scholars

You can also compare multiple scholars, for example, a Feynman/Hawking battle royale:

# Compare Richard Feynman and Stephen Hawking
ids < - c('B7vSqZsAAAAJ', 'qj74uXkAAAAJ') # Compare their career trajectories, based on year of first citation df <- compare_scholar_careers(ids) ggplot(df, aes(x=career_year, y=cites)) + geom_line(aes(linetype=name)) + theme_bw()

Citation histories of Richard Feynman and Stephen Hawking

Citation histories of Richard Feynman and Stephen Hawking

Predicting future h-index values

A scholar's h-index is n if they have published at least n papers that have been cited at least n times each. Acuna et al. published a method for predicting future h-index values based on historical citation rates. The original regressions were calibrated on neuroscience researchers so using this in other fields may well end up predicting negative h-indices. However there is an optional argument that allows you to re-define the 'top' journals in your field. No guarantees, but still, it's a bit of fun.


## Predict Daniel Acuna's h-index
id < - 'GAi23ssAAAAJ' predict_h_index(id)

That's it! If you have any suggestions for new features, comments, etc, please let me know.

22 thoughts on “New R package: scholar

  1. Sebastian

    Fascinating!
    But the package only works if a scholar has a profile, right? I can’t just search for a name and extract citations?

    Thanks for an interesting package!
    Sebastian

  2. James Keirstead Post author

    It should be; there’s probably just a delay in propagating it to your local CRAN mirror. As an alternative, you can install the package directly from Github:


    library(devtools)
    install_github("jkeirstead/scholar")

  3. Sachin Sancheti

    Hi James! I too had some problems installing ‘scholar’ package in 3.0.1. I tried using the devtools method, but it had its own problems.

    I successfully could install when I used the code >install_url(“https://github.com/jkeirstead/scholar/archive/master.zip”)

  4. Giacomo Bianchi

    Hi! Really a nice package. I’m experiencing a problem with it…
    I tried with my scholar account at it seems to work, the only problem is that is reading citations as h-index and the actual h-index as i10-h-index, while the citation is “NA”.

    Thank you for your help.

    Giacomo

  5. Sebastian Pink

    Thanks, that’s a very neat way to extract citation information. To get a sample of people I would love to gather people on the basis of their fields of interest – like when you search for ‘label:whatever’. Have you thought about integrating a crawler that gathers all people according to the label entered? That would be an awesome addition.

  6. Giacomo Bianchi

    Sorry for the delay in posting sessionInfo(); I’ve read the comments on github and seems that now the bug is fixed. I reinstalled the package and it’s working great!
    Thank you again.

    Giacomo

  7. Michael Bach

    Very nice! However, there may be an error: when I look at the list returned from
    get_profile(…), this looks (in part) like this (with my id):
    $total_cites [1] NA
    $h_index [1] 6215
    $i10_index [1] 41

    So the fields of seem to be offset by one item (I don’t have an h-index of 6215, not even Feynman has that :) This is not specific for my own id. Did Google change something so fields in “stats” in your function are shifted?

    Best, Michael

  8. Michael Bach

    And to finish my comments (sorry), I noted this is already fixed. Following Sachin Sancheti’s advice doing this:
    install_url(“https://github.com/jkeirstead/scholar/archive/master.zip”)
    and reloading “scholar” fixed it.
    GREAT! Thank you.

    Best, Michael.

  9. Tobias Buchmann

    Very nice package! Since I am interested in citation networks, I am wondering if it is possible to retrieve citing and cited articles of an author.
    Thanks a lot!
    Tobias

  10. Hendrik

    Thanks James!
    Of course it is as with the citations itself, we always like more. So is it possible now or in a future release to also extract the citations per year per article. Google Scholar now shows that information in a bar graph, but it would be interesting to inspect the numbers itself.

  11. Herman Mays

    I’m a little new to R but learning. I’m trying to figure out some quantitative ways to evaluate relative academic performance and love the scholar package so far. However, I noticed when I use get_num_articles(‘id’) it only returns a maximum of 20, the same number as on the first page of a Google Scholar site. Am I doing something wrong? Thanks.
    Herman

  12. James Keirstead Post author

    Hi Herman. Getting the first 20 hits is the expected behaviour. I’ve added an option to retrieve more results but it’s not yet on CRAN. You can install the latest with:

    library(devtools)
    install_github("jkeirstead/scholar@develop")

  13. Pingback: Momento R do Dia: Keynes vs Friedman e outras batalhas entre acadêmicos no mundo das citações | De Gustibus Non Est Disputandum

Comments are closed.