Working-with-organisation-lists.Rmd
The most useful part of ROR is retrieving lists of organisations. For
now, the wrapper for doing this is the most powerful part of the
package. This article with a quick run through of how to use the
get_org_list()
function.
.search_terms
let’s you enter a search term (or multiple
terms, separate by “|”) by which to filter results. The API will check
this term (or these terms) against every possible field and return any
potential match on this basis. This is most likely useful if you need to
do a broad brush search and are worried being too specific might exclude
useful results.
.affiliation
is much more precise than .search_terms. It
only looks at three fields (name, label (which often includes names in
non_English languages) and aliases). Then multiple search algorithms are
tried, with potential matches being scored 0-1 (quite how this scoring
is done isn’t revealed). Initially even ‘below threshold’ results are
returned, however get_org_list()
narrows this down to only
the chosen results. You should only use this if you really know which
organisation you’d like to return, but perhaps don’t know its id. For
example, searching for affiliation “Exeter” only returns Exeter Phillips
Academy - not any organisation in Exeter, UK! “University of Exeter”,
however, only returns the University of Exeter, UK.
If you use .affiliation
, the only other parameter you
can define are desired page numbers, although it is unlikely you will
get more than one page of results.
ROR categorises the organisations it indexes into different types:
This can be useful to help narrow down results.
The API lets you filter results by either country alpha-2 ISO code or by name (as per the same list). You can only use one or the other at a time.
The API returns 20 results per page, and defaults to sending the first page of results. There is no option to adjust page size. This is important to remember as with >100,000 organisations registered, returning them all would require more than 5,000 pages. This wrapper is set to send only one request per second (to avoid throttling or taking more than your fair share of bandwidth), so it would take at least 5,000 seconds (83 minutes) to return all results. Therefore it is critical you try and limit your requests in some way using the parameters. If you only want the first page, or first few pages (or last few, etc) you can use this as a way to narrow down what is returned. Caution should be urged though, as before running the request you may not know how many pages there are, or which page the result(s) you want is/are on.
This is a quick example to iteratively narrow down your results - although if you have enough information up front, no reason you can’t enter all required parameters at once.
Say we initially want to find education organisations in the UK:
library(ror4r)
data <- get_org_list(.org_type = "Education", .country_code = "GB")
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
head(data)[, 1:6] #only six columns as there are a lot of wide columns
#> id name email_address
#> 1 https://ror.org/052411t69 Courtauld Institute of Art <NA>
#> 2 https://ror.org/01thz0w07 Wessex Institute of Technology <NA>
#> 3 https://ror.org/003b78q25 Wye College <NA>
#> 4 https://ror.org/0530xmm89 Royal College of Physicians <NA>
#> 5 https://ror.org/023372f11 Royal Northern College of Music <NA>
#> 6 https://ror.org/05bh64v90 South Thames College <NA>
#> ip_addresses established types
#> 1 NULL 1932 Education
#> 2 NULL 1986 Education
#> 3 NULL 1447 Education
#> 4 NULL 1518 Education
#> 5 NULL 1973 Education
#> 6 NULL 1895 Education
This worked but it took a while. There are 550 results, 28 pages, so you’re looking at at least 28 seconds to return all the results in a dataframe.
Maybe we know a bit more about the organisation. Although we might
need to filter the data
dataframe based on this, there are
some options using the .search_terms
parameter. Say we know
we’re looking for a university - that’s likely to be in the
name
field, which is indexed and thus searchable.
data_2 <- get_org_list(.org_type = "Education", .country_code = "GB", .search_terms = "university")
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
head(data_2)[, 1:6]
#> id name email_address ip_addresses
#> 1 https://ror.org/04vg4w365 Loughborough University <NA> NULL
#> 2 https://ror.org/03angcq70 University of Birmingham NULL
#> 3 https://ror.org/05j0ve876 Aston University <NA> NULL
#> 4 https://ror.org/01a77tt86 University of Warwick <NA> NULL
#> 5 https://ror.org/01v29qb04 Durham University <NA> NULL
#> 6 https://ror.org/05cncd958 Cranfield University NULL
#> established types
#> 1 1909 Education
#> 2 1900 Education
#> 3 1895 Education
#> 4 1965 Education
#> 5 1832 Education
#> 6 1993 Education
We’ve now narrowed down to 159 results/8 pages, which is roughly three times quicker to run.
At this point it’s difficult to do much more to reduce the list. You could make it more expansive using “|” to add an OR clause to search terms if results are too narrow. You can also interact with the resulting dataframe to make it smaller - hopefully it’s now small enough to explore.