by Ben Sowter
Tracking all the papers and citations data we need from the Scopus database to fuel our evaluations is quite a challenge and our process has always resulted in some discrepancies between the results we are using and the results that you can actually retrieve from Scopus at given moment. Scopus is an ever-changing database, not only are Elsevier working very hard to add more journals, in more languages and backfilling, but they are alos workign hard to concolidate affiliations and make it easier to retrieve all the data for a given author or institution. The database is vast, however, and the variants are many – apparently MIT, for example at point in time has 1,741 name variants. Additionally, as time goes by, more papers get published and more citations get filed.
Our analysis is based on “custom data” exported from Scopus at a fixed point in time, defined within fixed limits. We use the last five complete years for both papers and citations – that is to say we take a count of all papers published in the five years leading up to December 31st of the previous year and the total of any citations received during the same period. By the time the Times Higher Education – QS World University Rankings are published in October there will 10 more months of papers and publications appearing in the online version Scopus.
The custom data for the forthcoming 2009 analysis amounts to 18Gb of raw XML data – along with this Elsevier provide an affiliation table. This table is an improving lens that we can use to identify the mappings required to retrieve the aggregate data we need. We search this affiliation table for strings that match the universities (or their alternate names) in our database which returns a list of 8 digit affiliate id numbers which we can then use to retrieve and aggregate data from the main data set. If key names are missing from the affiliation table it is very difficult to identify and content that may exist in the main dataset.
Since the publication of the QS.com Asian University Rankings a couple of institutions have come forward and expressed that to some degree or another, data is missing for their institution. This has been discovered thanks to our practice of sharing a “fact file” with institutions prior to publication. Each of them are now working with QS to ensure that any shortfall is rectified in the future.
In future we will be splitting our fact file distribution into two with one comeing out long in advance of publication and then a media briefing which will include the ranking results two days prior to the publication date.