Extracting citation counts from a Google Scholar page

⇠ Back to Blog:Hacks

Say you want to extract the citation counts from someone's Google Scholar page (here from Jeremy Baumberg):

Screenshot 20230625 105122.png

It could be of coursed copied by hand (the data is tool-tipped by hovering over), however this is impractical if you want to do it for several people (something I had to do for PLMCN24).

Peeking at the source, the data is in plain sight, being enclosed through a particular gsc_g_al class. This is, for instance, the number of citations for Baumberg in 2009:

<span class="gsc_g_al">1080</span>
Screenshot 20230625 105000.png

The following command thus extracts the wanted data (from the page saved here as _Jeremy J. Baumberg_ - _Google Scholar_.html, which is the standard Google name) and saves it in file Baumberg.txt:

grep -oP '(?<=gsc_g_al">).*?(?=</)' _Jeremy\ J.\ Baumberg_\ -\ _Google\ Scholar_.html > Baumberg.txt

The year could also be extracted similarly although there is an extra varying style that would demand to make further filtering:

span class="gsc_g_t" style="right:451px">2009</span><span class="gsc_g_t" style="right:419px">2010</span>

So it is probably easier (it was in my case) to reconstruct the year axis backward from the number of items returned, since you are doing this, probably, within the same year!

I did that with the following Mathematica code:

DateThisList[list_, year_] := Module[{},
  Transpose[{Reverse[year + 1 - Range[Length[list]]], list}]
  ]

And that's how I processed the files:

fncit = FileNames["*txt"]

Do[cit[FileBaseName@fncit[[i]]] = 
  DateThisList[Flatten[Import[fncit[[i]], "CSV"]], 2023], {i, 
  Length[fncit]}]

Not all years of publications are shown, unfortunately (the first ones are chopped off), but the total amount is given. For Baumberg, for instance, he has 517 citations from before 1998:

43711 - Total[cit["baumberg"] [[All, 2]]]
517

This is, for instance, the citation counts for all the people nominated at least twice by the PLMCN24 program committee:

Screenshot 20230625 181123.png

Note that the same can be done for citations to papers using the gsc_oci_g_al class instead:

grep -oP '(?<=gsc_oci_g_al">).*?(?=</)' paper-citations.html

It is interesting to compare scientists pairwise:

ratioCitations[name1_, name2_] := Module[{min},
  min = Min[{Length[cit[name1]], Length[cit[name2]]}];
  Reverse[cit[name1] [[All, 2]][[-Range[min]]]]/
  Reverse[cit[name2] [[All, 2]][[-Range[min]]]]
  ]
Screenshot 20230625 133337.png
Screenshot 20230625 134533.png
Screenshot 20230625 134637.png
Screenshot 20230625 140423.png
Screenshot 20230625 134739.png

There is much to extract from this. One compelling thing is when you start to get "established" or "settled" in your field, as measured by when you stop fluctuating wildly as compared to a more senior Author (say your Ph. D advisor in my case). For me, that happened around 2009.

Screenshot 20230625 134415.png

It does matter less whether you plateau or increase/decrease as compared to the reference, but the large fluctuations mean you are still at the early-career stage, while when that smooths out, you probably have penetrated your market.

Here is the Mathematica Notebook if you want to play with your own scientists.