doi2bib or parsing doi into bibTeX entries

⇠ Back to Blog:Hacks

From about 2003, when I started to compile scientific references, till now, I have been entering references in my sci.bib bibTeX file by hand! That included 3549 entries according to bibtex-count-entries. This stops today.

I have always suspected this should not be terribly difficult to do, but I got bullied into doing by by Eduardo seeing me entering one such bibliographic record and telling me that zotero does it automatically. I had to explain this does not do quite what I want and need. In fact, better tools like almost bring me there but, again, not quite, since my format is quite strict.

So on a very hot Sunday, I resolved to hack the code to do it. It happened, not surprisingly either, to be more complicated than I had expected, but a day against 20 years put the balance against me and my lazyness. Here are the details of the script:

The bibliographic information is first obtained from curl:

 curl -LH "Accept: application/json"[...] > from_doi.json

where [...] contains the actual doi, e.g.,

laussy@covid:~$ curl -LH "Accept: application/json" > from_doi.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   217  100   217    0     0    927      0 --:--:-- --:--:-- --:--:--   927
100 16695    0 16695    0     0  22274      0 --:--:-- --:--:-- --:--:-- 22274

I then rely on jq to parse it.

This produces the list of authors:

<from_doi.json jq -r '.author|map([.given,.family]|join(" "))|join(" and ")'

with output:

David Colas and Lorenzo Dominici and Stefano Donati and Anastasiia A Pervishko and Timothy CH Liew and Ivan A Shelykh and Dario Ballarini and Milena de Giorgi and Alberto Bramati and Giuseppe Gigli and Elena del Valle and Fabrice P Laussy and Alexey V Kavokin and Daniele Sanvitto

and thus, with plain firstnames, which maybe is good but in principle I do not keep that information, although this should probably be sorted at the bst level. Anyway, wanting to enforce that:

<from_doi.json jq -r '.author|map([.given,.family]|join(" "))|join(" and ") | splits (" and ")' | awk '{for(i=1; i<NF; i++){printf substr($i,1,1) ". "} print($NF)}'

The other bibliographic information is more straightforwardly extracted:

<from_doi.json jq '.title'
<from_doi.json jq '."container-title-short"'
<from_doi.json jq '.published."date-parts"[0][0]'
<from_doi.json jq '.volume'
<from_doi.json jq '.page'
<from_doi.json jq '.DOI'

Quite regrettably, some journals replace the page with so-called article-number:

<from_doi.json jq '."article-number"'

One can also use the long name of the journal:

<from_doi.json jq '."container-title"'

To replace that with ISO 4 titles, I prefer to use sed:

<from_doi.json jq -r '."container-title-short"' | sed -f iso4 

where iso4 contains things like:

s/J. Phys. B: At. Mol. Phys./jpb/
s/Light Sci Appl/lsa/
s/Phys. Rev. B/prb/

The same can (must) be done for special names too, in particular given the mayhem with Spanish names (which have two) or titles:

<from_doi4.json jq -r '.author|map([.given,.family]|join(" "))|join(" and ") | splits (" and ")' | awk '{for(i=1; i<NF; i++){printf substr($i,1,1) ". "} print($NF)}' | sed -f bibnames 

with bibnames containing something like:

s/J. C. L. Carreño/J. C. {L\'opez Carre\~no}/
s/E. Z. Casalengua/E. {Zubizarreta Casalengua}/
s/E. d. Valle/E. {del Valle}/

The bibTeX key is, in my case, the first author's name, the last two digits of the year and a letter for lifting degeneracies.

Now everything can be packed in a perl script to provide the final output. This is (my version of) doi2bib, which works as follows:

Screenshot 20230730 185200.png