BDC: BibTeX Duplicates Checker

During my career as a researcher I have been using BibTeX to manage citation databases since quite some years. It is an excellent tool, but I started to be annoyed by occasional duplicate citations, which were caught only by a publisher at proof stage of the papers. I must confess that some of my historically developed .bib files are a mess, without a clear convention for article labeling and consequently with many redundant entries. Anyway, when you get contributed bib files from colleagues, it is almost impossible not to have duplicates hidden under different labels. By duplicates, I mean not only duplicate labels, but the same paper cited under different labels. The items are possibly present with or without title and with different abbreviations of author first names, to add to the confusion :-).

I have thus written a simple tool to check for such duplicates. This task is in general rather complicated, so I simplified my goal somewhat - it is specific to citation style in my field. The program is not very general, and it presently checks only @articles, which are, however, the most frequent citation type occuring in my field. It also uses the fact, that an article citation is uniquely defined by journal, volume, page (and year), and that it is highly improbable to have same volume, page and year for articles in different journals. So if the three numbers checked are found identical, suspect duplicate is reported; if journal name matches as well, a sure duplicate is reported. In this specific situation it works more reliably than some other program's criteria based on percent of matched items.

Parsing a .bib file is not entirely easy, so I was happy to find an already existing tool BIBTOOL by Charalampos Nikolaou, which does it and outputs the result in a simple format easy to read. For convenience, I mirror the BIBTOOL package here. I use it to pre-parse the bib files and pass the result to my very very simple program bdc.cc.

How to use the tool: you install BIBTOOL's binaries and compile bdc.cc (which requires just the GNU C++ compiler and boost library). Then you issue the command:
bibparse file1.bib [file2.bib ...] |tr '{}' ' '| bdc > bibaliases.tex
Stderr will report duplicates and stdout will give a list of aliases for bibalias (see below). On a first run, I got about 30 duplicates in my bib files :-).

The program does not touch your .bib files, it just gives a list of suspect duplicates which you can check and decide which one to delete. This would, of course, break older documents which cite the item under the other label. There is, however, a possibility how to circumvent this problem: the BIBALIAS package by Ulrich Michael Schwarz. For convenience, I mirror the BIBALIAS package here. To install it, unpack the tar.gz, latex bibalias.ins, install bibalias.sty to your TEXPATH and optionally latex bibalias.dtx and read the manual ;-).

The stdout of bdc contains a list of aliases, suitable for the use of the BIBALIAS package. Using bibalias with overload option, which redefines \cite{}, and including the generated alias list, you can remove your duplicates from .bib and still keep old document's citations unbroken. To use it, insert in your main document.tex the following lines:
\usepackage[overload]{bibalias}
\input{bibaliases.tex}


I will appreciate your bug-fixes, enhancements, suggestions and comments.
Jiri Pittner

TOP