Creating a fast database for scientific reprints

I have around 4000 reprints of various scientific papers on my computer. They're all in PDF format, and it's often a challenge to find a specific one.

Suppose I wanted to find all the papers that mentioned keywords of interest—oh, say collagen and tau. Unless I remembered the authors (and named my file accordingly), I'd have to search each file. The problem is that they take up 7,303,532,000 bytes on the computer, and finding them with pdfgrep would take hours, even if it didn't die from the strain.

One solution is to put the files in a database. MySQL comes to mind, and it's powerful, but for that, you'd have to tell the software which part is the title, which part is the authors, and so on. And a database entry can only handle up to 255 characters, so there's not enough space even for the abstract. So you'd have to extract the keywords manually.

There is commercial software that can help. You draw a box around the title, then click "title". The same for each part of the paper. It would be extremely time-consuming to do this for 4000 papers, and you still run up against the 255-character limit. Another possibility is to use EndNote, which sometimes saves the abstracts. But it doesn't always do that, and EndNote is clumsy for finding things.

So here is my solution.

Run the PDF through pdf2txt, which is a Python script found in the python-pdfminer package. In Debian you need python-pdfminer_20110515+dfsg-1_all.deb. Use the command
```
 dpkg --install python-pdfminer_20110515+dfsg-1_all.deb
```
to install it. Then run
```
 pdf2txt test.pdf > test.txt 
```

Unfiltered pdf2txt output in a text editor

Filtered pdf2txt output (word-wrapped for clarity)

Pdf2txt makes the file about 77 times smaller, but the output still contains a lot of features that make searching tricky, including line breaks, fl and fi ligatures, and Greek letters. In a typical scientific paper, you'd get results that look like the figure at right. So the next step is to run it through a filter to clean it up. I created a filter titled reff0.1.tar.gz here. It is free and unlicensed, and it compiles and runs on the Linux command line. It does the following:
- Converts multibyte characters to UTF-8 single byte equivalents.
- Puts the filename followed by the first 20000 characters on a single line of a text file. This is enough for the title, authors, abstract, introduction, and part of the results. The number is easy to change.
To filter the pdf2txt output, type the command on the pdf2txt output file:
reff test.txt > test2.txt
. The file has now shrunk to 19,000 bytes, or 302 times smaller than the original PDF. Change the above command to
reff test.pdf >> test2.txt
if you want to put all the extracts into a single file.

Now it's possible to find the keywords or text strings easily. For example, if you wanted to know which of your papers contain the words 'β-amyloid' and 'plaques', you'd type
grep -c -H beta-amyloid *.txt | grep plaques
and it would tell you which files. There's one caveat in that the current version of pdf2txt converts Greek letters too much, so 'β' gets converted to 'b'. In that case you'd type
grep -c -H b-amyloid *.txt | grep plaques

Of course, it doesn't handle equations or figures properly, and it doesn't handle image PDFs. For that you'd need a commercial OCR program. But almost all papers in the past thirty years are ordinary PDFs. With this trick you get the file name, and it takes six seconds instead of thirty minutes to locate a string of text in a directory containing 7GB of papers.

And voilà: a quick and easy database for your scientific reprints. And no mucking about with verbose SQL commands. But there's a market for somebody to invent a document format like PDF but in a database format that can be queried.