randombio.com | computer notes Sunday, October 25, 2020 Creating a fast database for scientific reprintsAn easy way to find information in your collection of papers. |
I have around 4000 reprints of various scientific papers on my computer. They're all in PDF format, and it's often a challenge to find a specific one.
Suppose I wanted to find all the papers that mentioned keywords of interest—oh, say collagen and tau. Unless I remembered the authors (and named my file accordingly), I'd have to search each file. The problem is that they take up 7,303,532,000 bytes on the computer, and finding them with pdfgrep would take hours, even if it didn't die from the strain.
One solution is to put the files in a database. MySQL comes to mind, and it's powerful, but for that, you'd have to tell the software which part is the title, which part is the authors, and so on. And a database entry can only handle up to 255 characters, so there's not enough space even for the abstract. So you'd have to extract the keywords manually.
There is commercial software that can help. You draw a box around the title, then click "title". The same for each part of the paper. It would be extremely time-consuming to do this for 4000 papers, and you still run up against the 255-character limit. Another possibility is to use EndNote, which sometimes saves the abstracts. But it doesn't always do that, and EndNote is clumsy for finding things.
So here is my solution.
dpkg --install python-pdfminer_20110515+dfsg-1_all.debto install it. Then run
pdf2txt test.pdf > test.txt
reff0.1.tar.gz
here. It
is free and unlicensed, and it compiles and runs on the Linux command line. It
does the following:
reff test.txt > test2.txt
reff test.pdf >> test2.txt
Now it's possible to find the keywords or text strings easily. For example, if
you wanted to know which of your papers contain the words 'β-amyloid' and
'plaques', you'd type
grep -c -H beta-amyloid *.txt | grep plaques
and it would tell you which files. There's one caveat in that the current version
of pdf2txt converts Greek letters too much, so 'β' gets converted to 'b'.
In that case you'd type
grep -c -H b-amyloid *.txt | grep plaques
Of course, it doesn't handle equations or figures properly, and it doesn't handle image PDFs. For that you'd need a commercial OCR program. But almost all papers in the past thirty years are ordinary PDFs. With this trick you get the file name, and it takes six seconds instead of thirty minutes to locate a string of text in a directory containing 7GB of papers.
And voilà: a quick and easy database for your scientific reprints. And no mucking about with verbose SQL commands. But there's a market for somebody to invent a document format like PDF but in a database format that can be queried.