This post sums up some of the work I have done while working at CERN. I have been involved in a collaboration with the DESY Library to build a controlled vocabulary of high energy physics (HEP) keywords. DESY has been maintaining a list of controlled HEP terms for a long time. These indexes are widely used across HEP institutions and scientific libraries. When we started the collaboration, 8-9 months ago, the vocabulary looked like this: a simple text thesaurus, i.e. a list of keywords, one per line. Throughout the collaboration, we have converted this text thesaurus to an organised RDF SKOS taxonomy - with some light semantics in it too. The formal release of the taxonomy is expected to happen at some point soon and hopefully it will become editable by HEP scientists - a nice collaborative effort that could make it grow into a full-scale ontology.
Thesaurus, vocabulary, taxonomy, ontology and whatnot?
There is a little bit of confusion around the exact meaning of every one of these terms. Please check this article for elucidation.
Why all the effort?
Besides the taxonomy, I have worked on BibClassify, an automatic classification module for the CDS Invenio software suite. BibClassify performs keyword extraction from fulltext documents based on a controlled vocabulary - expressed as a SKOS taxonomy. Unlike other keyword extraction systems, BibClassify does not use any machine-learning or AI tricks:
a) it takes a fulltext document as input (e.g. a PDF),
b) it takes a taxonomy as input (SKOS or RDF),
c) it compiles some regexp around the concepts in the taxonomy,
d) it matches the concepts into the fulltext, does some mumbojumbo and calculates the occurrences,
e) pulls out some output keywords from the hat.
BibClassify is written in Python. And it works quite well. You can use it as part of the CDS Invenio suite, or as a standalone program.
The results
For example, running BibClassify on document hep-ph/0608096 using the HEP taxonomy yields to (excerpt):
Composite keywords:
17 inflaton: decay [26, 57]
13 energy: density [17, 13]
6 field theory: scalar [0, 0]
6 effect: nonperturbative [10, 28]
5 baryon: asymmetry [7, 12]
3 operator: nonrenormalizable [7, 4]
2 supersymmetry: flat direction [11, 70]
Main Keywords:
36 preheating
14 mass
10 rotation
10 lead
9 interaction
9 inflation
9 complex
8 reheating
5 spectrum
5 frequency
4 temperature
4 radiation
4 momentum
4 Hubble constant
or, the following tag-cloud visualization (a bit sexier):
Sounds interesting?
Then you might want to check out the hacking guides. BibClassify is now part of CDS Invenio digital library software.