« Happy 101st Bloomsday | Main | Delphi, Web Services and statefulness ... »
June 17, 2005
Cheminformatics software
Working with JOELib and OpenBabel was rewarding, in the sense of being part of a community, but also somewhat frustrating, in part because the on-going development of those packages is, well, on-going.
I have a problem with Java : not the language itself, but rather the need to maintain different oh so many JRE installations depending on the software developers or vendors involved. And maybe I'm just not paying attention enough or I'm looking in the wrong places, but I can't get decent performance out of Java on my Windows XP box, or find IDE tools that I can get along with. If anybody has any ideas why my experience has been so negative, please let me know. Maybe I'm just too lazy or dumb.
We finally went for MDL's software. I like it.
The new V3000 molfiles that MDL/Draw produces contain enhanced stereochemistry information, allowing you to handle unknown, known relative and known absolute stereocentres.
The Direct Oracle cartridges (molecule and reaction) allow you to do substructure searches and so-called flexmatch (similarity) searches directly through SQL. You can get molecules highlighted back from the query. These are the newer parts, actually core elements of Isentris, the replacement for ISIS. The older ISIS components (ISIS/Host, ISIS/Base) look a little tired compared to this whizzy stuff.
The real crown jewels for us are Cheshire and the Cheshire Business Rules Manager (CBRM). Cheshire is a set of tools for handling molecules and reactions in silico, comprising an IDE and an engine driven by a rich and powerful Java-like interpreted language. Cheshire acts on collections, primarily, and uses iterators to traverse them. This is all very sensible and grown up. You can join molecules, delete atoms or bonds, alter stereochemistry information, perform transforms or reactions on molecules or groups of molecules according to templates. In short, everything we needed to do can be done with Cheshire.
The CBRM is a repository for Cheshire scripts, and allows multiple "activation tags" to be associated with each script. So every script in the repository might be associated with the "test" tag, but only those to do with enumerating peptide bonds would be associated with the "enum_peptides" tag. An ActiveX control is provided which lets you programmatically load scripts for a particular activation tag, check and correct molecules, and determine the applied rules.
The CBRM ActiveX control has an annoying bug, which is that it will only work properly under the "US English" (or generally English) regional settings. This is because of the "," and "." being swapped around for decimal places in Denmark, France etc. Very annoying, actually.
My impression after one month of working with the MDL tools is that the software is of high quality, regionalization bugs aside, and the documentation of a very high standard. If you buy ISIS/Host, MDL/Draw, Cheshire, CBRM and the Direct cartridges, you are looking at getting through a good three reams of double sided print if you hard copy all the documentation - which I like to do.
There's one niggle with the 840 page Cheshire manual - there are no contents pages! This might seem like a small detail, but it really isn't. I have had to attach small PostIt notes to mark the chapter divisions. The index is comprehensive, but of course doesn't tell you where the section on atom constants starts, for example.
Other niggles. MDL/Draw doesn't include the sequence editor that ISIS/Draw has. You have to edit the XML config file to get some of the more useful out-of-the-box MDL/Draw functionality working (Sgroup data attachments, for example, and the extended stereochemistry handling). CBRM Administrator context help (actually, any help) is non-existent. Cheshire strips atom aliases (A type fields) from molfiles. Boo.
But these relatively minor things aside, the MDL software will allow us to seriously punch above our weight when it comes to handling our cheminformatics data.
Was it cheap? No. Don't ask. But I believe we would have been hard pressed to get to the same place any other way in the timescale we have to work in (three months) and with the resources we have to work with (two developers).
Posted by daen at June 17, 2005 12:42 AM