2012-03-07

Difficult decision - 2

Firstly, I am surprised at the amount of traffic my previous post has generated. I had over a thousand visitors within a few hours: redirected from Google, Reddit, Hacker News, etc.! I wonder how may of them would have read the post had it been the opposite way, i.e., a declaration that I was adopting Go for the next version of my chemistry product? It appears to demonstrate an intriguing aspect of human nature!

Plug-in

Several people have suggested various ways in which out-of-process plug-ins are better. Some of these suggestions arose probably because of me not describing a plug-in adequately. I tried to remedy the situation in individual responses. I am collecting some of those points hereunder.

I am looking at plug-ins for at least the following benefits. In all cases, the said plug-in could potentially be supplied by me, a third-party developer or the user herself.

  • Substitute an algorithm for another.
  • Substitute an algorithm implementation for another.
  • Add a new algorithm not originally shipped with the product.
  • Add a calculator or a transformer as a hook in a particular processing step.

Performance

My application has to process millions (sometimes tens of millions) of molecules per run. A large number of them are processed by the proposed plug-ins. The number reduces with each advancing stage of processing, owing to elimination in each stage. The load is, hence, lower towards the tail of the work flow. But, upstream plug-ins are invoked for most molecules.

An out-of-process plug-in will require the following steps for communication:

  • serialisation of input in the main program,
  • deserialisation of input in the plug-in,
  • serialisation of output in the plug-in, and
  • deserialisation of output in the main program.

The above steps are in addition to the unavoidable protocol handshake for each request. Evidently, the larger the amount of data that needs to be exchanged between the main program and the plug-in, the slower the above process will be. Let us look at the data structure that will get exchanged the highest in my application, viz., Molecule. It has the following information, at a minimum:

  • a unique identifier,
  • a list of atoms, where each atom has:
    • a unique identifier,
    • element type,
    • spatial coordinates,
    • net charge,
    • stereo configuration,
    • number of implicit H atoms attached to it,
    • aromaticity,
    • list of rings it is a part of,
    • whether it is a bridgehead,
  • a list of bonds, where each bond has:
    • a unique identifier,
    • the atoms it joins,
    • order: single, double, triple, aromatic, …,
    • stereo configuration,
    • aromaticity,
    • list of rings it is a part of,
  • a list of rings with their own properties,
  • a list of components,
  • a list of functional groups,
  • … .

It may be possible to have a protocol to allow a plug-in to declare the subset of the above data that it actually needs. However, checking the protocol and selectively serialising the data has a cost itself.

On the contrary, an in-memory plug-in accesses the object using a pointer, with just a transfer of ownership but no transfer of data.

Manageability

External plug-ins also raise the subject of their life cycle management and resolution. The questions that need to be addressed include the following.

  • When is a plug-in activated? Together with the main program? On demand?
  • Should plug-ins die with the main program? How do we handle the main program terminating abnormally?
  • How long should a plug-in continue to run, if idle?
  • How should zombie plug-ins be handled?
  • If socket-based, how should port numbers be managed?
  • How should multiple plug-ins providing the same capability be resolved? How about versions?

While in-memory plug-ins do not automatically solve all the above, they do eliminate some of them easily.

Interesting Options Suggested

An interesting option that was suggested was to package the Go binary distribution as part of my product. Then, when a plug-in is downloaded, the main program itself could be re-compiled and re-linked to include the plug-in. This is a possibility. Some infrastructure code has to be written, though.

Another family relates to embedding a scripting language. This is another possibility. However, it is neither fair nor acceptable to force third-party plug-ins to have to always suffer (the relatively) lower performance because of the scripting language itself. This may become very important if the plug-in is intended for use in the initial stages of the work flow.

The Need for a Java API

Independent of the above, there is no straight-forward way to provide a Java API on the top of an application written in Go. This issue remains unaddressed.

No comments: