2011-12-25

Linux distribution trials have not ended :-(

Following up on my August post regarding Linux distribution trials, I could not – in fact – continue with CentOS. There are two important reasons.

  • yum is a really fragile tool for package management. Not only could I not understand what wrong I had done to break it, following its instructions on how a sane package database state could be recovered just never worked! And, then, it kept complaining endlessly about Update Manager keeping the database locked, while Update Manager kept complaining endlessly that a different package management tool was keeping the database locked. I followed resolution suggestions from about a dozen Web sites, with no success whatsoever. Even deleting the entire package database did not help! This paints such a sorry picture of Linux!
  • CentOS makes it really difficult to install a broad set of essential 32-bit libraries. Firstly, they are not available by default in the 64-bit repositories under a single compat-kind of title. At least, I could not find one easily. Hence, I had to let a 32-bit software such as Skype or TeamViewer fail once for each dependency, then figure out which package provided that library, and then install its .i386/.i686 version. This is tedious at best.

Consequently, I moved on. My current experiment is with openSUSE 12.1 KDE spin. I have returned to openSUSE after almost five years. I do not remember if zypper was the default package management tool back then. After going through the man pages, and a few trials, I got used to it. I tried to break it in one or two small ways, but it has withstood so far!

The OS itself has been quite stable in these several weeks of use. I disabled nepomuk and strigi, and left KDE notifications on. The CPU usage when idle is good (<= 1.5%), KDE has been very responsive and stable, and to top it all, openSUSE 12.1 came with Go r60.3 in the repositories!

I have moved one of my developers to it. In its default configuration, openSUSE does not allow non-root users to connect to a wireless network. That is the only complaint that I have received yet. I resolved it by making my trusted wireless network a system connection.

Thus, life looks good so far! I shall update the status some time towards the end of January 2012 again.

2011-12-24

No more Apple products!

No more Apple products! I have made that decision after a careful examination of several incidents.

For the record, I have a MacBook Pro 15" and an iPod Touch (second generation) for my wife, and a MacBook Pro 17" for myself. Now, I think that I should not have purchased them in the first place.

Why did I buy them, then?

I bought the iPod Touch because I thought that it would be useful for quick e-mail checks and responses. That it could carry my music was secondary; more like tertiary, since I browsed the Internet using it much more than I listened to the music in it.

When I first bought the 15" MacBook Pro, my primary operating system for well over a decade had been Linux. I purchased a Mac Mini for cross-browser compatibility testing at work. I was bewitched by the cuteness of the user interface, even though I was at variance with some of its UX concepts. When I discovered the command line and the POSIX layer beneath the UI, though, it appeared as if my wishes were answered. So, I went ahead and ordered the MacBook Pro.

The first alarm bells did ring when I placed the order. The price was unreasonably high. Worse, the India prices were at close to a 60% premium over the corresponding US prices. But the charm worked, and I parted with the money.

When the computer did arrive, I was unhappy to see that it came with US power socket pins and no Indian pin adapter. I had to purchase an adapter myself.

Over a period, my usage of the iPod Touch decreased (more about this later), and I gave it to my wife so she could use it for listening to music.

Trouble strikes

A few months later, the battery of the MacBook Pro failed rapidly. After trying unsuccessfully to de-hysterisize it by draining it fully, I realized that it was to no avail. I went to the local Apple reseller. The genius there examined my computer, and declared that the battery was dead and needed to be replaced. And, with a rueful countenance, he further declared that it had just run out of warranty. I was perplexed. I showed the system diagnostics to him, pointing out that the battery had gone through only 310 recharge cycles. With a superior smile, he advised me that the typical life of laptop batteries was only 300 cycles! I was now disappointed. With an effort at being civil, I told him that the computer in context was my fourth or fifth, and that I had enough experience to know that laptop batteries do not die so fast. He simply ignored me, and asked me if I wanted a new battery or not. By now, I was enraged. However, I had no choice. So, I paid a ridiculously high price - the Apple premium - for a new battery.

What about the other MacBook Pro?

Technically, I did not purchase it. One of my collaborators - who lives in Toronto, Canada - did. His primary operating had been Linux, too, for an even longer period. When I assured him that he could continue to run his familiar shell and gcc in MacOS X, he took the plunge. He purchased a 17" MacBook Pro. However, starting with an unfamiliar interface, the default user interface and behavior of Finder, and inability to easily install new packages (compared to yum install package), his ride was rough.

After some research, I suggested macports to him. However, he ran into nuisances when some libraries installed through that (as dependencies) began conflicting with those installed by Xcode. After a while, he gave up, and simply shipped it to me. And, he did it without warning me!

Trouble strikes, again!

Two days before the 17" computer arrived, I had gone to the local Apple reseller, and purchased an OS upgrade for my 15" computer, from Leopard to Snow Leopard.

My gentle collaborator forgot sending the power adapter when he sent the 17" computer. So, I needed to purchase one if I wanted to use the computer. Reluctantly, I went to the Apple reseller, again. Once again did I have to pay the Apple premium, this time for the power adapter. I opened the pack, and this time, I found that it had UK power socket pins! That was ridiculous! I asked the reseller why I was given an adapter with UK pins. He shrugged, and replied that others simply purchased an additional converter. And that he could sell me one if I wished to buy it! I was getting indignant.

Nevertheless, I showed him the Snow Leopard upgrade that I had purchased two days earlier. It was new and unopened. I asked him to take it back, and give me a family pack instead, since I now had two computers to upgrade. He flatly refused. I raised the matter to the store manager, urging him to be reasonable. He seconded his genius, and said that it was against Apple policies. If I wished to upgrade the second computer, I had to purchase another single upgrade copy. I was livid!

Meanwhile, why did I stop using the iPod Touch?

The biggest reason was that I could not watch technology talks in it. Why? Because it cannot play Flash content. Another problem was that its Exchange ActiveSync did not allow me to install a self-signed certificate, barring me from accessing my corporate e-mail in it.

Conclusions

  • Apple products are needlessly expensive. And, on the top, there is no guarantee that its expensive hardware is as durable as much less expensive hardware from other vendors.
  • Apple is an arrogant company. It has no respect for its customers. It has sold equipment to me, in India, with US and UK power socket pins, forcing me to spend additional money on adapters/converters.
  • Apple's policies work against its customers. They did not exchange the brand new, unopened Snow Leopard upgrade pack for a family pack.

I am not going to purchase another Apple product! Not until a time all of the above changes. Today, I spend almost all of my time on the computer inside a Linux VMWare Fusion image. And, yes, I have not upgraded the MacOS.

2011-11-11

Draft 2 of a proposal for generics in Go

N.B. This is draft version 2 of a work in progress. It is intended to help me understand the subject better, elicit comments from more knowledgeable people, and elucidate the nuances of generics in Go's context. The following proposal makes use of a syntax for ease of presentation. The syntax is, however, not intended to be the primary subject, nor is it intended to represent the complexities or possible complications of the underlying mechanisms.

Generics

In Go, there are fundamental differences between built-in types and user-defined types. When we try to come up with mechanisms that help treat both of them on a completely equal footing, they seem to stretch the language along difficult dimensions. Particularly troublesome are issues such as the following.

  • Built-in types (like int and float64) do not have methods; they only have a pre-defined set of operators. Users cannot redefine those operators, nor can they define new operators.
  • Operator overloading is not allowed, preventing user-defined types from exhibiting operator compatibility with the built-in types.
  • Operators are not methods themselves, and hence do not allow a symmetry between the built-in operations and the user-defined.

Rather than explore possible solutions to the above, the current proposal accepts them as facts that will not change. In other words, the inherent differences between built-in types and user-defined types are preserved, and no unification is attempted. By adding one new concept (that of , see hereunder) to the language, backwards-compatibility can be maintained even as generics are introduced into the language.

This proposal considers the following points as the base requirements for a working model of generics in Go.

  1. Enabling generics for built-in types as well as user-defined types.
  2. Avoiding boxing/unboxing and implicit casts.
  3. Preserving Go's (package-level) separate compilation model.
  4. Maintaining reasonably high compilation speed.
  5. Promoting (generated) code sharing.
  6. Allowing for reflection of generics.

Storage Classes

Types T1 and T2 belong to the same if elements of T1 can either be assigned to those of T2, or converted to T2, and vice versa. Please note that only conversion is considered, not a type assertion.

Type Classes

Types T1 and T2 belong to the same if the following are true:

  1. T1 and T2 belong to the same storage class and
  2. every operation defined on elements of T1 is also defined on those of T2, and vice versa.

As per the above definition, for instance, int32 and int64 belong to the same type class, while int64 and complex64 do not. Thus, type classes play the equivalent role of interfaces for built-in types. However, type classes do not introduce new concrete types.

Predefined Type Classes

Go language defines a set of predefined type classes for numeric types, as enumerated here.

  • Integer This type class encompasses the following built-in types: int, int8, int16, int32, int64, uint, uint8, uint16, uint32, uint64 and byte.
  • Float This type class encompasses the following built-in types: float32 and float64.
  • Real This type class composes the type classes Integer and Float above.
  • Complex This type class encompasses the following built-in types: complex64 and complex128.

In addition, the set of all named interface types in a program constitutes a single type class. This class is unnamed, and is not accessible to the users; it is defined solely for the purposes of implementation.

It is not possible to define new type classes.

Inability to define new type classes may be a limitation. However, my current thinking is that the capability to define such may not be a pressing requirement. In any case, such a capability cannot be provided until the mechanisms governing the composition of type classes is unambiguously understood.

Generic Types

A type G with a type parameter T is said to be in T. It is denoted by G<T>. It is allowed for a generic type to be generic in more than one type. Where multiple type parameters need to be specified together, they are comma-separated. Go's generic types are reified, and are amenable to introspection at run time.

A type argument T can be any valid, assignable (within the package scope of applicability) Go type, with the exception of interface{}. The semantics of interface{} interfere with the compile-time checks that generics require. Moreover, the choice of interface{} for a type expressly avoids compile-time specification of a concrete type on the part of the user. In a sense, the compile-time mechanisms of generics constitute the conjugate space of that constituted by the run-time mechanisms of interface{}.

The semantics of generic types are dependent on the type parameter constraints (described hereunder) and actual type arguments; in particular, value types and reference types lead to different semantics.

A type parameter T can appear in the following positions (or sites):

  1. type parameter declaration,
  2. variable type specification,
  3. function/method parameter type specification and
  4. allocation.

Type Parameter Declaration

Type parameter declaration makes a new type parameter available for subsequent use within the applicable scope. Such a declaration does not introduce a new concrete type. It is illegal to declare a type parameter that is unused within that scope; similarly, it is illegal to use a type parameter that is not declared in that scope or any of its enclosing scopes. Type parameters are not shadowed. Hence, it is illegal to declare a type parameter that is already in scope, again.

A generic type G<T> with a specification for T introduces a new concrete type. This new concrete type behaves in full compliance with the rules governing normal Go types, including those for type inference, with explicit exceptions where noted.

Given a generic type G that is dependent on type parameters T1, ..., Tm, it is possible to construct a generic type H that is the same as G with n, n<m type parameters instantiated. This results in H being generic in m-n parameters.

type HashTable<K, V> struct {
    ...
}

type HashSet<K> HashTable<K, bool>

Variable Type Specification

Using a generic type as the type of a variable requires specification of concrete types for the generic type's type parameters. The number of specified type arguments should match that in the generic type's declaration. It is illegal to leave any type parameter uninstantiated, unless such parameter is in scope from an enclosing declaration.

var h1 HashTable<string, float64>  
var h2 HashTable<string, T>        

It is allowed to specify a type parameter as the type of a variable, as long as the type parameter has a valid declaration in the current scope.

It is illegal for a naked type parameter to itself act as a generic type. For instance, the following is illegal.

type A<T, U> struct {
    x T<U>  
}

Function/Method Parameter Type Specification

A function or a method can utilize type parameters available in scope. The function/method will then be generic in those type parameters utilized.

func (h *HashTable<K, V>) At(k K) V {
    ...
}

A function/method may itself be generic in one type parameter or more. This is independent of its being generic in one type parameter or more from an outer scope.

func modulus<T>(v T) float64 {
    ...
}

Allocation

In the presence of type parameters, allocation requirements may be determined at run time. The type class of the type parameter determines the allocation mechanism. Also, value types and reference types have different semantics, as described later.

type S struct {
    ...
}

var v = new(HashTable<string, S>)

An Example

The following example illustrates the above.


type Vector<T> []T



func (v *Vector<T>) Push(el T) {
    *v = append(*v, el)
}

 
func (v *Vector<T>) Map<U>(mapper func (T) U) *Vector<U> {
    u := new(Vector<U>)  
    for _, el := range *v {
        u.Push(mapper(el))
    }
    return u
}

Constrained Generic Types

Since an unconstrained generic type cannot assume anything specific about the nature of the types eventually instantiated into its type parameters, the set of possible operations on the elements of such types is very small. Constraints can improve the situation by allowing a wider set of operations.

A type parameter can optionally be constrained to only those types that satisfy a particular interface. This enables all the methods from the interface to be utilized in the context of variables belonging to the applicable type parameter.

type Comparable<E> interface {
    Compare(E) int
}

type OrderedSet<E : Comparable<E>> struct {
    ...
}

The above declaration allows ordered sets to be constructed over any type whose elements are ordered according to the Compare() relation.

Similarly, a type parameter can optionally be constrained to only those types that belong to a particular type class.

type Vector<T : Integer> []T 

The above declaration allows a generic vector type to be written with algorithms that provide operations for any integral type.

Constraints on type parameters can appear at all sites except allocation sites. A constraint on a type parameter need only be specified when it is introduced for the first time. In the example above, for instance, the methods of OrderedSet need not repeat the constraint on E.

It is illegal to specify more than one constraint for a given type parameter.

Implementation Notes

Caveat: I do not understand the current Go compilers. Hence, the following is only a hint. More knowledgeable people should be able to comment on the naïveness, the feasibility and the complexity of implementing such a scheme as what is suggested here. I have represented some of the logic in the form of a high-level pseudocode, in the hope of its making it easier to discuss the same.

The following implementation notes cover various aspects of compilation, linking and reflection.

Compilation of Generics

In Go, compilation of a generic type generates two outputs: (a) generic type metadata and (b) partially compiled form of the generic.

Generic Type Metadata

At least the following metadata is generated for a given generic type, during compilation. This information is packed into a structure, which we refer to as , that is amenable to reflection.

  • unique identifier of the generic,
  • number of type parameters,
  • type handles of all type parameters,
  • function descriptors of all methods and
  • a pointer to the method table.

A similar type handle is generated for each type parameter. It contains at least the following metadata.

  • unique identifier of the type parameter,
  • a slot to hold the size of an instantiated type,
  • any constraint specified for it and
  • a pointer to its entry in the applicable function/method table.

In the case of functions and methods, a is generated. As with a type handle, this too is amenable to reflection. It contains at least the following metadata.

  • unique identifier of the function/method and
  • type handles of all type parameters.

It should be noted that the implementation distinguishes function-specific, or method-, type parameters from those declared in any applicable enclosing generic type.

If the generic type or function is exported, the above information is exported from the package of origin, and is available to the compiler, the linker and the runtime.

Partial Compilation

Each type parameter of the generic type serves as a slot that is filled for each instance of the generic type. A pointer to the metadata structure mentioned above fills this slot.

When compiling code, the size of the involved data types needs to be known. The following mechanism is used in order to arrive at one, for each type parameter.

When a generic type is compiled, the compiler first checks to see if its type parameters are constrained. For each type parameter constrained to a type class, the size of its elements is chosen to be that of the widest type (not in the variance sense, since Go does not support classical inheritance, but in the storage sense) in that class.

For each type parameter that is either constrained to an interface or unconstrained, the size is looked up from its type handle. This size is only a placeholder; it is filled, and may differ, for each instantiation of the generic type. This is applicable to all user-defined types supplied as value type arguments to any generic type.

compileGeneric() {
    if generic is a struct or an interface {
        generate type handle metadata placeholder
    } else {  
        generate a function descriptor placeholder
    }

    for each type-parameter {
        if type-parameter is constrained {
            if constrained to a type-class {
                use the size of the widest type in class
                point to the predefined pseudo-method-set of this type-class
            } else {  
                set a mark to obtain the size from type handle
                setup, or point to, the method set of the interface
            }
        } else {
            set a mark to obtain the size from type handle
            point to a predefined empty method set
        }
    }

    type check the code as per the above type handles and method sets

    generate Go AST using the above type handles and function descriptors
    serialize the generated AST into the package binary
    set pointers for functions, methods and method tables
}

Notes

  • The choice of an AST representation of the generic's code is with the intention of preserving the separate compilation model of Go. It also relieves us of the need to invent a new exportable intermediate language; we just use a standard part of the language. The package go/ast may have to be enhanced to support this requirement.
  • Please note that the generic code is type checked only once, in its package of origin. For each instantiation, the supplied type arguments are checked only against any specified constraints. This speeds up compilation, since there is no need for repeated full type checks. Even though we are using an approach opposite to that of C++ (where type checks are completely use-site based), we are not going anywhere near H-M type system.
  • The above mechanism - as described - is not space-efficient for type parameters constrained to type classes. For instance, a vector of bytes will occupy more than a byte per byte. This is a trade-off between monomorphizing (code bloat) and user data space-efficiency. However, this could be tuned.
  • This look up of the actual size from the corresponding type handle introduces a small run time overhead. However, it greatly facilitates code sharing. It may be possible, and even easy, to avoid this overhead through the use of some clever trick. I do not know enough!
  • I also have not thought through any possible repercussions of differently-instantiated recursive functions and methods. For instance, foo<T1>(x T1) may have a call to foo<T2>(y), where y has type T2.

Compilation of Client Code

Once the generic type is compiled as above, client code can be written to instantiate and use it.

Metadata

When the generic type is instantiated in client code, and type parameters are specified, those type arguments are substituted in a copy of the generic's metadata. The result is a new concrete type. For most type checking purposes, the generic origins of this new concrete type can be ignored, as long as the applicable constraints are taken into account.

It should be noted that the unique identifier of the generic can be used to trace this instance back to its generic type definition. Similarly for type parameters and their constraints. This filled copy of the metadata is available to the compiler, the linker and the runtime.

Instance Code Generation

Generation of machine code for a specific instance of the generic type depends on factors such as whether a supplied type argument is a value type or not, and whether it is constrained or not. The following high-level logic could help illustrate the mechanism.

generateClientCode() {
    create a local copy of the generic type's metadata structure
    fill the metadata with actual type arguments

    
    if new concrete type in package dictionary {
        
        register this type to use the corresponding code
        return
    }

    if new concrete type type-class-compatible with an existing type {
        
        
        register this type to use the corresponding code
        return
    }

    read the serialized AST of the generic from the package of origin

    for each type-argument {
        type check against any applicable constraint

        if type-argument is value-type {
            handleValueType(type-argument)
        } else {
            handleReferenceType(type-argument)
        }
    }

    generate binary code from the AST, filling type information

    register new concrete type in package dictionary with pointer to generated code
}

//

handleValueType(type-argument) {
    if type-argument belongs to a built-in type-class {
        use the pre-selected (widest) element size
    } else {
        use the actual size of the given type
    }
}

handleReferenceType(type-argument) {
    use the predefined size of standard interface
}

The package dictionary referred to is always exported, i.e., it is available to the compiler, the linker and the runtime. This is independent of whether a particular generic instance is itself exported.

I currently think that the above exported information need not contain package namespace qualifiers. But, I could be wrong in thinking so, and namespace qualifiers may be necessary for some purpose.

In order to facilitate the process of looking up the size of a given value type at run time, a hidden parameter is inserted into all applicable function and method signatures. The corresponding type handle of the supplied type argument is passed into the said hidden parameter at run time. When multiple type arguments are involved, a single consolidated meta handle of their type handles is passed into the hidden parameter. This simplifies the internal signatures of the functions/methods.

Notes

  • Since Go does not support traditional inheritance, there is no transitive closure of types that we need to take into account. Thus, the amount of code explosion is much less than what it would be had transitive closure been applicable. It should be noted that this applies independent of whether generic metadata is used to reuse generated code or not. But, the combination of the two results in a big win over what causes code bloat otherwise.
  • A topic that is not discussed relates to partial specialization. It can be dealt with once the mechanisms governing the simpler cases are understood and established.

Linking

During linking, the generic metadata exported from various packages is checked to see if there are redundant copies of:

  • the same metadata in different packages and
  • the same code in different packages (library files).

A global dictionary is created by merging the above information from all packages. This global dictionary is constructed using the exported package-level dictionaries themselves. The detected redundant copies are not linked into the final executable.

Reflection Capabilities

Reflection of generic types at run time is possible. Type handles and function descriptors are utilized for this purpose. At least the following capabilities are provided through the reflection API.

  • Query if a type is generic.
  • Query the number of type parameters.
  • Query the number of type-specific parameters.
  • Query the number of function-specific (or method-) parameters.
  • Retrieve specific or all type handles.
  • Retrieve the information within a type handle or a function descriptor.

The specific form of the API can be determined later to suit the philosophy of the current reflection API.

2011-11-02

A proposal for generics in Go

N.B. This is a work in progress. It is intended to help me understand the subject better, elicit comments from more knowledgeable people, and elucidate the nuances of generics in Go's context. The following proposal makes use of a syntax for ease of presentation. The syntax is, however, not intended to be the primary subject, nor is it intended to represent the complexities or possible complications of the underlying mechanisms.

Background

In Go, there are fundamental differences between built-in types and user-defined types. When we try to come up with mechanisms that help treat both of them on a completely equal footing, it appears to be stretching the language along difficult dimensions. Particularly troublesome are issues such as the following.

  • Built-in types (like int and float64) do not have methods; they only have a pre-defined set of operators. Users cannot redefine those operators, nor can they define new operators.
  • Operator overloading is not allowed, preventing user-defined types from exhibiting operator compatibility with the built-in types.
  • Operators are not methods themselves, and hence do not allow a symmetry between the built-in operations and the user-defined.

Rather than explore possible solutions to the above, the current proposal accepts them as facts that will not change. In other words, the inherent differences between built-in types and user-defined types are preserved, and no unification is attempted. By adding one new concept (that of , see hereunder) to the language, backwards-compatibility can be maintained even as generics are introduced into the language.

Storage Classes

Types T1 and T2 belong to the same if elements of T1 can either be assigned to those of T2, or converted to T2, and vice versa. Please note that only conversion is considered, not a type assertion.

Type Classes

Types T1 and T2 belong to the same if the following are true:

  1. T1 and T2 belong to the same storage class and
  2. every operation defined on elements of T1 is also defined on those of T2, and vice versa.

As per the above definition, for instance, int32 and int64 belong to the same type class, while int64 and complex64 do not. Thus, type classes play the equivalent role of interfaces for built-in types. However, type classes do not introduce new concrete types.

Predefined Type Classes

Go language defines a set of predefined type classes for numeric types, as enumerated here.

  • Integer This type class encompasses the following built-in types: int, int8, int16, int32, int64, uint, uint8, uint16, uint32, uint64 and byte.
  • Float This type class encompasses the following built-in types: float32 and float64.
  • Complex This type class encompasses the following built-in types: complex64 and complex128.

In addition, the set of all named interface types in a program constitutes a single type class. This class is unnamed, and is not accessible to the users; it is defined solely for the purposes of implementation. It is not possible to define new type classes.

Generic Types

A type G with a type parameter T is said to be in T. It is denoted by G<T>. It is allowed for a generic type to be generic in more than one type. Where multiple type parameters need to be specified together, they are comma-separated. Go's generic types are reified, and are amenable to introspection at run time.

A type parameter T can represent any valid, assignable Go type, with the exception of interface{}. The semantics of interface{} interfere with the compile-time checks that generics require. Moreover, the choice of interface{} for a type expressly avoids compile-time specification of a concrete type on the part of the user.

The semantics of generic types are dependent on the type parameters; in particular, value types and pointer types lead to different semantics. Go's generic types are reified, and additional type information is passed through hidden parameters (i.e., the dictionary approach). In the most general case, for efficiency reasons, a distinct reified copy may be generated for each instantiated type class. However, this is an implementation issue.

The dictionary approach incurs some run time cost, but usually does not need the compiler to be available at run time. In my opinion, it is desirable, since that preserves the current Go model.

A type parameter T can appear in the following positions (or sites):

  1. type parameter declaration,
  2. variable type specification,
  3. function/method parameter type specification and
  4. allocation.

Type Parameter Declaration

Type parameter declaration makes a new type parameter available for subsequent use within the applicable scope. Such a declaration does not introduce a new concrete type. It is illegal to declare a type parameter that is unused within that scope; similarly, it is illegal to use a type parameter that is not declared in that scope or any of its enclosing scopes. Type parameters are not shadowed. Hence, it is illegal to declare a type parameter that is already in scope, again.

A generic type G<T> with a specification for T introduces a new concrete type. This new concrete type behaves in full compliance with the rules governing normal Go types, including those for type inference, with explicit exceptions where noted.

Given a generic type G that is dependent on type parameters T1, ..., Tm, it is possible to construct a generic type H that is the same as G with n, n<m type parameters instantiated. This results in H being generic in m-n parameters.

type HashTable<K, V> struct {
    ...
}

type HashSet<K> HashTable<K, bool>

Variable Type Specification

Using a generic type as the type of a variable requires specification of concrete types for the generic type's type parameters. The number of specified type arguments should match that in the generic type's declaration. It is illegal to leave any type parameter uninstantiated.

var h1 HashTable<string, float64>  
var h2 HashTable<string, T>        

It is allowed to specify a type parameter as the type of a variable, as long as the type parameter has a valid declaration in the current scope.

Function/Method Parameter Type Specification

A function or a method can utilize type parameters available in scope. The function/method will then be generic in those type parameters utilized.

func (h *HashTable<K, V>) At(k K) V {
    ...
}

A function/method may itself be generic in one type parameter or more. This is independent of its being generic in one type parameter or more from an outer scope.

func modulus<T>(v T) float64 {
    ...
}

Allocation

In the presence of type parameters, allocation requirements may be determined at run time. The type class of the type parameter determines the allocation mechanism. Also, value types and pointer types have different semantics, as with non-generic types.

type S struct {
    ...
}

var v = new(HashTable<string, S>)

An Example

The following example illustrates the above.

type Vector<T> []T  

func (v *Vector<T>) Push(el T) {  
                                  
    *v = append(*v, el)
}

func (v *Vector<T>) Map<U>(mapper func (T) U) *Vector<U> {  
    u := new(Vector<U>)                                     
    for _, el := range *v {
        u.Push(mapper(el))
    }
    return u
}

Constrained Generic Types

Since an unconstrained generic type cannot assume anything specific about the nature of the types eventually instantiated into its type parameters, the set of possible operations on the elements of such types is very small. Constraints can improve the situation by allowing a wider set of operations. A type parameter can optionally be constrained to only those types that satisfy a particular interface. This enables all the methods from the interface to be utilized in the context of variables belonging to the applicable type parameter.

type Comparable<E> interface {
    Compare(E) int
}

type OrderedSet<E : Comparable<E>> struct {
    ...
}

The above declaration allows ordered sets to be constructed over any type whose elements are ordered according to the Compare() relation.

Similarly, a type parameter can optionally be constrained to only those types that belong to a particular type class.

type Vector<T : Integer> []T  

The above declaration allows a generic vector type to be written with algorithms that provide operations for any integral type.

Constraints on type parameters can appear at all sites except allocation sites. It is illegal to specify more than one constraint for a given type parameter.

2011-08-06

Linux distribution trials end, hopefully!

I have used several Linux distributions over the last 16-17 years. For many years, my primary Linux distribution was SuSE/openSUSE. Since 2007, it has mostly been Ubuntu. Starting with the release 10.10, though, Ubuntu's (in)stability had become increasingly concerning. Release 11.04 only made matters worse. Window borders were erratic in their display, transitions between workspaces was jarring, the feel was sluggish, and an update even rendered the system unable to restart!

Initially, I thought that Ubuntu's Unity was playing tricks. It perhaps interfered with standard GNOME in such ways as to complicate GDM, power management, etc. However, I could not invest the time needed to investigate the problems. Then, I tried Linux Mint 11. It sure did not have Ubuntu's Unity, but had (almost) all the other problems that Ubuntu 11.04 had.

For the first time, I then tried Arch Linux. I was impressed! It is a small, fast, no-nonsense distribution, and has excellent documentation — the best that I have seen for any Linux distribution at all. Added to that, it is a rolling distribution! We, of course, have to allow that it is meant for people who are rather technically advanced. My team thought it was too much of a hassle to even set up a `working' development machine.

Two weeks ago, I tried CentOS 6. My previous trial of CentOS was with version 5.4. Its power management was so terrible that my laptop used to drain out in about 1.5 hours. It used to last over 3 hours otherwise.

This time, though, it is Just Good! I installed the so-called `Minimal Desktop' flavor. In these two weeks of exclusive use, it has demonstrated itself to be very stable, fast and predictable. I am very impressed! All going well, I will be switching my team to CentOS 6, as well, in the coming week or so!

2011-07-28

Performance improvement - 3

Hmm … a new addition to the ring detection algorithm set brings back the distance matrix into the game. Over the last week, I was interacting with potential users, trying to assess the accuracy of the algorithm. They wanted an SSSR (Smallest Set of Smallest Rings) result set for each input molecule, not a more exhaustive set that I was deeming necessary.

Accordingly, I implemented a new SSSR algorithm. This relies on eliminating contours which can be short-circuited. That, in turn, requires shortest paths between all atoms - pair-wise - to be determined first. So, the matrix is back!

Consequently, the running time for my test set of 19946 molecules is now in the range of 16.5-17s. Preliminary profiling shows that building this matrix for each molecule takes almost 60% of the running time. I should investigate better spanning tree algorithms!

2011-07-12

LOWESS normalization

Several years ago, I led a team that developed a bioinformatics product, specifically for managing and analyzing data from microarray experiments. I wrote several of the algorithms needed (including those for hierarchical clustering and K-means clustering), together with parts of the underlying mathematical machinery (such as matrices and operations on them).

Even when I did not write an algorithm myself, I often used to write a verion (sans error-checking) in Ruby. It was used either as a basis for a production version in C++, or to compare the output from an independently-developed C++ version. Either way, those Ruby versions were useful.

One such that I wrote was for performing locally-weighted robust regression, usually known as LOWESS normalization. It is actually a rather small algorithm. However, my team felt lazy, and included the Ruby version directly in the release branch. A few days later, as was the usual practice, I asked my team to produce the C++ version for review. I was told that the Ruby version itself was being directly incorporated into the release branch. I was alarmed, and for two reasons:
  • the Ruby version was itself untested, and
  • we were dealing with huge data sets.

Ruby is notorious for being slow. I tried convincing the team that we could not afford one algorithm to take disproportionately long time in comparison to all the other algorithms in the product. However, we were running out of time, and the only other person working on the algorithms had quit by then. Perforce, I had to spend the little time available on testing the Ruby version rather than rewriting it in C++. The reference software used was R.

Much later, I decided to extract that algorithm, make it independent, and publish it under a liberal license. It was done with the consent of my company, and the result was published on Rubyforge. At that time, I announced its availability on ruby-talk mailing list, but on no bioinformatics list per se. Despite that, when I looked at the project statistics today – after a few years – I am pleasantly surprised to see that it has been downloaded over 320 times!

Here is the README file therein.

* * * *

BACKGROUND


Microarray data normalization is an important pre-processing step for obtaining data that is reliable and usable for downstream analysis.

LOWESS normalization, or robust locally weighted regression, is one of the most commonly utilized within-slide normalization techniques. Within-slide normalization aims to correct dye incorporation differences which affect all the genes similarly, or genes with comparable intensity, similarly. Therefore, LOWESS normalization is a useful technique when we suspect a systematic dye-dependent bias.

LOWESS


Control-, house keeping-, or dye swap-based experiments assume the existence and availability of certain genes whose expression levels can be treated to be constant across a wide range. In comparison, LOWESS is much more resilient to range, as well as the type of study.

The LOWESS algorithm, as applied to dual-channel microarrays, utilizes a locally weighted polynomial regression of the A-M scatterplot in order to obtain the calibration factor to be applied, where:
    A    = log2(√(iR * iG)),
    M    = log2(iR / iG),
    iR   = intensity in the red channel, and
    iG   = intensity in the green channel.
for each spot on the microarray.

A general LOWESS algorithm is parameterized by five variables:
  1. order of the polynomial,
  2. number of regression iterations,
  3. weight function,
  4. subset of the input space considered for regression, and
  5. number of points to regress at.

Of the above, I have restricted the order of the polynomial to 1, and the number of regression iterations, also, to 1 (see references for why these are sufficient in a vast majority of cases). These parameters are normally referred to as degree of polynomial and iterations, respectively.

The weight function employed is that which is most widely used: the tri-cube function, defined as
    weight = (1 - (d/m)3)3,
where,
    d = distance between the point of regression and the current point, and
    m = maximum distance in the interval.

The operating subset of the input space, for regression, is determined using a fraction in the half-open interval (0, 1]. This fraction is usually called the smoothening parameter.

The input space is divided into a number of cells, and a representative point is picked up from each cell for regression. The number of elements in such cells is determined by a fraction in the half-open interval (0, 0.1]. This fraction is usually called delta.

As can be readily seen, too high values for smoothening parameter tend to over-smoothen the plot, while too low values are very sensitive to outliers. Too high values for delta lead to fewer points at which regression is performed, resulting in a coarse approximation, while too low values would yield higher accuracy, but at the expense of more computing power.

MODES SUPPORTED


Normalization can be performed on either all the elements of the microarray, or by pin/zone/subarray.

HOW TO RUN


This sciprt requires Jeff Mitchell's 'linalg' for matrix manipulations. It is available from the Ruby Application Archive. Please install that before running this script.

The input data must be in a tab-separated text file, and must not contain any header or trailer lines.

The file 'main.rb' is the driver that needs to be run. Run it without any arguments, and a small usage notes is printed.

REFERENCES


  • Cleveland, W.S., "Robust locally weighted regression and smoothing scatterplots", J. Amer. Stat. Assoc. 74, 829-836 (1979).
  • Yang, I.V. et al., "Within the fold: assessing differential expression measures and reproducibility in microarray assays", Genome Biol. 3, research0062.1-0062.12 (2002).
  • Quackenbush, J., "Microarray data normalization and transformation", Nature Genetics. Vol.32 supplement pp496-501 (2002).
  • Berger, J.A. et al., "Optimized LOWESS normalization parameter selection for DNA microarray data", BMC Bioinformatics 2004, 5:194.

2011-07-01

Performance improvement - 2

No sooner had I finished my previous post than I began reviewing the algorithmic aspects of ring discovery in the program. I observed two points standing out.

  1. During the analysis of the molecule's graph, we build a matrix of pair-wise distances between all atoms of the molecule.
  2. During ring discovery, we remove all terminal chains (remember terminal nodes from here?), and then explore the remaining graph to detect rings.

As I reviewed the code, I saw that I was recursively walking the graph in both cases. When building the distance matrix, the logic is simple: directly bonded neighbors are at a distance of 1; recurse over the neighbors of each neighbor, and increment the distance by 1 at each level. Whenever we encounter a shorter path between the same pair, we record this shorter distance. Exercise: Why do multiple paths exist at all? A similar algorithm is followed for ring discovery, but walking only the subgraph comprising non-terminal atoms.

Whenever a ring is detected, it is necessary to determine if it is a Jordan curve (simple closed path) or not. We are interested in Jordan curves; we mark others as fused systems. How do we, then, determine if the current path is a Jordan curve? The parent molecule of every component has a map that holds information about bonds. The key of the map is a hash; the value, the bond itself. Suppose that the bond is between the atoms a1 and a2. The hash is computed as: 10000 * aL.id + aG.id, where aL is the atom with the lower ID of the two, and aG is that with the higher. This map is constructed when parsing the input, and is, consequently, ready by the time ring discovery is requested. Clearly, it is adequate to determine if a given path is a Jordan curve or not. How? :-)

I suddenly recollected that in the first version of the ring discovery algorithm, the distance matrix was used to test if a path was a Jordan curve. In that version, it was necessary to pre-calculate the distance matrix. With it being no longer mandatory, I removed the call to build it during ring discovery.

Lo and behold — the running time dropped from 18.5-19s to 6.5-7s! This time, the improvement in performance had nothing to do with Go, and everything to do with my own oversight. Needless to say, I am happier now! :-)

2011-06-30

Performance improvement

One of my chemistry programs does a topological analysis of input molecules, printing a variety of information about them. It currently processes a standardized input file of about 20,000 molecules (19,944, to be precise) in about 18.5-19s. Considering the fact that the running time came down to that level from as much as 149s a few months ago, I am happy! A few key reasons for this improvement are:

  1. replacement of functional-style closures in collection methods with C-style explicit loops (in several places),
  2. manually inlining a few functions, and
  3. upgrading from r57 to the latest weekly.

I always have a temptation to use functional-style closures with collections. They suit my style of thinking. Accordingly, I wrote a custom collection – à la container/vector – with several methods that take a closure. Examples include Collect (or Map), Detect (or Find), Select (or Filter) and TakeWhile. You may recollect that the no longer extant exp/iter used to provide some of those methods!

However, spurious closures are not yet getting converted into normal function calls by the Go compiler. So, the said replacements are required until the current work on escape analysis begins to yield positive results. It is rather inelegant, but has no alternative currently.

A related issue concerns inlining. As Ron Minnich remarked, sometimes manually inlining a function may produce a better result than what a set of sophisticated algorithms employed by the compiler may produce. Nevertheless, Russ Cox has confirmed that inlining is a high-priority item. So, we can expect to see some improvement in the coming months.

When both the above get addressed, I am sure that almost all Go programs stand to benefit significantly. It does depend on the kind of actual workload in the program, but a large class of programs should benefit automatically. I, for one, eagerly await that release!

2011-06-14

Trip to Bengaluru

Ah ... Bengaluru!

  • Some useful work.
  • Meeting a few old friends.
  • A visit to Lalbagh.
  • A dinner at Maiyyas.

That quickly sums up my last week's trip!

2011-06-06

Criteria for evaluating the new language to use

My team asked me for the criteria I used for evaluating the candidate programming languages. Here is the relevant part of my response e-mail to them. N.B. The current product is written in a mix of C and C++, and has a little over 300,000 lines of code.

* * * *

I have considered the following parameters in my evaluation. Should we choose to use language X, it should satisfy the following criteria.

1. Features of the language


It should be possible to do in X all the most useful things we currently do in C/C++. It should at the least be as easy to do the above useful things in X as it is in C/C++; it should preferably be easier.

2. Expressiveness


X should allow us to model our problem without forcing us to devote too much time to book-keeping and design elements that are technological in nature.

3. Workarounds


When a facility available in C/C++ isn't in X, there should be a reasonably convenient workaround available.

4. Additional power


X should provide additional power by way of facilities that C/C++ either do not provide, or only do so in a convoluted manner. Of particular relevance here is the ability to easily scale the program to multiple processors and nodes.

5. Readability & maintainability


Idiomatic code written in X should be easy to read and understand. It should lend itself to easy debugging and long-term maintenance.

6. Low barrier to entry


X should present a low entry barrier for programmers with background in C/C++ or, to a lesser degree, C# and Java. Reasonably good programmers should be able to start writing useful code within a fortnight of study and practice.

7. Code re-use


It should be possible to selectively re-use current code from within X.

8. Deployment compatibility


Introduction of X into the code should not adversely impact our current deployment scenarios.

2011-05-31

Graphs and molecules - 1

Background


Mathematics - referred to as the queen of the sciences by that great Karl Gauss - comprises a large number of fields. It provides abstractions of varying sophistication, together with a vast body of more specific results. Several of them have applications in areas far beyond their origins.

One such that is (albeit indirectly) relevant to our current context is topology. Topology can be defined as the study of properties of objects that are invariant under continuous maps. Topology has evolved into a large field, whose major components include point-set topology, algebraic topology and geometric topology. Of these, point-set topology lays the foundation, defining most of the basic concepts. People who take a course in mathematics at the college level are familiar with several of them, usually in specific forms as applicable to real numbers, etc. In generalized forms, concepts such as continuity remain simple, while others such as compactness (every open cover should have a finite subcover) can be surprisingly subtle!

Curves


Let I be an interval [x, y] of real numbers, with xy. Let f:IR2. Further, let f be a continuous map. We can call the range of f a curve. It is common to call f itself the curve.

Now, suppose that the restriction of f to [x, y) is injective (i.e., f(a) = f(b)a = b ∀ a, b ∈ [x, y)). The curve does not intersect itself, since the map is injective except for the single point y. Further suppose that f(x) = f(y). Such a curve is called a simple closed curve or a Jordan Curve.

Some texts define I above to be the interval [0, 1]. Another commonly found definition is: a Jordan Curve in R2 is one for which it is possible to construct an injective continuous map g:S1R2, where S1 is a circle. All the definitions are equivalent. Exercise: how? :-)

Graphs


Evidently, when we move to a discrete scenario, concepts such as continuity have to be adapted appropriately. Also, the discrete scenario brings two distinct kinds of entities into the discussion: nodes (or vertices) and edges. Note that the same notions exist in geometry as well. For instance, a triangle has three nodes and three edges. A graph is a collection of nodes and edges, where each node has at least one edge (i.e., one adjacent node).

A path can be defined as the sequence vi of vertices, where an edge ei exists between the nodes vi and vi+1. The index i varies in the discrete interval [1, n], for an applicable value of n.

Analogous to the continuous scenario, we can define a simple closed path or a simple cycle as a path with no repeated nodes except for the starting/ending node.

For our immediate purposes, we are mostly interested in undirected graphs.

Connectedness, terminal nodes


Let us denote the nodes of a graph by N, and its edges by E. For x, y ∈ N, xy, if there exists a path from x to y, we say that x and y are connected. If the above holds good ∀ x, yN, then we say that the graph itself is connected.

Suppose that the graph is not connected. Then it contains two or more connected subgraphs. Obviously, they are pair-wise disjoint, i.e., any given node belongs to one and exactly one such subgraph. Similarly, any given edge belongs to one and exactly one such subgraph.

A node with only one adjacent node connected to it (even though it can have any number of edges to that single adjacent node) is called a terminal node. As a special case, we note that a node that is part of a simple closed path cannot be a terminal node. Exercise: why? :-)

2011-05-24

Simple task distribution to worker goroutines -- example

The utility program stsearchsdf that I wrote about previously has a simple supervisor-workers setup for processing the input molecules in parallel. To begin with I have an iterator that reads the input SDF file, and returns the text of the next molecule, on each call to Next(). The input file is represented by a *bufio.Reader as shown here.

// SDFile represents an on-disk file in MDL's SDF format.
type SDFile struct {
    f *os.File
}

// Init initializes an SDFile with the underlying file.
func (s *SDFile) Init(fn string, flag int, mode uint32) os.Error {
    f, e := os.OpenFile(fn, flag, mode)
    if e != nil {
        return e
    }

    s.f = f
    return nil
}

// Close closes the underlying file.
func (s *SDFile) Close() {
    s.f.Close()
}

// SdfIter is an iterator over SDFile.
type SdfIter struct {
    r *bufio.Reader
}

Once we open the input SDF file, we can request for an iterator as follows.

// Iter creates and returns an iterator over the current SDF file's
// molecules.
func (s *SDFile) Iter() *SdfIter {
    it := new(SdfIter)
    it.r, _ = bufio.NewReaderSize(s.f, BUF_SIZE)
    return it
}

This iterator utilizes the sequence of characters that separates consecutive molecules (the sequence is $$$$) to read the text corresponding to one molecule. It answers that text in the success scenario.

// Next reads the text of one molecule from the input buffered stream
// into a slice, and returns the same.
func (it *SdfIter) Next() (v string, b bool, err os.Error) {
    sl := make([]string, 0, SLICE_SIZE)

    for {
        s, e := it.r.ReadSlice(REC_SEP_H)
        if e != nil {
            err = io.ErrUnexpectedEOF
            return
        }
        sl = append(sl, string(s))
        s, e = it.r.ReadSlice('\n')
        if e != nil {
            err = io.ErrUnexpectedEOF
            return
        }
        sl = append(sl, string(s))

        if REC_SEP_T == strings.TrimRightFunc(string(s), IsWspace) {
            b = true
            break
        }
    }
    v = strings.Join(sl, "")

    return
}

This iterator is then made use of by the supervisor to read the molecules one at a time. Meanwhile, the supervisor creates the necessary communication channels. Since the nature of the actual task varies based on the user's request, the function that represents the task itself (simply denoted fn here) is injected into the supervisor loop function. opts.mx determines the number of goroutines to spawn, and is taken from the command line.

    tchs := make([]chan TMsg, opts.mx)
    rch := make(chan RMsg, opts.mx)
    for j := 0; j < int(opts.mx); j++ {
        tch := make(chan TMsg)
        tchs[j] = tch
        go fn(j, tch, rch)
    }

The supervisor keeps track of whether the entire input file is processed, whether the search result has been found, and the number of live worker goroutines. When there are no more worker goroutines, the supervisor loop itself can exit.

    // Loop over the file.
    found := false
    eof := false
    liveHelpers := opts.mx
L1: for {
        if 0 == liveHelpers {
            break L1
        }

When the result has been found or the input file is exhausted or there was an error, all workers are sent termination messages, which each of them acknowledges. The acknowledgement is shown later.

        select {
        case rmsg := <-rch:
            if (-1 == rmsg.mn) && (!rmsg.b) {
                liveHelpers--
                continue L1
            }

            if !found && rmsg.b {
                fmt.Println("-- Molecule number :", rmsg.mn, "\n")
                fmt.Println(rmsg.v)
                found = true
            }

            if found || eof {
                tchs[rmsg.id] <- TMsg{-1, nil, nil}
                continue L1
            }

We are now ready for the task distribution itself. A defined number of molecules serves as the quantum task size. This grouping helps in reducing the number of messages sent back-and-forth over the channels.

            var tmsg TMsg
            tmsg.mn = i
            tmsg.opts = opts
            tmsg.v = make([]string, 0, TASK_SIZE)
            k := 0
L2:         for k = 0; k < TASK_SIZE; k++ {
                v, b, _ := it.Next()
                if !b {
                    break L2 // TODO: This may need better handling.
                }
                tmsg.v = append(tmsg.v, v)
                i++
                if i >= opts.to {
                    break L2
                }
            }

            if k > 0 {
                tchs[rmsg.id] <- tmsg
            } else {
                tchs[rmsg.id] <- TMsg{-1, nil, nil}
                continue L1
            }
        }
    }

Once we exit the above loop, we can close the receiving channel since all the workers will have exited by then.

Now we take a look at one of the task functions that is run in a worker goroutine. As soon as a worker is created, it sends an idle message to the supervisor.

func searchaHelper(id int, tch chan TMsg, rch chan RMsg) {
    rch <- RMsg{id, 0, "", false, nil}

In response, the supervisor starts assigning work to it. The task handling loop is quite simple: it exits when it receives a termination request (which it first acknowledges); else it processes input data.

    for {
        tmsg := <-tch
        if -1 == tmsg.mn {
            rch <- RMsg{id, -1, "", false, nil}
            close(tch)
            break
        }

        var rmsg RMsg
        rmsg.id = id
        rmsg.b = false
        for k := 0; k < len(tmsg.v); k++ {
            m := mdl.NewMolecule()
            rmsg.e = m.ParseMol(strings.Split(tmsg.v[k], "\n", -1), skip)
            if nil == rmsg.e {
                if m.HasAtoms(tmsg.opts.acount, tmsg.opts.comp) {
                    rmsg.mn = tmsg.mn + k + 1
                    rmsg.v = tmsg.v[k]
                    rmsg.b = true
                    break
                }
            }
        }

        rch <- rmsg
    }

Finally, depending on the actual outcome of the search, we show an appropriate message to the user.

The above is conceptually simple, and easy to understand. For production, we have to introduce the requisite error handling code, of course. But, Go makes it really easy to think about the task distribution structure at a reasonably high level. What is more, it allows us to implement such a structure in a straight-forward manner, too!

2011-05-23

Large files and multiple cores - 2

In November 2010, I wrote a small utility program - in Go - to view and search for molecules in an MDL SDF format file. I wrote about it here. As the program turned out to be quite useful, I cleaned it up, and made it more streamlined and modular. Here is its current version of the function that prints help text.
func printHelp() {
    str := `
NAME
    stsearchsdf - view, extract from and search an SDF file

SYNOPSIS - GENERAL FORM
    stsearchsdf command [params]

SYNOPSIS - SPECIFIC FORMS
    stsearchsdf help

    stsearchsdf show in=filename [from=m] [to=n]

    stsearchsdf copy in=filename out=filename [from=m] [to=n]

    stsearchsdf searcha in=filename (symbol=count)... [comp=(eq|gt|lt)]
    [from=m] [to=n] [mx=c]

    stsearchsdf searcht in=filename (tag=tagname tagval=tagvalue)... [from=m]
    [to=n] [mx=c]

    stsearchsdf counta in=filename (symbol=count)... [comp=(eq|gt|lt)]
    [from=m] [to=n] [mx=c]

    stsearchsdf countt in=filename (tag=tagname tagval=tagvalue)... [from=m]
    [to=n] [mx=c]

DESCRIPTION
    stsearchsdf is a program that can be used to view or extract parts of an
    SDF file.  It can also be used to search an SDF file for molecules that
    contain atoms of given elements in specified number, or those with
    specified tag values.

COMMANDS
    help
        Display this help text, and exit.

    show
        Display the sequence of molecules in the specified serial number
        range.

    copy
        Copy the sequence of molecules in the specified serial number range
        to the given output file.

    searcha
        Search for and display the first matching molecule containing specified
        numbers of given atom types.

    searcht
        Search for and display the first matching molecule containing specified
        tag values.

    counta
        Display the number of matching molecules containing specified numbers
        of given atom types.

    countt
        Display the number of matching molecules containing specified tag
        values.

OPTIONS
    in
        Input SDF file name.

    out
        Output SDF file name.  Any existing output file will be overwritten.
        Applicable to only the command 'copy'.

    from
        Number of the molecule where processing should start.  Defaults to 1.

    to
        Number of the molecule where processing should stop.  Defaults to the
        last molecule in the file.

    symbol
        Atomic symbol.  This should be given in proper case, e.g. C, Na.

    comp
        The numeric comparison to be used.  Use 'eq' for equality check, 'gt'
        for greater than, and 'lt' for less than.  Defaults to 'eq'.

    tag
        Name of the tag whose value follows.

    tagval
        Value of the tag field whose name this follows.

    mx
        Maximum number of threads to use for processing.  Defaults to 2.
        Applicable to only search and count commands.
`
    fmt.Println(str)
}

2011-05-22

Legacy code and multi-threading

After a hiatus of a few months, here is an update. What follows is an e-mail that I sent to my collaborators regarding the future technological direction of our organic chemistry product.
* * * *
Apart from resuming my work on importing available chemicals databases, etc., I have spent some time over the last few weeks studying recent developments in making programs multi-threaded, etc. I have written several multi-threaded programs in the past. Usually, the following are the important problems that arise in the context of multi-threaded programs:
  1. identifying the variables (mutable state of data) that need to be accessed from more than one thread,
  2. determining the locking mechanisms needed to protect simultaneous access to such variables from multiple threads, and
  3. using appropriate synchronization mechanisms for proper control flow and cause-effect relationships.
In contrast to the model described above, an alternative approach was proposed by Hewitt, Bishop and Steiger in 1973. It is called the Actor model. It comprises independent objects (all objects are actors) communicating via messages. No other form of interaction exists between independent objects. This communication can be synchronous or asynchronous, but is usually asynchronous by default. Also, actors can create new actors. Actors can also maintain state, designating what their behavior should be in response to future incoming messages.
Yet another alternative was proposed in 1978 by Hoare, called Communicating Sequential Processes (CSP). While being similar in spirit to the Actor model (all objects are communicating processes), CSP differs in a few aspects. For instance, in the Actor model, the sender of a message needs to know the address of the receiver in order to send a message. In CSP, communication happens through channels. A sender need not be aware of the identity of the receiver at the other end of the channel. Also, in CSP, message-passing is in itself a synchronization mechanism: a sender cannot complete the process of sending a message unless the receiver is ready to accept one. In the Actor model, on the other hand, the `system' itself holds the messages that receivers are not yet ready to accept.
Please refer to http://en.wikipedia.org/wiki/Actor_model and http://en.wikipedia.org/wiki/Communicating_sequential_processes for introductory details.
It is pertinent to note that neither the Actor model nor CSP mandates conventional locks or globally-shared data! It is known in the industry that locking and lock-based synchronization are difficult to implement correctly, and even more difficult to scale up to multi-processor scenarios.
About 2-3 years ago, when I used Scala (http://www.scala-lang.org) for a few programs, I utilized the Actor model (provided in Scala through a standard library) to develop a framework for lock-less multi-threading that could utilize all the available processors. That was, of course, before I had heard of Akka. I employed it successfully for processing large amounts of data from the chemical databases of one of my clients.
Some time last year, I became aware of, and interested in, the Go (http://golang.org) language. These days, I use it for most new programs that I develop. Go implements CSP at the language level itself. I have found it to be very useful and convenient for developing multi-threaded programs that operate in a lock-less manner. I have recently written a utility program, using Go, to view and search for molecules in MDL's SDF file format.
Due to C and C++ not having an inherent notion of threading or message passing, there exists no straight-forward way of making ST's main program multi-threaded. We have to employ the Boost library or the MPI(Message-Passing Interface) library or (worse) POSIX threads directly or something similar, in order to wrap certain parts of the current main program. I remember that you looked at MPI (OpenMPI, in particular) several months ago
In my (rather limited) experience, both Boost libraries and MPI present steep learning gradients. They are large, their APIs are large, they are inherently complex, and the very amount of code they require us to write is significantly high. I am, in fact, more concerned about the effort needed to debug and maintain such code in the long run. In contrast, I have found Scala's Actor model and Go's CSP model to be welcome reliefs by way of their simplicity and ease of use.
Scala is a language implemented for the Java Virtual Machine (JVM). This, unfortunately, means that should we choose to employ Scala, our clients have to install Java in their Linux computers. Installation of Java is a decision outside our control, and is made by their respective IT departments.
Go, on the other hand, is similar to C and C++: it compiles to native code, and does not require a virtual machine. Consequently, we can give a new main program to our clients without asking them to install a JVM first. More importantly, we can easily call C code from within Go, making it feasible to re-use select parts of the current main program in a more flexible manner.
In view of the above, after several weeks of study and trials, I think that we should employ Go for making ST's main program multi-threaded. This applies to all areas of the main program that lend themselves to multi-threading, in general, and to the convergent synthesis problem, in particular.
I have written a small part of the new SDF import program in Go. Apart from a particular annoyance that I encountered (because of the absence of generics in Go), I should say that I was delighted to see how easily the program could be written multi-threaded to utilize both the processors in my computer. The effective increase in speed was about 50%, and I think that it can increase even further with some optimization.
The above may be unclear or confusing in parts. Please ask me questions if it is so. However, kindly note that this is an important decision that has to be made, and which cannot be reversed (that easily) later. Accordingly, the more the number of questions that we consider now, the better.

2011-02-28

Scaling databases

Here is an interesting prototype for scaling the database without partitioning. I admit that I have not yet fully grasped the lock-free algorithm for conflict resolution. But, it looks quite promising. Also, note that this is a relational - not a NoSQL - database!

2011-01-26

Equivalence classes -- a recurring pattern

Within the last couple of months, I have found myself applying one pattern in three different scenarios.

(1) Literature on Clinical Trials

Each clinical trial - usually - has multiple trial aliases. A journal article may cite one or more of them. It is, therefore, possible for two articles to refer to the same trial, but using two different aliases. However, as we compile newer articles, it may be possible to cross-link them based on those that cite multiple aliases.

Example:
- Article A1 refers to trial aliases TA1 and TA5.
- Article A2 refers to trial aliases TA2 and TA4.
- Article A3 refers to trial aliases TA1 and TA3.
- Article A4 refers to trial aliases TA4 and TA5.
- Article A5 refers to trial alias TA3.

Based on the above, the program should infer that they are all referring to the same trial.

(2) Stripping Salts of a Molecule

possible complex may have more than one molecule. Often, it is a combination of a main molecule and one or more salts. Reading from an MDL `.mol' file, the program should recognize the different units, identify the main molecule, and strip the salts. The input is read an atom at a time, followed by a bond at a time.

Obviously, the two atoms participating in a bond belong to one unit. The neighbors of each of them also belong to the same unit, etc.

(3) Fused Rings of a Molecule

A single molecule can have multiple ring systems. Within each such system may exist a set of fused rings. Step 1 is to identify all rings in the system. In Step 2, we have to identify fused rings.

If rings R1 and R2 are fused, they belong to one ring system. The fused neighbors of each of them also belong to the same ring system, etc.

Equivalence Classes

Each of the problems above has the same recurring pattern -- that of equivalence classes! There is a defined equivalence relation on the input set. The program has to utilize that to partition the set. Recognizing that pattern has helped me in streamlining my design and implementation.