The purpose of this experiment is to see whether an explorative data mining program could recognize degree of similarities in about 60 species of animals, from honeybees to lions, to starfish, parrots, etc.
The software program used is from www.viscovery.net . Some uses of this program are: (a) clustering of customer data according to their behaviour. (b) In gene data analytics to identify biomarkers for the diagnosis of diseases. (c) For fraud profiling and forensic applications.
The data is from the University of California Irvine data archive (archive.ics.uci.edu/ml).
The ‘properties’ (attributes) considered were: backbone, legs, fins, lungs, hair, feathers, eggs, milk, tail, domestic, aquatic, airborne, venomous, predator. Presence of an attribute is denoted by 1, and absence by 0, except for Legs, where it has a numerical value of: 2,4,6,8 legs.
The tool was made to learn the attributes of all the animals and group them into clusters, each cluster containing animals with some degree of overall similarity in total attributes. The output of the tool used is a self-organized map where each cluster is denoted by a different colour. Within each cluster, two animals can be considered similar if they are close by each other. The distance in position of the animal from the centre of the cluster is a measure of how strongly it belongs to the cluster. Since we know the real types of the animals classified, the clustering found by the tool is good if different types are also grouped together in different clusters.
With these comments in mind, let’s take a look at the results:
a. There are 7 clusters (C1, C2….C7)
b. The classification is generally accurate: e.g. every fish is in C3. All birds are in C1 (except Tortoise which is in the wrong cluster).
c. Note that C5 has all the venomous creatures, except Frog where the clustering is wrong.
e. C7 has all the domestic animals. Reindeer is classified as domestic animal, not sure why, though I know people in Finland keep them like cattle.
f. Strangely, ‘Girl’ is in the same C7 cluster as domestic animals right next to ‘Pussycat’.
g. In C2, you can see lion, leopard, cheetah are very close together which is a fair indication of what we perceive in real life.
h. Also in C2, you can see that flying mammals (fruit bat and vampire bat) are in the same cluster as the other mammals but far away from the animals which typify our perception of ‘mammal’.
i. In C1, the birds cluster, you can see that gull, skimmer and skua (all seabirds) are very close together though I never told the program that they are seabirds, neither is their diet specified.
j. And hey, a Ladybird (which is an insect) is on the border between the Insect (C4) and the Bird clusters (C1)!
k. Quite interesting that Starfish is in the same cluster as Crayfish, Crab, Lobster, Clam
l. Fruitbat and Vampire Bat are in C2 on the border between birds and mammals which is quite good since they are flying mammals.
Of what use is such a self-organizing map?
With the right data, the right ‘tuning’ and some additional post-processing it could be used for e.g.
1. Stock market arbitrage. The program could be presented with a list of attributes such as P/E, EPS, volatility, market capitalization, analyst rating, Sharpe Ratio, Beta, % price change, average volume, etc. (all pre-processed and normalized), then stocks that are temporarily out of line with its peers despite having many similar characteristics, can be bought/ sold in anticipation of reversion to the normal situation.
2. Medical conditions, diseases, and their symptoms can also be fed into such a network to see estimate the likelihood of having/not having the disease. Of course this is not a simplistic task and requires information from medical specialists (and their input can also be fed into the program). Medical diagnosis is also not a White or Black process but with many shades of Gray.
3. This tool is also suitable for classifying data with a mix of quantitative (measurable) and qualitative (non-measurable except as a ranking or grading) attributes such as for different types of wine. On the one hand we may have numeric values such as acidity, price per x litre, age. On the other hand we also have the wine-taster’s grading, the color, the country of origin etc. All these could be used as input to see if our program’s classification matches the degree of distance between $ price.