PCA and K-means to decipher bacteria genome

This is our first article, so exciting! We will use the special occasion to apply Data Science to a noble cause: deciphering the information contained in genetic code. We will here provide an implementation in python of a paper by Alexander Gorban and Andrei Zinovyev, used on MIT's edX Data Science course. We will try to explain the main Data Science concepts targeting the average curious person, but most importantly discuss the potential that these techniques have as well as their limitations.

If you are more interested in the coding and implementation aspects of the exercise, feel free to go directly to the Jupyter Notebook. This page is rather aimed at explaining the approach and discussing results.

The problem

Nowadays genome is a mainstream concept. We know that it contain key messages (genes) that regulate life, and that this information is coded in an "alphabet" with 4 letters (A, T, C and G).

You might wonder: "how did scientists discover that those bunch of letters carry the code to life?". Well, we will look at that today: from a genome stream of these 4 characters from a bacteria we will try to identify "words" that compose the genetic information, or genes.

The approach

What are the types of "words" that carry minimal genomic information? If we assume that these words are all in the same length, we can try to analyze the frequency of these words in our genomic "book", and see if a specific dictionary yields better structure for our text.

You can think of this as trying to decipher which language a book is written on by checking the frequency of its words against dictionaries of several languages. Certainly, the right dictionary has higher frequencies of its words present in the book, right?

Now imagine the amount of data that we would have to analyze if this was a book? A dictionary has thousands of words, so if we stored the frequencies of each word in a column of a matrix, each row being a "chapter" (more on this later), understanding patterns would be nearly impossible for a human. We have the same problem here, since for example all possible 4 letter gene combinations would generate 256 columns.

That's where PCA (principal component analysis) comes into play. It allows us to find key dimensions in our data, i.e. the combinations of words with that represent most of the information contained in the text. This usage of PCA is also known as dimensionality reduction and is often used in Data Science to reduce complexity before other algorithms.

We humans are pretty good at analyzing 2 dimension graphs. Therefore, by plotting the two principal components of our data in a graph, we hope to see differences in how each word length represent the set, maybe a pattern?

Finally we will apply an algorithm called K-Means, which is a kind of unsupervised learning. This technique is used to group, or cluster data points together by their similarity (often a measure called euclidian distance). It is an interesting technique which can be used, for example, to categorize clients of a business "automatically" given their history of purchase. In our case, clusters shall show us data points that have similar frequencies of words, possibly related to similar genetic information structure.
You will notice that a lot of the hard work lays in pre-processing, while the PCA and K-Means algorithms are simple to apply. The key is feeding the correct data, so let's get to it!

Pre-processing

As it is usually the case in Data Science, pre-processing is key to having accurate and understandable results. Both PCA and K-means need to have standard featurized data, which we will build on this section.

File Extraction

Our data can be found on a file in a Fasta extension (most used format for bioinformatics). This extension is close to a standard encoding text file with a line of header for each piece of genome. We import the file, removing this line as well as any other unwanted characters.

Separation into Data points

The next step is to break this long string into smaller data points. Why do we do that? If we thought of this long string as one single "book" to read, we could theorize about how to read it, but would be unable to prove anything since we would only have one observation. Instead, by separating this book into "chapters", we are able to test our ideas across a number of observations. So how many "chapters", or data points should we break this into? To answer this we will balance two factors: the first is having a critical mass of data points, and the second is how representative is each data point. For instance, we are looking at "words", or genes, with 1 - 4 letters, so in 100 characters strings we could have 25 - 10 "words". In this sense, if we consider 300 letters data points, we would be looking at around 1 000 data point and 75 - 300 words per data point

Both of these sound reasonable to start with, but given the subjectiveness of the above reasoning, we can customize the data point length to see the effect on results.

Creation of word dictionaries

In order to analyze the frequencies of genes on each data point, we need to create a list of all possible "words", which we call a dictionary. We do this by applying a cartesian product of the 4 basic letters, with the number of repetitions varying between 1 to 4, thus creating dictionaries with the respective word lengths we are looking at.

Frequency tables

We are now ready to calculate our frequency tables, the big task in pre-processing.

For every size of word (therefore 4 times), we create a matrix filled with zeros for each data point. Then, we iteratively look at the beginning of each data point, locating the "word" it represents and adding one to the respective column. In the end of this process, we have a table that contains the frequency of all words in the dictionary.

Standard Scaling

The last step of pre-processing is scaling our data. We do operations on our data set so that every column (feature) has an average of zero, and a standard deviation of 1. This is an important pre-requisite of PCA, K-Means and many other algorithms.

Why to do that? To guarantee that features have the same "weight" in the algorithm. Imagine that these features (columns) were grades given by different people to restaurants. Well some people tend to be lenient, while others are strict in reviewing. Scaling our data in that case would remove this personal bias that could influence results.

Scaling is very easy to do since there are several very good packages for it in python. We here use Scikit Learn (as well for PCA and K-Means).

PCA Analysis

All we need to do now is to run the PCA algorithm, which we accomplish in a few lines. We run it for each size of word, and then plot the result.

Wow! This is seriously cool: we can clearly see that three letter words result in clear and identifiable clusters, meaning that they help structure data. It is a serious indication that the information is coded in 3 letter words, which we today know is true, these triplets being called genes.

K-means clustering

Now the cherry on the cake =) Since we have the data set ready we might as well apply a clustering algorithm to separate the clearly visible clusters for 3 letter words. The K-Means algorithm accomplishes this task for us. We use a known amount of clusters here because they are clearly distinguishable and in line with reality. But we could as well select the best performing number of clusters from trial.

We can see 7 clear clusters (six on the edge and one in the center). Any ideas about why this structure? We won't go into too many details here (you can read about in the original paper) but the 6 clusters represent shifted readings of 3 letter words(i.e. starting to read the sequence one letter to the right / left), and the central cluster is for points not carrying information (the genome is not only composed by information carrying genes.

Here it is for this first post. Next topic on the pipeline will be more pragmatic, related to recommendation systems for business. Stay tuned, and feel free to ask away or comment!

Data Science for Humans

Search This Blog