October is the breast cancer awareness month. Cancer is classified as a genetic disease caused by the abnormal cell division that destroy body tissue. Wait! Cell? Body tissue? Disease? So now you might be wondering: what does this have to do with computers?
In fact, cancer research has been in the heart of life sciences for the past few decades. Since genetics play an important role in most cancers, computational methods are crucial in understanding the development of the disease as well as predicting the results of clinical trials for treatment. That’s where computer science comes into action.
Biological background
Before we define the computational problem, let’s review some biology from high school and learn some facts about cancer.
Our human body consists of trillions of cells. Although each cell has the exact same DNA all over your body, every cell carries out its own function. The DNA is a long sequence of nucleotides preserved inside a cell nucleus.
We represent the DNA as a sequence of characters consisting of four letters: A, C, G and T; they stand for Adenine, Cytosine, Guanine, Thymine respectively. For example, a portion of the DNA might look like: ACGTCCGGTAAATGGCGTA. In humans, the length of the DNA is about 3 billion nucleotides – that’s 3 billion characters. If we imagine it stored on a text file containing this sequence, the file size would be about 3 GBs. Actually, it turns out it is much larger than that, since we store additional information too.
Some parts from the DNA are copied from inside the nucleus to the outside. The copied regions are ultimately converted into something that makes the cell functional. For example, cells of the stomach copy regions of the DNA that will ultimately be converted into enzymes that help digesting food. Cells of the eyes copy regions of the DNA that will ultimately be converted into products that sense the light. Regions that are copied from inside the nucleus to the outside are called genes. For example, in ACGTCCGGTAAATGGCGTA, we might have the first three characters and the last three characters copied together and concatenated to form the gene ACGGTA that will ultimately be converted into something useful.
In our 3 billion nucleotides of the DNA, there are about 20,000 genes of different lengths. Suppose that the average length of a gene is 1000 nucleotides, then all genes will represent a length of 20,000,000 nucleotides; that is less than 1% of the total length of the DNA (3 billion). Does this mean that most of the DNA is not useful because it doesn’t encode genes? The answer is No. It’s extremely useful as we will see shortly.
Cancer genomics
At every single moment, inside our bodies, cells divide and replicate. Cell replication is also what we call growing up! As they substitute dead cells or just replicate to carry out different functions. When they do, they form new identical daughter cells. This replication starts with obtaining two identical copies of the DNA that each copy will be contained in the nucleus of a separate cell.
What is the problem? During replication, some nucleotides might get mutated in the DNA copying process. In general, mutations are ok as the copy process occurs in micro-seconds and it’s expected that mutations happen with this very high speed copying mechanism.
So, how harmful is a mutation? If it happens that mutations occurred in part of the DNA that encode genes, that might be a problem leading genes to be either over-expressed or under-expressed. And how likely is it to have mutations? It’s very likely as we said. However, luckily, if it happens, there is only less than 1% probability that this mutation lies in a gene region (as we showed above). Ok. What if I’m very unlucky that a mutation happened in a DNA region that encodes for an important gene? There is a backup plan, called programmed cell death. In the programmed cell death, a cell simply commits suicide if it discovers that it is wrong, as shown in the below picture.
How does the cell know that something went wrong during the copy process? There is a group of genes called tumor-suppressor genes that monitor the cell replication and explode it if it fails the checkpoint (that there is no mutations). In cancer:
- Mutations happen in the 1% gene regions of the DNA.
- Tumor suppressor genes fail to kill the malfunctioned cell.
- Another type of genes, called oncogenes, promote the replication process surviving all trials of killing it.
Result: a malfunctioned daughter cell that will in turn replicate forming more malfunctioned cells. At the end, the uncontrolled and abnormal cell division forms tumors, which we call cancer. Normally, a single mutation is not enough for tumors to form. Cancer is a result of the accumulation of a large number of mutations over a long period of time, that makes the cell resist all trials of being killed naturally.
A computational problem
Curing cancer is very difficult because every patient has his/her own story. Mutations that happened in patient x and caused the cell to divide abnormally is not the same as the mutations happened in patient y. That’s why there is no one treatment that cures all patients. Although we know some genes that their variations/mutations cause cancer, there are so many other factors that actually promote the disease. Now, we can define our computational problem.
Given genomic datasets of cancer patients, can we design experiments to:
- .. exactly define what mutations caused cancer for each patient?
- .. find correlations between the genomic features of the patients the type of cancer?
- .. identify biomarkers in the DNA that lead to the success or failure of a specific type of treatment?
- .. get a better understanding on the development of the disease over time?
- .. customize treatment for cancer patients?
- .. predict the outcome of clinical trials?
And the list of questions goes on. The challenge in the genomic data is that they are very heterogeneous. They come in many different formats. In addition, the high dimensionality of the genomic data (thousands of features/genes) and the relatively small number of samples available make the current statistical analysis and machine learning methods inefficient.
Cancer research is more focused now on the computational methods that can give us significant and meaningful insights about the disease and how to treat it. If you are someone with serious computational skills and you feel comfortable cleaning and analyzing tens of gigabytes of data, the computational biology community needs you. Data analytics, statistical methods and machine learning techniques are not monopolized by the software applications community. Indeed, they are now very required for helping doctors making a better decision for their patients.
If you feel motivated, join the cause and study CS + genomics!