Machine learning basics in proteomics — analysis with iFeature python package
iFeature implementation for protein sequence analysis
Basics of Omics
Omics is a field of study about genomics, transcriptomics, proteomics, or metabolomics. Here we briefly introduce genomics and proteomics.
Genes and DNA
First of all, “genes” are sections of reads of a chromosomes that are (biologically) meaningful. Chromosomes are made up by the deoxyribonucleic acid (DNA) molecules packed and stored together in the nucleus (as in the case of humans).
The basic unit of heredity that occupies a specific location on a chromosome. Each consists of nucleotides arranged in a linear manner. Most genes code for a specific protein or segment of protein leading to a particular characteristic or function. — NCI Dictionary of Genetics Terms
The gene is defined as a piece of chromosome which is sufficiently short for it to last, potentially, for long enough for it to function as a significant unit of natural selection — The Selfish Gene
A double-helix DNA molecule is made of four types of nucleobases: adenine (A), thymine (T), cytosine (C ), and guanine (G). They are connected with the phosphate backbones and form the double-helix DNA molecules. Packed with proteins such as histones, the DNA molecule is condensed and turns into a chromosome. Human cells have 23 pairs of chromosomes (22 pairs of autosomes and one pair of sex chromosomes), giving a total of 46 per cell.
From DNA, RNA, to proteins
Genes of DNAs will be “transcripted” to RNAs, some of which are functional (mRNA), while others are non-functional (non-coding RNA). Functional RNAs are those which will be “translated” into proteins (amino acids, peptides, proteins) with particular functions.
In the quest of understanding life, we are often interested in what and how genetic defects lead to diseases, and what tools (e.g. genetic editing, functional proteins, etc.) are available for us in the dance of life.
Why machine learning in bioinformatics?
Machine learning is particularly powerful in genomics because for each chromosome, it contains 50,000,000 to 300,000,000 base pairs! And don’t forget that for each human cell there are 46 chromosomes. Machine learning aims to capture the patterns in data, which is applied to genomic reads as traditionally, we understand that certain gene (as composed of certain combinations of nucleobases) will lead to certain physiological status. Computational power of artificial intelligence certainly renders the breakthrough in bioinformatics.
Data sources in bioinformatics
Starting from understanding the sequences, sequence analysis helps to predict the functional paradign of unknown genes. Secondly, some genes are functional as they contribute in biological processes (of our knowledge), the mapping of such genes and proteins generated gene expression data. Third, molecular and protein foldings could significantly affect their functions, and prediction of structural biology is used to explore the field. Finally, system and network biology looks at the biological processes from a systematic perspective and is therefore able to capture more data to understand the molecular behaviors.
Machine learning in omics
Prediction questions
Classification prediction is commonly used in omics. For example:
- Binary: is an unknown sequence gene A or not?
- Multiclass: which one of the classes (Class 1, Class 2, Class 3, and Class 4) does protein A belong to?
A general workflow
In machine learning, we use “handcrafted” features for models to learn the patterns. As a result, the model would be able to relate the input and the output using the features. (in deep learning, “learned” feature are used but is out of the scope of this article)
The critical step of this workflow is the features that determine what the model will learn. Here we introduce some commonly used features in bioinformatics. We will use a Python package, iFeature, for demonstration.
Feature generation
DNA
k-mer
k as an integer more than 0 that specifies the number of k adjacent occurrence of nucleobases. For example, a 2-mer analysis looks at the occurrence of AT pairs in a DNA.
Proteins
Amino acid composition (AAC)
The occurrence of amino acids of a protein. There are 20 natural amino acids in humans and hence the result will be a matrix of 1x20, denoting the quantity of each amino acid. For example, the number aspartic acid (D) in a protein.
Pseudo-amino acid composition (PAAC)
PAAC analyzes the composition of 20 natural amino acids like AAC with additional information that retains the positional information of the amino acids.
Dipeptide pair composition (DPC)
The occurrence of a pair of amino acids of a protein. For example, the number AR (adjacent alanine and arginine) of a protein.
Motif feature — Position scoring specific matrix (PSSM)
A motif sequence of a protein is a sequence of amino acids that give particular functions.
iFeature implementation with VS code
We will use iFeature to generate features for protein sequences.
iFeature is a comprehensive Python-based toolkit for generating various numerical feature representation schemes from protein or peptide sequences.
- Install:
Step 1: Open VS code and navigate to the folder you will be working in.
cd [folder location]
2. Use the following command in the terminal to download the package.
git clone https://github.com/Superzchen/iFeature
Now the package will be downloaded in the folder.
3. Now you can read the package information by
iFeature/iFeature.py --help
4. Download data
We use Uniprot to download protein sequence data of SARS-COVID-2.
Step 1: type keyword in the search bar and press “Search”.
Step 2: This returns a list of results of reviewed and unreviewed. “Reviewed” means the dataset is extracted from literatures. For this little exercise, we choose “Reviewed”.
Step 3: Click “Download>Download all>Format (fasta)>Compressed”. Fasta format is commonly used in sequence analysis; it comprises of a structure of “header” and “sequence”. This format is requred by iFeature.
After downloading the data, you can use Notepad or TextEditor to open the data.
Step 4: Put the data in the working folder.
4. Start generating features!
For example, we want to create AAC.
# iFeature/iFeature.py --file [dataset]--type [feature] --out [output file]iFeature/iFeature.py --file covid.fasta --type AAC --out covid_AAC.csv
Supported features by iFeature include AAC, EAAC, CKSAAP, DPC, DDE, TPC, BINARY, GAAC, EGAAC, CKSAAGP, GDPC, GTPC, AAINDEX, ZSCALE, BLOSUM62, NMBroto, Moran, Geary, CTDC, CTDT, CTDD, CTriad, KSCTriad, SOCNumber, QSOrder, PAAC, APAAC, KNNprotein, KNNpeptide, PSSM, SSEC, SSEB, Disorder, DisorderC, DisorderB, ASA, TA.
Change the argument after --type
to create different features.
Now you can use these features to create machine learning models!