Machine learning basics in proteomics — analysis with iFeature python package

iFeature implementation for protein sequence analysis

6 min readApr 4, 2022

Basics of Omics

Omics is a field of study about genomics, transcriptomics, proteomics, or metabolomics. Here we briefly introduce genomics and proteomics.

Genes and DNA

First of all, “genes” are sections of reads of a chromosomes that are (biologically) meaningful. Chromosomes are made up by the deoxyribonucleic acid (DNA) molecules packed and stored together in the nucleus (as in the case of humans).

The basic unit of heredity that occupies a specific location on a chromosome. Each consists of nucleotides arranged in a linear manner. Most genes code for a specific protein or segment of protein leading to a particular characteristic or function. — NCI Dictionary of Genetics Terms
The gene is defined as a piece of chromosome which is sufficiently short for it to last, potentially, for long enough for it to function as a significant unit of natural selection — The Selfish Gene

https://www.cancer.gov/publications/dictionaries/genetics-dictionary/def/dna

A double-helix DNA molecule is made of four types of nucleobases: adenine (A), thymine (T), cytosine (C ), and guanine (G). They are connected with the phosphate backbones and form the double-helix DNA molecules. Packed with proteins such as histones, the DNA molecule is condensed and turns into a chromosome. Human cells have 23 pairs of chromosomes (22 pairs of autosomes and one pair of sex chromosomes), giving a total of 46 per cell.

From DNA, RNA, to proteins

Genes of DNAs will be “transcripted” to RNAs, some of which are functional (mRNA), while others are non-functional (non-coding RNA). Functional RNAs are those which will be “translated” into proteins (amino acids, peptides, proteins) with particular functions.

The flow of genetic information: transcription and translation. Reference: Color Atlas of Genetics

In the quest of understanding life, we are often interested in what and how genetic defects lead to diseases, and what tools (e.g. genetic editing, functional proteins, etc.) are available for us in the dance of life.

Why machine learning in bioinformatics?

Machine learning is particularly powerful in genomics because for each chromosome, it contains 50,000,000 to 300,000,000 base pairs! And don’t forget that for each human cell there are 46 chromosomes. Machine learning aims to capture the patterns in data, which is applied to genomic reads as traditionally, we understand that certain gene (as composed of certain combinations of nucleobases) will lead to certain physiological status. Computational power of artificial intelligence certainly renders the breakthrough in bioinformatics.

Data sources in bioinformatics

Starting from understanding the sequences, sequence analysis helps to predict the functional paradign of unknown genes. Secondly, some genes are functional as they contribute in biological processes (of our knowledge), the mapping of such genes and proteins generated gene expression data. Third, molecular and protein foldings could significantly affect their functions, and prediction of structural biology is used to explore the field. Finally, system and network biology looks at the biological processes from a systematic perspective and is therefore able to capture more data to understand the molecular behaviors.

Machine learning in omics

Prediction questions

Classification prediction is commonly used in omics. For example:

Binary: is an unknown sequence gene A or not?
Multiclass: which one of the classes (Class 1, Class 2, Class 3, and Class 4) does protein A belong to?

A general workflow

In machine learning, we use “handcrafted” features for models to learn the patterns. As a result, the model would be able to relate the input and the output using the features. (in deep learning, “learned” feature are used but is out of the scope of this article)

A workflow of machine learning in bioinformatics

The critical step of this workflow is the features that determine what the model will learn. Here we introduce some commonly used features in bioinformatics. We will use a Python package, iFeature, for demonstration.

Feature generation

DNA

k-mer
k as an integer more than 0 that specifies the number of k adjacent occurrence of nucleobases. For example, a 2-mer analysis looks at the occurrence of AT pairs in a DNA.

Proteins

Amino acid composition (AAC)
The occurrence of amino acids of a protein. There are 20 natural amino acids in humans and hence the result will be a matrix of 1x20, denoting the quantity of each amino acid. For example, the number aspartic acid (D) in a protein.

Pseudo-amino acid composition (PAAC)
PAAC analyzes the composition of 20 natural amino acids like AAC with additional information that retains the positional information of the amino acids.

Dipeptide pair composition (DPC)
The occurrence of a pair of amino acids of a protein. For example, the number AR (adjacent alanine and arginine) of a protein.

Motif feature — Position scoring specific matrix (PSSM)
A motif sequence of a protein is a sequence of amino acids that give particular functions.

iFeature implementation with VS code

We will use iFeature to generate features for protein sequences.

iFeature is a comprehensive Python-based toolkit for generating various numerical feature representation schemes from protein or peptide sequences.

Install:

Step 1: Open VS code and navigate to the folder you will be working in.

cd [folder location]

2. Use the following command in the terminal to download the package.

git clone https://github.com/Superzchen/iFeature

Now the package will be downloaded in the folder.

3. Now you can read the package information by

iFeature/iFeature.py --help

4. Download data
We use Uniprot to download protein sequence data of SARS-COVID-2.

Step 1: type keyword in the search bar and press “Search”.

Step 2: This returns a list of results of reviewed and unreviewed. “Reviewed” means the dataset is extracted from literatures. For this little exercise, we choose “Reviewed”.

Step 3: Click “Download>Download all>Format (fasta)>Compressed”. Fasta format is commonly used in sequence analysis; it comprises of a structure of “header” and “sequence”. This format is requred by iFeature.

After downloading the data, you can use Notepad or TextEditor to open the data.

An example of fasta format from the COVID search result.

Step 4: Put the data in the working folder.

4. Start generating features!
For example, we want to create AAC.

# iFeature/iFeature.py --file [dataset]--type [feature] --out [output file]iFeature/iFeature.py --file covid.fasta --type AAC --out covid_AAC.csv

Supported features by iFeature include AAC, EAAC, CKSAAP, DPC, DDE, TPC, BINARY, GAAC, EGAAC, CKSAAGP, GDPC, GTPC, AAINDEX, ZSCALE, BLOSUM62, NMBroto, Moran, Geary, CTDC, CTDT, CTDD, CTriad, KSCTriad, SOCNumber, QSOrder, PAAC, APAAC, KNNprotein, KNNpeptide, PSSM, SSEC, SSEB, Disorder, DisorderC, DisorderB, ASA, TA.

Change the argument after --type to create different features.

Now you can use these features to create machine learning models!