Bioinformatics is an interdisciplinary field that is a combination of biology, informatics, mathematics, computer science, and statistics. In bioinformatics, we analyze the biological data and use its interpretation for various purposes. Some biologists define bioinformatics as the computational branch of biology, in which they solve specific biological problems.
There are two types of biological experiments. The first is in-vivo (within a living organism) and the second is in-vitro (within a glass or a test tube). With the help of bioinformatics, the third type of analysis is also possible that uses silicon chips and computers. That is called in-silico biology.
In this article, we will explore the world of bioinformatics. We will talk about the sequence analyses of protein and DNA. How bioinformatics changes the whole concept of problem-solving from biological data. Lastly, we will discuss some important software and tools used in bioinformatics to help you in discovering new horizons.
Scientists first started using informatics in biology after when Frederick Sanger determined the sequence of insulin. Different scientists then developed new techniques and tools for analyzing and sequencing DNA. Now bioinformatics has evolved, and it has become an integral part of research in different areas of biology. For example,
- It helps in sequencing genes and play important role in analyzing gene expression.
- It allows text mining of biological literature.
- It analyses large amounts of data through image and signal processing.
- It plays important role in the analysis of protein expression.
- It aids in comparing sequences, and their interpretation.
- Structural bioinformatics helps in the modelling of different molecules such as DNA, RNA, and proteins.
- It helps in organizing and analyzing the biological pathways.
Goals of bioinformatics
Understanding the biological processes with increased speed and efficiency is the primary goal of bioinformatics. It includes the development and implementation of various computationally intensive techniques to manage biological information and also the development of new algorithms to find relationships among large sets of data.
Some of the important techniques include pattern recognition, data mining, signal processing, visualization, and machine learning. These techniques help in sequence analyses, sequence alignment, drug discovery, drug design, gene finding, genome assembly, protein structure & function prediction, mapping of the DNA, and creation & visualization of 3-D protein models, etc.
There is a slight difference between bioinformatics and biological computation. The former uses computation for an efficient understanding of biological data whereas the latter is used for building biological computers through the use of bioengineering.
A sequence is a pattern of several monomers that repeat themselves and create a large polymer. Sequence analysis is the process of analyzing the different repeated units in a molecule such as DNA, RNA, and protein to understand and predict their structure, function, and behaviour. We use different methods for this purpose that include sequence alignment, sequence assembly, and searches against biological databases. The sequence analyses increase the scientist’s understanding of the features, function, and evolution of the organism.
There is a very wide scope of sequence analysis in molecular biology. For example, it includes a comparison of new sequences to those with known functions, finding similarities between different sequences, recognition of variations and differences among sequences, and identification of the characteristics of a sequence. Moreover, sequence analysis is also beneficial for the identification of molecular structure.
Sequence alignment is arranging in organizing the sequences of DNA and RNA in a way that allows the scientist to identify similar and different areas between various sequences. Usually, a matrix with rows represents the aligned sequences, and gaps are inserted between residues to take a comprehensive picture. This method arranges similar characteristics in successive columns.
Sequence alignment is of two types; Pair-wise alignment & multiple sequence alignment. The pair-wise alignment is only used for the comparison of two sequences at a time. Common methods of pair-wise alignment are dot-matrix methods, dynamic programming, and word methods. For pair-wise alignment, mostly two algorithms are used: the Needleman-Wunsch algorithm & the Smith-Waterman algorithm.
On the other hand, multiple sequence alignment can incorporate many sequences at a time. Various methods align the sequences in a query set and scientists assume that there is an evolutionary relationship among them. This evolutionary relationship is that they are the descendants of a common ancestor.
The alignment methods used for multiple sequence analysis include dynamic programming, iterative methods, consensus methods, progressive alignment construction, hidden Markov models, and many more. Creating multiple sequence alignment through computational techniques is somewhat difficult but it provides a lot of information, and it is of high yield for researchers of bioinformatics.
Sequence assembly plays an important role in modern DNA sequencing. It is the way in which scientists use various techniques to reconstruct a DNA sequence by aligning and merging DNA fragments. It is developed because large sequences of DNA are not suitable for reading. That is why first the DNA is cut into small pieces and then scientist the original DNA by merging and aligning the information on different fragments.
DNA sequencing is the way of determining the order of 4 nucleotides of DNA. This is done through rapid DNA sequencing methods which we will describe later in this article. Determining DNA sequence leads to advancement in biological research areas such as forensics, virology, medical diagnosis, and biotechnology. Moreover, it provides a lot of help in diagnosing different types of diseases such as various cancers and antibody disorders.
In 1970, scientist Frederick Sanger revolutionized molecular biology when he determined the sequence of insulin. Determining DNA sequences is much faster and simpler than the protein because of only 4 nucleotides as compared to 20 amino acids in the protein. These 4 nucleotides are Adenine (A), Thymine (T), Guanine (G), and Cytosine (C). All of them have a 5’ phosphoryl and a 3’ hydroxyl end and the formation of a bond between these two ends of different nucleotides make a DNA molecule.
When the nucleotides have been linked, the resulting molecule has a free phosphoryl group at 5’ and a free hydroxyl group at 3’. These two ends are sometimes called 5’ terminus and 3’ terminus, respectively. Literature defines a DNA sequence as the succession of these nucleotides listed from 5’ to 3’ end.
There are various methods of sequencing DNA. Basic methods include Maxam-Gilbert and Chain-termination methods. Long-read sequencing methods are SMRT sequencing and Nanopore DNA sequencing. On the other hand, there are many short-read sequencing methods such as Polony sequencing, Massive parallel signature sequencing, SOLiD sequencing, 454 pyrosequencing, DNA nano ball sequencing, Combinatorial probe anchor synthesis (cPAS).
The process of identifying the genes and the functional elements along the sequence of a genome is known as genome annotation. This process gives meaning to the gene sequence. In 1995, a team at the Institute of Genomic Research completely sequenced and analyzed the genome of a bacteria (Haemophilus influenzae). They then published the first description of the genome annotation system. Currently, there are many programs available for the analysis of genomic DNA such as the GeneMark program.
Analysis of mutations
Sequencing and sequence analysis can play an important role in identifying mutations in genes in cancer. Basically, the genomes of cancer-affected cells are rearranged in somewhat unpredictable and complex ways. New technologies have been developed such as single-nucleotide polymorphism arrays that help in the detection of point mutations.
Identification of mutations through bioinformatics is based upon two principles. First, in cancer, there are accumulated somatic mutations in the genes. Second, there are certain driver mutations in cancer that should be distinguished from the passengers. The development of next-generation sequencing technology will allow bioinformatics to sequence many cancer genomes efficiently and affordably.
Analyzing protein expression
The building blocks of proteins are called amino acids which are complex organic molecules made of carbon, hydrogen, oxygen, and sulfur atoms. These amino acids are linked like a chain in a precise order to make a special protein molecule. Some of the examples of these amino acids are alanine, glycine, glutamine, and tyrosine.
To identify the proteins, present in a biological sample, scientists use protein microarrays and high throughput mass spectrometry. Bioinformatics plays an important role in interpreting the data of protein microarrays and high throughput mass spectrometry. Bioinformatics is basically involved in organizing and matching large amounts of data against predicted protein sequences. Moreover, it is involved in complicated statistical analysis of samples. Another thing is the localization of cellular proteins. Scientists do this through spatial data of immunohistochemistry and tissue microarray analysis.
Bioinformatics of protein sequence and protein expression is a very big topic which certainly requires a lot of time and pages. It includes the study of retrieval of protein sequence from databases, computing amino acid composition and other parameters, predicting elements of secondary structure, visualizing protein structure in 3-D, and predicting the 3-D structure of proteins. It is also involved in finding proteins of similar sequences and classifying proteins into different families.
Gene Expression and regulation
Gene regulation is the increase or decrease in the activity of one or more proteins that result from a sequence of events that involve a signal in the form of a hormone. Various techniques and methods of bioinformatics are applied to this process to understand and explore the details of various steps of this process. Basically, the nearby elements in a genome regulate the process of gene expression. Biologists identify the sequence motifs in the DNA near the coding region through promoter analysis because these motifs influence the transcription of that region into mRNA. Some other elements that are not in the region can also influence this process through special interactions. Biologists study these special interactions through various methods of bioinformatics.
It is also possible to infer gene regulation using expressions data. For example, comparing microarray data from an organism in different states helps in forming a hypothesis about various involved genes. Moreover, comparing cell cycle stages under stress conditions in a unicellular organism and then applying clustering algorithms on that expression data helps in identifying co-expressing genes. Some important clustering algorithms are self-organizing maps, consensus clustering, k-means clustering, and hierarchical clustering.
Measuring mRNA levels is extremely helpful in determining the expression of various genes. Different techniques are used for this purpose. For example, EST sequencing, microarrays, SAGE tag sequencing, and Whole Transcriptome Shotgun Sequencing, known as WTSS. The main problem with all of these techniques is that they are noise prone. That is why scientists are trying to develop new statistical tools that will separate the signal from noise in high throughout studies.
Analysis of the cellular organization
There are several approaches and techniques to organize, analyze, and interpret data of the cellular organization. It is now possible to analyze the location of the different parts of a cell such as organelles, proteins, and genes, etc. These components affect the events and processes of the cell. That is why the study and understanding of the location of these components help a biologist to predict the behaviour of biological systems.
For determining the location of organelles and other components of the cells, scientists use microscopic pictures. These microscopic pictures are also helpful to differentiate between normal cells and cancerous cells. Additionally, there are some protein subcellular localization prediction resources available that help in the localization of proteins. It is important to note that the localization of proteins helps in the evaluation of their role. An example is that the proteins found in the mitochondria are involved in respiration whereas the proteins found in the nucleus are involved in gene regulation.
Identifying the three-dimensional structure and nuclear organization of chromatin is not an easy task. But fortunately, scientists have made it possible through chromosome conformation capture experiments. The analysis of the data from these experiments helps a lot in this regard. Some common experiments include Hi-C (experiment), and ChIA-PET.
Biomedical Text mining
With the advancement in biological research and a rapid increase in the number of published articles, it is becoming nearly impossible to read every paper. That is why modern techniques use computational and statistical analysis to mine the biological text. Biological text mining refers to the study of how the methods of text mining can become useful for the literature of biomedical and molecular biology. This involves a combination of bioinformatics and medical informatics. Additionally, natural language processing and computational linguistics are also related to this type of research.
Some common processes of biological text mining are Named-entity recognition, relationship discovery, document classification & clustering, claim a detection, Hedge cue detection, and information extraction. The difference between document classification and clustering is that the former is supervised, and the latter is unsupervised. Basically, the clustering takes place through the help of algorithms. Despite their differences, both of these approaches focus on making subsets of biological documents based on their specific characteristics.
Another process called information extraction refers to the automatic identification of structures information from unstructured text. It is a complex process, and it may involve several other processes of biological text mining. Biologists use the information extraction process to create links between the concepts and ideas present in the biomedical text.
Structural bioinformatics is another domain of bioinformatics in which scientists analyze and predict the 3-D structure of various biological molecules, especially proteins. It involves comparisons of folds and motifs, evolution, molecular folding, and structure/function relationships of biological macromolecules. Structural bioinformatics aims to create efficient methods of analyzing and understanding biological molecules.
Proteins are made up of amino acids and the amino acid sequence is called the primary structure of proteins. The primary structure is determined from the sequence of the encoding gene. A detailed understanding of this structure is important to determine and predict the function of the protein. Protein’s structure is usually classified as primary, secondary, tertiary, and quaternary structure.
Another important domain of bioinformatics is homology. In the context of genetics and bioinformatics, the term homology refers to shared ancestry in the evolutionary history of life.in genetic bioinformatics, it plays an important role in the prediction of the function of a gene. On the other hand, in structural bioinformatics, homology is used to identify the important parts of a protein structure. Homology modelling is the only reliable way to predict the structure of a protein. Using it, biologists can predict the structure of a protein from the structure of a known homologous protein.
Another domain of structural bioinformatics uses protein structures, Virtual Screening models. These include Quantitative Structure-Activity Relationship models and proteochemometric models (PCM). Additionally, in-silico mutagenesis studies also require data of protein’s crystal structure.
Software and tools
There is a wide range of software and tools available for bioinformatics, such as simple command-line tools, open-source software, web services in bioinformatics, bioinformatics workflow management systems, and BioCompute objects. In this part of the article, we will discuss some of them.
- Open-source bioinformatics software
There is several open-source software freely available which help researchers in the field of bioinformatics. New and emerging algorithms, the potential for in-silico experiments, and other qualities of these tools attract every researcher. The main advantage of open-source tools is that a community supports them, and it also creates new plug-ins for various purposes.
Some common bioinformatics software includes Bioconductor, GenoCAD, BioPerl, BioPython, UGENE, Apache Taverna, BioJava, EMBOSS, and BioRuby. A non-profit organization called as Open Bioinformatics Foundation provides support for the betterment of software and tools.
- Web services in bioinformatics
Scientists have developed SOAP-based and REST-based interfaces that allow the use and transfer of bioinformatics data all around the world. It allows the researcher to work efficiently without the stress of dealing with software and database maintenance. There are three types of bioinformatics services:
Sequence search services (SSS)
Multiple sequence alignment (MSA)
Biological sequence analysis (BSA)
- Bioinformatics workflow management systems
A bioinformatics workflow management system executes a range of important tasks. One of the important qualities of this type of system is that it provides scientists with an easy-to-use system where they can create custom workflows. Moreover, it makes it very easy to share workflow between scientists and researchers. It also provides interactive tools for scientists and enables them to track the results and workflow creation steps.
Databases play an integral role in bioinformatics application. They contain different types of data such as empirical data and predicted data that may be general or specific to a particular organism. There are large differences in the structure and function of these databases. Some common databases are Genbank, PDB (protein data bank), Pfam, InterPro, Sequence read archive, etc. These all are used for various domains of bioinformatics analysis.
Bioinformatics is an interdisciplinary field that combines biology, informatics, mathematics, computer science, and statistics. In bioinformatics, experts in this field analyze and interpret biomedical data with the help of different techniques and tools. The primary goal of bioinformatics is to make it easy to interpret biomedical data. It focuses on the efficient and quick understanding of biological processes. Various techniques are used for this purpose, such as sequence analysis, sequence assembly, data mining, data visualization, and signal processing.
Sequence analysis is of utmost importance in bioinformatics. It refers to the analysis of sequences of monomers in biological macromolecules such as DNA, RNA, and proteins. It increases the scientist’s understanding of the characteristics, features, evolution, and structure of organisms. There are several methods available for DNA sequencing which refers to the identification of the sequence of 4 nucleotides in the DNA molecule. Analysis of protein, analysis of the cellular organization, and analysis of RNA is also part of bioinformatics.
Bioinformatics also help in the understanding of gene expression and gene regulation. Various methods of bioinformatics help in the exploration of the details of the steps involved in gene regulation. Another domain of bioinformatics known as structural bioinformatics help in predicting the structure and function of different biological macromolecules.
There is another important area of bioinformatics which is called biomedical text mining. It refers to the study of how the methods of text mining can become useful for the literature of biomedical and molecular biology. Lastly, there are several tools and software available for bioinformatics research and application such as GenoCAD, BioPerl, BioPython, and many others.
- Lesk, A. M. (26 July 2013). “Bioinformatics”. Encyclopaedia Britannica.
- ^ Jump up to:a b Sim, A. Y. L.; Minary, P.; Levitt, M. (2012). “Modeling nucleic acids”. Current Opinion in Structural Biology. 22 (3): 273–78. doi:10.1016/j.sbi.2012.03.012. PMC 4028509. PMID 22538125.
- ^ Dawson, W. K.; Maciejczyk, M.; Jankowska, E. J.; Bujnicki, J. M. (2016). “Coarse-grained modeling of RNA 3D structure”. Methods. 103: 138–56. doi:10.1016/j.ymeth.2016.04.026. PMID 27125734.
- ^ Kmiecik, S.; Gront, D.; Kolinski, M.; Wieteska, L.; Dawid, A. E.; Kolinski, A. (2016). “Coarse-Grained Protein Models and Their Applications”. Chemical Reviews. 116 (14): 7898–936. doi:10.1021/acs.chemrev.6b00163. PMID 27333362
- Wong, K. C. (2016). Computational Biology and Bioinformatics: Gene Regulation. CRC Press/Taylor & Francis Group. ISBN 9781498724975.