Using Data Analytics to Predict Protein Structure

by December 17, 2017 0 comments

One of the most important goals pursued by bioinformatics and protein chemistry is protein structure prediction. It is all about inferring the three-dimensional structure of a protein from the amino acid sequence. Proteins are central to all functions within a body and any malfunctioning can cause severe diseases like Alzheimer’s disease. To cure damaged proteins, structures have to be known. Simply analyzing genome data to study protein structure is a far too complicated and a time-consuming process. With computer modeling and availability of a large volume of data this process has become a lot easier.

Analyzing the protein data helps researchers to believe that metagenomics could be used to predict protein structures. Metagenomics is the study of genetic material recovered directly from environmental samples. Traditionally this technique has been used to study DNA sequencing of the microbial community.

Pfam, a large collection of protein families, represented by multiple sequence alignments and hidden Markov models, has about 15,000 protein families out of which more than 5,000 families have no structural information.

David Baker from the University of Washington in collaboration with researchers at the U.S. Department of Energy Joint Genome Institute (DOE JGI) published an article in the ‘Science’ journal on 20th January, 2017, titled, “Protein Structure Determination Using Metagenome Sequence Data”. The article features the research process used to find structural models of more than hundred protein families. These families previously had no information available.

“We show that Rosetta structure prediction guided by residue-residue contacts inferred from evolutionary information can accurately model proteins that belong to large families and that metagenome sequence data more than triple[s] the number of protein families with sufficient sequences for accurate modeling. We then integrate metagenome data, contact-based structure matching, and Rosetta structure calculations to generate models for 614 protein families with currently unknown structures.”, wrote the authors of the science article.

The team analyzed protein structures using server Rosetta and the metagenomic sequences publicly available on the Integrated Microbial Genomes (IMG) system run by the DOE JGI.

The research revealed that a majority of protein families (in Pfam) have a low number of sequences. This resulted in two consequences. First, nobody cared about these families (since they were small). Second, coevolution methods could not be applied to study them. With metagenomics, the researchers found that some of these neglected families with only a handful of sequences so far can become as large as some of the most studied ones when metagenomics data are taken into account.

The researchers build a 3D model of a representative sequence from the family. About 206 membrane proteins’ structural models were generated out of which 137 were found to contain folds not represented in the Protein Data Bank.

Nikos Kyrpides, DOE JGI Prokaryote Super Program head said such efforts were previously restricted to protein families generated from sequences found on the isolated genome only. These genomes comprise about 200 million sequences. “As expected, when we added on those our metagenomics data, harnessing the five billion assembled metagenome sequences available on our IMG/M database, we were able to dramatically increase the coverage of many of the known protein families. Efforts like this one heavily dependent on the availability of assembled metagenomics sequences, which is an advantage the DOE JGI brings to the table with our high-quality assemblies.”

Combining latest technology with datasets definitely provides new dimensions to previous studies. It offers high potential for results to be incorporated in the new treatment methods of diseases in the future.

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.

Your data will be safe!Your e-mail address will not be published. Also other data will not be shared with third person.