There’s a lot of research going on around the world, and that means lots of data.
On a personal level, we’ve seen computer hard drives jump in memory constantly to keep pace with all of the information, bigger images, and so on. Many people have an external drive with 1TB (terabyte) or 2TB of storage.
To show the scale of the issue, the European Bioinformatics Institute (EMBL-EBI), has gone from managing a volume of 40 petabytes to working with 250 petabytes in just six years. A petabyte is 1,024 terabytes, so that’s the equivalent of 256,000 of those 1TB drives.
The rapid development of the different disciplines in the fields of biological and biomedical research (such as genomics, proteomics, and transcriptomics) in recent decades has led to exponential growth in the amount of biological data available.
About the Bioteque developed by IRB Barcelona scientists
Scientists led by Patrick Aloy, ICREA researcher and head of the Structural Bioinformatics and Network Biology laboratory at IRB Barcelona, have developed a computational tool to harmonize, integrate and simplify these data. The result is a knowledge graph that provides information on how different biological entities are related to each other, including more than 30 million functional interactions.
The Bioteque works by integrating different levels of biological complexity and can report, for example, on two genes that are related, whether they physically interact, whether they are active in the same type of cells, and whether they are related to the same disease. It can also predict the sensitivity or resistance of a type of cell to a specific drug.
“This computational resource that we’ve developed is one of the first aimed at unifying biological information and it’s the only one to address such diversity and amount of data. It allows access, in an easy and harmonized way, to practically all the biological knowledge currently available, and it has enormous potential to accelerate biomedical research,” Aloy said.
Almost 1,000 descriptors for 12 biological entities
The information held in the Bioteque is structured into 12 types of biological entities, such as gene, disease, tissue, cell, etc. For each of these entities, the tool considers a series of descriptors or characteristics, for example, the pattern of mutations of a gene, the profile of physical interactions of the resulting proteins, the expression of the gene in different cell types, or its relationship with different diseases. Among the 12 biological entities, the system covers around 1,000 types of descriptors.
“We have worked with information from 150 different databases, so first we had to integrate them, that is, put them all in the same “language.” And then we converted that knowledge into numerical descriptors that could be interpreted by algorithms, and that way we could computationally exploit these networks and connections,” said Adrià Fernández, the first author of the article and a doctoral student in the same laboratory.
The Bioteque will be expanded periodically with new databases, as they are made public. Both the tool and the databases and algorithms are open access.