New study uses computer learning to provide quality control for genetic databases

AI and Quality Control in Genome data are made for each other.

A new study published in The Plant Journal helps to shed light on the transcriptomic differences between different tissues in Arabidopsis, an important model organism, by creating a standardized “atlas” that can automatically annotate samples to include lost metadata such as tissue type. By combining data from over 7000 samples and 200 labs, this work represents a way to leverage the increasing amounts of publically available ‘omics data while improving quality control, to allow for large scale studies and data reuse.

“As more and more ‘omics data are hosted in the public databases, it become increasingly difficult to leverage those data. One big obstacle is the lack of consistent metadata,” says first author and Brookhaven National Laboratory research associate Fei He. “Our study shows that metadata might be detected based on the data itself, opening the door for automatic metadata re-annotation.”

The study focuses on data from microarray analyses, an early high-throughput genetic analysis technique that remains in common use. Such data are often made publically available through tools such as the National Center for Biotechnology Information’s Gene Expression Omnibus (GEO), which over time accumulates vast amounts of information from thousands of studies.

Blog