Transcription network construction for large-scale microarray datasets using a high-performance computing approach

Research output: Contribution to journalArticle

13 Citations (Scopus)

Abstract

Background: The advance in high-throughput genomic technologies including microarrays has demonstrated the potential of generating a tremendous amount of gene expression data for the entire genome. Deciphering transcriptional networks that convey information on intracluster correlations and intercluster connections of genes is a crucial analysis task in the post-sequence era. Most of the existing analysis methods for genome-wide gene expression profiles consist of several steps that often require human involvement based on experiential knowledge that is generally difficult to acquire and formalize. Moreover, large-scale datasets typically incur prohibitively expensive computation overhead and thus result in a long experiment-analysis research cycle. Results: We propose a parallel computation-based random matrix theory approach to analyze the cross correlations of gene expression data in an entirely automatic and objective manner to eliminate the ambiguities and subjectivity inherent to human decisions. We apply the proposed approach to the publicly available human liver cancer data and yeast cycle data, and generate transcriptional networks that illustrate interacting functional modules. The experimental results conform accurately to those published in previous literatures. Conclusions: The correlations calculated from experimental measurements typically contain both "genuine" and "random" components. In the proposed approach, we remove the "random" component by testing the statistics of the eigenvalues of the correlation matrix against a "null hypothesis" - a truly random correlation matrix obtained from mutually uncorrelated expression data series. Our investigation into the components of deviating eigenvectors after varimax orthogonal rotation reveals distinct functional modules. The utilization of high performance computing resources including ScaLAPACK package, supercomputer and Linux PC cluster in our implementations and experiments significantly reduces the amount of computation time that is otherwise needed on a single workstation. More importantly, the large distributed shared memory and parallel computing power allow us to process genomic datasets of enormous sizes.

Original languageEnglish
Article numberS5
JournalBMC Genomics
Volume9
Issue numberSUPPL. 1
DOIs
StatePublished - 4 Mar 2008

Fingerprint

Computing Methodologies
Gene Regulatory Networks
Genome
Gene Expression
Liver Neoplasms
Transcriptome
Yeasts
Technology
Research
Genes
Datasets

Cite this

@article{f9a9bb8fc80043c28ab1733cfdc3194c,
title = "Transcription network construction for large-scale microarray datasets using a high-performance computing approach",
abstract = "Background: The advance in high-throughput genomic technologies including microarrays has demonstrated the potential of generating a tremendous amount of gene expression data for the entire genome. Deciphering transcriptional networks that convey information on intracluster correlations and intercluster connections of genes is a crucial analysis task in the post-sequence era. Most of the existing analysis methods for genome-wide gene expression profiles consist of several steps that often require human involvement based on experiential knowledge that is generally difficult to acquire and formalize. Moreover, large-scale datasets typically incur prohibitively expensive computation overhead and thus result in a long experiment-analysis research cycle. Results: We propose a parallel computation-based random matrix theory approach to analyze the cross correlations of gene expression data in an entirely automatic and objective manner to eliminate the ambiguities and subjectivity inherent to human decisions. We apply the proposed approach to the publicly available human liver cancer data and yeast cycle data, and generate transcriptional networks that illustrate interacting functional modules. The experimental results conform accurately to those published in previous literatures. Conclusions: The correlations calculated from experimental measurements typically contain both {"}genuine{"} and {"}random{"} components. In the proposed approach, we remove the {"}random{"} component by testing the statistics of the eigenvalues of the correlation matrix against a {"}null hypothesis{"} - a truly random correlation matrix obtained from mutually uncorrelated expression data series. Our investigation into the components of deviating eigenvectors after varimax orthogonal rotation reveals distinct functional modules. The utilization of high performance computing resources including ScaLAPACK package, supercomputer and Linux PC cluster in our implementations and experiments significantly reduces the amount of computation time that is otherwise needed on a single workstation. More importantly, the large distributed shared memory and parallel computing power allow us to process genomic datasets of enormous sizes.",
author = "Zhu, {Mengxia Michelle} and Qishi Wu",
year = "2008",
month = "3",
day = "4",
doi = "10.1186/1471-2164-9-S1-S5",
language = "English",
volume = "9",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central Ltd.",
number = "SUPPL. 1",

}

Transcription network construction for large-scale microarray datasets using a high-performance computing approach. / Zhu, Mengxia Michelle; Wu, Qishi.

In: BMC Genomics, Vol. 9, No. SUPPL. 1, S5, 04.03.2008.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Transcription network construction for large-scale microarray datasets using a high-performance computing approach

AU - Zhu, Mengxia Michelle

AU - Wu, Qishi

PY - 2008/3/4

Y1 - 2008/3/4

N2 - Background: The advance in high-throughput genomic technologies including microarrays has demonstrated the potential of generating a tremendous amount of gene expression data for the entire genome. Deciphering transcriptional networks that convey information on intracluster correlations and intercluster connections of genes is a crucial analysis task in the post-sequence era. Most of the existing analysis methods for genome-wide gene expression profiles consist of several steps that often require human involvement based on experiential knowledge that is generally difficult to acquire and formalize. Moreover, large-scale datasets typically incur prohibitively expensive computation overhead and thus result in a long experiment-analysis research cycle. Results: We propose a parallel computation-based random matrix theory approach to analyze the cross correlations of gene expression data in an entirely automatic and objective manner to eliminate the ambiguities and subjectivity inherent to human decisions. We apply the proposed approach to the publicly available human liver cancer data and yeast cycle data, and generate transcriptional networks that illustrate interacting functional modules. The experimental results conform accurately to those published in previous literatures. Conclusions: The correlations calculated from experimental measurements typically contain both "genuine" and "random" components. In the proposed approach, we remove the "random" component by testing the statistics of the eigenvalues of the correlation matrix against a "null hypothesis" - a truly random correlation matrix obtained from mutually uncorrelated expression data series. Our investigation into the components of deviating eigenvectors after varimax orthogonal rotation reveals distinct functional modules. The utilization of high performance computing resources including ScaLAPACK package, supercomputer and Linux PC cluster in our implementations and experiments significantly reduces the amount of computation time that is otherwise needed on a single workstation. More importantly, the large distributed shared memory and parallel computing power allow us to process genomic datasets of enormous sizes.

AB - Background: The advance in high-throughput genomic technologies including microarrays has demonstrated the potential of generating a tremendous amount of gene expression data for the entire genome. Deciphering transcriptional networks that convey information on intracluster correlations and intercluster connections of genes is a crucial analysis task in the post-sequence era. Most of the existing analysis methods for genome-wide gene expression profiles consist of several steps that often require human involvement based on experiential knowledge that is generally difficult to acquire and formalize. Moreover, large-scale datasets typically incur prohibitively expensive computation overhead and thus result in a long experiment-analysis research cycle. Results: We propose a parallel computation-based random matrix theory approach to analyze the cross correlations of gene expression data in an entirely automatic and objective manner to eliminate the ambiguities and subjectivity inherent to human decisions. We apply the proposed approach to the publicly available human liver cancer data and yeast cycle data, and generate transcriptional networks that illustrate interacting functional modules. The experimental results conform accurately to those published in previous literatures. Conclusions: The correlations calculated from experimental measurements typically contain both "genuine" and "random" components. In the proposed approach, we remove the "random" component by testing the statistics of the eigenvalues of the correlation matrix against a "null hypothesis" - a truly random correlation matrix obtained from mutually uncorrelated expression data series. Our investigation into the components of deviating eigenvectors after varimax orthogonal rotation reveals distinct functional modules. The utilization of high performance computing resources including ScaLAPACK package, supercomputer and Linux PC cluster in our implementations and experiments significantly reduces the amount of computation time that is otherwise needed on a single workstation. More importantly, the large distributed shared memory and parallel computing power allow us to process genomic datasets of enormous sizes.

UR - http://www.scopus.com/inward/record.url?scp=44449158290&partnerID=8YFLogxK

U2 - 10.1186/1471-2164-9-S1-S5

DO - 10.1186/1471-2164-9-S1-S5

M3 - Article

C2 - 18366618

AN - SCOPUS:44449158290

VL - 9

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - SUPPL. 1

M1 - S5

ER -