[Skip to Content]
Access to paid content on this site is currently suspended due to excessive activity being detected from your IP address 54.205.150.215. Please contact the publisher to request reinstatement.
Sign In
Individual Sign In
Create an Account
Institutional Sign In
OpenAthens Shibboleth
[Skip to Content Landing]
Download PDF
Figure 1.
Principal component analysis, with axes representing the 3 principal components (Comp) (linear combinations of genes) accounting for most of the data variance, using all genes exhibiting variation across the data set (A) and using the top 10 genes most highly associated with each tumor class (B). CNS indicates central nervous system; PNETs, primitive neuroectodermal tumors.

Principal component analysis, with axes representing the 3 principal components (Comp) (linear combinations of genes) accounting for most of the data variance, using all genes exhibiting variation across the data set (A) and using the top 10 genes most highly associated with each tumor class (B). CNS indicates central nervous system; PNETs, primitive neuroectodermal tumors.

Figure 2.
Self-organizing maps were used to discover 2 predominant classes of medulloblastoma: class 0 (high ribosome content) and class 1 (low ribosome content). The top 50 genes for each class are shown. Each column represents an individual sample and each row represents a single gene. Relative gene expression is depicted by red when high and blue when low. mRNA indicates messenger RNA.

Self-organizing maps were used to discover 2 predominant classes of medulloblastoma: class 0 (high ribosome content) and class 1 (low ribosome content). The top 50 genes for each class are shown. Each column represents an individual sample and each row represents a single gene. Relative gene expression is depicted by red when high and blue when low. mRNA indicates messenger RNA.

Figure 3.
Top 50 genes associated with survival (A) and treatment failure (B). Each column represents an individual sample and each row represents a single gene. Relative gene expression is depicted by red when high and blue when low. mRNA indicates messenger RNA.

Top 50 genes associated with survival (A) and treatment failure (B). Each column represents an individual sample and each row represents a single gene. Relative gene expression is depicted by red when high and blue when low. mRNA indicates messenger RNA.

Figure 4.
Pictorial representation of supervised analysis methods for sample classification. A hypothetical distribution of samples in multidimensional gene expression space is shown in 2 dimensions. The samples are identified as class A (red) or class B (green) to demonstrate how the following algorithms assign by binary decision an unclassified sample (blue). For outcome predictions, the 2 classes correspond to patients who died of tumor progression due to treatment failure vs survivors after therapy. A, k-Nearest neighbor assigns the unknown (test) sample to a class based on its proximity in gene expression space to the surrounding samples (neighbors). In this case, the unknown sample has a gene expression profile closer to that of the samples in class A. B, Weighted voting determines a decision boundary (DB) midway between the mean gene expression levels of 2 groups of samples. The closer each gene of the unknown sample is to the DB for a particular gene, the less weight that gene carries toward the assignment of the sample to an outcome class. The unknown sample is assigned to the class with the most positive voting genes, class B in this example. C, Support vector machine algorithms determine a DB based on the optimum separation of the 2 classes according to their gene expression profiles. It assigns a classification to the unknown sample according to its position in gene expression space in relation to that of the DB. In this case, a hypothetical DB is shown assigning the unknown sample to class A.

Pictorial representation of supervised analysis methods for sample classification. A hypothetical distribution of samples in multidimensional gene expression space is shown in 2 dimensions. The samples are identified as class A (red) or class B (green) to demonstrate how the following algorithms assign by binary decision an unclassified sample (blue). For outcome predictions, the 2 classes correspond to patients who died of tumor progression due to treatment failure vs survivors after therapy. A, k-Nearest neighbor assigns the unknown (test) sample to a class based on its proximity in gene expression space to the surrounding samples (neighbors). In this case, the unknown sample has a gene expression profile closer to that of the samples in class A. B, Weighted voting determines a decision boundary (DB) midway between the mean gene expression levels of 2 groups of samples. The closer each gene of the unknown sample is to the DB for a particular gene, the less weight that gene carries toward the assignment of the sample to an outcome class. The unknown sample is assigned to the class with the most positive voting genes, class B in this example. C, Support vector machine algorithms determine a DB based on the optimum separation of the 2 classes according to their gene expression profiles. It assigns a classification to the unknown sample according to its position in gene expression space in relation to that of the DB. In this case, a hypothetical DB is shown assigning the unknown sample to class A.

1.
Theillet  COrsetti  BRedon  RManoir  SD Genomic profiling: from molecular genetics to DNA arrays. Bull Cancer.2001;88:261-268.
2.
Jain  ANChin  KBorresen-Dale  AL  et al Quantitative analysis of chromosomal CGH in human breast tumors associates copy number abnormalities with p53 status and patient survival. Proc Natl Acad Sci U S A.2001;98:7952-7957.
3.
Hui  ABLo  KWYin  XLPoon  WSNg  HK Detection of multiple gene amplifications in glioblastoma multiforme using array-based comparative genomic hybridisation. Lab Invest.2001;81:717-723.
4.
Clarke  PAPoele  RTWooster  RWorkman  P Gene expression and microarray analysis in cancer biology, pharmacology, and drug development: progress and potential. Biochem Pharmacol.2001;62:1311-1336.
5.
Schena  MShalon  DDavis  RWBrown  PO Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science.1995;270:467-470.
6.
DeRisi  JPenland  LBrown  PO  et al Use of cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet.1996;14:457-460.
7.
Lockhart  DJWinzeler  EA Genomics, gene expression and DNA arrays. Nature.2000;405:827-836.
8.
Pomeroy  SLTamayo  PGaasenbeek  M  et al Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature.2001;415:436-442.
9.
Hidalgo  MRowinsky  EK The rapamycin-sensitive signal transduction pathway as a target for cancer therapy. Oncogene.2000;19:6680-6686.
10.
Eisen  MBSpellman  PTBrown  POBotstein  D Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A.1998;95:14863-14868.
11.
Dasarathy  VB Nearest Neighbour (NN) Norms: NN Pattern Classification Techniques.  Los Alamitos, Calif: IEEE Computer Society Press; 1991.
12.
Mukherjee  STamayo  PMesirov  JPSlonim  DVerri  APoggio  T Support Vector Machine Classification of Microarray Data, CBCL Paper #182/AI Memo #1676.  Cambridge: Massachusetts Institute of Technology; 1999.
13.
Brown  MPGrundy  WNLin  D  et al Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci U S A.2000;97:262-267.
14.
Whitney  LWBecker  KGTresser  NJ  et al Analysis of gene expression in mutiple sclerosis lesions using cDNA microarrays. Ann Neurol.1999;46:425-428.
15.
Ginsberg  SDHemby  SELee  VMEberwine  JHTrojanowski  JQ Expression profile of transcripts in Alzheimer's disease tangle-bearing CA1 neurons. Ann Neurol.2000;48:77-87.
16.
Luthi-Carter  RStrand  APeters  NL  et al Decreased expression of striatal signaling genes in a mouse model of Huntington's disease. Hum Mol Genet.2000;9:1259-1271.
17.
Steinman  L Gene microarrays and experimental demyelinating disease: a tool to enhance serendipity. Brain.2001;124:1897-1899.
18.
Bitsch  ABruck  W Differentiation of multiple sclerosis subtypes: implications for treatment. CNS Drugs.2002;16:405-418.
19.
Goodin  DSFrohman  EMGarmany  GP  et al Disease modifying therapies in multiple sclerosis: report of the Therapeutics and Technology Assessment Subcommittee of the American Academy of Neurology and the MS Council for Clinical Practice Guidelines. Neurology.2002;58:169-178.
20.
Mirnics  KMiddleton  FALewis  DALevitt  P Analysis of complex brain disorders with gene expression microarrays: schizophrenia as a disease of the synapse. Trends Neurosci.2001;24:479-486.
Basic Science Seminars in Neurology
May 2003

Application of Microarrays to Neurological Disease

Author Affiliations

From the Division of Neuroscience, Department of Neurology, Children's Hospital, Harvard Medical School, Boston, Mass (Drs Sturla, Fernandez-Teijeiro, and Pomeroy); and the Unidad de Oncologia Pediatrica, Hospital de Cruces-Baracaldo, Basque Country, Spain (Dr Fernandez-Teijeiro).

 

HASSAN M.FATHALLAH-SHAYKHMD

Arch Neurol. 2003;60(5):676-682. doi:10.1001/archneur.60.5.676

Modern microarray-based functional genomics holds great promise for revealing novel molecular and cellular mechanisms of disease. First introduced commercially in 1996, microarrays have been used widely to monitor the expression of thousands of genes in biological samples, as described in the following paragraphs. Other microarray-based genomic applications are also in development, including comparative genomic hybridization, on-chip sequencing, and novel drug discovery. For example, DNA array-based comparative genomic hybridization identifies chromosomal gains and losses with greatly improved resolution compared with conventional methods that use metaphase chromosomes as hybridization targets.1 This increase in resolution will continue to improve as the technology advances. Moreover, microarrays provide a better platform for automation than is possible with standard metaphase techniques. Where genetic mutations and aberrations are already well characterized, microarrays can be customized to be effectively used as a diagnostic and prognostic tool.2,3 In the field of drug discovery, microarrays have the potential to dramatically enhance progress, being used at all stages from target discovery (through validation of new molecular targets and understanding modes of action) to predicting patient response.4

These devices are beginning to revolutionize how scientists explore the operation of normal cells in the body and the molecular aberrations that underlie medical disorders. DNA microarrays, which are based on well-established principles of nucleic acid hybridization, simultaneously interrogate thousands of genes.57 The actual mechanics of data capture from raw material are ever-improving and well documented, and it is the analysis and discovery of meaningful gene expression patterns within these data to which we now must turn our attention.

Analytical approaches to gene expression analysis using a cancer classification model are illustrated in the recent article by Pomeroy et al.8 Several important clinical questions were answered via the application of microarray technology and emerging data analysis techniques to pediatric brain tumors.8 Using microarrays that monitor the expression of more than 6800 genes, we endeavored to definitively differentiate a group of embryonal tumors whose diagnosis on the basis of morphologic features remains controversial and to predict outcome in the most common of these tumors, medulloblastoma, for which patient response to treatment is unpredictable.

There are 2 general approaches to data analysis: supervised and unsupervised. Unsupervised methods are applied to the entire gene expression data set without any previous knowledge of sample classification, allowing an impartial assessment of the underlying features within a data set. Two examples of unsupervised methods are principal component analysis and self-organizing maps (SOMs). Principal component analysis allowed us to differentiate at a molecular level between the different brain tumor types and normal cerebellum (Figure 1). The marker genes responsible for this distinction supported the conclusion that medulloblastomas are derived from cerebellar granule cell precursors and that they are molecularly distinct from supratentorial primitive neuroectodermal tumors. This argues against the hypothesis that medulloblastomas are a subset of primitive neuroectodermal tumors, differing only in their location in the cerebellum. Self-organizing maps are ideally suited for exploratory data analysis in the generally large and complex data sets generated in the study of a particular disease, in our case brain tumors. Using SOMs, we identified 2 distinct biological subtypes of medulloblastomas with low and high ribosomal protein expression (Figure 2). Electron microscopy subsequently confirmed that these differences in ribosomal gene expression were reflected at a cellular level by differences in ribosome biogenesis. Although this was not an expected result, it provided us with an interesting therapeutic target. Sirolimus and its analogues are currently under clinical investigation in tumors reliant on the PI3K signaling pathway and ribosome biogenesis.9

This approach, although useful in its ability to pull out prominent structure (eg, medulloblastoma vs primitive neuroectodermal tumors) in a data set, may miss more subtle distinctions. We found this to be true for outcome prediction. Neither principal component analysis nor SOMs identified prognostically significant subgroups of medulloblastomas, so we turned to supervised analysis. Expression profiles were obtained from 60 children with medulloblastomas who received similar treatment and whose outcome was known. Supervised methods were used to "learn" the distinction between survivors and patients who failed treatment (Figure 3). Using take-one-out cross-validation, gene expression patterns predict survival with substantially more accuracy than current clinical risk criteria. Several supervised analysis methods showed a similar degree of accuracy, including k-nearest neighbor, support vector machines, and structural pattern localization analysis by sequential histograms.

Supervised methods were also used to successfully classify classic and desmoplastic medulloblastomas (histologically confirmed by a single neuropathologist). These algorithms allowed us not only to classify tumors and predict outcome but also to discover previously unknown relationships between coordinate gene expression and tumor characteristics. For example, we demonstrated that the genes encoding sonic hedgehog (shh)–related proteins are highly expressed in desmoplastic medulloblastomas, suggesting that they arise as a consequence of dysregulated shh signaling. Thus, microarray analysis can identify gene expression profiles that signify an activated regulatory pathway or interacting molecular processes leading to a known cellular response.

There are, of course, limitations to any approach that involves the generation of such a large amount of data for each of a relatively small group of samples. One of the most significant risks is finding statistically significant associations by chance. Consequently, identification of gene expression patterns that may underlie the pathogenesis of brain tumors requires validation. Validation of the expression of single genes can be done using well-established techniques such as Northern or Western blotting, as well as immunohistochemistry or in situ hybridization. Hypotheses that arise from the interpretation of significant patterns of gene expression can be tested in a variety of ways. For example, we used electron microscopy to demonstrate that tumors with increased coordinate expression of ribosomal proteins have high numbers of free ribosomes. Our gene expression–based outcome predictions must be validated in an independent, prospective cohort of patients before gene expression profiling can be used for risk stratification in future clinical trials. It is evident, then, that the hypotheses generated from the analysis of complex gene expression patterns must be tested by independent measures before final conclusions can be reached.

METHODS
UNSUPERVISED
Principal Component Analysis

A simple multidimensional scaling of the data set was obtained by plotting the top principal components (combinations of genes) that account for a significant fraction of the variance in scatterplots. To study the natural clustering of the brain tumor samples, we initially considered the subset of genes with the highest variation across samples (Figure 1A). In this case, the top 3 principal components account for approximately 43% of the variance of the marker genes. We then plotted principal components based on the top 10 marker genes associated with each tumor, selected by the signal-to-noise statistic.10 The top 3 principal components of this data set accounted for approximately 61% of the variance, and the degree of separation and clustering of tumor types was significantly improved over that obtained by the analysis of genes with highest variation (Figure 1B). These calculations were performed using Mathsoft software available on the Internet at http://www.mathsoft.com.

Self-organizing Maps

We performed SOMs using the GeneCluster software package available on the Internet at http://www.genome.wi.mit.edu. As an exploratory data analysis method, SOMs identify groups of samples with common gene expression patterns within a large heterogeneous sample set. To calculate SOMs, one initially randomly selects and maps a grid of nodes onto the tumor sample set. Through an iterative series of calculations testing the similarity of gene expression between samples, the geometry of the nodes is adjusted to reflect the data structure. If the number of nodes exceeds the number of "natural" clusters in the sample set, then the nodes will converge to reflect the natural clustering. In our case, applying this unsupervised approach to the medulloblastoma data set led to the discovery that 2 is the optimum number of groups identifiable by SOMs (Figure 2).

SUPERVISED

To build supervised classifiers, we defined target classes based on morphologic features, tumor class, or treatment outcome. The method is illustrated by our analysis of treatment outcome (Figure 3). In this case, we created 2 classes of patients based on clinical outcome. Gene expression profiles from patients who died of progressive disease due to treatment failure were compared with expression profiles of patients who were still alive at the end of the study and who had been followed for at least 1 year after cessation of therapy. Genes correlated with the 2 outcome classes were identified by sorting all of the genes on the array according to the signal-to-noise statistic.10 We built a sample classifier in cross-validation by removing 1 sample and then using the rest as a training set and then repeating this procedure until all samples were tested as "unknowns." Several models were built using different numbers of marker genes, and the final chosen model was the one that minimized the total error (number of samples that were misclassified) in cross-validation. For this, k-nearest neighbor, weighted voting, and support vector machine algorithms were used (Figure 4).

k-Nearest Neighbors

The k-nearest neighbor algorithm11 was used to predict the class of the unknown sample by calculating the distance of that sample from those surrounding it in gene expression space (class-specific marker genes, ie, those associated with survival or treatment failure). The unknown sample was predicted to be in one or the other outcome class based on the similarity of gene expression with that of most of the k-nearest (neighbor) samples (Figure 4A). Marker genes were chosen from those highly correlating with the predetermined classes using the signal-to-noise statistic.

Weighted Voting

The weighted voting algorithm10 makes a weighted linear combination of relevant "marker" or "informative" genes obtained in the training set to provide a classification scheme for new samples. The selection of marker genes for each outcome class was determined by computing the signal-to-noise statistic of each gene for the predefined classes. The algorithm determined the decision boundary (halfway) between the outcome class means for each gene (Figure 4B). To predict the class of the unknown sample, each gene in the sample expression profile casts a vote, and the unknown sample is assigned to the class with the most positive voting genes. The distance of that sample from the decision boundary determines the weight that each gene carries in this voting process. The closer each gene of the unknown sample is to the decision boundary for a particular gene, the less weight that gene carries toward the assignment of the sample to an outcome class. Confidence in the class prediction of the unknown sample was determined by the size of the voting margin responsible for putting the sample in one class vs the other.

Support Vector Machines

The basic idea behind support vector machines is to construct an optimal class-separating hyperplane (decision boundary) by mapping the gene expression data to a high-dimensional space.12,13 Linear separation in this higher dimensional space corresponds to a nonlinear decision boundary separating the 2 outcome classes (Figure 4C). This allowed us to more optimally separate outcome classes than with weighted voting, where the decision boundary is linear. As with weighted voting, samples are assigned to a class by their position in relation to the decision boundary, and, again, confidence of classification is dependent on the relative distance of the sample from that boundary into a particular class.

COMMENT
RELEVANCE TO THE STUDY OF NEUROSCIENCE AND THE PRACTICE OF NEUROLOGY

The use of microarrays provides a springboard from which we can start to examine the cellular pathogenesis underlying neurological disease and perhaps narrow down the search to a manageable group of therapeutic targets. Examples of this can be seen in neurological disorders such as multiple sclerosis, Huntington disease, Parkinson disease, and Alzheimer disease.1417 Applying microarray technology to a transgenic animal model of Huntington disease resulted in the finding that genes encoding certain neurotransmitters, calcium and retinoid signaling pathway components, were down-regulated, whereas those encoding inflammatory components were up-regulated.16 These findings were unexpected consequences of the mutant huntington protein and could only have been discovered on this scale by microarray analysis.

Multiple sclerosis is a complex disorder with multiple clinical subtypes that cannot be diagnosed by clinical criteria at initial presentation.18 Immunomodulatory treatment has proved to be relatively successful in relapsing-remitting disease, but it is not as useful in primary or secondary progressive disease.18,19 To date, there are no in vivo markers that allow specific direction of treatment. Microarray analysis has begun to dissect the molecular heterogeneity of multiple sclerosis, identifying genes related to cell metabolism, structure, cytokines, and cell adhesion molecules. In addition, a gene not previously associated with multiple sclerosis, encoding the Duffy chemokine receptor, was identified using this technology.14,17 Although these results are preliminary, eventually microarray expression analysis may lead to the identification of markers that are detectable in living patients, allowing prognosis to be accurately predicted at the time of initial diagnosis and treatment to be tailored accordingly. An even greater future challenge is offered by the investigation of psychiatric disorders, which seem to result from the interplay of polygenic and epigenetic factors on multiple brain circuits.20

CONCLUSIONS

The development of array-based DNA mutation screening may, in the future, prove to be beneficial for the identification of an individual's genetic propensity to acquire disorders such as Alzheimer or Parkinson disease. Consequently, at-risk candidates can be selected for close monitoring, intensive preventive care, and early clinical intervention. Other applications may include screening of patients for gene variants that affect the individual's response to certain medications, allowing the physician to tailor the best treatment regimen for a given disease in an individual patient.

USEFUL WEB SITES

The following Web sites are useful in the study of microarray-based functional genomics:

Back to top
Article Information

Corresponding author and reprints: Scott L. Pomeroy, MD, PhD, Division of Neuroscience, Department of Neurology, Children's Hospital, 300 Longwood Ave, Boston, MA 02115 (e-mail: scott.pomeroy@tch.harvard.edu).

Accepted for publication December 11, 2002.

Author contributions: Study concept and design (Dr Pomeroy); acquisition of data (Drs Sturla, Fernandez-Teijeiro, and Pomeroy); analysis and interpretation of data (Drs Sturla and Pomeroy); drafting of the manuscript (Drs Sturla, Fernandez-Teijeiro, and Pomeroy); critical revision of the manuscript for important intellectual content (Dr Pomeroy); statistical expertise (Dr Pomeroy); obtained funding (Dr Pomeroy); administrative, technical, and material support (Dr Sturla); study supervision (Dr Pomeroy).

References
1.
Theillet  COrsetti  BRedon  RManoir  SD Genomic profiling: from molecular genetics to DNA arrays. Bull Cancer.2001;88:261-268.
2.
Jain  ANChin  KBorresen-Dale  AL  et al Quantitative analysis of chromosomal CGH in human breast tumors associates copy number abnormalities with p53 status and patient survival. Proc Natl Acad Sci U S A.2001;98:7952-7957.
3.
Hui  ABLo  KWYin  XLPoon  WSNg  HK Detection of multiple gene amplifications in glioblastoma multiforme using array-based comparative genomic hybridisation. Lab Invest.2001;81:717-723.
4.
Clarke  PAPoele  RTWooster  RWorkman  P Gene expression and microarray analysis in cancer biology, pharmacology, and drug development: progress and potential. Biochem Pharmacol.2001;62:1311-1336.
5.
Schena  MShalon  DDavis  RWBrown  PO Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science.1995;270:467-470.
6.
DeRisi  JPenland  LBrown  PO  et al Use of cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet.1996;14:457-460.
7.
Lockhart  DJWinzeler  EA Genomics, gene expression and DNA arrays. Nature.2000;405:827-836.
8.
Pomeroy  SLTamayo  PGaasenbeek  M  et al Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature.2001;415:436-442.
9.
Hidalgo  MRowinsky  EK The rapamycin-sensitive signal transduction pathway as a target for cancer therapy. Oncogene.2000;19:6680-6686.
10.
Eisen  MBSpellman  PTBrown  POBotstein  D Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A.1998;95:14863-14868.
11.
Dasarathy  VB Nearest Neighbour (NN) Norms: NN Pattern Classification Techniques.  Los Alamitos, Calif: IEEE Computer Society Press; 1991.
12.
Mukherjee  STamayo  PMesirov  JPSlonim  DVerri  APoggio  T Support Vector Machine Classification of Microarray Data, CBCL Paper #182/AI Memo #1676.  Cambridge: Massachusetts Institute of Technology; 1999.
13.
Brown  MPGrundy  WNLin  D  et al Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci U S A.2000;97:262-267.
14.
Whitney  LWBecker  KGTresser  NJ  et al Analysis of gene expression in mutiple sclerosis lesions using cDNA microarrays. Ann Neurol.1999;46:425-428.
15.
Ginsberg  SDHemby  SELee  VMEberwine  JHTrojanowski  JQ Expression profile of transcripts in Alzheimer's disease tangle-bearing CA1 neurons. Ann Neurol.2000;48:77-87.
16.
Luthi-Carter  RStrand  APeters  NL  et al Decreased expression of striatal signaling genes in a mouse model of Huntington's disease. Hum Mol Genet.2000;9:1259-1271.
17.
Steinman  L Gene microarrays and experimental demyelinating disease: a tool to enhance serendipity. Brain.2001;124:1897-1899.
18.
Bitsch  ABruck  W Differentiation of multiple sclerosis subtypes: implications for treatment. CNS Drugs.2002;16:405-418.
19.
Goodin  DSFrohman  EMGarmany  GP  et al Disease modifying therapies in multiple sclerosis: report of the Therapeutics and Technology Assessment Subcommittee of the American Academy of Neurology and the MS Council for Clinical Practice Guidelines. Neurology.2002;58:169-178.
20.
Mirnics  KMiddleton  FALewis  DALevitt  P Analysis of complex brain disorders with gene expression microarrays: schizophrenia as a disease of the synapse. Trends Neurosci.2001;24:479-486.
×