DNAmicroarrays represent a technological intersection between biology
and computers that enables
gene expression analysis in human tissues on a
genome-wide scale. This application can be expected to prove extremely valuable
for the study of the
genetic basis of complex diseases. Despite the enormous
promise of this revolutionary technology, there are several issues and possible
pitfalls that may undermine the authority of the microarray platform. We discuss
some of the conceptual, practical, statistical, and logistical issues surrounding
the use of microarrays for gene expression profiling. These issues include
the imprecise definition of normal in expression
comparisons; the cellular and subcellular heterogeneity of the tissues being
studied; the difficulty in establishing the statistically valid comparability
of arrays; the logistical logjam in analysis, presentation, and archiving
of the vast quantities of data generated; and the need for confirmational
studies that address the functional relevance of findings. Although several
complicated issues must be resolved, the potential payoff remains large.
Increasing numbers of human diseases, both acquired and
genetic, are
being considered to be based at least in part on alterations in
DNA sequence.
For most diseases, inheritance and acquisition are likely to be complex and
polygenic. The efforts of the
Human Genome Project to elucidate the structural
genetic background by identifying the
chromosomal
positions and genomic organization
of between approximately 30 000 and 35 000 human genes are nearly
complete.1 Based on this structural knowledge,
a byproduct should be a better "scaffolding" to help link specific genes to
susceptibility to various human diseases. However, to understand how the products
of these genetic linkages work together to orchestrate the initiation and
progression of particular complex diseases, there will be a need to apply
a functional genetic rather than a structural genetic approach.2,3
Until recently, functional genetic studies have generally been of limited
scope, only able to elucidate the role of 1 or a few genes at a time in 1
system. Information on the specificity and relative abundance of expression
products has traditionally been obtained by techniques such as
RNANorthern blothybridization
and ribonuclease protection assays. Somewhat more sophisticated
methods, such as differential display4 and
Serial Analysis of Gene Expression,5 have been
used to screen larger numbers of complementary DNA
(cDNA)
clones. However,
technical limitations render these techniques nonconducive to large-scale
genetic survey.
To this end, a powerful new technology is emerging, using hybridization
to
nucleotide
arrays, the so-called gene chips.6,7
This technological intersection of biology and computers enables the reliable
screening of a vast number of genes simultaneously and is amenable to automation.
On a nylon membrane or glass surface, gene-specific cDNAs can be spotted,
or oligonucleotides can be synthesized in situ by a combination of photolithography
and oligonucleotide chemistry. This permits simultaneous monitoring of the
expression of thousands of genes in a single step. Individual chips can be
customized to include any chosen set of fully or partially characterized genomic
or expressed sequences. Chips can monitor over 50 000 unique sequences.
The power of these chips lies in the potential for comparative expression
studies in diseased vs normal samples, and in documenting
changes at different stages during the natural course of the disease or in
response to treatment. It provides the researcher with a new arsenal to analyze
underlying pathomechanisms on a grand scale and also to review the rationale
of therapeutic concepts.
However, despite the enormous potential of this revolutionary technology,
there are several issues and possible pitfalls that attenuate the power of
microarrays. First, the definition of normal in expression comparisons is
neither precise nor unambiguous. Second, the heterogeneity of the tissues
being studied complicates the meaning of the expression profiles. Third, the
statistically valid comparability of arrays is an unresolved problem. Fourth,
the vast quantities of data create a logistical logjam for analysis, presentation,
and archiving. Finally, confirmational studies are needed to corroborate the
biological significance of microarray data (Figure 1).
The standard normal vs diseased tissue type of comparison, which is
the basic design foundation of profiling studies, may be more quicksand than
bedrock. Normal is not so easy to define—neither is diseased. Gene expression
in normal tissue is likely to be dependent on several factors involving patient
and sample variation. These factors will also have an impact on expression
profiles of diseased tissue.
Patient Variation: Ethnicity, Sex, Age, Genetic Background, Disease
States
The ethnicity, sex, age, and genetic background of a patient are likely
to affect the gene expression profiles of many tissues to varying extents.8,9 A simple example is provided by the
expression profiles of genes involved in scalp and body hair follicle activity,
which can be expected to vary over a normal range under the influence of all
of these sources of patient variation. The effects of these parameters on
gene expression are likely to be subtle but pervasive, not fully understood
at this time, and quite problematic for defining normal.
The presence of disease in a subject who is the source of tissue for
control purposes, presents further potential variabilities. For example, there
may be a significant difference in the conclusions reached by 2 similar microarray
expression profiling studies. One may compare genes expressed in a patient's
diseased lung tissue with those expressed in normal, nondiseased lung tissue
from the same patient, and another may compare genes expressed in the same
patient's diseased lung tissue with those expressed in normal, nondiseased
lung tissue from a healthy control or normal individual. Moreover, it is also
possible that seemingly unrelated disease states may influence gene expression
at distant sites. For instance, the presence of diabetes in 1 of 2 renal cancer
patients may complicate the direct comparison of renal tissues.
Sample Variation: Proximity to Disease, Anatomic Location, and Developmental
Range
Yet another complication derives from the proximity of the normal tissue
used as a control for the diseased tissue. Tissue adjacent to an area of disease
may not be normal despite absence of evidence of disease clinically or under
the light microscope. Normal-appearing tissue near a tumor could, for example,
be genotypically altered or exhibit an altered gene expression profile.10-13 Moreover,
factors such as the degree of disease-associated inflammation may have a significant
impact on gene expression profiles. Other bystander effects, epiphenomena,
or secondary disease processes could all play important roles in determining
expression profiles within these adjacent, so-called normal tissues. These
factors must be considered in the choice of normal.
The precise location within a particular organ may be another important
factor that affects gene expression.9 For example,
just as location relative to the urethra may influence expression profiles
in the prostate,14 skin from the nose, back,
and palm are certain to have different expression profiles as well, despite
all being from the same organ. Thus, site and specific anatomic location must
also be taken into account in a description of normal.
It must also be kept in mind that the definition of normal actually
represents a dynamic state.14 All tissues,
which are composed of early and late-stage cells, have a normal developmental
range. For example, normal epithelium in prostatic ducts ranges from atrophic
to resting to hyperplastic, and each has a unique pattern of gene expression.14
A 3-dimensional analytic approach is a strategy that has been used to
address some of these concerns about defining normal. Cole et al14
used a 3-dimensional model to characterize the entire prostate gland in their
study of gene expression profiles in prostate cancer. In this study, whole-mount
prostactectomy specimens were divided into transverse cross sections such
that the entire prostate gland, including the complete spectrum of normal
epithelium and tumor progression, was available for viewing, microdissection,
and microarray analysis. This method was used to determine the exact physical
relationship of the normal ducts, premalignant lesions, and tumors—thus
obtaining an anatomic framework on which to overlay gene expression data.
This technique offers several advantages over the normal vs tumor comparison.
Previous studies had used normal epithelium in prostatic ducts as a baseline
control against which to compare and contrast tumor gene expression profiles.15,16 However, the expression profile of
this normal epithelium is affected by proximity to tumor, location within
the gland, and developmental state.14 These
factors can be better appreciated using a 3-dimensional approach.
Disease-Related Variation
Of course, many of the parameters that affect normal expression profiles
(patient ethnicity, age, sex, and genetic background, location within an organ,
and developmental stage) will also affect disease expression profiles.17-19 Disease heterogeneity,
including subtype, activity, severity, stage, and previous as well as current
treatments, also may have a significant impact on gene expression.20-25
Categorizing and subgrouping patients on entry into a study may be useful
to control for as many of these factors as possible. However, there may be
problems surrounding attempts to define microarray-based categorization on
the basis of another imperfect categorization system, such as histology, as
these groups are sometimes arbitrary or inconsistently designated. Nevertheless,
determining whether gene expression profiles correlate with existing clinical
or histological categories can provide new insights into the meaning of these
categories as can new methods of classifying cancers or other diseases into
specific diagnostic categories based on their gene expression signatures.
Several studies have been able to establish expression-based criteria (class
predictors) for preexisting categories and then use these new criteria to
categorize new cases (class prediction).26-28
Global profiling may also allow the development of new classification systems
based on gene expression alone (class discovery).29,30
Thus, when possible, it will be of value to profile a range of normal
and diseased cell populations from a number of patients to distinguish between
differences in expression that are relevant to the disease process and those
reflecting the biological spectrum of the normal tissue or that have occurred
for reasons unrelated to the disease. The significance of this distinction
is further appreciated when taking into account the vast quantity of data
generated from microarrays and the potential for confounding interpretation
from the inclusion of differential expression unrelated to disease processes.
It is worth noting, however, that the issues of patient and sample variability
are not unique to microarray experiments. In fact, microarray experiments,
in contrast to classic single-gene experiments, may actually provide the tools
for identifying this heterogeneity. For example, DNA microarrays have been
used to explore physiological variation in gene expression on a genomic scale
in 60 cell lines derived from diverse tumor tissues.31
Cluster analysis allows the identification of prominent features in gene expression
patterns that appear to reflect molecular signatures of the tissue from which
the cells originated.31
Heterogeneous cell populations
A further complication encountered with expression profiles is that
any given tissue is composed of several cell types, members of which are likely
to be within a spectrum of dynamic functional states. For example, a simple
punch biopsy of the skin may contain keratinocytes, melanocytes, Langerhans
cells, Merkel cells, adipocytes, smooth muscle cells of arrector pili, striated
muscle cells of the panniculus carnosus, blood cells including immune system
cells, and cellular elements of blood vessels, nerves, hair follicles, sebaceous
glands, and sweat glands. Moreover, cells from each of these populations will
be at various stages of development and levels of activation, performing different
functions and responding to disease processes or treatments in different ways
and to varying extents. The result is a highly heterogeneous sampling of cells,
each expressing a unique set of genes. An expression profile generated from
a microarray study of the RNA in such a sample will thus represent merely
a snapshot of the genes expressed by a plethora of cells at a moment in time.
Such extensive cellular heterogeneity complicates the ability to draw conclusions
about specific processes occurring within a tissue specimen. An illustrative
example is provided by Stanton et al,32 who
used microarrays to identify genes differentially expressed during myocardial
infarction. The expression profiles they studied represented transcripts from
cell populations as diverse as immune system cells, which migrated to the
infarct region and are responsible for the inflammatory response, cardiac
myocytes within the ischemic area undergoing apoptosis and necrosis, fibroblasts
undergoing proliferation and participating in the formation of scar tissue
to replace the infarct, and cardiac myocytes undergoing hypertrophy to compensate
for the loss of cells in the infarct area.33
The issue of such cellular heterogeneity was avoided by categorizing the differentially
expressed genes into functional categories to look for patterns indicative
of cardiac remodeling without attempting to attribute specific transcripts
to specific cell types. For gene expression studies involving samples with
mixed cellular populations, further investigation, such as with in situ
messenger RNA (mRNA)
hybridization, may be necessary to localize the transcripts before
conclusions can be drawn about the roles of specific genes in specific cell
types during the disease process.
Laser Capture Microdissection
An ingenious but technically delicate approach to the study of complex
biological samples has become possible with the development of laser capture
microdissection (Figure 2).34 This technique allows for the rapid and accurate
procurement of cells from specific areas of tissue under direct microscopic
visualization, and thus makes the molecular genetic analysis of defined populations
in their native tissue environment possible.35
Sgroi et al36 demonstrated the feasibility
of combining laser capture microdissection with high-throughput cDNA arrays.
They showed that in vivo subpopulations of malignant cells from multiple stages
of breast cancer progression could be separated from nonmalignant populations,
and their expression profiles could subsequently be analyzed using microarrays.
The potential is powerful. Specimens could be separated into tissue
layers; for example, separating a skin biopsy into epidermis, dermis, and
hypodermis. Tissues could be further differentiated into specific structural
components, such as dermis into blood vessels, adipose, arrector pili, and
sebaceous glands. Structures could be separated into defined cell types, such
as blood vessels into endothelial cells, erythrocytes, and lymphocytes. Cell
types could even be separated into marker-defined subtypes, such as lymphocytes
into CD4 and CD8 cells. Expression profiles from refined and defined structures
and cell types likely would be extremely valuable in the study of disease.
Potential aside, there are significant limitations to this technology
at the present time. The standard protocols for fixing and embedding tissue
samples from surgical resections were not designed to be compatible with microarray
experiments, with or without laser capture microdissection. Typically, tissue
suspected of being important for diagnosis and staging is processed through
aldehyde-based fixatives, such as formalin, which damage mRNA integrity.37 If frozen tissue is available, mRNA can be recovered
and studied from dissected cell populations. However, frozen tissue sections
are technically difficult to prepare, the histology is often severely compromised,
and the tissue available may contain only a limited portion of the lesion.14
Moreover, the sample amounts generated from laser capture microdissection
can be small, even as miniscule as a single cell.38
Consequently, the yields of RNA are low. Arrays have a threshold for the quantity
of molecular starting material: at least 5 to 15 µg for oligonucleotide
arrays and between 2 and 100 µg for cDNA arrays, depending on the manufacturer,
the source of the RNA, and the use of signal amplification.39,40
Studies that have successfully integrated laser capture microdissection with
microarray technology have used samples of approximately 1 × 104 to 1 × 105 cells with 95% to 98% homogeneity as determined
by microscopic visualization.36,41
If needed, amplification techniques may be used to generate sufficient genetic
material for microarray hybridizations.41 Laser
capture microdissection is an intriguing technology, but time will tell whether
its potential is realized.
Although some biological issues related to gene expression may be complicated
by the presence of heterogeneous cell populations in studied samples, it is
also true that some biological conditions can be understood only in the context
of these heterogeneous cell populations. The nature of global gene expression
experiments is to uncover differences between 2 biological samples, including
those differences based on diverse cell populations. For example, to appreciate
a disease that is characterized by an inflammatory infiltrate, it must be
understood that the inflammatory infiltrate is part of the disease and is
part of the difference between diseased and nondiseased tissue. Thus, the
isolation of specific cell populations for study is not necessarily required
or even desirable in all instances.
Making microarrays comparable
Ideally, microarray experiments should be comparable both within and
between laboratory or manufacturing systems, but obtaining consistent and
comparable data is a critical challenge for microarray-based expression analysis.
Major sources for the observed variability of microarray data include the
normal physiological gene expression variations in different samples and the
noise introduced in the microarray assay process.42
Physiological Gene Expression Variation
Inextricably linked to the issues of patient and sample variability
and tissue heterogeneity discussed above, is the problem of normal gene expression
variations and how to distinguish these variations from significant disease-associated
changes. Few studies have systematically investigated physiological expression
changes, but data from in situ hybridizations suggest that normal variance
for many tightly regulated tissue-specific genes can be within 20% to 30%.42 However, there can be as much as 2- to 4-fold random
fluctuations for many genes in yeast.43,44
Affymetrix (Santa Clara, Calif) guidelines have suggested that for most of
the "housekeeping" genes in human tissues, which are likely to be less tightly
regulated, differences of less than 4-fold are probably not biologically significant.45 Consequently, a significant portion of microarray
data variability for high- or medium-abundance mRNAs may simply be due to
normal expression variations. Several previous studies have used arbitrary
2-fold change criteria to define significant expression change.46
However, the 2-fold threshold has been shown to be statistically invalid even
for duplicate experiments.46 In a recent study
that used cDNA microarrays to profile gene expression in samples of normal
skin from breast-reduction surgeries, 71 of 4400 genes were found to demonstrate
variability in expression greater than 2 SDs from the mean of each gene.47 These included genes coding for transport proteins,
gene transcription, cell-signaling proteins, and cell-surface proteins. Thus,
physiological variation should be considered in the analysis and interpretation
of microarray data. More stringent criteria for defining significant expression
change may be useful.
Noise in the Microarray Assay Process
For the tightly regulated (mostly low abundance) mRNA species, inconsistencies
introduced at any stage of the microarray-based assay process may play a major
role in data variability.42 Due to the miniaturization
and the large number of genes involved, it is difficult to maintain consistent
processing conditions for each sequence across multiple assays, and obtaining
accurate absolute signals is unlikely.42 Noise
may be introduced by slide heterogeneities, printing irregularities (eg, pin-to-pin
variations), and spotting volume fluctuations.48
Some of the systematic variations may be reduced by the inclusion of controls,
but random fluctuation at various manufacturing stages cannot be completely
controlled and can accumulate quickly in a complicated assay.48
In certain microarray systems, 2 samples are competitively hybridized
to 1 array using different fluors for labeling. In other systems, there is
only single-sample hybridization. A 2-color system might be expected to be
more reliable since variations in spot size or amount of cDNA probe on the
chip should not affect the signal ratio (both signals are derived from the
same spot). However, this only holds true if signals are well above the background
in both detection channels.42 In fact, the
signal level for most of the tightly regulated genes will likely be close
to the background level.42 In addition, background
level on a slide can also vary significantly from spot to spot due to factors
such as unevenness in slide surface properties, dust contamination, and incomplete
washing, leading to high levels of signal variability for low-abundance mRNA
species even in 2-color systems.42
The high levels of variability of microarray data also mean that subtle
changes in experimental conditions may significantly alter the results, making
it difficult for separate laboratories to compare experimental data. In addition,
the lack of standard controls, the predominant use of relative signals (ratios),
and the adoption of incompatible data formats contribute to poor comparability
between studies.42
Despite the hard-wired variability introduced by chip manufacturing
conditions, most of the published studies to date using microarray-based expression
analysis include only limited numbers of replicates.49
In fact, many studies conduct the experiment only once. Considering the potential
sources of assay variation, the need for sufficiently replicated studies is
underscored.49
Microarray Data Normalization
Because of variability of microarray data for single sample arrays and
for further analysis of 2-color system arrays, each must be brought into the
same scale to compare 2 or more arrays. How to perform this normalization
of gene expression levels across multiple arrays, thus removing systematic
variation between the arrays and rendering different experiments comparable,
remains an issue that is not yet fully resolved.50
Many of the early microarray studies in the literature simply ignored this
issue. A more statistically rigorous approach is needed.
One difficulty has been that leading microarray manufacturers have not
published statistical error models for their products. Thus, users are unclear
how much to adjust data for variations in spot intensity, hybridization patterns,
and intensity measurement sensitivity. Software does exist to allow for array-to-array
comparisons by using a scaling factor to normalize gene expression patterns
across arrays. However, in general, these algorithms assume that intensity
differences between arrays are linearly related with a zero y-intercept.51 This assumption allows software to trim the tails
off distributions of expression from different arrays at statistical cutoffs
and then simply move the distributions along an axis to a common level to
provide comparisons. However, this linear relationship often does not hold
true.51 When the average expression level of
1 array is higher than that of a second array, a longer tail will be trimmed
from the second array. Thus, a greater number of genes from the first array
will be counted as being expressed because their expression level is above
the statistical cutoff point. In this case, the 2 arrays cannot be considered
comparable.
Although bioinformatic software has recently been developed that offers
more statistically robust normalization, the cost of these commercially available
programs (combined with the already expensive microarrays) has been prohibitive
for many researchers.50 Standardization of
these processes awaits the development of improved methods of normalization
leading to valid statistical models widely available to all researchers.
To this end, Schadt et al51 have developed
a standard nonlinear curve technique for normalizing the data in arrays that
do not demonstrate a linear relationship between data sets. This model performs
well when the 2 samples being compared demonstrate a low number of differentially
expressed genes. However, when expression profiles of 2 samples vary to a
greater extent, Schadt et al51 recommend a
rank-selection method. Using this method, genes expressed on an array are
ranked from highest to lowest level of expression. Then, for the array expressing
a greater number of genes, the genes with the lowest expression levels are
removed from the list until the 2 arrays list a comparable number of expressed
genes. This type of rank-selection method has gained support from other groups,
but it too has limitations.50 Removing low-expression
level data points restricts the study to the more extreme and easily detected
entities, a technique that blunts the genomic-scale potential of microarray
technology.
Efforts continue to improve comparability between arrays. Jones50 recently applied a statistical model to normalize
spotted cDNA array data that takes into account not only the differences in
numbers of genes expressed between arrays, but also the interarray variations
in fluorescent dye intensity and mechanical error occurring in the printing
process. Nevertheless, the issue of how to properly normalize array data has
not been settled. Researchers must continue to demand statistical rigor in
their comparisons before they can believe the mathematical results of their
data.
Microarrays deliver massive amounts of data on tens of thousands of
genes. The result is an immense quantity of biological information that must
be analyzed, presented, and archived in a meaningful way.
In human studies, the number of hybridizations that can be performed
for any set of experimental conditions is often restricted by the limited
number of obtainable tissue samples and by the expense of arrays. Restricted
numbers of hybridizations for each experiment hamper the ability to assess
the biological significance of variation within or between given sets of conditions.
Thus, for the assessment of thousands of genes in a setting of limited hybridizations,
the importance of reliable and sophisticated algorithms for data analysis
becomes amplified.51
A logical beginning is to examine the extremes, that is, genes with
significant differential expression in individual samples. For example, a
comparison of 2 samples can be visualized in the form of a simple bivariate
scatterplot in which the expression profile of 1 sample (x-axis) is plotted
against that of the second sample (y-axis). The distribution pattern generally
demonstrates that the expression ratios cluster around the line in which x
is equal to y (indicating comparable levels), with individual genes falling
varying distances from this line. Additional lines can be placed on the scatterplot
to represent various fold changes of expression. Data points that fall above
or below these lines represent genes exhibiting expression ratios greater
or less than the specified fold change. Thus, one can begin by examining those
genes that demonstrate a 10-fold or greater change in expression level. To
expand the number of genes under investigation, one can examine genes that
demonstrate a 5-fold or greater change, or a 3-fold or greater change, and
so forth. Many studies define a 2-fold or greater change in expression level
to represent significant differential expression. The 2-fold threshold, however,
as noted above, has been shown to be statistically invalid.46
Although this simple technique can be efficient and effective for focusing
on expanding sets of differentially expressed sequences, again, such an analysis
does not take advantage of the full potential of genome-scale experiments
to enhance our comprehension of cellular biology that would be provided by
an inclusive analysis of the entire repertoire of transcripts in a cell as
it goes through a biological process.52 A more
holistic approach, which allows the deciphering of patterns from the entire
data set, is needed.
Data Organization and Presentation
Statistical algorithms can be applied to detect and extract patterns
within profiling data. It is a basic assumption of many gene expression studies
that knowledge of where and when a gene is expressed provides information
about the function of the gene. Therefore, an important beginning is to organize
genes on the basis of similarities in their expression profiles.53
However, even this basic tenet deserves critical consideration. Similarity
of gene expression profile does not mandate similarity of function or mechanistic
pathway, and it may be purely coincidental. Nevertheless, the idea of clustering
genes on the basis of their expression patterns is well established and cluster
analysis has become the most widely used statistical technique applied to
large-scale gene expression data.52
Although various cluster methods can usefully organize tables of gene
expression measurements, the resulting ordered but still massive collection
of numbers remains difficult to assimilate. Thus, another important component
of genome-wide expression data exploration is the development of powerful
data visualization methods and tools. Approaches have been developed that
present clustering results in simple graphical displays such as dendrograms,
which represent relationships among genes by a tree whose branch lengths reflect
the degree of similarity in expression between the genes. Similarity is mathematically
defined.54 The computed trees can be used to
order genes in the original data table such that genes or groups of genes
with similar expression level patterns are placed adjacently. Clustering methods
can also be combined with representation of each data point with a color that
quantitatively and qualitatively reflects the original experimental observations.52 Visual assimilation is then more intuitive.
Data Archiving and Mining
Ultimately, successful interpretation of gene profiling studies is likely
to be dependent on the integration of experimental data with external information
resources. As multiple experiments involving multiple cell types and tissues
from multiple laboratory groups accumulate, data archiving may well become
the watershed issue. Ideally, all data, in a suitably standardized form, would
be freely accessible in the public domain. Even assuming a willingness to
share the data, such utopian goals would require a user-friendly and powerful
database system and standardization of correction and normalization procedures
such that data points from various projects become comparable.55
The National Center for Biotechnology Information Entrez system (http://www.ncbi.nlm.nih.gov/Entrez/) does provide useful data in this regard, but current databases may
be limited in scope or computability.53 A major
focus of infrastructure development to support genomic-scale expression studies
will need to be in the area of electronic biological pathway databases and
resources.
The development of more sophisticated analytical algorithms and databases
will help lend credence to the biological significance of differential gene
expression determined by microarray analysis. In the meantime, several studies
have begun to examine the sensitivity and specificity of microarray-based
experiments. Sensitivity, defined as the minimum reproducible signal detected
by a given array scanning system, has been reported for microarrays to be
approximately 10 mRNA copies per cell, which is slightly inferior to the sensitivity
of Northern blot analyses.56,57
Specificity studies showed that for a given probe any nontarget transcripts
with more than 75% sequence similarity may show cross-hybridization.56 The problem of clone misidentification and the need
for clone confirmation have also been addressed.58
One study found that of 1189 bacterial stock cultures, only 62.2% were uncontaminated
and contained cDNA inserts that had significant sequence identity with published
data for the ordered clones.59 Thus, the use
of sequence-verified clones for cDNA microarray construction is warranted.
Additionally, potential gene candidates can be assessed for relevance
to disease using parallel technologies. Several such alternative platforms
have been used to bolster the importance of specific sequences first suggested
in gene chip comparisons including (1) methods at the RNA level, (2) methods
at the protein level, and (3) functional studies.
Reverse transcriptase
polymerase chain reaction
(RT-PCR) is a method
often used to verify microarray data. Although RT-PCR is not well suited to
quantitation, the relative technological ease of this assay and the ability
to rapidly monitor multiple samples make it a useful technology.60,61
Hybridization data can be verified and multiple putative markers can be screened
in a short period.
Several other studies have used real-time quantitative RT-PCR (TaqMan,
PE Applied Biosystems, Foster City, Calif).15,62
Real-time PCR is a technique that increases the quantitative ability of RT-PCR
by providing accurate and reproducible information on RNA copy number (Figure 3).
In this method, a fluorogenic
probe (labeled at the 5′ end with a reporter fluorochrome and at the
3′ end with a quencher fluorochrome) is annealed to 1 strand of the
target cDNA sequence between the forward and reverse PCR
primers.
As Taq polymerase
extends the forward primer, its intrinsic 5′ to 3′ nuclease activity
displaces and degrades the dual-labeled probe, releasing the reporter fluorochrome
from the quencher label and allowing the detection of a fluorescent signal
that is proportional to the amount of PCR product generated in each cycle.63
Northern blot analysis is also commonly used as a confirmational technique,
as it is a standard specific and semiquantitative method.15,57,61
For mRNA expressed at moderate-to-high levels, and for which cDNA probes are
available, Northern blot analysis works well, but it is not well suited for
low-copy mRNA.64,65 Furthermore,
only a small number of genes can be analyzed with this conventional method.
Methods at the Protein Level
DNA microarray technology is limited to the study of gene expression
at the mRNA level. However, it has been established that mRNA levels do not
necessarily correlate with protein levels. Moreover, the level of expression
or even presence of a protein is not tightly linked to physiological consequences.
An investigation conducted by Winzeler et al,66
for example, provides a cautionary tale. Their study demonstrated that genes
upregulated in yeast growing in minimal medium did not prove to be more important
for growth than genes that were not upregulated.33
They found only 2 of 8 genes required for yeast growth in minimal medium to
be induced. The lesson to be learned is that genes that are not differentially
expressed may be of equal functional importance in disease states compared
with those that are.
Furthermore, the regulation of some genes may be at the translational
rather than the transcriptional level, which would preclude detection by DNA
microarrays. Posttranslational modification of proteins is also an important
mode of regulation that cannot be detected by DNA microarrays. Protein activity,
particularly receptor activity, is heavily dependent on phosphorylation, for
example. DNA and mRNA reveal nothing about whether a given protein is active,
and can be deceptive when used to speculate about quantities of proteins.
It has been demonstrated that the correlation between mRNA and protein abundance
is less than 0.5,67 emphasizing that ideally,
mRNA expression studies should be accompanied by analyses at the protein level.39 Radioimmunoassay and immunohistochemistry have been
used in a number of studies.15,68,69
These techniques, however, are not well suited to detecting low levels of
expression, and they require the availability of an antibody specific for
the protein to be studied.
The field of proteomics, the large-scale parallel analysis of the proteins
that are present in a cell, is developing rapidly, but has problems of its
own. Proteins vary in abundance by many orders of magnitude within a given
cell, and there is no PCR equivalent for the amplification of proteins. Moreover,
proteins fold in many known (and unknown) ways that affect their function.
The feasibility of the microarray analysis of proteins has begun to be explored.
Antibodies attached to microarrays can be used to bind to and quantitatively
detect proteins that have been tagged with fluorescent dyes.70
Skeptics doubt the plausibility of identifying thousands of unknown proteins
in this manner.70 The diverse chemistry of
various proteins poses serious difficulties, and it will be challenging to
find antibodies for every protein. Thus, although it is important to incorporate
protein analyses into expression profiling studies, current platforms are
technically limiting.
Confirming the role of a gene initially identified in a microarray experiment
in animal models with transgene or
knockout studies provides a particularly
powerful alternative platform. Transcript function, rather than mere presence,
is addressed. However, this approach is ill-suited for high-throughput conditions.
It may be ideal for an in-depth investigation of 1 or 2 genes of interest,
but it is not practical for confirming large quantities of profiling data.
Confirmational studies are useful to corroborate the biological significance
of differential gene expression determined by microarray analysis. While improved
databases and more reliable statistical models will help to lend greater authority
to array data, alternative platforms can be used to assess the relevance of
genes first identified by array comparisons. It should be realized, however,
that the alternative technologies are not intended for large-scale analyses.
Realistically, only selected sequences from the array data can be confirmed
with other platforms in the short-term, a retreat from the initial purpose
of the genome-scale investigation by microarray.
Microarrays can be expected to prove extremely valuable as tools for
the study of the genetic basis of complex diseases. The ability to measure
expression profiles across entire genomes provides a level of information
not previously attainable. Although complicated issues must be resolved, the
potential payoff is big. Microarrays make it possible to investigate differential
gene expression in normal vs diseased tissue, in treated vs nontreated tissue,
and in different stages during the natural course of a disease, all on a genomic
scale. Gene expression profiles may help to unlock the molecular basis of
phenotype, response to treatment, and heterogeneity of disease. They may also
help to define patterns of expression that will aid in diagnosis as well as
define susceptibility loci that may lead to the identification of individuals
at risk. Finally, as specific genes are identified and their functional roles
in the development and course of disease are characterized, new targets for
therapy should be identified.
Despite the problems of defining normal, understanding tissue heterogeneity,
making arrays comparable, analyzing and archiving massive quantities of data,
and performing confirmational studies in alternative platforms, expression
profiling with microarrays stands as a truly revolutionary technology. As
we continue to delve into the possibilities, we will surely progress in our
understanding of current issues and complications. No doubt the ride on the
high-throughput highway will be exhilarating.
1.Venter JC, Adams MD, Myers EW.
et al. The sequence of the human genome.
Science.2001;291:1304-1351.Google Scholar 4.Liang P, Pardee AB. Differential display of eukaryotic messenger RNA by means of the polymerase
chain reaction.
Science.1992;257:967-971.Google Scholar 5.Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression.
Science.1995;270:484-487.Google Scholar 6.Lockhart DJ, Dong H, Byrne MC.
et al. Expression monitoring by hybridization to high-density oligonucleotide
arrays.
Nat Biotechnol.1996;14:1675-1680.Google Scholar 7.Strachan T, Abitbol M, Davidson D, Beckmann JS. A new dimension for the human genome project.
Nat Genet.1997;16:126-132.Google Scholar 8.Nishimoto IN, Hanaoka T, Sugimura H.
et al. Cytochrome P450 2E1 polymorphism in gastric cancer in Brazil.
Cancer Epidemiol Biomarkers Prev.2000;9:675-680.Google Scholar 9.Furuya KN, Gebhardt R, Schuetz EG, Schuetz JD. Isolation of rat pgp3 cDNA.
Biochim Biophys Acta.1994;1219:636-644.Google Scholar 10.Deng G, Lu Y, Zlotnikov G.
et al. Loss of heterozygosity in normal tissue adjacent to breast carcinomas.
Science.1996;274:2057-2059.Google Scholar 11.Zhuang Z, Vortmeyer AO, Mark EJ.
et al. Barrett's esophagus.
Cancer Res.1996;56:1961-1964.Google Scholar 12.Hung J, Kishimoto Y, Sugio K.
et al. Allele-specific chromosome 3p deletions occur at an early stage in
the pathogenesis of lung carcinoma.
JAMA.1995;273:558-563. [published correction appears in JAMA. 1995;273:1908].Google Scholar 13.Shimada S, Shiomori K, Tashima S.
et al. Frequent p53 mutation in brain (fetal)-type glycogen phosphorylase
positive foci adjacent to human ‘de novo' colorectal carcinomas.
Br J Cancer.2001;84:1497-1504.Google Scholar 14.Cole KA, Krizman DB, Emmert-Buck MR. The genetics of cancer—a 3D model.
Nat Genet.1999;21:38-41.Google Scholar 15.Xu J, Stolk JA, Zhang X.
et al. Identification of differentially expressed genes in human prostate
cancer using subtraction and microarray.
Cancer Res.2000;60:1677-1682.Google Scholar 16.Elek J, Park KH, Narayanan R. Microarray-based expression profiling in prostate tumors.
In Vivo.2000;14:173-182.Google Scholar 17.Chia SJ, Tang WY, Elnatan J.
et al. Prostate tumours from an Asian population.
Br J Cancer.2000;83:761-768.Google Scholar 18.Dong M, Nio Y, Tamura K.
et al. Ki-ras point mutation and p53 expression in human pancreatic cancer.
Cancer Epidemiol Biomarkers Prev.2000;9:279-284.Google Scholar 19.Pettaway CA. Racial differences in the androgen/androgen receptor pathway in prostate
cancer.
J Natl Med Assoc.1999;91:653-660.Google Scholar 20.Hata K, Fujiwaki R, Nakayama K, Miyazaki K. Expression of the endostatin gene in epithelial ovarian cancer.
Clin Cancer Res.2001;7:2405-2409.Google Scholar 21.Saida T. Recent advances in melanoma research.
J Dermatol Sci.2001;26:1-13.Google Scholar 22.Hegde U, Wilson WH. Gene expression profiling of lymphomas.
Curr Oncol Rep.2001;3:243-249.Google Scholar 23.Liu L, Yang K. A study on C-erbB2, nm23 and p53 expressions in epithelial ovarian
cancer and their clinical significance [in Chinese].
Zhonghua Fu Chan Ke Za Zhi.1999;34:101-104.Google Scholar 24.Zhang Z, DuBois RN. Detection of differentially expressed genes in human colon carcinoma
cells treated with a selective COX-2 inhibitor.
Oncogene.2001;20:4450-4456.Google Scholar 25.Oguri T, Isobe T, Fujitaka K.
et al. Association between expression of the MRP3 gene and exposure to platinum
drugs in lung cancer.
Int J Cancer.2001;93:584-589.Google Scholar 26.Golub TR, Slonim DK, Tamayo P.
et al. Molecular classification of cancer.
Science.1999;286:531-537.Google Scholar 27.Khan J, Wei JS, Ringner M.
et al. Classification and diagnostic prediction of cancers using gene expression
profiling and artificial neural networks.
Nat Med.2001;7:673-679.Google Scholar 28.Zhang H, Yu CY, Singer B, Xiong M. Recursive partitioning for tumor classification with gene expression
microarray data.
Proc Natl Acad Sci U S A.2001;98:6730-6735.Google Scholar 29.Bittner M, Meltzer P, Chen Y.
et al. Molecular classification of cutaneous malignant melanoma by gene expression
profiling.
Nature.2000;406:536-540.Google Scholar 30.Welsh JB, Zarrinkar PP, Sapinoso LM.
et al. Analysis of gene expression profiles in normal and neoplastic ovarian
tissue samples identifies candidate molecular markers of epithelial ovarian
cancer.
Proc Natl Acad Sci U S A.2001;98:1176-1181.Google Scholar 31.Ross DT, Scherf U, Eisen MB.
et al. Systematic variation in gene expression patterns in human cancer cell
lines.
Nat Genet.2000;24:227-235.Google Scholar 32.Stanton LW, Garrard LJ, Damm D.
et al. Altered patterns of gene expression in response to myocardial infarction.
Circ Res.2000;86:939-945.Google Scholar 33.Abdellatif M. Leading the way using microarray.
Circ Res.2000;86:919-920.Google Scholar 34.Bonner RF, Emmert-Buck M, Cole K.
et al. Laser capture microdissection.
Science.1997;278:1481,1483.Google Scholar 35.Emmert-Buck MR, Bonner RF, Smith PD.
et al. Laser capture microdissection.
Science.1996;274:998-1001.Google Scholar 36.Sgroi DC, Teng S, Robinson G.
et al. In vivo gene expression profile analysis of human breast cancer progression.
Cancer Res.1999;59:5656-5661.Google Scholar 37.Klimecki WT, Futscher BW, Dalton WS. Effects of ethanol and paraformaldehyde on RNA yield and quality.
Biotechniques.1994;16:1021-1023.Google Scholar 38.Dolter KE, Braman JC. Small-sample total RNA purification.
Biotechniques.2001;30:1358-1361.Google Scholar 39.van Hal NL, Vorst O, van Houwelingen AM.
et al. The application of DNA microarrays in gene expression analysis.
J Biotechnol.2000;78:271-280.Google Scholar 40.Burgess JK. Gene expression studies using microarrays.
Clin Exp Pharmacol Physiol.2001;28:321-328.Google Scholar 41.Kitahara O, Furukawa Y, Tanaka T.
et al. Alterations of gene expression during colorectal carcinogenesis revealed
by cDNA microarrays after laser-capture microdissection of tumor tissues and
normal epithelia.
Cancer Res.2001;61:3544-3549.Google Scholar 42.Watson SJ, Meng F, Thompson RC, Akil H. The "chip" as a specific genetic tool.
Biol Psychiatry.2000;48:1147-1156.Google Scholar 43.Cho RJ, Campbell MJ, Winzeler EA.
et al. A genome-wide transcriptional analysis of the mitotic cell cycle.
Mol Cell.1998;2:65-73.Google Scholar 44.Klevecz RR, Kauffman SA, Shymko RM. Cellular clocks and oscillators.
Int Rev Cytol.1984;86:97-128.Google Scholar 45.Warrington JA, Nair A, Mahadevappa M, Tsyganskaya M. Comparison of human adult and fetal expression and identification of
535 housekeeping/maintenance genes.
Physiol Genomics.2000;2:143-147.Google Scholar 46.Claverie JM. Computational methods for the identification of differential and coordinated
gene expression.
Hum Mol Genet.1999;8:1821-1832.Google Scholar 47.Cole J, Tsou R, Wallace K.
et al. Comparison of normal human skin gene expression using cDNA microarrays.
Wound Repair Regen.2001;9:77-85.Google Scholar 48.Schuchhardt J, Beule D, Malik A.
et al. Normalization strategies for cDNA microarrays.
Nucleic Acids Res.2000;28:E47.Google Scholar 49.Lee ML, Kuo FC, Whitmore GA, Sklar J. Importance of replication in microarray gene expression studies.
Proc Natl Acad Sci U S A.2000;97:9834-9839.Google Scholar 51.Schadt EE, Li C, Su C, Wong WH. Analyzing high-density oligonucleotide gene expression array data.
J Cell Biochem.2000;80:192-202.Google Scholar 52.Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns.
Proc Natl Acad Sci U S A.1998;95:14863-14868.Google Scholar 53.Bassett Jr DE, Eisen MB, Boguski MS. Gene expression informatics—it's all in your mine.
Nat Genet.1999;21:51-55.Google Scholar 54.Alon U, Barkai N, Notterman DA.
et al. Broad patterns of gene expression revealed by clustering analysis of
tumor and normal colon tissues probed by oligonucleotide arrays.
Proc Natl Acad Sci U S A.1999;96:6745-6750.Google Scholar 55.Granjeaud S, Bertucci F, Jordan BR. Expression profiling.
Bioessays.1999;21:781-790.Google Scholar 56.Kane MD, Jatkoe TA, Stumpf CR.
et al. Assessment of the sensitivity and specificity of oligonucleotide (50mer)
microarrays.
Nucleic Acids Res.2000;28:4552-4557.Google Scholar 57.Taniguchi M, Miura K, Iwao H, Yamanaka S. Quantitative assessment of DNA microarrays—comparison with Northern
blot analyses.
Genomics.2001;71:34-39.Google Scholar 58.Bowtell DD. Options available—from start to finish—for obtaining expression
data by microarray.
Nat Genet.1999;21:25-32.Google Scholar 59.Halgren RG, Fielden MR, Fong CJ, Zacharewski TR. Assessment of clone identity and sequence fidelity for 1189 IMAGE cDNA
clones.
Nucleic Acids Res.2001;29:582-588.Google Scholar 60.Ichikawa JK, Norris A, Bangera MG.
et al. Interaction of
Pseudomonas aeruginosa with
epithelial cells.
Proc Natl Acad Sci U S A.2000;97:9659-9664.Google Scholar 61.Wang K, Gan L, Jeffery E.
et al. Monitoring gene expression profile changes in ovarian carcinomas using
cDNA microarray.
Gene.1999;229:101-108.Google Scholar 62.Wong KK, Cheng RS, Mok SC. Identification of differentially expressed genes from ovarian cancer
cells by MICROMAX cDNA microarray system.
Biotechniques.2001;30:670-675.Google Scholar 64.Raval P. Qualitative and quantitative determination of mRNA.
J Pharmacol Toxicol Methods.1994;32:125-127.Google Scholar 65.Jung R, Soondrum K, Neumaier M. Quantitative PCR.
Clin Chem Lab Med.2000;38:833-836.Google Scholar 66.Winzeler EA, Shoemaker DD, Astromoff A.
et al. Functional characterization of the
S cerevisiae
genome by gene deletion and parallel analysis.
Science.1999;285:901-906.Google Scholar 68.Shirota Y, Kaneko S, Honda M.
et al. Identification of differentially expressed genes in hepatocellular
carcinoma with cDNA microarrays.
Hepatology.2001;33:832-840.Google Scholar 69.Storz M, Zepter K, Kamarashev J.
et al. Coexpression of CD40 and CD40 ligand in cutaneous T-cell lymphoma (mycosis
fungoides).
Cancer Res.2001;61:452-454.Google Scholar 70.Dalton R, Abbott A. Can researchers find recipe for proteins and chips?
Nature.1999;402:718-719.Google Scholar