Context The Internet has become an important tool for finding health information,
especially among adolescents. Many computers have software designed to block
access to Internet pornography. Because pornography-blocking software cannot
perfectly discriminate between pornographic and nonpornographic Web sites,
such products may block access to health information sites, particularly those
related to sexuality.
Objective To quantify the extent to which pornography-blocking software used in
schools and libraries limits access to health information Web sites.
Design and Setting In a simulation of adolescent Internet searching, we compiled search
results from 24 health information searches (n = 3206) and 6 pornography searches
(n = 781). We then classified the content of each site as either health information
(n = 2467), pornography (n = 516), or other (n = 1004). We also compiled a
list of top teen health information sites (n = 586). We then tested 6 blocking
products commonly used in schools and libraries and 1 blocking product used
on home computers, each at 2 or 3 levels of blocking restrictiveness.
Main Outcome Measure Rates of health information and pornography blocking.
Results At the least restrictive blocking setting, configured to block only
pornography, the products blocked a mean of only 1.4% of health information
sites. The differences between blocking products was small (range, 0.6%-2.3%).
However, about 10% of health sites found using some search terms related to
sexuality (eg, safe sex, condoms) and homosexuality (eg, gay) were blocked.
The mean pornography blocking rate was 87% (range, 84%-90%). At moderate settings,
the mean blocking rate was 5% for health information and 90% for pornography.
At the most restrictive settings, health information blocking increased substantially
(24%), but pornography blocking was only slightly higher (91%).
Conclusions Blocking settings have a greater impact than choice of blocking product
on frequency of health information blocking. At their least restrictive settings,
overblocking of general health information poses a relatively minor impediment.
However, searches on some terms related to sexuality led to substantially
more health information blocking. More restrictive blocking configurations
blocked pornography only slightly more, but substantially increased blocking
of health information sites.
The Internet has become an important tool for many individuals with
health concerns,1 especially adolescents.2 Teenagers grapple with sensitive health issues, including
depression, substance abuse, and birth control. Concerns about confidentiality,
accentuated by many teens not yet having their own health provider, make adolescents'
access to information via the Internet particularly important. Given rapidly
expanding Internet access, it is not surprising that more than 70% of 15-
to 17-year-olds say they have used the Internet to look up health information.3 Almost half have researched traditional health topics
such as cancer or diabetes. About 40% of adolescents have searched for information
on a sexual health topic such as pregnancy, birth control, human immunodeficiency
virus/acquired immunodeficiency syndrome, or other sexually transmitted diseases;
1 in 4 have researched problems with drugs or alcohol; 17% have searched for
information on depression or mental illness; and 11% have searched for information
on sexual assault.3
In 2000, the US Congress passed the Child Internet Protection Act (CIPA)
mandating that schools and libraries install pornography-blocking software
on computers used by minors in order to be eligible for some forms of federal
funding. While the CIPA requirement for libraries was struck down by a circuit
court on the grounds that it violates the First Amendment,4 it
is currently being appealed to the US Supreme Court. Meanwhile, 73% of schools5 and 43% of public libraries6 already
use filters of some kind.
Filtering software intended to limit minors' exposure to pornography
and other controversial material may inadvertently reduce the usefulness of
the Internet as a health information tool for adolescents. Web sites that
address issues of health and sexuality might be particularly susceptible to
erroneous blocking. For example, cases of filters blocking access to breast
cancer sites were widely publicized beginning in 1995, although this particular
error has largely been corrected in recent years.7 The
use of filtering software in public schools and libraries is of special concern,
because adolescents' health concerns often focus on issues related to sexuality,
and because those who do not have computers at home rely on schools and libraries
for Internet access.
Despite the concerns about the potential impact of blocking software
on access to health information, and prolonged and impassioned public debate,
surprisingly little empirical evidence exists regarding blocking errors. Recent
government-commissioned studies in the United States, Europe, and Australia8-10 used methodologies
similar to ours but had smaller samples of health information sites. Furthermore,
most filtering software systems allow administrators to specify blocking configurations,
providing individual schools or libraries with the ability to tailor the blocking
to local community standards. The effect of different configurations on the
accuracy of the blocking systems has not been sufficiently tested.
We developed a computer model to simulate information-seeking by adolescents.
Using this model, we tested the ability of 6 different blocking software packages
commonly used in schools and libraries, as well as 1 product commonly used
on home computers, each under a variety of blocking configurations, to discriminate
between health information Web sites and pornography Web sites.
We simulated adolescent searching and browsing on the Internet to compile
lists of Web sites that adolescents might come across while looking for either
health information or pornography. For the search simulation, trained raters
then classified each of the sites in these lists as health information, pornography,
or other. Finally, we tested each site against 7 blocking products, each configured
at 2 or 3 different levels of blocking restriction, to determine blocking
rates for health information and pornography.
To simulate searches, we submitted search terms to the 6 Internet search
engines that are among the most popular with teens according to data from
a Kaiser Family Foundation survey2: Yahoo,
Google, America Online (AOL), Microsoft Network, Ask Jeeves, and Alta Vista.
To ensure that we had some variety in our list of sites with respect to likelihood
of being blocked, we selected search terms from the following categories:
(1) health topics unrelated to sex (eg, diabetes); (2) health topics involving
sexual body parts, but not sex related (eg, breast cancer); (3) health topics
related to sex (eg, pregnancy prevention); (4) controversial health topics
(eg, abortion); and (5) pornography.
For each of the first 4 categories, we chose 6 frequently used search
terms for health topics relevant to adolescents.2 Frequency
data for each search term was obtained from 2 different search engine logs
of search term use, one from Overture.com11 and
the other from Excite.12 For the fifth category,
we also used the Overture and Excite data to select 6 frequently used search
strings: blowjob, free sex, teen porn, hardcore porn, porn, and XXX.
On May 9, 2002, we ran a custom JAVA computer program to conduct searches
for the 30 search strings on each of the 6 search engines and to store the
results in a database. The search procedure programmed into this simulation
program was based on data from an observational pilot study during which we
observed 12 teens conducting a total of 69 health information searches. Because
none of adolescents in the observational study clicked on advertisements or
sponsored links, and they looked past the fourth page of results less than
5% of the time, our JAVA program also ignored ads and sponsored links and
captured only the first 40 search results from each search. The list of search
results was collapsed into a smaller list of unique uniform resource locators
(URLs), and sites that were not available for classifying or blocking tests
because they were off-line or broken, or for other technical reasons, were
not included in the analysis. We also screened each Web site for automatic
redirect coding, and for most of these sites we were able to follow the redirect
link in our blocking tests. If either the original URL or the redirected destination
was blocked, we considered the site to be blocked.
Research associates coded the Web sites following a detailed coding
scheme according to whether or not they contained health information and then
by whether or not they were pornographic. The raters explored each site by
reading pages and following links, seeking both health information and pornography.
If no health information was found within 2 minutes, the site was classified
as nonhealth; the same was done for pornography. Any information about topics
that might be discussed in a medical school or school of public health counted
as health information, even if the source or quality of the information was
questionable. Loosely following the definitions of obscenity in US law,13 any text or graphics depicting genitals or a sexual
act and designed to appeal to a prurient interest, and not of an educational
or scientific nature, were considered pornography. Sites that contained both
health information and pornography (n = 14/3987 rated sites) were classified
as pornographic for all analyses.
Two primary raters were each assigned 60% of the sites, and ratings
were done independently. Sites were assigned to raters using a systematic
sampling from the complete list with a random component to ensure that raters
could not know which sites would be rated by the other rater. The 10% overlap
for each allowed us to calculate interrater reliabilities for both the health
information rating (κ = .84) and for the pornography rating (κ
= .92). Primary raters also had the option of not assigning a classification
to a site for which they were unsure of the proper rating. These sites, and
those given 2 different ratings by the 2 primary reviewers, were subsequently
discussed with a third rater and a consensus rating of health, pornography,
or other was assigned.
We tested 7 different blocking products (Table 1), 6 of which were products commonly used in schools and
libraries. All 6 of these products allow the network administrator to specify
a custom blocking configuration by specifying topics or categories. The categories
vary from vendor to vendor, though they tend to be roughly comparable. Some
vendors provide one or more default configurations, but vendors have a wide
range of customers, including corporations as well as schools and libraries,
and most vendors were not willing to identify a "typical" school configuration.
Calls to 20 school systems and libraries confirmed wide variability in their
configurations and that none was using a vendor's default setting. We defined
3 configurations for each product, to reflect extreme choices and a middle
position. Our least-restrictive configuration, matching the configuration
used in another recent test,8 was designed
to block only pornography. Our moderately restrictive configuration blocked
pornography as well as a few other categories such as illicit drugs, nudity,
and weapons. It was modeled on the configuration used by one major statewide
school network that blocked fewer categories than some school districts but
more than others. Our most restrictive configuration for each product was
set up to block all topics or categories that plausibly might be blocked in
some school or library. For most products, all categories that the products
offered were blocked except news, health, education, finance, search engine,
and job search sites. The details of our product configurations are available
on the study Web page (http://www.kff.org).
The seventh blocking product we tested was America Online Parental Controls
(AOL PC). At the time of our study, this product, designed primarily for home
use, allowed only 2 configuration options appropriate for teens. Parents could
choose a moderately restrictive setting for mature teens or a very restrictive
setting for young teens. We have chosen not to include AOL PC in the between-product
comparisons. This is partly because AOL is not commonly used in schools and
libraries and partly because the limited configuration options made it impossible
to determine if AOL's blocking was truly comparable to the configurations
that we set for the other products in the study.
Most of the blocking tests were completed immediately after the searches,
on the same day. Due to technical difficulties related to AOL's proprietary
browsing software, the AOL PC blocking test took several days to complete.
Due to errors in the initial runs, CyberPatrol's configurations and 2 of the
configurations each for Symantec (least restrictive and moderately restrictive)
and Websense (moderately restrictive and most restrictive) were rerun about
6 weeks later.
Top Health Sites Recommended for Teens
Our browsing simulation entailed compiling a list of recommended health
information Web sites for teens (n = 633). Two online directories (Yahoo and
Google) were used to determine the most popular and widely recommended health
sites for adolescents. Within these directories, there are several health
categories (eg, Kids and Teens > Health > Drugs and Alcohol). We selected only
those sites for which the category header mentioned teens or youth as well
as health issues related to 1 of our 24 health search terms. These sites were
assumed to be health information sites and were not independently rated. The
sites were compiled from the directories in June 2002.
As a measure of an individual blocking product's tendency to block health
information Web sites, we calculated the percentage of health information
sites that were blocked by each of the blocking products at each blocking
level. The denominator for all of these percentages was the number of unique
health information sites in our list of search simulation results that were
reachable at the time of the blocking test for each product and configuration.
A similar analysis was done for the pornography sites in our search simulation
results and for the recommended health sites list. We also calculated summary
percentage results for all of the blocking products at a given configuration.
In order to identify statistically significant differences in product
performance, we used a series of 6 multivariable logistic regressions to calculate
odds ratios and 95% confidence intervals for tendency to block health information
or pornography. Multivariate logistic regression was used because it allowed
us to test statistical significance of product differences without having
to do postestimation adjustments for multiple comparisons. The regression
model also allowed us to examine the effects of factors such as search term
on likelihood of appropriately blocking pornography or inappropriately blocking
health information while controlling for blocking product. Models were estimated
independently at each blocking level and independently for health blocking
and for pornography blocking. The dependent variable in all of these models
was a dichotomous variable representing the results of a single blocking test
by 1 product at 1 blocking level for 1 site, either blocked or not blocked.
The independent variables were dummy dichotomous variables representing the
6 different school and library blocking products.
For each model, we chose the blocking product that performed best (either
least likely to block health information or most likely to block pornography)
as the reference group when specifying the independent variables. This allowed
us to interpret the regression results such that odds ratios significantly
different from 1 indicate that the product performed significantly worse than
the best product in the category. We used STATA v7.0 SE14 for
this analysis and used STATA's "svy" commands, allowing adjustment for clustering
by site. These models were estimated using a pseudo-log likelihood method.
Goodness of fit was tested using the Hosmer-Lemeshow method on all 6 models
unadjusted for clustering, and all had excellent fit across the range of probabilities.
We selected 1 of the search terms that resulted in a large number of
blocked health information sites (safe sex) for more
detailed analysis of the content of health information sites (n = 45) that
were blocked by at least 1 product at either the least restrictive or moderately
restrictive settings. A research associate visited each of the sites and summarized
the content in 1 or 2 sentences, with specific attention to content that might
have triggered the blocking software. We then analyzed the summaries to determine
patterns.
Search Simulation Results
Our search simulation yielded a total of 6760 Web sites. After eliminating
duplicate sites (n = 2501) and sites that were unreachable or could not be
included for technical reasons (n = 272), 3987 unique URLs remained. Of these
unique sites, 2467 contained health information and not pornography, 516 contained
pornography, and 1004 were rated as neither health information nor pornography.
Results of the blocking tests on the health information sites are shown
in the first section of Table 2.
Large differences are apparent with the 6 comparable products compared as
a group across the 3 levels of blocking. At the least restrictive blocking
configuration, the mean blocking rate of health sites was 1.4% (range for
the 6 products, 0.6%-2.3%). The mean blocking rate of pornography sites was
87.2% (range, 84%-90%). As the level of blocking increased from least to moderate
to most restrictive, the frequency of health blocking increased substantially
while the improvement in pornography blocking was small. At moderate blocking
settings, the mean blocking rate of health information sites was 5.2%; at
the most restrictive settings, it was 24%. At the least restrictive configuration,
5% of all health information sites were blocked by at least 1 product. This
compares with 16% of sites for moderate blocking settings and 63% of sites
for the most restrictive settings.
There were some statistically significant differences between products,
as summarized in Table 3. In the
least restrictive blocking configuration, Websense was the least likely to
block health information, so Websense became the reference category (odds
ratio, 1). SmartFilter, 8e6, CyberPatrol, and Symantec were all more likely
to block health information than Websense, but N2H2 was not significantly
more likely to block health information than was Websense. Within the margin
of error for our study, Websense and N2H2 are both top products at not blocking
health information at the least restrictive blocking configuration. Across
all 3 blocking levels, N2H2 was the best at blocking pornography.
Overall, for the 24 health search strings only about 1% of search results
were pornography, but the software blocked fewer of these pornography sites
(62%) than those resulting from pornography searches (89%). Adding a dummy
variable for a pornography vs not pornography search in the logistic regression
reported in Table 3 confirmed
that this difference is statistically significant (P<.001).
When comparing health information blocking rates across the 24 different
health search terms, there were some notable differences in performance as
summarized in Table 4. At the
least restrictive setting, where products were supposed to block pornography
only, about 10% of nonpornographic health information sites returned from
searches using the terms safe sex, condom, and gay were blocked, while for most
other searches less than 1% of health sites were blocked. At the moderately
restrictive setting, these search terms again yielded a larger percentage
of health results blocked, as did ecstasy, presumably
because the moderately restrictive setting was supposed to block access to
sites about illegal drugs. At the most restrictive blocking setting, most
strings yielded a health information blocking rate of at least 10%, and half
of the more controversial topics had rates above 40%.
When we tested the blocking products against a list of 633 top health
information sites, we found similar results. After excluding 29 sites that
were unreachable and eliminating duplicates, we ran our blocking test on 586
unique recommended health sites. At the least restrictive blocking setting,
0.5% (range, 0%–1.4%) of recommended teen health information sites were
blocked. This compares with 2.5% (range, 0.9%-8.4%) at the moderately restrictive
blocking settings and 23% (range, 10.9%-39%) at the most restrictive blocking
settings.
What Kinds of Health Sites Were Blocked?
Of the 86 unique health sites resulting from searches using the term safe sex, 28 were blocked by some product at the least
restrictive configuration and 42 were blocked by some product at the moderately
restrictive configuration. Of those blocked at the least restrictive configuration,
the vast majority contained at least moderately specific descriptions of condom
use and/or alternatives to intercourse. Four of these sites contained pictures
and graphic depictions of sexual acts and 2 contained nudity that seemed to
be artistic in nature. Three required users to confirm that they were older
than 18 years before visiting the site. Four sites sold condoms. The additional
health sites blocked at the moderately restrictive configuration did not appear
qualitatively different than those blocked at the least restrictive level,
ie, they did not contain more offers for condoms or more explicit information
on safer sexual practices.
For all 7 of the filtering products we tested, access was blocked to
only a small percentage of health information Web sites when the blocking
configurations were set to the least restrictive settings. With only 1.4%
of health information sites that we tested blocked, a teenager whose access
to a particular health information site is inadvertently blocked will probably
be able to easily find an unblocked site with similar information. This suggests
that filtering software set to block pornography will not necessarily have
a serious impact on access to general health information. Compared with other
factors that may limit teenagers' access to health information when searching
the Internet, including spelling errors, limited search skills, and uneven
quality of search engines, overblocking by filtering software set at the least
restrictive blocking settings poses a relatively minor barrier for most of
the health topics we studied.
However, the blocking rates were noticeably higher for some topics.
For example, with searches on safe sex, almost 10%
of attempts to access health results were blocked, and 33% of health sites
were blocked by at least 1 of the products, even on the least restrictive
setting. More than 20% of attempts were blocked at the moderate setting. These
blocking rates may be enough to make blocking software a serious impediment
to searching for this type of health information. This is particularly concerning
given that 80% of teens identify sexual health as very important.2 The conventional wisdom that the presence of words
mentioning sexual body parts fools blocking software appears not to be true
(no breast cancer search results were blocked at
the least restrictive configuration). There do seem to be patterns, however,
in the types of blocking errors. To the extent that these blocked health information
sites represent errors and not intentional blocking of controversial sites,
further research and product development should be devoted to improving the
ability of products to discriminate between pornography and health information
in sites related to safe sex, condoms, and homosexuality.
We also found that configuration of the products can have a large impact
on access to health information. The moderately restrictive configurations
that we believe approximate many schools' settings led to more than 3 times
as much blocking of health information as the least restrictive, pornography-only
blocking settings. Overall, the most restrictive configurations blocked more
than 17 times as often as the least restrictive configurations. However, these
more restrictive settings led to only slight improvements in blocking of pornography:
their main effect was to block other potentially controversial types of information,
including some types of health information.
There may be principled reasons why some schools or libraries choose
to block more than pornography, including some kinds of health information.
These decisions, however, should be viewed as important policy decisions and
not mere technical configuration issues to be left to network administrators.
The choice of configurations should get at least as much public and managerial
scrutiny as the initial decision about whether to install filters at all.
Comparing among the products, the blocking rates for health information
varied by a factor of 2 or more. At the least restrictive settings, for most
health searches the overall blocking rates were small enough that erroneous
blocking was rare for all the products. For more restrictive settings, and
for searches on topics such as safe sex, differences
among products would become more noticeable.
The products each blocked 80% to 90% of the pornography sites at minimal
blocking levels. Health searches generated links to pornography sites only
about 1% of the time, so that accidentally stumbling across pornography while
searching for health information is a rare occurrence, and even rarer with
blocking software. However, it is interesting to note that the filters were
far more effective at blocking pornography sites resulting from pornography
searches than at blocking pornography sites resulting from health searches.
One possible explanation is that the same characteristics, such as particular
text appearing in the content or links to and from other sites, that caused
pornography sites to appear in search engine results also caused the blocking
software to classify them incorrectly. We do not know exactly what text content
or link patterns might be the source of the errors, or whether the sites were
deliberately designed to induce such errors.
Some simple industry-wide actions might reduce error rates even further
and aid in product selection and configuration. For example, it would be helpful
if creators of health or pornography sites could provide hints to the vendors
about how the site should be classified. One solution might be the more widespread
use of embedded labels15 or the creation and
use of domain names such as .health and .xxx.7 Conversely,
it would be helpful if vendors informed operators of Web sites about whether
their sites were blocked, so that errors could be identified and corrected
more quickly. This could be accomplished through an electronic clearinghouse,
operated by a nonprofit organization or government agency, where people could
submit a URL and find out immediately whether the site was blocked on any
of the configurations of the major vendors.
Vendors may have commercial reasons for not fully disclosing their blocking
strategies. However, providing the ability to check if a specific URL is blocked
would not require vendors to divulge the trade secrets of their classification
methods or publish their entire blocking lists. Some vendors voluntarily provide
sites allowing users to check for blocking of specific URLs. Legislation or
regulation could mandate vendor participation or provide incentives such as
certifying vendors for government contracts if they allow these blocking checks.
Moreover, if a publisher does find that its site is blocked and feels that
it is a mistake, the software vendor may not be responsive to an inquiry asking
for a reevaluation. One possible solution would be to establish an appeal
process that the vendor would have to respond to within a fixed period of
time. Finally, to aid in product and configuration selection, tests of the
form reported in this article should be conducted on a regular basis, using
a different set of search topics each time.
While the rigorous sampling methods and the large sample size lend weight
to the results, there are several limitations to our study. First, while we
simulated searches on topics that previous surveys indicate interest teenagers,
our simulations were still fairly basic. We did not attempt to model how teenagers
react to the short summary text for each site that a search engine returns,
and how that influences their choices of which links to follow. Similarly,
we did not attempt to model how having some sites blocked would affect the
progress of a search. Second, we made no attempt to rate the quality of health
information or the relevance of health sites to the search topics. Third,
when we counted blocked health information sites, we made no attempt to check
whether alternative sources of the same health information were available
and not blocked. Thus, this study measures the percentages of health and pornography
items that are blocked, but was not designed to give a detailed picture of
how the presence of blocking software would affect the quality of health information
a teenager would find when searching. Fourth, some of the product configurations
were tested at a later date due to technical difficulties. Since the search
results from an earlier date were used, it is possible that the product vendors
had revised their blocking decisions for those URLs, perhaps reducing the
number of blocked health sites or increasing the number of blocked pornography
sites. However, results for product configurations tested later were roughly
consistent with the overall pattern of results, both for individual products
and across products.
Another important limitation of the study is that it focused only on
the categories of pornography and health information. Some individuals may
think that teenagers should be prevented from accessing information on controversial
topics such as condoms, homosexuality, and abortion. Our analysis treated
sites discussing these topics as health information sites. Depending on one's
opinion about accessibility of information on these controversial topics,
the more restrictive blocking rates for health information found in some of
the software configurations may or may not be problematic. While it was fairly
easy to achieve interrater reliability in classifying pornography and health
information, it is less clear what the objective criteria for more controversial
topics would be, and we deferred that to future research. For those who are
interested in rerating our sample, running their own statistics, or simply
examining our ratings, the database is available on the study Web page (http://www.kff.org).
The differences between products were much smaller than the differences
between settings within each product. For general health information searches,
at their least restrictive settings, overblocking by filtering software poses
a relatively minor risk. However, for searches for some sexually related health
information and for homosexuality, the blocking of health information sites
was around 10% even on the least restrictive setting, suggesting that blocking
software is less effective at distinguishing pornography sites from those
discussing these health topics. Moreover, more restrictive blocking configurations
substantially increased health information blocking with only slight improvement
in pornography blocking: the main effect of the more restrictive settings
is to block other categories of controversial material besides pornography.
1.Fox S, Rainie L. The Online Health Care Revolution: How the Web Helps
Americans Take Better Care of Themselves. Washington, DC: Pew Internet and American Life Project; 2000:23.
2.Rideout V. Generation Rx.com: How Young People Use the Internet
for Health Information. Menlo Park, Calif: Henry J. Kaiser Family Foundation; 2001:37.
3. Generation Rx.com Survey . Menlo Park, Calif: Henry J. Kaiser Family Foundation; 2001.
4. American Library Association Inc v United States , 9537 US Dist Lexis (2002 ED PA).
5.Cattagni A, Westat EF. Internet Access in US Public Schools and Classrooms:
1994-2000. Washington, DC: National Center for Education Statistics; 2001:20.
7.Thornburgh D, Lin H. Youth, Pornography, and the Internet. Washington, DC: National Academy Press; 2002.
8.US Department of Justice. Web Content Filtering Software Comparison. Morrisville, NC: eTesting Labs; 2001:18.
9.Greenfield P, Rickwood P, Tran HC. Effectiveness of Internet Filtering Software Products. Sydney, Australia: CSIRO Mathematical and Information Sciences; 2001:90.
12.Jansen B, Spink A, Saracevic T. Real life, real users, and real needs: a study and analysis of user
queries on the web.
Information Processing Manage.2000;36:207-227.Google Scholar 13. Miller v California , 413 US 15 (1973).
14.StataCorp. Stata Statistical Software: Release 7.0 SE. College Station, Tex: Stata Corp; 2001.
15.Resnick P. Filtering information on the Internet.
Sci Am.1997;273:106-108.Google Scholar 16.Curry A, Haycock K. Filtered or unfiltered?
Sch Libr J.2001;1:42-47.Google Scholar