Uniform resource locator (URL) use in the Archives of Dermatology (A), the Journal of the American Academy of Dermatology (B), and The Journal of Investigative Dermatology (C) from 1999 to 2004. Percentages indicate unavailable URLs for each year. The asterisk indicates that data in 2004 were only from January through September.
Dermatology article author responses regarding unavailable uniform resource locators (URLs). CD indicates compact disc. Percentages are based on the denominator of total respondents for each question. Boldface indicates the most frequent response.
Wren JD, Johnson KR, Crockett DM, Heilig LF, Schilling LM, Dellavalle RP. Uniform Resource Locator Decay in Dermatology JournalsAuthor Attitudes and Preservation Practices. Arch Dermatol. 2006;142(9):1147-1152. doi:10.1001/archderm.142.9.1147
Copyright 2006 American Medical Association. All Rights Reserved. Applicable FARS/DFARS Restrictions Apply to Government Use.2006
To describe dermatology journal uniform resource locator (URL) use and persistence and to better understand the level of control and awareness of authors regarding the availability of the URLs they cite.
Software was written to automatically access URLs in articles published between January 1, 1999, and September 30, 2004, in the 3 dermatology journals with the highest scientific impact. Authors of publications with unavailable URLs were surveyed regarding URL content, availability, and preservation.
Main Outcome Measures
Uniform resource locator use and persistence and author opinions and practices.
The percentage of articles containing at least 1 URL increased from 2.3% in 1999 to 13.5% in 2004. Of the 1113 URLs, 81.7% were available (decreasing with time since publication from 89.1% of 2004 URLs to 65.4% of 1999 URLs) (P<.001). Uniform resource locator unavailability was highest in The Journal of Investigative Dermatology (22.1%) and lowest in the Archives of Dermatology (14.8%) (P=.03). Some content was partially recoverable via the Internet Archive for 120 of the 204 unavailable URLs. Most authors (55.2%) agreed that the unavailable URL content was important to the publication, but few controlled URL availability personally (5%) or with the help of others (employees, colleagues, and friends) (6.7%).
Uniform resource locators are increasingly used and lost in dermatology journals. Loss will continue until better preservation policies are adopted.
Approximately 80% of dermatologists with Internet access use the Internet for medical updating and professional purposes.1 Locating online health information, however, can be problematic because of the inconstant nature of Internet addresses, also known as uniform resource locators (URLs).2- 7 The continual flux of information on the Internet is reflected in the changing content and disappearance of URLs, which may become unavailable because of changes in Web site organization, hardware reconfiguration, and file renaming.8
Previous studies2,4,5,7,9 examined the loss of cited URLs in journals encompassing multiple academic disciplines. Unlike previous estimates of URL use and availability, this study used an automated program to examine many full-text publications. To our knowledge, this is also the first study to survey authors with unavailable URLs regarding URL content and preservation.
All online publications from January 1, 1999, to September 30, 2004, in the 3 dermatology journals with the highest scientific impact, according to the 2003 Institute of Scientific Information Journal Citation Reports, were examined: The Journal of Investigative Dermatology, Archives of Dermatology, and the Journal of the American Academy of Dermatology. Advertisements were excluded. Full-text publications were downloaded to a local hard drive and saved in HTML format using an automated script (Visual Basic 6). An automated program downloaded all full-text publications and extracted all URLs that were located within text sections. Hence, URLs embedded in tables or figures were not detected for this study. The availability of each URL was determined in September 2004 using a previously described program (Visual Basic 6).4
Article characteristics captured included PubMed identification, journal name, and date of publication. Data recorded for each URL included text location, URL address, top-level domain (eg, “.com” or “.gov”), directory depth, presence of tildes, availability of the URL, and recoverability of unavailable URLs using the Internet Archive (IA) (http://www.archive.org). The presence or absence of an accession date (date the author last accessed the URL) was noted for a random sample, chosen using a random-number generator (http://www.random.org), of 100 URLs found in journal articles with PubMed identifications.
The URLs were classified as either available (yielding no error message when accessed using an Internet browser) or unavailable (yielding an error message when accessed using an Internet browser). The URLs that were redirected were noted and classified as available.
For all unavailable URLs, content recovery was attempted using the IA, an Internet archiving resource. The URLs were pasted into the IA's Wayback Machine to minimize data entry error, and subcategorized as follows: (a) a recoverable URL (ie, at least some retrievable content via the IA) or (b) an unrecoverable URL (irretrievable content). Two investigators (D.M.C. and L.F.H.) independently attempted this recovery and created 2 separate databases, which were then compared and reconciled by consensus when differences occurred.
Journal policies regarding URLs were sought in the online versions of the “Instructions for Authors” for all 3 journals. Statistical analyses, including descriptive statistics and χ2 tests, were performed using SAS statistical software, version 8 (SAS Institute Inc, Cary, NC). Data were stored in a database (Access 2000; Microsoft Corporation, Redmond, Wash).
Between June 30 and September 30, 2005, a questionnaire (available from the authors) was sent to the corresponding author of articles containing URLs initially identified as unavailable in September 2004 and reconfirmed as unavailable in May 2005. Random selection was used in cases in which multiple articles had the same corresponding author or the same unavailable URL, so that each author and URL were unique. E-mails and addresses were obtained from the articles. When an e-mail was not given, standard post was used. Up to 3 contacts were attempted by e-mail and 1 via standard post for each author. Replies were compiled and descriptive statistics provided by a commercially available Internet electronic survey tool (http://www.surveymonkey.com). This study received Colorado Multiple Institutional Board approval.
In the 271 online journal issues sampled (Archives of Dermatology, 81; Journal of the American Academy of Dermatology, 108; and The Journal of Investigative Dermatology, 82), 7337 articles included 1113 URLs, of which 801 were unique (Table 1). Overall, 7.6% of articles (554 of 7337) contained at least 1 URL. The percentage of published articles containing at least 1 URL increased from 2.3% in 1999 to 13.5% in 2004 (January to September). The total number of URLs published increased annually from 78 in 1999 to 309 in 2003. Of the URLs, 27.3% appeared in the article text (“Introduction,” “Materials and Methods,” “Results,” or “Discussion”), while the remaining 72.7% were located in the references or other areas of the article locations.
Overall, 18.3% of URLs were unavailable for all 3 dermatology journals in all years. The availability of URLs decreased significantly with time since article publication, with 89.1% of URLs published in 2004 (January through September) and 65.4% of those published in 1999 available (P<.001) (Figure 1). The availability was the highest in the Archives of Dermatology (85.2%) and the lowest in The Journal of Investigative Dermatology (77.9%) (P=.03). For all years, the likelihood of availability was significantly associated with top-level domain (P=.003): “.edu” (34.2%), “.org” (18.7%), “.net” (18.8%), “.com” (15.5%), “.gov” (14.7%), and other (ungrouped top-level domains) (21.5%). The URLs with a directory depth of 0 (also known as root directories) (eg, http://www.uchsc.edu) were significantly more likely to be available compared with those with a directory depth of 1 (eg, http://www.uchsc.edu/derm/) or more (8.4% vs 26.1% unavailable) (P<.001). Neither the presence of an accession date, indicating when the URL was last viewed by the author (P<1.0), nor a tilde in the URL (P≤.22) was significantly associated with availability. (A tilde [~] character is an “alias” indicating that the Web page is located on a user or group directory and, thus, does not specify the directory path. If a user account becomes inactive, then redirection will fail even if the Web page files are available on the Web server.) Of 100 randomly chosen URLs, 39 had accession dates.
Of 204 unavailable URLs, the content of 120 (58.8%) was recoverable in some form using the IA. This increased overall recoverability of at least partial content to 92.5% of URLs in all journals for all years.
A total of 102 unique corresponding authors of articles with unavailable URLs were e-mailed a survey (Figure 2) regarding the unavailable URLs, and 67 (65.7%) responded. Less than half (43.9%) had attempted to access the URL after publication, suggesting that most URLs become unavailable without the knowledge of the citing authors. Most (55.0%) of the cited URLs reference content outside the direct control of the authors and their coworkers. Of 60 respondents, 7 (11.7%) had direct control over URL availability.
Most authors (32 [51.6%] of 62) did not know why the URL they cited was unavailable. However, consistent with previous findings,4 about 11% of URLs were misspelled in the final publication. Three (4.5%) indicated that the URLs became unavailable because of a lack of funding or support.
Most responding authors (63.9%) had preserved cited URL content, most commonly (29.5%) by printing it. Few (4.9%) had used an Internet-based archive for content preservation. Most (55.2%) agreed that the content of the cited URL was important to their publication, most often (60.7%) as a means of contributing to background information for the study. The most common reason for citing a URL was to provide additional information about a topic (54.1%) or to link to additional data or analyses (37.7%). Only 14.3% indicated that an alternative source of data (other than the cited URL) was available at publication.
Most often, the nature of the URL was a text-based document (46.8%), which can be backed up by several means, but 45.2% of the URL links pointed to either a database (33.9%) or a software program (11.3%), which is not as straightforward to back up.
Since January 2002, the “Instructions for Authors” of the Archives of Dermatology (http://archderm.ama-assn.org) has provided an example Internet reference with an accession date and has recommended that authors retain a printed copy of any referenced Internet-only information to ensure access to cited information if the URL is altered or disappears. The “Instructions for Authors” of the Journal of the American Academy of Dermatology (http://www.eblue.org) and The Journal of Investigative Dermatology (http://www.jidonline.org) do not mention an Internet referencing policy. None of the 3 journals restricted URLs to specific locations in articles.
This study confirms that URLs are increasingly cited as sources of scholarly information in dermatology journals, and that a significant portion of cited information is no longer available. Of 1113 URLs examined, 18.3% were unavailable. The probability a URL would become unavailable was significantly associated with increasing time since publication, journal, top-level domain, and greater directory depth, but not with the presence of a tilde or an accession date. These associations support the findings of Casserly and Byrd2 in information science journals. Of unavailable URLs, 58.8% were recoverable in some form in the IA, and an assessment of content relevance of randomly selected URLs yielded no irrelevant information content. This study also corroborates findings that 12% of URLs in MEDLINE abstracts contain spelling or formatting errors that render the published URL unavailable.4
The Internet serves as an invaluable network that provides global access to information. However, the average lifespan of a Web site is far from sufficient to ensure reliable long-term availability.10,11 Because of the inconstant nature of URLs, neither publishers nor authors are able to guarantee the long-term accuracy or availability of digital information referenced in dermatology journals. Effective solutions will likely require a collaborative effort on the part of researchers, authors, and journal editors.
Digital archiving resources offer one approach to preserving digital information. The IA, a public nonprofit organization, was constructed with the purpose of archiving Internet content and can often locate content of otherwise unrecoverable URLs, with snapshots taken on multiple dates. Unfortunately, archived versions of dynamic Web pages may not fully retain functionality, and other URLs, including those that are password protected or that block Web crawlers, are not available for archiving. Moreover, IA archiving typically takes place every couple of months, so changes made during this time will not be preserved. Thus, while 58.8% of unavailable URLs were classified as “recoverable” on the IA, the information recovered could not be verified as identical to that viewed and cited by the author.
An additional problem is the possibility of copyright infringement associated with preserving Internet content that is not the intellectual property of the citing author. In terms of scientific publications, for example, a recent study12 demonstrated that many authors make journal article reprints available online, which may in turn be archived by the IA regardless of whether the journals want this content freely available. It is difficult, if not impossible, in many cases for the IA to ascertain what content has been legally posted and what content may be illegal. Web authors may ask to have their electronic content removed from the IA (more information is available at: http://www.archive.org/about/faqs.php), which may further limit the ability of the IA to preserve URLs.
Other efforts to remedy the problem of URL loss exist (Table 2). Software programs, such as Peridot (IBM Corporation, White Plains, NY)13 and Xenu's Link Sleuth (http://home.snafu.de/tilman/xenulink.html), automate the updating of linked Web sites. Another program (FURL; LookSmart, Ltd, San Francisco, Calif) (http://www.furl.net) also serves as a digital information archive, but preserves only URL content submitted by individuals for personal archiving. Alternatively, WebCite specifically targets preservation of URLs in academic journals.
Readers commonly use additional recovery methods, such as typing the higher-level stem (beginning) of an unavailable URL or the entire URL into a search engine such as Google. About 30% of the unavailable URLs in our study yielded prima facie relevant information using these methods. In the end, however, the reader does not know with certainty that this retrieved information is, in fact, the originally cited information.
Uniform resource locator content might also be better preserved by using more permanent alternatives to URLs for locating information on the Internet. Uniform resource locators serve as the name (identifying content) and address (identifying location) for Internet resources, rendering cited content unavailable if either one changes. Alternatively, permanent URLs are associated with specific URLs, but are unchanging, effectively redirecting the Web client to the correct URL via an intermediary resolution service.8 This process is not fully location independent, and its success depends on the reliability of permanent URL maintainers to update the associated URL if it changes.8 Other alternatives are uniform resource names, permanent location-independent identifiers of cited resources that rely on a resolving service; and digital object identifiers, which identify a digital object by name only, using a persistent novel identifier embedded within a URL.14
In light of the limitations of URL preservation options, the importance of improving journal policies regarding URLs cannot be overstated. In a recent study15 of the top 100 medical and scientific journals, as rated by the Institute for Scientific Information for scientific impact, only one, the Archives of General Psychiatry, had a URL preservation policy stated in the “Instructions for Authors.” Of the 3 dermatology journals, only the Archives of Dermatology gives specific mention to Internet referencing in the “Instructions for Authors,” using the same policy used by the Archives of General Psychiatry. The Archives of Dermatology also demonstrated a significantly lower rate of unavailable URLs in this study. Publishers, editors, and authors should work together to discover and implement feasible solutions to URL content loss15- 18 by (1) requiring authors to retain digital backup or printed copies of cited Internet-only information to facilitate content recovery should a URL become unavailable and (2) advocating the inclusion of referenced Internet content in an online archive (Table 2). In addition, URLs need systematic double checking before publication to minimize unavailability due to spelling errors or misprints.
The adoption of standard electronic referencing policies, the use of Internet-based archives, and collaboration between authors and publishers will hopefully lead to more permanent URL availability in dermatology journals. Ultimately, widespread acceptance and support for these easily implemented policies could serve as a model for all medical literature.
Correspondence: Robert P. Dellavalle, MD, PhD, MSPH, Dermatology Service, Department of Veterans Affairs Medical Center, 1055 Clermont St, Mail Code 165, Denver, CO 80220 (firstname.lastname@example.org).
Financial Disclosure: None reported.
Previous Presentation: This study was presented at the Fifth International Congress on Peer Review and Biomedical Publication; September 16, 2005; Chicago, Ill.
Accepted for Publication: January 9, 2006.
Author Contributions:Study concept and design: Wren, Schilling, and Dellavalle. Acquisition of data: Wren, Johnson, Crockett, Heilig, and Dellavalle. Analysis and interpretation of data: Wren, Johnson, Heilig, and Schilling. Drafting of the manuscript: Johnson, Crockett, Heilig, Schilling, and Dellavalle. Critical revision of the manuscript for important intellectual content: Wren, Heilig, Schilling, and Dellavalle. Statistical analysis: Heilig. Obtained funding: Dellavalle. Administrative, technical, and material support: Wren, Heilig, Schilling, and Dellavalle. Study supervision: Heilig and Dellavalle.
Funding/Support: This study was supported by grant EPS-0447262 from the National Science Foundation Experimental Program to Stimulate Competitive Research (Dr Wren); grant T32 AR07411 from the National Institutes of Health (Dr Johnson); in part by research grant R25 CA49981 from the National Cancer Institute Education (Mr Crockett); grant 5 D14HP00153, a Faculty Development in Primary Care Health Services Research Award (Dr Schilling); and grant K-07 CA92550 from the National Cancer Institute (Dr Dellavalle).
Acknowledgment: We thank John Kittelson, PhD, Department of Preventive Medicine and Biometrics, University of Colorado at Denver and Health Sciences Center, for statistical advice; and Eric Hester, MD, Jennifer Myers, MD, Renee D’Ambrosia, MD, Kristy Lundahl, MBA, and Shayla Francis, MD, for their work on this project.