Setting Assessment Standards for Artificial Intelligence Computer Vision Wound Annotations | Dermatology | JAMA Network Open | JAMA Network
[Skip to Navigation]
Sign In
Views 1,038
Citations 0
Invited Commentary
Statistics and Research Methods
May 19, 2021

Setting Assessment Standards for Artificial Intelligence Computer Vision Wound Annotations

Author Affiliations
  • 1Section of Trauma, Surgical Critical Care and Acute Care Surgery, Department of Surgery, Stanford University School of Medicine, Stanford, California
  • 2The Buncke Clinic, San Francisco, California
  • 3Department of Biomedical Data Science, Stanford University, Stanford, California
JAMA Netw Open. 2021;4(5):e217851. doi:10.1001/jamanetworkopen.2021.7851

Artificial intelligence (AI) and machine learning methods are being increasingly used to generate insights into clinical problems. The area of artificial intelligence known as computer vision (CV) is being leveraged to augment aspects of medicine that involve the interpretation of visual information, including wound care.1 Wound classification, measurement, and evolution over time are all critical to optimal care. Each of these features are visual and should be amenable to CV algorithms.2 As research efforts proliferate, it seems prudent to set assessment standards for the work that is being done. Doing so would better enable different research teams to compare their work. It can also facilitate a more coordinated effort in building algorithms that can together help to solve clinical problems.

Three major categories of CV algorithms are those built for classification, detection, and segmentation. Classification algorithms assign labels to entire images. Detection algorithms assign a bounding box and label around objects within images. Segmentation algorithms go a step further by assigning a mask outline and corresponding label to objects within images. We are not aware of any prior set of assessment standards considered and targeted specifically for CV-based wound annotations. With a focus on segmentation algorithms, Howell et al3 began the process of laying out nuanced methods to compare area-based wound tracings performed by human and AI annotators. Specifically, they used error measure distributions to compare the annotations and coupled those with qualitative comparisons done by blinded human reviewers. Using these methods, they showed that for the algorithm assessed, AI vs human segmentations varied in quantitatively similar ways as human vs human segmentations.

The first challenge that Howell et al3 addressed in this process was that of generating the ground truth data set. Like many other visual-based diagnoses, wound assessment lacks a perfectly objective ground truth. The process of training a model encourages it to replicate the ground truth labels that it is given. In many clinical CV segmentation studies, the ground truth is a set of human-labeled images. The authors rightfully note the variance in human annotators in their own study. These differences may be because some human annotators were more skilled with digital tracings or more detail oriented than others, perhaps even zooming in on the photograph to make their outlines more precise.

Therefore, any team building a model should think critically about what they want to replicate. Do they want to mimic a specific rater? Perhaps an extremely detailed annotator is able to trace wound edges perfectly, and they want to focus algorithm training on that. Do they instead want to poll the wisdom of the crowd?4 Perhaps having multiple raters annotate each image gives a better ground truth on which to train the algorithm.5 Regardless, the training process reflects the labels it is given. Whatever path is chosen regarding human-labeled data, intra-observer and interobserver variability and concordance should be assessed and reported.

Once algorithms are trained on labeled data, they are then tested against novel data, and their performance assessed. Both classification and segmentation algorithms have a core set of performance metrics. Traditionally for segmentation algorithms, the primary measures of performance are intersection over union (IOU) or the Dice coefficient. In some cases, precision and recall are also reported. IOU and the Dice coefficient can be derived from one another. IOU treats false-positives and false-negatives equally and jointly optimizes them. Howell et al3 broke down the standard metric of IOU into its component parts by using error analysis. The false-negative area and false-positive area metrics describe the type of mistake made and the extent to which the algorithm is underpredicting vs overpredicting. Relative error describes the difference in area measurement. Used together, these 3 metrics permit a granular understanding of how close the predicted tracing is to the human tracing and in what direction the error is occurring.

Another important insight provided by Howell et al3 is the impact of size on an algorithm’s performance metric results. They note that “relative error between tracings tended to increase significantly for smaller absolute wound sizes.”3 This is not entirely surprising given that IOU is normalized by size. When you have a small object, each pixel error has a greater effect. In some extreme cases, just the fact that there is overlap between the predicted mask and the ground truth mask can already be a good sign. A possible way to account for size impact is to group wounds into a set of meaningful categories (eg, small, medium, and large) and then specifically assess how a wound segmentation algorithm performs on wounds in each category.

That Howell et al3 include their full set of images as a supplemental appendix is of large benefit to the reader. Every image contains tracings produced by both the human annotators and the AI algorithm. Looking over the images and their tracings, one can see evidence of tracing discrepancies. In their supplementary discussion, the authors have broken these into subgroups. Most of the images contain wounds annotated with slightly different borders. The aforementioned performance metrics work well to assess these images. There are also wound images where dominant wounds or smaller satellite wounds are variably annotated. For example, sometimes a satellite wound is not identified at all. Finally, wounds are often complex. Various types of tissue, including necrosis, eschar, slough, granulation, and epithelialized, may all be present. One segmentation of a wound may appropriately include all portions of the wound within the segmentation mask, while another may annotate only a single portion of the wound (eg, granulation tissue only). Future models could further delineate their performance on these subgroups of problems.

Howell et al3 articulate well the pragmatic challenges faced in wound care assessments. They highlight the need for “clear, quick, and precise wound analysis to guide best clinical practices.”3 We agree with their conclusion that a structured approach to wound assessment augmented by AI could help to improve clinical practice and ultimately patient outcomes. Their methods and approach will be useful to those wishing to assess the performance of a wound segmentation algorithm. As the methods to apply AI algorithms to wound assessments continue to mature, we envision that the accompanying assessment standards will also evolve. This is an important first step.

Back to top
Article Information

Published: May 19, 2021. doi:10.1001/jamanetworkopen.2021.7851

Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2021 Jopling JK et al. JAMA Network Open.

Corresponding Author: Jeffrey K. Jopling, MD, MSHS, Section of Trauma, Surgical Critical Care and Acute Care Surgery, Department of Surgery, Stanford University School of Medicine, 300 Pasteur Dr, H3691, Stanford, CA 94305 (

Conflict of Interest Disclosures: None reported.

Wang  C, Anisuzzaman  DM, Williamson  V,  et al.  Fully automatic wound segmentation with deep convolutional neural networks.   Sci Rep. 2020;10(1):21897. doi:10.1038/s41598-020-78799-wPubMedGoogle ScholarCrossref
Anisuzzaman  DM, Wang  C, Rostami  B, Gopalakrishnan  S, Niezgoda  J, Yu  Z. Image-based artificial intelligence in wound assessment: a systematic review. arXiv. Preprint published on September 15, 2020. Accessed April 12, 2021.
Howell  RS, Liu  HH, Khan  AA,  et al.  Development of a method for clinical evaluation of artificial intelligence–based digital wound assessment tools.   JAMA Netw Open. 2021;4(5):e217234. doi:10.1001/jamanetworkopen.2021.7234Google Scholar
Kovashka  A, Russakovsky  O, Fei-Fei  L, Grauman  K.  Crowdsourcing in Computer Vision. Now Foundations and Trends, 2016. doi:10.1561/0600000071
Barnett  ML, Boddupalli  D, Nundy  S, Bates  DW.  Comparative accuracy of diagnosis by collective intelligence of multiple physicians vs individual physicians.   JAMA Netw Open. 2019;2(3):e190096. doi:10.1001/jamanetworkopen.2019.0096PubMedGoogle Scholar
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words