THE CREATION OF STANDARD DATA SETS FOR THE EVALUATION OF NEW SPATIAL ANALYSIS METHODS

 

Gregg Petrie, Senior Research Scientist

National Security Directorate

Pacific Northwest National Laboratory

P.O Box 999

Richland, WA. 99352

509 372 6057

gregg.petrie@pnl.gov

 

 

Ian Anderson, Senior Product Manager

Leica Geosystems GIS and Mapping

2801 Buford Hwy, Suite 300

Atlanta GA 30329

ian.anderson@gis.leica-geosystems.com

(404) 248-9000 ext 2306

 

 

Scott A. Bennett, Vice President - Sales

ImageLinks, Inc.

Western Regional Office:

1746 Cole Blvd., Suite 225, Golden, CO 80031

Tel: 303-215-1700

cell 303 517 4329

sbennett@imagelinks.com

 

 

Haans Wesley Fisk, Remote  Sensing Analyst

USDA Forest Service

Remote Sensing Application Center (RSAC)

2222 West 2300 South

Salt Lake City, UT 84119

hfisk@fs.fed.us

(801) 975-3760

 

 

Eileen Perry, Assistant Director

Center for Precision Agriculture Systems

Washington State University

Irrigated Ag. Research & Extension Center

24106 N. Bunn Rd. ; Prosser WA 99350-8694

509-786-9257 voice; 786-9370 fax

eileen_perry@wsu.edu

 

Thomas K. Windholz, Associate GIS Director

Idaho State University

Pocatellow, Idaho 83209-8130

windthom@isu.edu

208 282 5606

 

 


ABSTRACT

 

Remote sensing systems and supporting technologies have improved recently in several ways that promote the development of new algorithms.  However, a set of standard imagery is needed to promote the objective evaluation and comparison of new methods that will be developed to exploit new opportunities.  Standard images and corresponding ground truth maps require a number of useful characteristics: they should be complete, widely recognized, easily accessible, environmentally representative, and well understood.  To develop a definitive collection of data sets with these characteristics, it will be necessary to identify and generate ground truth maps for a variety of different environment types, such as urban, industrial, agricultural, forest, and rangeland, for example, that are both important and representative.  For each environmental category, it will also be necessary to identify the representative classes that are complete in both an academic sense and useful in a practical, commercial sense.  Once this is done, a set of images will be found and acquired that captures the range of spectral (e.g., the number and type of bands), spatial (pixel size), and quality (8 bits versus 11 bits numeric precision) characteristics available to the remote sensing community.  Ideally, these image data sets would include both the processed (e.g., registered and atmospherically corrected) imagery and the raw source data.  They must include both the complete documentation/metadata to explain the image processing that was done, and complete ground truth registered to 1/10 pixel.

 

INTRODUCTION

Remote sensing systems and supporting technology have improved recently in several ways that promote the development of new algorithms.  For example, the QuickBird system offers both submeter panchromatic and 2.4-meter multispectral band imagery.  Hyperion offers new hyperspectral opportunities, and Moderate Resolution Imaging Spectroradiometer (MODIS) a National Aeronautics and Space Administration (NASA)/Earth Observation Satellite (EOS) instrument, offers high temporal sampling.  In addition, Moore’s Law has greatly increased the hardware computing power available to a wide number of researchers.  The user-friendly Beowulf Class Cluster Computing (Jones et al., 2003), optimized to solve image processing problems, is an example of current computer software strategies that encourage the development of new spatial analysis methodology.  The availability of new geographic information system (GIS) methods and data sets (Dean et al., 2002) encourages the development of novel techniques to fuse imagery with other kinds of data sets.  To take full advantage of such opportunities, it is necessary to develop a set of freely shared, standard images (SI).  Although the use of a single test image is standard practice in the areas of image processing (Yang and Cohen, 1999), medical imaging (Giacomuzzi et al., 1998) and remote sensing (e.g., Chen et al., 2003), we have found only one reference to date for the use of a complete image set (Toet et al., 2001).  Therefore, it could be the optimal time to develop a full set of SI that can be used to test, refine, demonstrate, and help communicate the strengths and weakness of new methods to the widest possible audience.

 

 A collection of SI data sets would also provide a useful tool for education that would (1) help reduce cost for individual instructors; (2) help ensure a consistent high level of instruction across institutions; and (3) provide a common experience among students that can help them both communicate with their peers and move between different software packages for their work.  As an example, if the same training examples were used by both Imagine and ENVI, it would be easier to transfer between the two systems.

 

Whereas the value of using SI to promote the objective evaluation of new methods has been well demonstrated in other disciplines, the selection and presentation of SI is not necessarily a trivial task. The purpose of this short paper is to present an initial discussion of some of the issues to provide a starting point for further discussion and action within the remote sensing community.


 

 

ISSUES

 

We suggest that SI and the corresponding auxiliary data and ground truth information would ideally have a number of useful characteristics that include the following:

 

  • Complete: in the present context, the SI fully characterizes a site spectrally, temporally, and spatially.  In addition, the documentation/metadata explaining the history of each data set have to be complete and should serve as a model for researchers who are entering the remote sensing field.  In the ideal case for a given test site, this might mean a well-documented set of hyperspectral imagery over a wide spectral range that is both spectrally (narrow 16 bit bands) and spatially (small pixels) of high-resolution over a wide range of dates; radar images; and corresponding auxiliary data (e.g., high-resolution Digital Elevation Model [DEM] grids, ground truth).  In addition, representative imagery from all commercial sensor systems (e.g., Land Remote-Sensing Satellite [Landsat], Satellite Pour l'Observation de La Terre [SPOT], IKONOS, QuickBird, Radar Satellite [RADARSAT]) would be available. The imagery would come in many forms, such as raw imagery, geometrically corrected versions, and spectrally corrected (e.g., with atmospheric corrections).  This would allow researchers to use image data appropriate to their requirements. For instance, if researchers were testing a new method for removing shading from imagery, they would be able to combine the hyperspectral imagery into ideal bands, or use a Landsat image that is ready at hand, and exploit the high-resolution DEM included in the SI to quickly test, refine, and demonstrate their new method.  Alternatively, researchers interested in developing a new classification method for SPOT imagery could quickly test their results against a general classification map that they had abstracted from a detailed classification ground truth map that was part of the SI.

 

  • Well understood: the SI is well documented, including known problems; it is supported by high-quality ground truth; and it is familiar.  For example, an SI set created at Cuprite, Nevada, would allow the many researchers already familiar with the site to build on past experience.  The ground truth must be extensive, with, for example, detailed classification maps and spectral signatures, and must be registered to 1/10 pixel.  The researcher should also have the option to actually visit the site that was imaged.  Having easy access, both legally and physically, to the site where imagery data were created can be important when an investigator needs new information that was not anticipated to be necessary when the auxiliary data sets were assembled.

 

  • Environmentally representative: spectral signatures, auxiliary data needs, and spatial signatures can vary with different environmental settings, such as forest, urban sites, rangeland, agricultural areas, or wetlands.  For example, the regular street patterns found in an urban image are not commonly found in wilderness areas.  Providing GIS data sets that include streets would be expected to be most important for the urban sites, whereas soil information might be more important for a rangeland area; however, good ground truth spectral measurements would be important for both.  Therefore, it will be necessary to create a number of SI data sets for each of the important environmental areas.  For each environmental site, it will also be necessary to create a detailed ground truth classification map.  This effort will entail the identification of the representative classes that are complete in both an academic sense—for example, that include all classes present—and useful in a practical, commercial sense—for example, that have commercial application.

 

  • Easily accessible:  the data sets should be physically easy to access, for example, via download over the Internet, without legal restrictions that would hamper sharing with fellow workers, and widely available from a number of sources.

 

  • Widely recognized: in the context of this discussion, researchers are aware of the fundamental existence of the SI data sets and know how to access them and/or the metadata about them.




 

IMPLEMENTATION STRATEGY

 

The above SI characteristics imply that the effort to generate a set of SI will not be inconsequential. Coming to agreement on representative sites to generate imagery and auxiliary data for an SI set could be an expensive and time-consuming process.  However, there are possible strategies to reduce cost.  Instead of simply collecting raw image data, providing complex models that allow users to create their own data sets to meet their specific needs may be more appropriate.  An important advantage of this strategy would be that because the user would fully understand the data, there would be less ambiguity in interpreting the results.  The investigator could create a wide range of test cases to fully exercise the new methodology.  A major disadvantage would be that with model imagery, there is perhaps less chance for informative surprises.  Real data do not include any hidden model basis and can therefore sometimes provide a more satisfying test.  A compromise may be to modify an SI data set with model results or imagery from other sites.  For instance, to provide a compact test set, it may be efficient to cut out features from several data sets and combine them into one small image.  Such a strategy may work well for testing new classification methodology that does not consider information from its adjacent pixels. Alternatively, the parameters that a model uses would have to be calculated from real imagery.  As an example, if a model required a mean and standard deviation for spectral signatures for oak trees used to generate Landsat images, the users would be required to use a standard set of parameters with a well-understood history.

 

One way to make the generation of SI more practical would be to use current channels of distribution. For instance, both the commercial and open source communities have expressed a clear interest in supporting the exposure, maintenance, and distribution of SI sets to the remote sensing community.  The open source community has an extensive Internet infrastructure that can naturally support the efficient distribution of the standardized data sets.  The commercial software vendors can use the SI sets as training examples and thus support both their effective distribution and understanding.

 

CONCLUSION

 

In this short paper, we have suggested that a set or sets of standard imagery, documentation, and supporting auxiliary data could be important to promote the objective evaluation and comparison of new methods that will be developed to in response to new capabilities now being offered to the remote sensing community.  Metrics and methodology to develop these data sets were also discussed.  However, these ideas were offered only has a starting point to promote more discussion of the complex issues involved in the creation of standard data sets.  More work is clearly needed to further develop the initial concepts presented in this paper.

 

 

REFERENCES

 

Chen, C. T., K. S. Chen, et al. (2003). The use of fully polarimetric information for the fuzzy neural classification of SAR images. IEEE Transactions on Geoscience and Remote Sensing, 41(9):2089-2100.

Dean, G., M. Oimoen, et al. (2002). The National Elevation Dataset. Photogrammetric Engineering & Remote Sensing, 68(1):5.

Giacomuzzi, S. M., P. Springer, et al. (1998). The Austrian Academic Computer Network and its usefulness for teleradiology. Journal of Telemedicine and Telecare, 4:41-42.

Jones D.R., G.M. Petrie, and S.E. Thompson.  (2003).  An Overview of Beowulf Cluster Computing for Remote Sensing Applications. In: Proceedings of the American Society for Photogrammetry & Remote Sensing 2003 Annual Conference, May 5-9, 2003, Anchorage, Alaska.

Toet, A., P. Bijl, et al. (2001). Image dataset for testing search and detection models. Optical Engineering, 40(9):1760-1767.

Yang, Z. W. and F. S. Cohen. (1999). Image registration and object recognition using affine invariants and convex hulls. IEEE Transactions on Image Processing, 8(7):934-946.