Next: The AIPSview Astronomy Visualization Tools
Previous: AIPS Developments in the Nineties
Table of Contents --- Search ---
PS reprint
Jungsoon Yoo
Department of Computer Science,
Middle Tennessee State University,
Murfreesboro, TN 37132,
Email: csyoojp@mtsu.edu
Alexander Gray, Joseph Roden, Usama M. Fayyad
Jet Propulsion Laboratory,
M/S 525-3660,
Machine Learning Systems Group,
4800 Oak Grove Drive, Pasadena, CA 91109
R. R. de Carvalho1, S. G. Djorgovski
Department of Physics, Mathematics, and Astronomy,
California Institute of Technology,
M/S 105-24,
Pasadena, CA 91125
1On Leave of Absence from Observatório Nacional/Cnpq, Rio de Janeiro, CEP 20921, Brazil
SKICAT is a highly successful astronomical data analysis and cataloging system which produces well-calibrated digital catalogs of sky objects photographed in the northern sky as part of the Second Palomar Observatory Sky Survey (POSS-II). SKICAT also incorporates machine learning techniques to classify these objects into one of a handful of human-useful classes, most notably including ``star'' and ``galaxy'' (Fayyad et al. 1993). A training set consisting of objects which have been hand-classified by a human astronomer is used to train the software, which then can discriminate between the classes for new data presented to it---this is an instance of supervised learning.
An unsupervised learning system does not require preclassified examples. Conceptual clustering is an instance of unsupervised learning, which discovers useful categories in unclassified data.
Conceptual clustering is well-matched to many problems in astronomy. This paper will focus on the relatively well-defined problem of star-galaxy separation. This is a sub-problem of the more general task of morphological classification of sky objects. Machine learning techniques in general tend to have the advantage over humans of being able to handle large amounts of data of large dimensionality. Unsupervised learning has the advantage over supervised learning that the subjective element introduced by the human is eliminated, leaving only a clearly defined and objective process. In addition, a good unsupervised learning scheme has the ability to identify new classes that were previously unknown to the human user.
Various unsupervised learning methodologies have been applied to this problem, to be discussed in later reports. One of the promising results coming out of these studies has been due to the COBWEB/95 system. In this paper, a brief description of the COBWEB/95 system and the result of the experimentation are presented.
COBWEB is a conceptual clustering system that organizes observations to maximize inference ability. Observations must be encoded into a language that the system understands. COBWEB uses the simplest encoding scheme which is attribute-value descriptions, also known as feature-vectors. Each line of an input file contains a tuple which contains values of attributes corresponding to the columns of the file.
When humans observe objects within an environment, they often group the objects into classes by finding common features within a class and differences among classes. This classification process aids our understanding of objects. Conceptual clustering is a machine learning technique that simulates the human classification process. One of the well-known conceptual clustering systems is Fisher's COBWEB (Fisher 1987). The heuristic measure used in COBWEB, called the category utility, has a firm grounding in probability theory and was originally developed as a means of predicting ``preferred'' concepts in human classification (Gluck & Corter 1985). The category utility is designed to use nominal (symbolic) attributes. Thus, it cannot be applied as it is when objects have numeric attributes. The COBWEB/95 system has been developed to handle numeric attributes as well as nominal attributes.
COBWEB organizes observations incrementally into a classification tree. Each node in a classification tree represents a class (concept). It is labeled by a probabilistic concept that summarizes the attribute-value distributions of objects classified under the node. The terminal nodes are the most specific classes that cover individual instances that have been previously observed. The root represents the most general concept which summarizes the entire observations. The higher level nodes represent more general concepts than the lower level nodes. The classification tree may be used to organize objects into classes as well as to predict missing attributes or the class of a new object.
COBWEB uses an explicit heuristic measure called the category utility to guide classification. Gluck and Corter developed this metric as a means of predicting the basic level in human classification hierarchies. Briefly, certain categories (e.g., bird) are retrieved more quickly by humans than either more general (e.g., animal) or more specific (e.g., robin) categories during object recognition. The category utility favors classes that maximize the potential for inferring information. It assumes that concept descriptions are probabilistic in nature.
Figure 1: A Classification Tree with SKICAT Data.
The COBWEB/95 system has been applied to POSS-II data obtained from SKICAT. The number of well-placed instances (McKusick & Langley 1991) is used to measure the quality of a tree structure. The well-placed instances are the singleton nodes that are descendents of the proper target concepts. We measure the clustering quality by measuring ratios of well-placed objects. We can do this since we have predetermined class assignments of ``star'' or ``galaxy'' assigned by SKICAT's supervised learning component. Since the SKICAT classification accuracy is known to peak at 94%, there is uncertainty in the class label itself. Note, though, that this accuracy is excellent compared to human visual analysis of the data.
The SKICAT dataset which we used contains 33021 objects. Each object contains an object identification number, 10 photographic attribute values, and the SKICAT-supplied class value. The following attributes were used for the analysis: (1) Magnitude; (2) Resolution scale; (3) Resolution fraction (these two are described in (Valdes 1982)); (4) Ellipticity; (5) Normalized core magnitude; (6) Normalized area; (7) First intensity moment; (8) The S parameter introduced by (Collins et al. 1989); (9) Mean surface brightness; and (10) Concentration index.
We have used only objects classified as galaxies and stars by SKICAT. It is important to emphasize that we did not use color, which is known to be effective for star-galaxy separation and available in our catalogs, in order to demonstrate the power of this multidimensional method. Instead we use it to help us in our post-experiment understanding of the classes which come out of the experiment. See Weir et al. 1995 for a more detailed description of the data and references regarding the attributes.
It took three days to run the whole dataset on a SPARCstation 20 and the final tree takes 41 MB of disk space. Because of the size of the data set we only ran COBWEB/95 on the data one time. (Because it is an incremental algorithm, its results are sensitive to the order in which the data appear, making multiple runs desirable.) Figure 1 shows the COBWEB/95 tree generated up to level 4. It does not show lower levels in cases where a node's ``purity'' with respect to the majority class (the name of the majority class is printed in each node) is higher than 94%. The classification accuracy of this tree with respect to the class label is 94% at the second level (one below the root) and 97% at the fourth level. Recall that our system is an unsupervised learning system, which means that the class information is not provided when it builds a classification tree.
In this paper, we described a system that classifies objects whose attribute values may be nominal or numeric, in an unsupervised manner. We consider the experimental result to be preliminary evidence that hierarchical unsupervised learning techniques can be very useful for astronomical data analysis. The next step of this research is to let astronomers analyze the final hierarchy in order to associate astronomical meaning to as much of the hierarchy as possible and to investigate the possibility of discovering previously unknown objects or classes of objects.
Collins, C. A., Heydon-Dumbleton, N. H., & MacGillivray, H. T. 1989, MNRAS, 236, 7p
Fayyad, U., Weir, N., & Djorgovski, S. G. 1993, in Proc. of Tenth Int. Conf. on Machine Learning, Morgan Kaufman
Fisher, D. H. 1987, in Machine Learning, 2, 139
Gluck, M., & Corter, J. 1985, in Proceedings of the Seventh Annual Conference of the Cognitive Science Society, Irvine, CA, Lawrence Erlbaum, 283
McKusick, K., & Langley, P. 1991, in Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, 810
Valdes, F. 1982, in Instrumentation in Astronomy IV, SPIE Proc., ed. D. L. Crawford, 331, 465
Weir, N., Djorgovski, S., & Fayyad, U. 1995, AJ, 101, 1