Next: Options for Improving Retrieval Performance of the Hubble Data Archive
Previous: The Aladin Project: Status Report
Table of Contents --- Search ---
PS reprint
R. J. Brunner, I. Csabai, A. Szalay, A. J. Connolly, G. P. Szokoly
Department of Physics & Astronomy, The Johns Hopkins University, Baltimore, MD 21218
K. Ramaiyer
Department of Computer Science, The Johns Hopkins University, Baltimore, MD 21218
steradians or
) over a period of five
years. The photometric imaging will be done in five broad bands
(
,
,
,
, &
), to a limiting magnitude of
. From
the photometric catalog, a sample of
galaxies (complete
to
) and
quasars (
) will be observed using two multi-fiber double spectrographs. The
entire dataset produced during the course of the survey will be tens
of terabytes in size, with the photometric catalog alone occupying
several hundred gigabytes.
In addition to being an enormously large dataset, the Sloan Digital Sky Survey Science Archive will be complex, containing several hundred million objects in five colors, with measured attributes, and associated spectral and calibration data. Some objects will be observed multiple times due to the survey geometry, for quality control, or potential variability studies. As a result of the large and complex nature of this data, we have decided to employ a distributed object oriented archival system that utilizes a geometric indexing scheme to provide a rapid query and analysis interface to the Sloan Digital Sky Survey Science Archive.
The Sloan Digital Sky Survey (SDSS) is a collaboration between The
Johns Hopkins University, the University of Chicago, Princeton
University, the Institute for Advanced Study, Fermi National
Accelerator Laboratory, the Japan Participation Group, the University
of Washington, and the United States Naval Observatory, with
additional funding provided by the Sloan Foundation and the National
Science Foundation. The survey will employ a dedicated 2.5 meter
Ritchey-Chrétien telescope with a 3o field of view. The
photometric imaging will use a CCD array that consists of 30 2K x 2K
Imaging CCDs, 22 2K x 400 Astrometric CCDs, and 2 2K x 400 Focus
CCDs. The raw data rate from this imaging camera will exceed 8
megabytes per second. The spectroscopic survey utilizes two
multi-fiber high resolution spectrographs, each with 320 3'' fibers,
that provide spectral coverage from
.
Overall, the survey archive has evolved into two distinct components: the Operational Archive where the raw data is reduced and mission critical information is stored, and the Science Archive where calibrated data is available to the collaboration for analysis. The separation of the archive into two components is crucial to the successful operation of the survey. This approach helps to ensure the integrity of the operational data by removing possible outside interference, as well as guaranteeing that the scientists are provided data of the highest calibrated quality.
In designing the Science Archive for the Sloan Digital Sky Survey, we have developed a three tiered system that provides maximum flexibility and isolates potential portability conflicts. The system uses a novel geometrical indexing scheme to provide both fast access to the data and feedback on the estimated size and execution time for a submitted query. The query results are extracted from the data repository and returned to the user in a customizable environment in order to promote the scientific viability of the survey.
The main project for the SDSS is a drift scan of 10,000 sq. degrees
centered on the North Galactic Cap in five broadband filters:
,
,
,
, and
. The photometric calibration is expected to
be 2 % at
and the accuracy of the astrometric
calibration should exceed 60 milliarcseconds. From this photometric
catalog, a complete galaxy redshift survey to
will be constructed. A secondary project, that will complement the
northern survey, is to repeatedly image the same strip for variability
and deeper imaging in the southern galactic cap while the main
northern survey area is not available.
The SDSS Science Archive will consist of four main components: a photometric catalog, a spectroscopic catalog, images, and spectra. The photometric catalog is expected to contain one hundred million galaxies, one hundred million stars, and one million quasars, with magnitudes, profiles, and observational information recorded in the archive. Each detected object will also have an associated image cutout for each of the five bandpasses. The spectroscopic catalog will contain identified emission and absorption lines, and one dimensional spectra for one million galaxies, one hundred thousand stars, one hundred thousand quasars, and about ten thousand clusters. In addition, derived, custom catalogs may be included, such as a photometric cluster catalog, or QSO absorption line catalog.
The Science Archive employs a three-tiered architecture: the user interface, the query support component, and the data warehouse. This distributed approach provides maximum flexibility, while maintaining portability, by isolating hardware specific features. The different tiers communicate through TCP/IP sockets using a well defined ASCII control protocol, while data is transferred in a binary machine independent format.
The user interface is primarily responsible for aiding in the formulation of queries, and in receiving the extracted data. A graphical interface is provided that allows the user to construct a query from the attributes retrieved from the archive metadata. This graphical query is then converted into an ASCII query string that is a subset of SQL. The query support layer then parses, tokenizes, and transforms the ASCII string into a query tree which encapsulates the geometrical nature of the data. Depending on the user's particular request, either the estimated query feedback is returned to the user interface, or the query is optimized and sent to the data warehouse for execution. The feedback mechanism provides a trial-and-error exploratory querying ability that can prevent unintentional queries from wasting system resources.
The data warehouse contains the actual data as well as the corresponding metadata (data that describes the contents of the archive) for the entire science archive. This data repository is implemented using Objectivity, a commercial object oriented database management system. The interface to the data repository is through an object request broker (ORB), which communicates with the query support layer. When a query is executed, the ORB opens the appropriate databases, and extracts the desired attributes from all objects which satisfy the optimized query. These queries can be executed using several distinct methods of access: an OQL predicate query, an ODBC interface, and an SQL++ interface, all provided by Objectivity. The extracted data is sent to the query support layer where it is routed to the user's desired destination.
Astronomical data often contains a numerical subset (i.e., spatial
coordinates) that are indexed in order to expedite
queries. Traditional indexing schemes must be restricted to a few
parameters, otherwise, they begin to match the actual data in physical
size. Moreover, current archive queries are limited to simple ranges
of parameter values, while the desired query may be more
complicated. For example, ``Find all blue (
) galaxies fainter than
that are within 3
arcseconds of a quasar brighter than
.'' Such complex
queries can be succinctly modeled in the formalisms of computational
geometry, ``Find all data points within a given metric distance of a
specified simplex.'' As a result, we have developed a multidimensional
geometric indexing scheme that provides accurate predictions of query
volumes and times, a snapshot of the spatial relationships that exist
within the dataset, and also a quantization of data on the storage
media.
Our geometric index consists of a modified k-d tree, in which a d-dimensional numerical dataset is partitioned using k-key attributes in a binary tree structure. Within the tree, a node conceptually represents a subvolume within the entire d-dimensional volume occupied by the data. Thus, the root node represents the entire dataset, which is then partitioned using one of the k-key attributes into two separate volumes, which are represented by the root node's two children. This process continues until a predetermined leaf node level is reached, wherein all objects which lie within the leaf node's volume are quantized and stored contiguously on the storage media in an attempt to ensure efficient cache hits. The leaf nodes, or cells, form a coarse grained density map of the actual data.
Queries are first executed on the geometric index, producing a feedback to the user on the estimated number of objects satisfying the search, and the estimated search time. As the index is compact enough to fit entirely within the memory of the user's system, any interactions with the index are extremely fast. Using the information contained within the index, queries can be optimized on a cell by cell basis by removing extraneous or redundant portions of the query within the given cell. These optimized container queries fit naturally within Objectivity's database hierarchy, which provides a one-to-one mapping from our leaf nodes to containers of objects on the storage media. The quantization of queries provides an additional benefit of optimizing the query server by instituting a queued container query access (i.e., optimize cache hits across individual queries).
The geometrical indexing strategy can naturally incorporate the actual query, resulting in a more powerful search mechanism. Rather than limit a user to parameter cuts, linear combinations of attributes form the query primitive within our system. These linear combinations can then be combined using Boolean Algebra to form complex polyhedra that can carve out complicated volumes within the available parameter space. In order to simplify spatial queries, we work with a Cartesian projection of the spherical astrometric coordinates. This simplifies coordinate conversions, and reduces spherical proximities to a linear combination of the Cartesian coordinates.
We wish to thank Robert Lupton, Don Petravick, Steve Kent, Jeff Munn, Brian Yanny, and Ruth Pordes for stimulating discussions. We also would like to acknowledge the Sloan Digital Sky Survey for providing funding for this project.
Gunn, J. E. & Knapp, G. R. 1992, PASP, 43, 267
Samet, H. 1990, The Design and Analysis of Spatial Data Structures, (Addison-Wesley)
The Object Database Standard: ODMG-93, ed. R. Cattel, (Morgan Kaufmann)
The SDSS Grey Book: A Proposal to the National Science Foundation