Next: Transforming Images into Icons to Remotely Retrieve Information from Astronomical Archives
Previous: An On-Line Database of APS POSS Images
Table of Contents --- Search --- PS reprint


Astronomical Data Analysis Software and Systems V
ASP Conference Series, Vol. 101, 1996
George H. Jacoby and Jeannette Barnes, eds.

The CADC/ST-ECF Archives of HST Data: Less is More

Dennis R. Crabtree, Daniel Durand, Séverin Gaudet, Norman Hill

Canadian Astronomy Data Centre, Dominion Astrophysical Observatory, Herzberg Institute of Astrophysics, National Research Council of Canada

Benoît Pirenne

Space Telescope-European Coordinating Facility

Abstract:

The CADC and ST-ECF maintain copies of the HST archive in coordination with the STScI. We are continually exploring ways in which to make the archive a more powerful scientific tool for researchers (previous examples include: the science table, preview images, and a prototype Web interface). We also seek out innovative, cost-effective methods for accessing the HST archive. We have developed a new approach to the HST archive which not only makes it more useful scientifically, but also uses new technologies to provide more cost-effective service to our users.

The CADC and ST-ECF now supports on-line access to the HST archive through a CD-ROM jukebox. In addition, we also support on-the-fly reprocessing of HST data when it is requested from the archive. This ensures that archival researchers can obtain data calibrated with the latest software and reference files.

Performing on-the-fly recalibration allows to keep only the raw data on-line and we further subset this by only selecting data for the science targets. Finally, the raw data is compressed using gzip before being written to CD-ROM. Taken together the complete set of public HST data, i.e., data obtained until November 1994, fits on approximately 31 CDs!

This paper will describe overall architecture of the archive including the Web access, maintenance of data on CD-ROM, maintenance of the reference data and details of the recalibration pipeline. Possible extensions for the future will also be discussed.

1. Introduction

The CADC and ST-ECF maintain archives of data from the Hubble Space Telescope in cooperation with the Space Telescope Science Institute (STScI). Our two sites maintain a duplicate of the HST database which is updated over the Internet using Sybase's Replication Server. We receive a copy of the actual HST data on 12-inch Sony optical disk (WORM). The ST-ECF receives a copy of all the data in the archive while the CADC receives everything but the engineering data. However, unlike STScI, we cannot afford the jukeboxes needed to store the HST data on-line. The disks are stored in cabinets and all requests for data need to be handled manually by an operator. Also, the software we distributed to access the HST archive would only run on Sparc machines (SunOS and Solaris) which limited access to the archive.

Unlike the STScI Hubble archive, which needs to support the operational needs for the telescope, our focus is solely on archival researchers. This allows us to explore different options and mechanisms for providing access to the archive. We made three key observations while exploring the future of the HST archive at the CADC and ST-ECF.

  1. CD-ROM technology has changed rapidly in the past two years with blank CD-R media costing approximately $10, CD jukeboxes with a capacity of 500 CDs readily available at a reasonable price and the cost of CD recorders also dropping.

  2. The capabilities of the World Wide Web and browsers such as Netscape continue to grow at a remarkable pace. It is now feasible to offer the Web as the sole access to the archive. While the capabilities of this interface may be somewhat limited compared to a custom developed interface, the limitations are not severe and are likely to be temporary.

  3. The HST Calibration Workshop in May 1995 raised the issue of recalibration of the data in the archive. It seemed that most experienced users would routinely recalibrate HST data as new reference files or software became available. It seemed that the question was why would you not recalibrate data from the archive?

We decided on the following goal for the CADC/ST-ECF HST archive:

Store all HST science observations on-line and provide users data calibrated at the time of the request using the latest calibration files and calibration S/W. Access to the HST archive will be Web-based in order to support the largest possible user-base.

2. The Data

The CADC version of the HST archive, which excludes engineering data, contains over 1 Tbyte of data. This includes raw and calibrated scientific data, calibration files, data used to produce the calibration files such as Earth flat observations, etc. Archival researchers are usually interested in a subset of this data. The typical archival researcher is interested in calibrated data for a set of science targets.

The data for such science targets is a subset of the data in the Hubble archive. While the end-user is interested in calibrated data, which is a further subset of the archive, an equivalent approach is to consider raw data in conjunction with automatic recalibration of the data.

A further reduction of the data can be realized through data compression. It turns out that the raw data in the archive compresses very well with either the standard UNIX compress utility or gzip. In fact, using gzip compression, the compressed raw data for ALL public HST observations of science targets fit on less than 30 CDs (as of Oct. 31, 1995). Each CD holds approximately 1000 datasets, 90% of which are either WFPC or WFPC2 observations.

The CDs containing the HST data are stored on-line in a CD jukebox. The data on the CDs are available as ordinary UNIX files which can be accessed using standard UNIX tools. Data is copied to a magnetic disk as it becomes public and then written to CD when enough data is accumulated. We expect the data rate from HST (prior to the 1997 servicing mission) will require approximately one CD per month.

3. Recalibration

HST observations are calibrated at the time of observation using the best available reference files and software available. This process is header driven in the sense that the steps to be performed, and the reference files and tables needed, are stored as header keywords. This data is provided to the Principal Investigator and stored in the archive.

However, when data is requested from the archive, which is at least one year later, there is a good chance that either better reference data is available or the calibration software has improved. Either one of these conditions implies that the user of the data should recalibrate the data. Fortunately, STScI stores information on which reference files should be used in the archive database. Thus it is straightforward to identify the best reference files and tables to use given a particular dataset. Also, the software needed to calibrate the data is distributed as part of STSDAS. Thus all the pieces are in place to automatically recalibrate the data when it is requested from the archive. The following excerpt from an FOS Information Bulletin is an example of the reasons for pursuing the goal of recalibration:

20 June 1995: New flat field reference files derived from post-COSTAR observations for the low dispersion FOS gratings and both detectors have been delivered to CDBS and installed in the PODPS pipeline. Low dispersion GO data obtained prior to this date have been reduced with flat fields appropriate to the pre-COSTAR epoch, which were (until now) the best available flat fields. All low dispersion (G160L, G650L, or PRISM) GO data obtained prior to this date should be re-processed with CALFOS and the new reference files...

 
Figure 1: Retrieval pipeline for HST data.
Figure 1: PS 139 Kb

There are three main pieces in the implementation of recalibration of data from the CADC/ST-ECF Hubble archives and these are illustrated in Figure 1. When a request for data is submitted (using our WWW interface) the necessary raw data is copied from the CDs stored on-line in the CD-ROM jukebox. Once on magnetic disk the data is uncompressed and read into STSDAS. Then for each observation, the database is queried for the best reference files. The output from this procedure is stored as a small cl script containing updates to the parameters for the reference files and calibration steps. The header of the raw data is then updated to reflect the new calibration file information. Finally, the data is calibrated using the pipeline software available in STSDAS. All of the current reference files and tables are stored on-line on magnetic disk, a total of approximately 7 GB. When the data is calibrated it is written out in FITS format and made available via anonymous ftp or written to tape if requested.

It takes approximately 3 minutes to fully process a WFPC2 dataset which are generally the most time consuming. At this rate we can process almost 500 datasets per day on a Sparc 10/51. During our testing of this process we recalibrated all public HRS data in approximately 2.5 days.

4. The Web Interface and the Future

The CADC/ST-ECF Hubble archives can be accessed via the World Wide Web at either http://cadcwww.dao.nrc.ca or

http://arch-http.hq.eso.org/ESO-ECF-Archive.html.

The WWW access provides complete access to the Hubble Science Archive including access to preview images in GIF format. The Web interface also provides easy access to the CFHT and ESO archives, SIMBAD and the Digitized Sky Survey. The Web truly does allow varying services to be effectively linked together!

Possible future enhancements to the system include:

Acknowledgments:

We are grateful to the Canadian Space Agency for the continued support of the Hubble archive at the CADC.


Next: Transforming Images into Icons to Remotely Retrieve Information from Astronomical Archives
Previous: An On-Line Database of APS POSS Images
Table of Contents --- Search --- PS reprint
Wed Jul 3 07:33:56 MST 1996