Next: An On-Line Database of APS POSS Images
Previous: The Science Archive for the Sloan Digital Sky Survey
Table of Contents --- Search ---
PS reprint
T. Comeau
Space Telescope Science Institute, 3700 San Martin Drive,
Baltimore, MD 21218
Analysis of current retrieval behavior shows that less than one quarter of archived datasets are ever retrieved, and less than one tenth are retrieved more than once, which argues against a cache. However, some classes of data for certain proposal types are repeatedly requested, and internal users differ greatly from external users. A separate cache strategy for each group, along with changes to place related data on the same optical platters, should improve performance.
The Data Archive and Distribution System (DADS) currently ingests 40 to 60 gigabytes of new data each month, about half of which are CAL class (calibrated science) datasets.
Retrieval volume (excluding internal DADS operation retrievals) has risen to slightly more than the ingested volume, of which two thirds to three quarters is for users internal to the Space Telescope Science Institute (STScI).
Based on our experience with the interim Data Management Facility, DMF, we expected that for DADS nearly all new datasets would be retrieved within the first week after ingest, and that most datasets would be retrieved more than once.
This assumption was incorrect.
Our analysis was based on information in several database tables maintained by the Data Archive and Distribution System (DADS), which is operated by STScI.
The Archive Data Set All table contains information about each ingested dataset, including the date on which the dataset was created, the size of the dataset, the source of the data, and the observing program for which the dataset was created. Joining this table to the Proposal table allows us to determine the type of observing program with which a dataset is associated.
The Requests table contains information about each retrieval request, including the user which made the request, and whether the request was successful. By joining the Requests table with the Registered Users table, we determine whether the user is internal to the Institute, and whether the user is a Superuser. Superusers include data analysts and instrument scientists with essentially unrestricted access to the Archive.
The Request Files table contains information about each dataset requested by a user, including the name, generation date, and archive class of the dataset.
We limited our analysis to all datasets ingested since the First Servicing Mission (in December 1993) through August 1995 from the Science Operations Ground System (SOGS). Datasets from other sources, including DMF, were not considered.
We joined the information in these tables to determine, for all datasets, the number of times each dataset was retrieved by Internal and External users.
Of the 222,089 datasets ingested, only 47238 (21.27%) were ever retrieved, and only 18,120 (8.16%) were retrieved more than once. Most datasets are never retrieved from the archive, and of those that are retrieved, only 38% are retrieved more than once.
The datasets rarely retrieved include intermediate products such as EDT (edited science data) class data, scheduling and support data, and engineering data. For EDT data, only 2006 of 38097 datasets (5.27%) were ever retrieved. While most non-science data is rarely retrieved, it is used on occasion for problem analysis, troubleshooting, and reprocessing. These datasets are occasionally useful, and determining which datasets within a class will be needed is impossible. By comparison, about 60% of CAL datasets are retrieved.
The first cache option we considered, based on the DMF experience, was to preload a cache with incoming ingest datasets, replacing least recently ingested datasets. This is roughly the behavior we would expect from a Hierarchical Storage Management system (HSM). This assumes, however, that datasets are actually retrieved soon after ingest. Since less than one quarter of all datasets are ever retrieved, this ``preloaded'' cache would not be effective.
The second approach considered was a cache loaded from user retrievals, using a least recently used replacement strategy. To evaluate this approach we examined how quickly after ingest datasets are retrieved. While some datasets are retrieved shortly after ingest, most are retrieved much later. Assuming that the cache held only a few weeks worth of data, the expected hit rate is below five percent.
Plotting the ``age'' of datasets at retrieval (the interval between the generation date of the dataset and the date of the retrieval request) revealed that while there are some datasets retrieved immediately after ingest, most datasets are not. Only 8.25% of datasets are retrieved in the first week after ingest, and only 13.9% in the first month.
Internal users account for about 60% of retrievals, external users about 40%. Plotting the retrieval patterns of both internal and external users revealed very different retrieval patterns for each group.
Internal users do retrieve recently ingested data, particularly CAL datasets taken for Engineering and Calibration proposals. Internal retrievals of datasets for GO and GTO proposals during the first few weeks are rare. Internal users do retrieve GO and GTO proposals, but they do so between 50 and 200 days after ingest.
External users retrieve predominately CAL datasets taken for GO and GTO, and they frequently do so in the month immediately after the end of the proprietary period for that dataset.
We then examined the individual retrievals to determine if there were patterns in the way datasets are retrieved.
First, most users request CAL class datasets. Users who generate retrieval requests using Starview will automatically also request related OCX and PDQ data, and in some cases OMS class datasets related to the observation.
Second, users tend to request groups of related datasets from a single proposal or target. For example, the most popular dataset in the Archive is a WFPC-2 image of SN1987A. Users requesting this dataset tend to also get other datasets taken in conjunction with this image, and to get the OCX and PDQ datasets associated with each image.
Because datasets for an observation are usually scattered across multiple DADS platters, retrieves generated in this way require mounting many platters. Platter mounts are a time consuming operation which also accelerates wear on the DADS drives and jukeboxes. Collecting datasets for a single observation or group of observations would significantly reduce the number of mounts required to service a request. This would improve both the performance of the Archive and the reliability of DADS hardware components.
Our analysis suggest two different caches would improve DADS performance: A preloaded cache for non-operator internal users, and a retrieval loaded cache for external users.
The preloaded cache would contain recently ingested CAL, OCX, and PDQ datasets for Calibration and Engineering proposals, with a least recently ingested replacement strategy, sized to hold seven to ten days of datasets. Currently this would require about 7 GB of storage. For WFPC-2 Calibration proposals, the hit rate for this cache would approach 100%.
The retrieval loaded cache would contain recently retrieved GO and GTO proposals, with a least recently retrieved replacement strategy, sized to hold several weeks worth of retrievals. Currently this would require 10 to 15 GB of storage. The hit rate for this cache would vary widely. For some periods the estimated hit rate is near 80%, for most periods it is below 40%. The best cache performance occurs when a small number of visually interesting images become non-proprietary. An automatic method of identifying visually interesting images would be useful.
Improved performance, particularly for internal users, is needed to support the SM-97 instruments.
Conventional least recently used caching of all datasets would result in a low hit rate, because most datasets are retrieved once, or not at all.
Internal users would benefit from a preloaded cache of recently ingested non-GO or -GTO proposals, while external users would benefit from a larger cache using least recently used caching of only GO and GTO proposals.
More significant retrieval performance gains would be obtained by grouping CAL class and related datasets for an observation on the same platter. This will reduce the number of mounts required to service a request. Rarely retrieved classes should be placed on separate platters for export to an off-line store.
I wish to thank Lisa Gardner (STScI), Cynthia Montalvo and Simone Stewart for their assistance in working out the database queries for this paper.