Next: The NCSA Astronomy Digital Image Library
Previous: CCMA/ESF: The European Scientific Network on Converging Computing Methodologies in Astronomy
Table of Contents --- Search --- PS reprint


Astronomical Data Analysis Software and Systems V
ASP Conference Series, Vol. 101, 1996
George H. Jacoby and Jeannette Barnes, eds.

The Astronomical Software Directory Service: Distributed Documents---Centralized Searchable Index

H. E. Payne, R. J. Hanisch

Space Telescope Science Institute

A. Warnock

A/WWW Enterprises

Abstract:

We have created a service on the World Wide Web for discovering astronomical software. Tools allow us to create a searchable index of documents distributed across the Web, rather than maintaining a centralized collection of documents. We have indexed astronomical software documentation, which is maintained directly by the software providers at their home sites. In addition, we also provide a uniform set of high-level package descriptions.

1. Introduction

There is a lot of software of interest to astronomers, and astronomical software developers. Still, finding the software to solve a particular problem can be difficult. Sometimes we end up re-writing software, and sometimes we get something from a friend of a friend. Neither solution is optimal.

About three years ago, we started to think about a solution to this problem. The tools we were using for discovering software were archie and COSMIC. We liked archie as a free and open system, but felt limited because only file names are indexed---it's hard to find something unless you know what it is called, i.e., what you are looking for. NASA's COSMIC project distributes software developed under NASA contracts. It is not free, and not very well known.

What we wanted was (1) to catalog enough information about a software package's capabilities for you to decide whether it might be what you are looking for, (2) to catalog enough information about a package's system requirements and availability to decide whether you can run the software on your system, and how to get it if you can, and (3) to incorporate the experience we gained from the package installations one of us (H. E. P.) was doing as part of our job, (4) to have a system that uses the Internet client/server model, (5) to use a public domain, well-maintained, and easy to install client for the user interface, and (6) to promote completeness, currency, and maintainability by using material available at the software developers' sites, and minimizing the amount of material kept at our end. This makes us a software directory, and not a library. What we did not know is how we would do it.

By two years ago, when we announced the Astronomical Software Directory Service (ASDS) concept at ADASS III (Hanisch, Payne, & Hayes 1994), we had been introduced to the World Wide Web (WWW), and to WAIS indexing. We were pretty sure that these were the technologies we wanted to use. We felt it was very important to be able to search through the contents of our collection, and not just browse through the collection, guided by titles, or worse, filenames.

We now have a working ASDS system, which we will describe, and a small, introductory collection of documents. ASDS is an Internet service, accessible via the WWW. Instead of WAIS, we are using a modified version of the Center for Networked Information Discovery and Retrieval (CNIDR) Isite software. The modifications allow us to index documents made available to the WWW by software providers from their own sites---that is, we maintain a centralized index to a distributed document collection. At this point we are prepared to enhance the usefulness of our service by enlarging our document collection to include more software packages. To do this, we seek the cooperation of you, the astronomical software providers.

2. Isite Instead of WAIS

Our system architecture differs in one significant way from our original conception: it is based on CNIDR's Isite package instead of WAIS. The primary reason for making this switch was the ability to do fielded searching, but we appreciated the ability of the search engine to do phrase searching, right truncation (e.g., `` galax*'', which matches both galaxy and galaxies), and Boolean operations. And we have been fortunate to receive support from the Isite developers.

 
Figure 1:  Schematic representation of an index created from documents served to the WWW by httpd servers.
Figure 1: PS 52 Kb

The system we are using contains modifications to the Isite distribution made by one of us (A. W.) and Jim Fullton of CNIDR. The essential modification allows URLs on the Web to be opened for indexing and searching. File names and URLs are treated as instantiations of a text file class, where the URL is retrieved over the network and held in a temporary cache. We are currently experimenting with the Harvest object cache software.

A schematic representation of the architecture for indexing is shown in Figure 1. Documents are served to the Web (or created on-the-fly) by httpd servers at remote sites. The indexer saves pointers to all of the phrases in the collection, along with a list of all of the fields in the collection.

3. The Collection

Two years ago, we had already decided that our collection would consist of two different types of documents: (1) a uniform set of high-level package descriptions, and (2) whatever else we could get our hands on, as long as it was in electronic form. Software documentation covers such a broad range of style, completeness, intended audience, format, and readability that uniformity could be obtained only by creating a description for each package. We settled on a set of about three dozen characteristics describing the package's capabilities, obtaining the software, requirements for running the software, obtaining more information, on-line or otherwise, and finally, our subjective comments on installing and using the software. These descriptions are in HyperText Markup Language (HTML) format, and each item is tagged with a field identifier so that it is possible to restrict a search to the list of supported operating systems, for example. We keep all of the high-level descriptions locally, to make it easier to keep track of what is in the collection.

The distributed documentation collection is eclectic. Currently, user guides, programmer guides, on-line help files, and UNIX man pages are included. In contrast to the high-level descriptions, these documents are widely distributed around the Internet. All are in native HTML format, were converted by us to HTML format (e.g., RosettaMan was used for man to HTML conversion), or are converted into HTML format on-the-fly by CGI scripts (e.g., the AIPS on-line help files, and various IDL procedure libraries). Almost all documents are retrieved by URL rather than by file name. While neither indexing nor searching is limited to HTML format, this situation does eliminate the problem of how to present the result of a successful search: the user's Web browser presents it the way the author intended. The only HTML tag that is common to our entire collection is the Title tag. This allows a search result to be presented as a list of titles.

Leaving the documentation files in the hands of the software providers is the best assurance that they are up-to-date. On the other hand, it means that the index can be out-of-date. We have developed a ``client package'' that the software provider can install as a CGI script. It allows us to ask the provider's Web server to supply us with a list of URLs that are new, changed, or deleted. This allows us to update the index with a minimal amount of network traffic. Updates can be made periodically or at the provider's request.

4. Your Assistance Requested!

The material currently in the collection consists largely of material we found, in HTML format, by browsing the WWW---i.e., this is the stuff that is easy to find. Given this, and the limited manpower on our end, we need to enlist the cooperation of the software providers themselves to further enhance the completeness of the collection. To facilitate this process, we have created a few products to support participation at a number of levels of commitment, ranging from a one-time submission of a high-level package description, through supplying a list of URLs of software documentation, to installing the client package mentioned above.

5. Future Enhancements

Our prototype system has reached the point where we can profitably ask software providers for their participation, but it is not ready for users. The system is both too slow and too fragile. The Isite software is not yet optimized to support the modifications we have made. The Isite indexes contain a lot of information, to support the advanced search capabilities. To reduce the size of the index, a lot of information is not copied into the index, but is retrieved from the documents at search time. This works well for local files, but retrieving a URL from the WWW is very expensive, by comparison. In addition, it makes the search process vulnerable to servers that are down, domain name servers that are down, and to slowdowns due to network traffic. And finally, if the documents have changed since the index was constructed, then the information retrieved at search time will not be what is expected, and searches will fail. We are currently working to keep more information locally, and to reduce, and perhaps eliminate, the network traffic required at search time.

References:

Hanisch, R. J., Payne, H. E., & Hayes, J. J. E., 1994, in Astronomical Data Analysis Software and Systems III, ASP Conf. Ser., Vol. 61, eds. D. R. Crabtree, R. J. Hanisch, & J. Barnes (San Francisco, ASP), p. 41


Next: The NCSA Astronomy Digital Image Library
Previous: CCMA/ESF: The European Scientific Network on Converging Computing Methodologies in Astronomy
Table of Contents --- Search --- PS reprint
Wed Jul 3 08:01:26 MST 1996