Next: AIPS Developments in the Nineties
Previous: Stellar Photometry with CCD Sub-Pixel Sensitivity Variations
Table of Contents --- Search ---
PS reprint
Francisco G. Valdes
IRAF Group, NOAO, PO Box 26732, Tucson, AZ 85726
One of the most challenging and time consuming steps in the processing of astronomical spectra is determining the instrumental dispersion function. In this paper the term dispersion coordinate refers to whatever units the observer choses. The dispersion function is determined by identifying spectral lines with known dispersion coordinates in a calibration spectrum (typically an arc-lamp spectrum) and fitting a function to the set of pixel position and dispersion coordinates.
The goal of the tasks and algorithms discussed here is to automate this process so that given an observed spectrum of a spectral line source and a file of known dispersion coordinates for the lines, the software will determine the proper dispersion function. We want this capability to be as general as possible so that the correct dispersion function is found even when there is poor or no knowledge of the dispersion coverage and resolution of the observation. However, we want to allow the additional input of a dispersion calibrated reference spectrum (not necessarily at the same resolution or wavelength coverage) to identify the significant lines in the coordinate file and approximate dispersion parameters and their uncertainties to constrain the search. The dispersion parameters may be specified by the user or through keywords in the image header supplied by the acquisition system.
The algorithm described here is actually a combination of many algorithms. This short paper provides a high level overview of the tasks and the algorithm. Details of the various separate algorithms may be found in a paper to be submitted to the PASP and in ftp://iraf.noao.edu/iraf/docs/autoid.ps.Z
The automatic line identification capability will appear in IRAF V2.11. It will be part of the basic spectroscopic tasks IDENTIFY, REIDENTIFY, and AUTOIDENTIFY. In addition these tasks will be used in revised versions of DOSLIT, DOHYDRA, DOARGUS, DOFIBERS, and DO3FIBER. The latter reduction scripts will, therefore, provide automatic line identifications where currently an interactive step is required.
The three basic tasks serve somewhat different purposes. IDENTIFY is a general interactive task. It begins in the same way as in previous versions by plotting the observed spectrum in pixels. At this point the user may type a new command that runs the automatic line identification algorithm and then redraws the spectrum with a dispersion calibration and the identified lines marked. Approximate dispersion information (if known) is specified by task parameters which may be directed to appropriate keywords in the image header provided by the data acquisition system.
There is also a second command in IDENTIFY that finds a dispersion shift at the same dispersion scale (the change in dispersion coordinate per pixel) using a set of lines previously identified. This is used when there are multiple shifted observations such as occur with slit masks. A dispersion function and a set of identified lines is obtained either as in the previous paragraph or by interactive identification and fitting. When the user moves on to another spectrum it will appear shifted relative to the inherited solution from the previous spectrum. Typing a command key will apply the automatic line identification algorithm constrained by the previous dispersion scale and direction and with the previous central dispersion coordinate as the starting point. A small search in the dispersion scale and a large search in the central dispersion value is performed. Initially, the method uses the set of previously identified lines as the reference coordinate list (described later) and looks for shifts of up to 60% of the dispersion range. If the shifts are even greater it will try to automatically identify the lines using the coordinate reference file since in that case there may be too few or no overlapping lines.
The task REIDENTIFY is used to find new dispersion functions which are similar to a reference dispersion function. In the past this meant that the spectra were restricted to shifts of only a few pixels so that the reference lines could be found again close to the previous position. The automatic algorithm allows finding larger shifts assuming the spectra to be reidentified have basically the same dispersion. This uses the same steps described in the previous paragraph but in a non-interactive setting. In essence, REIDENTIFY can be used to find dispersion functions for slit mask spectra given one dispersion function and set of representative lines. For this type of application one would also use the option to add new lines from the reference coordinate file to fill in the region not covered in the reference spectrum due to the large shifts.
The new task AUTOIDENTIFY provides the automatic line identification algorithm in a non-interactive format. Unlike REIDENTIFY it does not start with a reference spectrum having the same dispersion. In its most general use the task is presented with one or more spectra, a reference coordinate file, an optional reference spectrum, and approximate dispersion values. The task then finds reference lines and a dispersion function automatically. In both REIDENTIFY and AUTOIDENTIFY there is a parameter that allows the user to examine the results interactively.
An example of how these tasks may be used together is in the automatic calibration of multi-slit mask spectra. After the spectra are extracted, AUTOIDENTIFY is used on one slit to find a dispersion function based on some approximate dispersion information recorded in the header. If the reduction process is interactive the user will be presented with the solution in the IDENTIFY mode of AUTOIDENTIFY to verify the result and make any changes. Then REIDENTIFY is used to find solutions for the other slits in the presences of large dispersion shifts. Finally, REIDENTIFY is used to find additional dispersion functions for calibration spectra from the same mask where each slit is matched against the previous solution with the same slit. No large shifts are involved in this last stage.
The automatic line identification method proceeds as follows. First a list of pixel positions for the strong spectral lines in the target spectrum is created. We use the word line to mean the coordinate of a spectral line. The list is created by finding the local maxima, sorting them by pixel value, and then applying a centering algorithm to accurately determine the centers of the line profiles. The number of lines to use is a parameter which has a default value of 20.
A list of reference dispersion coordinates is selected from a coordinate file. The size of this list is another parameter which has a default value of 40. The reference coordinates are either selected uniformly from the coordinate file if no reference spectrum is available or by locating the strong spectral lines (in the same way as for the target spectrum) in a reference spectrum. The selection is limited to the expected range of the dispersion as specified by the user or given in the image header. If no approximate dispersion information is available the full range of the coordinate file or reference spectrum is used.
The lists of coordinates are sorted in increasing order and the ratios of consecutive spacings for patterns of N lines are computed. The size of the patterns is a user parameter with a recommended value of five lines. Rather than considering all possible combinations of lines only those sets of lines with members within a specified number (the default is ten) of successive lines are used. So the default case is to find all sets of five lines which are within ten lines of each other and compute the three ratios of spacings for each set. Note that if the direction of the dispersion is unknown then one computes the ratios in the reference coordinates in both directions.
The idea is that similar patterns in the pixel list and the dispersion list will have matching ratios to within some tolerance. All matches in the ratio space are found between patterns in the two lists. When a match is made then the candidate pairings between the members of the patterns are recorded. A count is made of how often each possible candidate pairing occurs. When there are a sufficient number of true pairs between the list (of order 25% of the shorter list) then true pairs will appear in common in many different patterns. Thus the highest counts of candidate pairings are the most likely to be true pairs.
The candidate pairs with the highest likelihood are used to compute all linear dispersions containing two or more pairs. The dispersions are constrained by any approximate dispersion information provided by the user or recorded in the image. The dispersions are ranked by the number of pairs the candidate dispersion fits and the counts of how often the pairs appeared in matching patterns.
Each candidate dispersion function is evaluated as follows. Each line in the coordinate file is converted to a pixel coordinate based on the dispersion function. A centering algorithm attempts to find a line profile near that position. The lines found are used to determine an improved linear dispersion function. Additional lines are sought from the coordinate list based on the new fit.
The quality of the dispersion function is evaluated based on three criteria. The first criterion is the root-mean-square (RMS) of the residuals between the pixel coordinates derived from the coordinate file and the measured pixel coordinates. This pixel RMS is normalized by a target RMS with a default value 0.02 pixels. A good solution will have a value less than one. Note that a pixel RMS is used instead of a dispersion RMS to allow comparing all candidate dispersion using a physical property of the target spectrum.
The other two criteria are the fraction of strong lines from the pixel list which were not identified with lines in the coordinate file and the fraction of all the lines (within the candidate range) in the coordinate file which were not identified. These are normalized to a target value with a default of 0.2; i.e., 80% of the strong lines and of all lines should be identified. As with the RMS, a value of one or less is good.
The reason the fraction identified criteria are used is that the pixel RMS can be minimized by finding solutions with large dispersion per pixel. This puts all the lines in the coordinate file into a small range of pixels and so lines with very small residuals can be found. The strong line criterion is clearly a requirement that humans use in evaluating a solution. The fraction of all lines, as opposed to the number of lines, identified in the coordinate file is used to reject the case of a large dispersion per pixel mapping a large number of lines (such as the entire list) into the range of pixels in the target spectrum. This can give the appearance of finding a large number of lines from the coordinate file. However, an incorrect dispersion will also find a large number that are not matched. Hence the fraction will be low.
The three criteria, all of which are normalized so that values less than one are good, are combined into a single figure of merit by a weighted average. Equal weights have been found to work well. In testing it has been found that correct solutions over a wide range of resolutions and dispersion coverage have figures of merit less than one and typically of order 0.2. All incorrect candidate dispersion have values of order two to three.
The search for the correct dispersion function terminates immediately when a figure of merit less than one is found. The candidate dispersion functions are evaluated based on their ranking so that the correct solution is often found on the first attempt.
When the approximate dispersion is not known or very imprecise it is often the case that the pixel and coordinate lists will not overlap much and have none or few true coordinate pairs. Thus, at a higher level the above steps are iterated by partitioning the dispersion space allowed into bins of various sizes range. The search is done using bins in the middle of the size range and in the middle of the dispersion range and working outward towards larger and smaller bins and larger and smaller dispersion ranges. This is done to improved the chances of finding the correct dispersion function in the smallest number of steps.