NOAO Data Products Program

A Virtual Observatory Data Model

Francisco Valdes
fvaldes@noao.edu
May 9, 2003

COVER LETTER

The following (also http://iraf.noao.edu/projects/vo/dal/datamodel.html) is a contribution for the data model discussions at the working groups meeting next week in Cambridge. It is an extension of some earlier ideas ( [1] , [2] ) on including celestial coordinates in the WCS for 1D and 2D spectra and on the question about whether accessing spectra and images in the prototype VO framework requires different protocols. Because of the deadline imposed by the meeting the discussion is abbreviated in some areas. My hope is emphasize the general philosophy and approach. There are some important ideas which I support in the Spectral Data Models draft from Jonathan McDowell and Steve Lowe but there are some philosophical differences which I wanted to offer. Primarily the ideas of treating images and spectra as projections of a more general class and simplifying as much as possible by limiting VO data to "calibrated" forms which don't require complex metadata to interpret.

Because I decided it was more valuable to try and build a consistent discussion from my perspective I did not have time to also critique the McDowell and Lowe concepts. But it makes sense that before doing that one really needs to reach concensus on whether to treat spectra of various dimensions separately or whether to work towards an integrated spectrum/image data model.

Good luck in the meetings. I'm sorry I can't be there.

Frank Valdes

1. What is a virtual observatory data model?

The first hurtle to overcome in defining virtual observatory (VO) data models is to understand what they are and what they are not. In the discussion given here a VO data model is the SIMPLEST abstraction of physically calibrated, wavelength regime and detector technology independent astronomical data.

We emphasize simplest because a key part of the VO concept is that users, called VO observers, should not need to be experts in every regime of astronomy and instead only be educated astrophysicists. The science done by VO observers generally involves data from various telescopes and various energy subdisciplines. The reason for striving towards the simplest description is to allow concensus and interoperability between a wide variety of data providers.

The other side of the question, which should be a mantra of sorts, is:

    "VO data models are not FITS or file formats"
    "VO data models are not archived data"
    "VO data models are not instrumental data"

2. Celestial Sphere Binned Photon Observations - 4DBIN

This document defines a broad class of astronomical data called "Celestial Sphere Binned Photon Observations". Note that the detailed definition of the class identified by this label is more specific than the literal interpretation of the words. The definition of the class flows from the name as follows.

Celestial Sphere
Restricts the class to data about the two dimensional celestial sphere. There are two spatial parameters specifying the longitude and a latitude in some specified celestial system.
Photon
Restricts the class to data about the photon energies as described by an energy parameter.
Binned
Restricts the class to data about the number of photons arriving over finite regions, called bins, of the parameter domain. A way to look at this is that photon events are indistinguishable within a bin. A further restriction is that the bins are rectangular so they may be described by a center and width in each parameter.
Observations
Restricts the class to data about photons over a time described by time parameter. Observation evokes the idea of detecting photons over an integration period, though simulation and model results can be cast into simulated observations.

This definition of the class has four parameters; celestial position, energy, and time. This forms a continuous space or domain which is divided into a set of bins that are not necessarily uniformly distributed or of equal size. Each bin is associated with the number of photons it contains. The number of photons may be expressed in various ways such as number, energy, and flux.

This class may be thought of data obtain through the following process. Photons of various energies are detected as a function of time coming from points on the sky. Each photon is tagged by four numbers from a four dimensional continuous space. The numbers are a latitude and longitude on the celestial sphere from which the photon arrives, the energy of the photon, and the time. The continuous space is divided into a set of discrete regions or bins which are indexed in some fashion. The photons are counted in each bin. The details of the continuous energy, position, and time parameters are lost and only the bin index and bin counts are retained.

This definition makes a notable distinction between the measured quantity, the photons, and the sampling, the bins. This distinction is often confused or lost. The photons, sometimes thought of as the "z" axis in an image, is the scientific content which is conveyed in standard physical units. The sampling or binning is variable and dependent on the way the data was obtained. The VO infrastructure or the data providers may "convert" units for the photon values and "resample" the bins at the request of the VO observer.

To identify data which falls into this class we define a top level tag

	VOCLASS = 4DBIN

2.1 What is the difference between VO data and observational data?

A key aspect of virtual observatory photon binned data is that the primary bin values be calibrated to standard physically meaningful units. There are two important reasons for this. One is to allow VO observers to easily intercompare data with only simple physical unit conversions. The other is to simplify the data model and limit metadata which must be supplied to allow meaningful interpretation.

This does provide a small burden on the data providers above what has been typical. For instance, optical imaging often provides data in digital units with the conversion to photons implicit in a gain and a magnitude zeropoint. For VO data the data provider does the gain multiplication and conversion of the magnitude system to photon based units so that non-optical astronomers don't need to understand the detector technology, many of the ideas of magnitudes, and the metadata doesn't need to include a gain and magnitude zeropoint.

In order to provide a "caveat emptor" option to the VO observers and data providers, a top-level metadata declaration is whether the primary data values meet the VO standard for this class:

	4DBIN.CALIBRATED = [yes|no|relative]

By asserting "no" the data may be useful but would require the VO observer to calibrate it themselves in some way. The "relative" calibration is a way to assert that the data is proportional to photon counts and that the response to photon fluxes is independent of position (after taking differences in bin sizes into account). Therefore, relative comparison between different bins is scientifically meaningful even though an absolute calibration is not defined.

Note that the first sentence of this section refers to the "primary photon bin values". The reason for this is that the observational and calibration characteristics appear in the ancillary data and metadata. This is primarily contained in the uncertainties but some other useful information may be provided in exposure maps and data quality flags.

2.2 What is an image and a spectrum?

In as much as astronomers define and distinguish between "images" and "spectra", an image is a subclass with only a single energy bin, a single time bin, and multiple bins in both spatial parameters. The energy bin is often fairly wide but not always. A spectrum also has only a single time bin, but has more than one energy bin, and one or more spatial bins.

Astronomers also typically discriminate between spectra having a single spatial bin, called a "one-dimensional spectrum", and multiple spatial bins, often called a "data cube". The special case of spatial bins restricted to a curve on the celestial sphere is called a "slit spectrum".

In this document there is no distinction made between spectra and images. However, one could choose to subclass the metadata concepts. A subclass means using implicit and explicit conventions and defaults. The subclasses might be:

	VOCLASS = 4DBIN.IMAGE
	VOCLASS = 4DBIN.1DSPECTRUM
	VOCLASS = 4DBIN.SLITSPECTRUM
	VOCLASS = 4DBIN.DATACUBE

3. Metadata

Data from 4DBIN Class fundamentally consists of a set of numbers related to photon counts. To make sense of this set of numbers requires metadata or conventions which describe the relationship between photon counts and the bin value, define the bins, the uncertainties in the values, and associated attributes.

As a thought experiment, which we use to identify the metadata through a use case, suppose one is given the set of numbers {0,6,7,2,5,3,1,4}. What do we need to understand something about the photons observed on the sky? Along these lines the minimal metadata necessary should be separated from optional metadata. Here we suggest the minimal description is provided by section 3.1 on the bin geometry and section 3.2 on the bin values.

First we need a top level piece of metadata defining the class and conventions. This type of metadata is sometimes associated with a name, such as FITS (with SIMPLE=T). For this document we define this metadata class domain

	VOCLASS = 4DBIN

3.1 Bin Geometry

The metadata for the bin geometry describes the mapping from the continuous four dimensional photon parameter space to the discrete indexed bins. As noted in section 2, the bins are required to be described by a center and width along each of the four parameter dimensions. This constitutes the bin geometry.

The first thing we need is a definition for the indexing of the data bin values. There are two straightforward ways to do this. One is to use the ordinal of the data value set. The other is to arrange the values into an array. For the 4DBIN class the array is required to be four dimensional.

	4DBIN.INDEXING = ordinal
	4DBIN.INDEXING = array(N1,N2,N3,N4)

3.1.1 Ordinal or tabular indexing

The first method is completely general while the second requires the number of data values to be the product of the array dimensions. At this point the two indexing schemes seem pretty much the same. The distinction comes in how the indices are used to map to the bin geometries in the four dimensional parameter space. In practice, the ordinal indexing is used with a table and the array is used for gridded bins.

In the ordinal indexing the metadata includes a table of bin geometry values. The table is a set of numbers ordered such that each sequential set of eight values define a line and the line number corresponds to the data value with matching ordinal. For example, the first eight numbers apply to the first data value, the second eight to the second data value, and so forth. The eight values are the bin centers in longitude, latitude, energy, and time followed by bin widths.

In the simple 1D spectrum example we might have

  0 : 12h10m15s 32d15m10s 4001A 2003-05-07T12:10:15 1arcsec 1arcsec 1A 300s
  6 : 12h10m15s 32d15m10s 4002A 2003-05-07T12:10:15 1arcsec 1arcsec 1A 300s

3.1.2 Array or raster indexing

For the array indexing we use a metadata description along the lines of the FITS WCS. This is a complex description which we only touch on here with attention to the restrictions imposed by the 4DBIN class. The metadata components would include many of the basic elements of the FITS WCS metadata. Besides the actual formalism for evaluating the bin centers and widths another key piece of metadata is the units of the four parameters.

The main restriction on the FITS WCS formalism as it applies to the 4DBIN class is that the axes ordering is required to be latitude, longitude, energy, and time and so the FITS WCS is always a WCSDIM of 4. The FITS WCS does not currently explicitly define time coordinates. But for the main data types of interest, images and spectra with a single time bin, we simply use a linear WCS.

The bin centers are a direct analog to the pixel centers in the FITS WCS. There is a linear mapping from the array index to an intermediate WCS coordinate. There is potentially a distortion transformation to an ideal intermediate coordinate. For calibrated data typical of the VO this should not be required except possibly to describe the path of a slit spectrum on the sky. Finally there is a projection or standard non-linear transformation to the final coordinates.

One new feature of the FITS WCS formalism is use of a lookup table. This allows for bin centers which are not uniformly arrayed in the parameter space. It can provide similar information to the ordinal description.

The concept of bin widths is only implicit in the FITS WCS formalism. For the array indexing metadata model defined here, the bin widths are computed from the WCS using the idea that the WCS functions are continuous in the index space. So the bin edges are computed by adding and subtracting one-half to the integer indices and evaluating the parameter value at those points. The WCS formalism is more general than simple rectangular bins so this computation is done by varying only the index of one parameter. The width of the bin is average difference from the integer index center and the two half index values.

3.2 Bin Values

Section 2.1 declares that calibrated 4DBIN data be in certain physical units directly related to the photons and the bin sizes. The primary metadata for the bin values is then the units. For example,

    4DBIN.VALUES.UNITS = ergs/s/cm^2/A
    4DBIN.VALUES.UNITS = photons
    4DBIN.VALUES.UNITS = Jy

The definition of the allowed units also needs to provide standards such as calibrations to above the atmosphere.

When there is a significant variation in the detection of photons across an energy bin, such as occurs with a filter in a broadband image, the calibration must be referenced to the filter system.

    4DBIN.FILTER = Johnson(B)

Background contributions need to be described by primary metadata.

    4DBIN.VALUES.BACKGROUND = Subtracted using nearby simultaneous observations
    4DBIN.VALUES.BACKGROUND = Subtracted by CCD shuffling
    4DBIN.VALUES.BACKGROUND = None subtracted

3.4 Uncertainties

For identification purposes, such as finding sources or redshifts, and when the magnitude of the signal is high, such as continuum shapes over decades of energy, the uncertainties about the data bin values may not be important. In other words, there a a number of uses for calibrated VO data that just depend on the data units and the the binning.

But for detailed measurements where detection and instrumental effects are important, a significant piece of metadata are the uncertainties. There are two approaches which might be provided by the data model. The more rigorous approach would be to give statistical information about each bin (possibly including covariances).

The statistical description of the uncertainties implicitly carries information about exposure times, rejected data in combined observations, variable sensitivities, and so on. Other attribute metadata may explicitly provide the means to separate these implicit contributions to the total uncertainties.

The other is to provide a functional description. This is only really useful if the data is relatively homogeneous so that variable DQE, bin sizes, and backgrounds are not present. A typical model describes the variances as a function of the data values. For instance,

	V = A + B N ...

where N is the binned photon number.

3.5 Attributes

This section on attributes is a catch-all for all the rest of the metadata. This is all to be defined. However a quick list of common useful attributes is given below.

label/title
a label or title provided by the observer
object ID
a standard object id
instrument
details of the telescope and instrumentation
conditions
information about the observing conditions
calibrations
details of the calibrations

data quality
a table of data quality indicators for:
  • uncalibrated bins due to vignetting or masking
  • poorly calibrated bins
exposure map
a table of effective exposure times
exposure filter
a table describing chopping, shuffling, sequences of combined exposures, etc. This is a filter function for the time dimension of a bin.