Cloudnet NetCDF Convention

Introduction

The Cloudnet convention is applicable to any dataset on a time-height grid, including radar and lidar data, single-site model forecasts and derived geophysical products. It adopts many of the components of other netCDF conventions, specifically the Climate and Forecast (CF) Metadata Conventions. Conventions generally relate to the attributes that should be supplied, or those that it is recommended to use if a certain piece of information needs to be conveyed.

Suggestions for improvements and clarifications to the convention are welcome.

Files and filenames

Files should use the netCDF4 classic file format. Each level 1 (instrumental or model data) or level 2 (geophysical product) file should contain data from a single day. The times reported in the file should be in hours UTC, so for instruments that operate continuously, each individual file should run from midnight to midnight UTC.

Filenames should be of the form YYYYMMDD_WHERE_WHAT.nc or YYYYMMDD_WHERE_WHAT_ID.nc, where the fields are as follows:

YYYYMMDD
The date, UTC.
WHERE
A lower case string identifying the site (e.g. chilbolton, arm-sgp).
WHAT
This field either identifies the instrument (e.g. galileo for the Galileo radar), the model (e.g. harmonie-fmi-0-5 for the 0-5 hour forecast of the HARMONIE-AROME model) or the geophysical product (e.g. iwc-Z-T-method for ice water content derived using the reflectivity+temperature method).
ID
Additional unique identifier to distinguish multiple files of the same type (e.g. two radars in the same location) from each other.

The fields should not contain underscores (_); hyphens (-) should be used to separate information within fields. This then allows for an additional field to be added in a future revision of the convention. Thus filenames should contain only the characters [-_.a-z0-9]. Spaces are forbidden as they have the habit of breaking Unix scripts.

Dimensions

The netCDF dataset should contain the dimension time, which should be the first dimension defined. The vertical dimension may be range, height or level:

time
This dimension may have the length unlimited. NetCDF permits one dimension to be unlimited, which means that variables using this dimension can grow along this dimension. However, if the data are read one variable at a time then the use of an unlimited dimension seems to slow down the read speed.
range
This dimension is used for instrumental data up to level 1b, and indicates that distance is measured from the instrument rather than from mean sea level, and also allows for instruments not pointing at zenith.
height
For level 1c and 2, ranges from instruments are converted to heights above mean sea level, and this dimension name is used.
level
For level 1 model data this is used to indicate model level rather than height, since model levels often do not correspond to unique heights.

Other dimensions may be defined. For example, the level 1b model data contains microwave propagation parameters derived from the model fields for several different frequencies, so uses the dimension frequency. The level 1c categorize dataset holds model data on the original vertical model grid (to save space), which is referenced using the model_height dimension.

Variables

Compulsory variables

The following compulsory variables are stored as variables rather than global attributes because they have a unit or other describing attribute associated with them; the attributes that should be set are shown indented after each variable name. Each netCDF attribute consists of a “name” and a “value”, where the value can be a text string or a vector of numbers. All these variables are of type float, i.e. a 4-byte floating-point number.

latitude
units = "degrees_north"
long_name = "Latitude of site"
longitude
It is conventional to always report positive longitudes, i.e. +359.0 rather than -1.0.
units = "degrees_east"
long_name = "Longitude of site"

For each dimension a “coordinate variable” must be defined, i.e. a vector variable with the same name as the dimension. Typically these would be of type float. Thus all datasets should contain a time variable:

time(time)
Note that the float type has enough precision for time in hours to be discretised to better than 0.007 seconds.
units = "hours since YYYY-MM-DD 00:00:00 00:00"
where YYYY-MM-DD must contain the date that the data were taken (e.g. 2002-09-05). The zeros at the end indicate that the time is from midnight UTC (i.e. timezone 00:00). This reporting of time is from the CF convention. Note that reporting time in hours rather than seconds from midnight is much more convenient for the user.
long_name = "Time UTC"
axis = "T"

A range, height or level variable should then also be defined, depending on the dimensions present, e.g.

range(range)
units = "km"
long_name = "Range from antenna to the centre of each range gate"
An example long name.
axis = "Z"
height(height)
units = "m"
long_name = "Height above mean sea level"
axis = "Z"

Note that the axis attribute is the CF way of stating the dominant temporal and vertical variables against which 2D variables in the file should be plotted. No more than one axis of a given type should be present in the file.

Compulsory variable attributes

All variables should set the following two attributes:

units
The units should be readable by the UDUNITS package, as required by the CF convention description. An additionally accepted unit is "dBZ". If possible, units should be SI. The main points for uniform use of units are as follows:
  • Exponents should be expressed by "g m-3", not "g m^{-3}", "gm-3", "g/m3", "g(m)^-1" etc.
  • If conventional modifiers such as "kilo" are used, please use the correct case, i.e. "km" not "Km" for kilometers.
  • The appropriate way to express microns is "um", not "microns" or "1e-6 m".
  • The units for time should conform to the use indicated in the section above.
  • Dimensionless variables use the unit "1".
Note that bit fields and status fields, defined below, need not use the units attribute.
long_name
This should be a concise but informative phrase describing the variable, short enough to fit comfortably in the axis or title of a plot (i.e. shorter than around 60 characters). It should start with an upper case letter.

The following attributes are good ways to express information about a variable. They should conform to the conventions indicated.

comment
This is by far the most important attribute that a variable can have as it describes to the user what the variable is. Do not assume that the user has a copy of documentation that should have been distributed with the file: put enough information here to explain what the variable contains, how it was derived, what the calibration convention was and things the user should be aware of when using this variable. If there are references specific to this variable (i.e. those that would be inappropriate in the global references attribute) then include them here. Ideally this attribute should start with "This variable contains...". Use complete sentences terminated with a full-stop/period so that extra comments can be easily appended. New line characters (ASCII code: decimal 10) should be used to break long lines. Note that the use of the plural comments has been deprecated.
_FillValue
If the variable contains missing data (e.g. because an instrument was not working or the variable indicates cloud particle size but not cloud is present etc.) then _FillValue should be present to indicate which value has been used to flag that no valid data are available. They must be of the same type as the variable itself.
source
For datasets containing variables derived from different sources, it is useful to indicate the particular source here. Typically one would take the global source attribute from the dataset from which this variable was derived.

Variables indicating error and sensitivity

All derived geophysical products at level 2 and above should ideally be accompanied by an indication of their error. Typically errors can be divided into random error that decorrelates rapidly with time, and a bias due to the accuracy with which an instrument was calibrated and which may affect all measurements in a day uniformly. Additionally, many instruments and the products derived from them have a sensitivity, or a minimum detectable value, which should be reported in order that comparison with models be fair. Variables affected in this way should define one or more of the following attributes:

<variable>_error
Contains the name of the variable in the file that indicates the random error of the variable in question. Typically if the variable name were Z, then the corresponding error variable would be Z_error. The variables should be linked with ancillary_variables attribute as specified in the CF conventions.
<variable>_bias
As above, but for the bias. Similarly, the typical name for the bias in Z would be Z_bias.
<variable>_sensitivity
As above, but for the sensitivity. The typical name for the minimum detectable Z would be Z_sensitivity.

Sometimes errors can have a long (and difficult to define) decorrelation time, and it is not obvious how to differentiate between random error and bias. In this case only an <variable>_error need be defined. The variables used to report error and sensitivity should conform to the following conventions:

Bit fields and status fields

It is often necessary to indicate the status of a retrieval, enabling the user to distinguish pixels for which the retrieval was (for example) “reliable”, “probably reliable but…”, “unreliable”, “not possible”. Sometimes targets need to be distinguished between a number of different types, such as “liquid clouds”, “ice clouds”, “aerosol”, “insects”. In this case one can use a status field, where the integer variable will be one of a limited number of values, or a bit field, where each bit of the integer variable should be interpreted as a separate flag.

Rather than use a units attribute, the variable should use a definition attribute, where each line (separated by the newline character) indicates the meaning either of each value, or of each bit. In the case of status fields, we could have:

definition =
"Value 0: No cloud present
Value 1: Reliable retrieval
Value 2: Possibly unreliable retrieval due to spiders in the waveguide
Value 3: Unreliable retrieval"

while in the case of bit fields we could have:

definition =
"Bit 0: Liquid droplets are present
Bit 1: Ice particles are present
Bit 2: Raindrops are present
Bit 3: Aerosol particles are present"

where bit 0 is the least significant bit.

Global attributes

Global attributes provide important information about the data in a netCDF file.

Compulsory global attributes

The following attributes should be present and of type text:

Conventions = "CF-1.8"
Indicates that your data satisfies the CF conventions.
day
The day of the month on which the data were taken as a two-digit number (e.g. "01").
month
The month of the year as a two-digit number (e.g. "01" for January).
year
The year as a full four-digit number (e.g. "2024").
cloudnet_file_type
Identifier for product (e.g. "lidar" or "classification").
location
The site at which the instrument was operating, such as "Chilbolton", "Cabauw", "Palaiseau" and "ARM Southern Great Plains".
title
A suitable title for plots created from the dataset, such as "Ice water content from Chilbolton", "Chilbolton 94-GHz Cloud Radar (Galileo)" or "Cabauw 905-nm CT75K Vaisala Lidar Ceilometer".
history
Each program that acts on the file should append to this attribute a brief description of what they did, and when they did it (again using the newline character as a separator). Extra information can include the user and the name of the machine. For example, "Wed Nov 28 18:38:12 GMT 2001 - NetCDF generated from original data by Robin Hogan <r.j.hogan@reading.ac.uk> on voldemort". If the calibration needs to be changed then it may be appended by "\nThu Nov 29 18:38:12 GMT 2001 - Recalibrated (+3 dB) by Robin Hogan <r.j.hogan@reading.ac.uk> on voldemort", where '\n' indicates the newline character (i.e. not a backslash character followed by an "n" character).
source
In the case of instrumental data, this would contain a brief specification of the instrument. The spec of a radar should include frequency, antenna diameter, pulse repetition frequency, pulse width (in microseconds) and peak power, and the spec of a lidar should include wavelength, divergence, field of view and pulse repetition frequency. The fields would be newline separated. In the case of model data a single-line title for the model is sufficient, e.g. "UK Met Office mesoscale model". Data derived from a variety of sources should concatenate the global source attributes from the input datasets, separated by semi-colon (;) and newline.
file_uuid
Universally unique identifier (UUID) of the file.
references
Any web-based or published information about the data, e.g. "Information on the data is available at http://www.met.rdg.ac.uk/radar/doc/galileo.html". Obviously please ensure that the web site referred to is maintained for the likely lifetime of the data.
comment
Any further general information for the user (that is not specific to individual variables) should be added here. Use complete sentences terminated with a full-stop/period so that extra comments can be easily appended. It is also useful to add newline characters to break up long lines.
source_file_uuids
Comma-separated list of UUIDs that identify the files used in generation of this product. Useful in categorize and level 2 files.
instrument_pid
Persistent identifier (PID) of the source instrument (e.g. https://hdl.handle.net/21.12132/3.d98f6fd2bec94e5e).
pid
Persistent identifier (PID) of the file (e.g. https://hdl.handle.net/21.12132/1.ce67fc697f3f4aa5).
serial_number
Serial number of the source instrument.
<software>_version
If the processing program changes over time then it is useful to store the version number (as a string) of the program here.

The following describes additional conventions that should make radar and lidar data from different sites as similar as possible.

Scalar variables

The following variables are single values that are stored as variables rather than global attributes because they have a unit or other describing attribute associated with them; the attributes that should be set are shown indented after each variable name. All these variables are of type float.

altitude
To get the altitude of each range gate above mean sea level, the user of this data should add this value to the values in the range variable (assuming the instrument is vertically pointing, and taking account of the fact that altitude is in metres and range is in km).
units = "m"
long_name = "Altitude of antenna above mean sea level"
elevation
Most radars will be vertically pointing, so their elevation will be 90°. Lidars may be deployed off-zenith to avoid specular reflection from horizontally aligned plate crystals, in which case the elevation will be less than 90°.
units = "degrees"
long_name = "Elevation above horizon"
azimuth
An optional variable that gives the azimuth of instruments that are not vertically pointing.
units = "degrees"
long_name = "Azimuth clockwise from due north"

For radar the following should also be defined:

frequency
units = "GHz"
long_name = "Radar frequency"

For lidar, use:

wavelength
If this is a multi-wavelength lidar, then wavelength should be a one-dimensional array containing all the wavelengths available. This requires an extra dimension, also with name wavelength.
units = "nm"
long_name = "Lidar wavelength"

Two-dimensional variables

Most two-dimensional variables will be of type float. However, for some data it may make sense to use the short data type (a signed 2-byte integer; The CT75K lidar ceilometer is a good candidate as the raw data are stored to this precision so no information is lost. You may then use scale_factor and/or add_offset attributes to get the data into suitable units and to provide the correct calibration. If both are present then the data in the file should be scaled first before the offset is added. Note also that the _FillValue attributes apply to the data before it has been scaled and shifted in this way. Usually scale_factor and add_offset would be of type float.

For some variables, notably radar reflectivity, accurate calibration can be difficult and the data may need to be recalibrated after the initial release. These variables should therefore indicate the calibration that has been applied to them in the processing stage in the calibration_applied attribute.

The following are variable names that could be used in radar data, and some of the attributes that should be present:

Z(time, range)
units = "dBZ"
long_name = "Radar reflectivity factor"
comment = "Calibration convention: in the absence of attenuation, a cloud at 273 K containing one million 100-micron droplets per cubic metre will have a reflectivity of 0 dBZ at all frequencies."
calibration_applied
...in dB.
v(time, range)
units = "m s-1"
long_name = "Doppler velocity"
comment = "Positive velocities are away from the radar."
folding_velocity
This attribute indicates that the velocities may be folded, lying in the range -folding_velocity to folding_velocity.
width(time, range)
units = "m s-1"
long_name = "Spectral width"
comment = "This variable is the standard deviation of the reflectivity-weighted velocities in the radar pulse volume."
sigma_v(time, range)
Level 1 data is typically averaged to 30 seconds, so the velocity variable in the netCDF file is typically an average of a number of high-resolution mean velocity values measured in the averaging time. The sigma_v variable is the standard deviation of these high-resolution mean velocities. Spectral width is the standard deviation of actual particle velocities measured within the radar pulse volume in a short time (typically around 1 second), so tends to be dominated by the differential fall speeds of the different sized particles. This variable, on the other hand, is dominated by turbulence.
units = "m s-1"
long_name = "Standard deviation of mean velocity"
comment = "The data in this file are at a lower resolution than the raw data, and this variable is the standard deviation of the raw Doppler velocities measured during in each output gate and ray."
ldr(time, range)
units = "dB"
long_name = "Linear depolarisation ratio"

Similarly, the following are variable names that could be used with lidar data:

beta(time, range)
If attenuated backscatter coefficient is measured at more than one wavelength, then the wavelength could be indicated in the variable name, such as beta1064, beta532 etc.
units = "m-1 sr-1"
long_name = "Attenuated backscatter coefficient"
ldr(time, range)
units = "1"
Lidar depolarisation ratio normally lies in the range 0 to 1.
long_name = "Linear depolarisation ratio"

If there is a need to have an unprocessed version of a variable in the file then I suggest using the names Z_raw, beta_raw and so on.