Working with ASCAT data and Xarray

Working with swath files

Creating a SwathFileCollection

If we have a collection of time-series swath files, we can create a SwathFileCollection object that will handle them as a group.

from datetime import datetime
from importlib import reload

import ascat.read_native.ragged_array_ts as rat
from ascat.read_native import xarray_io

from time import time

swath_source = "/home/charriso/p14/data-write/RADAR/hsaf/h121_v1.0/swaths"

It’s important to understand the structure of the data, so that the SwathFileCollection can find and parse all of the data that is requested from it. Handily, this package comes with existing parsers for several ASCAT products. These can be used with SwathFileCollection.from_product_id():

swath_collection = rat.SwathFileCollection.from_product_id(swath_source, "H121_V1.0")

The currently included project ids are the keys of ascat.read_native.xarray_io.swath_io_catalog

from ascat.read_native.xarray_io import swath_io_catalog
swath_io_catalog.keys()

dict_keys(['H129', 'H129_V1.0', 'H121_V1.0', 'H122', 'SIG0_6.25', 'SIG0_12.5'])

If your data structure does not match anything included in the package, you can write a class inheriting from ascat.read_native.xarray_io.SwathIOBase:

class AscatH121v1Swath(SwathIOBase):
    # outlines the basic structure of the swath filenames, with {bracketed} variable names
    # in the place of anything that changes according to the data inside
    fn_pattern = "W_IT-HSAF-ROME,SAT,SSM-ASCAT-METOP{sat}-12.5km-H121_C_LIIB_{placeholder}_{placeholder1}_{date}____.nc"

    # defines the names of the subfolder/directory names that contain the data. In this case,
    # files are sorted into folders by satellite and year, so that a typical filepath looks like
    # ".../metop_b/2021/W_IT-HSAF-ROME..."
    sf_pattern = {
        "satellite_folder": "metop_[abc]",
        "year_folder": "{year}"
    }

    # specifies the string format of the date that occupies the "date" field in fn_pattern
    date_format = "%Y%m%d%H%M%S"

    # provides the grid that the data is based on. In this case, using the grid_cache allows
    # us to reuse the same grid for multiple classes in the source code without
    # actually generating several different FibGrids.
    # however, it would also work to just write, for example:
    #     grid = FibGrid(12.5)
    grid = grid_cache.fetch_or_store("Fib12.5", FibGrid, 12.5)["grid"]

    # specifies the size (in degrees) of the grid cell to be written out by SwathFileCollection.stack()
    grid_cell_size = 5

    # specifies the filename format for the cells written out by SwathFileCollection.stack()
    cell_fn_format = "{:04d}.nc"

    # names any dataset variables with a "beams" dimension - not relevant with this particular product
    beams_vars = []

    # defines all the dataset's data variable names and the dtype that they should be written as when packed.
    ts_dtype = np.dtype([
        ("sat_id", np.int8),
        ("as_des_pass", np.int8),
        ("swath_indicator", np.int8),
        ("surface_soil_moisture", np.float32),
        ("surface_soil_moisture_noise", np.float32),
        ("backscatter40", np.float32),
        ("slope40", np.float32),
        ("curvature40", np.float32),
        ("surface_soil_moisture_sensitivity", np.float32),
        ("backscatter_flag", np.uint8),
        ("correction_flag", np.uint8),
        ("processing_flag", np.uint8),
        ("surface_flag", np.uint8),
        ("snow_cover_probability", np.int8),
        ("frozen_soil_probability", np.int8),
        ("wetland_fraction", np.int8),
        ("topographic_complexity", np.int8),
        ("subsurface_scattering_probability", np.int8),
    ])

    # a function for generating a filename regex for a particular timestamp
    # (used for filtering swath files by date)
    @staticmethod
    def fn_read_fmt(timestamp):
        return {"date": timestamp.strftime("%Y%m%d*"),
                "sat": "[ABC]",
                "placeholder": "*",
                "placeholder1": "*"}

    # a function that returns subfolder regexes based on a timestamp
    # Here, in the case of satellite_folder, we don't actually need to be able to search
    # by satellite, so we just have a regex that is inclusive of all three possible satellites.
    # However, in the case of year_folder, we want the search year to actually match with the
    # timestamp we pass to the function.
    @staticmethod
    def sf_read_fmt(timestamp):
        return {
            "satellite_folder": {"satellite": "metop_[abc]"},
            "year_folder": {"year": f"{timestamp.year}"},
        }

    def __init__(self, filename, **kwargs):
        super().__init__(filename, "netcdf4", **kwargs)

After creating your IO class, you can use it to make a collection by passing it to the SwathFileCollection class:

swath_collection = rat.SwathFileCollection(swath_source, ioclass=ASCATH121v1Swath)

Regardless of how you define you define your collection, once created it can be used to read data from your swath collection for any given date range and geographic extent. It can also be used to stack data in the collection into cellwise timeseries in indexed ragged array format, according to the CellGrid defined in the ioclass.

Let’s start with a rectangular, roughly country-sized area in Central Europe, and a time period of four days.

bounds = (45, 50, 10, 20) #latmin, latmax, lonmin, lonmax
dates = (datetime(2020, 12, 1), datetime(2020, 12, 5))

By calling the read method of our SwathFileCollection, we open every dataset within the date_range we’ve passed, concatenate them together into an xarray dataset, and filter out any observations that don’t intersect with the bounding box we’ve passed to bbox. Other spatial selections we could pass to read are cell (cell number or list of cell numbers in the swath data’s grid system), location_id (grid point ID or list of IDs in the swath data’s grid system), coords (lat/lon coordinate or list of coorinates that will be converted to the nearest grid point ID or list of grid point IDs), or geom (a Shapely geometry).

output = swath_collection.read(bbox=bounds, date_range=dates)
output

<xarray.Dataset>
Dimensions:                            (obs: 45007)
Coordinates:
    latitude                           (obs) float64 dask.array<chunksize=(1056,), meta=np.ndarray>
    longitude                          (obs) float64 dask.array<chunksize=(1056,), meta=np.ndarray>
    time                               (obs) datetime64[ns] 2020-12-01T08:09:...
Dimensions without coordinates: obs
Data variables: (12/19)
    location_id                        (obs) float64 dask.array<chunksize=(1056,), meta=np.ndarray>
    as_des_pass                        (obs) float32 dask.array<chunksize=(1056,), meta=np.ndarray>
    swath_indicator                    (obs) float32 dask.array<chunksize=(1056,), meta=np.ndarray>
    surface_soil_moisture              (obs) float32 dask.array<chunksize=(1056,), meta=np.ndarray>
    surface_soil_moisture_noise        (obs) float32 dask.array<chunksize=(1056,), meta=np.ndarray>
    backscatter40                      (obs) float64 dask.array<chunksize=(1056,), meta=np.ndarray>
    ...                                 ...
    snow_cover_probability             (obs) float32 dask.array<chunksize=(1056,), meta=np.ndarray>
    frozen_soil_probability            (obs) float32 dask.array<chunksize=(1056,), meta=np.ndarray>
    wetland_fraction                   (obs) float32 dask.array<chunksize=(1056,), meta=np.ndarray>
    topographic_complexity             (obs) float32 dask.array<chunksize=(1056,), meta=np.ndarray>
    subsurface_scattering_probability  (obs) float32 dask.array<chunksize=(1056,), meta=np.ndarray>
    sat_id                             (obs) int64 ...

Now we have a nice Xarray dataset that we can work with however we wish. In this case, it’s one-dimensional, so we are basically working with a tabular data structure. In order to work with it as 3-dimensional (latitude, longitude, time) raster data, we can aggregate it into timesteps.

First, to make sure we got the desired data, let’s make some plots.

This is not a very useful plot, but it shows that the data covers the time range we requested, and that it includes data from all three Metop satellites.

%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
plt.close()
fig, ax = plt.subplots()
scatter = ax.scatter(output.time, output.longitude, s=0.01, c=output.sat_id, cmap="rainbow", alpha=0.8)
legend1 = ax.legend(*scatter.legend_elements(), title="Satellite")
for i in range(3):
    legend1.get_texts()[i].set_text(f"Metop {chr(65+i)}")
ax.add_artist(legend1)
plt.xlabel("Time")
plt.ylabel("Latitude (degrees)")
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m-%d\n%H:%M"))
plt.xticks(rotation=30)
plt.tight_layout()

_images/82bbc435ee80a738fc6f70d615095fe635c93950c6e58bd917cef7d085e0f0dd.png

We can check the spatial coverage of the data by plotting it on a map.

import cartopy.crs as ccrs
from shapely import Point

plt.close()
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()
ax.gridlines(draw_labels=True)
ax.set_extent([-10, 30, 35, 65])
plt.scatter(
    output.longitude,
    output.latitude,
    s=0.1,
    alpha=0.1
)

<matplotlib.collections.PathCollection at 0x7fa7d97e7f70>

_images/f60f5e7ac709125f8ae76fdfe3722a44c8a89931f4112add83fcf38db8829173.png

Having the data as an Xarray makes it handy to do transformations. For example, we can group by location_id and get the average surface soil moisture at each. First, we need to load the location_id into memory, since it currently exists as a chunked dask array, and the groupby method only works with numpy arrays.

%%time
avg_ssm = output["surface_soil_moisture"].groupby(output["location_id"]).mean("obs")
avg_ssm.load()

/home/charriso/micromamba/envs/egu2024/lib/python3.8/site-packages/flox/core.py:2004: PerformanceWarning: Slicing with an out-of-order index is generating 78 times more chunks
  result = result[..., sorted_idx]

CPU times: user 1.24 s, sys: 82.1 ms, total: 1.33 s
Wall time: 1.21 s

<xarray.DataArray 'surface_soil_moisture' (location_id: 2652)>
array([55.3      , 73.21357  , 54.573635 , ...,  3.9599998, 41.566666 ,
       25.1      ], dtype=float32)
Coordinates:
  * location_id  (location_id) float64 1.163e+06 1.163e+06 ... 1.26e+06 1.26e+06
Attributes:
    long_name:  surface soil moisture
    units:      percent saturation
    valid_min:  0
    valid_max:  10000

This takes ~1.2 seconds on my own machine, which isn’t too bad, but we’re only working with 5 days of data here for a relatively small area. This operation could easily balloon in complexity and become intractable, especially if it needs to be repeated often. We also get warnings about slicing with an out-of-order index.

However, if we use flox directly (a package from the developers of Xarray that is created to do faster groupbys with Xarray datasets), we can accomplish the same operation in a fraction of the time (~90ms on my machine). When scaling up to much longer time periods and larger surface areas, these savings can make a huge difference.

from flox.xarray import xarray_reduce

%%time
avg_ssm_flox = xarray_reduce(output["surface_soil_moisture"], output["location_id"], func="mean")
avg_ssm_flox.load()

CPU times: user 93.7 ms, sys: 14.4 ms, total: 108 ms
Wall time: 93.4 ms

<xarray.DataArray 'surface_soil_moisture' (location_id: 2652)>
array([55.3      , 73.21357  , 54.57363  , ...,  3.9599998, 41.566666 ,
       25.100002 ], dtype=float32)
Coordinates:
  * location_id  (location_id) float64 1.163e+06 1.163e+06 ... 1.26e+06 1.26e+06
Attributes:
    long_name:  surface soil moisture
    units:      percent saturation
    valid_min:  0
    valid_max:  10000

Note: if, when using flox, you get an error about needing to provide expected_groups, make sure you’ve load -ed the variables you’ll be grouping your data by into memory first. If your dataset is too big for that, you can calculate the unique values of those variables and pass them in a tuple to the expected_groups parameter. For example, if we want to calculate seasonal soil moisture averages per location, we can add a grouping of the time variable to our xarray_reduce arguments. However, if we haven’t loaded location_id into memory yet, we’ll get an error:

short_dates = (datetime(2020, 12, 1), datetime(2020, 12, 2))
ds = swath_collection.read(bbox=bounds, date_range=dates)
xarray_reduce(ds["surface_soil_moisture"], ds["location_id"], ds["time"].dt.hour, func="mean")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 3
      1 short_dates = (datetime(2020, 12, 1), datetime(2020, 12, 2))
      2 ds = swath_collection.read(bbox=bounds, date_range=dates)
----> 3 xarray_reduce(ds["surface_soil_moisture"], ds["location_id"], ds["time"].dt.hour, func="mean")

File ~/micromamba/envs/egu2024/lib/python3.8/site-packages/flox/xarray.py:337, in xarray_reduce(obj, func, expected_groups, isbin, sort, dim, fill_value, dtype, method, engine, keep_attrs, skipna, min_count, reindex, *by, **finalize_kwargs)
    332     if isbin_:
    333         raise ValueError(
    334             f"Please provided bin edges for group variable {idx} "
    335             f"named {group_name} in expected_groups."
    336         )
--> 337     expect_ = _get_expected_groups(b_.data, sort=sort)
    338 else:
    339     expect_ = expect

File ~/micromamba/envs/egu2024/lib/python3.8/site-packages/flox/core.py:103, in _get_expected_groups(by, sort)
    101 def _get_expected_groups(by: T_By, sort: bool) -> pd.Index:
    102     if is_duck_dask_array(by):
--> 103         raise ValueError("Please provide expected_groups if not grouping by a numpy array.")
    104     flatby = by.reshape(-1)
    105     expected = pd.unique(flatby[~isnull(flatby)])

ValueError: Please provide expected_groups if not grouping by a numpy array.

We didn’t get this error before because we had already loaded location_id into memory. Loading a single variable into memory shouldn’t be much of a problem, but if it is, here’s how you would use expected_groups to solve it instead:

I’ll step away from the data for a second and write a quick function for plotting it on a map:

from matplotlib import pyplot as plt
import cartopy.crs as ccrs

def simple_map(lons, lats, color_var, cmap, dates=None, cbar_label=None):
    plt.close()
    ax = plt.axes(projection=ccrs.PlateCarree())
    ax.coastlines()
    gl = ax.gridlines(draw_labels=True)
    gl.bottom_labels = False
    gl.right_labels = False
    ax.set_extent([lons.min()-5, lons.max()+5, lats.min()-5, lats.max()+5])
    # ax.set_extent([-10, 30, 35, 65])
    plt.scatter(
        lons,
        lats,
        c=color_var,
        cmap=cmap,
        s=1,
        # alpha=0.8,
        # clim=(0, 100)
    )
    if cbar_label is None:
        cbar_label = (
            f"Average {color_var.long_name}\n"
            f"({color_var.units})\n"
        )
    if dates is not None:
        cbar_label += f"\n{np.datetime_as_string(dates[0], unit='s')} - {np.datetime_as_string(dates[1], unit='s')}"

    plt.colorbar(label=(cbar_label),
                 shrink=0.5,
                 pad=0.05,
                 orientation="horizontal"
    )
    plt.tight_layout()

And here is our mean soil moisture!

import cmcrameri.cm as cmc
lons, lats = swath_collection.grid.gpi2lonlat(avg_ssm_flox.location_id.values)
simple_map(lons, lats, avg_ssm_flox, cmc.roma, (output.time.values.min(), output.time.values.max()))

_images/703c808f9f1d8c213f909a4b2f6d2d88c6cf9347f9eae877adc08d184902cb7f.png

Now it’s easy to make a map of any of the other variables in the dataset. Here’s the average backscatter at 40 degrees incidence angle:

avg_backscatter40 = xarray_reduce(output["backscatter40"], output["location_id"], func="mean")
simple_map(lons, lats, avg_backscatter40, "viridis", (output.time.values.min(), output.time.values.max()))

_images/52df4a2a2d0d01678994a1dcaa411b88e15d2de20541e088aaf202ca5bf19cb5.png

Or we could make a timeseries plot of a variable at a single location or a collection of locations:

week_dates = (datetime(2020, 12, 1), datetime(2020, 12, 8))
week_data = swath_collection.read(date_range=week_dates, bbox=bounds)
date_groups = week_data.groupby("time.date")

for dt, ds in date_groups:
    plt.scatter(ds["time.date"], ds.backscatter40, color="black", s=1, alpha=0.01)

plt.plot(date_groups.groups.keys(), date_groups.mean().backscatter40.values, color="red")

plt.title("Daily backscatter values, Metop A, B and C\n"
          "Latitudes 45-50, Longitudes 10-20")
plt.ylabel(f"{ds.backscatter40.units}")
plt.xlabel(f"date")
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=1))
plt.xticks(rotation=30)
plt.tight_layout()

_images/7e8e2830249a57cdd823d0ee064c22d7d11c124b98df4ad9fee811e226dd150f.png

We can make a 5-year climatology for our data in three lines of code, but it will take a while to run, since we’ll have to read metadata from thousands of files to compile the xarray dataset. I do not recommend running this cell.

# five year climatology
five_years = [datetime(2015, 1, 1), datetime(2020, 1, 1)]
five_years_data = swath_collection.read(location_id=gpis, date_range=five_years)#.load()
climatology = output.groupby("time.dayofyear").mean("obs")

If you need to do several operations on larger chunks of time, it could be useful to convert the data to a cell file collection and work off of that. (see CellFileCollection section below)

Converting swath collections to cell collections

To convert a collection of swath files into a collection of cell files, we only need to call a single method from our SwathFileCollection. We pass it at least an output directory path (out_dir), where the outputs will be written, and we can also pass it several other options.

# without setting this variable as False, the package will warn the user and wait for confirmation before running, since a careless use of `stack` pointing to the wrong directory could delete or ruin a lot of data.
rat.process_warnings = False
# where to save the files
cell_file_directory = ""
# a list of swath file names to write, if you have a specific list
fnames = None
# the date range to stack data from
date_range = None
# mode : "w" for creating new files if any already exist, "a" to append data to existing cell files
# note that old data and new data will not be sorted after the append
mode = "w"
# the number of processes to use when writing the data.
# does NOT have anything to do with xarray's dask processing
# I've found that using too many processes, even on machines with many cores, may not be optimal.
# A good number is 8.
processes = 8

# the maximum size of the data buffer before dumping to file (actual maximum memory used will be higher)
# default is 6144MB
buffer_memory_mb = None

# # This command commented to avoid accidental execution
# swath_collection.stack(
#     output_dir=cell_file_directory,
#     fnames=fnames,
#     date_range=date_range,
#     mode=mode,
#     processes=processes,
#     buffer_memory_mb=buffer_memory_mb
# )

The output cells are in indexed ragged array format. In order to convert them to contiguous ragged array format, we can create a CellFileCollection from the output directory, and call the method to_contiguous():

cell_collection = rat.CellFileCollection.from_product_id(cell_file_directory, product_id="H121_v1.0")
contiguous_cell_file_directory = ""
cell_collection.to_contiguous(contiguous_cell_file_directory)

This will sort the entire dataset first by time and then by locationIndex, and then replace the dataset’s locationIndex variable with a row_size variable. At this point it is no longer practically possible to append new data to the dataset without first re-converting it to indexed ragged array format and then converting back.

Working with collections of cell files

Right now, although CellFileCollection exists, it currently is optimized for use under the hood of CellFileCollectionStack. Ideally both could be used by users, but there are still some bugs to be worked out and some refactoring to do. To work with a single collection of cell files, simply create a CellFileCollectionStack with a single collection inside.

Working with ASCAT data and Xarray

Working with swath files

Creating a SwathFileCollection

Converting swath collections to cell collections

Working with collections of cell files

Working with stacks of cell file collections

Creating a cell file collection

Reading from a cell file collection

Using geometries