How to download Earth Observation datasets using Python with Radiant MLHub, TorchGeo and Dataset4EO
I tried three different Python libraries to download Remote Sensing datasets; radiant_mlhub, torchgeo and dataset4eo.
In a recent paper, titled “EarthNets: Empowering AI in Earth Observation”, Xiong et al. put together a queryable collection of 401 Remote Sensing datasets (featured in the latest Spectral Reflectance Newsletter). As part of that, they also developed Dataset4EO — a Python library that provides an easy-to-use way of loading datasets. Having this as motivation I tried three different Python libraries we can use to download about 80 different Remote Sensing datasets; radiant_mlhub, torchgeo and dataset4eo.
These three Python libraries try to accomplish slightly different things;
radiant_mlhub.Dataset provides access to 64 datasets for users to download locally, while dataset4eo and torchgeo go one step further; Both provide a PyTorch DataLoader for each of the available datasets. So, instead of having users write their own custom PyTorch Dataset/DataLoader classes, they can use pre-built ones. By doing so, users would have a streamlined way of downloading a dataset and feeding their models. Cool, right?
Radiant MLHub
The radiant_mlhub library just provides the means to download datasets, without any fancy DataLoader wrappers on top of it or Deep Learning models shipped with it.
After getting an API key things are fairly straightforward. Here you can have a look at the 64 available datasets through radiant_mlhub. In the following code blocks there are instruction on how to request an API key, list the available datasets and download the CV4A Kenya Crop Type Competition dataset.
Torchgeo
Torchgeo is a PyTorch domain library focused on geospatial data that provides both datasets and (pre-trained) models.
One of the great things about torchgeo is that there are available PyTorch DataLoaders for the datasets; meaning that once you download a dataset you can use an out-of-the-box DataLoader to train your model (in PyTorch)! However, in some cases torchgeo doesn’t provide a “download method” and expects that users have already downloaded the dataset. In addition, to download some datasets you need to have a Radiant MLHub API key, as a few of them are fetched from there. Here is a list of the datasets available through TorchGeo.
Once again, using torchgeo datasets is simple. Have a look at the code below for how to download the same dataset as before and iterate through a PyTorch DataLoader.
NOTE: At the moment of this writing, TorchGeo only supports radiant-mlhub 0.2.1–0.4. Versions 0.5+ won’t work! [link]
To learn more about TorchGeo, here is their paper on arxiv.
Dataset4EO
The EarthNets platform (part of which is dataset4eo) is designed to provide an easy way to apply (deep learning) models to Remote Sensing datasets. Furthermore, EarthNets pushes for a more consistent evaluation of deep learning methods on remote sensing data by selecting 5 different datasets as benchmarks on 3 different tasks; image classification, object detection ad semantic segmentation. As the authors state, “there are some difficulties in loading RS datasets, especially for researchers in other communities” and “it would be helpful to download, decompress and split the dataset [into train/val/test] automatically”.
The are several aspects of dataset4eo that make it very similar to TorchGeo; both provide a PyTorch (or PyTorch compatible) DataLoader and deep learning models to use out-of-the-box.
Here, however, we are more interested in the “dataset loading” part of it.
At the moment, I’m afraid that dataset4eo isn’t a matured enough codebase. While trying to run the code example and download the LandSlide4Sense dataset I was faced with several errors. Eventually I managed to run the LandSlide4Sense example and iterate through a PyTorch DataLoader. Check this ticket here for the steps I followed to work around the errors I was getting.
In the table below, I have put together a short comparison among the three libraries.