Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save akhuff157/33bed04948fba54fbf8271f32e85f00a to your computer and use it in GitHub Desktop.

Select an option

Save akhuff157/33bed04948fba54fbf8271f32e85f00a to your computer and use it in GitHub Desktop.
tutorial_mss_viirs_aws_nodd_20250507.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyNpbEcMqB/VrzjjyAfuSWwJ",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/akhuff157/33bed04948fba54fbf8271f32e85f00a/tutorial_mss_viirs_aws_nodd_20250507.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"#**Tutorial: Accessing VIIRS Granules Files from the JPSS NODD on AWS using Python**\n",
"\n",
"This Python tutorial was written in May 2025 by Dr. Amy Huff, STC at NOAA/NESDIS/STAR (amy.huff@noaa.gov). It demonstrates how to 1) select VIIRS granule files in the JPSS NODD AWS S3 buckets for a specific geographic region of interest and 2) open a selected file from an S3 bucket and read variable data without downloading the file.\n",
"\n",
"<font color='red'>**If you use any of the Python code in your research, please credit the NOAA/NESDIS/STAR Aerosols & Atmospheric Composition Science Team.**</font>"
],
"metadata": {
"id": "Lo-fWLDTTN-U"
}
},
{
"cell_type": "markdown",
"source": [
"## **Section 0: Set up Google Colab**"
],
"metadata": {
"id": "x_amqa-K5S87"
}
},
{
"cell_type": "markdown",
"source": [
"[Google Colab](https://colab.google/) is a hosted Jupyter Notebook service that requires no setup to use and provides free access to computing resources.\n",
"\n",
"[Jupyter Notebook](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html) is an open-source web application that supports >40 programming languages, including Python. It allows code to be broken into “blocks” that run independently, which makes it ideal for learning. Any output from the code in a \"block\" will appear underneath it.\n",
"\n",
"**The Python code demonstrated in this training is universal**. Specific lines of code or functions will run in any Python IDE (e.g., Spyder, Visual Studio Code), Jupyter Notebook, or the Python interpreter."
],
"metadata": {
"id": "xWSzp7qCARPl"
}
},
{
"cell_type": "markdown",
"source": [
"###**Example of how to run Jupyter Notebook code blocks**\n",
"\n",
"To see how Jupyter Notebook works, let's run the Python code to print \"Hello world!\"\n",
"\n",
"Place your cursor over the grey code block below, then click the little black circle with the white arrow inside, located on the far left side of the block."
],
"metadata": {
"id": "8RQO-i-i4TM2"
}
},
{
"cell_type": "code",
"source": [
"print('Hello world!')"
],
"metadata": {
"id": "kKCQc4nl2vuN"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### **Limitations of using Google Colab**\n",
"\n",
"Colab is free, powerful, and easy to use, but it has limitations:\n",
"\n",
"\n",
"1. **Colab sessions are temporary.** A session will expire after 12 hours of continuous use or after 90 minutes of idle time. <font color='red'>**All output, including downloaded or generated files, will be lost after the session expires.**</font> Therefore, any files users want to save must be downloaded to the user's local computer or Google Drive account.\n",
"2. **Colab cannot be configured to use a virtual or conda environment.** Therefore, in general, users must work with the existing, current Colab configuration.\n",
"3. **The Colab configuration changes frequently, with the addition of new packages and updates to existing packages.** This means code that runs today may give an error in the future after an update to the Colab configuration."
],
"metadata": {
"id": "NnfX517t1yxB"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "d3404f3a"
},
"source": [
"### **Import Python modules and packages**\n",
"\n",
"- [pathlib](https://docs.python.org/3/library/pathlib.html): module to set file system paths\n",
"- [datetime](https://docs.python.org/3/library/datetime.html): module to manipulate dates and times\n",
"- [S3Fs](https://s3fs.readthedocs.io/en/latest/): library to set up a file system interface with AWS Simple Storage Service (S3) buckets\n",
"- [requests](https://requests.readthedocs.io/en/latest/): library to send HTTP requests\n",
"- [pandas](https://pandas.pydata.org/docs/user_guide/index.html): library for data analysis\n",
"-[h5py](https://docs.h5py.org/en/stable/index.html): interface to the HDF5 binary data format\n",
"- [xarray](https://docs.xarray.dev/en/stable/index.html): library to work with labeled multi-dimensional arrays\n",
"- [NumPy](https://numpy.org/doc/stable/user/index.html): library to perform array operations\n",
"\n",
"Used by `xarray` but not imported:\n",
"- [h5netcdf](https://h5netcdf.org/): interface for the netCDF4 file-format based on h5py"
]
},
{
"cell_type": "markdown",
"source": [
"The Colab configuration does not include the `s3fs` package, so it needs to be installed.\n",
"\n",
"**Ignore any error messages about package dependency conflicts.**"
],
"metadata": {
"id": "uKyeZsUEBdGs"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6Fsya_6ip_1L"
},
"outputs": [],
"source": [
"# Install missing packages in Colab quietly (no progress notifications)\n",
"\n",
"!pip install --quiet s3fs"
]
},
{
"cell_type": "code",
"source": [
"# Import modules and packages\n",
"\n",
"from pathlib import Path\n",
"\n",
"import datetime\n",
"from datetime import date\n",
"\n",
"import s3fs\n",
"\n",
"import requests\n",
"\n",
"import pandas as pd\n",
"\n",
"import h5py\n",
"\n",
"import xarray as xr\n",
"\n",
"import numpy as np"
],
"metadata": {
"id": "cfhllG-zyU0v"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## **Section 1: Select VIIRS Granule Files on the JPSS NODD on AWS for a Specific Region of Interest**"
],
"metadata": {
"id": "Uwbi0UaNm_8J"
}
},
{
"cell_type": "markdown",
"source": [
"### **NOAA Open Data Dissemination (NODD) Cloud Data Archives**\n",
"The [NODD Program](https://www.noaa.gov/information-technology/open-data-dissemination) provides public access to NOAA's open data via commercial cloud platforms.\n",
"\n",
"The Python `s3fs` package makes it very easy to search for, download, and remotely access data files from the NODD on Amazon Web Services (AWS). Other commonly-used options to access AWS include [AWS S3 commands](https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html) and the open-access [s3cmd tool](https://s3tools.org/s3cmd).\n",
"\n",
"**You do not need an AWS cloud computing account to access NOAA data!** Think of the NODD on AWS as a data archive that just happens to be in the cloud instead of hosted on a physical server."
],
"metadata": {
"id": "VvjEmgss8anY"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "d57b9190"
},
"source": [
"### **Connect to AWS S3 (Simple Storage Service)**\n",
"\n",
"Files on AWS are stored in directories called Simple Storage Service (S3) buckets. The `s3fs` package allows users to access AWS S3 buckets as if they are file system (`fs`) directories.\n",
"\n",
"The NODD S3 buckets are publicly available & read-only, so we use an anonymous connection (```annon=True```)."
]
},
{
"cell_type": "code",
"source": [
"# Connect to AWS S3 anonymously\n",
"\n",
"fs = s3fs.S3FileSystem(anon=True)"
],
"metadata": {
"id": "es5fbF2205EE"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "324f7068"
},
"source": [
"### **The JPSS NODD S3 buckets**\n",
"\n",
"The JPSS NODD on AWS contains data from NOAA's polar-orbiting satellites. Data files can also be searched/downloaded manually via the [web interface](https://registry.opendata.aws/noaa-jpss/).\n",
"\n",
"There are separate S3 buckets for each of the three JPSS satellites:\n",
"- SNPP: `noaa-nesdis-snpp-pds` [SNPP web interface](https://noaa-nesdis-snpp-pds.s3.amazonaws.com/index.html)\n",
"- NOAA-20: `noaa-nesdis-n20-pds` [NOAA-20 web interface](https://noaa-nesdis-n20-pds.s3.amazonaws.com/index.html)\n",
"- NOAA-21: `noaa-nesdis-n21-pds` [NOAA-21 web interface](https://noaa-nesdis-n21-pds.s3.amazonaws.com/index.html)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-20T21:04:01.704184Z",
"start_time": "2023-06-20T21:04:01.688225Z"
},
"id": "d656e9c8"
},
"source": [
"### **Finding VIIRS files in the JPSS S3 buckets**\n",
"\n",
"JPSS data files are organized on AWS as follows:\n",
"- Satellite (S3 bucket name)\n",
"- Product\n",
"- Year\n",
"- Month\n",
"- Day\n",
"- Filename\n",
"\n",
"To find a specific data file, set the full S3 bucket directory path for the satellite, product, year, month, day, and filename; for example:\n",
"\n",
"`noaa-nesdis-n20-pds/VIIRS-I4-SDR/2024/03/14/SVI04_j01_d20240314_t0636388_e0638033_b32743_c20240314070542288000_oeac_ops.h5`"
]
},
{
"cell_type": "markdown",
"source": [
"###**Section 1.1: Working with VIIRS SDR Files**"
],
"metadata": {
"id": "a24HnWhxPH_8"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "e15a0b14"
},
"source": [
"#### **Navigate to the NOAA-20 VIIRS SDR I04 file directory for March 14, 2024**\n",
"\n",
"Our case study event is March 14, 2024. We want to select the NOAA-20 VIIRS SDR I04 granule files for this date that cover the ASEAN region of interest (ROI). For reference, the mapped locations and observation times of VIIRS granules are viewable on the [JSTAR Mapper website](https://www.star.nesdis.noaa.gov/mapper) under the `Non-product layers` menu.\n",
"\n",
"The first step is to navigate to the March 14, 2024 directory on the NOAA-20 S3 bucket. To do that, let's define a `data_path` variable to set the directory path for NOAA-20 VIIRS SDR I04 granule files for March 14, 2024, and then use the `fs.ls()` [function](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.ls) to list the full path names for the individual data files.\n",
"\n",
"The JPSS satellites have global coverage, and the daily directories on the NODD contain all of the VIIRS files generated each day. For VIIRS SDR granules, that's **>1000 files per day**! Because there are so many files, let's print the total number of files in the March 14, 2024 directory and only the first 10 file names.\n",
"\n",
"For clarity, I prefer to print only the VIIRS file names instead of the full directory paths. We can also print the full path for the first file to see what it looks like, for comparison.\n",
"\n",
"---\n",
"---\n",
"\n",
"**About the code:** The `fs.ls()` [function](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.ls) takes the source directory path argument `data_path` as a string, which is why the `month` and `day` variables, entered as integers, are converted to strings. The Python `str.zfill(width)` [method](https://docs.python.org/3/library/stdtypes.html#str.zfill) ensures the `month` and `day` strings in the `data_path` are 2 digits; `str.zfill(width)` returns a copy of the string left-filled with ASCII '0' digits to make a string of length `width`. This way, the `data_path` syntax is correct for `month` and `day` variable integers <10."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dc3713ea"
},
"outputs": [],
"source": [
"# Find all the NOAA-20 VIIRS SDR I04 granule files for March 14, 2024\n",
"# Print total number of files in directory & first 10 file names\n",
"\n",
"bucket = 'noaa-nesdis-n20-pds'\n",
"product = 'VIIRS-I4-SDR'\n",
"year = 2024\n",
"month = 3\n",
"day = 14\n",
"\n",
"data_path = (bucket\n",
" + '/'\n",
" + product\n",
" + '/'\n",
" + str(year)\n",
" + '/'\n",
" + str(month).zfill(2)\n",
" + '/'\n",
" + str(day).zfill(2))\n",
"\n",
"viirs_sdr_files = fs.ls(data_path)\n",
"\n",
"print('Total number of files:', len(viirs_sdr_files), '\\n')\n",
"\n",
"for file in viirs_sdr_files[:10]:\n",
" print(file.split('/')[-1])"
]
},
{
"cell_type": "code",
"source": [
"# Print the full directory path for the first data file\n",
"\n",
"viirs_sdr_files[0]"
],
"metadata": {
"id": "My9ZtoLGH-Ey"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "6a77cf1d"
},
"source": [
"#### **Function to find the start/end times for JPSS satellite daytime overpass(es) of a geographic domain**\n",
"\n",
"Often users want to select only a subset of the full day's 500-1000+ VIIRS granule files, corresponding to a specific region such as the ASEAN ROI. The JPSS S3 buckets do **not** include a tool to subset the global files for a given date, however.\n",
"\n",
"To solve this problem, I wrote the Python function below, which uses the [Boxtimes tool](https://sips.ssec.wisc.edu/orbnav#/tools/boxtimes) from the [UW SSEC OrbNav API](https://sips.ssec.wisc.edu/orbnav#/) to find, for a given date and JPSS satellite, the start/end times (in UTC) for the satellite's daytime overpass(es) of a geographic domain, set by a latitude/longitude bounding box. We will use these overpass start/end times to subset the global VIIRS files for a given day on the S3 bucket."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8tus_akvn3Ob"
},
"outputs": [],
"source": [
"def find_jpss_overpass_times_day(observation_date, satellite_name, w_lon, n_lat,\n",
" e_lon, s_lat):\n",
" \"\"\"\n",
" For a given date and JPSS satellite, finds the start and end times in UTC\n",
" for the satellite's daytime (ascending orbit) overpass(es) of a specified\n",
" geographic domain. Uses the UWisc OrbNav API.\n",
"\n",
" For best results, make the search domain a little larger than the actual\n",
" region of interest, i.e., by ~5 degrees in each direction.\n",
"\n",
" ***********\n",
" Positional Input Parameters:\n",
"\n",
" observation_date: type=str; required. Observation date in 'YYYYMMDD' format.\n",
"\n",
" satellite_name: type=str; required. JPSS satellite name (case sensitive);\n",
" one of: 'SNPP', 'NOAA20', 'NOAA21'.\n",
"\n",
" n_lat: type=int; required. Northern latitude boundary of geographic domain.\n",
" Use negative values for °S latitude.\n",
"\n",
" s_lat: type=int; required. Southern latitude boundary of geographic domain.\n",
" Use negative values for °S latitude.\n",
"\n",
" w_lon: type=int; required. Western longitude boundary of geographic domain.\n",
" Use negative values for °W longitude.\n",
"\n",
" e_lon: type=int; required. Eastern longitude boundary of geographic domain.\n",
" Use negative values for °W longitude.\n",
"\n",
" ************\n",
" Returns:\n",
"\n",
" start_times: type=list; Start time(s) of daytime overpass(es) in 'HHMM' format.\n",
" end_times: type=list; End time(s) of daytime overpass(es) in 'HHMM' format.\n",
"\n",
" ************\n",
" Example:\n",
"\n",
" >>> find_jpss_overpass_times_day('20240314', 'NOAA20', 85, 35, 135, -20)\n",
" (['0450', '0632'], ['0506', '0647'])\n",
" \"\"\"\n",
"\n",
" # Convert entered observation_date to format needed by UWisc OrbNav API\n",
" api_date = date.isoformat(datetime.datetime.strptime(observation_date,\n",
" '%Y%m%d'))\n",
"\n",
" # Set JPSS satellite URL for UWisc OrbNav API\n",
" sat_number_dict = {'SNPP':'37849', 'NOAA20':'43013', 'NOAA21':'54234'}\n",
" sat_number = sat_number_dict.get(satellite_name)\n",
" # Break long url string using f-string formatting\n",
" url = (f'http://sips.ssec.wisc.edu/orbnav/api/v1/boxtimes.json?'\n",
" f'start={api_date}T00:00:00Z&sat={sat_number}&end={api_date}T23:59:59Z'\n",
" f'&ur={n_lat},{e_lon}&ll={s_lat},{w_lon}')\n",
"\n",
" # Use requests library to get json response from UWisc OrbNav API\n",
" response = requests.get(url)\n",
" data = response.json()\n",
"\n",
" # Convert json response values from \"data\" key into a dataframe\n",
" # \"enter\" & \"leave\": times when satellite enters/leaves domain bounding box\n",
" df = pd.DataFrame(data['data'], columns=['enter', 'leave'])\n",
"\n",
" # Make two new dataframes, for \"enter\" and \"leave\" column lists\n",
" # Read in all the values in the lists as separate columns in new dataframes\n",
" df_enter = pd.DataFrame(df['enter'].to_list(), columns = ['enter_datetime',\n",
" 'enter_lat',\n",
" 'enter_lon',\n",
" 'enter_sea',\n",
" 'enter_orbit'])\n",
" df_leave = pd.DataFrame(df['leave'].to_list(), columns = ['leave_datetime',\n",
" 'leave_lat',\n",
" 'leave_lon',\n",
" 'leave_sea',\n",
" 'leave_orbit'])\n",
"\n",
" # Combine \"enter\" & \"leave\" dataframes into new dataframe; drop extra columns\n",
" combined = (pd.concat([df_enter, df_leave], axis=1, join='outer')\n",
" .drop(columns=['enter_lat', 'enter_lon', 'enter_sea', 'leave_lat',\n",
" 'leave_lon', 'leave_sea'], axis=1))\n",
"\n",
" # Drop rows with descending orbits (nighttime orbits)\n",
" combined.drop(combined[(combined['leave_orbit'] == 'D') |\n",
" (combined['enter_orbit'] == 'D')].index, inplace=True)\n",
"\n",
" # Export the \"enter_datetime\" & \"leave_datetime\" columns to lists\n",
" enter_list = combined['enter_datetime'].tolist()\n",
" leave_list = combined['leave_datetime'].tolist()\n",
"\n",
" # Remove the colon from the list of enter/leave times (strings)\n",
" # Need 'HHMM' format for use with satellite data file names\n",
" start_times = [time[11:16].replace(':','') for time in enter_list]\n",
" end_times = [time[11:16].replace(':','') for time in leave_list]\n",
"\n",
" return start_times, end_times"
]
},
{
"cell_type": "markdown",
"source": [
"#### **Example: Find the start/end times for NOAA-20's daytime overpasses of the ASEAN ROI on March 14, 2024**"
],
"metadata": {
"id": "G5R7LMvF4awl"
}
},
{
"cell_type": "code",
"source": [
"start_times, end_times = find_jpss_overpass_times_day('20240314', 'NOAA20', 85, 35, 135, -20)"
],
"metadata": {
"id": "Dlfqv69g33OX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"start_times, end_times"
],
"metadata": {
"id": "z96-zB_P5dz3"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### **Use the NOAA-20 start/end daytime overpass times for the ASEAN ROI on March 14, 2024 to select the corresponding VIIRS SDR I04 files on the AWS NODD**\n",
"\n",
"All VIIRS file names contain information about the observation time of the data. For VIIRS SDR file names, the `tHHMMSSS` segment contains the observation start time.\n",
"\n",
"Using Python [slicing](https://stackoverflow.com/questions/509211/how-slicing-in-python-works) and [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions), we can select VIIRS granule file paths on the NODD S3 bucket by matching the observation start time from the file name with the time range of the JPSS satellite overpass `start_times` and `end_times` found using the `find_jpss_overpass_times_day()` function."
],
"metadata": {
"id": "Y8f1xho09bfo"
}
},
{
"cell_type": "code",
"source": [
"# Select NOAA-20 VIIRS SDR I04 daytime files for the ASEAN ROI on March 14, 2024\n",
"# Use slicing to to extract observation time from VIIRS SDR file name\n",
"\n",
"for start_time, end_time in zip(start_times, end_times):\n",
" subset_sdr_files = [file for start_time, end_time in zip(start_times, end_times)\n",
" for file in viirs_sdr_files\n",
" if (file.split('/')[-1].split('_')[3][1:5] >= start_time\n",
" and file.split('/')[-1].split('_')[3][1:5] <= end_time)]\n",
"\n",
"# Print the VIIRS file names from the S3 file directory paths\n",
"for file in subset_sdr_files:\n",
" print(file.split('/')[-1])"
],
"metadata": {
"id": "Tkn2chc133LY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"###**Section 1.2: Working with VIIRS Cloud Mask Files**"
],
"metadata": {
"id": "JBGS2vBdRT92"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "0TL7C4AERT93"
},
"source": [
"#### **Navigate to the NOAA-20 VIIRS Cloud Mask file directory for March 14, 2024**\n",
"\n",
"The process to find any type of files on the JPSS NODD is the same as demonstrated for the VIIRS SDR I04 files in Section 1.1. Let's look at another example, for the VIIRS Cloud Mask granule files.\n",
"\n",
"The first step is to navigate to the March 14, 2024 directory on the NOAA-20 S3 bucket. The only difference from Section 1.1 is the `product_name` in the S3 bucket has changed to `VIIRS-JRR-CloudMask`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hpiIOap4RT93"
},
"outputs": [],
"source": [
"# Find all the NOAA-20 VIIRS Cloud Mask granule files for March 14, 2024\n",
"# Print total number of files in directory & first 10 file names\n",
"\n",
"bucket = 'noaa-nesdis-n20-pds'\n",
"product = 'VIIRS-JRR-CloudMask'\n",
"year = 2024\n",
"month = 3\n",
"day = 14\n",
"\n",
"data_path = (bucket\n",
" + '/'\n",
" + product\n",
" + '/'\n",
" + str(year)\n",
" + '/'\n",
" + str(month).zfill(2)\n",
" + '/'\n",
" + str(day).zfill(2))\n",
"\n",
"viirs_cm_files = fs.ls(data_path)\n",
"\n",
"print('Total number of files:', len(viirs_cm_files), '\\n')\n",
"\n",
"for file in viirs_cm_files[:10]:\n",
" print(file.split('/')[-1])"
]
},
{
"cell_type": "code",
"source": [
"# Print the full directory path for the first data file\n",
"\n",
"viirs_cm_files[0]"
],
"metadata": {
"id": "vzxIDASLRT94"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### **Use the NOAA-20 start/end daytime overpass times for the ASEAN ROI on March 14, 2024 to select the corresponding VIIRS Cloud Mask files on the AWS NODD**\n",
"\n",
"We can use the exact same output from the `find_jpss_overpass_times_day()` function to select the VIIRS Cloud Mask file paths on the JPSS NODD. The only difference is how we slice the VIIRS file name; for VIIRS Cloud Mask file names, the `sYYYYMMDDHHMMSSS` segment contains the observation start time.\n",
"\n"
],
"metadata": {
"id": "aS6oLxHqVIm_"
}
},
{
"cell_type": "code",
"source": [
"# Select NOAA-20 VIIRS Cloud Mask daytime files for the ASEAN ROI on March 14, 2024\n",
"# Use slicing to to extract observation time from VIIRS Cloud Mask file name\n",
"\n",
"for start_time, end_time in zip(start_times, end_times):\n",
" subset_cm_files = [file for start_time, end_time in zip(start_times, end_times)\n",
" for file in viirs_cm_files\n",
" if (file.split('/')[-1].split('_')[3][9:13] >= start_time\n",
" and file.split('/')[-1].split('_')[3][9:13] <= end_time)]\n",
"\n",
"# Print the VIIRS file names from the S3 file directory paths\n",
"for file in subset_cm_files:\n",
" print(file.split('/')[-1])"
],
"metadata": {
"id": "PaIyAI0MVInA"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## **Section 2: Open a VIIRS Granule File on the JPSS NODD on AWS & Read Variable Data Without Downloading the File**"
],
"metadata": {
"id": "9WZ3T4ri_Mdx"
}
},
{
"cell_type": "markdown",
"source": [
"Any file in an AWS S3 bucket can be read remotely - without first downloading the file - using the `fs.open` [function](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.open). This process is also called **streaming** the file.\n",
"\n",
"A streamed satellite file can be treated just like a downloaded file and opened using standard Python packages, such as [xarray](https://docs.xarray.dev/en/stable/index.html) and [h5py](https://docs.h5py.org/en/stable/index.html)."
],
"metadata": {
"id": "__-7dKWYcwIa"
}
},
{
"cell_type": "markdown",
"source": [
"###**Section 2.1: Working with VIIRS SDR Files (HDF)**"
],
"metadata": {
"id": "wZ8UZ9MDW6Af"
}
},
{
"cell_type": "markdown",
"source": [
"VIIRS SDR files, such as `I04` and `GIMGO`, are in HDF format (`.h5`). The `h5py` package must be used to open streamed HDF files.\n",
"\n",
"The `h5py` package treats HDF file `groups` like Python dictionaries and `datasets` (variables) like `NumPy` arrays.\n",
"\n",
"The examples below demonstrate how to open a streamed HDF file using `h5py` and view metadata, group names, and datasets."
],
"metadata": {
"id": "or-54moRLZBX"
}
},
{
"cell_type": "code",
"source": [
"# Open a VIIRS SDR I04 file remotely using s3fs & h5py\n",
"\n",
"remote_sdr_file = subset_sdr_files[0]\n",
"f = h5py.File(fs.open(remote_sdr_file, 'rb'))"
],
"metadata": {
"id": "QxolI4Qt33IP"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Print the global file metadata\n",
"\n",
"for key, value in f.attrs.items():\n",
" print(f'{key}: {value[0,0].decode()}')"
],
"metadata": {
"id": "6EfVirzuKRp7"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# List the groups in the SDR I04 file\n",
"\n",
"for key in f.keys():\n",
" print(key)"
],
"metadata": {
"id": "TfgoL1a5_PCB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# List the variables in the \"All_Data/VIIRS-I4-SDR_All\" group\n",
"\n",
"for variable in f['All_Data/VIIRS-I4-SDR_All']:\n",
" print(variable)"
],
"metadata": {
"id": "9Ls7YcpoKmFK"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Read \"Radiance\" and \"RadianceFactors\" variables into numpy arrays\n",
"\n",
"radiance = f['All_Data/VIIRS-I4-SDR_All//Radiance'][()]\n",
"radiance_factors = f['All_Data/VIIRS-I4-SDR_All//RadianceFactors'][()]"
],
"metadata": {
"id": "dK8Casf5LhyB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Print a snippet of the array\n",
"\n",
"radiance"
],
"metadata": {
"id": "42jWfSqWMUKa"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Mask fill values (65528-65535)\n",
"\n",
"radiance = np.ma.masked_where(radiance >= 65528, radiance)"
],
"metadata": {
"id": "S7WKt4WpL-Uu"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Print a snippet of the array\n",
"\n",
"radiance"
],
"metadata": {
"id": "OS6O8mCNMaFt"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Convert corrected radiance dtype from int to float\n",
"\n",
"radiance = radiance*radiance_factors[0] + radiance_factors[1]"
],
"metadata": {
"id": "zcOuVk4wMc_Q"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Print a snippet of the array\n",
"\n",
"radiance"
],
"metadata": {
"id": "fn5ZmgF3Muaw"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Check the range of radiance values\n",
"\n",
"print(np.nanmax(radiance), np.nanmin(radiance))"
],
"metadata": {
"id": "exJbP67eMwvn"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Close the remotely opened file\n",
"\n",
"f.close()"
],
"metadata": {
"id": "yYSG_7zAM0JW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"In practice, use the Python [with](https://docs.python.org/3/reference/compound_stmts.html#with) statement when opening files, in order to automatically close the file when you're done using it. The block below demonstrates how to use the `with` statement to open an HDF file remotely using the `s3fs` and `h5py` packages."
],
"metadata": {
"id": "8YQ08RXKjNe4"
}
},
{
"cell_type": "code",
"source": [
"# Open file remotely using s3fs & h5py (automatically closes file when done)\n",
"with h5py.File(fs.open(remote_sdr_file, 'rb'))as f:\n",
" # <Put your code to use the file contents here>\n",
"\n",
" # Example: print the global attributes of the file (to prove you opened it!)\n",
" for key, value in f.attrs.items():\n",
" print(f'{key}: {value[0,0].decode()}')"
],
"metadata": {
"id": "19j2LWnVNZoO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"###**Section 2.2: Working with VIIRS Cloud Mask Files (netCDF4)**"
],
"metadata": {
"id": "0aO0vMxcXANJ"
}
},
{
"cell_type": "markdown",
"source": [
"Most VIIRS EDR files, such as `Cloud Mask`, are in netCDF4 format (`.nc`). The `xarray` package can be used to open streamed netCDF4 files.\n",
"\n",
"The `xarray` package has much more functionality than `h5py`, and is commonly used to work with satellite data files.\n",
"\n",
"The examples below demonstrate how to open a streamed netCDF4 file with `xarray` and view the contents and metadata."
],
"metadata": {
"id": "V-aHEwuikd9z"
}
},
{
"cell_type": "code",
"source": [
"# Open a VIIRS Cloud Mask file remotely using s3sf & xarray\n",
"\n",
"remote_cm_file = subset_cm_files[0]\n",
"ds = xr.open_dataset(fs.open(remote_cm_file, mode='rb'), engine='h5netcdf')"
],
"metadata": {
"id": "u8XhNtBGXC7F"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# View the xarray dataset contents & metadata\n",
"\n",
"ds"
],
"metadata": {
"id": "0s2LMArLZEdj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Check the data type of the \"Cloud Mask\" data array\n",
"\n",
"ds.CloudMask.encoding['dtype']"
],
"metadata": {
"id": "Gheipw8mZQZY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Xarray automatically loads data arrays as floats\n",
"# Convert the \"Cloud Mask\" data array to a numpy array with correct dtype\n",
"\n",
"cloud_mask = ds.CloudMask.to_masked_array().astype('int8')"
],
"metadata": {
"id": "I3M90KSsZYnI"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Print a snippet of the array\n",
"\n",
"cloud_mask"
],
"metadata": {
"id": "gDGs_u0MZghj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Close the remote file\n",
"\n",
"ds.close()"
],
"metadata": {
"id": "XzE1sYAGbWM5"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"In practice, use the Python [with](https://docs.python.org/3/reference/compound_stmts.html#with) statement when opening files, in order to automatically close the file when you're done using it. The block below demonstrates how to use the `with` statement to open a netCDF4 file remotely using the `s3fs` and `xarray` packages."
],
"metadata": {
"id": "4ARHdQpYmasu"
}
},
{
"cell_type": "code",
"source": [
"# Open file remotely using s3fs & xarray (automatically closes file when done)\n",
"with xr.open_dataset(fs.open(remote_cm_file, mode='rb'), engine='h5netcdf') as ds:\n",
" # <Put your code to use the file contents here>\n",
"\n",
" # Example: print the attributes of the file (to prove you opened it!)\n",
" for key, value in ds.attrs.items():\n",
" print(f'{key}: {value}')"
],
"metadata": {
"id": "5O-RAMEibjlf"
},
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment