rainforest.database package¶
This module provides routines used for database management.
rainforest.database.database
: main database class, used as an entry-point to the database
rainforest.database.db_populate
: command-line script used to add data to database
rainforest.database.retrieve_radar_data
: functions used to add new radar data to database
rainforest.database.retrieve_reference_data
: functions used to add new Cartesian reference data to database
rainforest.database.retrieve_dwh_data R module : R functions used to add new gauge data to database
rainforest.database.database module¶
Main class to update the RADAR/STATION database and run queries to retrieve specific data
Note that I use spark because there is currently no way to use SQL queries with dask
-
class
rainforest.database.database.
DataFrameWithInfo
(name, df)¶ Bases:
pyspark.sql.dataframe.DataFrame
-
property
info
¶
-
property
-
class
rainforest.database.database.
Database
(config_file=None)¶ Bases:
object
Creates a Database instance that can be used to load data, update new data, run queries, etc
- Parameters
config_file (str (optional)) – Path of the configuration file that you want to use, can also be provided later and is needed only if you want to update the database with new data
-
add_tables
(filepaths_dic, get_summaries=False)¶ Reads a set of data contained in a folder as a Spark DataFrame and adds them to the database instance
- Parameters
filepaths_dic (dict) –
Dictionary where the keys are the name of the dataframes to add and the values are the wildcard patterns poiting to the files for example {‘gauge’: ‘/mainfolder/gauge/*.csv’,
’radar’ : ‘/mainfolder/radar/.csv’, ‘reference’ : /mainfolder/reference/.parquet’}
will add the three tables ‘gauge’, ‘radar’ and ‘reference’ to the database
-
property
config_file
¶
-
query
(sql_query, to_memory=True, output_file='')¶ Performs an SQL query on the database and returns the result and if wanted writes it to a file
- Parameters
sql_query (str) – Valid SQL query, all tables refered to in the query must be included in the tables attribute of the database (i.e. they must first be added with the add_tables command)
memory (to) – If true will try to put the result into ram in the form of a pandas dataframe, if the predicted size of the query is larger than the parameter WARNING_RAM in common.constants this will be ignored
output_file (str (optional)) – Full path of an output file where the query will be dumped into. Must end either with .csv, .gz.csv, or .parquet, this will determine the output format
- Returns
If the result fits in memory, it returns a pandas DataFrame, otherwise
a cached Spark DataFrame
-
update_radar_data
(gauge_table_name, output_folder, t0=None, t1=None)¶ Updates the radar table using timesteps from the gauge table
- Inputs:
- gauge_table_name: str
name of the gauge table, must be included in the tables of the database, i.e. you must first add it with load_tables(..)
- output_folder: str
directory where to store the computed radar tables
- t0: start time in YYYYMMDD(HHMM) (optional)
starting time of the retrieval, by default all timesteps that are in the gauge table will be used
- t1: end time in YYYYMMDD(HHMM) (optional)
ending time of the retrieval, by default all timesteps that are in the gauge table will be used
-
update_reference_data
(gauge_table_name, output_folder, t0=None, t1=None)¶ Updates the reference product table using timesteps from the gauge table
- Inputs:
- gauge_table_name: str
name of the gauge table, must be included in the tables of the database, i.e. you must first add it with load_tables(..)
- output_folder: str
directory where to store the computed radar tables
- t0: start time in YYYYMMDD(HHMM) (optional)
starting time of the retrieval, by default all timesteps that are in the gauge table will be used
- t1: end time in YYYYMMDD(HHMM) (optional)
ending time of the retrieval, by default all timesteps that are in the gauge table will be used
-
update_station_data
(t0, t1, output_folder)¶ - Populates the csv files that contain the point measurement data,
that serve as base to update the database. A different file will be created for every station. If the file is already present the new data will be appended to the file.
- inputs:
t0: start time in YYYYMMDD(HHMM) format (HHMM) is optional t1: end time in YYYYMMDD(HHMM) format (HHMM) is optional output_folder: where the files should be stored. If the directory
is not empty, the new data will be merged with existing files if relevant
-
class
rainforest.database.database.
TableDict
¶ Bases:
dict
This is an extension of the classic python dict that automatically calls createOrReplaceTempView once a table has been added to the dict
rainforest.database.db_populate module¶
Command line script to add new data to the database
see Database command-line tool
-
rainforest.database.db_populate.
main
()¶
rainforest.database.retrieve_radar_data module¶
Main routine for retrieving radar data This is meant to be run as a command line command from a slurm script
i.e. ./retrieve_radar_data -t <task_file_name> -c <config_file_name> - o <output_folder>
IMPORTANT: this function is called by the main routine in database.py so you should never have to call it manually
-
class
rainforest.database.retrieve_radar_data.
Updater
(task_file, config_file, output_folder)¶ Bases:
object
Creates an Updater class instance that allows to add new radar data to the database
- Parameters
task_file (str) – The full path to a task file, i.e. a file with the following format timestamp, station1, station2, station3…stationN These files are generated by the database.py module so normally you shouldn’t have to create them yourself
config_file (str) – The full path of a configuration file written in yaml format that indicates how the radar retrieval must be done
output_folder (str) – The full path where the generated files will be stored
-
get_agg_operators
()¶ Returns all aggregation operators codes needed to aggregate all columns to 10 min resolution, 0 = mean, 1 = log mean
-
process_all_timesteps
()¶ Processes all timesteps that are in the task file
-
process_single_timestep
(list_stations, radar_object, tidx)¶ Processes a single 5 min timestep for a set of stations
- Parameters
list_stations (list of str) – Names of all SMN or pluvio stations for which to retrieve the radar data
radar_object (Radar object instance as defined in common.radarprocessing) – a radar object which contains all radar variables in polar format
tidx (int) – indicates if a radar 5 min timestep is the first or the second in the corresponding 10 min gauge period, 1 = first, 2 = second
-
retrieve_radar_files
(radar, start_time, end_time, include_vpr=True, include_status=True)¶ Retrieves a set of radar files for a given time range
- Parameters
radar (char) – The name of the radar, i.e either ‘A’,’D’,’L’,’P’,’W’
start_time (datetime.datetime instance) – starting time of the time range
end_time (datetime.datetime instance) – end time of the time range
include_vpr (bool (optional)) – Whether or not to also include VPR files
include_status (bool (optional)) – Whether or not to also include status files
rainforest.database.retrieve_reference_data module¶
Main routine for retrieving reference MeteoSwiss data (e.g. CPC, RZC, POH, etc) This is meant to be run as a command line command from a slurm script
i.e. ./retrieve_reference_data -t <task_file_name> -c <config_file_name> - o <output_folder>
IMPORTANT: this function is called by the main routine in database.py so you should never have to call it manually ————– Daniel Wolfensberger, LTE-MeteoSwiss, 2020
-
class
rainforest.database.retrieve_reference_data.
Updater
(task_file, config_file, output_folder)¶ Bases:
object
Creates an Updater class instance that allows to add new reference data to the database
- Parameters
task_file (str) – The full path to a task file, i.e. a file with the following format timestamp, station1, station2, station3…stationN These files are generated by the database.py module so normally you shouldn’t have to create them yourself
config_file (str) – The full path of a configuration file written in yaml format that indicates how the radar retrieval must be done
output_folder (str) – The full path where the generated files will be stored
-
process_all_timesteps
()¶ Processes all timestaps in the task file
-
retrieve_cart_files
(start_time, end_time, products)¶ Retrieves a set of reference product files for a given time range
- Parameters
start_time (datetime.datetime instance) – starting time of the time range
end_time (datetime.datetime instance) – end time of the time range
products (list of str) – list of all products to retrieve, must be valid MeteoSwiss product names, for example CPC, CPCH, RZC, MZC, BZC, etc
rainforest.database.retrieve_dwh_data R module¶
Main routine for retrieving station data This is meant to be run as a command line command from a slurm script
i.e. Rscript retrieve_dwh_data.r <t0> <t1> <threshold> <stations> <variables> <output_folder> <missing_value overwrite>
IMPORTANT: this function is called by the main routine in database.py so you should never have to call it manually
retrieve_dwh_data.R [options]
Options:
t0 (str) - start time in YYYYMMDDHHMM format
t1 (str) - end time in YYYYMMDDHHMM format
threshold (float) - minimum value of hourly precipitation for the entire hour to be included in the database (i.e. all 6 10min timesteps)
variables (str) - list of variables to retrieve, using the DWH names, for example “tre200s0,prestas0,ure200s0,rre150z0,dkl010z0,fkl010z0”
output_folder (str) - directory where to store the csv files containing the retrieved data
output_folder (float) - directory where to store the csv files containing the retrieved data
overwrite (bool) - whether or not to overwrite already existing data in the output_folder