readabs.read_abs_cat
Download timeseries data from the Australian Bureau of Statistics.
Download timeseries data from the Australian Bureau of Statistics (ABS) for a specified ABS catalogue identifier.
1"""Download *timeseries* data from the Australian Bureau of Statistics. 2 3Download timeseries data from the Australian Bureau of Statistics (ABS) 4for a specified ABS catalogue identifier. 5""" 6 7import calendar 8from functools import cache 9from typing import Any, Unpack 10 11import pandas as pd 12from pandas import DataFrame 13 14from readabs.abs_meta_data import metacol 15from readabs.grab_abs_url import grab_abs_url, grab_abs_zip 16from readabs.read_support import HYPHEN, ReadArgs 17 18# Constants 19MAX_DATETIME_CHARS = 20 20TABLE_DESC_ROW = 4 21TABLE_DESC_COL = 1 22 23 24# --- functions --- 25# - public - 26@cache # minimise slowness for any repeat business 27def read_abs_cat( 28 cat: str, 29 **kwargs: Unpack[ReadArgs], 30) -> tuple[dict[str, DataFrame], DataFrame]: 31 """For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames. 32 33 This function returns the complete ABS Catalogue information as a 34 python dictionary of pandas DataFrames, as well as the associated metadata 35 in a separate DataFrame. The function automates the collection of zip and 36 excel files from the ABS website. If necessary, these files are downloaded, 37 and saved into a cache directory. The files are then parsed to extract time 38 series data, and the associated metadata. 39 40 By default, the cache directory is `./.readabs_cache/`. You can change the 41 default directory name by setting the shell environment variable 42 `READABS_CACHE_DIR` with the name of the preferred directory. 43 44 Parameters 45 ---------- 46 cat : str 47 The ABS Catalogue Number for the data to be downloaded and made 48 available by this function. This argument must be specified in the 49 function call. 50 51 **kwargs : Unpack[ReadArgs] 52 The following parameters may be passed as optional keyword arguments. 53 54 url : str = "" 55 The URL of an ABS landing page. Use this for discontinued series 56 that are no longer in the ABS Time Series Directory. If provided, 57 data will be retrieved from this URL instead of looking up the 58 catalogue number. Example: 59 `read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")` 60 61 keep_non_ts : bool = False 62 A flag for whether to keep the non-time-series tables 63 that might form part of an ABS catalogue item. Normally, the 64 non-time-series information is ignored, and not made available to 65 the user. 66 67 history : str = "" 68 Provide a month-year string to extract historical ABS data. 69 For example, you can set history="dec-2023" to the get the ABS data 70 for a catalogue identifier that was originally published in respect 71 of Q4 of 2023. Note: not all ABS data sources are structured so that 72 this technique works in every case; but most are. 73 74 verbose : bool = False 75 Setting this to true may help diagnose why something 76 might be going wrong with the data retrieval process. 77 78 ignore_errors : bool = False 79 Normally, this function will cease downloading when 80 an error in encountered. However, sometimes the ABS website has 81 malformed links, and changing this setting is necessitated. (Note: 82 if you drop a message to the ABS, they will usually fix broken 83 links with a business day). 84 85 get_zip : bool = True 86 Download the excel files in .zip files. 87 88 get_excel_if_no_zip : bool = True 89 Only try to download .xlsx files if there are no zip 90 files available to be downloaded. Only downloading individual excel 91 files when there are no zip files to download can speed up the 92 download process. 93 94 get_excel : bool = False 95 The default value means that excel files are not 96 automatically download. Note: at least one of `get_zip`, 97 `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS 98 catalogue items, it is sufficient to just download the one zip 99 file. But note, some catalogue items do not have a zip file. 100 Others have quite a number of zip files. 101 102 single_excel_only : str = "" 103 If this argument is set to a table name (without the 104 .xlsx extension), only that excel file will be downloaded. If 105 set, and only a limited subset of available data is needed, 106 this can speed up download times significantly. Note: overrides 107 `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`. 108 109 single_zip_only : str = "" 110 If this argument is set to a zip file name (without 111 the .zip extension), only that zip file will be downloaded. 112 If set, and only a limited subset of available data is needed, 113 this can speed up download times significantly. Note: overrides 114 `get_zip`, `get_excel_if_no_zip`, and `get_excel`. 115 116 cache_only : bool = False 117 If set to True, this function will only access 118 data that has been previously cached. Normally, the function 119 checks the date of the cache data against the date of the data 120 on the ABS website, before deciding whether the ABS has fresher 121 data that needs to be downloaded to the cache. 122 123 zip_file: str | Path = "" 124 If set to a specific zip file name (with or without the .zip 125 extension), this function will only extract data from that zip file 126 on the local file system. This may be useful for debugging purposes. 127 128 Returns 129 ------- 130 tuple[dict[str, DataFrame], DataFrame] 131 The function returns a tuple of two items. The first item is a 132 python dictionary of pandas DataFrames (which is the primary data 133 associated with the ABS catalogue item). The second item is a 134 DataFrame of ABS metadata for the ABS collection. 135 136 Note: 137 You can retrieve non-timeseries data using the grab_abs_url() 138 function. That takes the URL for the ABS landing page for the ABS 139 collection you are interested in. The read_abs_cat function is for 140 ABS catalogue identifiers which are timeseries data, for which the 141 metadata can be extracted. 142 143 Example 144 ------- 145 146 ```python 147 import readabs as ra 148 from pandas import DataFrame 149 cat_num = "6202.0" # The ABS labour force survey 150 data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num) 151 abs_dict, meta = data 152 ``` 153 154 """ 155 # --- get the time series data --- 156 if kwargs.get("zip_file"): 157 raw_abs_dict = grab_abs_zip(kwargs["zip_file"], **kwargs) 158 else: 159 raw_abs_dict = grab_abs_url(cat=cat, **kwargs) 160 response = _get_time_series_data(cat, raw_abs_dict, **kwargs) 161 162 if not response: 163 response = {}, DataFrame() 164 165 return response # dictionary of DataFrames, and a DataFrame of metadata 166 167 168# - private - 169def _get_time_series_data( 170 cat: str, 171 abs_dict: dict[str, DataFrame], 172 **kwargs: Any, # keep_non_ts, verbose, ignore_errors 173) -> tuple[dict[str, DataFrame], DataFrame]: 174 """Extract the time series data for a specific ABS catalogue identifier.""" 175 # --- set up --- 176 cat = "<catalogue number missing>" if not cat.strip() else cat.strip() 177 new_dict: dict[str, DataFrame] = {} 178 meta_data = DataFrame() 179 180 # --- group the sheets and iterate over these groups 181 long_groups = _group_sheets(abs_dict) 182 for table, sheets in long_groups.items(): 183 args = { 184 "cat": cat, 185 "from_dict": abs_dict, 186 "table": table, 187 "long_sheets": sheets, 188 } 189 new_dict, meta_data = _capture(new_dict, meta_data, args, **kwargs) 190 return new_dict, meta_data 191 192 193def _copy_raw_sheets( 194 from_dict: dict[str, DataFrame], 195 long_sheets: list[str], 196 to_dict: dict[str, DataFrame], 197 *, 198 keep_non_ts: bool, 199) -> dict[str, DataFrame]: 200 """Copy the raw sheets across to the final dictionary. 201 202 Used if the data is not in a timeseries format, and keep_non_ts 203 flag is set to True. Returns an updated final dictionary. 204 """ 205 if not keep_non_ts: 206 return to_dict 207 208 for sheet in long_sheets: 209 if sheet in from_dict: 210 to_dict[sheet] = from_dict[sheet] 211 else: 212 # should not happen 213 raise ValueError(f"Glitch: Sheet {sheet} not found in the data.") 214 return to_dict 215 216 217def _capture( 218 to_dict: dict[str, DataFrame], 219 meta_data: DataFrame, 220 args: dict[str, Any], 221 **kwargs: Any, # keep_non_ts, ignore_errors 222) -> tuple[dict[str, DataFrame], DataFrame]: 223 """Capture the time series data and meta data from an Excel file. 224 225 For a specific Excel file, capture *both* the time series data 226 from the ABS data files as well as the meta data. These data are 227 added to the input 'to_dict' and 'meta_data' respectively, and 228 the combined results are returned as a tuple. 229 """ 230 # --- step 0: set up --- 231 keep_non_ts: bool = kwargs.get("keep_non_ts", False) 232 ignore_errors: bool = kwargs.get("ignore_errors", False) 233 234 # --- step 1: capture the meta data --- 235 short_names = [x.split(HYPHEN, 1)[1] for x in args["long_sheets"]] 236 if "Index" not in short_names: 237 print(f"Table {args['table']} has no 'Index' sheet.") 238 to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts) 239 return to_dict, meta_data 240 index = short_names.index("Index") 241 242 index_sheet = args["long_sheets"][index] 243 this_meta = _capture_meta(args["cat"], args["from_dict"], index_sheet) 244 if this_meta.empty: 245 to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts) 246 return to_dict, meta_data 247 248 meta_data = pd.concat([meta_data, this_meta], axis=0) 249 250 # --- step 2: capture the actual time series data --- 251 data = _capture_data(meta_data, args["from_dict"], args["long_sheets"], **kwargs) 252 if len(data): 253 to_dict[args["table"]] = data 254 else: 255 # a glitch: we have the metadata but not the actual data 256 error = f"Unexpected: {args['table']} has no actual data." 257 if not ignore_errors: 258 raise ValueError(error) 259 print(error) 260 to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts) 261 262 return to_dict, meta_data 263 264 265def _capture_data( 266 abs_meta: DataFrame, 267 from_dict: dict[str, DataFrame], 268 long_sheets: list[str], 269 **kwargs: Any, # verbose 270) -> DataFrame: 271 """Take a list of ABS data sheets and stitch them into a DataFrame. 272 273 Find the DataFrames for those sheets in the from_dict, and stitch them 274 into a single DataFrame with an appropriate PeriodIndex. 275 """ 276 # --- step 0: set up --- 277 verbose: bool = kwargs.get("verbose", False) 278 merged_data = DataFrame() 279 header_row: int = 8 280 281 # --- step 1: capture the time series data --- 282 # identify the data sheets in the list of all sheets from Excel file 283 data_sheets = [x for x in long_sheets if x.split(HYPHEN, 1)[1].startswith("Data")] 284 285 for sheet_name in data_sheets: 286 if verbose: 287 print(f"About to cature data from {sheet_name=}") 288 289 # --- capture just the data, nothing else 290 sheet_data = from_dict[sheet_name].copy() 291 292 # get the columns 293 header = sheet_data.iloc[header_row] 294 sheet_data.columns = pd.Index(header) 295 sheet_data = sheet_data[(header_row + 1) :] 296 297 # get the row indexes 298 sheet_data = _index_to_period(sheet_data, sheet_name, abs_meta, verbose=verbose) 299 300 # --- merge data into a single dataframe 301 if len(merged_data) == 0: 302 merged_data = sheet_data 303 else: 304 merged_data = merged_data.merge( 305 right=sheet_data, 306 how="outer", 307 left_index=True, 308 right_index=True, 309 suffixes=("", ""), 310 ) 311 312 # --- step 2 - final tidy-ups 313 # remove NA rows 314 merged_data = merged_data.dropna(how="all") 315 # check for NA columns - rarely happens 316 # Note: these empty columns are not removed, 317 # but it is useful to know they are there 318 if merged_data.isna().all().any() and verbose: 319 na_cols = merged_data.columns[merged_data.isna().all()] 320 print(f"Caution: These columns are all NA: {list(na_cols)}") 321 322 # check for duplicate columns - should not happen 323 # Note: these duplicate columns are removed 324 duplicates = merged_data.columns.duplicated() 325 if duplicates.any(): 326 if verbose: 327 dup_table = abs_meta[metacol.table].iloc[0] 328 print(f"Note: duplicates removed from {dup_table}: " + f"{merged_data.columns[duplicates]}") 329 merged_data = merged_data.loc[:, ~duplicates].copy() 330 331 # make the data all floats. 332 return merged_data.astype(float).sort_index() 333 334 335def _index_to_period(sheet_data: DataFrame, sheet_name: str, abs_meta: DataFrame, *, verbose: bool) -> DataFrame: 336 """Convert the index of a DataFrame to a PeriodIndex.""" 337 index_column = sheet_data[sheet_data.columns[0]].astype(str) 338 sheet_data = sheet_data.drop(sheet_data.columns[0], axis=1) 339 long_row_names = index_column.str.len() > MAX_DATETIME_CHARS # 19 chars in datetime str 340 if verbose and long_row_names.any(): 341 print(f"You may need to check index column for {sheet_name}") 342 index_column = index_column.loc[~long_row_names] 343 sheet_data = sheet_data.loc[~long_row_names] 344 345 proposed_index = pd.to_datetime(index_column) 346 347 # get the correct period index 348 short_name = sheet_name.split(HYPHEN, 1)[0] 349 series_id = sheet_data.columns[0] 350 freq_value = abs_meta[abs_meta[metacol.table] == short_name].loc[series_id, metacol.freq] 351 freq = str(freq_value).upper().strip()[0] 352 freq = "Y" if freq == "A" else freq # pandas prefers yearly 353 freq = "Q" if freq == "B" else freq # treat Biannual as quarterly 354 if freq not in ("Y", "Q", "M", "D"): 355 print(f"Check the frequency of the data in sheet: {sheet_name}") 356 357 # create an appropriate period index 358 if freq: 359 if freq in ("Q", "Y"): 360 month = str(calendar.month_abbr[proposed_index.dt.month.max()]).upper() 361 freq = f"{freq}-{month}" 362 sheet_data.index = pd.PeriodIndex(proposed_index, freq=freq) 363 else: 364 raise ValueError(f"With sheet {sheet_name} could not determime PeriodIndex") 365 366 return sheet_data 367 368 369def _capture_meta( 370 cat: str, 371 from_dict: dict[str, DataFrame], 372 index_sheet: str, 373) -> DataFrame: 374 """Capture the metadata from the Index sheet of an ABS excel file. 375 376 Returns a DataFrame specific to the current excel file. 377 Returning an empty DataFrame, means that the meta data could not 378 be identified. Meta data for each ABS data item is organised by row. 379 """ 380 # --- step 0: set up --- 381 frame = from_dict[index_sheet] 382 383 # --- step 1: check if the metadata is present in the right place --- 384 # Unfortunately, the header for some of the 3401.0 385 # spreadsheets starts on row 10 386 starting_rows = 8, 9, 10 387 required = metacol.did, metacol.id, metacol.stype, metacol.unit 388 required_set = set(required) 389 390 header_row = None 391 header_columns = None 392 for row in starting_rows: 393 columns = frame.iloc[row] 394 if required_set.issubset(set(columns)): 395 header_row = row 396 header_columns = columns 397 break 398 399 if header_row is None or header_columns is None: 400 print(f"Table has no metadata in sheet {index_sheet}.") 401 return DataFrame() 402 403 # --- step 2: capture the metadata --- 404 file_meta = frame.iloc[header_row + 1 :].copy() 405 file_meta.columns = pd.Index(header_columns) 406 407 # make damn sure there are no rogue white spaces 408 for col in required: 409 file_meta[col] = file_meta[col].str.strip() 410 411 # remove empty columns and rows 412 file_meta = file_meta.dropna(how="all", axis=1).dropna(how="all", axis=0) 413 414 # populate the metadata 415 file_meta[metacol.table] = index_sheet.split(HYPHEN, 1)[0] 416 tab_desc_value = frame.iloc[TABLE_DESC_ROW, TABLE_DESC_COL] 417 tab_desc = str(tab_desc_value).split(".", 1)[-1].strip() 418 file_meta[metacol.tdesc] = tab_desc 419 file_meta[metacol.cat] = cat 420 421 # drop last row - should just be copyright statement 422 file_meta = file_meta.iloc[:-1] 423 424 # set the index to the series_id 425 file_meta.index = pd.Index(file_meta[metacol.id]) 426 427 return file_meta 428 429 430def _group_sheets( 431 abs_dict: dict[str, DataFrame], 432) -> dict[str, list[str]]: 433 """Group the sheets from an Excel file.""" 434 keys = list(abs_dict.keys()) 435 long_pairs = [(x.split(HYPHEN, 1)[0], x) for x in keys] 436 437 def group(p_list: list[tuple[str, str]]) -> dict[str, list[str]]: 438 groups: dict[str, list[str]] = {} 439 for x, y in p_list: 440 if x not in groups: 441 groups[x] = [] 442 groups[x].append(y) 443 return groups 444 445 return group(long_pairs) 446 447 448# --- initial testing --- 449if __name__ == "__main__": 450 451 def simple_test() -> None: 452 """Test the read_abs_cat function.""" 453 # ABS Catalogue ID 8731.0 has a mix of time 454 # series and non-time series data. Also, 455 # it has unusually structured Excel files. So, a good test. 456 457 print("Starting test.") 458 459 d, _m = read_abs_cat("8731.0", keep_non_ts=False, verbose=False) 460 print(f"--- {len(d)=} ---") 461 print(f"--- {d.keys()=} ---") 462 for table in d: 463 freq_str = getattr(d[table].index, "freqstr", "Unknown") 464 print(f"{table=} {d[table].shape=} {freq_str=}") 465 466 print ("=" * 20) 467 468 d, _m = read_abs_cat("", zip_file=".test-data/Qrtly-CPI-Time-series-spreadsheets-all.zip", verbose=False) 469 print(f"--- {len(d)=} ---") 470 print(f"--- {d.keys()=} ---") 471 for table in d: 472 freq_str = getattr(d[table].index, "freqstr", "Unknown") 473 print(f"{table=} {d[table].shape=} {freq_str=}") 474 475 print("Test complete.") 476 477 simple_test()
27@cache # minimise slowness for any repeat business 28def read_abs_cat( 29 cat: str, 30 **kwargs: Unpack[ReadArgs], 31) -> tuple[dict[str, DataFrame], DataFrame]: 32 """For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames. 33 34 This function returns the complete ABS Catalogue information as a 35 python dictionary of pandas DataFrames, as well as the associated metadata 36 in a separate DataFrame. The function automates the collection of zip and 37 excel files from the ABS website. If necessary, these files are downloaded, 38 and saved into a cache directory. The files are then parsed to extract time 39 series data, and the associated metadata. 40 41 By default, the cache directory is `./.readabs_cache/`. You can change the 42 default directory name by setting the shell environment variable 43 `READABS_CACHE_DIR` with the name of the preferred directory. 44 45 Parameters 46 ---------- 47 cat : str 48 The ABS Catalogue Number for the data to be downloaded and made 49 available by this function. This argument must be specified in the 50 function call. 51 52 **kwargs : Unpack[ReadArgs] 53 The following parameters may be passed as optional keyword arguments. 54 55 url : str = "" 56 The URL of an ABS landing page. Use this for discontinued series 57 that are no longer in the ABS Time Series Directory. If provided, 58 data will be retrieved from this URL instead of looking up the 59 catalogue number. Example: 60 `read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")` 61 62 keep_non_ts : bool = False 63 A flag for whether to keep the non-time-series tables 64 that might form part of an ABS catalogue item. Normally, the 65 non-time-series information is ignored, and not made available to 66 the user. 67 68 history : str = "" 69 Provide a month-year string to extract historical ABS data. 70 For example, you can set history="dec-2023" to the get the ABS data 71 for a catalogue identifier that was originally published in respect 72 of Q4 of 2023. Note: not all ABS data sources are structured so that 73 this technique works in every case; but most are. 74 75 verbose : bool = False 76 Setting this to true may help diagnose why something 77 might be going wrong with the data retrieval process. 78 79 ignore_errors : bool = False 80 Normally, this function will cease downloading when 81 an error in encountered. However, sometimes the ABS website has 82 malformed links, and changing this setting is necessitated. (Note: 83 if you drop a message to the ABS, they will usually fix broken 84 links with a business day). 85 86 get_zip : bool = True 87 Download the excel files in .zip files. 88 89 get_excel_if_no_zip : bool = True 90 Only try to download .xlsx files if there are no zip 91 files available to be downloaded. Only downloading individual excel 92 files when there are no zip files to download can speed up the 93 download process. 94 95 get_excel : bool = False 96 The default value means that excel files are not 97 automatically download. Note: at least one of `get_zip`, 98 `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS 99 catalogue items, it is sufficient to just download the one zip 100 file. But note, some catalogue items do not have a zip file. 101 Others have quite a number of zip files. 102 103 single_excel_only : str = "" 104 If this argument is set to a table name (without the 105 .xlsx extension), only that excel file will be downloaded. If 106 set, and only a limited subset of available data is needed, 107 this can speed up download times significantly. Note: overrides 108 `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`. 109 110 single_zip_only : str = "" 111 If this argument is set to a zip file name (without 112 the .zip extension), only that zip file will be downloaded. 113 If set, and only a limited subset of available data is needed, 114 this can speed up download times significantly. Note: overrides 115 `get_zip`, `get_excel_if_no_zip`, and `get_excel`. 116 117 cache_only : bool = False 118 If set to True, this function will only access 119 data that has been previously cached. Normally, the function 120 checks the date of the cache data against the date of the data 121 on the ABS website, before deciding whether the ABS has fresher 122 data that needs to be downloaded to the cache. 123 124 zip_file: str | Path = "" 125 If set to a specific zip file name (with or without the .zip 126 extension), this function will only extract data from that zip file 127 on the local file system. This may be useful for debugging purposes. 128 129 Returns 130 ------- 131 tuple[dict[str, DataFrame], DataFrame] 132 The function returns a tuple of two items. The first item is a 133 python dictionary of pandas DataFrames (which is the primary data 134 associated with the ABS catalogue item). The second item is a 135 DataFrame of ABS metadata for the ABS collection. 136 137 Note: 138 You can retrieve non-timeseries data using the grab_abs_url() 139 function. That takes the URL for the ABS landing page for the ABS 140 collection you are interested in. The read_abs_cat function is for 141 ABS catalogue identifiers which are timeseries data, for which the 142 metadata can be extracted. 143 144 Example 145 ------- 146 147 ```python 148 import readabs as ra 149 from pandas import DataFrame 150 cat_num = "6202.0" # The ABS labour force survey 151 data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num) 152 abs_dict, meta = data 153 ``` 154 155 """ 156 # --- get the time series data --- 157 if kwargs.get("zip_file"): 158 raw_abs_dict = grab_abs_zip(kwargs["zip_file"], **kwargs) 159 else: 160 raw_abs_dict = grab_abs_url(cat=cat, **kwargs) 161 response = _get_time_series_data(cat, raw_abs_dict, **kwargs) 162 163 if not response: 164 response = {}, DataFrame() 165 166 return response # dictionary of DataFrames, and a DataFrame of metadata
For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames.
This function returns the complete ABS Catalogue information as a python dictionary of pandas DataFrames, as well as the associated metadata in a separate DataFrame. The function automates the collection of zip and excel files from the ABS website. If necessary, these files are downloaded, and saved into a cache directory. The files are then parsed to extract time series data, and the associated metadata.
By default, the cache directory is ./.readabs_cache/. You can change the
default directory name by setting the shell environment variable
READABS_CACHE_DIR with the name of the preferred directory.
Parameters
cat : str The ABS Catalogue Number for the data to be downloaded and made available by this function. This argument must be specified in the function call.
**kwargs : Unpack[ReadArgs] The following parameters may be passed as optional keyword arguments.
url : str = ""
The URL of an ABS landing page. Use this for discontinued series
that are no longer in the ABS Time Series Directory. If provided,
data will be retrieved from this URL instead of looking up the
catalogue number. Example:
read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")
keep_non_ts : bool = False A flag for whether to keep the non-time-series tables that might form part of an ABS catalogue item. Normally, the non-time-series information is ignored, and not made available to the user.
history : str = "" Provide a month-year string to extract historical ABS data. For example, you can set history="dec-2023" to the get the ABS data for a catalogue identifier that was originally published in respect of Q4 of 2023. Note: not all ABS data sources are structured so that this technique works in every case; but most are.
verbose : bool = False Setting this to true may help diagnose why something might be going wrong with the data retrieval process.
ignore_errors : bool = False Normally, this function will cease downloading when an error in encountered. However, sometimes the ABS website has malformed links, and changing this setting is necessitated. (Note: if you drop a message to the ABS, they will usually fix broken links with a business day).
get_zip : bool = True Download the excel files in .zip files.
get_excel_if_no_zip : bool = True Only try to download .xlsx files if there are no zip files available to be downloaded. Only downloading individual excel files when there are no zip files to download can speed up the download process.
get_excel : bool = False
The default value means that excel files are not
automatically download. Note: at least one of get_zip,
get_excel_if_no_zip, or get_excel must be true. For most ABS
catalogue items, it is sufficient to just download the one zip
file. But note, some catalogue items do not have a zip file.
Others have quite a number of zip files.
single_excel_only : str = ""
If this argument is set to a table name (without the
.xlsx extension), only that excel file will be downloaded. If
set, and only a limited subset of available data is needed,
this can speed up download times significantly. Note: overrides
get_zip, get_excel_if_no_zip, get_excel and single_zip_only.
single_zip_only : str = ""
If this argument is set to a zip file name (without
the .zip extension), only that zip file will be downloaded.
If set, and only a limited subset of available data is needed,
this can speed up download times significantly. Note: overrides
get_zip, get_excel_if_no_zip, and get_excel.
cache_only : bool = False If set to True, this function will only access data that has been previously cached. Normally, the function checks the date of the cache data against the date of the data on the ABS website, before deciding whether the ABS has fresher data that needs to be downloaded to the cache.
zip_file: str | Path = "" If set to a specific zip file name (with or without the .zip extension), this function will only extract data from that zip file on the local file system. This may be useful for debugging purposes.
Returns
tuple[dict[str, DataFrame], DataFrame] The function returns a tuple of two items. The first item is a python dictionary of pandas DataFrames (which is the primary data associated with the ABS catalogue item). The second item is a DataFrame of ABS metadata for the ABS collection.
Note:
You can retrieve non-timeseries data using the grab_abs_url()
function. That takes the URL for the ABS landing page for the ABS
collection you are interested in. The read_abs_cat function is for
ABS catalogue identifiers which are timeseries data, for which the
metadata can be extracted.
Example
import readabs as ra
from pandas import DataFrame
cat_num = "6202.0" # The ABS labour force survey
data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
abs_dict, meta = data