readabs.read_abs_cat

Download timeseries data from the Australian Bureau of Statistics.

Download timeseries data from the Australian Bureau of Statistics (ABS) for a specified ABS catalogue identifier.

  1"""Download *timeseries* data from the Australian Bureau of Statistics.
  2
  3Download timeseries data from the Australian Bureau of Statistics (ABS)
  4for a specified ABS catalogue identifier.
  5"""
  6
  7import calendar
  8from functools import cache
  9from typing import Any, Unpack
 10
 11import pandas as pd
 12from pandas import DataFrame
 13
 14from readabs.abs_meta_data import metacol
 15from readabs.grab_abs_url import grab_abs_url, grab_abs_zip
 16from readabs.read_support import HYPHEN, ReadArgs
 17
 18# Constants
 19MAX_DATETIME_CHARS = 20
 20TABLE_DESC_ROW = 4
 21TABLE_DESC_COL = 1
 22
 23
 24# --- functions ---
 25# - public -
 26@cache  # minimise slowness for any repeat business
 27def read_abs_cat(
 28    cat: str,
 29    **kwargs: Unpack[ReadArgs],
 30) -> tuple[dict[str, DataFrame], DataFrame]:
 31    """For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames.
 32
 33    This function returns the complete ABS Catalogue information as a
 34    python dictionary of pandas DataFrames, as well as the associated metadata
 35    in a separate DataFrame. The function automates the collection of zip and
 36    excel files from the ABS website. If necessary, these files are downloaded,
 37    and saved into a cache directory. The files are then parsed to extract time
 38    series data, and the associated metadata.
 39
 40    By default, the cache directory is `./.readabs_cache/`. You can change the
 41    default directory name by setting the shell environment variable
 42    `READABS_CACHE_DIR` with the name of the preferred directory.
 43
 44    Parameters
 45    ----------
 46    cat : str
 47        The ABS Catalogue Number for the data to be downloaded and made
 48        available by this function. This argument must be specified in the
 49        function call.
 50
 51    **kwargs : Unpack[ReadArgs]
 52        The following parameters may be passed as optional keyword arguments.
 53
 54    url : str = ""
 55        The URL of an ABS landing page. Use this for discontinued series
 56        that are no longer in the ABS Time Series Directory. If provided,
 57        data will be retrieved from this URL instead of looking up the
 58        catalogue number. Example:
 59        `read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")`
 60
 61    keep_non_ts : bool = False
 62        A flag for whether to keep the non-time-series tables
 63        that might form part of an ABS catalogue item. Normally, the
 64        non-time-series information is ignored, and not made available to
 65        the user.
 66
 67    history : str = ""
 68        Provide a month-year string to extract historical ABS data.
 69        For example, you can set history="dec-2023" to the get the ABS data
 70        for a catalogue identifier that was originally published in respect
 71        of Q4 of 2023. Note: not all ABS data sources are structured so that
 72        this technique works in every case; but most are.
 73
 74    verbose : bool = False
 75        Setting this to true may help diagnose why something
 76        might be going wrong with the data retrieval process.
 77
 78    ignore_errors : bool = False
 79        Normally, this function will cease downloading when
 80        an error in encountered. However, sometimes the ABS website has
 81        malformed links, and changing this setting is necessitated. (Note:
 82        if you drop a message to the ABS, they will usually fix broken
 83        links with a business day).
 84
 85    get_zip : bool = True
 86        Download the excel files in .zip files.
 87
 88    get_excel_if_no_zip : bool = True
 89        Only try to download .xlsx files if there are no zip
 90        files available to be downloaded. Only downloading individual excel
 91        files when there are no zip files to download can speed up the
 92        download process.
 93
 94    get_excel : bool = False
 95        The default value means that excel files are not
 96        automatically download. Note: at least one of `get_zip`,
 97        `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS
 98        catalogue items, it is sufficient to just download the one zip
 99        file. But note, some catalogue items do not have a zip file.
100        Others have quite a number of zip files.
101
102    single_excel_only : str = ""
103        If this argument is set to a table name (without the
104        .xlsx extension), only that excel file will be downloaded. If
105        set, and only a limited subset of available data is needed,
106        this can speed up download times significantly. Note: overrides
107        `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`.
108
109    single_zip_only : str = ""
110        If this argument is set to a zip file name (without
111        the .zip extension), only that zip file will be downloaded.
112        If set, and only a limited subset of available data is needed,
113        this can speed up download times significantly. Note: overrides
114        `get_zip`, `get_excel_if_no_zip`, and `get_excel`.
115
116    cache_only : bool = False
117        If set to True, this function will only access
118        data that has been previously cached. Normally, the function
119        checks the date of the cache data against the date of the data
120        on the ABS website, before deciding whether the ABS has fresher
121        data that needs to be downloaded to the cache.
122
123    zip_file: str | Path = ""
124        If set to a specific zip file name (with or without the .zip
125        extension), this function will only extract data from that zip file
126        on the local file system. This may be useful for debugging purposes.
127
128    Returns
129    -------
130    tuple[dict[str, DataFrame], DataFrame]
131        The function returns a tuple of two items. The first item is a
132        python dictionary of pandas DataFrames (which is the primary data
133        associated with the ABS catalogue item). The second item is a
134        DataFrame of ABS metadata for the ABS collection.
135
136        Note:
137        You can retrieve non-timeseries data using the grab_abs_url()
138        function. That takes the URL for the ABS landing page for the ABS
139        collection you are interested in. The read_abs_cat function is for
140        ABS catalogue identifiers which are timeseries data, for which the
141        metadata can be extracted.
142
143    Example
144    -------
145
146    ```python
147    import readabs as ra
148    from pandas import DataFrame
149    cat_num = "6202.0"  # The ABS labour force survey
150    data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
151    abs_dict, meta = data
152    ```
153
154    """
155    # --- get the time series data ---
156    if kwargs.get("zip_file"):
157        raw_abs_dict = grab_abs_zip(kwargs["zip_file"], **kwargs)
158    else:
159        raw_abs_dict = grab_abs_url(cat=cat, **kwargs)
160    response = _get_time_series_data(cat, raw_abs_dict, **kwargs)
161
162    if not response:
163        response = {}, DataFrame()
164
165    return response  # dictionary of DataFrames, and a DataFrame of metadata
166
167
168# - private -
169def _get_time_series_data(
170    cat: str,
171    abs_dict: dict[str, DataFrame],
172    **kwargs: Any,  # keep_non_ts, verbose, ignore_errors
173) -> tuple[dict[str, DataFrame], DataFrame]:
174    """Extract the time series data for a specific ABS catalogue identifier."""
175    # --- set up ---
176    cat = "<catalogue number missing>" if not cat.strip() else cat.strip()
177    new_dict: dict[str, DataFrame] = {}
178    meta_data = DataFrame()
179
180    # --- group the sheets and iterate over these groups
181    long_groups = _group_sheets(abs_dict)
182    for table, sheets in long_groups.items():
183        args = {
184            "cat": cat,
185            "from_dict": abs_dict,
186            "table": table,
187            "long_sheets": sheets,
188        }
189        new_dict, meta_data = _capture(new_dict, meta_data, args, **kwargs)
190    return new_dict, meta_data
191
192
193def _copy_raw_sheets(
194    from_dict: dict[str, DataFrame],
195    long_sheets: list[str],
196    to_dict: dict[str, DataFrame],
197    *,
198    keep_non_ts: bool,
199) -> dict[str, DataFrame]:
200    """Copy the raw sheets across to the final dictionary.
201
202    Used if the data is not in a timeseries format, and keep_non_ts
203    flag is set to True. Returns an updated final dictionary.
204    """
205    if not keep_non_ts:
206        return to_dict
207
208    for sheet in long_sheets:
209        if sheet in from_dict:
210            to_dict[sheet] = from_dict[sheet]
211        else:
212            # should not happen
213            raise ValueError(f"Glitch: Sheet {sheet} not found in the data.")
214    return to_dict
215
216
217def _capture(
218    to_dict: dict[str, DataFrame],
219    meta_data: DataFrame,
220    args: dict[str, Any],
221    **kwargs: Any,  # keep_non_ts, ignore_errors
222) -> tuple[dict[str, DataFrame], DataFrame]:
223    """Capture the time series data and meta data from an Excel file.
224
225    For a specific Excel file, capture *both* the time series data
226    from the ABS data files as well as the meta data. These data are
227    added to the input 'to_dict' and 'meta_data' respectively, and
228    the combined results are returned as a tuple.
229    """
230    # --- step 0: set up ---
231    keep_non_ts: bool = kwargs.get("keep_non_ts", False)
232    ignore_errors: bool = kwargs.get("ignore_errors", False)
233
234    # --- step 1: capture the meta data ---
235    short_names = [x.split(HYPHEN, 1)[1] for x in args["long_sheets"]]
236    if "Index" not in short_names:
237        print(f"Table {args['table']} has no 'Index' sheet.")
238        to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts)
239        return to_dict, meta_data
240    index = short_names.index("Index")
241
242    index_sheet = args["long_sheets"][index]
243    this_meta = _capture_meta(args["cat"], args["from_dict"], index_sheet)
244    if this_meta.empty:
245        to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts)
246        return to_dict, meta_data
247
248    meta_data = pd.concat([meta_data, this_meta], axis=0)
249
250    # --- step 2: capture the actual time series data ---
251    data = _capture_data(meta_data, args["from_dict"], args["long_sheets"], **kwargs)
252    if len(data):
253        to_dict[args["table"]] = data
254    else:
255        # a glitch: we have the metadata but not the actual data
256        error = f"Unexpected: {args['table']} has no actual data."
257        if not ignore_errors:
258            raise ValueError(error)
259        print(error)
260        to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts)
261
262    return to_dict, meta_data
263
264
265def _capture_data(
266    abs_meta: DataFrame,
267    from_dict: dict[str, DataFrame],
268    long_sheets: list[str],
269    **kwargs: Any,  # verbose
270) -> DataFrame:
271    """Take a list of ABS data sheets and stitch them into a DataFrame.
272
273    Find the DataFrames for those sheets in the from_dict, and stitch them
274    into a single DataFrame with an appropriate PeriodIndex.
275    """
276    # --- step 0: set up ---
277    verbose: bool = kwargs.get("verbose", False)
278    merged_data = DataFrame()
279    header_row: int = 8
280
281    # --- step 1: capture the time series data ---
282    # identify the data sheets in the list of all sheets from Excel file
283    data_sheets = [x for x in long_sheets if x.split(HYPHEN, 1)[1].startswith("Data")]
284
285    for sheet_name in data_sheets:
286        if verbose:
287            print(f"About to cature data from {sheet_name=}")
288
289        # --- capture just the data, nothing else
290        sheet_data = from_dict[sheet_name].copy()
291
292        # get the columns
293        header = sheet_data.iloc[header_row]
294        sheet_data.columns = pd.Index(header)
295        sheet_data = sheet_data[(header_row + 1) :]
296
297        # get the row indexes
298        sheet_data = _index_to_period(sheet_data, sheet_name, abs_meta, verbose=verbose)
299
300        # --- merge data into a single dataframe
301        if len(merged_data) == 0:
302            merged_data = sheet_data
303        else:
304            merged_data = merged_data.merge(
305                right=sheet_data,
306                how="outer",
307                left_index=True,
308                right_index=True,
309                suffixes=("", ""),
310            )
311
312    # --- step 2 - final tidy-ups
313    # remove NA rows
314    merged_data = merged_data.dropna(how="all")
315    # check for NA columns - rarely happens
316    # Note: these empty columns are not removed,
317    # but it is useful to know they are there
318    if merged_data.isna().all().any() and verbose:
319        na_cols = merged_data.columns[merged_data.isna().all()]
320        print(f"Caution: These columns are all NA: {list(na_cols)}")
321
322    # check for duplicate columns - should not happen
323    # Note: these duplicate columns are removed
324    duplicates = merged_data.columns.duplicated()
325    if duplicates.any():
326        if verbose:
327            dup_table = abs_meta[metacol.table].iloc[0]
328            print(f"Note: duplicates removed from {dup_table}: " + f"{merged_data.columns[duplicates]}")
329        merged_data = merged_data.loc[:, ~duplicates].copy()
330
331    # make the data all floats.
332    return merged_data.astype(float).sort_index()
333
334
335def _index_to_period(sheet_data: DataFrame, sheet_name: str, abs_meta: DataFrame, *, verbose: bool) -> DataFrame:
336    """Convert the index of a DataFrame to a PeriodIndex."""
337    index_column = sheet_data[sheet_data.columns[0]].astype(str)
338    sheet_data = sheet_data.drop(sheet_data.columns[0], axis=1)
339    long_row_names = index_column.str.len() > MAX_DATETIME_CHARS  # 19 chars in datetime str
340    if verbose and long_row_names.any():
341        print(f"You may need to check index column for {sheet_name}")
342    index_column = index_column.loc[~long_row_names]
343    sheet_data = sheet_data.loc[~long_row_names]
344
345    proposed_index = pd.to_datetime(index_column)
346
347    # get the correct period index
348    short_name = sheet_name.split(HYPHEN, 1)[0]
349    series_id = sheet_data.columns[0]
350    freq_value = abs_meta[abs_meta[metacol.table] == short_name].loc[series_id, metacol.freq]
351    freq = str(freq_value).upper().strip()[0]
352    freq = "Y" if freq == "A" else freq  # pandas prefers yearly
353    freq = "Q" if freq == "B" else freq  # treat Biannual as quarterly
354    if freq not in ("Y", "Q", "M", "D"):
355        print(f"Check the frequency of the data in sheet: {sheet_name}")
356
357    # create an appropriate period index
358    if freq:
359        if freq in ("Q", "Y"):
360            month = str(calendar.month_abbr[proposed_index.dt.month.max()]).upper()
361            freq = f"{freq}-{month}"
362        sheet_data.index = pd.PeriodIndex(proposed_index, freq=freq)
363    else:
364        raise ValueError(f"With sheet {sheet_name} could not determime PeriodIndex")
365
366    return sheet_data
367
368
369def _capture_meta(
370    cat: str,
371    from_dict: dict[str, DataFrame],
372    index_sheet: str,
373) -> DataFrame:
374    """Capture the metadata from the Index sheet of an ABS excel file.
375
376    Returns a DataFrame specific to the current excel file.
377    Returning an empty DataFrame, means that the meta data could not
378    be identified. Meta data for each ABS data item is organised by row.
379    """
380    # --- step 0: set up ---
381    frame = from_dict[index_sheet]
382
383    # --- step 1: check if the metadata is present in the right place ---
384    # Unfortunately, the header for some of the 3401.0
385    #                spreadsheets starts on row 10
386    starting_rows = 8, 9, 10
387    required = metacol.did, metacol.id, metacol.stype, metacol.unit
388    required_set = set(required)
389
390    header_row = None
391    header_columns = None
392    for row in starting_rows:
393        columns = frame.iloc[row]
394        if required_set.issubset(set(columns)):
395            header_row = row
396            header_columns = columns
397            break
398
399    if header_row is None or header_columns is None:
400        print(f"Table has no metadata in sheet {index_sheet}.")
401        return DataFrame()
402
403    # --- step 2: capture the metadata ---
404    file_meta = frame.iloc[header_row + 1 :].copy()
405    file_meta.columns = pd.Index(header_columns)
406
407    # make damn sure there are no rogue white spaces
408    for col in required:
409        file_meta[col] = file_meta[col].str.strip()
410
411    # remove empty columns and rows
412    file_meta = file_meta.dropna(how="all", axis=1).dropna(how="all", axis=0)
413
414    # populate the metadata
415    file_meta[metacol.table] = index_sheet.split(HYPHEN, 1)[0]
416    tab_desc_value = frame.iloc[TABLE_DESC_ROW, TABLE_DESC_COL]
417    tab_desc = str(tab_desc_value).split(".", 1)[-1].strip()
418    file_meta[metacol.tdesc] = tab_desc
419    file_meta[metacol.cat] = cat
420
421    # drop last row - should just be copyright statement
422    file_meta = file_meta.iloc[:-1]
423
424    # set the index to the series_id
425    file_meta.index = pd.Index(file_meta[metacol.id])
426
427    return file_meta
428
429
430def _group_sheets(
431    abs_dict: dict[str, DataFrame],
432) -> dict[str, list[str]]:
433    """Group the sheets from an Excel file."""
434    keys = list(abs_dict.keys())
435    long_pairs = [(x.split(HYPHEN, 1)[0], x) for x in keys]
436
437    def group(p_list: list[tuple[str, str]]) -> dict[str, list[str]]:
438        groups: dict[str, list[str]] = {}
439        for x, y in p_list:
440            if x not in groups:
441                groups[x] = []
442            groups[x].append(y)
443        return groups
444
445    return group(long_pairs)
446
447
448# --- initial testing ---
449if __name__ == "__main__":
450
451    def simple_test() -> None:
452        """Test the read_abs_cat function."""
453        # ABS Catalogue ID 8731.0 has a mix of time
454        # series and non-time series data. Also,
455        # it has unusually structured Excel files. So, a good test.
456
457        print("Starting test.")
458
459        d, _m = read_abs_cat("8731.0", keep_non_ts=False, verbose=False)
460        print(f"--- {len(d)=} ---")
461        print(f"--- {d.keys()=} ---")
462        for table in d:
463            freq_str = getattr(d[table].index, "freqstr", "Unknown")
464            print(f"{table=} {d[table].shape=} {freq_str=}")
465
466        print ("=" * 20)
467
468        d, _m = read_abs_cat("", zip_file=".test-data/Qrtly-CPI-Time-series-spreadsheets-all.zip", verbose=False)
469        print(f"--- {len(d)=} ---")
470        print(f"--- {d.keys()=} ---")
471        for table in d:
472            freq_str = getattr(d[table].index, "freqstr", "Unknown")
473            print(f"{table=} {d[table].shape=} {freq_str=}")
474
475        print("Test complete.")
476
477    simple_test()
MAX_DATETIME_CHARS = 20
TABLE_DESC_ROW = 4
TABLE_DESC_COL = 1
@cache
def read_abs_cat( cat: str, **kwargs: Unpack[readabs.ReadArgs]) -> tuple[dict[str, pandas.DataFrame], pandas.DataFrame]:
 27@cache  # minimise slowness for any repeat business
 28def read_abs_cat(
 29    cat: str,
 30    **kwargs: Unpack[ReadArgs],
 31) -> tuple[dict[str, DataFrame], DataFrame]:
 32    """For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames.
 33
 34    This function returns the complete ABS Catalogue information as a
 35    python dictionary of pandas DataFrames, as well as the associated metadata
 36    in a separate DataFrame. The function automates the collection of zip and
 37    excel files from the ABS website. If necessary, these files are downloaded,
 38    and saved into a cache directory. The files are then parsed to extract time
 39    series data, and the associated metadata.
 40
 41    By default, the cache directory is `./.readabs_cache/`. You can change the
 42    default directory name by setting the shell environment variable
 43    `READABS_CACHE_DIR` with the name of the preferred directory.
 44
 45    Parameters
 46    ----------
 47    cat : str
 48        The ABS Catalogue Number for the data to be downloaded and made
 49        available by this function. This argument must be specified in the
 50        function call.
 51
 52    **kwargs : Unpack[ReadArgs]
 53        The following parameters may be passed as optional keyword arguments.
 54
 55    url : str = ""
 56        The URL of an ABS landing page. Use this for discontinued series
 57        that are no longer in the ABS Time Series Directory. If provided,
 58        data will be retrieved from this URL instead of looking up the
 59        catalogue number. Example:
 60        `read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")`
 61
 62    keep_non_ts : bool = False
 63        A flag for whether to keep the non-time-series tables
 64        that might form part of an ABS catalogue item. Normally, the
 65        non-time-series information is ignored, and not made available to
 66        the user.
 67
 68    history : str = ""
 69        Provide a month-year string to extract historical ABS data.
 70        For example, you can set history="dec-2023" to the get the ABS data
 71        for a catalogue identifier that was originally published in respect
 72        of Q4 of 2023. Note: not all ABS data sources are structured so that
 73        this technique works in every case; but most are.
 74
 75    verbose : bool = False
 76        Setting this to true may help diagnose why something
 77        might be going wrong with the data retrieval process.
 78
 79    ignore_errors : bool = False
 80        Normally, this function will cease downloading when
 81        an error in encountered. However, sometimes the ABS website has
 82        malformed links, and changing this setting is necessitated. (Note:
 83        if you drop a message to the ABS, they will usually fix broken
 84        links with a business day).
 85
 86    get_zip : bool = True
 87        Download the excel files in .zip files.
 88
 89    get_excel_if_no_zip : bool = True
 90        Only try to download .xlsx files if there are no zip
 91        files available to be downloaded. Only downloading individual excel
 92        files when there are no zip files to download can speed up the
 93        download process.
 94
 95    get_excel : bool = False
 96        The default value means that excel files are not
 97        automatically download. Note: at least one of `get_zip`,
 98        `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS
 99        catalogue items, it is sufficient to just download the one zip
100        file. But note, some catalogue items do not have a zip file.
101        Others have quite a number of zip files.
102
103    single_excel_only : str = ""
104        If this argument is set to a table name (without the
105        .xlsx extension), only that excel file will be downloaded. If
106        set, and only a limited subset of available data is needed,
107        this can speed up download times significantly. Note: overrides
108        `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`.
109
110    single_zip_only : str = ""
111        If this argument is set to a zip file name (without
112        the .zip extension), only that zip file will be downloaded.
113        If set, and only a limited subset of available data is needed,
114        this can speed up download times significantly. Note: overrides
115        `get_zip`, `get_excel_if_no_zip`, and `get_excel`.
116
117    cache_only : bool = False
118        If set to True, this function will only access
119        data that has been previously cached. Normally, the function
120        checks the date of the cache data against the date of the data
121        on the ABS website, before deciding whether the ABS has fresher
122        data that needs to be downloaded to the cache.
123
124    zip_file: str | Path = ""
125        If set to a specific zip file name (with or without the .zip
126        extension), this function will only extract data from that zip file
127        on the local file system. This may be useful for debugging purposes.
128
129    Returns
130    -------
131    tuple[dict[str, DataFrame], DataFrame]
132        The function returns a tuple of two items. The first item is a
133        python dictionary of pandas DataFrames (which is the primary data
134        associated with the ABS catalogue item). The second item is a
135        DataFrame of ABS metadata for the ABS collection.
136
137        Note:
138        You can retrieve non-timeseries data using the grab_abs_url()
139        function. That takes the URL for the ABS landing page for the ABS
140        collection you are interested in. The read_abs_cat function is for
141        ABS catalogue identifiers which are timeseries data, for which the
142        metadata can be extracted.
143
144    Example
145    -------
146
147    ```python
148    import readabs as ra
149    from pandas import DataFrame
150    cat_num = "6202.0"  # The ABS labour force survey
151    data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
152    abs_dict, meta = data
153    ```
154
155    """
156    # --- get the time series data ---
157    if kwargs.get("zip_file"):
158        raw_abs_dict = grab_abs_zip(kwargs["zip_file"], **kwargs)
159    else:
160        raw_abs_dict = grab_abs_url(cat=cat, **kwargs)
161    response = _get_time_series_data(cat, raw_abs_dict, **kwargs)
162
163    if not response:
164        response = {}, DataFrame()
165
166    return response  # dictionary of DataFrames, and a DataFrame of metadata

For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames.

This function returns the complete ABS Catalogue information as a python dictionary of pandas DataFrames, as well as the associated metadata in a separate DataFrame. The function automates the collection of zip and excel files from the ABS website. If necessary, these files are downloaded, and saved into a cache directory. The files are then parsed to extract time series data, and the associated metadata.

By default, the cache directory is ./.readabs_cache/. You can change the default directory name by setting the shell environment variable READABS_CACHE_DIR with the name of the preferred directory.

Parameters

cat : str The ABS Catalogue Number for the data to be downloaded and made available by this function. This argument must be specified in the function call.

**kwargs : Unpack[ReadArgs] The following parameters may be passed as optional keyword arguments.

url : str = "" The URL of an ABS landing page. Use this for discontinued series that are no longer in the ABS Time Series Directory. If provided, data will be retrieved from this URL instead of looking up the catalogue number. Example: read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")

keep_non_ts : bool = False A flag for whether to keep the non-time-series tables that might form part of an ABS catalogue item. Normally, the non-time-series information is ignored, and not made available to the user.

history : str = "" Provide a month-year string to extract historical ABS data. For example, you can set history="dec-2023" to the get the ABS data for a catalogue identifier that was originally published in respect of Q4 of 2023. Note: not all ABS data sources are structured so that this technique works in every case; but most are.

verbose : bool = False Setting this to true may help diagnose why something might be going wrong with the data retrieval process.

ignore_errors : bool = False Normally, this function will cease downloading when an error in encountered. However, sometimes the ABS website has malformed links, and changing this setting is necessitated. (Note: if you drop a message to the ABS, they will usually fix broken links with a business day).

get_zip : bool = True Download the excel files in .zip files.

get_excel_if_no_zip : bool = True Only try to download .xlsx files if there are no zip files available to be downloaded. Only downloading individual excel files when there are no zip files to download can speed up the download process.

get_excel : bool = False The default value means that excel files are not automatically download. Note: at least one of get_zip, get_excel_if_no_zip, or get_excel must be true. For most ABS catalogue items, it is sufficient to just download the one zip file. But note, some catalogue items do not have a zip file. Others have quite a number of zip files.

single_excel_only : str = "" If this argument is set to a table name (without the .xlsx extension), only that excel file will be downloaded. If set, and only a limited subset of available data is needed, this can speed up download times significantly. Note: overrides get_zip, get_excel_if_no_zip, get_excel and single_zip_only.

single_zip_only : str = "" If this argument is set to a zip file name (without the .zip extension), only that zip file will be downloaded. If set, and only a limited subset of available data is needed, this can speed up download times significantly. Note: overrides get_zip, get_excel_if_no_zip, and get_excel.

cache_only : bool = False If set to True, this function will only access data that has been previously cached. Normally, the function checks the date of the cache data against the date of the data on the ABS website, before deciding whether the ABS has fresher data that needs to be downloaded to the cache.

zip_file: str | Path = "" If set to a specific zip file name (with or without the .zip extension), this function will only extract data from that zip file on the local file system. This may be useful for debugging purposes.

Returns

tuple[dict[str, DataFrame], DataFrame] The function returns a tuple of two items. The first item is a python dictionary of pandas DataFrames (which is the primary data associated with the ABS catalogue item). The second item is a DataFrame of ABS metadata for the ABS collection.

Note:
You can retrieve non-timeseries data using the grab_abs_url()
function. That takes the URL for the ABS landing page for the ABS
collection you are interested in. The read_abs_cat function is for
ABS catalogue identifiers which are timeseries data, for which the
metadata can be extracted.

Example

import readabs as ra
from pandas import DataFrame
cat_num = "6202.0"  # The ABS labour force survey
data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
abs_dict, meta = data