跳转至

🧰️ Microsoft AzureBlob

一、概述

数据湖(Data Lake)是一种基于集群和分布式文件系统的一种数据存储方式,对企业中的所有数据进行统一存储,数据湖的就是原始数据保存区,从原始数据(源系统数据的精确副本)转换为用于报告、可视化分析和机器学习等各种任务的目标数据。 数据湖中的数据包括结构化数据(关系数据库数据),半结构化数据(CSV、XML、JSON等),非结构化数据(电子邮件,文档,PDF)和二进制数据(图像、音频、视频),从而形成一个容纳所有形式数据的集中式数据存储。

数据湖的主要供应商一般都是超大规模公共云供应商,例如亚马逊AWS、微软Azure和谷歌云平台(GCP)。

AILab-PDBC (Python DataBase Connectivity)是数智教育发展(山东)有限公司 AI Lab 100 团队开发的高效、灵活的数据接口(API)

ailab100.pdbc.datalake类用于数据湖的读写,支持MinIO 、亚马逊Amazon S3、谷歌GCS、微软Azure Blob等主流数据湖。

本章介绍如何使用ailab100.pdbc.datalake.AzureBlob 连接 AzureBlob 进行读写和下载。

Azure Blob 存储是 Microsoft 提供的适用于云的对象存储解决方案。 Blob 存储最适合存储巨量的非结构化数据。

二、API 接口说明

alcedo_pdbc.datalake.AzureBlob

AzureBlob class create a ligo azureblob object, through which you can able to read, write, upload, download data from Azure Blob Storage.

Parameters:

Name Type Description Default
config dict

Automatically loaded from the config file (yaml)

required

Functions

alcedo_pdbc.datalake.AzureBlob.download_file(container_name, blob_name, path_to_download='.') method descriptor

AzureBlob.download_file(self, str container_name: str, str blob_name: str, path_to_download='.')

Takes container name and blob name as arguments and download the file

Parameters:

Name Type Description Default
container_name str

container name

required
blob_name str

blob name

required
path_to_download str

save location. Defaults to '.'.

'.'
alcedo_pdbc.datalake.AzureBlob.download_folder(container_name, blob_path, local_path_to_download='.') method descriptor

AzureBlob.download_folder(self, str container_name: str, str blob_path: str, local_path_to_download='.')

alcedo_pdbc.datalake.AzureBlob.read_as_dataframe(container_name, blob_name, pandas_args={}, polars_args={}, extension='csv', return_type='pandas') method descriptor

AzureBlob.read_as_dataframe(self, str container_name: str, str blob_name: str, dict pandas_args: Dict = {}, dict polars_args: Dict = {}, extension='csv', return_type='pandas')

Takes Azure Storage account container name and blob name and return datafarme.

Parameters:

Name Type Description Default
container_name str

Container Name of the azure storage account

required
blob_name str

Blob Name which wants to read

required
pandas_args dict

pandas arguments like encoding, etc

{}
extension str

extension of the files, It take automatically from the blob_name parameter. Defaults to 'csv'.

'csv'
return_type str

which dataframe you want to return (pandas, polars, dask etc). Defaults to 'pandas'.

'pandas'

Returns:

Name Type Description
DataFrame ``Pandas``、``Polars`` or ``Dask``

根据 return_type 参数返回对应的数据帧 Dataframe

alcedo_pdbc.datalake.AzureBlob.upload_file(source_file_path, container_name, blob_name=None) method descriptor

AzureBlob.upload_file(self, str source_file_path: str, str container_name: str, str blob_name: str = None)

Takes source file path, container name and blob name as arguments and upload the file to Azure Blob Storage

Parameters:

Name Type Description Default
source_file_path str

source file path

required
container_name str

container name

required
blob_name str

blob name, if not mentioned, it automatically takes source filename as blob name. Defaults to None.

None
alcedo_pdbc.datalake.AzureBlob.upload_folder(local_folder_path, container_name, blob_name) method descriptor

AzureBlob.upload_folder(self, str local_folder_path: str, str container_name: str, str blob_name: str) -> None

alcedo_pdbc.datalake.AzureBlob.write_dataframe(df, container_name, blob_name, overwrite=True, extension='csv', pandas_args={}, polars_args={}) method descriptor

AzureBlob.write_dataframe(self, df, str container_name: str, str blob_name: str, overwrite=True, extension='csv', pandas_args={}, polars_args={})

Takes DataFrame, container name, filename as arguments and write the dataframe to Azure Blob Storage.

Parameters:

Name Type Description Default
df DataFrame

Dataframe which need to be uploaded

required
container_name str

Container Name of the azure storage account

required
blob_name str

file name with extension

required
overwrite bool

Overwrite the existing data. Defaults to True.

True
extension str

extension of the files, It take automatically from the filename parameter. Defaults to 'csv'

'csv'