🧰️ Microsoft AzureBlob
一、概述
数据湖(Data Lake)是一种基于集群和分布式文件系统的一种数据存储方式,对企业中的所有数据进行统一存储,数据湖的就是原始数据保存区,从原始数据(源系统数据的精确副本)转换为用于报告、可视化分析和机器学习等各种任务的目标数据。 数据湖中的数据包括结构化数据(关系数据库数据),半结构化数据(CSV、XML、JSON等),非结构化数据(电子邮件,文档,PDF)和二进制数据(图像、音频、视频),从而形成一个容纳所有形式数据的集中式数据存储。
数据湖的主要供应商一般都是超大规模公共云供应商,例如亚马逊AWS、微软Azure和谷歌云平台(GCP)。
AILab-PDBC (Python DataBase Connectivity)是数智教育发展(山东)有限公司 AI Lab 100 团队开发的高效、灵活的数据接口(API)
ailab100.pdbc.datalake类用于数据湖的读写,支持MinIO
、亚马逊Amazon S3、谷歌GCS、微软Azure Blob等主流数据湖。
本章介绍如何使用ailab100.pdbc.datalake.AzureBlob 连接 AzureBlob 进行读写和下载。
Azure Blob 存储是 Microsoft 提供的适用于云的对象存储解决方案。 Blob 存储最适合存储巨量的非结构化数据。
二、API 接口说明
alcedo_pdbc.datalake.AzureBlob
AzureBlob class create a ligo azureblob object, through which you can able to read, write, upload, download data from Azure Blob Storage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config |
dict
|
Automatically loaded from the config file (yaml) |
required |
Functions
alcedo_pdbc.datalake.AzureBlob.download_file(container_name, blob_name, path_to_download='.')
method descriptor
AzureBlob.download_file(self, str container_name: str, str blob_name: str, path_to_download='.')
Takes container name and blob name as arguments and download the file
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
container_name |
str
|
container name |
required |
blob_name |
str
|
blob name |
required |
path_to_download |
str
|
save location. Defaults to '.'. |
'.'
|
alcedo_pdbc.datalake.AzureBlob.download_folder(container_name, blob_path, local_path_to_download='.')
method descriptor
AzureBlob.download_folder(self, str container_name: str, str blob_path: str, local_path_to_download='.')
alcedo_pdbc.datalake.AzureBlob.read_as_dataframe(container_name, blob_name, pandas_args={}, polars_args={}, extension='csv', return_type='pandas')
method descriptor
AzureBlob.read_as_dataframe(self, str container_name: str, str blob_name: str, dict pandas_args: Dict = {}, dict polars_args: Dict = {}, extension='csv', return_type='pandas')
Takes Azure Storage account container name and blob name and return datafarme.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
container_name |
str
|
Container Name of the azure storage account |
required |
blob_name |
str
|
Blob Name which wants to read |
required |
pandas_args |
dict
|
pandas arguments like encoding, etc |
{}
|
extension |
str
|
extension of the files, It take automatically from the blob_name parameter. Defaults to 'csv'. |
'csv'
|
return_type |
str
|
which dataframe you want to return (pandas, polars, dask etc). Defaults to 'pandas'. |
'pandas'
|
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
``Pandas``、``Polars`` or ``Dask``
|
根据 |
alcedo_pdbc.datalake.AzureBlob.upload_file(source_file_path, container_name, blob_name=None)
method descriptor
AzureBlob.upload_file(self, str source_file_path: str, str container_name: str, str blob_name: str = None)
Takes source file path, container name and blob name as arguments and upload the file to Azure Blob Storage
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_file_path |
str
|
source file path |
required |
container_name |
str
|
container name |
required |
blob_name |
str
|
blob name, if not mentioned, it automatically takes source filename as blob name. Defaults to None. |
None
|
alcedo_pdbc.datalake.AzureBlob.upload_folder(local_folder_path, container_name, blob_name)
method descriptor
AzureBlob.upload_folder(self, str local_folder_path: str, str container_name: str, str blob_name: str) -> None
alcedo_pdbc.datalake.AzureBlob.write_dataframe(df, container_name, blob_name, overwrite=True, extension='csv', pandas_args={}, polars_args={})
method descriptor
AzureBlob.write_dataframe(self, df, str container_name: str, str blob_name: str, overwrite=True, extension='csv', pandas_args={}, polars_args={})
Takes DataFrame, container name, filename as arguments and write the dataframe to Azure Blob Storage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
Dataframe which need to be uploaded |
required |
container_name |
str
|
Container Name of the azure storage account |
required |
blob_name |
str
|
file name with extension |
required |
overwrite |
bool
|
Overwrite the existing data. Defaults to True. |
True
|
extension |
str
|
extension of the files, It take automatically from the filename parameter. Defaults to 'csv' |
'csv'
|