🧰️ Amazon S3
一、概述
数据湖(Data Lake)是一种基于集群和分布式文件系统的一种数据存储方式,对企业中的所有数据进行统一存储,数据湖的就是原始数据保存区,从原始数据(源系统数据的精确副本)转换为用于报告、可视化分析和机器学习等各种任务的目标数据。 数据湖中的数据包括结构化数据(关系数据库数据),半结构化数据(CSV、XML、JSON等),非结构化数据(电子邮件,文档,PDF)和二进制数据(图像、音频、视频),从而形成一个容纳所有形式数据的集中式数据存储。
数据湖的主要供应商一般都是超大规模公共云供应商,例如亚马逊AWS、微软Azure和谷歌云平台(GCP)。
AILab-PDBC (Python DataBase Connectivity)是数智教育发展(山东)有限公司 AI Lab 100 团队开发的高效、灵活的数据接口(API)
ailab100.pdbc.datalake类用于数据湖的读写,支持MinIO
、亚马逊Amazon S3、谷歌GCS、微软Azure Blob等主流数据湖。
本章介绍如何使用ailab100.pdbc.datalake.MinIO 连接 MinIO 进行读写和下载。
Amazon S3 是一种对象存储服务,采用分布式架构,数据存储在多个物理位置以提高可靠性和容错能力。S3使用容器化技术,可以实现水平扩展,从而满足不同规模和需求的存储。
二、API 接口说明
alcedo_pdbc.datalake.S3
S3 class create a ligo s3 object, through which you can able to read, write, upload, download data from AWS S3
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config |
dict
|
Automatically loaded from the config file (yaml) |
required |
Functions
alcedo_pdbc.datalake.S3.download_file(s3_path=None, bucket=None, key=None, local_path='.')
method descriptor
S3.download_file(self, str s3_path: str = None, str bucket: str = None, str key: str = None, str local_path: str = '.')
Takes s3 path or (bucket and key name) as arguments and download the file
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
s3_path |
str
|
S3 path from where it needs to download the file. Defaults to None. |
None
|
bucket |
str
|
S3 bucket name, if S3 path is not provided . Defaults to None. |
None
|
key |
str
|
S3 Key name, if S3 path is not provied. Defaults to None. |
None
|
local_path |
str
|
save location. Defaults to '.' (current directory). |
'.'
|
Returns:
| Name | Type | Description |
|---|---|---|
file |
``CSV``、``Excel``、 ``JSON``、``HTML``、 ``HDF5``、 ``Feather``、 ``Parquet`` 、``Apache Avro``
|
根据参数 |
alcedo_pdbc.datalake.S3.download_folder(s3_path=None, bucket=None, key=None, local_path_to_download='.')
method descriptor
S3.download_folder(self, str s3_path: str = None, str bucket: str = None, str key: str = None, str local_path_to_download: str = '.')
Takes s3 path or (bucket and key name) as arguments and download the folder
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
s3_path |
str
|
S3 path from where it needs to download the folder. Defaults to None. |
None
|
bucket |
str
|
S3 bucket name, if S3 path is not provided . Defaults to None. |
None
|
key |
str
|
S3 Key name, if S3 path is not provied. Defaults to None. |
None
|
local_path_to_download |
str
|
save location. Defaults to '.' (current directory). |
'.'
|
alcedo_pdbc.datalake.S3.read_as_dataframe(s3_path=None, bucket=None, key=None, pandas_args={}, polars_args={}, extension='csv', return_type='pandas')
method descriptor
S3.read_as_dataframe(self, str s3_path: str = None, str bucket: str = None, str key: str = None, dict pandas_args: Dict = {}, dict polars_args: Dict = {}, extension='csv', return_type='pandas')
Takes s3 path as arguments and return dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
s3_path |
str
|
s3 path of the file need to be loaded, for multiple file loading, |
None
|
use |
s3
|
//bucket/path/filename* to load all files from folder, use s3://bucket/folder/. |
required |
bucket |
str
|
S3 Bucket Name |
None
|
key |
str
|
file name with extension |
None
|
pandas_args |
dict
|
pandas arguments like encoding, etc |
{}
|
extension |
str
|
extension of the files, It take automatically from the s3_path parameter. |
'csv'
|
return_type |
str
|
which dataframe you want to return (pandas, polars, dask etc). Defaults to |
'pandas'
|
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
``Pandas``、``Polars`` or ``Dask``
|
根据 |
alcedo_pdbc.datalake.S3.upload_file(source_file_path, bucket, key)
method descriptor
S3.upload_file(self, str source_file_path: str, str bucket: str, str key: str)
Takes source file path, bucket and key as arguments and upload the file to S3
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_file_path |
str
|
source file path |
required |
bucket |
str
|
destination bucket |
required |
key |
str
|
destination file path |
required |
alcedo_pdbc.datalake.S3.upload_folder(local_folder_path, bucket, key)
method descriptor
S3.upload_folder(self, str local_folder_path: str, str bucket: str, str key: str) -> None
Takes local path, bucket and key as arguments and upload the folder to s3
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
local_folder_path |
str
|
local path of the folder want to be uploaded |
required |
bucket |
str
|
s3 bucket name |
required |
key |
str
|
s3 key name |
required |
alcedo_pdbc.datalake.S3.write_dataframe(df, bucket, key, extension='csv', pandas_args={}, polars_args={})
method descriptor
S3.write_dataframe(self, df, str bucket: str, str key: str, extension='csv', pandas_args={}, polars_args={}) -> None
Takes DataFrame, bucket name, filename as arguments and write the dataframe to S3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
Dataframe which need to be uploaded |
required |
bucket |
str
|
S3 Bucket Name |
required |
key |
str
|
file name with extension |
required |
extension |
str
|
extension of the files, It take automatically from the filename parameter. Defaults to 'csv' |
'csv'
|