跳转至

🧰️ Amazon S3

一、概述

数据湖(Data Lake)是一种基于集群和分布式文件系统的一种数据存储方式,对企业中的所有数据进行统一存储,数据湖的就是原始数据保存区,从原始数据(源系统数据的精确副本)转换为用于报告、可视化分析和机器学习等各种任务的目标数据。 数据湖中的数据包括结构化数据(关系数据库数据),半结构化数据(CSV、XML、JSON等),非结构化数据(电子邮件,文档,PDF)和二进制数据(图像、音频、视频),从而形成一个容纳所有形式数据的集中式数据存储。

数据湖的主要供应商一般都是超大规模公共云供应商,例如亚马逊AWS、微软Azure和谷歌云平台(GCP)。

AILab-PDBC (Python DataBase Connectivity)是数智教育发展(山东)有限公司 AI Lab 100 团队开发的高效、灵活的数据接口(API)

ailab100.pdbc.datalake类用于数据湖的读写,支持MinIO 、亚马逊Amazon S3、谷歌GCS、微软Azure Blob等主流数据湖。

本章介绍如何使用ailab100.pdbc.datalake.MinIO 连接 MinIO 进行读写和下载。

Amazon S3 是一种对象存储服务,采用分布式架构,数据存储在多个物理位置以提高可靠性和容错能力。S3使用容器化技术,可以实现水平扩展,从而满足不同规模和需求的存储。

二、API 接口说明

alcedo_pdbc.datalake.S3

S3 class create a ligo s3 object, through which you can able to read, write, upload, download data from AWS S3

Parameters:

Name Type Description Default
config dict

Automatically loaded from the config file (yaml)

required

Functions

alcedo_pdbc.datalake.S3.download_file(s3_path=None, bucket=None, key=None, local_path='.') method descriptor

S3.download_file(self, str s3_path: str = None, str bucket: str = None, str key: str = None, str local_path: str = '.')

Takes s3 path or (bucket and key name) as arguments and download the file

Parameters:

Name Type Description Default
s3_path str

S3 path from where it needs to download the file. Defaults to None.

None
bucket str

S3 bucket name, if S3 path is not provided . Defaults to None.

None
key str

S3 Key name, if S3 path is not provied. Defaults to None.

None
local_path str

save location. Defaults to '.' (current directory).

'.'

Returns:

Name Type Description
file ``CSV``、``Excel``、 ``JSON``、``HTML``、 ``HDF5``、 ``Feather``、 ``Parquet`` 、``Apache Avro``

根据参数filename文件类型导出文件保存至指定的目录下

alcedo_pdbc.datalake.S3.download_folder(s3_path=None, bucket=None, key=None, local_path_to_download='.') method descriptor

S3.download_folder(self, str s3_path: str = None, str bucket: str = None, str key: str = None, str local_path_to_download: str = '.')

Takes s3 path or (bucket and key name) as arguments and download the folder

Parameters:

Name Type Description Default
s3_path str

S3 path from where it needs to download the folder. Defaults to None.

None
bucket str

S3 bucket name, if S3 path is not provided . Defaults to None.

None
key str

S3 Key name, if S3 path is not provied. Defaults to None.

None
local_path_to_download str

save location. Defaults to '.' (current directory).

'.'
alcedo_pdbc.datalake.S3.read_as_dataframe(s3_path=None, bucket=None, key=None, pandas_args={}, polars_args={}, extension='csv', return_type='pandas') method descriptor

S3.read_as_dataframe(self, str s3_path: str = None, str bucket: str = None, str key: str = None, dict pandas_args: Dict = {}, dict polars_args: Dict = {}, extension='csv', return_type='pandas')

Takes s3 path as arguments and return dataframe.

Parameters:

Name Type Description Default
s3_path str

s3 path of the file need to be loaded, for multiple file loading,

None
use s3

//bucket/path/filename* to load all files from folder, use s3://bucket/folder/.

required
bucket str

S3 Bucket Name

None
key str

file name with extension

None
pandas_args dict

pandas arguments like encoding, etc

{}
extension str

extension of the files, It take automatically from the s3_path parameter.

'csv'
return_type str

which dataframe you want to return (pandas, polars, dask etc). Defaults to

'pandas'

Returns:

Name Type Description
DataFrame ``Pandas``、``Polars`` or ``Dask``

根据 return_type 参数返回对应的数据帧 Dataframe

alcedo_pdbc.datalake.S3.upload_file(source_file_path, bucket, key) method descriptor

S3.upload_file(self, str source_file_path: str, str bucket: str, str key: str)

Takes source file path, bucket and key as arguments and upload the file to S3

Parameters:

Name Type Description Default
source_file_path str

source file path

required
bucket str

destination bucket

required
key str

destination file path

required
alcedo_pdbc.datalake.S3.upload_folder(local_folder_path, bucket, key) method descriptor

S3.upload_folder(self, str local_folder_path: str, str bucket: str, str key: str) -> None

Takes local path, bucket and key as arguments and upload the folder to s3

Parameters:

Name Type Description Default
local_folder_path str

local path of the folder want to be uploaded

required
bucket str

s3 bucket name

required
key str

s3 key name

required
alcedo_pdbc.datalake.S3.write_dataframe(df, bucket, key, extension='csv', pandas_args={}, polars_args={}) method descriptor

S3.write_dataframe(self, df, str bucket: str, str key: str, extension='csv', pandas_args={}, polars_args={}) -> None

Takes DataFrame, bucket name, filename as arguments and write the dataframe to S3.

Parameters:

Name Type Description Default
df DataFrame

Dataframe which need to be uploaded

required
bucket str

S3 Bucket Name

required
key str

file name with extension

required
extension str

extension of the files, It take automatically from the filename parameter. Defaults to 'csv'

'csv'