🧰️ Google Cloud Storage
一、概述
数据湖(Data Lake)是一种基于集群和分布式文件系统的一种数据存储方式,对企业中的所有数据进行统一存储,数据湖的就是原始数据保存区,从原始数据(源系统数据的精确副本)转换为用于报告、可视化分析和机器学习等各种任务的目标数据。 数据湖中的数据包括结构化数据(关系数据库数据),半结构化数据(CSV、XML、JSON等),非结构化数据(电子邮件,文档,PDF)和二进制数据(图像、音频、视频),从而形成一个容纳所有形式数据的集中式数据存储。
数据湖的主要供应商一般都是超大规模公共云供应商,例如亚马逊AWS、微软Azure和谷歌云平台(GCP)。
AILab-PDBC (Python DataBase Connectivity)是数智教育发展(山东)有限公司 AI Lab 100 团队开发的高效、灵活的数据接口(API)
ailab100.pdbc.datalake类用于数据湖的读写,支持MinIO
、亚马逊Amazon S3、谷歌GCS、微软Azure Blob等主流数据湖。
本章介绍如何使用ailab100.pdbc.datalake.MinIO 连接 MinIO 进行读写和下载。
Google Cloud Storage(GCS) 是 google 公司的一个存储平台,可提供高性能的对象存储服务,并包含出色的可伸缩性,数据可用性,持久性和安全性。它使您可以存储对象并立即从任何存储类别访问任何数量的数据,使用单个统一的API 将存储集成到应用程序中,并轻松优化价格和性能。
二、API 接口说明
alcedo_pdbc.datalake.GCS
GCS class create a ligo gcs object, through which you can able to read, write, upload, download data from Google Cloud Storage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config |
dict
|
Automatically loaded from the config file (yaml) |
required |
Functions
alcedo_pdbc.datalake.GCS.download_file(gcs_path=None, bucket=None, blob_name=None, path_to_download='.')
method descriptor
GCS.download_file(self, str gcs_path: str = None, str bucket: str = None, str blob_name: str = None, str path_to_download: str = '.')
Takes gcs path or (bucket and blob name) as arguments and download the file
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gcs_path |
str
|
GCS file path. Defaults to None. |
None
|
bucket |
str
|
GCS bucket name, if gcs path is not provided. Defaults to None. |
None
|
blob_name |
str
|
GCS blob name, if gcs path is not provied. Defaults to None. |
None
|
path_to_download |
str
|
save location. Defaults to '.'. |
'.'
|
alcedo_pdbc.datalake.GCS.download_folder(gcs_path=None, bucket=None, blob_path=None, local_path_to_download='.')
method descriptor
GCS.download_folder(self, str gcs_path: str = None, str bucket: str = None, str blob_path: str = None, str local_path_to_download: str = '.')
alcedo_pdbc.datalake.GCS.read_as_dataframe(gcs_path=None, bucket=None, blob_name=None, pandas_args={}, polars_args={}, extension='csv', return_type='pandas')
method descriptor
GCS.read_as_dataframe(self, str gcs_path: str = None, str bucket: str = None, str blob_name: str = None, dict pandas_args: Dict = {}, dict polars_args: Dict = {}, extension='csv', return_type='pandas')
Takes gcs path as argument and return dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gcs_path |
str
|
gcs path of the file need to be loaded, for multiple file loading, use gs://bucket/path/filename* to load all files from folder, use gs://bucket/folder/. |
None
|
bucket |
str
|
GCS Bucket Name |
None
|
blob_name |
str
|
file name with extension |
None
|
pandas_args |
dict
|
pandas arguments like encoding, etc |
{}
|
extension |
str
|
extension of the files, It take automatically from the gcs path parameter. Defaults to 'csv'. |
'csv'
|
return_type |
str
|
which dataframe you want to return (pandas, polars, dask etc). Defaults to 'pandas'. |
'pandas'
|
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
``Pandas``、``Polars`` or ``Dask``
|
根据 |
alcedo_pdbc.datalake.GCS.upload_file(source_file_path, bucket, blob_name)
method descriptor
GCS.upload_file(self, str source_file_path: str, str bucket: str, str blob_name: str)
Takes source file path, bucket and blob name as arguments and upload the file to GCS
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_file_path |
str
|
Source file path |
required |
bucket |
str
|
GCS Bucket Name |
required |
blob_name |
str
|
Blob name (destination file path) |
required |
alcedo_pdbc.datalake.GCS.upload_folder(local_folder_path, bucket, blob_path='')
method descriptor
GCS.upload_folder(self, str local_folder_path: str, str bucket: str, str blob_path: str = '')
alcedo_pdbc.datalake.GCS.write_dataframe(df, bucket, blob_name, extension='csv', pandas_args={}, polars_args={})
method descriptor
GCS.write_dataframe(self, df, bucket, blob_name, extension='csv', pandas_args={}, polars_args={})
Takes DataFrame, bucket name, blob name as arguments and write the dataframe to GCS.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
Dataframe which need to be uploaded |
required |
bucket |
str
|
GCS Bucket Name |
required |
blob_name |
str
|
file name with extension |
required |
extension |
str
|
extension of the files, It take automatically from the filename parameter. |
'csv'
|