跳转至

🧰️ Google Cloud Storage

一、概述

数据湖(Data Lake)是一种基于集群和分布式文件系统的一种数据存储方式,对企业中的所有数据进行统一存储,数据湖的就是原始数据保存区,从原始数据(源系统数据的精确副本)转换为用于报告、可视化分析和机器学习等各种任务的目标数据。 数据湖中的数据包括结构化数据(关系数据库数据),半结构化数据(CSV、XML、JSON等),非结构化数据(电子邮件,文档,PDF)和二进制数据(图像、音频、视频),从而形成一个容纳所有形式数据的集中式数据存储。

数据湖的主要供应商一般都是超大规模公共云供应商,例如亚马逊AWS、微软Azure和谷歌云平台(GCP)。

AILab-PDBC (Python DataBase Connectivity)是数智教育发展(山东)有限公司 AI Lab 100 团队开发的高效、灵活的数据接口(API)

ailab100.pdbc.datalake类用于数据湖的读写,支持MinIO 、亚马逊Amazon S3、谷歌GCS、微软Azure Blob等主流数据湖。

本章介绍如何使用ailab100.pdbc.datalake.MinIO 连接 MinIO 进行读写和下载。

Google Cloud Storage(GCS) 是 google 公司的一个存储平台,可提供高性能的对象存储服务,并包含出色的可伸缩性,数据可用性,持久性和安全性。它使您可以存储对象并立即从任何存储类别访问任何数量的数据,使用单个统一的API 将存储集成到应用程序中,并轻松优化价格和性能。

二、API 接口说明

alcedo_pdbc.datalake.GCS

GCS class create a ligo gcs object, through which you can able to read, write, upload, download data from Google Cloud Storage.

Parameters:

Name Type Description Default
config dict

Automatically loaded from the config file (yaml)

required

Functions

alcedo_pdbc.datalake.GCS.download_file(gcs_path=None, bucket=None, blob_name=None, path_to_download='.') method descriptor

GCS.download_file(self, str gcs_path: str = None, str bucket: str = None, str blob_name: str = None, str path_to_download: str = '.')

Takes gcs path or (bucket and blob name) as arguments and download the file

Parameters:

Name Type Description Default
gcs_path str

GCS file path. Defaults to None.

None
bucket str

GCS bucket name, if gcs path is not provided. Defaults to None.

None
blob_name str

GCS blob name, if gcs path is not provied. Defaults to None.

None
path_to_download str

save location. Defaults to '.'.

'.'
alcedo_pdbc.datalake.GCS.download_folder(gcs_path=None, bucket=None, blob_path=None, local_path_to_download='.') method descriptor

GCS.download_folder(self, str gcs_path: str = None, str bucket: str = None, str blob_path: str = None, str local_path_to_download: str = '.')

alcedo_pdbc.datalake.GCS.read_as_dataframe(gcs_path=None, bucket=None, blob_name=None, pandas_args={}, polars_args={}, extension='csv', return_type='pandas') method descriptor

GCS.read_as_dataframe(self, str gcs_path: str = None, str bucket: str = None, str blob_name: str = None, dict pandas_args: Dict = {}, dict polars_args: Dict = {}, extension='csv', return_type='pandas')

Takes gcs path as argument and return dataframe.

Parameters:

Name Type Description Default
gcs_path str

gcs path of the file need to be loaded, for multiple file loading, use gs://bucket/path/filename* to load all files from folder, use gs://bucket/folder/.

None
bucket str

GCS Bucket Name

None
blob_name str

file name with extension

None
pandas_args dict

pandas arguments like encoding, etc

{}
extension str

extension of the files, It take automatically from the gcs path parameter. Defaults to 'csv'.

'csv'
return_type str

which dataframe you want to return (pandas, polars, dask etc). Defaults to 'pandas'.

'pandas'

Returns:

Name Type Description
DataFrame ``Pandas``、``Polars`` or ``Dask``

根据 return_type 参数返回对应的数据帧 Dataframe

alcedo_pdbc.datalake.GCS.upload_file(source_file_path, bucket, blob_name) method descriptor

GCS.upload_file(self, str source_file_path: str, str bucket: str, str blob_name: str)

Takes source file path, bucket and blob name as arguments and upload the file to GCS

Parameters:

Name Type Description Default
source_file_path str

Source file path

required
bucket str

GCS Bucket Name

required
blob_name str

Blob name (destination file path)

required
alcedo_pdbc.datalake.GCS.upload_folder(local_folder_path, bucket, blob_path='') method descriptor

GCS.upload_folder(self, str local_folder_path: str, str bucket: str, str blob_path: str = '')

alcedo_pdbc.datalake.GCS.write_dataframe(df, bucket, blob_name, extension='csv', pandas_args={}, polars_args={}) method descriptor

GCS.write_dataframe(self, df, bucket, blob_name, extension='csv', pandas_args={}, polars_args={})

Takes DataFrame, bucket name, blob name as arguments and write the dataframe to GCS.

Parameters:

Name Type Description Default
df DataFrame

Dataframe which need to be uploaded

required
bucket str

GCS Bucket Name

required
blob_name str

file name with extension

required
extension str

extension of the files, It take automatically from the filename parameter.

'csv'