Welcome to pyobjectify PyPI

Bridge the gap across the different file formats and streamline the process to accessing ingested data via Python objects

license issues codecov build

Overview

Open data is abound. For example, NYC Open Data has over 3,000 datasets spanning over 97 agencies in New York City. This data comes in many different formats, including CSV, JSON, XML, XLS/XLSX, KML, KMZ, Shapefile, GeoJSON, JSON, and more.

In order to import and analyze the data in Python involves sending a request to download the raw data, then converting it into a Python object so that methods can be used to parse its contents. However, this process varies across the many different data types.

This project aims to streamline this process and bridge the gap across the different file formats to allow the end user to get started on data analytics more quickly with a quick function call.

Install from pip

pip install pyobjectify

Quick start

import pyobjectify
import pandas as pd

json_dict = pyobjectify.from_url("https://bit.ly/42KCUSv")  # URL holds JSON data, returns data in dict
json_df = pyobjectify.from_url("https://bit.ly/42KCUSv", pd.DataFrame)  # User-specified output data type

Examples

The main method that the end user would typically use is the from_url() method. This method takes two parameters: a URL to a resource, and optionally, a user-specified data type of the output.

You can use this method like in the Quick start example above:

import pyobjectify
import pandas as pd

json_dict = pyobjectify.from_url("https://bit.ly/42KCUSv")  # URL holds JSON data, returns data in dict
json_df = pyobjectify.from_url("https://bit.ly/42KCUSv", pd.DataFrame)  # User-specified output data type

Note that if the resource cannot or is not implemented to convert to the user-specified output data type, a TypeError will be raised. The supported resource (input) data types and supported conversions are clearly delineated above.

Subroutines

In addition to the main from_url() method, which provides a one-stop-shop functionality of the whole library, there are subroutines that are exposed publically so the user can tweak the more granular operations:

  • url_to_connectivity(url)

  • retrieve_resource(url, connectivity)

  • get_resource_types(resource)

  • get_conversions(in_types, out_type)

  • convert(resource, conversions)

In fact, the from_url() method runs all of these subroutines, in that order.

url_to_connectivity(url)

This function is used to get the resource connectivity type of the resource, given the URL.

Example:

connectivity = url_to_connectivity("https://bit.ly/42KCUSv")

print(connectivity)
"""
<Connectivity.ONLINE_STATIC: 1>
"""

Connectivity is an enumeration of the supported file connectivity types: ONLINE_STATIC and LOCAL. (At the moment, a data stream from the Internet is not supported.)

A TypeError will be raised if the connectivity type is not supported.

retrieve_resource(url, connectivity)

This function is used to retrieve the resource at the URL, which has the specified connectivity type.

A TypeError will be raised if the resource connectivity type is not supported.

Example:

url = "https://bit.ly/42KCUSv"
connectivity = url_to_connectivity(url)  
# <Connectivity.ONLINE_STATIC: 1>
resource = retrieve_resource(url, connectivity)

print(resource)
"""
<__main__.Resource object at 0x104be6fd0>
"""

The Resource class stores some metadata about the resource. It stores the URL of the resource, the connectivity type, the HTTP response, and the response in plaintext.

A TypeError will be raised if the connectivity type is not supported.

get_resource_types(resource)

This function is used to get a list of the possible input types of the resource. Heuristics are used to determine possible data types.

Example:

url = "https://bit.ly/42KCUSv"
connectivity = url_to_connectivity(url)  
# <Connectivity.ONLINE_STATIC: 1>
resource = retrieve_resource(url, connectivity)
# <__main__.Resource object at 0x104be6fd0>
in_types = get_resource_types(resource)

print(in_types)
"""
[<InputType.JSON: 1>]
"""

InputType is an enumeration of the supported input data types. If the input type cannot be determined, a TypeError will be raised.

get_conversions(in_types, out_type=None)

This function is used to get a list of the possible conversions to output data types, given the list of the probable input data types of the resource. If there are no possible conversions, a TypeError is raised.

Example:

url = "https://bit.ly/42KCUSv"
connectivity = url_to_connectivity(url)  
# <Connectivity.ONLINE_STATIC: 1>
resource = retrieve_resource(url, connectivity)
# <__main__.Resource object at 0x104be6fd0>
in_types = get_resource_types(resource)
# [<InputType.JSON: 1>]
conversions = get_conversions(in_types)

print(conversions)
"""
[
   (<InputType.JSON: 1>, <class 'dict'>), 
   (<InputType.JSON: 1>, <class 'list'>), 
   (<InputType.JSON: 1>, <class 'pandas.core.frame.DataFrame'>)
]
"""

This function returns a list of (in, out) conversion tuples. Since the only probable input data type was calculated to be JSON, the three possible/supported conversions are to Python dict or list, or pandas DataFrame.

convert(resource, conversions)

This function is used to convert the resource data through the list of possible conversions. The first successful conversion from the probable resource type to an output data type is returned.

If all conversions were unsuccessful, a TypeError is returned.

Example:

url = "https://bit.ly/42KCUSv"
connectivity = url_to_connectivity(url)  
# <Connectivity.ONLINE_STATIC: 1>
resource = retrieve_resource(url, connectivity)
# <__main__.Resource object at 0x104be6fd0>
in_types = get_resource_types(resource)
# [<InputType.JSON: 1>]
conversions = get_conversions(in_types)
# [(<InputType.JSON: 1>, <class 'dict'>), (<InputType.JSON: 1>, <class 'list'>), (<InputType.JSON: 1>, <class 'pandas.core.frame.DataFrame'>)]
output = convert(resource, conversions)

print(output)
"""
{'data': [{'condition': 'Clear sky', ...
"""

print(type(output))
"""
<class 'dict'>
"""

Note that the listed order of subroutines can be run by just using from_url("https://bit.ly/42KCUSv"). However, as shown, the inner workings can be modified to the end user’s liking by calling the exposed subroutines.