Metadata-Version: 2.1
Name: sleepyjson
Version: 0.0.3
Summary: Read from JSON files without having to keep everything in memory
Home-page: https://github.com/jdferreira/sleepyjson
Author: João D. Ferreira
Author-email: jotomicron@gmail.com
License: UNKNOWN
Description: # `sleepyjson`
        
        In some situations, particularly in big data scenarios, it is necessary to extract information from a JSON file without needing to read the full content into memory. For an example, see the "Example" section below.
        
        `sleepyjson` provides a mechanism to deal with this scenario, where the JSON file is only parsed until the necessary information is found, and only that data is kept in memory.
        
        Although the package provides ways to handle random access to the contents of the file, random access runs in linear time on the size of the file. In fact, the whole idea of the package is to support memory-lightweight **sequential processing** of the JSON file.
        
        # Example
        
        Imagine you have a 10GB JSON file, where the top value is an array and the various items follow a predictable structure, as illustrated in the snippet below (pretend that the top level array contains millions of items and the `snippets` key contains large arrays with potentially long strings). Imagine as well that you want to extract the identifiers associated items that are dated from January of any year.
        
        ```json
        [
          {
            "identifier": "AX1999",
            "url": "http://www.example.org/url-path-with-a-slug",
            "date": {
              "year": 2020,
              "month": 1,
              "day": 1
            },
            "snippets": [
              "A happy snippet of text found in the URL",
              "Another snippet of text, this time sad",
              "Yet another",
              "And potentially many more"
            ]
          },
          // ...
        ]
        ```
        
        Traditionally (with the standard library `json` package), you would need to read the full dataset into memory.
        
        ```py
        import json
        
        with open('data.json') as f:
          data = json.load(f)
        
        identifiers = [
          item['identifier']
          for item in data
          if item['date']['month'] == 1
        ]
        ```
        
        Because you're reading the full file contents into the `data` variable, the memory consumption for this snippet is quite high.
        
        With `sleepyjson`, you can keep memory usage low and still achieve the same result:
        
        ```py
        import sleepyjson
        
        with open('data.json') as f:
          reader = sleepyjson.Reader(f)
        
          identifiers = [
            item['identifier'].value()
            for item in reader
            if item['date']['month'].value() == 1
          ]
        ```
        
        Notice the while the memory consumption is low, time complexity is linear for most purposes. If you want to get to a value near the end of a JSON file, the file must be fully parsed until the position you need to access. Additionally, because this is a pure python implementation, parsing is slow (I *may* change the parsing mechanism in the future to a compiled process to accelerate this).
        
        # Comparison with `json`
        
        As you can spot in the previous snippet of code, `sleepyjson` requires you to keep the file opened while you are moving within the JSON contents. This is because no content is moved into memory unless the code does so explicitly.
        
        Additionally, the contents of a value must be explicitly requested with the `.value()` method. *Note*: I want to change this method to something more explicit, like `materialize`, to convey the meaning that we are not simply getting the value, but actually parsing and building a JSON value, which might be costly if the value is big.
        
        The `sleepyjson.Reader` class takes a file-like object, but doesn't read its contents until requested to. You can move around the file using iteration and indexation.
        
        Also, `sleepyjson.Reader` can read "JSON streams" in addition to regular JSON files. JSON streams are files that contain JSON values in succession. The reader can navigate within these files using the `.next()` method.
        
        ```py
        # Assuming file `data.json` contains
        # ["an", "array"] {"an": "object"}
        
        from sleepyjson import Reader
        
        with open('data.json') as f:
          reader = Reader(f)
        
          print(reader.value())  # ['an', 'array']
          reader.next()
          print(reader.value())  # {'an': 'object'}
        ```
        # Partially valid JSON
        
        In case your information needs from the file do not require the file to be read until the end, `sleepyjson` parses only the necessary contents from the file, which means that the file does not need to be completely valid.
        
        # Comments, trailing commas
        
        Even though python's `json` package does not accept comments nor trailing commas, some popular packages elsewhere do. To support reading this "non-standard" data format, `sleepyjson` understands double-slash comments and ignores trailing commas. So the following would be a valid JSON file from the point of view of this package:
        
        ```json
        {
          // Comment
          "powers of two": [
            1,
            2,
            4,
            8,
            16,
            32,
          ],
        }
        ```
        
        # Simple documentation
        
        The API surface of this package is simple, providing three classes:
        - `Node`
        - `NodeType`
        - `Reader`
        
        While you can produce instances of the `Node` class, I recommend you only instantiate `Reader` directly. `NodeType` is an enumeration class that represents the possible JSON value types.
        
        ## The `Node` class
        
        This class represents a value in the JSON file. It deliberately does not contain a full representation of the value, particularly for strings, arrays and objects, because doing so would defeat the purpose of the package. It does, however, offer mechanisms to access those value, by allowing iteration over arrays and objects, and allowing (but not requiring) the construction, in memory, of its contents.
        
        In the following examples, we assume `node` points to the JSON object represented below:
        ```json
        {
          "a": [0, 1.337e3],
          "b": "string",
          "c": [true, false, null]
        }
        ```
        
        In general, some operations are only valid for some types (namely indexing, iterating, etc.). If the corresponding methods are called on a node of an incorrect type, a `ValueError` is raised.
        
        ### The `Node.type` attribute
        
        Returns an instance of `NodeType` that represents the type of JSON value under this node. Possible values are:
        - `NodeType.OBJECT`
        - `NodeType.ARRAY`
        - `NodeType.NUMBER`
        - `NodeType.STRING`
        - `NodeType.TRUE`
        - `NodeType.FALSE`
        - `NodeType.NULL`
        
        ### The `Node.value` method
        
        Builds and returns the value of this node.
        
        - If the node is a JSON true, false or null literal, it returns `True`, `False` or `None` respectively.
        - If the node is a number, it parses and returns the number (returns an `int` if no decimal and no exponent is provided, `float` otherwise).
        - If the node is a string, it parses the string, unescaping any escaped characters
        - If the node is an array, it returns a python list
        - If the node is an object, it returns a python dict
        
        The inner values of arrays and objects are recursively built with the `.value` method as well.
        
        ### The `Node.__getitem__` method (`node[i]`)
        
        For arrays and objects, returns a `Node` that represents the requested item. For arrays, you can index with integers. Negative value are allowed, but this requires parsing the entire array to determine the length of the array. For objects, you can index with strings. Indexing parses the node only until the correct item is found (except for indexing arrays with a negative value). If the item is not found, an `IndexError` is raised (for arrays) or a `KeyError` is raised (for objects).
        
        ```py
        node['a'].value() # [0, 1337.0]
        node['c'][0].value() # True
        node['c'][-1].value() # None
        
        
        node['x'] # raises KeyError
        ```
        
        ### The `Node.__len__` method (`len(node)`)
        
        For arrays and objects, returns the length of the value. Determining the length parses the result but doesn't construct the items, which means it is easy on memory.
        
        ```py
        len(node) # 3
        len(node['a']) # 2
        len(node['b']) # raises ValueError; you cannot determine the length of a string
        ```
        
        ### The `Node.__iter__` method (`for i in node`)
        
        Iterates over the items in an array, or over the keys in an object. This iterates in the order the values appear in the file. Also see `Node.items`, to iterate over the keys *and* values of a JSON object.
        
        ```py
        list(node) # ['a', 'b', 'c']
        [i.value() > 0 for i in node['a']] # [False, True]
        ```
        
        ### The `Node.items` method
        
        Iterates over the items of a JSON object. The iterator returned from this method sequentially produces pairs of type `(str, Node)`, where the first item in the key and the second item is the node representing the value associated with that key. The iterator respects the order in the file.
        
        ### The `Node.__contains__` method (`x in node`)
        
        This defines the `in` operator. `x in node` is `True` if:
        - `node` represents a JSON array and one of its inner values is equal to `x`
        - `node` represents a JSON object and one of its keys is equal to `x`
        
        ```py
        'a' in node # True
        0 in node['a'] # True
        ```
        
        ### The `Node.is_*` methods
        
        There are several of these methods, each testing the type of value the node points to:
        
        - `Node.is_object`: Determines whether the node points to an object
        - `Node.is_array`: Determines whether the node points to an array
        - `Node.is_string`: Determines whether the node points to a string
        - `Node.is_number`: Determines whether the node points to a number
        - `Node.is_true`: Determines whether the node points to the `true` literal
        - `Node.is_false`: Determines whether the node points to the `false` literal
        - `Node.is_boolean`: Determines whether the node points to a boolean (`true` or `false`)
        - `Node.is_null`: Determines whether the node points to the `null` literal
        
        ### A note on key uniqueness
        
        `sleepyjson` does **not** make an effort to validate that keys on objects are unique. This means that iterating over the keys of an object can produce the same key more than once; however, retrieving the actual value of a JSON object preserves only one of those key-value pairs (since the returned object is actually a python dictionary).
        
        Additionally, because retrieving an item from an object stops when the key is *first* found in the file, and building the python dictionary likely preserves the *last* value associated with the key.
        
        As such, when a key is repeated in a JSON object, the following can happen:
        - `len(node) > len(node.value())`
        - `node[key].value() != node.value()[key]`
        
        
        ## The `Reader` class
        
        In the following examples, we assume `reader` to be constructed from a file whose contents are:
        ```json
        {
          "a": [1, 2, 3]
        }
        true
        [null, false, true]
        ```
        
        ### The `Reader.__init__` constructor
        
        This class constructor takes a file-like whose contents are in the JSON format. The file should contain a JSON value or a sequence of JSON values (a-la JSON streams). It can also receive multiple files.
        
        ### The `Reader.node` attribute
        
        Returns the node that is currently being read in the JSON stream. As a convenience, you can access the fields and methods of this node by calling them directly on the reader:
        
        ```py
        reader.node.value() # {'a': [1, 2, 3]}
        reader.value() # {'a': [1, 2, 3]}
        ```
        
        ### The `Reader.__len__` method  (`len(node)`)
        
        Returns the length of the current node. Equivalent to `len(reader.node)`.
        
        ### The `Reader.__iter__` method (`for i in node`)
        
        Iterates over the current node. Equivalent to `iter(reader.node)`
        
        ### The `Reader.__getattr__` method (`node.*`)
        
        This method gets the requested attribute from `reader.node`, thus ensuring that the reader behaves in most ways like the node it is currently reading. Read the `Node` documentation to know more about this.
        
        ### The `Reader.__getitem__` method (`node[i]`)
        
        This method implements random access to the contents of the current node. Equivalent to `reader.node[i]`. See the documentation for the `Node` class.
        
        ```py
        reader['a'].value() # [1, 2, 3]
        len(reader['a']) # 3
        ```
        
        ### The `Reader.next` method
        
        Jumps to the next value on the JSON stream. Notice that if multiple files have been given in the constructor, this is the way to access the next files. There is no way to jump back to a previous value on the stream.
        
        ```py
        reader.value() # {'a': [1, 2, 3]}
        reader.next()
        reader.value() # true
        reader.next()
        reader.value() # [None, False, True]
        ```
        
        If there are no more nodes in the current file and no more files to process, this method raises a `StopIteration` exception.
        
        ### The `Reader.jump` method
        
        Performs the `.next()` method a non-negative number of times.
        
        ```py
        reader.value() # {'a': [1, 2, 3]}
        reader.jump(2)
        reader.value() # [None, False, True]
        ```
        
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
