Metadata-Version: 2.2
Name: preparser
Version: 2.0.8
Summary: a slight preparser to help parse webpage content or get request from urls,which supports win, mac and unix.
Home-page: https://github.com/BertramYe/preparser
Author: BertramYe
Author-email: bertramyerik@gmail.com
License: MIT
Keywords: preparser,parser,parse,crawl,webpage,html,api,requests,BeautifulSoup4,BeautifulSoup4,python3,windows,mac,linux
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Requires-Python: >=3.9.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: playwright
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary


# Description

this is a sight Parser to help you pre_parser the datas from `specified website url or api`, it help you get ride of the duplicate coding to get the request from the `specified url and speed up the process with the threading pool` and you just need focused on the bussiness proceess coding after you get the specified  request response from the `specified webpage or api urls`

# Attention

as this slight pre_parser  for the old version 1.0.0, which only can help preparser the `static html` or `api` inform, but now from the 2.0.0 , I have added an new `html_dynamic` mode, which will help get all inform even generated by the `JS` code.

```bash

python version >= 3.9 

```

# How to use

## install

```bash
$ pip install preparser
```



> Github Resouce ➡️ [Github Repos](https://github.com/BertramYe/preparser) 

> and also just feel free to fork and modify this codes. if you like current project, star ⭐ it please, uwu.

> PyPI: ➡️ [PyPI Publish](https://pypi.org/project/preparser/)  

## parameters

here below are some of the parameters you can use for initial the Object `PreParser` from the package `preparser`:


|        Parameters      | Type                | Description                                               |
| ---------------------  | -----------------   |--------------------------------------------------------   |
| url_list               | list                | The list of URLs to parse from. Default is an empty list. |
| request_call_back_func | Callable or None    | A callback function according to the parser_mode to handle the `BeautifulSoup` object or request `json` Object. and if you want to show your business process failed, you can return `None`, otherwise please return a `not None` Object.        |
|  parser_mode           | `'html'`, `'api'` or `'html_dynamic'` | The pre-parsing datas mode,default is `'html'`.<br/>  `html`: parse the content from static html, and return an `BeautifulSoup` Object. <br/> `api`: parse the datas from an api, and return the `json` Object. <br/> `html_dynamic`: parse  from  the whole webpage html content and return an `BeautifulSoup` Object, even the content that generated by the dynamic js code. <br/>  **and all of Object you can get when you defined the `request_call_back_func`, otherwise get it via the object of `PreParer(....).cached_request_datas`    |
| cached_data | bool | weather cache the parsed datas, defalt is False. |
| start_threading | bool | Whether to use threading pool for parsing the data. Default is `False`.|
| threading_mode | `'map'` or `'single'` | to run the task mode, default is `single`. <br/>  `map`: use the `map` func of the theading pool to distribute tasks. <br/> `single`: use the `submit` func to distribute the task one by one into the theading pool. |
| stop_when_task_failed | bool | wheather need stop when you failed to get request from a Url,default is `True` |
| threading_numbers | int | The maximum number of threads in the threading pool. Default is `3`. |
| checked_same_site | bool |  wheather need add more headers info to pretend requesting in a same site to parse datas, default is `True`,to resolve the `CORS` Block. |
| html_dynamic_scope | list or None | point and get the specied scope dom of the whole page html, default is None,which stands for the whole page.<br />if this value was set, the parameter should be a list(2) Object. <br/> 1. the first value is a tag <a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector"> selecter</a>. <br /> for example, 'div#main' mean a div tag with 'id=main', 'div.test' will get the the first matched div tag with 'class = test'. but don't make the selecter too complex or matched the mutiple parent dom, otherwise you can't get their inner_html() correctly or time out, and finally you can get the BeautifulSoup object of the inner_html from this selecter selected tag in the `request_call_back_func`. <br /> 2. the secound value should be one of the values below: <br />`attached`: wait for element to be present in DOM. <br />`detached`: wait for element to not be present in DOM. <br />`hidden`: wait for element to have non-empty bounding box and no 'visibility:hidden'. Note that element,without any content or with 'display:none' has an empty bounding box and is not considered visible. <br /> `visible`: wait for element to be either detached from DOM, or have an empty bounding box or 'visibility:hidden'. This is opposite to the 'visible' option. |
| ssl_certi_verified | bool | wheather need verify the ssl certi when requesting datas from urls, default is True, which means will verify the ssl certi to make the requesting safe.|

## example

```python

#  test.py
from preparser import PreParser,BeautifulSoup,Json_Data,Filer


def handle_preparser_result(url:str,preparser_object:BeautifulSoup | Json_Data) -> bool:
    # here you can just write the bussiness logical you want
    
    # attention：
    # preparser_object type depaned on the `parser_mode` in the `PreParser`:
    #               'api' : preparser_object is the type of a Json_Data
    #               'html' : preparser_object is the type of a BeautifulSoup 
    
    ........
    
    # for the finally return:
    # if you want to show current result is failed just Return a None, else just return any object which is not None.
    return preparser_object


if __name__ == "__main__":
    
    #  start the parser
    url_list = [
        'https://example.com/api/1',
        'https://example.com/api/2',
        .....
    ]
  
    parser = PreParser(
        url_list=url_list,
        request_call_back_func=handle_preparser_result,
        parser_mode='api',    # this mode depands on you set, you can use the "api", "html",or 'html_dynamic'
        start_threading=True,
        threading_mode='single',
        cached_data=True,
        stop_when_task_failed=False,
        threading_numbers=3,
        checked_same_site=True
    )
    
    #  start parse
    parser.start_parse()

    # when all task finished, you can get the all task result result like below:
    all_result = parser.cached_request_datas
    
    # if you want to terminal, just execute the function here below
    # parser.stop_parse()

    # also you can use the Filer to save the final result above
    # and also find the datas in the `result/test.json` 
    filer = Filer('json')
    filer.write_data_into_file('result/test',[all_result])

```


# Get Help

Get help ➡️ [Github issue](https://github.com/BertramYe/preparser/issues)


# Update logs

* `version 2.0.8 `: add the func of `read_datas_from_file` into `Filer`, to help read the datas from the specified type files.

* `version 2.0.7 `: add the `ssl_certi_verified` parameter to control weather ignored the error that caused by ssl certi verification when do the requesting.

* `version 2.0.6 `: add the `html_dynamic_scope` parameters to let user can specified the whole dynamic parse scope, which can help faster the preparser speed when the `parser_mode` is `html_dynamic` . and resort the additional tools into the `ToolsHelper` package.

* `version 2.0.5 `: remove the dynamic mode browser core install from setup into package call.

* `version 2.0.4 `: test the installing process command.

* `version 2.0.3 `: optimise the `error` alert for `html_dynamic`.

* `version 2.0.2 `: correct the README Doc of `parser_mode`.

* `version 2.0.1 `: update the README Doc.

* `version 2.0.0 `: add the new `parser_mode` of the `html_dynamic`, which help `preparser` all of the content from `html` , event it generated by the `JS` code.

* `version 1.0.0 `: basical version, only `perparser` the static `html` and `api` content.
