Metadata-Version: 2.4
Name: tubecensus
Version: 1.0.0
Summary: Sample YouTube channels and retrieve their historical Wayback Machine metadata
Project-URL: Homepage, https://github.com/blast-cu/tubecensus
Project-URL: Repository, https://github.com/blast-cu/tubecensus
Project-URL: Issues, https://github.com/blast-cu/tubecensus/issues
Author-email: Chloe Eggleston <chloe.eggleston@colorado.edu>
License: MIT License
        
        Copyright (c) 2026 Boulder Language and Social Technologies
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: internet archive,sampling,social media,wayback,youtube
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Requires-Dist: beautifulsoup4>=4.14.3
Requires-Dist: cdx-toolkit>=0.9.38
Requires-Dist: huggingface-hub>=1.12.1
Requires-Dist: jsonpath-ng>=1.8.0
Requires-Dist: msgpack>=1.1.2
Requires-Dist: pandas>=1.5
Requires-Dist: rocksdict>=0.3.29
Requires-Dist: tqdm
Description-Content-Type: text/markdown

# TubeCensus

A Python library for sampling YouTube channels and retrieving their historical Wayback Machine metadata.

## Installation
* Requirements: ~20GB of storage. 
    * Defaults to `~/.tubecensus`, but can be overriden by `TubeCensus(data_dir=...)`, or the `TUBECENSUS_DIR` environment variable.

* ` pip install tubecensus`

## Features

### sample(n, by={"usernames","ids","customs", "handles"})
- Sample YouTube channels from the URLs collected from the Wayback machine indices.
- Current version includes unique URLs up to 2023. These are featured in the four YouTube channel formats:
    1. Username (`/profile?user=`, `/user/`): 34.8M channels
    2. ID (`/channel/UC`): 106M channels 
    3. Custom Page (`/c/`): 5.9M channels
    4. Handle (`/@`): 25.4M channels
- See our paper for more discussion.

### sample_until(n, by, condition)
- Construct a conditional sample by repeatedly drawing channels and keeping them if the condition function is met.
- Can be used along with YouTube API / Innertube to construct samples conditioned on API metadata (e.g. country, join date, channel topic), or alternatively our metadata (subscribers at given timestamp).

### fetch(channels, by, from_ts, to_ts, closest)
- Retrieve the subscriber counts for a given timestamp using the Wayback Machine.
- Requires to either specify a timestamp range using `(from_ts, to_ts)` or `closest`.
- Returns outputs as a Pandas DataFrame, and includes additional channel identifier metadata extracted from the page (username / id fields).

## Citation
```
@article{tubecensus, 
    title={TubeCensus: A Transparent, Replicable, and Large-Scale Census of YouTube Channels and their Subscriber Counts Over Time}, 
    volume={20}, 
    number={1}, 
    journal={Proceedings of the International AAAI Conference on Web and Social Media}, 
    author={Eggleston, Chloe and Handler, Abram and Pacheco, Maria Leonor}, 
    year={2026}, 
    month={May}, 
}
```

## TO-DOs
- Early channel IDs via CDN URLs
    - Before the standardization of the YouTube channel ID (c. 2012), they were occasionally used in the URLs of custom channel page content (such as profile pictures and custom CSS). They can be used to map additional usernames to channel IDs. 
- Scrape channel hubs / related channels
    - Subscriber counts for additional channels are sometimes accessible in the related channels tab. When paired with identifiers extracted from profile pictures or subscriber button HTML attributes, they can add upwards of ~10 subscriber counts in a given page scrape.
- Caching
    - We redistribute the data collected in our paper as a part of [our dataset](https://zenodo.org/uploads/18267682), which is downloaded with this library. We plan to integrate these into the library such that URLs in the cache are not re-scraped.