Metadata-Version: 2.4
Name: NewsLookout
Version: 3.0.0
Summary: News scraping application
Home-page: https://github.com/sandeep-sandhu/NewsLookout
Author: Sandeep Singh Sandhu
Author-email: sandeep.sandhu@gmx.com
Maintainer: Sandeep Singh Sandhu
License: GPL-3
Keywords: Web-scraping,News,NLP,Information-Retrieval,crawler
Platform: Operating System :: MacOS :: MacOS X
Platform: Operating System :: Microsoft :: Windows
Platform: Operating System :: POSIX
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Environment :: No Input/Output (Daemon)
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: newspaper4k
Requires-Dist: beautifulsoup4
Requires-Dist: lxml
Requires-Dist: nltk
Requires-Dist: spacy
Requires-Dist: requests
Requires-Dist: enlighten
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: tld
Requires-Dist: urllib3
Requires-Dist: configparser
Requires-Dist: openpyxl
Requires-Dist: fastapi
Requires-Dist: uvicorn
Dynamic: license-file



1. [Overview](#overview)
2. [Installation](#installation)
3. [Quick Start](#quick-start)
4. [Library Usage](#library-usage)
5. [Architecture](#architecture)
6. [Configuration](#configuration)
7. [Plugin Development](#plugin-development)
8. [API Reference](#api-reference)
9. [Troubleshooting](#troubleshooting)



NewsLookout is a comprehensive, multi-threaded web scraping framework designed for extracting news articles and data from various online sources. It features a plugin-based architecture for extensibility and supports concurrent processing across multiple news sources.


- **Multi-threaded Architecture**: Concurrent URL discovery, content fetching, and data processing
- **Plugin-Based Design**: Easy to extend with custom scrapers for different news sources
- **Session Management**: Tracks completed URLs to avoid duplicate processing
- **Data Processing Pipeline**: Built-in support for deduplication, classification, and keyword extraction
- **Graceful Shutdown**: Handles interrupts cleanly without data loss
- **Library Interface**: Can be used as a Python library in your own applications
- **Configurable Timeouts**: Prevents indefinite hangs with configurable timeout mechanisms


1. **Timeout Management**: URL gathering operations now have configurable timeouts (default: 10 minutes)
2. **Dedicated Database Thread**: All database operations handled by single thread to prevent lock conflicts
3. **Improved Recursion**: Iterative link extraction with strict depth limiting (max 4 levels)
4. **Better Interrupt Handling**: Graceful shutdown on Ctrl+C with proper cleanup
5. **Queue-Based URL Streaming**: URLs processed as discovered, not in batches
6. **Library Interface**: Can be imported and used programmatically



```bash
pip install newslookout
```


When installed via pip, NewsLookout stores all user-writable files outside the Python
package directory so that package upgrades never overwrite your data or configuration.

| Platform | Config file | Log / PID files | Data & archive |
|----------|-------------|-----------------|----------------|
| **Linux** | `~/.config/newslookout/newslookout.conf` | `~/.local/state/newslookout/` | `~/.local/share/newslookout/data/` |
| **macOS** | `~/Library/Preferences/newslookout/newslookout.conf` | `~/Library/Logs/newslookout/` | `~/Library/Application Support/newslookout/data/` |
| **Windows** | `APPDATA\newslookout\newslookout.conf` | `APPDATA\newslookout\` | `APPDATA\newslookout\data\` |

> **Tip:** You can override any path in the config file.  Set the `data_dir`, `log_file`,
> and `archive_base_path` keys under `[environment]` to any absolute path you prefer.


The first time you run `newslookout` without specifying a config file it will:

1. Create the default configuration at the platform-appropriate path shown above.
2. Print the path and exit so you can review it before scraping begins.

```bash
newslookout          # first run: creates config and exits
newslookout -d 2024-03-22
```

You can also point to a custom config explicitly:

```bash
newslookout -c /path/to/my.conf -d 2024-03-22
```



```bash
git clone https://github.com/sandeep-sandhu/newslookout.git
cd newslookout
pip install -e .
```


NewsLookout requires Python 3.8+ and will install the following dependencies:

- `beautifulsoup4` - HTML parsing
- `newspaper3k` - Article extraction
- `nltk` - Natural language processing
- `requests` - HTTP requests
- `pandas` - Data manipulation
- `enlighten` - Progress bars
- `spacy` - Advanced NLP (optional, for deduplication)
- `torch` - Deep learning (optional, for classification)



After installation, download the required NLP model data:

```bash
python -m spacy download en_core_web_lg

python - <<'EOF'
import nltk
for pkg in ['punkt', 'punkt_tab', 'maxent_treebank_pos_tagger',
'reuters', 'universal_treebanks_v20']:
nltk.download(pkg)
EOF
```

If NLTK data is stored in a non-standard location, set the `NLTK_DATA` environment variable
to its path. See https://www.nltk.org/data.html for details.


Alternatively, you could manually download these from the source location - https://github.com/nltk/nltk_data/tree/gh-pages/packages

For NLTK, refer to the NLTK website on downloading the data - https://www.nltk.org/data.html.
Specifically, the following data needs to be downloaded:
1. reuters
1. universal_treebanks_v20
1. maxent_treebank_pos_tagger
1. punkt





```bash
newslookout -c config.conf -d 2025-12-21

newslookout -c config.conf -d 2025-12-21 --log-level DEBUG
```


```python
from newslookout import NewsLookoutApp

app = NewsLookoutApp(config_file='config.conf')
stats = app.run(run_date='2025-12-21', max_runtime=3600)

print(f"Processed {stats['urls_processed']} URLs in {stats['duration']:.1f} seconds")
```


```python
from newslookout import NewsLookoutApp

with NewsLookoutApp('config.conf') as app:
app.start()  # Run in background
app.stop()
```


```python
from newslookout import scrape

stats = scrape('config.conf', run_date='2025-12-21', max_runtime=3600)
```



```python
from newslookout import NewsLookoutApp

app = NewsLookoutApp(config_file='path/to/config.conf')

stats = app.run(run_date='2025-12-21')

print(f"URLs discovered: {stats['urls_discovered']}")
print(f"URLs processed: {stats['urls_processed']}")
print(f"Data processed: {stats['data_processed']}")
print(f"Duration: {stats['duration']:.1f} seconds")
```


```python
from newslookout import NewsLookoutApp
import time

app = NewsLookoutApp('config.conf')

app.start()

while app.is_running:
stats = app.get_statistics()
print(f"Progress: {stats['urls_processed']} URLs processed")
time.sleep(10)

app.wait_for_completion()

final_stats = app.get_statistics()
```


```python
from newslookout import NewsLookoutApp

app = NewsLookoutApp('config.conf')

stats = app.run(max_runtime=3600)

if app.is_running:
print("Timeout reached, stopping...")
app.stop()
```


```python
app = NewsLookoutApp('config.conf')
app.start()

plugin_status = app.get_plugin_status()
for plugin_name, state in plugin_status.items():
print(f"{plugin_name}: {state}")
```


The application status is also visible from
the monitoring dashboard which uses the REST API to publish
the status of scraping activity and progress.
It is accessible at http://localhost:8080/dashboard.html

![Monitoring Dashboard](monitoring_dashboard.png)




```
┌─────────────────────────────────────────────────────┐
│                  NewsLookoutApp                      │
│              (Library Interface)                     │
└───────────────────┬─────────────────────────────────┘
│
┌───────────────────▼─────────────────────────────────┐
│                 QueueManager                         │
│          (Orchestrates all workers)                  │
└─────┬────────────┬────────────┬────────────┬────────┘
│            │            │            │
▼            ▼            ▼            ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│   URL    │ │ Content  │ │   Data   │ │ Progress │
│Discovery │ │ Fetching │ │Processing│ │ Watcher  │
│ Workers  │ │ Workers  │ │ Workers  │ │          │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│            │            │            │
└────────────┴────────────┴────────────┘
│
┌───────▼────────┐
│   Database     │
│     Worker     │
│  (Dedicated)   │
└────────────────┘
```


1. **URL Discovery Workers**: One per plugin, discovers URLs to scrape
2. **Content Fetch Workers**: Multiple workers that download and parse content
3. **Data Processing Workers**: Process scraped data through plugins
4. **Database Worker**: Single thread handling all database operations
5. **Progress Watcher**: Monitors progress and updates UI


- **URL Discovery Queue**: New URLs streamed here as discovered
- **Fetch Queue**: URLs pending content download
- **Processing Queue**: Downloaded content pending processing
- **Database Queue**: Database operations to be executed
- **Completed Queue**: Finished items



```ini
[installation]
prefix = /opt/newslookout
data_dir = /var/cache/newslookout_data
plugins_dir = /opt/newslookout/plugins
log_file = /var/log/newslookout/app.log
pid_file = /tmp/newslookout.pid

[operation]
url_gathering_timeout = 600

recursion_level = 2

user_agent = Mozilla/5.0 ...
fetch_timeout = 60
connect_timeout = 3
retry_count = 3

proxy_url_http = http://proxy.example.com:8080
proxy_url_https = https://proxy.example.com:8080

[logging]
log_level = INFO
max_logfile_size = 10485760
logfile_backup_count = 30

[plugins]
plugin1 = mod_en_in_ecotimes|10
plugin2 = mod_en_in_timesofindia|20
plugin3 = mod_dedupe|100
```


- `url_gathering_timeout`: Maximum seconds for URL discovery (default: 600)
- `recursion_level`: Depth of link extraction (1-4, default: 2)

- `fetch_timeout`: Timeout for downloading content (seconds)
- `connect_timeout`: Timeout for establishing connection (seconds)
- `retry_count`: Number of retry attempts
- `user_agent`: User agent string for requests

- `completed_urls_datafile`: SQLite database for session history

- `log_level`: DEBUG, INFO, WARNING, ERROR
- `max_logfile_size`: Maximum log file size before rotation
- `logfile_backup_count`: Number of rotated logs to keep



```python
from base_plugin import BasePlugin
from data_structs import PluginTypes

class mod_my_news_site(BasePlugin):
"""
Plugin for scraping MyNewsSite.com
"""

pluginType = PluginTypes.MODULE_NEWS_CONTENT
mainURL = 'https://www.mynewssite.com'
allowedDomains = ['www.mynewssite.com']

validURLStringsToCheck = ['mynewssite.com/article/']
invalidURLSubStrings = ['mynewssite.com/ads/', '/video/']

def __init__(self):
super().__init__()

def getURLsListForDate(self, runDate, sessionHistoryDB):
"""Discover URLs for given date."""
urls = []
return urls

def extractArticleBody(self, htmlContent):
"""Extract article text from HTML."""
return text

def extractUniqueIDFromURL(self, url):
"""Extract unique identifier from URL."""
return unique_id
```


```python
from base_plugin import BasePlugin
from data_structs import PluginTypes

class mod_my_processor(BasePlugin):
"""
Plugin for processing scraped data
"""

pluginType = PluginTypes.MODULE_DATA_PROCESSOR

def __init__(self):
super().__init__()

def additionalConfig(self, sessionHistoryObj):
"""Additional configuration."""
pass

def processDataObj(self, newsEventObj):
"""Process a news event object."""
newsEventObj.setText(processed_text)

filename = newsEventObj.getFileName().replace('.json', '')
newsEventObj.writeFiles(filename, '', saveHTMLFile=False)
```


- `MODULE_NEWS_CONTENT`: Scrapes news articles
- `MODULE_NEWS_AGGREGATOR`: Aggregates URLs from multiple sources
- `MODULE_DATA_CONTENT`: Scrapes structured data
- `MODULE_DATA_PROCESSOR`: Post-processes scraped data




```python
NewsLookoutApp(config_file: str, run_date: Optional[str] = None)
```

**Parameters:**
- `config_file` (str): Path to configuration file
- `run_date` (str, optional): Date in 'YYYY-MM-DD' format

**Raises:**
- `FileNotFoundError`: If config file doesn't exist
- `ValueError`: If configuration is invalid



```python
run(run_date: Optional[str] = None,
max_runtime: Optional[int] = None,
blocking: bool = True) -> Dict[str, Any]
```

Run the scraping process.

**Parameters:**
- `run_date` (str, optional): Override run date
- `max_runtime` (int, optional): Maximum runtime in seconds
- `blocking` (bool): If True, wait for completion

**Returns:**
- `dict`: Statistics dictionary


```python
start()
```

Start application in background mode.


```python
stop(timeout: int = 30)
```

Stop the running application gracefully.

**Parameters:**
- `timeout` (int): Maximum seconds to wait for shutdown


```python
get_statistics() -> Dict[str, Any]
```

Get current or last run statistics.

**Returns:**
- `dict`: Statistics including:
- `urls_discovered`: Total URLs found
- `urls_processed`: URLs successfully scraped
- `data_processed`: Items processed
- `start_time`: Execution start time
- `end_time`: Execution end time
- `duration`: Runtime in seconds
- `is_running`: Current status


```python
get_plugin_status() -> Dict[str, str]
```

Get status of all loaded plugins.

**Returns:**
- `dict`: Map of plugin names to states


```python
wait_for_completion(timeout: Optional[int] = None) -> bool
```

Wait for background execution to complete.

**Parameters:**
- `timeout` (int, optional): Maximum seconds to wait

**Returns:**
- `bool`: True if completed, False if timeout



```python
scrape(config_file: str,
run_date: Optional[str] = None,
max_runtime: Optional[int] = None) -> Dict[str, Any]
```

Convenience function to run a scraping job.




**Symptom:** Application hangs during URL discovery

**Solution:** Increase `url_gathering_timeout` in configuration:

```ini
[operation]
url_gathering_timeout = 1200  # 20 minutes
```


**Symptom:** `database is locked` errors in logs

**Solution:** All database operations now go through dedicated thread. If issue persists:
- Check no other process is accessing the database
- Remove `-journal` files if present
- Increase timeout in session_hist.py


**Symptom:** Ctrl+C doesn't stop the application

**Solution:** Updated code includes periodic shutdown checks. Ensure:
- Using latest version
- Not stuck in long-running external call
- Check network timeouts are reasonable


**Symptom:** Memory exhaustion from excessive URLs

**Solution:**
- Reduce `recursion_level` in configuration
- Improve URL filtering in plugins
- Use more restrictive `validURLStringsToCheck`


**Symptom:** Specific plugin never completes

**Solution:**
- Check plugin's `is_stopped` flag periodically
- Ensure network operations have timeouts
- Review `getURLsListForDate()` implementation


Enable detailed logging:

```ini
[logging]
log_level = DEBUG
```

Or programmatically:

```python
import logging
logging.getLogger('').setLevel(logging.DEBUG)
```



```ini
[operation]
fetch_timeout = 30  # Reduce if sites are fast
retry_count = 2     # Reduce retries
```


Modify in code:

```python
queue_manager.dataproc_threads = 10  # Increase for more parallelism
```


```ini
[operation]
recursion_level = 1  # Minimum recursion
```



- Use separate configs for different environments
- Version control your configuration files
- Document custom settings


- Always check `self.is_stopped` in loops
- Use timeouts for all network operations
- Handle exceptions gracefully
- Log progress at regular intervals


- Monitor disk space for data directory
- Rotate logs regularly
- Clean up old session data periodically


- Use systemd or supervisor for service management
- Set up log rotation
- Monitor application health
- Configure appropriate timeouts
- Use separate database for each instance


- Review logs after each run
- Set up alerts for critical errors
- Test plugins with edge cases
- Handle malformed HTML gracefully


| Log message | Cause | Fix |
|-------------|-------|-----|
| `can't compare offset-naive and offset-aware datetimes` | The news site returns a timezone-aware publication date | Apply Patch 2 to `base_plugin.py` |
| `'NoneType' object has no attribute 'getURL'` in `mod_keywordflags` | The JSON article file for a previously scraped URL no longer exists on disk | Apply Patch 4 to `worker.py`; also verify your `data_dir` path in the config |
| `Invalid article_id: None` / `Falling back to legacy file storage` | The URL did not match any `urlMatchPatterns` in the plugin | Apply Patch 3 to `base_plugin.py` |
| `Error fetching status: TypeError: can't access property "textContent" … is null` | Dashboard JS runs before DOM is ready | Apply Patch 5 to `dashboard.html` |
| `Request for font "Ubuntu Sans" blocked at visibility level 2` | Browser privacy policy blocks Google Fonts | Apply Patch 5a to `dashboard.html` |
| Installed package appears under `src/newslookout` instead of `newslookout` | Missing `src`-layout config in `setup.cfg` / `pyproject.toml` | Apply Patches 9 and 10 |



- **Documentation**: https://github.com/sandeep-sandhu/newslookout
- **Issues**: Report bugs on GitHub Issues
- **Contributing**: Pull requests welcome


This software is provided "AS IS" without warranty. See LICENSE file for details.

---
