Metadata-Version: 2.3
Name: wdr-article-semantic-chunking-2
Version: 0.1.0
Summary: Semantic segmentation and topic boundary detection
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: sentence-transformers
Requires-Dist: transformers
Requires-Dist: huggingface-hub
Requires-Dist: matplotlib
Requires-Dist: httpx
Requires-Dist: aiohttp
Requires-Dist: pydantic
Requires-Dist: spacy
Requires-Dist: pytest ; extra == 'dev'
Requires-Dist: jupyter ; extra == 'dev'
Requires-Dist: matplotlib ; extra == 'dev'
Requires-Dist: seaborn ; extra == 'dev'
Requires-Python: >=3.10
Provides-Extra: dev
Description-Content-Type: text/markdown

pypi-AgENdGVzdC5weXBpLm9yZwIkY2VkNzlmNmEtZmViYi00OTM4LTlhZTgtNDAyNWJkMWFlMjVlAAIqWzMsImMxYTE0ZWY4LTBjNWEtNDg5ZS04YWUyLTE1OWI2YmIwZDQyYyJdAAAGIMupex0Wxu515x2zMhXvUO7sEcVGMPdMQ0DSE1scsU6C
## Table of Contents
- [Project Goal](#project-goal)
- [How Do We Determine Semantic Similarity?](#how-do-we-determine-semantic-similarity)
- [Cosine Similarity Example](#cosine-similarity-example)
- [Sliding window mechanism](#sliding-window-mechanism)
- [Challenge](#challenge)
- [Coding Plan](#coding-plan)
  - [Data Preparation](#data-preparation)
  - [Running Algorithm](#running-algorithm)
  - [Visualization](#visualization)
  - [Result evaluation](#results-evaluation)
    - [Model Results](#model-results)
    - [3rd‑Party Library Results](#3rd-party-library-results)
    - [Overall Evaluation](#overall-evaluation)
- [File Structure](#file-structure)
- [TODO](#todo)

# Semantic Chunker
🚀 Project Goal
The goal of this project is to automatically find topic‑based borders within a document.
It identifies points where the semantic content of the text shifts noticeably
by using cosine similarity and a sliding‑window mechanism.

## How do we determine whether sentences have similar meaning?
Natural Language Processing (NLP) models are trained on massive amounts of text and convert the meaning of words
and sentences into mathematical representations called vectors. These vectors can be thought of as points located in 
a multidimensional coordinate space.
using this models, when we provide an input word, it can return its numerical representation in the form of 
a vector. We can then provide a second word, and the library will generate another vector. These two numerical 
representations (vectors) allow us to perform mathematical operations such as subtraction, addition, etc.

For example, if we take the vector of the word “king”, subtract the vector of “man”, and then add the vector of “woman”,
and finally convert the resulting vector back into a word, we obtain “queen”.
![SVG Image](docs/king.webp)

<p align="center">
  <i>Figure 1: A geometric illustration of word‑vector relationships showing how semantic 
transformations appear in vector space.</i>
</p>
<p align="center">
  <img src="docs/king2.webp" alt="cos sim" width="80%" />
</p>

Using this approach, we can also find synonyms and other semantically related words.


We can also convert sentences into vectors and compare them to understand how similar they are in meaning.
To do this, we use cosine similarity.
Words with similar meaning end up close to each other.
Words with different meaning end up far apart.

<p align="center">
  <img src="docs/cos_sim.png" alt="cos sim" width="80%" />
</p>
<p align="center"><i>Figure 2: Conceptual explanation of cosine similarity as the angle between vectors.</i></p>


So What Does Cosine Similarity Do?
Cosine similarity measures how similar two word‑vectors are by checking the angle between them.
Think of each word as an arrow (a vector) in a many‑dimensional space:

If two arrows point in almost the same direction, their meanings are similar
If they point in different directions, their meanings are different

Mathematically, cosine similarity looks at the cosine of the angle between the vectors.

### Cosine Similarity Values

Cosine similarity always returns a value between –1 and 1:

1.0 → words mean almost the same
0.0 → words are unrelated.

lets see an example of cosine similarity between two sentences:
```
1 from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

2 model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

3 sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

4 embeddings = model.encode(sentences)

5 first_sim = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]# cosine similarity between index 0 and index 1
second_sim = cosine_similarity([embeddings[1]], [embeddings[2]])[0][0]# cosine similarity between index 1 and index 2
print(first_sim)
print(second_sim)
```
1 import libraries 

2 initialize model

3 initialize sample input sentences

4 encode sentences to get embeddings

5 find cosine similarities

it will output
```
0.81397283
0.15795702
```
so the first two sentences are semantically similar (both talk about the weather), while the third sentence is quite different (talks about driving to a stadium).

we can visualize this as follows:

<p align="center">
<img src="docs/test1.png" alt="cos sim" width="60%" />
</p>

<p align="center">
  <i>Figure 3: Basic example — strong similarity (S0–S1) vs. weak similarity (S1–S2).</i>
</p>


so the sentences with index 0 *("The weather is lovely today.")* relationship to the sentence with the index 1 *("It's so sunny outside!")* is
strong, meanwhile the relationship of sentence with index 1 *("The weather is lovely today.")* to the sentence with index 
2 *( "He drove to the stadium.")* is weak.

such visualization helps when we have a lot of sentences and we want to quickly see where the topic changes.

example
![SVG Image](docs/example_of_more_sentences.png)

<p align="center">
  <i>Figure 4: visualization of cos similarity across a large number of sentences</i>
</p>

### Sliding Window Mechanism

"so far so good" , but comparing every sentence to the neighbor sentence
sometimes is not enough to detect topic changes. Sometimes adjacent sentences may belong to the same topic,
but their cosine similarity is low. For example:
*"The cat is on the roof." "the children are going to school."* 

Or the opposite situation: two sentences at the boundary between topics may belong to different topics,
but their cosine similarity is high. For example:
*“The cat is on the roof.”*
*“The dog is on the roof.”*
These two sentences may be from completely different topics (for instance, one about a family’s pets and the other about guard dogs), but they will have high cosine similarity because of the shared phrase “on the roof”. This results in a misleading similarity plot:

![SVG Image](docs/problem.png)
<p align="center">
  <i>Figure 5: Example of noisy results with many sentences.</i>
</p>
To solve this problem, we need to include nearby sentences by merging them into a single context. For example:

*  *“They were a wonderful big family; grandpa taught them to be kind to everyone.”*
*  *“They had several animals — cows, dogs, chickens — and the children treated them well.”*
*  *“The cat is on the roof.”*
*  *“The children are going to school.”*
*  *“The cat was watching them leave, saying goodbye with his eyes.”*
*  *“One of the children noticed the cat and waved at him.”*

If we take 3 sentences to the left and 3 sentences to the right of the current sentence and
compare cosine similarity between these windows, we can better understand whether a topic shift occurs.
In the example above, we can see that the first three sentences are related to each other because they 
describe a family with animals, and thus their cosine similarity will be high.

** *Note: We do not expect to find the exact boundary position. Instead, we consider a prediction correct if the true 
boundary lies within a tolerance window of ±3 sentences around the detected boundary.* 



### Challange
We have a list of models, and we don’t know which window size and which min_gap value will work best for each model. This means we need to test all combinations of these parameters and evaluate their performance.
Additionally, there are libraries such as LLaMA-based semantic segmentation tools that can also detect topic boundaries. We want to compare our results against these baselines and see whether our method can perform better.
The idea is to run our algorithm:

* for each model,
* for each window size,
* and for each min_gap value,

and then evaluate the results using metrics such as:

* the percentage of correctly detected boundaries,
* and visualizations that allow us to compare different configurations side-by-side.

We use news articles from the WDR NRW archive, where each file contains five news stories.
For every news story, we have ground‑truth annotations that mark the exact topic boundaries.
We compare our predicted boundaries with these annotations and measure how accurately each model 
and parameter combination performs.



### Coding plan
Next, we describe how we prepare the data, run algorithm, evaluate the predictions, save the results, visualize them, and finally summarize our findings.



<details>

<summary><b>data preparation</b></summary>

#### Test data preparation
The detailed description of test data preparation process is not very important. 
We start with the original JSON files, parse them, and then reconstruct the cleaned version back into JSON format. All processed files are stored in the data/ directory.
For debugging purposes, the same data is also converted into .txt format. In these text files:

* every sentence is indexed,

* topic boundaries are marked with an asterisk *.

These debug-friendly files are located in computer/content/.

#### Algoritm input
In total, we use 13 different models. For each model, we test 5 window sizes and 5 gap values, which results in:
13 × 5 × 5 = 325 possible parameter combinations.
These combinations are evaluated independently, allowing us to analyze how each model behaves under different configurations.

**see main.py*
</details>


<details>
<summary><b>Running Algorithm</b></summary>

####  Sliding Window Mechanism Implementation
In the previous example, we took three sentences and compared them with each other. In this example,
we will use more sentences and adapt our code accordingly, but the main idea will remain the same.
in the file 
<pre>slid_win.py</pre>

is the main code of the sliding window mechanism.

<pre>

def segment_topics_window(
        blocks,
        window_size,
        min_gap,
        model
):
 1   embeddings = model.encode(blocks)

 2  scores = []
    indices = []

 3   for i in range(window_size, len(blocks) - window_size):
 4      left = embeddings[i - window_size:i]
        right = embeddings[i:i + window_size]

 5      left_mean = optimize_embddings(left)
        right_mean = optimize_embddings(right)

 6      sim = cosine_similarity(left_mean, right_mean)[0][0]
 7      scores.append(sim)
        indices.append(i)

 8   threshold = np.mean(scores) - 1.2 * np.std(scores)

    boundaries = []
    last = 0

 9  for idx, score in zip(indices, scores):
        if score < threshold and idx - last >= min_gap:
            boundaries.append(idx)
            last = idx

    return boundaries, scores, indices
</pre>

1 - Encode sentences

2 - Initialize arrays to store the similarity scores and the sentence indices.

3 - Iterate through the sentences using a loop with a step size equal to window_size. 

4 - Take combined left and right parts of sentences

5 - Apply embedding optimization — this helps reduce noise and capture the overall topic of each window more robustly.

6 - Compute the cosine similarity.

7 - Store the similarity scores and the corresponding indices in the arrays.

8 - Compute a dynamic threshold based on the distribution of similarity scores.
This helps identify unusually low similarity values that may indicate potential topic shifts.

9 - Detect topic boundaries where the similarity score falls below the threshold and
the distance from the last detected boundary is at least min_gap.
This prevents overly dense or noisy boundary detection.


####  Main Code

The hardest part is over — from here, it’s all smooth sailing.

<pre>
def compute(
        window_size,
        min_gap,
        model_name):
    model = SentenceTransformer(model_name)
    combination_name = f"model_{model_name}_w_{window_size}_m_{min_gap}"
  1  for i in range(0, 100):
        file_name = f"merged_filtered_{i}.json"
  2     blocks, expected_boundary, source_count, _ = extract_texts_and_write_to_file(file_name, False)

  3     boundaries, scores, indices = segment_topics_window(blocks, ...)

  4      plot_sliding_window(...)

  5      save_pair_to_csv(...)
  
  6 df = pd.read_csv(get_path_for_csv(combination_name), usecols=[MATCH_PERCENTAGE])
  7 save_result_tocsv(combination_name, df.mean().iloc[0])
</pre>
                                                                   
1 after defining model and combination names, we loop through 100
test samples, 

2  we extract the text 
blocks and expected boundaries

3 this step does need explanation, we described it in detail above.

4 we generate and save visualization of the sliding window results.
This helps us to visually inspect why 
and where the algorithm decided that the topic changes.

5 we save per-sample results to CSV

6-7 after processing all samples for the current combination, we count how
many boundaries were correctly detected and save the average percentage to a final CSV file for later analysis.

</details>


<details>

<summary><b>Visualization</b></summary>

for each test case we generate such a visualization:
![SVG Image](docs/visualisation.png)
<p align="center"><i>Figure 6: Sliding‑window similarity plot — blue line shows similarity scores, green dashed lines s
how ground truth, red points show detected boundaries.</i></p>
the red points represent the detected 
boundaries, the blue line represents the
similarity scores across the text, and the 
vertical green dashed lines indicate the expected
boundaries (ground truth). 

it saved in the result folder with subfoler named after the model and parameter combination.
for example this one is saved in computer/result/model_all-MiniLM-L12-v2_w_3_m_3/merged_filtered_4/merged_filtered_4.json.png

The source code for visualiazation is in 
<pre>computer/plotter.py</pre>
</details>

<details>

<summary><b>Results Evaluation</b></summary>

#### Model Results
After each run of the algorithm — for every model and every parameter configuration — we
save the results to a CSV file. The files are stored in the result/ directory and 
each one is named according to the model and the parameters used. For example:

*model_paraphrase-multilingual-mpnet-base-v2_w_3_m_3.csv*

![SVG Image](docs/result1.png)
<p align="center"><i>Figure 7: Example of per‑model and per‑parameter evaluation results stored in CSV format.</i></p>

The structure of this file includes the following columns:

* the name of the test file - *File*,
* the expected boundaries  - *boundary*,
* the predicted boundaries  - *possible_breaks*,
* a dictionary indicating whether each boundary was detected correctly  - *matches2*,
* and the overall match percentage  - *percentage2*.

The code responsible for saving the results to a CSV file is located in slid_win.py inside the
function save_pair_to_csv(...).
To keep the documentation simple, we do not include the full implementation here, but the function 
itself is straightforward.
And if needed, feel free to ask an AI for help — (p.s. that’s where I copied it from myself :)).

#### 3rd party library results
3rd‑Party Library Results
We also tested third‑party libraries for semantic segmentation, specifically the LLaMA‑based 
implementations
SemanticSplitterNodeParser and SemanticDoubleMergingSplitterNodeParser.
We used the same test dataset, and the results were saved in CSV files with the same structure
as our own algorithm’s output.
However, these libraries did not perform well.
Although they detected all real boundaries, they also generated a large number of incorrect ones,
which significantly reduced their overall usefulness.
#### Overall Evaluation
After running all combinations of models and parameters, we compiled the results into a final CSV file 
that summarizes the performance of each configuration. This allows us to compare different models 
and parameter settings side by side and identify which ones are most effective at detecting topic
boundaries in our test dataset.

![SVG Image](docs/overall_result.png)
<p align="center"><i>Figure 9: Comparison of all model and parameter combinations,
showing boundary‑detection accuracy.</i></p>

Our top performers with a window size of 3 and a min_gap of 3 were the models paraphrase-multilingual-mpnet-base-v2 
and distiluse-base-multilingual-cased-v1. 
</details>


# File Structure

![SVG Image](docs/file_structure.png)
<p align="center">
  <i>Figure 10: Directory layout.</i>
</p>

* `Artikel_WDR_NRW/` This folder contains **raw test data**.
After extraction and text cleaning, the processed data is saved into the `data/` folder.

* `data/`
Stores the cleaned and preprocessed data generated from the raw inputs. This folder is used as the main input source for the processing pipeline.

* `computer/` Contains the core application logic. All main processing steps are implemented here.

* `content/`, `result/` and `grafic/` These folders are primarily used for debugging and inspection purposes. 
 All output data is classified and stored in one of these folders depending on its type.

* `text_util/` and `util/` Contain helper and utility functions, including:
    *  Text cleaning and normalization

    * Format conversion

    * Shared helper logic used across the project

    
# TODO
* ***Fine‑tune the model*** — Hugging Face provides tools to further train embedding models
on custom datasets, which may significantly improve boundary‑detection accuracy for our domain.

* ***Experiment with alternative approaches*** such as agglomerative clustering — instead 
of using a sliding window, clustering algorithms could group semantically similar sentences and identify topic boundaries between clusters.

* ***Extend algorithm to find the exact boundary position.***
We want to extend the existing code so that it can identify the boundary more precisely.
To do this, we use the following approach:
We have a predicted boundary X, and we know that the true boundary lies within a window of ±3 sentences around X.
This means we can take the contextual text to the left of (X − 3) and compare it with each sentence in that window.
Then we do the same with the contextual text to the right of (X + 3) and compare it with each sentence.
This should produce a pattern similar to the one below:
we see that the similarity values are high at first and then drop sharply —
and for the right side it behaves in the opposite way. So the exect boundary will be at the point where the similarity drops (for the left context) and rises (for the right context).
<p align="center">
  <img src="docs/exec_boundary_left_half.png" alt="Left figure" width="48%" />
  <img src="docs/exec_boundary_right_half.png" alt="Right figure" width="48%" />
</p>

<p align="center"><i>Figure 11: Similarity between the left and right context and each sentence 
within the approximate boundary range.</i></p>

<p align="center">
  <img src="docs/exec_boundary_left.png" alt="Left figure" width="48%" />
  <img src="docs/exec_boundary_right.png" alt="Right figure" width="48%" />
</p>

<p align="center"><i>Figure 12: If the left‑side similarities are low while the right‑side similarities are high,
then the true boundary is likely located at (X − 3).</i></p>

* If both sides show consistently high similarity, then the prediction is likely ambiguous.
In this case, a more advanced approach (for example, using an OpenAI LLM) may be required to determine the 
exact boundary with higher accuracy.
