Metadata-Version: 2.4
Name: wowool-chunks
Version: 1.3.1
Summary: Wowool Chunks
Home-page: https://www.wowool.com/
Author: Wowool
Author-email: info@wowool.com
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: wowool-common
Dynamic: author
Dynamic: author-email
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Chunking documents semantically

The chunks app intelligently segments documents into meaningful, self-contained sections based on their semantic content.

## Options

#### ChunksOptions

```typescript
interface ChunksOptions
{
    max_chunk_size?: number;
    soft_sentence_limit?: number;
    header_level?: number;
    canonicalize?: boolean;
    fix_spelling_mistakes?: boolean;
    lowercase: boolean;
    cleanup: boolean;
    lemmas: boolean;
    dates?: boolean;
    add_themes?: boolean;
    add_topics?: boolean;
    add_outline?: boolean;
}

const default_options: ChunksOptions = {
    max_chunk_size: 100,
    soft_sentence_limit: 4,
    header_level: -1,
    canonicalize: true,
    fix_spelling_mistakes: true,
    lowercase: false,
    cleanup: false,
    lemmas: false,
    dates: true,
    add_themes: false,
    add_topics: false,
    add_outline: false,
};
```

with:

| Property                | Description                                                                    |
|-------------------------|--------------------------------------------------------------------------------|
| `max_chunk_size`        | Maximum number of tokens allowed in each chunk                                 |
| `soft_sentence_limit`   | Maximum number of sentences to find a soft boundary like a header or paragraph |
| `header_level`          | Use document outlining and split the level of the markup or <h.> tags          |
| `canonicalize`          | Resolve names of people and companies to their canonical form                  |
| `lowercase`             | Lowercase the results                                                          |
| `fix_spelling_mistakes` | Fix spelling mistakes. Example: cooool -> cool or lable -> label               |
| `cleanup`               | Remove markup characters                                                       |
| `lemmas`                | Return all the data using the lemmas                                           |
| `dates`                 | Resolve relative dates. Example: The previous year                             |
| `add_topics`            | Add topics to the output of every chunk                                        |
| `add_themes`            | Add themes/categories to the output of every chunk                             |
| `add_outline`           | Add the outline of the document to the output of every chunk                   |

## Results

#### ChunksResult

```typescript
type ChunksResult = Chunk[];
```

#### Chunk

```typescript
interface Chunk {
    sentences : string[];
    begin_offset : number;
    end_offset : number;
    outline?: string[];
    topics?: Topic[];
    themes?: Theme[];
}
```

with:

| Property       | Description                                                                    |
|----------------|--------------------------------------------------------------------------------|
| `sentences`    | Sentences in the given chunk                                                   |
| `begin_offset` | Begin offset of the chunk                                                      |
| `end_offset`   | End offset of the chunk                                                        |
| `outline`      | Outline of a chunk; provides a hierarchical summary where the chunk is located |
| `topics`       | Topics found in the chunk                                                      |
| `themes`       | Themes or categories found in the chunk                                        |

#### Theme

```typescript
interface Theme {
    name : string;
    relevancy : number;
}
```

#### Topic

```typescript
interface Topic {
    name : string;
    relevancy : number;
}
```

## Examples

#### Creating chunks from a Markdown document

<sample data-uuid="chunking"></sample>

# Chunking documents semantically

The chunks app intelligently segments documents into meaningful, self-contained sections based on their semantic content.

## Options

#### ChunksOptions

```typescript
interface ChunksOptions
{
    max_chunk_size?: number;
    soft_sentence_limit?: number;
    header_level?: number;
    canonicalize?: boolean;
    fix_spelling_mistakes?: boolean;
    lowercase: boolean;
    cleanup: boolean;
    lemmas: boolean;
    dates?: boolean;
    add_themes?: boolean;
    add_topics?: boolean;
    add_outline?: boolean;
}

const default_options: ChunksOptions = {
    max_chunk_size: 100,
    soft_sentence_limit: 4,
    header_level: -1,
    canonicalize: true,
    fix_spelling_mistakes: true,
    lowercase: false,
    cleanup: false,
    lemmas: false,
    dates: true,
    add_themes: false,
    add_topics: false,
    add_outline: false,
};
```

with:

| Property                | Description                                                                    |
|-------------------------|--------------------------------------------------------------------------------|
| `max_chunk_size`        | Maximum number of tokens allowed in each chunk                                 |
| `soft_sentence_limit`   | Maximum number of sentences to find a soft boundary like a header or paragraph |
| `header_level`          | Use document outlining and split the level of the markup or <h.> tags          |
| `canonicalize`          | Resolve names of people and companies to their canonical form                  |
| `lowercase`             | Lowercase the results                                                          |
| `fix_spelling_mistakes` | Fix spelling mistakes. Example: cooool -> cool or lable -> label               |
| `cleanup`               | Remove markup characters                                                       |
| `lemmas`                | Return all the data using the lemmas                                           |
| `dates`                 | Resolve relative dates. Example: The previous year                             |
| `add_topics`            | Add topics to the output of every chunk                                        |
| `add_themes`            | Add themes/categories to the output of every chunk                             |
| `add_outline`           | Add the outline of the document to the output of every chunk                   |

## Results

#### ChunksResult

```typescript
type ChunksResult = Chunk[];
```

#### Chunk

```typescript
interface Chunk {
    sentences : string[];
    begin_offset : number;
    end_offset : number;
    outline?: string[];
    topics?: Topic[];
    themes?: Theme[];
}
```

with:

| Property       | Description                                                                    |
|----------------|--------------------------------------------------------------------------------|
| `sentences`    | Sentences in the given chunk                                                   |
| `begin_offset` | Begin offset of the chunk                                                      |
| `end_offset`   | End offset of the chunk                                                        |
| `outline`      | Outline of a chunk; provides a hierarchical summary where the chunk is located |
| `topics`       | Topics found in the chunk                                                      |
| `themes`       | Themes or categories found in the chunk                                        |

#### Theme

```typescript
interface Theme {
    name : string;
    relevancy : number;
}
```

#### Topic

```typescript
interface Topic {
    name : string;
    relevancy : number;
}
```

# API

## Examples

### Creating Chunks

This script demonstrate the capabilities of the chunks app, the input text has been chunked and information like outlines, themes, topics have been added to each chunk.



```python
from wowool.sdk import Pipeline

text = """# List of Authors and Their Books

## J.R.R. Tolkien

J.R.R. Tolkien was an English writer, poet, and professor known for his high fantasy works. He created the richly detailed world of Middle-earth, a place inhabited by hobbits, elves, dwarves, and orcs.

### The Hobbit

- Published: 1937
- Genre: Fantasy
- **Abstract**: 
  *The Hobbit* is a classic tale of adventure and self-discovery. Bilbo Baggins, a reluctant hobbit, is recruited by the wizard Gandalf and a group of thirteen dwarves to help them reclaim their homeland and treasure from the fearsome dragon Smaug. Along the way, Bilbo encounters trolls, goblins, giant spiders, and a mysterious creature named Gollum. The novel explores themes of bravery, friendship, and the unexpected heroism that lies within ordinary individuals. This book also serves as a prelude to *The Lord of the Rings*, setting up the history of Middle-earth.

### The Lord of the Rings
- Published: 1954
- Genre: Epic Fantasy
- **Abstract**: 
  *The Lord of the Rings* is a monumental epic fantasy trilogy that follows the journey of Frodo Baggins as he attempts to destroy the One Ring, a powerful artifact created by the dark lord Sauron. The ring grants immense power but also corrupts those who possess it. With the help of friends like Samwise Gamgee, Aragorn, Gandalf, and others, Frodo travels across Middle-earth, facing tremendous challenges and internal struggles. The novel delves into themes of good versus evil, the corrupting influence of power, and the importance of hope and perseverance in the face of overwhelming odds. Tolkien's world-building and intricate mythology make this one of the most beloved and influential works of fantasy literature.

## George Orwell

George Orwell was an English novelist, essayist, and critic whose works often focused on social issues, particularly those related to politics, totalitarianism, and personal freedoms. Orwell's keen insights into human nature and government have made his works enduringly relevant.

### 1984
- Published: 1949
- Genre: Dystopian, Political Fiction
- **Abstract**: 
  *1984* is a chilling portrayal of a dystopian future where the government, led by the omnipresent Big Brother, controls every aspect of life. Citizens are constantly watched through telescreens, and any hint of rebellion or independent thought is ruthlessly suppressed by the Thought Police. The novel follows Winston Smith, a low-ranking government employee who becomes disillusioned with the oppressive regime and secretly longs for freedom. Orwell's work explores themes such as the dangers of totalitarianism, the manipulation of truth, and the loss of individual identity. The novel remains a powerful critique of oppressive governments and the dangers of surveillance and propaganda.

### Animal Farm
- Published: 1945
- Genre: Allegory, Satire
- **Abstract**: 
  *Animal Farm* is an allegorical novella that uses a group of farm animals to represent the events leading up to the Russian Revolution of 1917 and the subsequent rise of the Soviet Union. The animals, led by the pigs Napoleon and Snowball, overthrow their human farmer in a bid for equality and freedom. However, as time goes on, the pigs become indistinguishable from the humans they replaced, and the farm’s original ideals are betrayed. Orwell uses the fable to critique the corruption of socialist ideals and the rise of totalitarianism, demonstrating how power can corrupt even the most well-intentioned leaders. The famous line, \"All animals are equal, but some animals are more equal than others,\" captures the central theme of the story.
"""
pipeline = Pipeline(
    [
        "english",
        "entity",
        "topics",
        "semantic-theme",
        {
            "name": "chunks.app",
            "options": {
                "add_outline": True,
                "add_themes": True,
                "add_topics": True,
                "header_level": 3,
                "canonicalize": True,
                "cleanup": True,
                "lowercase": True,
                "fix_spelling_mistakes": True,
            },
        },
    ]
)
document = pipeline(text)
for chunk in document.chunks:
    print("Offsets:", chunk.begin_offset, chunk.end_offset)
    print("Outline:", chunk.outline)
    print("Themes:", chunk.themes)
    print("Topics:", chunk.topics)
    print("Sentences:")
    for sentence in chunk.sentences:
        print("  ", sentence)

    print("-" * 30)

```



## License

In both cases you will need to acquirer a license file at https://www.wowool.com

### Non-Commercial

    This library is licensed under the GNU AGPLv3 for non-commercial use.  
    For commercial use, a separate license must be purchased.  

### Commercial license Terms

    1. Grants the right to use this library in proprietary software.  
    2. Requires a valid license key  
    3. Redistribution in SaaS requires a commercial license.  
