add.deduce()

Text-based label deduction - Pure Python, No LLMs

🔒 Privacy-First & Offline

Your data never leaves your machine. This function uses pure Python text similarity (TF-IDF + cosine similarity) with no LLMs, no external API calls, and no telemetry. Everything runs locally on your computer.

What does add.deduce() do?

The add.deduce() function automatically fills in missing labels by learning from your existing labeled examples. It uses text similarity to find the most similar labeled example for each unlabeled row.

Common use cases:

📋 Table of Contents

📖 Parameters

Parameter Type Required Description
df DataFrame ✅ Yes DataFrame with some labeled and some unlabeled rows (pandas, polars, or cuDF)
from_column str or List[str] ✅ Yes Text column(s) to analyze. Can be a single column name or list of column names for better accuracy
to_column str ✅ Yes Label column to fill with deduced values

🚀 Example 1: Support Ticket Categorization (Simplest)

Scenario: You have support tickets and want to categorize them. You've manually labeled a few, and want the rest filled in automatically.

Setup: Create sample support tickets
import pandas as pd
import additory as add

# Support tickets with some labeled, some unlabeled
tickets = pd.DataFrame({
    "ticket_text": [
        "Cannot log in to my account",
        "How do I reset my password?",
        "App crashes when I click submit",
        "Need help with billing",
        "The app won't open on my phone",
        "Question about my invoice",
        "Error message on login screen",
        "Want to upgrade my subscription"
    ],
    "category": [
        "Technical",
        "Account",
        "Technical",
        "Billing",
        None,  # To be deduced
        None,  # To be deduced
        None,  # To be deduced
        None   # To be deduced
    ]
})

print("Support tickets:")
print(tickets)
Deduce missing categories
# Deduce missing categories from ticket text
result = add.deduce(tickets, from_column="ticket_text", to_column="category")

print("\nResult with deduced categories:")
print(result)
Output
Support tickets:
                       ticket_text   category
0      Cannot log in to my account  Technical
1      How do I reset my password?    Account
2  App crashes when I click submit  Technical
3           Need help with billing    Billing
4   The app won't open on my phone        NaN
5        Question about my invoice        NaN
6    Error message on login screen        NaN
7  Want to upgrade my subscription        NaN

✓ Deduced 4 labels from 4 examples (offline, no LLMs)

Result with deduced categories:
                       ticket_text   category
0      Cannot log in to my account  Technical
1      How do I reset my password?    Account
2  App crashes when I click submit  Technical
3           Need help with billing    Billing
4   The app won't open on my phone  Technical
5        Question about my invoice    Billing
6    Error message on login screen  Technical
7  Want to upgrade my subscription    Billing

📝 Example 2: Task Status from Comments

Scenario: You have task comments and want to deduce completion status.

Setup: Task data with comments
import pandas as pd
import additory as add

# Task data with comments
tasks = pd.DataFrame({
    "comment": [
        "I finished the task yesterday",
        "Still working on it",
        "Completed the update",
        "Not done yet",
        "Will complete tomorrow",
        "Task is finished",
        "I haven't started",
        "All done and tested"
    ],
    "status": [
        "Completed",
        "Not Completed",
        "Completed",
        "Not Completed",
        None,
        None,
        None,
        None
    ]
})

print("Task data:")
print(tasks)
Deduce task status
# Deduce missing statuses
result = add.deduce(tasks, from_column="comment", to_column="status")

print("\nResult:")
print(result)

# Show which labels were deduced
print("\nDeduced labels:")
for idx in range(len(result)):
    if tasks["status"][idx] is None:
        print(f"  Row {idx}: '{result['comment'][idx][:40]}...' → {result['status'][idx]}")
Output
Task data:
                         comment         status
0  I finished the task yesterday      Completed
1            Still working on it  Not Completed
2           Completed the update      Completed
3                   Not done yet  Not Completed
4         Will complete tomorrow            NaN
5               Task is finished            NaN
6              I haven't started            NaN
7            All done and tested            NaN

✓ Deduced 4 labels from 4 examples (offline, no LLMs)

Result:
                         comment         status
0  I finished the task yesterday      Completed
1            Still working on it  Not Completed
2           Completed the update      Completed
3                   Not done yet  Not Completed
4         Will complete tomorrow      Completed
5               Task is finished      Completed
6              I haven't started  Not Completed
7            All done and tested      Completed

Deduced labels:
  Row 4: 'Will complete tomorrow' → Completed
  Row 5: 'Task is finished' → Completed
  Row 6: 'I haven't started' → Not Completed
  Row 7: 'All done and tested' → Completed

🔍 Example 3: Multiple Text Columns for Better Accuracy

Scenario: You have multiple text columns and want to use all of them for better accuracy.

Setup: Bug reports with title and description
import pandas as pd
import additory as add

# Bug reports with title and description
bugs = pd.DataFrame({
    "title": [
        "Login bug",
        "New dashboard feature",
        "Docs outdated",
        "Slow performance",
        "Login issue",
        "Feature request",
        "Documentation error",
        "Performance problem"
    ],
    "description": [
        "Users cannot log in",
        "Add analytics dashboard",
        "API docs need update",
        "App takes 10s to load",
        "Cannot access account",
        "Need export feature",
        "Missing examples",
        "Very slow response"
    ],
    "type": [
        "Bug",
        "Feature",
        "Documentation",
        "Performance",
        None,
        None,
        None,
        None
    ]
})

print("Bug reports:")
print(bugs)
Deduce using both title and description
# Use both title and description for better accuracy
result = add.deduce(
    bugs,
    from_column=["title", "description"],  # Multiple columns!
    to_column="type"
)

print("\nResult:")
print(result)
Output
Bug reports:
                   title              description           type
0              Login bug      Users cannot log in            Bug
1  New dashboard feature  Add analytics dashboard        Feature
2          Docs outdated     API docs need update  Documentation
3       Slow performance    App takes 10s to load    Performance
4            Login issue    Cannot access account            NaN
5        Feature request      Need export feature            NaN
6    Documentation error         Missing examples            NaN
7    Performance problem       Very slow response            NaN

✓ Deduced 4 labels from 4 examples (offline, no LLMs)

Result:
                   title              description           type
0              Login bug      Users cannot log in            Bug
1  New dashboard feature  Add analytics dashboard        Feature
2          Docs outdated     API docs need update  Documentation
3       Slow performance    App takes 10s to load    Performance
4            Login issue    Cannot access account            Bug
5        Feature request      Need export feature        Feature
6    Documentation error         Missing examples  Documentation
7    Performance problem       Very slow response    Performance

⚙️ How It Works

Step 1: Tokenizes text into words (lowercase, removes special characters)
Step 2: Creates TF (term frequency) vectors for each text
Step 3: Computes cosine similarity between unlabeled and labeled examples
Step 4: Assigns the label from the most similar labeled example

⚠️ Important Notes

Minimum Examples: Requires at least 3 labeled examples to work. More examples = better accuracy.
Multiple Columns: Using multiple text columns (title + description) improves accuracy.
DataFrame Support: Works with pandas, polars, and cuDF DataFrames.
Privacy: 100% offline. No external connections, no LLMs, no telemetry.

🎯 Quick Reference

Basic syntax templates
# Single column
result = add.deduce(df, from_column="comment", to_column="status")

# Multiple columns (better accuracy)
result = add.deduce(
    df,
    from_column=["comment", "notes", "description"],
    to_column="status"
)

# Works with polars too
import polars as pl
df_polars = pl.DataFrame({...})
result = add.deduce(df_polars, from_column="text", to_column="label")

🔒 Privacy & Trust

Why You Can Trust add.deduce()