Text-based label deduction - Pure Python, No LLMs
What does add.deduce() do?
The add.deduce() function automatically fills in missing labels by learning from your existing labeled examples. It uses text similarity to find the most similar labeled example for each unlabeled row.
Common use cases:
| Parameter | Type | Required | Description |
|---|---|---|---|
| df | DataFrame | ✅ Yes | DataFrame with some labeled and some unlabeled rows (pandas, polars, or cuDF) |
| from_column | str or List[str] | ✅ Yes | Text column(s) to analyze. Can be a single column name or list of column names for better accuracy |
| to_column | str | ✅ Yes | Label column to fill with deduced values |
Scenario: You have support tickets and want to categorize them. You've manually labeled a few, and want the rest filled in automatically.
import pandas as pd
import additory as add
# Support tickets with some labeled, some unlabeled
tickets = pd.DataFrame({
"ticket_text": [
"Cannot log in to my account",
"How do I reset my password?",
"App crashes when I click submit",
"Need help with billing",
"The app won't open on my phone",
"Question about my invoice",
"Error message on login screen",
"Want to upgrade my subscription"
],
"category": [
"Technical",
"Account",
"Technical",
"Billing",
None, # To be deduced
None, # To be deduced
None, # To be deduced
None # To be deduced
]
})
print("Support tickets:")
print(tickets)
# Deduce missing categories from ticket text
result = add.deduce(tickets, from_column="ticket_text", to_column="category")
print("\nResult with deduced categories:")
print(result)
Support tickets:
ticket_text category
0 Cannot log in to my account Technical
1 How do I reset my password? Account
2 App crashes when I click submit Technical
3 Need help with billing Billing
4 The app won't open on my phone NaN
5 Question about my invoice NaN
6 Error message on login screen NaN
7 Want to upgrade my subscription NaN
✓ Deduced 4 labels from 4 examples (offline, no LLMs)
Result with deduced categories:
ticket_text category
0 Cannot log in to my account Technical
1 How do I reset my password? Account
2 App crashes when I click submit Technical
3 Need help with billing Billing
4 The app won't open on my phone Technical
5 Question about my invoice Billing
6 Error message on login screen Technical
7 Want to upgrade my subscription Billing
Scenario: You have task comments and want to deduce completion status.
import pandas as pd
import additory as add
# Task data with comments
tasks = pd.DataFrame({
"comment": [
"I finished the task yesterday",
"Still working on it",
"Completed the update",
"Not done yet",
"Will complete tomorrow",
"Task is finished",
"I haven't started",
"All done and tested"
],
"status": [
"Completed",
"Not Completed",
"Completed",
"Not Completed",
None,
None,
None,
None
]
})
print("Task data:")
print(tasks)
# Deduce missing statuses
result = add.deduce(tasks, from_column="comment", to_column="status")
print("\nResult:")
print(result)
# Show which labels were deduced
print("\nDeduced labels:")
for idx in range(len(result)):
if tasks["status"][idx] is None:
print(f" Row {idx}: '{result['comment'][idx][:40]}...' → {result['status'][idx]}")
Task data:
comment status
0 I finished the task yesterday Completed
1 Still working on it Not Completed
2 Completed the update Completed
3 Not done yet Not Completed
4 Will complete tomorrow NaN
5 Task is finished NaN
6 I haven't started NaN
7 All done and tested NaN
✓ Deduced 4 labels from 4 examples (offline, no LLMs)
Result:
comment status
0 I finished the task yesterday Completed
1 Still working on it Not Completed
2 Completed the update Completed
3 Not done yet Not Completed
4 Will complete tomorrow Completed
5 Task is finished Completed
6 I haven't started Not Completed
7 All done and tested Completed
Deduced labels:
Row 4: 'Will complete tomorrow' → Completed
Row 5: 'Task is finished' → Completed
Row 6: 'I haven't started' → Not Completed
Row 7: 'All done and tested' → Completed
Scenario: You have multiple text columns and want to use all of them for better accuracy.
import pandas as pd
import additory as add
# Bug reports with title and description
bugs = pd.DataFrame({
"title": [
"Login bug",
"New dashboard feature",
"Docs outdated",
"Slow performance",
"Login issue",
"Feature request",
"Documentation error",
"Performance problem"
],
"description": [
"Users cannot log in",
"Add analytics dashboard",
"API docs need update",
"App takes 10s to load",
"Cannot access account",
"Need export feature",
"Missing examples",
"Very slow response"
],
"type": [
"Bug",
"Feature",
"Documentation",
"Performance",
None,
None,
None,
None
]
})
print("Bug reports:")
print(bugs)
# Use both title and description for better accuracy
result = add.deduce(
bugs,
from_column=["title", "description"], # Multiple columns!
to_column="type"
)
print("\nResult:")
print(result)
Bug reports:
title description type
0 Login bug Users cannot log in Bug
1 New dashboard feature Add analytics dashboard Feature
2 Docs outdated API docs need update Documentation
3 Slow performance App takes 10s to load Performance
4 Login issue Cannot access account NaN
5 Feature request Need export feature NaN
6 Documentation error Missing examples NaN
7 Performance problem Very slow response NaN
✓ Deduced 4 labels from 4 examples (offline, no LLMs)
Result:
title description type
0 Login bug Users cannot log in Bug
1 New dashboard feature Add analytics dashboard Feature
2 Docs outdated API docs need update Documentation
3 Slow performance App takes 10s to load Performance
4 Login issue Cannot access account Bug
5 Feature request Need export feature Feature
6 Documentation error Missing examples Documentation
7 Performance problem Very slow response Performance
# Single column
result = add.deduce(df, from_column="comment", to_column="status")
# Multiple columns (better accuracy)
result = add.deduce(
df,
from_column=["comment", "notes", "description"],
to_column="status"
)
# Works with polars too
import polars as pl
df_polars = pl.DataFrame({...})
result = add.deduce(df_polars, from_column="text", to_column="label")