sparkly.index_optimizer package¶
Submodules¶
sparkly.index_optimizer.index_optimizer module¶
- class sparkly.index_optimizer.index_optimizer.IndexOptimizer(is_dedupe: bool, scorer: QueryScorer | None = None, conf: float = 0.99, init_top_k: int = 10, max_combination_size: int = 3, opt_query_limit: int = 250, sample_size: int = 10000, use_early_pruning: bool = True)¶
Bases:
object
a class for optimizing the search columns and analyzers for indexes
- Attributes:
- index
Methods
make_index_config
(df[, id_col])create the starting index config which can then be used to for optimization throws out any columns where the average number of whitespace delimited tokens are >= 50
optimize
(index, search_df)- Parameters:
- property index¶
- make_index_config(df: DataFrame, id_col='_id') IndexConfig ¶
create the starting index config which can then be used to for optimization throws out any columns where the average number of whitespace delimited tokens are >= 50
- Parameters:
- dfpyspark.sql.DataFrame
the dataframe that we want to generate a config for
- id_colstr
the unique id column for the records in the dataframe
- optimize(index: Index, search_df: DataFrame) QuerySpec ¶
- Parameters:
- indexIndex
the index that will have an optimzed query spec created for it
- search_dfpyspark.sql.DataFrame:
the records that will be used to choose the query spec
- Returns:
- QuerySpec
a query spec optimized for searching for search_df using index
sparkly.index_optimizer.query_scorer module¶
- class sparkly.index_optimizer.query_scorer.AUCQueryScorer¶
Bases:
QueryScorer
Methods
score_query_result
score_query_results
- score_query_result(query_result, query_spec, drop_first) float ¶
- score_query_results(query_results, query_spec, drop_first) list ¶
- class sparkly.index_optimizer.query_scorer.QueryScorer¶
Bases:
ABC
Methods
score_query_results
(query_results, query_spec)score_query_result
- abstract score_query_result(query_result, query_spec) float ¶
- abstract score_query_results(query_results, query_spec) list ¶
- class sparkly.index_optimizer.query_scorer.RankQueryScorer(threshold, k)¶
Bases:
QueryScorer
Methods
score_query_result
score_query_results
- score_query_result(query_result, query_spec) float ¶
- score_query_results(query_results, query_spec) list ¶
- sparkly.index_optimizer.query_scorer.compute_wilcoxon_score(x, y)¶
- sparkly.index_optimizer.query_scorer.score_query_result(scores, drop_first=False)¶
- sparkly.index_optimizer.query_scorer.score_query_result_sum(scores)¶
- sparkly.index_optimizer.query_scorer.score_query_results(query_results)¶