sparkly package¶
Subpackages¶
- sparkly.index package
- Submodules
- sparkly.index.index_base module
- sparkly.index.index_config module
- sparkly.index.lucene_index module
LuceneIndex
LuceneIndex.ANALYZERS
LuceneIndex.LUCENE_DIR
LuceneIndex.PY_META_FILE
LuceneIndex.config
LuceneIndex.deinit()
LuceneIndex.delete_docs()
LuceneIndex.get_full_query_spec()
LuceneIndex.id_to_lucene_id()
LuceneIndex.index_path
LuceneIndex.init()
LuceneIndex.is_built
LuceneIndex.is_on_spark
LuceneIndex.num_indexed_docs()
LuceneIndex.query_gen
LuceneIndex.score_docs()
LuceneIndex.search()
LuceneIndex.search_many()
LuceneIndex.to_spark()
LuceneIndex.upsert_docs()
- Module contents
- sparkly.index_optimizer package
- sparkly.query_generator package
Submodules¶
sparkly.index_config module¶
- class sparkly.index_config.IndexConfig(*, store_vectors: bool = False, id_col: str = '_id', weighted_queries: bool = False)¶
Bases:
object
- Attributes:
id_col
The unique id column for the records in the index this must be a 32 or 64 bit integer
is_frozen
Returns
store_vectors
True if the term vectors in the index should be stored, else False
weighted_queries
True if the term vectors in the index should be stored, else False
Methods
add_concat_field
(field, concat_fields, analyzers)Add a new concat field to be indexed with this config
add_field
(field, analyzers)Add a new field to be indexed with this config
freeze
()- Returns:
from_json
(data)construct an index config from a dict or json string, see IndexConfig.to_dict for expected format
get_analyzed_fields
([query_spec])Get the fields used by the index or query_spec.
remove_field
(field)remove a field from the config
to_dict
()convert this IndexConfig to a dictionary which can easily be stored as json
to_json
()Dump this IndexConfig to a valid json strings
- add_concat_field(field: str, concat_fields: Iterable[str], analyzers: Iterable[str])¶
Add a new concat field to be indexed with this config
- Parameters:
- fieldstr
The name of the field that will be added to the index
- concat_fieldsset, list or tuple of strs
the fields in the table that will be concatenated together to create field
- analyzersset, list or tuple of str
The names of the analyzers that will be used to index the field
- add_field(field: str, analyzers: Iterable[str])¶
Add a new field to be indexed with this config
- Parameters:
- fieldstr
The name of the field in the table to the index
- analyzersset, list or tuple of str
The names of the analyzers that will be used to index the field
- freeze()¶
- Returns:
- IndexConfig
a frozen deepcopy of this index config
- classmethod from_json(data)¶
construct an index config from a dict or json string, see IndexConfig.to_dict for expected format
- Returns:
- IndexConfig
- get_analyzed_fields(query_spec=None)¶
Get the fields used by the index or query_spec. If query_spec is None, the fields that are used by the index are returned.
- Parameters:
- query_specQuerySpec, optional
if provided, the fields that are used by query_spec in creating a query
- Returns:
- list of str
the fields used
- property id_col¶
The unique id column for the records in the index this must be a 32 or 64 bit integer
- property is_frozen¶
- Returns:
- bool
True if this index is frozen (not modifiable) else False
- remove_field(field: str)¶
remove a field from the config
- Parameters:
- fieldstr
the field to be removed from the config
- Returns:
- bool
True if the field existed else False
- property store_vectors¶
True if the term vectors in the index should be stored, else False
- to_dict()¶
convert this IndexConfig to a dictionary which can easily be stored as json
- Returns:
- dict
A dictionary representation of this IndexConfig
- to_json()¶
Dump this IndexConfig to a valid json strings
- Returns:
- str
- property weighted_queries¶
True if the term vectors in the index should be stored, else False
sparkly.analysis module¶
- class sparkly.analysis.Gram2Analyzer¶
Bases:
PythonAnalyzer
- Attributes:
- class
- reuseStrategy
- self
Methods
ReuseStrategy
TokenStreamComponents
cast_
close
createComponents
equals
finalize
getClass
getOffsetGap
getPositionIncrementGap
getReuseStrategy
hashCode
initReader
instance_
normalize
notify
notifyAll
pythonExtension
toString
tokenStream
wait
- createComponents(fieldName)¶
- class sparkly.analysis.Gram3Analyzer¶
Bases:
PythonAnalyzer
- Attributes:
- class
- reuseStrategy
- self
Methods
ReuseStrategy
TokenStreamComponents
cast_
close
createComponents
equals
finalize
getClass
getOffsetGap
getPositionIncrementGap
getReuseStrategy
hashCode
initReader
instance_
normalize
notify
notifyAll
pythonExtension
toString
tokenStream
wait
- createComponents(fieldName)¶
- class sparkly.analysis.Gram4Analyzer¶
Bases:
PythonAnalyzer
- Attributes:
- class
- reuseStrategy
- self
Methods
ReuseStrategy
TokenStreamComponents
cast_
close
createComponents
equals
finalize
getClass
getOffsetGap
getPositionIncrementGap
getReuseStrategy
hashCode
initReader
instance_
normalize
notify
notifyAll
pythonExtension
toString
tokenStream
wait
- createComponents(fieldName)¶
- class sparkly.analysis.PythonAlnumTokenFilter(tokenStream)¶
Bases:
PythonFilteringTokenFilter
- Attributes:
- attributeClassesIterator
- attributeFactory
- attributeImplsIterator
- class
- self
Methods
State
accept
addAttribute
addAttributeImpl
captureState
cast_
clearAttributes
cloneAttributes
close
copyTo
end
endAttributes
equals
finalize
getAttribute
getAttributeClassesIterator
getAttributeFactory
getAttributeImplsIterator
getClass
hasAttribute
hasAttributes
hashCode
incrementToken
instance_
notify
notifyAll
pythonExtension
reflectAsString
reflectWith
removeAllAttributes
reset
restoreState
toString
unwrap
wait
- accept()¶
- class sparkly.analysis.StandardEdgeGram36Analyzer¶
Bases:
PythonAnalyzer
- Attributes:
- class
- reuseStrategy
- self
Methods
ReuseStrategy
TokenStreamComponents
cast_
close
createComponents
equals
finalize
getClass
getOffsetGap
getPositionIncrementGap
getReuseStrategy
hashCode
initReader
instance_
normalize
notify
notifyAll
pythonExtension
toString
tokenStream
wait
- createComponents(fieldName)¶
- class sparkly.analysis.StrippedGram3Analyzer¶
Bases:
PythonAnalyzer
- Attributes:
- class
- reuseStrategy
- self
Methods
ReuseStrategy
TokenStreamComponents
cast_
close
createComponents
equals
finalize
getClass
getOffsetGap
getPositionIncrementGap
getReuseStrategy
hashCode
initReader
instance_
normalize
notify
notifyAll
pythonExtension
toString
tokenStream
wait
- createComponents(fieldName)¶
- initReader(fieldName, reader)¶
- class sparkly.analysis.UnfilteredGram3Analyzer¶
Bases:
PythonAnalyzer
- Attributes:
- class
- reuseStrategy
- self
Methods
ReuseStrategy
TokenStreamComponents
cast_
close
createComponents
equals
finalize
getClass
getOffsetGap
getPositionIncrementGap
getReuseStrategy
hashCode
initReader
instance_
normalize
notify
notifyAll
pythonExtension
toString
tokenStream
wait
- createComponents(fieldName)¶
- class sparkly.analysis.UnfilteredGram5Analyzer¶
Bases:
PythonAnalyzer
- Attributes:
- class
- reuseStrategy
- self
Methods
ReuseStrategy
TokenStreamComponents
cast_
close
createComponents
equals
finalize
getClass
getOffsetGap
getPositionIncrementGap
getReuseStrategy
hashCode
initReader
instance_
normalize
notify
notifyAll
pythonExtension
toString
tokenStream
wait
- createComponents(fieldName)¶
- sparkly.analysis.analyze(analyzer, text, with_offset=False)¶
Apply the analyzer to the text and return the tokens, optionally with offsets
- Parameters:
- analyzer
The lucene analyzer to be applied
- textstr
the text that will be analyzer
- with_offsetbool
if true, return the offsets with the tokens in the form (TOKEN, START_OFFSET, END_OFFSET)
- Returns:
- list of str or tuples
a list of tokens potentially with offsets
- sparkly.analysis.analyze_generator(analyzer, text, with_offset=False)¶
Apply the analyzer to the text and return the tokens, optionally with offsets
- Parameters:
- analyzer
The lucene analyzer to be applied
- textstr
the text that will be analyzer
- with_offsetbool
if true, return the offsets with the tokens in the form (TOKEN, START_OFFSET, END_OFFSET)
- Returns:
- generator of str or tuples
a list of tokens potentially with offsets
- sparkly.analysis.get_shingle_analyzer()¶
- sparkly.analysis.get_standard_analyzer_no_stop_words()¶
sparkly.search module¶
- class sparkly.search.Searcher(index: Index, search_chunk_size: int = 500)¶
Bases:
object
class for performing bulk search over a dataframe
Methods
get a query spec that searches on all indexed fields
search
(search_df, query_spec, limit[, id_col])perform search for all the records in search_df according to query_spec
- get_full_query_spec()¶
get a query spec that searches on all indexed fields
- search(search_df: DataFrame, query_spec: QuerySpec, limit: int, id_col: str = '_id')¶
perform search for all the records in search_df according to query_spec
- Parameters:
- search_dfpyspark.sql.DataFrame
the records used for searching
- query_specQuerySpec
the query spec for searching
- limitint
the topk that will be retrieved for each query
- id_colstr
the id column from search_df that will be output with the query results
- Returns:
- pyspark DataFrame
a pyspark dataframe with the schema (id_col, ids array<long> , scores array<float>, search_time float)
- sparkly.search.search(index, query_spec, limit, search_recs)¶
- sparkly.search.search_gen(index, query_spec, limit, search_recs)¶
sparkly.utils module¶
- class sparkly.utils.Timer¶
Bases:
object
utility class for timing execution of code
Methods
get the time that has elapsed since the object was created or the last time get_interval() was called
get total time this Timer has been alive
set the start time to the current time
- get_interval()¶
get the time that has elapsed since the object was created or the last time get_interval() was called
- Returns:
- float
- get_total()¶
get total time this Timer has been alive
- Returns:
- float
- set_start_time()¶
set the start time to the current time
- sparkly.utils.atomic_unzip(zip_file_name, output_loc)¶
atomically unzip the file, that is this function is safe to call from multiple threads at the same time
- Parameters:
- zip_file_namestr
the name of the file to be unzipped
- output_locstr
the location that the file will be unzipped to
- sparkly.utils.attach_current_thread_jvm()¶
attach the current thread to the jvm for PyLucene
- sparkly.utils.auc(x)¶
- sparkly.utils.get_index_name(n, *postfixes)¶
utility function for generating index names in a uniform way
- sparkly.utils.get_logger(name, level=10)¶
Get the logger for a module
- Returns:
- Logger
- sparkly.utils.init_jvm(vmargs=[])¶
initialize the jvm for PyLucene
- Parameters:
- vmargslist[str]
the jvm args to the passed to the vm
- sparkly.utils.invoke_task(task)¶
invoke a task created by joblib.delayed
- sparkly.utils.is_null(o)¶
check if the object is null, note that this is here to get rid of the weird behavior of np.isnan and pd.isnull
- sparkly.utils.is_persisted(df)¶
check if the pyspark dataframe is persist
- sparkly.utils.kill_loky_workers()¶
kill all the child loky processes of this process. used to prevent joblib from sitting on resources after using joblib.Parallel to do computation
- sparkly.utils.local_parquet_to_spark_df(file)¶
- sparkly.utils.norm_auc(x)¶
- sparkly.utils.persisted(df, storage_level=StorageLevel(True, True, False, False, 1))¶
context manager for presisting a dataframe in a with statement. This automatically unpersists the dataframe at the end of the context
- sparkly.utils.repartition_df(df, part_size, by=None)¶
repartition the dataframe into chunk of size ‘part_size’ by column ‘by’
- sparkly.utils.spark_to_pandas_stream(df, chunk_size, by='_id')¶
repartition df into chunk_size and return as iterator of pandas dataframes
- sparkly.utils.type_check(var, var_name, expected)¶
type checking utility, throw a type error if the var isn’t the expected type
- sparkly.utils.type_check_iterable(var, var_name, expected_var_type, expected_element_type)¶
type checking utility for iterables, throw a type error if the var isn’t the expected type or any of the elements are not the expected type
- sparkly.utils.zip_dir(d, outfile=None)¶
Zip a directory d and output it to outfile. If outfile is not provided, the zipped file is output in /tmp
- Parameters:
- dstr or Path
the directory to be zipped
- outfilestr or Path, optional
the output location of the zipped file
- Returns:
- Path
the path to the new zip file