The Search Module

The search module of the repytah package holds functions used to find and record the diagonals in the thresholded matrix, T. These functions prepare the found diagonals for transformation and assembling later.

  • find_complete_list: Finds all smaller diagonals (and the associated pairs of repeats) that are contained in pair_list, which is composed of larger diagonals found in `find_initial_repeats <https://github.com/smith-tinkerlab/repytah/blob/main/docs/utilities_vignette.ipynb>`__.

  • find_all_repeats: Finds all the diagonals present in thresh_mat. This function is nearly identical to find_initial_repeats except for two crucial differences. First, we do not remove diagonals after we find them. Second, there is no smallest bandwidth size as we are looking for all diagonals.

  • find_complete_list_anno_only: Finds annotations for all pairs of repeats found in find_all_repeats. This list contains all the pairs of repeated structures with their starting/ending indices and lengths.

The following functions are imported from the `utilities <https://github.com/smith-tinkerlab/repytah/blob/main/docs/utilities_vignette.ipynb>`__ module to reformat outputs and assist with the operations of the `search <https://github.com/smith-tinkerlab/repytah/blob/main/docs/search_vignette.ipynb>`__ functions.

  • add_annotations

For more in-depth information on the function calls, an example function pipeline is shown below. Functions from the current module are shown in purple.

3a11a9f758aa4a578b6ebd093dd05da1

Importing necessary modules

[1]:
# NumPy is used for mathematical calculations
import numpy as np

# Import search
from repytah.search import *

find_complete_list

As seen in the flow chart, find_initial_repeats is called by `example <https://github.com/smith-tinkerlab/repytah/blob/main/docs/example_vignette.ipynb>`__ right before find_complete_list. In find_complete_list, smaller pairs of repeats are added to the original list of pairs of repeats made in find_initial_repeats. All of the pairs of repeats correspond to each repeated structure in another numpy array called thresh_mat. This array holds all the repeated structures in a sequential data stream and the repeated structures are represented as diagonals.

The inputs for the function are:

  • pair_list (np.ndarray): List of pairs of repeats found in earlier steps (bandwidths MUST be in ascending order). If you have run find_initial_repeats before this script, then pair_list will be ordered correctly.

  • song_length (int): Song length, which is the number of audio shingles.

The output for the function is:

  • lst_out (np.ndarray): List of pairs of repeats with smaller repeats added.

[2]:
pair_list = np.array([[ 1, 10, 46, 55, 10],
                      [31, 40, 46, 55, 10],
                      [10, 20, 40, 50, 11],
                      [ 1, 15, 31, 45, 15]])
song_length = 55

print("The input array is: \n", pair_list)
print("The number of audio shingles is: \n", song_length)
The input array is:
 [[ 1 10 46 55 10]
 [31 40 46 55 10]
 [10 20 40 50 11]
 [ 1 15 31 45 15]]
The number of audio shingles is:
 55
[3]:
output = find_complete_list(pair_list, song_length)

print("The output array is: \n", output)
The output array is:
 [[11 15 41 45  5  1]
 [ 1 10 31 40 10  1]
 [ 1 10 46 55 10  1]
 [31 40 46 55 10  1]
 [10 20 40 50 11  1]
 [ 1 15 31 45 15  1]]

In this example, there are two more rows added to the initial pair_list input, as find_complete_list can detect smaller diagonals contained in larger diagonals already found in find_initial_repeats. The repeats now look like this:

alt text

Each row represents a pair of repeats, and each column represents a time step. The time steps with the color black are the starting indices for repeats of length k that we use to check lst_no_anno for more repeats of length k.

alt text

With the same starting index and same length, we can find the same repeats in the color yellow.

alt text

Then we find two more groups of repeats.

find_all_repeats

find_all_repeats finds all the diagonals present in thresh_mat. This function is nearly identical to find_initial_repeats, with two crucial differences. First, we do not remove diagonals after we find them. Second, there is no smallest bandwidth size as we are looking for all diagonals.

The inputs for the function are:

  • thresh_mat (np.ndarray): Thresholded matrix that we extract diagonals from

  • band_width_vec (np.ndarray): Vector of lengths of diagonals to be found. Should be 1, 2, 3, …, n where n is the number of timesteps.

The output for the function is:

  • all_lst (np.ndarray): Pairs of repeats that correspond to diagonals in thresh_mat

[4]:
thresh_mat = np.array([[1, 0, 1, 0, 0],
                       [0, 1, 0, 1, 0],
                       [1, 0, 1, 0, 1],
                       [0, 1, 0, 1, 0],
                       [0, 0, 1, 0, 1]])

bandwidth_vec = np.array([1, 2, 3, 4, 5])

print("The threshold matrix is: \n", thresh_mat)
print("The lengths of the diagonals to be found are: \n", bandwidth_vec)
The threshold matrix is:
 [[1 0 1 0 0]
 [0 1 0 1 0]
 [1 0 1 0 1]
 [0 1 0 1 0]
 [0 0 1 0 1]]
The lengths of the diagonals to be found are:
 [1 2 3 4 5]
[5]:
output = find_all_repeats(thresh_mat, bandwidth_vec)

print("The output array is: \n", output )
The output array is:
 [[1 1 3 3 1]
 [2 2 4 4 1]
 [3 3 5 5 1]
 [1 2 3 4 2]
 [2 3 4 5 2]
 [1 2 3 4 2]
 [2 3 4 5 2]]

find_complete_list_anno_only

find_complete_list_anno_only finds annotations for all pairs of repeats found in find_initial_repeats. This list contains all the pairs of repeated structures with their starting/ending indices and lengths.

The inputs for the function are:

  • pair_list (np.ndarray): List of pairs of repeats.

  • song_length (int): Number of audio shingles in song.

The output for the function is:

  • out_lst (np.ndarray): List of pairs of repeats with smaller repeats added and with annotation markers.

[6]:
pair_list = np.array([[3,  3,  5,  5, 1],
                      [2,  2,  8,  8, 1],
                      [3,  3,  9,  9, 1],
                      [2,  2, 15, 15, 1],
                      [8,  8, 15, 15, 1],
                      [4,  4, 17, 17, 1],
                      [2,  3,  8,  9, 2],
                      [3,  4,  9, 10, 2],
                      [2,  3, 15, 16, 2],
                      [8,  9, 15, 16, 2],
                      [3,  4, 16, 17, 2],
                      [2,  4,  8, 10, 3],
                      [3,  5,  9, 11, 3],
                      [7,  9, 14, 16, 3],
                      [2,  4, 15, 17, 3],
                      [3,  5, 16, 18, 3],
                      [9, 11, 16, 18, 3],
                      [7, 10, 14, 17, 4],
                      [7, 11, 14, 18, 5],
                      [8, 12, 15, 19, 5],
                      [7, 12, 14, 19, 6]])
song_length = 19

print("The pairs of repeats are: \n", pair_list)
print("The number of audio shingles in the song is:", song_length)
The pairs of repeats are:
 [[ 3  3  5  5  1]
 [ 2  2  8  8  1]
 [ 3  3  9  9  1]
 [ 2  2 15 15  1]
 [ 8  8 15 15  1]
 [ 4  4 17 17  1]
 [ 2  3  8  9  2]
 [ 3  4  9 10  2]
 [ 2  3 15 16  2]
 [ 8  9 15 16  2]
 [ 3  4 16 17  2]
 [ 2  4  8 10  3]
 [ 3  5  9 11  3]
 [ 7  9 14 16  3]
 [ 2  4 15 17  3]
 [ 3  5 16 18  3]
 [ 9 11 16 18  3]
 [ 7 10 14 17  4]
 [ 7 11 14 18  5]
 [ 8 12 15 19  5]
 [ 7 12 14 19  6]]
The number of audio shingles in the song are: 19
[7]:
output = find_complete_list_anno_only(pair_list, song_length)

print("The output array is: \n", output)
The output array is:
 [[ 2  2  8  8  1  1]
 [ 2  2 15 15  1  1]
 [ 8  8 15 15  1  1]
 [ 3  3  5  5  1  2]
 [ 3  3  9  9  1  2]
 [ 4  4 17 17  1  3]
 [ 2  3  8  9  2  1]
 [ 2  3 15 16  2  1]
 [ 8  9 15 16  2  1]
 [ 3  4  9 10  2  2]
 [ 3  4 16 17  2  2]
 [ 2  4  8 10  3  1]
 [ 2  4 15 17  3  1]
 [ 3  5  9 11  3  2]
 [ 3  5 16 18  3  2]
 [ 9 11 16 18  3  2]
 [ 7  9 14 16  3  3]
 [ 7 10 14 17  4  1]
 [ 7 11 14 18  5  1]
 [ 8 12 15 19  5  2]
 [ 7 12 14 19  6  1]]