Comparison types, transform add-ons, aggregate features, and household aggregate features

This page has information on the different comparison types available for the [[comparison_features]] section, along with some attributes available to all of the comparison types and some aggregate features that are not configurable.

Comparison types

Each header below represents a comparison type. Transforms are used in the context of comparison_features.

[[comparison_features]]
alias = "relatematch"
column_name = "relate_div_100"
comparison_type = "equals"
categorical = true

maximum_jaro_winkler

Finds the greatest Jaro-Winkler value among the cartesian product of multiple columns. For example, given an input of column_names = ['namefrst', 'namelast'], it would return the maximum Jaro-Winkler name comparison value among the following four comparisons:

[('namefrst_a', 'namefrst_b'),
 ('namefrst_a', 'namelast_b'),
 ('namelast_a', 'namefrst_b'),
 ('namelast_a', 'namelast_b')]
  • Attributes:

    • column_names – Type: list of strings. Required. The list of columns used as input for the set of comparisons generated by taking the cartesian product.

[[comparison_features]]
alias = "maximum_jw"
column_names = ["namelast", "namefrst"]
comparison_type = "maximum_jaro_winkler"

jaro_winkler

Returns the Jaro-Winkler comparison score for a given column.

  • Attributes:

    • column_name – Type: string. Required. The column to compare using the Jaro-Winkler score.

[[comparison_features]]
alias = "namefrst_jw"
column_name = "namefrst"
comparison_type = "jaro_winkler

jaro_winkler_street

Uses an additional geographic column value to filter for major location changes before comparing street names. If boundary column A is not equal to boundary column B, a Jaro-Winkler score of zero is returned. If boundary column A and B are equal, the Jaro-Winkler comparison score of the street columns is returned.

  • Attributes:

    • column_name – Type: string. Required. The input street column.

    • boundary – Type: string. Required. An input column to match on before comparing street name values.

[[comparison_features]]
alias = "jw_street"
column_name = "street"
boundary = "enum_dist"
comparison_type = "jaro_winkler_street"

max_jaro_winkler

Returns the greatest Jaro-Winkler value from the comparisons of a list of names.

  • Attributes:

    • column_name – Type: string. Required. Input column containing a list of names to compare (such as related household members, or neighborhood surnames).

[[comparison_features]]
alias = "related_individual_max_jw"
column_name= "namefrst_related"
comparison_type = "max_jaro_winkler"

equals

Asserts that values are the same for both compared columns using SQL: a.{column_name} IS NOT DISTINCT FROM b.{column_name}

[[comparison_features]]
alias = "relatematch"
column_name = "relate_div_100"
comparison_type = "equals"
categorical = true

f1_match

Evaluates if the first name initial A matches either the first name first initial B or either the first or second middle initial of B. If so, returns 1. Otherwise, returns 2.

1 = First initial of first first name A matches first initial of any of potential match first names B

2 = mismatch

Uses the following SQL query:

"CASE WHEN (
    (a.{fi} IS NOT DISTINCT FROM b.{fi}) OR 
    (a.{fi} IS NOT DISTINCT FROM b.{mi0}) OR
    (a.{fi} IS NOT DISTINCT FROM b.{mi1})
) THEN 1 ELSE 2 END"
[[comparison_features]]
alias = "f1_match"
first_init_col = "namefrst_init"
mid_init_cols = ["namefrst_mid_init", "namefrst_mid_init_2"]
comparison_type = "f1_match"
categorical = true

f2_match

Evaluates if first middle initial A is empty/null. If so, return 0. Otherwise, if either first or second middle initial A is not null and matches first name initial B, or first or second middle initial B, return 1. Otherwise, return 2.

1 = First initial of A second first name matches first initial of any of potential match first names B

2 = mismatch

0 = no second first name A

Uses the following SQL:

CASE WHEN ((a.{mi0} == '') OR (a.{mi0} IS NULL)) THEN 0 WHEN (
    (a.{mi0} IS NOT DISTINCT FROM b.{fi}) OR
    ((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{fi})) OR
    (a.{mi0} IS NOT DISTINCT FROM b.{mi0}) OR
    (a.{mi0} IS NOT DISTINCT FROM b.{mi1}) OR
    ((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{mi0})) OR
    ((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{mi1}))
) THEN 1 ELSE 2 END
  • Attributes:

    • first_init_col – Type: string. Required. First name initial input column.

    • mid_init_cols – Type: list of strings. Required. List of first and second middle initial input columns.

[[comparison_features]]
alias = "f2_match"
first_init_col = "namefrst_init"
mid_init_cols = ["namefrst_mid_init", "namefrst_mid_init_2"]
comparison_type = "f2_match"
categorical = true

not_equals

Asserts that values are distinct between compared individuals using SQL: a.{column_name} IS DISTINCT FROM b.{column_name}. Used mainly in caution flag features (f_caution, m_caution, sp_caution).

  • Attributes:

    • column_name – Type: string. Required. Input column to compare.

[[comparison_features]]
alias = "m_caution"
column_names = ["mbpl", "mother_birthyr", "stepmom", "momloc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "mbpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "mother_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "stepmom"
comparison_type = "parent_step_change"
[comparison_features.comp_d]
column_name = "momloc"
comparison_type = "present_both_years"

equals_as_int

Checks for equality using equals sign and returns boolean result in integer form. Uses SQL: CAST(a.{col} = b.{col} as INT)

  • Attributes:

    • column_name – Type: string. Required. Input column to compare.

[[comparison_features]]
alias = "namelast_equal_as_int"
column_name = "namelast_clean"
comparison_type  = "equals_as_int"

all_equals

Asserts whether the values in all given columns match. Uses a SQL expression generated by joining a.{col} = b.{col} and AND for each given column.

  • Attributes:

    • column_names – Type: list of strings. Required. List of the columns to evaluate if all are equal across records being compared.

[[comparison_features]]
alias = "exact"
column_names = ["namefrst_unstd", "namelast_clean"]
comparison_type = "all_equals"

not_zero_and_not_equals

Checks that both values are present (not null) and nonzero and that they are not equal to one another. Evaluates to a boolean. This is primarily useful when a value of 0 indicates some kind of incomparibility akin to the value being missing.

See also present_and_equal_categorical_in_universe, which is a similar but more general comparison type.

  • Attributes:

    • column_name – Type: string. Required. Input column to compare.

[[comparison_features]]
alias = "fbpl_nomatch"
column_name = "fbpl"
comparison_type = "not_zero_and_not_equals"

or

Allows for the concatenation of up to four comparison features into one feature using a SQL OR between the generated clause for each sub-comparison.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b – Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.

    • comp_c, comp_d – Type: Object. Optional. Sub-comparison using any of the comparison feature types documented in this section.

[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr"]
comparison_type = "or"
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
lower_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7

and

Allows for the concatenation of up to four comparison features into one feature using a SQL AND between the generated clause for each sub-comparison.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b – Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.

    • comp_c, comp_d – Type: Object. Optional. Sub-comparison using any of the comparison feature types documented in this section.

In this example, the and comparison appears in [comparison_features.comp_b].

[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"

times

Takes the output of two sub-comparisons and multiplies them together after casting as floats.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b – Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section. comp_a and comp_b can also have sub-comparisons, as in the given example.

[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"

caution_comp_3

Generates an SQL expression in the form (comparison A OR comparison B) AND comparison C.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b, comp_c – Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.

[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_3"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7

caution_comp_3_012

Similar to caution_comp_3, but first checks the value of comparison C. If comparison C evaluates to false, then caution_comp_3_012 evaluates to 2. Otherwise, it evaluates to the result of caution_comp_3, so 0 or 1.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b, comp_c – Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.

caution_comp_4

Generates an SQL expression in the form (comparison A OR comparison B OR comparison C) AND comparison D.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b, comp_c, comp_d – Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.

[[comparison_features]]
alias = "m_caution"
column_names = ["mbpl", "mother_birthyr", "stepmom", "momloc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "mbpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "mother_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "stepmom"
comparison_type = "parent_step_change"
[comparison_features.comp_d]
column_name = "momloc"
comparison_type = "present_both_years"

caution_comp_4_012

Similar to caution_comp_4, but first checks the value of comparison D. If comparison D evaluates to false, then caution_comp_4_012 evaluates to 2. Otherwise, it evaluates to the result of caution_comp_4, so 0 or 1.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b, comp_c, comp_d – Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.

any_equals

Used to compare middle initials and first names under specific circumstances.
If middle initial A is not empty/null and is the same as either middle initial B or first name B, OR if first name A is not empty/null and is the same as middle initial B.

  • Attributes:

    • column_names – Type: list of strings. Required. The first input column should be the middle initial column, and the second input column should be the first name column.

[[comparison_features]]
alias = "mid_init_match"
column_names = ["namefrst_mid_init", "namefrst_unstd"]
comparison_type = "any_equals"

either_are_1

Checks if the column value for either A or B is equal to 1.

  • Attributes:

    • column_name – Type: string. Required. Input column to compare to 1.

[[comparison_features]]
alias = "either_1"
column_name = "nativity"
comparison_type = "either_are_1"
categorical = true

either_are_0

Checks if the column value for either A or B is equal to 0.

  • Attributes:

    • column_name – Type: string. Required. Input column to compare to 0.

[[comparison_features]]
alias = "either_0"
column_name = "nativity"
comparison_type = "either_are_0"
categorical = true

second_gen_imm

Checks if individual A is a second-generation immigrant by looking for nativity value of 2, 3, or 4 (one or both parents foreign-born).

  • Attributes:

    • column_name – Type: string. Required. Input should be the name of the nativity column.

[[comparison_features]]
alias =  "sgen"
column_name = "nativity"
comparison_type = "second_gen_imm"
categorical = true

rel_jaro_winkler

Uses a Scala function to determine the number of people in the input column with a name similarity score (Jaro-Winkler) greater than or equal to the given jw_threshold, an age difference less than or equal to the given age_threshold, and matching sex for the sample A individual and the sample B potential match. Takes a column generated with the feature selection transform related_individual_rows as input (list of person data objects to compare). Can be used for related or unrelated individuals, depending on the input column specified.

  • Attributes:

    • column_name – Type: string. The input column with data in the form of a list of person data objects.

    • name_col – Type: string.

    • birthyr_col – Type: string.

    • jw_threshold – Type: float.

    • age_threshold – Type: int.

[[comparison_features]]
alias = "rel"
column_name = "namefrst_related_rows"
name_col = "namefrst_unstd"
birthyr_col = "replaced_birthyr"
comparison_type = "rel_jaro_winkler"
jw_threshold = 0.9
age_threshold = 5

extra_children

Using a Scala function, checks to see if there are children present in sample B who are not present in sample A, but based on relate codes, age, sex, and name, we would have expected to be present in A. Returns a count of suspected “extra” children. Takes a column generated with the feature selection transform related_individual_rows as input (list of person data objects to compare).

  • Attributes:

    • column_name – Type: string. The input column with data in the form of a list of person data objects.

    • relate_col – Type: string. The name of the column with the relate code.

    • histid_col – Type: string. The name of the id column.

    • name_col – Type: string. The name of the column containing the first name for comparison.

    • birthyr_col – Type: string. The name of the column containing the birth year.

    • year_b – Type: int. The year that sample B was taken.

    • jw_threshold – Type: float. The minimum acceptable Jaro-Winkler score to consider a match.

    • age_threshold – Type: int. The maximum acceptable age difference to consider a match.

[[comparison_features]]
alias = "extra_children"
column_name = "namefrst_related_rows"
relate_col = "relate"
histid_col = "histid"
name_col = "namefrst_unstd"
birthyr_col = "replaced_birthyr"
year_b = 1910
comparison_type = "extra_children"
jw_threshold = 0.8
age_threshold = 2

jaro_winkler_rate

Uses a Scala function to calculate the percentage of individuals who have a Jaro-Winkler score greater than or equal to the given threshold. Rate returned as a percentage as a float data type.

  • Attributes:

    • column_name – Type: string. The input column with data in the form of a list of person data objects. The input column seen below (”namelast_neighbors”)was generated using a “neighbor_aggregate” feature selection.

    • jw_threshold – Type: float. The minimum Jaro-Winkler threshold to consider an acceptable match.

In the following example, a lower_threshold feature add-on is used to convert the returned rate to a boolean asserting whether it meets the given minimum threshold. (>= 5% of neighbors have a Jaro-Winkler score >= 0.95)

[[comparison_features]]
alias = "nbors"
comparison_type = "times"
column_names = ["namelast_neighbors", "county", "statefip"]
[comparison_features.comp_a]
column_name = "namelast_neighbors"
comparison_type = "jaro_winkler_rate"
jw_threshold = 0.95
lower_threshold = 0.05
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"

sum

Adds the column values for A and B together (takes the sum).

  • Attributes:

    • column_name – Type: string. The input column to be added.

[[comparison_features]]
alias = "namelast_popularity_sum"
column_name = "namelast_popularity"
comparison_type = "sum"

length_b

Returns the length of the column value in record B using the SQL size() function.

  • Attributes:

    • column_name – Type: string. The name of the input column to take the length of in dataset B.

abs_diff

Takes the absolute value of the difference between the values of the given column in datasets A and B.

  • Attributes:

    • column_name – Type: string. The input column to evaluate.

    • not_equals – Type: int. OPTIONAL. You can specify a value for the column to be considered invalid input, in which case the expression would return the value -1 instead of an absolute difference. For example, if you are evaluating the difference in marriage duration values, and “99” is a placeholder value for “unknown” in the data, you can exclude those values from consideration using this attribute.

[[comparison_features]]
alias = "byrdiff"
column_name = "replaced_birthyr"
comparison_type = "abs_diff"

[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "abs_diff"
btwn_threshold = [9, 14]
categorical = True

b_minus_a

Returns the value of subtracting the value of column A from the value of column B.

  • Attributes:

    • column_name – Type: string. The input column to evaluate.

    • not_equals – Type: int. OPTIONAL. You can specify a value for the column to be considered invalid input, in which case the expression would return the value -1 instead of an absolute difference. For example, if you are evaluating the difference in marriage duration values, and “99” is a placeholder value for “unknown” in the data, you can exclude those values from consideration using this attribute.

[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "b_minus_a"
btwn_threshold = [5,14]
categorical = true

geo_distance

Uses a lookup table to find the geographic distance between locations. The SQL expression is generated by hlink/linking/core/dist_table.py. There are several ways to configure this feature. You can look up distances in the given file using one or two keys (specified with the key_count attribute). You can also optionally have a secondary look-up table that serves as a back-up value in the case that the primary look-up does not contain a value for the locations given. This is particularly useful for county distance, as you can set the primary join to be across counties, but set up a secondary join on state, which has much fewer combinations and thus less risk of nulls, to fill in when the counties specified aren’t in the look-up.

  • Attributes:

    • key_count – Type: int. The number of keys used to join on the primary (or only) look-up table. Acceptable values are 1 or 2. Ex: for state and county, key_count = 2. For just state, key_count = 1 even though there is county_a and county_b.

    • distances_file – Type: string of path. Path to the distances look-up file.

    • table_name – Type: string. What to name the table that will be generated from the distances file. If you want to do multiple look-ups, if the table_name is the same across all feature specifications, it will only be read in once.

    • Attributes for key_count = 1:

      • column_name – Type: string. The column in the input data that you want to use as a key to look up the geographic distance.

      • loc_a – Type: string. First column to join on in the look-up table (where to find the value coming from the column_name column A).

      • loc_b – Type: string. Second column to join on in the look-up table (where to find the value coming from the column_name column B).

      • distance_col – Type: string. Name of the column containing the geographic distance in the look-up table.

    • Attributes for key_count = 2:

      • column_names – Type: list of strings. The two columns you want to use as keys to look up the geographic distance.

      • source_column_a – Type: string. First column to join on in the source data.

      • source_column_b – Type: string. Second column to join on in the source data.

      • loc_a_0 – Type: string. First column to join on in the look-up table.

      • loc_a_1 – Type: string. First column to join on in the look-up table.

      • loc_b_0 – Type: string. Second column to join on in the look-up table.

      • loc_b_1 – Type: string. Second column to join on in the look-up table.

      • distance_col – Type: string. Name of the column containing the geographic distance in the look-up table.

    • Attributes if using a secondary join:

      • secondary_key_count – Type: int. The number of keys used to join on the secondary (backup) look-up table. Acceptable values are 1 or 2.

      • secondary_table_name – Type: string. What to name the table that will be generated from the secondary_distances_file. If you want to do multiple look-ups, if the table_name is the same across all feature specifications, it will only be read in once.

      • secondary_distances_file – Type: string of path. Path to the secondary distances look-up file.

      • secondary_source_column – Type: string. The column in the input data that you want to use as a key in the secondary geographic distance look-up.

      • secondary_loc_a – Type: string. First column to join on in the secondary look-up table.

      • secondary_loc_b – Type: string. Second column to join on in the secondary look-up table.

      • secondary_distance_col – Type: string. Name of the column containing the geographic distance in the secondary look-up table.

[[comparison_features]]
alias = "state_distance"
comparison_type = "geo_distance"
key_count = 1
table_name = "state_distance_lookup"
distances_file = "/path/to/county_state_distance.csv"
column_name = "bpl"
loc_a = "statecode1"
loc_b = "statecode2"
distance_col = "dist"


[[comparison_features]]
alias = "county_distance"
comparison_type = "geo_distance"
column_names = ["county", "statefip"]
key_count = 2
table_name = "county_distance_lookup"
distances_file = "/path/to/county_1900_1910_distances_km.csv"
# columns to join on in the data
source_column_a = "county"
source_column_b = "statefip"

# column names from the csv lookup file
loc_a_0 = "from_icpsrctyi"
loc_a_1 = "to_icpsrctyi"
loc_b_0 = "from_statefip"
loc_b_1 = "to_statefip"
distance_col = "distance_km"

# SECONDARY JOIN
secondary_key_count = 1
secondary_table_name = "state_distance_lookup"
secondary_distances_file = "/path/to/state_1900_1910_distances_km.csv"
secondary_source_column = "statefip"
secondary_loc_a = "from_statefip"
secondary_loc_b = "to_statefip"
secondary_distance_col = "distance_km"

fetch_a

Gets the value of column A.

  • Attributes:

    • column_name – Type: string. Required. The column to get the value from.

[[comparison_features]]
alias = "race"
column_name = "race"
comparison_type = "fetch_a"
categorical = true

fetch_b

Gets the value of column B.

  • Attributes:

    • column_name – Type: string. The column to get the value from.

[[comparison_features]]
alias = "race"
column_name = "race"
comparison_type = "fetch_b"
categorical = true

present_both_years

Checks whether column A and column B are both present and both positive (> 0). Evaluates to 1 if both are present and positive and 0 otherwise.

  • Attributes:

    • column_name – Type: string. The column to check. Must be a column with a numerical type.

[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"

neither_are_null

Checks that neither column A nor column B is null or the empty string ''. Evaluates to 1 if neither column is null or '' and evaluates to 0 otherwise.

  • Attributes:

    • column_name – Type: string. The column of type string to check.

[[comparison_features]]
alias = "mpres"
column_name = "m_namefrst"
comparison_type = "neither_are_null"
categorical = true

present_and_matching_categorical

Checks that both column A and column B are present and that they match according to SQL’s IS DISTINCT FROM. Evaluates to 0, 1, or 2:

0 → columns are both present and match

1 → columns are both present but are distinct

2 → one or both columns are missing

  • Attributes:

    • column_name – Type: string. Required. The column to check.

present_and_not_equal

Checks that column A and column B are both present but are not equal.

  • Attributes:

    • column_name – Type: string. The column to check.

present_and_equal_categorical_in_universe

Checks that column A and column B are both present, are not equal to the not-in-universe value NIU, and are equal to each other according to SQL’s IS DISTINCT FROM. Evaluates to 0 if either column is missing or if either column is the NIU value. Otherwise, evaluates to 0 if the columns are distinct or 1 if the columns are equal.

  • Attributes:

    • column_name – Type: string. Required. The column to check.

    • NIU – Type: same as the type of the input column. Required. The not-in-universe value to use in the check.

[[comparison_features]]
alias = "mfbplmatch"
column_name = "nativity"
comparison_type = "present_and_equal_categorical_in_universe"
NIU = "0"
categorical = true

sql_condition

This is a flexible comparison type that allows users to write their own SQL expressions to be evaluated. Favor using a different comparison type if that’s a reasonable option. If there are no other comparison types that work for a particular use case, this one is a good fallback.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all columns used in the SQL expression.

    • condition – Type: string. The SQL expression to evaluate.

In this example, we make use of the hlink-defined jw function, which computes the Jaro-Winkler similarity of two strings. nvl is a Spark builtin function which returns its second argument if the first is null, and the first argument otherwise.

[[comparison_features]]
alias = "namelast_jw_max"
comparison_type = "sql_condition"
column_names = ["namelast1", "namelast2", "namelast3"]
condition = "GREATEST(jw(nvl(a.namelast1, ''), nvl(b.namelast1, '')), jw(nvl(a.namelast2, ''), nvl(b.namelast2, '')), jw(nvl(a.namelast3, ''), nvl(b.namelast3, '')))"

Feature add-ons

These attributes can be added to most comparison feature types above to extend the type of output returned beyond the standard comparison feature.

alias

  • Attributes:

    • alias: Type: string. Should be used at the top level comparison of every comparison feature. The name for the output column.

[[comparison_features]]
alias = "jw_f"
column_name = "father_namefrst"
comparison_type = "jaro_winkler"

power

Raises a comparison feature to a given exponential power.

  • Attributes:

    • power – Type: int. The power to raise the comparison output to. For example, power = 2 will square the output.

[[comparison_features]]
alias = "county_distance_squared"
comparison_type = "geo_distance"
column_names = ["county", "statefip"]
# PRIMARY JOIN
# key count: the number of keys used for the join per source file.  Ex: for state and county, key_count = 2.  For just state, key_count = 1 even though there is county_a and county_b
key_count = 2
table_name = "county_distance_lookup"
#distances_file = "/path/to/county_state_distance.csv"
distances_file = "/path/to/county_1900_1910_distances_km.csv"
# columns to join on in the data
source_column_a = "county"
source_column_b = "statefip"
# column names from the csv lookup file
loc_a_0 = "from_icpsrctyi"
loc_a_1 = "to_icpsrctyi"
loc_b_0 = "from_statefip"
loc_b_1 = "to_statefip"
distance_col = "distance_km"
# SECONDARY JOIN
secondary_key_count = 1
secondary_table_name = "state_distance_lookup"
secondary_distances_file = "/path/to/state_1900_1910_distances_km.csv"
secondary_source_column = "statefip"
secondary_loc_a = "from_statefip"
secondary_loc_b = "to_statefip"
secondary_distance_col = "distance_km"
power = 2

threshold

  • Attributes:

    • threshold – Type: numeric types. Asserts if the comparison feature output is not null and is greater than or equal to (>=) the given threshold value.

[[comparison_features]]
alias = "imm"
column_name = "nativity"
comparison_type = "fetch_a"
threshold = 5
categorical = true

lower_threshold

  • Attributes:

    • lower_threshold – Type: numeric types. Asserts if the comparison feature output is not null and is greater than or equal to (>=) the given threshold value.

[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"

upper_threshold

  • Attributes:

    • upper_threshold – Type: numeric types. Asserts if the comparison feature output is not null and is less than or equal to (<=) the given threshold value.

[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"

gt_threshold

  • Attributes:

    • gt_threshold – Type: numeric types. Asserts if the comparison feature output is not null and is greater than (>) the given threshold value.

[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"

btwn_threshold

  • Attributes:

    • btwn_threshold – Type: List of numeric type. Asserts if the comparison feature output is greater than or equal to (>=) the first threshold value, and less than or equal to (<=) the second threshold value.

[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "b_minus_a"
btwn_threshold = [5,14]
categorical = true

look_at_addl_var

  • Attributes:

    • look_at_addl_var – Type: boolean. Flags the program to consider an additional column value before reporting the comparison feature value.

    • addl_var – Type: string. The additional column to consider.

    • check_val_expr – Type: expression. The expression to use to evaluate the additional column. For example, check_val_expr = "= 5".

    • else_val – Type: same type as comparison feature output. If the additional volumn value does not meet the check_val_expr specification, the value to return instead of the comparison feature value.

In the following example, the generated SQL expression for the column would be: CASE WHEN {datasource}.nativity = 5 then {yrimmig abs_diff value} else -1 END.

[[comparison_features]]
alias = "immyear_diff"
column_name = "yrimmig"
comparison_type = "abs_diff"
look_at_addl_var = true
addl_var = "nativity"
datasource = "a"
check_val_expr = "= 5"
else_val = -1

Aggregate Features

These features are not configurable. To include them in the generated comparison features, they just need to be included in the [training][independent_vars] section of the config. They are generated using the “aggregate_features” SQL template.

hits

The number of potential matches generated for the given individual (counted by aggregating on {id_column}_a).

hits2

hits squared.

exact_mult

Indicator for the existence of multiple potential matches with the exact same first and last name as the A sample individual within the B data. Returns numeric boolean (0 or 1).

Household Aggregate Features

These features are not configurable. To include them in the generated comparison features, they just need to be included in the [hh_training][independent_vars] section of the config. They are generated using the “hh_aggregate_features” SQL template.

jw_max_a

The highest Jaro-Winkler score for any of the first names in linked household A against the first name in linked household B where birth year difference is less than or equal to ten, excluding the individual A in the current potential match. Returns 0 if no other individuals are in the household for comparison.

jw_max_b

The highest Jaro-Winkler score for any of the first names in linked household A against the first name in linked household B where sex matches and birth year difference is less than or equal to ten, excluding the individual A in the current potential match. Returns 0 if no other individuals are in the household for comparison.