Halo Merger Tree

Section author: Stephen Skory <sskory@physics.ucsd.edu>

New in version 1.7.

The Halo Merger Tree extension is capable of building a database of halo mergers over a set of time-ordered Enzo datasets. The fractional contribution of older ‘parent’ halos to younger ‘child’ halos is calculated by comparing the unique index labels of their constituent particles. The data is stored in a SQLite database which enables the use of powerful and fast SQL queries over all the halos.

General Overview

The first requirement is a set of sequential Enzo datasets. The detail of the merger tree is increased as the difference in time between snapshots is reduced, at the cost of higher computational effort for the tree itself and and disk usage for the snapshots. The merger tree relies on the output of one of the Halo Finders in yt, and the user can choose which one to use. The merger tree is capable of running the halo finder if it hasn’t already been done. Once halo finding is accomplished for all the data snapshots, the halo lineage is calculated by comparing the particle membership of halos between pairs of time steps. The halo data and tree data is stored in the SQLite database.

Clearly, another requirement is that Python has the sqlite3 library installed.

The merger tree can be calculated in parallel, and if necessary, it will run the halo finding in parallel as well. Please see the note below about the special considerations needed for Network File Systems.

There is a convenience-wrapper for querying the database, called MergerTreeConnect. It simplifies accessing data in the database.

There are two output classes for the merger tree. The MergerTreeDotOutput class outputs the tree for a user-specified subset of halos to a Graphviz format file. Graphviz is an open-source package for visualizing connected objects in a graphical way. There are binary distributions for all major operating systems. It is also possible to dump the contents of the SQLite database to a simple text file with the MergerTreeTextOutput class. The data is saved in columnar format.

Conceptual Primer

The best way to view the merger tree extension is as a two-part process. First, the merger tree is built and stored in the database. This process can be quite time consuming, depending on the size of the simulation, and the number and size of halos found in the snapshots. This is not a process one wants to do very often, and why it is separate from the analysis parts.

The second part is actually a many-part process, which is the analysis of the merger tree itself. The first step is computationally intensive, but the analysis step is user-intensive. The user needs to decide what to pull out of the merger tree and figure out how to extract the needed data with SQL statements. Once an analysis pipeline is written, it should run very fast for even very large databases.

A Note About Network File Systems

Accessing a SQLite database stored on a Network (or Distributed) File System (NFS) is a risky thing to do, particularly if more than one task wants to write at the same time (see more here). NFS disks can store files on multiple physical hard drives, and it can take time for changes made by one task to appear to all the parallel tasks.

The Merger Tree takes extra caution to ensure that every task sees the exact same version of the database before writing to it, and only one task ever writes to the database at a time. This is accomplished by using MPI Barriers and md5 hashing of the database between writes. In general, it is recommended to keep the database on a ‘real disk’ (/tmp for example, if all the tasks are on the same SMP node) if possible, but it should work on a NFS disk as well. If the database must be stored on a NFS disk, the documentation for the NFS protocol should be consulted to see what settings are available that can minimize the potential for file replication problems of the database.

Running and Using the Halo Merger Tree

It is very simple to create a merger tree database for a series of snapshots. The most difficult part is creating an ordered list of Enzo restart files. There are two ways to do it, by hand or with the EnzoSimulation extension.

By Hand

Here is an example of how to build the list and build the database by hand. Here, the snapshots are stored in directories named DD????, and the enzo restart file named data????, where ???? is a four digit zero-padded integer. The final snapshot considered (most progressed in time) is DD0116, and the earliest that will be examined is DD0100. The database will be saved to /path/to/database/halos.db. This example below works identically in serial or in parallel.

from yt.extensions.merger_tree import *

files = []
start = 100
finish = 116
for i in range(start, finish + 1):
    files.append('/path/to/snapshots/DD%04d/data%04d' % (i, i))

MergerTree(restart_files=files, database='/path/to/database/halos.db')

If the halos have not been found previously for the snapshots, the halo finder will be run automatically. See the note about this below.

Using EnzoSimulation

Here is how to build the input list of restart files using the EnzoSimulation extension. It is possible to set range and interval between snapshots. Please see the EnzoSimulation documentation (Analyzing an Entire Simulation) for details.

from yt.extensions.merger_tree import *
import yt.extensions.EnzoSimulation as ES

es = ES.EnzoSimulation('/path/to/snapshots/simulation.par')

files = []
for output in es.allOutputs:
    files.append(output['filename'])

MergerTree(restart_files=files, database='/path/to/database/halos.db')

Merger Tree Parallelism

If the halos are to be found during the course of building the merger tree, run with an appropriate number of tasks to the size of the dataset and the halo finder used. The merger tree itself, which compares halo membership in parallel very effectively, is almost completely constrained by the read/write times of the SQLite file. In tests with the halos pre-located, there is not much speedup beyond two MPI tasks. There is no negative effect with running the merger tree with more tasks (which is why if halos are to be found by the merger tree, the merger tree should be run with as many tasks as that step requires), but there is no benefit.

How The Database Is Handled

The Merger Tree is designed to allow the merger tree database to be built incrementally. For example, if a simulation is currently being run, the merger tree database can be built for the available datasets, and when new ones are created, the database extended to include them. So if there are going to be 60 data snapshots total (indexed (0, 1, 2, ..., 59)), and only 50 are saved when the tree is first built, the analysis should be done on datasets [0, 49]. If the last ten become available, re-run the merger tree on datasets [49, 59] referencing the same database as before. By referencing the same database as before, work does not need to be repeated.

Additional Parameters

When calling MergerTree, there are three parameters that control how the halo finder is run, if it needs to be run.

  • halo_finder_function (name) - Which of the halo finders (Halo Finding) to use. Default: HaloFinder (HOP).
  • halo_finder_threshold (float) - When using HOP or Parallel HOP, this sets the threshold used. Default: 80.0.
  • FOF_link_length (float) - When using Friends of Friends (FOFHaloFinder), this sets the inter-particle link length used. Default: 0.2.
  • dm_only (bool) - Whether to include stars (False), or only the dark matter particles when building halos (True). Default: False.
  • refresh (bool) - If set to True, this will run the halo finder and rebuild the database regardless of whether or not the halo files or database exist on disk already. Default: False.
  • sleep (float) - The amount of time in seconds tasks waits between checks to make sure the SQLite database file is globally-identical. This time is used to allow a parallel file system to synch up globally. The value may not be negative or zero. Default: 1.
  • index (bool) - Whether to add an index to the SQLite file. True makes SQL searches faster at the cost of additional disk space. Default=True.

Example using Parallel HOP:

MergerTree(restart_files=files, database='/path/to/database/halos.db',
    halo_finder_function=parallelHF, halo_finder_threshold=100.)

Pre-Computing Halos

If halo finding is to happen before the merger tree is calculated, and the work is not to be wasted, special care should be taken to ensure that all the data required for the merger tree is saved. By default, the merger tree looks for files that begin with the name MergerHalos in the same directory as each Enzo restart file, and if those files are missing or renamed, halo finding will be performed again. If halos is the list of halos returned by the halo finder, these three commands should be called to save the needed data:

halos.write_out('MergerHalos.out')
halos.write_particle_lists('MergerHalos')
halos.write_particle_lists_txt('MergerHalos')

Please see the documents on halo finding for more information on what these commands do (Halo Finding).

Accessing Data in the Database

SQLite databases support nearly all of the standard SQL queries. It is possible to write very complicated and powerful SQL queries, but below only simple examples will are shown. Please see other resources (WWW, books) for more on how to write SQL queries.

It is possible to read and modify a SQLite database from the command line using the sqlite3 command (e.g. sqlite3 database.db). It can be very convenient to use this to quickly inspect a database, but is not suitable for extracting or inserting large amounts of data. There are many examples (again, see the WWW or books) available on how to use the command line sqlite3 command.

The table containing halo data in the database is named ‘Halos’. All queries for halo data will come from this table. The table has these columns:

  1. GlobalHaloID (int) - A fully-unique identifier for the halo.
  2. SnapCurrentTimeIdentifier (int) - An unique time identifier for the snapshot the halo comes from. Equivalent to ‘CurrentTimeIdentifier’ from the Enzo restart file.
  3. SnapZ (float) - The redshift for the halo.
  4. SnapHaloID (int) - The halo ID for the halo taken from the output of the halo finder (i.e. ‘halos.write_out(“HopAnalysis.out”)’). It is unique for halos in the same snapshot, but not unique across the full database.
  5. HaloMass (float) - The total mass of dark matter in the halo as identified by the halo finder.
  6. NumPart (int) - Number of dark matter particles in the halo as identified by the halo finder.
  7. CenMassX,
  8. CenMassY,
  9. CenMassZ (float) - The location of the center of mass of the halo in code units.
  10. BulkVelX,
  11. BulkVelY,
  12. BulkVelZ (float) - The velocity of the center of mass of the halo in cgs units.
  13. MaxRad (float) - The distance from the center of mass to the most remote particle in the halo in code units.
  14. ChildHaloID0 (int) - The GlobalHaloID of the child halo which receives the greatest proportion of particles from this halo.
  15. ChildHaloFrac0 (float) - The fraction by mass of particles from this (parent) halo that goes to the child halo recorded in ChildHaloID0. If all the particles from this parent halo goes to ChildHaloID0, this number will be 1.0, regardless of the mass of the child halo.
  16. ChildHaloID[1-4], ChildHaloFrac[1-4] (int, float) - Similar to the columns above, these store the second through fifth greatest recipients of particle mass from this parent halo.

Warning

A value of -1 in any of the ChildHaloID columns corresponds to a fake (placeholder) child halo entry. There is no halo with an ID equal to -1. This is used during the merger tree construction, and must be accounted for when constructing SQL queries of the database.

To get the data for the most massive halo at the end of the simulation, there is a convenience class that simplifies database access. Using it, a query might look like this:

from yt.extensions.merger_tree import *

mt = MergerTreeConnect(database='halos.db')
line = "SELECT * FROM Halos WHERE SnapZ=0.0 AND SnapHaloID=0;"
results = mt.query(line)

results is a list containing a singular tuple containing the values for that halo in the same order as given above for the columns.

If all that is wanted is a few of the columns, this slight modification below will retrieve only the desired data. In general, it is a good idea to retrieve only the columns that will actually be used. Requesting all the columns (with *) requires more reads from disk and slows down the query.

line = "SELECT NumPart, GlobalHaloID FROM Halos WHERE SnapZ=0.0 AND SnapHaloID=0;"
results = mt.query(line)

results is a list containing a single tuple containing two items, the values for NumPart first and GlobalHaloID second.

If data from more than one halo is desired, more than one item will be returned. This query will find the largest halo from each of the snapshots.

from yt.extensions.merger_tree import *

mt = MergerTreeConnect(database='halos.db')
line = "SELECT HaloMass,SnapZ FROM Halos WHERE SnapHaloID=0;"
results = mt.query(line)

results is a list of multiple two-tuples. Note that SQLite doesn’t return the values in any particular order. If order is unimportant, it saves time. But if order is important, you can modify the query to sort the results by redshift.

line = "SELECT HaloMass,SnapZ FROM Halos WHERE SnapHaloID=0 ORDER BY SnapZ DESC;"

Now results will be ordered by time, first to last, for each two-tuple in the list.

The last example shows the kernel of the most important operation for a merger tree: recursion back in time to find progenitors for a halo. Using a query similar to ones above, the GlobalHaloID is found for the halo of interest at some late point in time (z=0, typically). Using that value (given the random-ish value of 1234567), the halos that came before can be identified very easily:

from yt.extensions.merger_tree import *

mt = MergerTreeConnect(database='halos.db')

lineage = {}
# Recursive function on parent halos.
def findParent(haloID, lineage):
    line = "SELECT GlobalHaloID from Halos where ChildHaloID0=%d;" % haloID
    results = mt.query(line)
    if results == []:
        return lineage
    # A one-tuple inside a list.
    parentID = results[0][0]
    lineage[parentID] = haloID
    # Now we recurse back in time.
    lineage = findParent(parentID, lineage)

# Stores the parent->child relationships.
lineage = {}
# Call the function once with the late halo.
lineage = findParent(1234567, lineage)

Contained within the dict lineage is the primary lineage for the final chosen halo. Storing the family tree in this way may not be the best choice, but this makes it clear how easy it is to build up the history of a halo over time.

Merger Tree Output

There are two included methods for outputting the contents of a Merger Tree database: Graphviz and plain-text columnar format.

Graphviz Output

The Graphviz output saves a plain text file to disk that is parsed by one of the Graphviz engines. The typical engine for this output is the dot engine, which produces hierarchical diagrams where directionality (such as left to right or top to bottom) indicates some meaningful property. In the case of the merger tree, top to bottom indicates the progress of time. Graphviz can output the visualization into a wide range of image and vector formats suitable for any application.

Below is a simple example of the Graphviz/dot visualization. Each box contains the mass of the halo (in Msun), and the center of mass for the halo in simulation units. For each snapshot, the box for the largest halo is colored red. The numbers next to the link arrows gives the percentage of the parent halo’s mass that goes to the child. On each row, the un-linked black boxes contain the redshift for that snapshot.

../_images/merger_tree_ex.png

To output the merger tree for a set of halos, the chosen halos need to be identified. There are two choices, either the GlobalHaloID or the SnapHaloID along with the SnapCurrentTimeIdentifier value for the chosen halo(s) may be used. Two bits of information need to be used if GlobalHaloID is not specified because SnapHaloID is not an unique identifier in the database. The reason why SnapCurrentTimeIdentifier is used rather than SnapZ has to do with the float valuation of the redshift column and the way SQL queries work. If SnapZ were used, the precise float value of the desired redshift would have to be used, rather than the simpler-to-get-correct integer value of SnapCurrentTimeIdentifier.

Luckily it isn’t as hard as it sounds to get the GlobalHaloID for the desired halo(s). By using the MergerTreeConnect class, it is simple to pick out halos before creating the Graphviz output. Below, the GlobalHaloID for the most massive halo in the last (z~0, typically) snapshot is found:

from yt.extensions.merger_tree import *

mt = MergerTreeConnect(database='halos.db')

line = "SELECT max(GlobalHaloID) FROM Halos WHERE SnapHaloID=0;"
results = mt.query(line)
print results

Because of the way the database is created, from early times to late, the most massive halo at z~0 will have the largest GlobalHaloID for all halos with SnapHaloID``=0. ``results will contain a one-tuple in a list of the desired GlobalHaloID.

To output the merger tree for the five largest halos in the last snapshot, it may be simplest to find the SnapCurrentTimeIdentifier for that snapshot. This can either be done by referencing the dataset itself by hand (look for CurrentTimeIdentifier in the Enzo restart file), or by querying the database. Here is how to query the database for the right information:

from yt.extensions.merger_tree import *

mt = MergerTreeConnect(database='halos.db')

line = "SELECT max(GlobalHaloID) FROM Halos WHERE SnapHaloID=0;"
results = mt.query(line)

line = "SELECT SnapCurrentTimeIdentifier FROM Halos WHERE GlobalHaloID=%d;" % results[0][0]
results = mt.query(line)
print results

results contains a one-tuple in a list of the desired SnapCurrentTimeIdentifier. Supposing that the desired SnapCurrentTimeIdentifier is 72084721, outputting merger trees is now simple:

from yt.extensions.merger_tree import *

MergerTreeDotOutput(halos=[0,1,2,3,4], database='halos.db',
    dotfile='MergerTree.gv', current_time=72084721)

This will output the file MergerTree.gv which can be parsed by Graphviz.

If the GlobalHaloID values are known for all of the desired halos, current_time should not be specified, as below:

from yt.extensions.merger_tree import *

MergerTreeDotOutput(halos=[24212,5822,19822,10423,51324], database='halos.db',
    dotfile='MergerTree.gv', link_min=0.7)

The link_min parameter above limits the tree to following links between parent and child halos for which at least 70% of the parent halo’s mass goes to the child. The default is 0.2.

Plain-Text Output

This is how to output the entire contents of the database to a text file:

from yt.extensions.merger_tree import *

MergerTreeTextOutput(database='halos.db', outfile='MergerTreeDB.txt')

Putting it All Together

Here is an example of how to create a merger tree for the most massive halo in the final snapshot from start to finish. This will work in serial and in parallel.

from yt.extensions.merger_tree import *

# Pick our snapshots to use.
files = []
start = 100
finish = 116
for i in range(start+1 - finish):
    files.append('/path/to/snapshots/DD%04d/data%04d' % (i+finish, i+finish))

my_database = '/path/to/database/halos.db'

# Build the tree.
MergerTree(restart_files=files, database=my_database)

# Get the GlobalHaloID for the halo.
mt = MergerTreeConnect(database=my_database)
line = "SELECT max(GlobalHaloID) FROM Halos WHERE SnapHaloID=0;"
results = mt.query(line)
my_halo = results[0][0] # one-tuple in a list

# Output the Graphviz file.
MergerTreeDotOutput(halos=my_halo, database=my_database, link_min=0.5,
    dotfile='MergerTree.gv')