GTFSTK 6.0 Documentation

GTFSTK is a Python 3.5 tool kit for processing General Transit Feed Specification (GTFS) data in memory without a database. It is mostly for computing statistics, such as daily service distance per route and daily number of trips per stop. It uses Pandas and Shapely to do the heavy lifting.

Installation

Create a Python 3.5 virtual environment and pip install gtfstk.

Examples

You can play with ipynb/examples.ipynb in a Jupyter notebook

Conventions

In conformance with GTFS and unless specified otherwise, dates are encoded as date strings of the form YYMMDD and times are encoded as time strings of the form HH:MM:SS with the possibility that the hour is greater than 24. Unless specified otherwise, ‘data frame’ and ‘series’ refer to Pandas data frames and series, respectively.

constants Module

gtfstk.constants.CRS_WGS84 = {'ellps': 'WGS84', 'no_defs': True, 'proj': 'longlat', 'datum': 'WGS84'}
gtfstk.constants.DIST_UNITS = ['ft', 'mi', 'm', 'km']
gtfstk.constants.DTYPE = {'zone_id': <class 'str'>, 'route_short_name': <class 'str'>, 'date': <class 'str'>, 'trip_id': <class 'str'>, 'shape_id': <class 'str'>, 'parent_station': <class 'str'>, 'to_stop_id': <class 'str'>, 'route_id': <class 'str'>, 'service_id': <class 'str'>, 'agency_id': <class 'str'>, 'origin_id': <class 'str'>, 'end_date': <class 'str'>, 'contains_id': <class 'str'>, 'destination_id': <class 'str'>, 'start_date': <class 'str'>, 'stop_code': <class 'str'>, 'from_stop_id': <class 'str'>, 'fare_id': <class 'str'>, 'stop_id': <class 'str'>}
gtfstk.constants.FEED_ATTRS_PRIVATE = ['_trips_i', '_calendar_i', '_calendar_dates_g']
gtfstk.constants.FEED_ATTRS_PUBLIC = ['agency', 'stops', 'routes', 'trips', 'stop_times', 'calendar', 'calendar_dates', 'fare_attributes', 'fare_rules', 'shapes', 'frequencies', 'transfers', 'feed_info', 'dist_units']
gtfstk.constants.GTFS_TABLES_OPTIONAL = ['calendar_dates', 'fare_attributes', 'fare_rules', 'shapes', 'frequencies', 'transfers', 'feed_info']
gtfstk.constants.GTFS_TABLES_REQUIRED = ['agency', 'stops', 'routes', 'trips', 'stop_times', 'calendar']
gtfstk.constants.INT_COLUMNS = ['location_type', 'wheelchair_boarding', 'route_type', 'direction_id', 'stop_sequence', 'wheelchair_accessible', 'bikes_allowed', 'pickup_type', 'drop_off_type', 'timepoint', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday', 'exception_type', 'payment_method', 'transfers', 'shape_pt_sequence', 'exact_times', 'transfer_type', 'transfer_duration', 'min_transfer_time']

utilities Module

gtfstk.utilities.almost_equal(f, g)

Return True if and only if the given data frames are equal after sorting their columns names, sorting their values, and reseting their indices.

gtfstk.utilities.datestr_to_date(x, format_str='%Y%m%d', inverse=False)

Given a string object x representing a date in the given format, convert it to a datetime.date object and return the result. If inverse, then assume that x is a date object and return its corresponding string in the given format.

gtfstk.utilities.get_convert_dist(dist_units_in, dist_units_out)

Return a function of the form

distance in the units dist_units_in -> distance in the units dist_units_out

Only supports distance units in DIST_UNITS.

gtfstk.utilities.get_max_runs(x)

Given a list of numbers, return a NumPy array of pairs (start index, end index + 1) of the runs of max value.

EXAMPLES:

>>> get_max_runs([7, 1, 2, 7, 7, 1, 2])
array([[0, 1],
       [3, 5]])

Assume x is not empty. Recipe from here

gtfstk.utilities.get_peak_indices(times, counts)

Given an increasing list of times as seconds past midnight and a list of trip counts at those times, return a pair of indices i, j such that times[i] to times[j] is the first longest time period such that for all i <= x < j, counts[x] is the max of counts. Assume times and counts have the same nonzero length.

gtfstk.utilities.get_segment_length(linestring, p, q=None)

Given a Shapely linestring and two Shapely points, project the points onto the linestring, and return the distance along the linestring between the two points. If q is None, then return the distance from the start of the linestring to the projection of p. The distance is measured in the native coordinates of the linestring.

gtfstk.utilities.get_utm_crs(lat, lon)

Return a GeoPandas coordinate reference system (CRS) dictionary corresponding to the UTM projection appropriate to the given WGS84 latitude and longitude.

gtfstk.utilities.is_not_null(data_frame, column_name)

Return True if the given data frame has a column of the given name (string), and there exists at least one non-NaN value in that column; return False otherwise.

gtfstk.utilities.linestring_to_utm(linestring)

Given a Shapely LineString in WGS84 coordinates, convert it to the appropriate UTM coordinates. If inverse, then do the inverse.

gtfstk.utilities.time_it(f)
gtfstk.utilities.timestr_mod24(timestr)

Given a GTFS time string in the format %H:%M:%S, return a timestring in the same format but with the hours taken modulo 24.

gtfstk.utilities.timestr_to_seconds(x, inverse=False, mod24=False)

Given a time string of the form ‘%H:%M:%S’, return the number of seconds past midnight that it represents. In keeping with GTFS standards, the hours entry may be greater than 23. If mod24, then return the number of seconds modulo 24*3600. If inverse, then do the inverse operation. In this case, if mod24 also, then first take the number of seconds modulo 24*3600.

gtfstk.utilities.weekday_to_str(weekday, inverse=False)

Given a weekday, that is, an integer in range(7), return it’s corresponding weekday name as a lowercase string. Here 0 -> ‘monday’, 1 -> ‘tuesday’, and so on. If inverse, then perform the inverse operation.

feed Module

This module defines the Feed class, which represents a GTFS feed as a collection of data frames, and defines some basic operations on Feed objects. Almost all other operations on Feed objects are defined as functions living outside of the Feed class rather than methods of the Feed class. Every function that acts on a Feed object assumes that every attribute of the feed that represents a GTFS file, such as agency or stops, is either None or a data frame with the columns required in the GTFS.

class gtfstk.feed.Feed(dist_units, agency=None, stops=None, routes=None, trips=None, stop_times=None, calendar=None, calendar_dates=None, fare_attributes=None, fare_rules=None, shapes=None, frequencies=None, transfers=None, feed_info=None)

Bases: object

A class that represents a GTFS feed, where GTFS tables are stored as data frames. Beware, the stop times data frame can be big (several gigabytes), so make sure you have enough memory to handle it. Feed (public) attributes are

  • dist_units: a string in constants.DIST_UNITS; specifies the distance units to use when calculating various stats, such as route service distance; should match the implicit distance units of the shape_dist_traveled column values, if present
  • agency
  • stops
  • routes
  • trips
  • stop_times
  • calendar
  • calendar_dates
  • fare_attributes
  • fare_rules
  • shapes
  • frequencies
  • transfers
  • feed_info

There are also a few private Feed attributes that are derived from some public attributes and are automatically updated when those public attributes change. However, for this update to work, you must properly update the primary attributes like this:

feed.trips['route_short_name'] = 'bingo'
feed.trips = feed.trips

and not like this:

feed.trips['route_short_name'] = 'bingo'

The first way ensures that the altered trips data frame is saved as the new trips attribute, but the second way does not.

calendar

A public Feed attribute made into a property for easy auto-updating of private feed attributes based on the calendar data frame.

calendar_dates

A public Feed attribute made into a property for easy auto-updating of private feed attributes based on the calendar dates data frame.

copy()

Return a copy of this feed, that is, a feed with all the same public and private attributes.

dist_units

A public Feed attribute made into a property for easy validation.

trips

A public Feed attribute made into a property for easy auto-updating of private feed attributes based on the trips data frame.

gtfstk.feed.read_gtfs(path, dist_units=None)

Create a Feed object from the given path and given distance units. The path points to a directory containing GTFS text files or a zip file that unzips as a collection of GTFS text files (but not as a directory containing GTFS text files).

gtfstk.feed.write_gtfs(feed, path, ndigits=6)

Export the given feed to a zip archive located at path. Round all decimals to ndigits decimal places. All distances will be displayed in units feed.dist_units.

validator Module

This module contains functions that supplement but do not replace the feedvalidator module of the transitfeed package. The latter module checks if GFTS feeds adhere to the GTFS specification.

exception gtfstk.validator.GTFSError(feed, msg)

Bases: Exception

Exception raised for Feed objects that do not conform to the GTFS specification. Attributes:

  • msg: explanation of the error
gtfstk.validator.check_calendar(feed)

Check that one of feed.calendar or feed.calendar_dates is nonempty.

cleaner Module

This module contains functions for cleaning Feed objects.

gtfstk.cleaner.aggregate_routes(feed, by='route_short_name', route_id_prefix='route_')

Given a GTFSTK Feed object, group routes by the by column of feed.routes and for each group,

  1. choose the first route in the group,
  2. assign a new route ID based on the given route_id_prefix string and a running count, e.g. 'route_013'
  3. assign all the trips associated with routes in the group to that first route.

Update feed.routes and feed.trips with the new routes, and return the resulting feed.

gtfstk.cleaner.assess(feed)

Return a Pandas series containing various feed assessments, such as the number of trips missing shapes. This is not a GTFS validator.

gtfstk.cleaner.clean(feed)

Given a GTFSTK Feed instance, apply the following functions to it and return the resulting feed.

  1. clean_ids()
  2. clean_stop_times()
  3. clean_route_short_names()
  4. prune_dead_routes()
gtfstk.cleaner.clean_ids(feed)

Strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore. Return the resulting feed.

gtfstk.cleaner.clean_route_short_names(feed)

In feed.routes, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-‘ and its route ID. Return the resulting feed.

gtfstk.cleaner.clean_stop_times(feed)

In feed.stop_times, prefix a zero to arrival and departure times if necessary. This makes sorting by time work as expected. Return the resulting feed.

gtfstk.cleaner.drop_invalid_columns(feed)

Given a GTFSTK Feed instance, drop all data frame columns not listed in constants.VALID_COLS. Return the resulting feed.

gtfstk.cleaner.prune_dead_routes(feed)

Remove all routes from feed.routes that do not have trips listed in feed.trips. Return the result feed.

calculator Module

This module contains functions for calculating properties of Feed objects, such as daily service duration per route.

gtfstk.calculator.append_dist_to_shapes(feed)

Calculate and append the optional shape_dist_traveled field in feed.shapes in terms of the distance units feed.dist_units. Return the resulting feed.

Assume the following feed attributes are not None:

  • feed.shapes
NOTES:
gtfstk.calculator.append_dist_to_stop_times(feed, trips_stats)

Calculate and append the optional shape_dist_traveled field in feed.stop_times in terms of the distance units feed.dist_units. Need trip stats in the form output by compute_trip_stats() for this. Return the resulting feed. Does not always give accurate results, as described below.

Assume the following feed attributes are not None:

ALGORITHM:

Compute the shape_dist_traveled field by using Shapely to measure the distance of a stop along its trip linestring. If for a given trip this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows.

Get the average speed of the trip via trips_stats and use is to linearly interpolate distances for stop times, assuming that the first stop is at shape_dist_traveled = 0 (the start of the shape) and the last stop is at shape_dist_traveled = the length of the trip (taken from trips_stats and equal to the length of the shape, unless trips_stats was called with get_dist_from_shapes == False). This fallback method usually kicks in on trips with feed-intersecting linestrings. Unfortunately, this fallback method will produce incorrect results when the first stop does not start at the start of its shape (so shape_dist_traveled != 0). This is the case for several trips in the Portland feed at https://transitfeeds.com/p/trimet/43/1400947517, for example.

gtfstk.calculator.append_route_type_to_shapes(feed)

Append a route_type column to a copy of feed.shapes and return the resulting shapes data frame. Note that a single shape can be linked to multiple trips on multiple routes of multiple route types. In that case the route type of the shape is the route type of the last route (sorted by ID) with a trip with that shape.

Assume the following feed attributes are not None:

  • feed.routes
  • feed.trips
  • feed.shapes
gtfstk.calculator.build_geometry_by_shape(feed, use_utm=False, shape_ids=None)

Return a dictionary with structure shape_id -> Shapely linestring of shape. If feed.shapes is None, then return None. If use_utm, then return each linestring in in UTM coordinates. Otherwise, return each linestring in WGS84 longitude-latitude coordinates. If a list of shape IDs shape_ids is given, then only include the given shape IDs.

Assume the following feed attributes are not None:

  • feed.shapes
gtfstk.calculator.build_geometry_by_stop(feed, use_utm=False, stop_ids=None)

Return a dictionary with structure stop_id -> Shapely point object. If use_utm, then return each point in in UTM coordinates. Otherwise, return each point in WGS84 longitude-latitude coordinates. If a list of stop IDs stop_ids is given, then only include the given stop IDs.

Assume the following feed attributes are not None:

  • feed.stops
gtfstk.calculator.combine_time_series(time_series_dict, kind, split_directions=False)

Given a dictionary of time series data frames, combine the time series into one time series data frame with multi-index (hierarchical) columns and return the result. The top level columns are the keys of the dictionary and the second and third level columns are ‘route_id’ and ‘direction_id’, if kind == 'route', or ‘stop_id’ and ‘direction_id’, if kind == 'stop'. If split_directions == False, then there is no third column level, no ‘direction_id’ column.

gtfstk.calculator.compute_bounds(feed)

Return the tuple (min longitude, min latitude, max longitude, max latitude) where the longitudes and latitude vary across all the stop (WGS84)coordinates.

gtfstk.calculator.compute_busiest_date(feed, dates)

Given a list of dates, return the first date that has the maximum number of active trips. If the list of dates is empty, then raise a ValueError.

Assume the following feed attributes are not None:

gtfstk.calculator.compute_center(feed, num_busiest_stops=None)

Compute the convex hull of all the given feed’s stop coordinates and return the centroid. If an integer num_busiest_stops is given, then compute the num_busiest_stops busiest stops in the feed on the first Monday of the feed and return the mean of the longitudes and the mean of the latitudes of these stops, respectively.

gtfstk.calculator.compute_feed_stats(feed, trips_stats, date)

Given trips_stats, which is the output of feed.compute_trip_stats() and a date, return a data frame including the following feed stats for the date.

  • num_trips: number of trips active on the given date
  • num_routes: number of routes active on the given date
  • num_stops: number of stops active on the given date
  • peak_num_trips: maximum number of simultaneous trips in service
  • peak_start_time: start time of first longest period during which the peak number of trips occurs
  • peak_end_time: end time of first longest period during which the peak number of trips occurs
  • service_distance: sum of the service distances for the active routes
  • service_duration: sum of the service durations for the active routes
  • service_speed: service_distance/service_duration

If there are no stats for the given date, return an empty data frame with the specified columns.

Assume the following feed attributes are not None:

gtfstk.calculator.compute_feed_time_series(feed, trips_stats, date, freq='5Min')

Given trips stats (output of feed.compute_trip_stats()), a date, and a Pandas frequency string, return a time series of stats for this feed on the given date at the given frequency with the following columns

  • num_trip_starts: number of trips starting at this time
  • num_trips: number of trips in service during this time period
  • service_distance: distance traveled by all active trips during this time period
  • service_duration: duration traveled by all active trips during this time period
  • service_speed: service_distance/service_duration

If there is no time series for the given date, return an empty data frame with specified columns.

Assume the following feed attributes are not None:

gtfstk.calculator.compute_route_stats(feed, trips_stats, date, split_directions=False, headway_start_time='07:00:00', headway_end_time='19:00:00')

Take trips_stats, which is the output of compute_trip_stats(), cut it down to the subset S of trips that are active on the given date, and then call compute_route_stats_base() with S and the keyword arguments split_directions, headway_start_time, and headway_end_time.

See compute_route_stats_base() for a description of the output.

Assume the following feed attributes are not None:

NOTES:
  • This is a more user-friendly version of compute_route_stats_base(). The latter function works without a feed, though.
  • Return None if the date does not lie in this feed’s date range.
gtfstk.calculator.compute_route_stats_base(trips_stats_subset, split_directions=False, headway_start_time='07:00:00', headway_end_time='19:00:00')

Given a subset of the output of Feed.compute_trip_stats(), calculate stats for the routes in that subset.

Return a data frame with the following columns:

  • route_id
  • route_short_name
  • route_type
  • direction_id
  • num_trips: number of trips
  • is_loop: 1 if at least one of the trips on the route has its is_loop field equal to 1; 0 otherwise
  • is_bidirectional: 1 if the route has trips in both directions; 0 otherwise
  • start_time: start time of the earliest trip on the route
  • end_time: end time of latest trip on the route
  • max_headway: maximum of the durations (in minutes) between trip starts on the route between headway_start_time and headway_end_time on the given dates
  • min_headway: minimum of the durations (in minutes) mentioned above
  • mean_headway: mean of the durations (in minutes) mentioned above
  • peak_num_trips: maximum number of simultaneous trips in service (for the given direction, or for both directions when split_directions==False)
  • peak_start_time: start time of first longest period during which the peak number of trips occurs
  • peak_end_time: end time of first longest period during which the peak number of trips occurs
  • service_duration: total of the duration of each trip on the route in the given subset of trips; measured in hours
  • service_distance: total of the distance traveled by each trip on the route in the given subset of trips; measured in wunits, that is, whatever distance units are present in trips_stats_subset; contains all np.nan entries if feed.shapes is None
  • service_speed: service_distance/service_duration; measured in wunits per hour
  • mean_trip_distance: service_distance/num_trips
  • mean_trip_duration: service_duration/num_trips

If split_directions == False, then remove the direction_id column and compute each route’s stats, except for headways, using its trips running in both directions. In this case, (1) compute max headway by taking the max of the max headways in both directions; (2) compute mean headway by taking the weighted mean of the mean headways in both directions.

If trips_stats_subset is empty, return an empty data frame with the columns specified above.

Assume the following feed attributes are not None: none.

gtfstk.calculator.compute_route_time_series(feed, trips_stats, date, split_directions=False, freq='5Min')

Take trips_stats, which is the output of compute_trip_stats(), cut it down to the subset S of trips that are active on the given date, and then call compute_route_time_series_base() with S and the given keyword arguments split_directions and freq and with date_label = ut.date_to_str(date).

See compute_route_time_series_base() for a description of the output.

If there are no active trips on the date, then return None.

Assume the following feed attributes are not None:

NOTES:
This is a more user-friendly version of compute_route_time_series_base(). The latter function works without a feed, though.
gtfstk.calculator.compute_route_time_series_base(trips_stats_subset, split_directions=False, freq='5Min', date_label='20010101')

Given a subset of the output of Feed.compute_trip_stats(), calculate time series for the routes in that subset.

Return a time series version of the following route stats:

  • number of trips in service by route ID
  • number of trip starts by route ID
  • service duration in hours by route ID
  • service distance in kilometers by route ID
  • service speed in kilometers per hour

The time series is a data frame with a timestamp index for a 24-hour period sampled at the given frequency. The maximum allowable frequency is 1 minute. date_label is used as the date for the timestamp index.

The columns of the data frame are hierarchical (multi-index) with

  • top level: name = ‘indicator’, values = [‘service_distance’, ‘service_duration’, ‘num_trip_starts’, ‘num_trips’, ‘service_speed’]
  • middle level: name = ‘route_id’, values = the active routes
  • bottom level: name = ‘direction_id’, values = 0s and 1s

If split_directions == False, then don’t include the bottom level.

If trips_stats_subset is empty, then return an empty data frame with the indicator columns.

NOTES:
  • To resample the resulting time series use the following methods:
    • for ‘num_trips’ series, use how=np.mean
    • for the other series, use how=np.sum
    • ‘service_speed’ can’t be resampled and must be recalculated from ‘service_distance’ and ‘service_duration’
  • To remove the date and seconds from the time series f, do f.index = [t.time().strftime('%H:%M') for t in f.index.to_datetime()]

gtfstk.calculator.compute_screen_line_counts(feed, linestring, date, geo_shapes=None)

Compute all the trips active in the given feed on the given date that intersect the given Shapely LineString (with WGS84 longitude-latitude coordinates), and return a data frame with the columns:

  • 'trip_id'
  • 'route_id'
  • 'route_short_name'
  • 'crossing_time': time that the trip’s vehicle crosses the linestring; one trip could cross multiple times
  • 'orientation': 1 or -1; 1 indicates trip travel from the left side to the right side of the screen line; -1 indicates trip travel in the opposite direction
NOTES:
  • Requires GeoPandas.

  • The first step is to geometrize feed.shapes via geometrize_shapes(). Alternatively, use the geo_shapes GeoDataFrame, if given.

  • Assume feed.stop_times has an accurate shape_dist_traveled column.

  • Assume the following feed attributes are not None:
    • feed.shapes, if geo_shapes is not given
  • Assume that trips travel in the same direction as their shapes. That restriction is part of GTFS, by the way. To calculate direction quickly and accurately, assume that the screen line is straight and doesn’t double back on itself.

  • Probably does not give correct results for trips with self-intersecting shapes.

ALGORITHM:
  1. Compute all the shapes that intersect the linestring.
  2. For each such shape, compute the intersection points.
  3. For each point p, scan through all the trips in the feed that have that shape and are active on the given date.
  4. Interpolate a stop time for p by assuming that the feed has the shape_dist_traveled field in stop times.
  5. Use that interpolated time as the crossing time of the trip vehicle, and compute the trip orientation to the screen line via a cross product of a vector in the direction of the screen line and a tiny vector in the direction of trip travel.
gtfstk.calculator.compute_station_stats(feed, date, split_directions=False, headway_start_time='07:00:00', headway_end_time='19:00:00')

If this feed has station data, that is, location_type and parent_station columns in feed.stops, then compute the same stats that feed.compute_stop_stats() does, but for stations. Otherwise, return an empty data frame with the specified columns.

Assume the following feed attributes are not None:

gtfstk.calculator.compute_stop_activity(feed, dates)

Return a data frame with the columns

  • stop_id
  • dates[0]: 1 if the stop has at least one trip visiting it on dates[0]; 0 otherwise
  • dates[1]: 1 if the stop has at least one trip visiting it on dates[1]; 0 otherwise
  • etc.
  • dates[-1]: 1 if the stop has at least one trip visiting it on dates[-1]; 0 otherwise

If dates is None or the empty list, then return an empty data frame with the column ‘stop_id’.

Assume the following feed attributes are not None:

gtfstk.calculator.compute_stop_stats(feed, date, split_directions=False, headway_start_time='07:00:00', headway_end_time='19:00:00')

Call compute_stop_stats_base() with the subset of trips active on the given date and with the keyword arguments split_directions, headway_start_time, and headway_end_time.

See compute_stop_stats_base() for a description of the output.

Assume the following feed attributes are not None:

NOTES:

This is a more user-friendly version of compute_stop_stats_base(). The latter function works without a feed, though.

gtfstk.calculator.compute_stop_stats_base(stop_times, trips_subset, split_directions=False, headway_start_time='07:00:00', headway_end_time='19:00:00')

Given a stop times data frame and a subset of a trips data frame, return a data frame that provides summary stats about the stops in the (inner) join of the two data frames.

The columns of the output data frame are:

  • stop_id
  • direction_id: present if and only if split_directions
  • num_routes: number of routes visiting stop (in the given direction)
  • num_trips: number of trips visiting stop (in the givin direction)
  • max_headway: maximum of the durations (in minutes) between trip departures at the stop between headway_start_time and headway_end_time on the given date
  • min_headway: minimum of the durations (in minutes) mentioned above
  • mean_headway: mean of the durations (in minutes) mentioned above
  • start_time: earliest departure time of a trip from this stop on the given date
  • end_time: latest departure time of a trip from this stop on the given date

If split_directions == False, then compute each stop’s stats using trips visiting it from both directions.

If trips_subset is empty, then return an empty data frame with the columns specified above.

gtfstk.calculator.compute_stop_time_series(feed, date, split_directions=False, freq='5Min')

Call compute_stops_times_series_base() with the subset of trips active on the given date and with the keyword arguments split_directions``and ``freq and with date_label equal to date. See compute_stop_time_series_base() for a description of the output.

Assume the following feed attributes are not None:

NOTES:

This is a more user-friendly version of compute_stop_time_series_base(). The latter function works without a feed, though.

gtfstk.calculator.compute_stop_time_series_base(stop_times, trips_subset, split_directions=False, freq='5Min', date_label='20010101')

Given a stop times data frame and a subset of a trips data frame, return a data frame that provides summary stats about the stops in the (inner) join of the two data frames.

The time series is a data frame with a timestamp index for a 24-hour period sampled at the given frequency. The maximum allowable frequency is 1 minute. The timestamp includes the date given by date_label, a date string of the form ‘%Y%m%d’.

The columns of the data frame are hierarchical (multi-index) with

  • top level: name = ‘indicator’, values = [‘num_trips’]
  • middle level: name = ‘stop_id’, values = the active stop IDs
  • bottom level: name = ‘direction_id’, values = 0s and 1s

If split_directions == False, then don’t include the bottom level.

If trips_subset is empty, then return an empty data frame with the indicator columns.

NOTES:

  • ‘num_trips’ should be resampled with how=np.sum
  • To remove the date and seconds from the time series f, do f.index = [t.time().strftime('%H:%M') for t in f.index.to_datetime()]
gtfstk.calculator.compute_trip_activity(feed, dates)

Return a data frame with the columns

  • trip_id
  • dates[0]: 1 if the trip is active on dates[0]; 0 otherwise
  • dates[1]: 1 if the trip is active on dates[1]; 0 otherwise
  • etc.
  • dates[-1]: 1 if the trip is active on dates[-1]; 0 otherwise

If dates is None or the empty list, then return an empty data frame with the column ‘trip_id’.

Assume the following feed attributes are not None:

gtfstk.calculator.compute_trip_locations(feed, date, times)

Return a data frame of the positions of all trips active on the given date and times Include the columns:

  • trip_id
  • route_id
  • direction_id
  • time
  • rel_dist: number between 0 (start) and 1 (end) indicating the relative distance of the trip along its path
  • lon: longitude of trip at given time
  • lat: latitude of trip at given time

Assume feed.stop_times has an accurate shape_dist_traveled column.

Assume the following feed attributes are not None:

gtfstk.calculator.compute_trip_stats(feed, compute_dist_from_shapes=False)

Return a data frame with the following columns:

  • trip_id
  • route_id
  • route_short_name
  • route_type
  • direction_id
  • shape_id
  • num_stops: number of stops on trip
  • start_time: first departure time of the trip
  • end_time: last departure time of the trip
  • start_stop_id: stop ID of the first stop of the trip
  • end_stop_id: stop ID of the last stop of the trip
  • is_loop: 1 if the start and end stop are less than 400m apart and 0 otherwise
  • distance: distance of the trip in feed.dist_units; contains all np.nan entries if feed.shapes is None
  • duration: duration of the trip in hours
  • speed: distance/duration

Assume the following feed attributes are not None:

NOTES:

If feed.stop_times has a shape_dist_traveled column with at least one non-NaN value and compute_dist_from_shapes == False, then use that column to compute the distance column. Else if feed.shapes is not None, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to np.nan.

Calculating trip distances with compute_dist_from_shapes=True seems pretty accurate. For example, calculating trip distances on the Portland feed at https://transitfeeds.com/p/trimet/43/1400947517 using compute_dist_from_shapes=False and compute_dist_from_shapes=True, yields a difference of at most 0.83km.

gtfstk.calculator.convert_dist(feed, new_dist_units)

Convert the distances recorded in the shape_dist_traveled columns of the given feed from the feed’s native distance units (recorded in feed.dist_units) to the given new distance units. New distance units must lie in constants.DIST_UNITS

gtfstk.calculator.count_active_trips(trip_times, time)

Given a data frame trip_times containing the columns

  • trip_id
  • start_time: start time of the trip in seconds past midnight
  • end_time: end time of the trip in seconds past midnight

and a time in seconds past midnight, return the number of trips in the data frame that are active at the given time. A trip is a considered active at time t if start_time <= t < end_time.

gtfstk.calculator.create_shapes(feed, all_trips=False)

Given a feed, create a shape for every trip that is missing a shape ID. Do this by connecting the stops on the trip with straight lines. Return the resulting feed which has updated shapes and trips data frames.

If all_trips, then create new shapes for all trips by connecting stops, and remove the old shapes.

Assume the following feed attributes are not None:

  • feed.stop_times
  • feed.trips
  • feed.stops
gtfstk.calculator.downsample(time_series, freq)

Downsample the given route, stop, or feed time series, (outputs of Feed.compute_route_time_series(), Feed.compute_stop_time_series(), or Feed.compute_feed_time_series(), respectively) to the given Pandas frequency. Return the given time series unchanged if the given frequency is shorter than the original frequency.

gtfstk.calculator.geometrize_shapes(shapes, use_utm=False)

Given a shapes data frame, convert it to a GeoPandas GeoDataFrame and return the result. The result has a ‘geometry’ column of WGS84 line strings instead of ‘shape_pt_sequence’, ‘shape_pt_lon’, ‘shape_pt_lat’, and ‘shape_dist_traveled’ columns. If use_utm, then use UTM coordinates for the geometries.

Requires GeoPandas.

gtfstk.calculator.geometrize_stops(stops, use_utm=False)

Given a stops data frame, convert it to a GeoPandas GeoDataFrame and return the result. The result has a ‘geometry’ column of WGS84 points instead of ‘stop_lon’ and ‘stop_lat’ columns. If use_utm, then use UTM coordinates for the geometries. Requires GeoPandas.

gtfstk.calculator.get_dates(feed, as_date_obj=False)

Return a chronologically ordered list of dates for which this feed is valid. If as_date_obj, then return the dates as datetime.date objects.

If feed.calendar and feed.calendar_dates are both None, then return the empty list.

gtfstk.calculator.get_first_week(feed, as_date_obj=False)

Return a list of date corresponding to the first Monday–Sunday week for which this feed is valid. If the given feed does not cover a full Monday–Sunday week, then return whatever initial segment of the week it does cover, which could be the empty list. If as_date_obj, then return the dates as as datetime.date objects.

gtfstk.calculator.get_route_timetable(feed, route_id, date)

Return a data frame encoding the timetable for the given route ID on the given date. The columns are all those in feed.trips plus those in feed.stop_times. The result is sorted by grouping by trip ID and sorting the groups by their first departure time.

Assume the following feed attributes are not None:

gtfstk.calculator.get_routes(feed, date=None, time=None)

Return the section of feed.routes that contains only routes active on the given date. If no date is given, then return all routes. If a date and time are given, then return only those routes with trips active at that date and time. Do not take times modulo 24.

Assume the following feed attributes are not None:

gtfstk.calculator.get_shapes_intersecting_geometry(feed, geometry, geo_shapes=None, geometrized=False)

Return the slice of feed.shapes that contains all shapes that intersect the given Shapely geometry object (e.g. a Polygon or LineString). Assume the geometry is specified in WGS84 longitude-latitude coordinates.

To do this, first geometrize feed.shapes via geometrize_shapes(). Alternatively, use the geo_shapes GeoDataFrame, if given. Requires GeoPandas.

Assume the following feed attributes are not None:

  • feed.shapes, if geo_shapes is not given

If geometrized is True, then return the resulting shapes data frame in geometrized form.

gtfstk.calculator.get_start_and_end_times(feed, date=None)

Return the first departure time and last arrival time (time strings) listed in feed.stop_times, respectively. Restrict to the given date if specified.

gtfstk.calculator.get_stop_times(feed, date=None)

Return the section of feed.stop_times that contains only trips active on the given date. If no date is given, then return all stop times.

Assume the following feed attributes are not None:

gtfstk.calculator.get_stop_timetable(feed, stop_id, date)

Return a data frame encoding the timetable for the given stop ID on the given date. The columns are all those in feed.trips plus those in feed.stop_times. The result is sorted by departure time.

Assume the following feed attributes are not None:

gtfstk.calculator.get_stops(feed, date=None, trip_id=None, route_id=None)

Return feed.stops. If a date is given, then restrict the output to stops that are visited by trips active on the given date. If a trip ID (string) is given, then restrict the output possibly further to stops that are visited by the trip. Eles if a route ID (string) is given, then restrict the output possibly further to stops that are visited by at least one trip on the route.

Assume the following feed attributes are not None:

gtfstk.calculator.get_stops_in_stations(feed)

If this feed has station data, that is, location_type and parent_station columns in feed.stops, then return a data frame that has the same columns as feed.stops but only includes stops with parent stations, that is, stops with location type 0 or blank and non-blank parent station. Otherwise, return an empty data frame with the specified columns.

Assume the following feed attributes are not None:

  • feed.stops
gtfstk.calculator.get_stops_intersecting_polygon(feed, polygon, geo_stops=None)

Return the slice of feed.stops that contains all stops that intersect the given Shapely Polygon object. Assume the polygon specified in WGS84 longitude-latitude coordinates.

To do this, first geometrize feed.stops via geometrize_stops(). Alternatively, use the geo_stops GeoDataFrame, if given. Requires GeoPandas.

Assume the following feed attributes are not None:

  • feed.stops, if geo_stops is not given
gtfstk.calculator.get_trips(feed, date=None, time=None)

Return the section of feed.trips that contains only trips active on the given date. If feed.trips is None or the date is None, then return all feed.trips. If a date and time are given, then return only those trips active at that date and time. Do not take times modulo 24.

gtfstk.calculator.is_active_trip(feed, trip, date)

If the given trip (trip ID) is active on the given date, then return True; otherwise return False. To avoid error checking in the interest of speed, assume trip is a valid trip ID in the given feed and date is a valid date object.

Assume the following feed attributes are not None:

  • feed.trips
NOTES:
  • This function is key for getting all trips, routes, etc. that are active on a given date, so the function needs to be fast.
gtfstk.calculator.restrict_by_polygon(feed, polygon)

Build a new feed by taking the given one, keeping only the trips that have at least one stop intersecting the given polygon, and then restricting stops, routes, stop times, etc. to those associated with that subset of trips. Return the resulting feed. Requires GeoPandas.

Assume the following feed attributes are not None:

gtfstk.calculator.restrict_by_routes(feed, route_ids)

Build a new feed by taking the given one and chopping it down to only the stops, trips, shapes, etc. used by the routes specified in the given list of route IDs. Return the resulting feed.

gtfstk.calculator.route_to_geojson(feed, route_id, include_stops=False)

Given a feed and a route ID (string), return a (decoded) GeoJSON feature collection comprising a MultiLinestring feature of distinct shapes of the trips on the route. If include_stops, then also include one Point feature for each stop visited by any trip on the route. The MultiLinestring feature will contain as properties all the columns in feed.routes pertaining to the given route, and each Point feature will contain as properties all the columns in feed.stops pertaining to the stop, except the stop_lat and stop_lon properties.

Assume the following feed attributes are not None:

  • feed.routes
  • feed.shapes
  • feed.trips
  • feed.stops
gtfstk.calculator.shapes_to_geojson(feed)

Return a (decoded) GeoJSON feature collection of linestring features representing feed.shapes. Each feature will have a shape_id property. If feed.shapes is None, then return None. The coordinates reference system is the default one for GeoJSON, namely WGS84.

Assume the following feed attributes are not None:

gtfstk.calculator.trip_to_geojson(feed, trip_id, include_stops=False)

Given a feed and a trip ID (string), return a (decoded) GeoJSON feature collection comprising a Linestring feature of representing the trip’s shape. If include_stops, then also include one Point feature for each stop visited by the trip. The Linestring feature will contain as properties all the columns in feed.trips pertaining to the given trip, and each Point feature will contain as properties all the columns in feed.stops pertaining to the stop, except the stop_lat and stop_lon properties.

Assume the following feed attributes are not None:

  • feed.trips
  • feed.shapes
  • feed.stops
gtfstk.calculator.ungeometrize_shapes(geo_shapes)

The inverse of geometrize_shapes(). Produces the columns:

  • shape_id
  • shape_pt_sequence
  • shape_pt_lon
  • shape_pt_lat

If geo_shapes is in UTM (has a UTM CRS property), then convert UTM coordinates back to WGS84 coordinates,

gtfstk.calculator.ungeometrize_stops(geo_stops)

The inverse of geometrize_stops(). If geo_stops is in UTM (has a UTM CRS property), then convert UTM coordinates back to WGS84 coordinates,

plotter Module

This module contains functions for plotting various graphs related to Feed objects. It is optional and requires Matplotlib.

gtfstk.plotter.plot_feed_time_series(feed_time_series)

Given a routes time series data frame, sum each time series indicator over all routes, plot each series indicator using Matplotlib, and return the resulting figure of subplots.

NOTES:

Take the resulting figure f and do f.tight_layout() for a nice-looking plot.

gtfstk.plotter.plot_headways(stats, max_headway_limit=60)

Given a stops or routes stats data frame, return bar charts of the max and mean headways as a Matplotlib figure. Only include the stops/routes with max headways at most max_headway_limit minutes. If max_headway_limit is None, then include them all in a giant plot. If there are no stops/routes within the max headway limit, then return None.

NOTES:

Take the resulting figure f and do f.tight_layout() for a nice-looking plot.

Indices and tables