Inequality Join#
- pandance.ineq_join(left, right, how='<=', on=None, left_on=None, right_on=None, suffixes=('_x', '_y'))#
Perform an inequality join on the (single) specified left and right columns, for example:
ineq_join(df_a, df_b, '<=', on='value')
will match all
(df_a['value'] <= df_b['value'])pairs.Valid column types are those that support comparisons (numbers, strings, dateime, etc.).
Note that the operation is not guaranteed to preserve the row or column order of the input dataframes, in order to save time. The indices of the input DataFrames is ignored (a new one RangeIndex is generated).
Note
The time cost of this operation is \(O\left(n \log_2{m} + m\log_2{m} + Q \right)\), where n and m are the lengths of the shorter and longer dataframe, respectively, and Q is the total number of matching entries in the join. Thus, the worst-case time cost is \(O(n \cdot m)\), which happens when all values on the right-hand side of e.g.
<are larger than those on the left-hand side (resulting in everything matching with everything), and analogously for>. When Q is small (or constant with respect to growing n and m), the time cost becomes \(O((n + m) \cdot \log_2{m})\).- Parameters:
left (
DataFrame) – The left-hand side Pandas DataFrameright (
DataFrame) – The right-hand side Pandas DataFramehow (
str) – The inequality operator between the two columns. Can be<,<=,>=, or>.on (
Optional[str]) – (Single) column name to join on, passed topandas.merge()left_on (
Optional[str]) – (Single) column name to join on in the left DataFrame, passed topandas.merge()right_on (
Optional[str]) – (Single) column name to join on in the right DataFrame, passed topandas.merge()suffixes (
Optional[tuple]) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively, passed topandas.merge()
- Return type:
DataFrame- Returns:
The inequijoin of the two DataFrames.
See also
theta_join(): A gneric join that allows user-specified joining conditions.Examples
Temporal data
Say we want to find connecting flights between locations A and C with a layover in location B. We have a table with flights from A to B, and another from B to C: Valid connections are those flights that arrive in B before a departing flight from B:
>>> import pandas as pd >>> import pandance as dance ... >>> flights_ab = pd.DataFrame([ ... ('2023-01-01 08:00', '2023-01-01 10:00'), ... ('2023-01-01 12:00', '2023-01-01 14:00'), ... ('2023-01-01 16:00', '2023-01-01 18:00'), ... ('2023-01-01 20:00', '2023-01-01 22:00') ... ], columns=['dep', 'arr'], dtype='datetime64[ns]') ... >>> flights_bc = pd.DataFrame([ ... ('2023-01-01 09:00', '2023-01-01 12:00'), ... ('2023-01-01 14:00', '2023-01-01 17:00'), ... ('2023-01-01 18:00', '2023-01-01 21:00'), ... ('2023-01-01 21:00', '2023-01-02 00:00') ... ], columns=['dep', 'arr'], dtype='datetime64[ns]') ... >>> dance.ineq_join(flights_ab, flights_bc, ... left_on='arr', right_on='dep', how='<', ... suffixes=('_ab', '_bc')) ... dep_ab arr_ab dep_bc arr_bc 0 2023-01-01 08:00:00 2023-01-01 10:00:00 2023-01-01 14:00:00 2023-01-01 17:00:00 1 2023-01-01 08:00:00 2023-01-01 10:00:00 2023-01-01 18:00:00 2023-01-01 21:00:00 2 2023-01-01 12:00:00 2023-01-01 14:00:00 2023-01-01 18:00:00 2023-01-01 21:00:00 3 2023-01-01 08:00:00 2023-01-01 10:00:00 2023-01-01 21:00:00 2023-01-02 00:00:00 4 2023-01-01 12:00:00 2023-01-01 14:00:00 2023-01-01 21:00:00 2023-01-02 00:00:00 5 2023-01-01 16:00:00 2023-01-01 18:00:00 2023-01-01 21:00:00 2023-01-02 00:00:00
Numerical data
We’re making a groceries list, and we’re balancing macronutrients and costs.
>>> carb_sources = pd.DataFrame([ ... ('rice', 34), ... ('oat flakes', 32) ... ], columns=['item', 'price']) ... >>> protein_sources = pd.DataFrame([ ... ('lentils', 25), ... ('chickpeas', 38), ... ('soy beans', 48) ... ], columns=['item', 'price'])
We want to stock up on a single carb and protein source, but we want the carbs to cost less than the proteins. This can be expressed as the following inequality join:
>>> dance.ineq_join( ... carb_sources, protein_sources, on='price', how = '<', ... suffixes=('_carb', '_prot')) ... item_carb price_carb price_prot item_prot 0 rice 34 38 chickpeas 1 oat flakes 32 38 chickpeas 2 rice 34 48 soy beans 3 oat flakes 32 48 soy beans
Strings
Suppose we have a small sample of strings and want to find all strings that are sorted lower in a large database of strings (which here is constructed randomly).
>>> import random >>> import string >>> random.seed(42) ... >>> query = pd.DataFrame(['bbb', 'ccc'], columns=['string']) ... >>> database = pd.DataFrame( ... [''.join(random.choices(string.ascii_lowercase, k=3)) for _ in range(10)], ... columns=['string'] ... )
In this case, the random database only has a few strings of lower ordering than our query
>>> dance.ineq_join(query, database, how='>', on='string', suffixes=('_query', '_db')) string_query string_db 0 bbb afn 1 ccc afn 2 bbb afq 3 ccc afq