Inequality Join#

pandance.ineq_join(left, right, how='<=', on=None, left_on=None, right_on=None, suffixes=('_x', '_y'))#

Perform an inequality join on the (single) specified left and right columns, for example:

ineq_join(df_a, df_b, '<=', on='value')

will match all (df_a['value'] <= df_b['value']) pairs.

Valid column types are those that support comparisons (numbers, strings, dateime, etc.).

Note that the operation is not guaranteed to preserve the row or column order of the input dataframes, in order to save time. The indices of the input DataFrames is ignored (a new one RangeIndex is generated).

Note

The time cost of this operation is \(O\left(n \log_2{m} + m\log_2{m} + Q \right)\), where n and m are the lengths of the shorter and longer dataframe, respectively, and Q is the total number of matching entries in the join. Thus, the worst-case time cost is \(O(n \cdot m)\), which happens when all values on the right-hand side of e.g. < are larger than those on the left-hand side (resulting in everything matching with everything), and analogously for >. When Q is small (or constant with respect to growing n and m), the time cost becomes \(O((n + m) \cdot \log_2{m})\).

Parameters:
  • left (DataFrame) – The left-hand side Pandas DataFrame

  • right (DataFrame) – The right-hand side Pandas DataFrame

  • how (str) – The inequality operator between the two columns. Can be <, <=, >=, or >.

  • on (Optional[str]) – (Single) column name to join on, passed to pandas.merge()

  • left_on (Optional[str]) – (Single) column name to join on in the left DataFrame, passed to pandas.merge()

  • right_on (Optional[str]) – (Single) column name to join on in the right DataFrame, passed to pandas.merge()

  • suffixes (Optional[tuple]) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively, passed to pandas.merge()

Return type:

DataFrame

Returns:

The inequijoin of the two DataFrames.

See also

theta_join() : A gneric join that allows user-specified joining conditions.

Examples

Temporal data

Say we want to find connecting flights between locations A and C with a layover in location B. We have a table with flights from A to B, and another from B to C: Valid connections are those flights that arrive in B before a departing flight from B:

>>> import pandas as pd
>>> import pandance as dance
...
>>> flights_ab = pd.DataFrame([
...     ('2023-01-01 08:00', '2023-01-01 10:00'),
...     ('2023-01-01 12:00', '2023-01-01 14:00'),
...     ('2023-01-01 16:00', '2023-01-01 18:00'),
...     ('2023-01-01 20:00', '2023-01-01 22:00')
... ], columns=['dep', 'arr'], dtype='datetime64[ns]')
...
>>> flights_bc = pd.DataFrame([
...     ('2023-01-01 09:00', '2023-01-01 12:00'),
...     ('2023-01-01 14:00', '2023-01-01 17:00'),
...     ('2023-01-01 18:00', '2023-01-01 21:00'),
...     ('2023-01-01 21:00', '2023-01-02 00:00')
... ], columns=['dep', 'arr'], dtype='datetime64[ns]')
...
>>> dance.ineq_join(flights_ab, flights_bc,
...                 left_on='arr', right_on='dep', how='<',
...                 suffixes=('_ab', '_bc'))
...
               dep_ab                 arr_ab                  dep_bc                  arr_bc
0 2023-01-01 08:00:00    2023-01-01 10:00:00     2023-01-01 14:00:00     2023-01-01 17:00:00
1 2023-01-01 08:00:00    2023-01-01 10:00:00     2023-01-01 18:00:00     2023-01-01 21:00:00
2 2023-01-01 12:00:00    2023-01-01 14:00:00     2023-01-01 18:00:00     2023-01-01 21:00:00
3 2023-01-01 08:00:00    2023-01-01 10:00:00     2023-01-01 21:00:00     2023-01-02 00:00:00
4 2023-01-01 12:00:00    2023-01-01 14:00:00     2023-01-01 21:00:00     2023-01-02 00:00:00
5 2023-01-01 16:00:00    2023-01-01 18:00:00     2023-01-01 21:00:00     2023-01-02 00:00:00

Numerical data

We’re making a groceries list, and we’re balancing macronutrients and costs.

>>> carb_sources = pd.DataFrame([
...     ('rice', 34),
...     ('oat flakes', 32)
... ], columns=['item', 'price'])
...
>>> protein_sources = pd.DataFrame([
...     ('lentils', 25),
...     ('chickpeas', 38),
...     ('soy beans', 48)
... ], columns=['item', 'price'])

We want to stock up on a single carb and protein source, but we want the carbs to cost less than the proteins. This can be expressed as the following inequality join:

>>> dance.ineq_join(
...     carb_sources, protein_sources, on='price', how = '<',
...     suffixes=('_carb', '_prot'))
...
    item_carb  price_carb  price_prot  item_prot
0        rice          34          38  chickpeas
1  oat flakes          32          38  chickpeas
2        rice          34          48  soy beans
3  oat flakes          32          48  soy beans

Strings

Suppose we have a small sample of strings and want to find all strings that are sorted lower in a large database of strings (which here is constructed randomly).

>>> import random
>>> import string
>>> random.seed(42)
...
>>> query = pd.DataFrame(['bbb', 'ccc'], columns=['string'])
...
>>> database = pd.DataFrame(
...     [''.join(random.choices(string.ascii_lowercase, k=3)) for _ in range(10)],
...     columns=['string']
... )

In this case, the random database only has a few strings of lower ordering than our query

>>> dance.ineq_join(query, database, how='>', on='string', suffixes=('_query', '_db'))
  string_query string_db
0          bbb       afn
1          ccc       afn
2          bbb       afq
3          ccc       afq