Theta Join#

pandance.theta_join(left, right, condition=None, on=None, left_on=None, right_on=None, suffixes=('_x', '_y'), n_processes=None, par_threshold=10000, relation=None)#

Perform an inner join with a user-specified match condition (any boolean-valued function taking two arguments).

A theta-join is an operation in which rows in the join columns are matched using an arbitrary condition \(\theta\) that holds between the row entries (i.e. the pair is in a binary relation). It generalizes equijoins (where \(\theta\) = equality). See examples below and the Wikipedia article, though note that in Pandance, \(\theta\) is not limited to the typical set of relations {<, <=, =, !=, >=, >}.

Since version 0.3.0, this join is parallelized for larger results. To avoid unnecessary overhead on small data, multiple processes are used only if the number of rows in the longest input dataframe is at least par_threshold. Consider decreasing this threshold if the condition function takes a longer time to evaluate (complex calculation, some sort of lookup / query, etc.) By default, all CPU cores on the machine are used (taking n_processes = multiprocessing.cpu_count()).

The indices of the input DataFrames is ignored (a new one RangeIndex is generated).

Parameters:

left (DataFrame) – The left-hand side Pandas DataFrame
right (DataFrame) – The right-hand side Pandas DataFrame
condition (Optional[Callable[..., bool]]) – a function or callable object of two parameters x, y that returns True if the pair (x, y) fulfills the condition, else False. E.g. divides(2, 8) == True.
on (Optional[str]) – (Single) column name to join on, passed to pandas.merge()
left_on (Optional[str]) – (Single) column name to join on in the left DataFrame, passed to pandas.merge()
right_on (Optional[str]) – (Single) column name to join on in the right DataFrame, passed to pandas.merge()
suffixes (Optional[tuple]) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively, passed to pandas.merge()
relation (Optional[Callable[..., bool]]) – (deprecated) Synonym for condition
par_threshold (int) –
The longest input dataframe must have at least this many rows for parallelism to be used.

New in version 0.3.0.
n_processes (Optional[int]) –
How many processes to spawn for performing the join. Defaults to the number of CPUs on the system.

New in version 0.3.0.

Return type:

DataFrame

Returns:

The theta-join of the two DataFrames.

See also

fuzzy_join(): An efficient implementation of the special case of θ-join where θ is \(\approx\).
ineq_join(): An efficient implementation of the special case of θ-join where θ is an inequality {<, <=, =, >=, >} between the join columns.

Examples

Numerical condition

We have two tables with numerical entries x and y, and we want to find those combinations of x and y that represent coordinates on the unit circle. Here

\[\theta (x, y): x^2 + y^2 - 1 = 0\]

>>> import math
>>> import pandas as pd
>>> import pandance as dance
...
>>> horiz_vals = pd.DataFrame([0, 1, -1, 0.5], columns=['x'])
>>> vert_vals = pd.DataFrame([0, 1, -1, 0.5], columns=['y'])
...
>>> dance.theta_join(
...     horiz_vals, vert_vals, left_on='x', right_on='y',
...     condition = lambda x, y: math.isclose(x**2 + y**2 - 1, 0)
... )
     x    y
0  1.0  0.0
1 -1.0  0.0
2  0.0  1.0
3  0.0 -1.0

Substring matching

We have two tables of character strings and want to find all pairs in which strings from the left join column appear as substrings of the right.

>>> import pandas as pd
>>> import pandance as dance
...
>>> keywords = pd.DataFrame(['a', 'the', 'xyzzy'], columns=['keyword'])
>>> phrases = pd.DataFrame([
...     'the quick brown fox jumps over the lazy dog',
...     'lorem ipsum dolor'
... ], columns=['phrase'])

Here θ(a, b): a substring of b.

>>> hits = dance.theta_join(
...     keywords, phrases, left_on='keyword', right_on='phrase',
...     condition = lambda kw, phrase: kw in phrase)
>>> hits
  keyword                                       phrase
0       a  the quick brown fox jumps over the lazy dog
1     the  the quick brown fox jumps over the lazy dog

Inequality condition

We’re making a groceries list, and we’re balancing macronutrients and costs.

>>> import pandas as pd
>>> import pandance as dance
...
>>> carb_sources = pd.DataFrame([
...     ('rice', 34),
...     ('oat flakes', 32)
... ], columns=['item', 'price'])
...
>>> protein_sources = pd.DataFrame([
...     ('lentils', 25),
...     ('chickpeas', 38),
...     ('soy beans', 48)
... ], columns=['item', 'price'])

We want to stock up on a single carb and protein source, but we want the carbs to cost less than the proteins. This can be expressed as the θ-join below, where

\[\theta (x, y): x < y\]

>>> dance.theta_join(
...     carb_sources, protein_sources, on='price',
...     condition = lambda price_carb, price_prot: price_carb < price_prot,
...     suffixes=('_carb', '_prot'))
...
    item_carb  price_carb  item_prot  price_prot
0        rice          34  chickpeas          38
1  oat flakes          32  chickpeas          38
2        rice          34  soy beans          48
3  oat flakes          32  soy beans          48

Tip

The ineq_join() operation implements a more efficient version of this type of inequality join.