Theta Join#
- pandance.theta_join(left, right, condition=None, on=None, left_on=None, right_on=None, suffixes=('_x', '_y'), n_processes=None, par_threshold=10000, relation=None)#
Perform an inner join with a user-specified match
condition(any boolean-valued function taking two arguments).A theta-join is an operation in which rows in the join columns are matched using an arbitrary condition \(\theta\) that holds between the row entries (i.e. the pair is in a binary relation). It generalizes equijoins (where \(\theta\) = equality). See examples below and the Wikipedia article, though note that in Pandance, \(\theta\) is not limited to the typical set of relations {<, <=, =, !=, >=, >}.
Since version
0.3.0, this join is parallelized for larger results. To avoid unnecessary overhead on small data, multiple processes are used only if the number of rows in the longest input dataframe is at leastpar_threshold. Consider decreasing this threshold if the condition function takes a longer time to evaluate (complex calculation, some sort of lookup / query, etc.) By default, all CPU cores on the machine are used (takingn_processes = multiprocessing.cpu_count()).The indices of the input DataFrames is ignored (a new one RangeIndex is generated).
- Parameters:
left (
DataFrame) – The left-hand side Pandas DataFrameright (
DataFrame) – The right-hand side Pandas DataFramecondition (
Optional[Callable[...,bool]]) – a function or callable object of two parametersx,ythat returnsTrueif the pair(x, y)fulfills the condition, elseFalse. E.g.divides(2, 8) == True.on (
Optional[str]) – (Single) column name to join on, passed topandas.merge()left_on (
Optional[str]) – (Single) column name to join on in the left DataFrame, passed topandas.merge()right_on (
Optional[str]) – (Single) column name to join on in the right DataFrame, passed topandas.merge()suffixes (
Optional[tuple]) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively, passed topandas.merge()relation (
Optional[Callable[...,bool]]) – (deprecated) Synonym forconditionpar_threshold (
int) –The longest input dataframe must have at least this many rows for parallelism to be used.
New in version 0.3.0.
n_processes (
Optional[int]) –How many processes to spawn for performing the join. Defaults to the number of CPUs on the system.
New in version 0.3.0.
- Return type:
DataFrame- Returns:
The theta-join of the two DataFrames.
See also
fuzzy_join()An efficient implementation of the special case of θ-join where θ is \(\approx\).
ineq_join()An efficient implementation of the special case of θ-join where θ is an inequality {<, <=, =, >=, >} between the join columns.
Examples
Numerical condition
We have two tables with numerical entries x and y, and we want to find those combinations of x and y that represent coordinates on the unit circle. Here
\[\theta (x, y): x^2 + y^2 - 1 = 0\]>>> import math >>> import pandas as pd >>> import pandance as dance ... >>> horiz_vals = pd.DataFrame([0, 1, -1, 0.5], columns=['x']) >>> vert_vals = pd.DataFrame([0, 1, -1, 0.5], columns=['y']) ... >>> dance.theta_join( ... horiz_vals, vert_vals, left_on='x', right_on='y', ... condition = lambda x, y: math.isclose(x**2 + y**2 - 1, 0) ... ) x y 0 1.0 0.0 1 -1.0 0.0 2 0.0 1.0 3 0.0 -1.0
Substring matching
We have two tables of character strings and want to find all pairs in which strings from the left join column appear as substrings of the right.
>>> import pandas as pd >>> import pandance as dance ... >>> keywords = pd.DataFrame(['a', 'the', 'xyzzy'], columns=['keyword']) >>> phrases = pd.DataFrame([ ... 'the quick brown fox jumps over the lazy dog', ... 'lorem ipsum dolor' ... ], columns=['phrase'])
Here θ(a, b): a substring of b.
>>> hits = dance.theta_join( ... keywords, phrases, left_on='keyword', right_on='phrase', ... condition = lambda kw, phrase: kw in phrase) >>> hits keyword phrase 0 a the quick brown fox jumps over the lazy dog 1 the the quick brown fox jumps over the lazy dog
Inequality condition
We’re making a groceries list, and we’re balancing macronutrients and costs.
>>> import pandas as pd >>> import pandance as dance ... >>> carb_sources = pd.DataFrame([ ... ('rice', 34), ... ('oat flakes', 32) ... ], columns=['item', 'price']) ... >>> protein_sources = pd.DataFrame([ ... ('lentils', 25), ... ('chickpeas', 38), ... ('soy beans', 48) ... ], columns=['item', 'price'])
We want to stock up on a single carb and protein source, but we want the carbs to cost less than the proteins. This can be expressed as the θ-join below, where
\[\theta (x, y): x < y\]>>> dance.theta_join( ... carb_sources, protein_sources, on='price', ... condition = lambda price_carb, price_prot: price_carb < price_prot, ... suffixes=('_carb', '_prot')) ... item_carb price_carb item_prot price_prot 0 rice 34 chickpeas 38 1 oat flakes 32 chickpeas 38 2 rice 34 soy beans 48 3 oat flakes 32 soy beans 48
Tip
The
ineq_join()operation implements a more efficient version of this type of inequality join.