Fuzzy Join#

pandance.fuzzy_join(left, right, on=None, left_on=None, right_on=None, tol=0.001, suffixes=('_x', '_y'))#

Perform an approximate inner join of two DataFrames, on a numerical or time column. For example, \(1.03 \approx 1\) would be a match, given an absolute tolerance tol = 0.5. The tolerance is inclusive, meaning (a - b) <= tol is considered a match.

A single join column may be used and this must be given explicitly (with on, or left_on and right_on).

The joined DataFrame contains both columns that were used in the join, with appended suffixes.

Warning

The matching may misbehave if numerical values are very large and the tolerance small, due to the simple absolute tolerance test and floating point representation limitations (see Notes).

Warning

NaN and Inf values in the joining column will (silently) not yield matches, as per the IEEE 754 standard implemented by NumPy.

Note

This operation is a more efficient implementation compared to the generic theta_join(), taking an expected \(O((n + m) \cdot \log_2{m})\) time, assuming the majority of values in the longer join column are distinct (accouting for the tolerance), where m is the length of the longer of the two DataFrames, and n of the shorter one, instead of \(O(n \cdot m)\).

The worst case of the fuzzy join is still \(O(n \cdot m)\) in the extreme case when both join colunms contain identical values (accounting for tolerance), meaning everything matches with everything.

Parameters:
  • left (DataFrame) – The left-hand side Pandas DataFrame

  • right (DataFrame) – The right-hand side Pandas DataFrame

  • on (Optional[str]) – (Single) numerical or time column name to join on

  • left_on (Optional[str]) – (Single) numerical or time column name to join on in the left DataFrame

  • right_on (Optional[str]) – (Single) numerical or time column name to join on in the right DataFrame

  • tol (Union[float, Decimal, Timedelta]) – Numerical or temporal tolerance for the fuzzy matching.

  • suffixes (tuple) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right, respectively

Return type:

DataFrame

Returns:

The fuzzy join of the two DataFrames.

Examples

Numerical columns

We have two sets of model performance scores (0..1). The models in one list are rather simple, the others much more complex.

>>> import pandas as pd
>>> simple_models = pd.DataFrame([
...     ('A', 0.2),
...     ('B', 0.5),
...     ('C', 0.7),
...     ('D', 0.9)
... ], columns=['model', 'score'])
>>> complex_models = pd.DataFrame([
...     ('M1', 0.1),
...     ('M2', 0.89),
...     ('M3', 0.8),
...     ('M4', 0.54)
... ], columns=['model', 'score'])

We’re interested in finding models that perform essentially the same across the two lists, accepting a score tolerance of 0.05:

>>> import pandance as dance
>>> dance.fuzzy_join(simple_models, complex_models, on='score', tol=0.05, suffixes=('_s', '_f'))
  model_s  score_s model_f  score_f
0       B      0.5      M4     0.54
1       D      0.9      M2     0.89

Time series

Given two datasets recording the observation times of events, perform a fuzzy join on the time column to get only the events that occur at approximately the same time between sets:

df_x:                                df_y:

  +--------+---------------------+     +--------+---------------------+
  |  event |            obs_time |     |  event |            obs_time |
  +--------+---------------------+     +--------+---------------------+
  | event1 | 2021-01-01 10:23:00 |     | event4 | 2021-01-01 10:22:00 |
  | event2 | 2021-02-01 13:23:00 |     | event5 | 2021-02-01 21:23:00 |
  | event3 | 2021-03-01 15:23:00 |     | event6 | 2021-03-01 15:22:00 |
  +--------+---------------------+     | event7 | 2021-03-01 15:24:00 |
                                       +--------+---------------------+

The operation:

fuzzy_join(df_x, df_y, on='obs_time', tol=pd.Timedelta('1 minute'))

gives:

+---------+---------------------+---------+---------------------+
| event_x |          obs_time_x | event_y |          obs_time_y |
+---------+---------------------+---------+---------------------+
| event1  | 2021-01-01 10:23:00 | event4  | 2021-01-01 10:22:00 |
| event3  | 2021-03-01 15:23:00 | event6  | 2021-03-01 15:22:00 |
| event3  | 2021-03-01 15:23:00 | event7  | 2021-03-01 15:24:00 |
+---------+---------------------+---------+---------------------+

Notes

High-precision applications

Care must be taken if high precision (low tolerances) are to be used with large floating point numbers, due to representation limitations.

As a workaround, consider using arbitrary precision data types, such as the Python built-in Decimal type, accepting the performance penalty. The join columns can be converted to Decimal just before the fuzzy join operation:

from decimal import Decimal
import pandance as dance

df_a['val'] = df_a['val'].map(lambda x: Decimal(x))
df_b['val'] = df_b['val'].map(lambda x: Decimal(x))

dance.fuzzy_join(df_b, df_a, on='val', tol=Decimal(1e-10))

See the decimal documentation on setting the precision (number of decimals) to be used for results of operations with Decimals.

A more widespread practice is to adjust float comparisons with an additional relative tolerance, but there is no straightforward way to do this for this join operation, since we’re (conceptually) symmetrically comparing everything with everything in the left and right join columns. Comparisons with relative tolerance factor in both numbers to be compared. For more technical details on the issue, see this post.