Fuzzy Join#
- pandance.fuzzy_join(left, right, on=None, left_on=None, right_on=None, tol=0.001, suffixes=('_x', '_y'))#
Perform an approximate inner join of two DataFrames, on a numerical or time column. For example, \(1.03 \approx 1\) would be a match, given an absolute tolerance
tol = 0.5. The tolerance is inclusive, meaning(a - b) <= tolis considered a match.A single join column may be used and this must be given explicitly (with
on, orleft_onandright_on).The joined DataFrame contains both columns that were used in the join, with appended
suffixes.Warning
The matching may misbehave if numerical values are very large and the tolerance small, due to the simple absolute tolerance test and floating point representation limitations (see Notes).
Warning
NaN and Inf values in the joining column will (silently) not yield matches, as per the IEEE 754 standard implemented by NumPy.
Note
This operation is a more efficient implementation compared to the generic
theta_join(), taking an expected \(O((n + m) \cdot \log_2{m})\) time, assuming the majority of values in the longer join column are distinct (accouting for the tolerance), where m is the length of the longer of the two DataFrames, and n of the shorter one, instead of \(O(n \cdot m)\).The worst case of the fuzzy join is still \(O(n \cdot m)\) in the extreme case when both join colunms contain identical values (accounting for tolerance), meaning everything matches with everything.
- Parameters:
left (
DataFrame) – The left-hand side Pandas DataFrameright (
DataFrame) – The right-hand side Pandas DataFrameon (
Optional[str]) – (Single) numerical or time column name to join onleft_on (
Optional[str]) – (Single) numerical or time column name to join on in the left DataFrameright_on (
Optional[str]) – (Single) numerical or time column name to join on in the right DataFrametol (
Union[float,Decimal,Timedelta]) – Numerical or temporal tolerance for the fuzzy matching.suffixes (
tuple) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names inleftandright, respectively
- Return type:
DataFrame- Returns:
The fuzzy join of the two DataFrames.
Examples
Numerical columns
We have two sets of model performance scores (0..1). The models in one list are rather simple, the others much more complex.
>>> import pandas as pd >>> simple_models = pd.DataFrame([ ... ('A', 0.2), ... ('B', 0.5), ... ('C', 0.7), ... ('D', 0.9) ... ], columns=['model', 'score']) >>> complex_models = pd.DataFrame([ ... ('M1', 0.1), ... ('M2', 0.89), ... ('M3', 0.8), ... ('M4', 0.54) ... ], columns=['model', 'score'])
We’re interested in finding models that perform essentially the same across the two lists, accepting a score tolerance of 0.05:
>>> import pandance as dance >>> dance.fuzzy_join(simple_models, complex_models, on='score', tol=0.05, suffixes=('_s', '_f')) model_s score_s model_f score_f 0 B 0.5 M4 0.54 1 D 0.9 M2 0.89
Time series
Given two datasets recording the observation times of events, perform a fuzzy join on the time column to get only the events that occur at approximately the same time between sets:
df_x: df_y: +--------+---------------------+ +--------+---------------------+ | event | obs_time | | event | obs_time | +--------+---------------------+ +--------+---------------------+ | event1 | 2021-01-01 10:23:00 | | event4 | 2021-01-01 10:22:00 | | event2 | 2021-02-01 13:23:00 | | event5 | 2021-02-01 21:23:00 | | event3 | 2021-03-01 15:23:00 | | event6 | 2021-03-01 15:22:00 | +--------+---------------------+ | event7 | 2021-03-01 15:24:00 | +--------+---------------------+
The operation:
fuzzy_join(df_x, df_y, on='obs_time', tol=pd.Timedelta('1 minute'))
gives:
+---------+---------------------+---------+---------------------+ | event_x | obs_time_x | event_y | obs_time_y | +---------+---------------------+---------+---------------------+ | event1 | 2021-01-01 10:23:00 | event4 | 2021-01-01 10:22:00 | | event3 | 2021-03-01 15:23:00 | event6 | 2021-03-01 15:22:00 | | event3 | 2021-03-01 15:23:00 | event7 | 2021-03-01 15:24:00 | +---------+---------------------+---------+---------------------+
Notes
High-precision applications
Care must be taken if high precision (low tolerances) are to be used with large floating point numbers, due to representation limitations.
As a workaround, consider using arbitrary precision data types, such as the Python built-in Decimal type, accepting the performance penalty. The join columns can be converted to
Decimaljust before the fuzzy join operation:from decimal import Decimal import pandance as dance df_a['val'] = df_a['val'].map(lambda x: Decimal(x)) df_b['val'] = df_b['val'].map(lambda x: Decimal(x)) dance.fuzzy_join(df_b, df_a, on='val', tol=Decimal(1e-10))
See the
decimaldocumentation on setting the precision (number of decimals) to be used for results of operations with Decimals.A more widespread practice is to adjust float comparisons with an additional relative tolerance, but there is no straightforward way to do this for this join operation, since we’re (conceptually) symmetrically comparing everything with everything in the left and right join columns. Comparisons with relative tolerance factor in both numbers to be compared. For more technical details on the issue, see this post.