Getting Started#
Installation#
Pandance is available from PyPI
pip install pandance
Pandance has been tested with pandas version as low as 1.2.0, though please consider upgrading to >= 2.0.0 for significant performance improvements, both in terms of speed and memory usage.
Basic Usage#
The Pandance operations are meant to integrate as seamlessly as possible with pandas, and have a very similar API, e.g.
import pandance as dance
result_df = dance.fuzzy_join(df_a, df_b, on='column_name', tol=0.05)
Each operation has examples of usage in its documentation. The Highlights section below showcases some of Pandance’s functionality.
Highlights#
Pandance extends the set of standard join operations in pandas (inner, outer, cross, left, right) with the operations:
fuzzy join: Match columns with a tolerance. Supports numerical and datetime values.
inequality join: Match one column’s values that are less / greater than the other column’s values. Supports any type that implements comparisons (numerical, string, datetime, user-defined, etc.).
theta join: Allows the user to specify arbitrary matching conditions on which to join.
Pandance is designed with performance in mind, aiming to provide fast implementations whenever possible.
Fuzzy Joins#
While the most common type of join matches column values exactly, one will invariably come across situations where some sort of tolerance must be considered, for instance matching on timestamps to find events that occur at approximately the same time, or joining on some type of real-valued (float) score to find entries with similar performance.
One approach is to transform the input DataFrames by binning (discretizing) these values. A more natural approach would be to consider fuzzy matches.
Taking the previous example of matching things based on some real value, say we have two sets of model performance scores (0..1). The models in one list are rather simple, the others much more complex:
|
|
We’re interested in finding models that perform essentially the same across the two lists, and take a score tolerance of 0.05. This can be easily expressed as a fuzzy join on the score column:
fuzzy_join(simple_models, fancy_models, on='score', tol=0.05, suffixes=('_s', '_f'))
This gives:
model_s |
score_s |
model_f |
score_f |
|---|---|---|---|
B |
0.5 |
M2 |
0.54 |
D |
0.9 |
M4 |
0.89 |
See the fuzzy_join documentation for more details
and examples, including for time series data.
Performance#
Pandance fuzzy joins are fast and consume little memory, primarily by using interval trees as the underlying data structure and avoiding comparing all entries against all entries. This makes the implementation faster than typical solutions, like the corresponding R fuzzyjoin function. This is illustrated here by the performance on the same dataset consisting of two overlapping normal distributions (10 000 samples from each, mean=-2 and 2, respectively, sd=1) with a tolerance of 0.1, which resulted in 106 776 matches.
Implementation |
time [s] |
memory [MB] |
|---|---|---|
pandance.fuzzy_join() |
1.88 |
16.5 |
fuzzyjoin::difference_inner_join() |
4.24 |
3070 |
Click to see more technical details about the performance aspect.
By representing the longer column (length m) as a tolerance interval tree, match lookups can be made in \(O(\log{m})\).
The R fuzzyjoin implementation (as of version 0.1.6),
as well as solutions on e.g. StackOverflow,
perform the operation (conceptually) as a Cartesian join
(comparing all against all, so a \(O(m)\) lookup time
for each entry in the shorter column),
followed by filtering on pairs within the tolerance.
The dataset used for profiling consists of two tables containing 1e4 numbers sampled from two normal distributions (means -2 and 2, respectively, and sd = 1). The fuzzy join is performed with a tolerance of 0.1 on these two sets, resulting in a sort of fuzzy intersection of the populations. Only the join operation time and memory are shown.
Pandance (Python) measurement:
# Speed
python test/performance.py
snakeviz $(ls -1rt | tail -n 1)
# Memory
# (decorated fuzzy_speed_random with @profile)
python -m memory_profiler test/performance.py
Versions: Python 3.8, pandas 1.4.2, numpy 1.22.3, intervaltree 3.1.0
fuzzyjoin (R) measurement: Used RStudio’s profiling
(profvis package under the hood)
The profiling scripts are available in the Pandance repo. The profiling was performed on a machine with 4x Intel Core i7-8550U @ 4 GHz and 8 GB RAM, under Manjaro Linux 22.1.1, kernel 5.15.109, starting each test at about 37 C CPU temperature.
Inequality Joins#
Sometimes we want to match table entries based on how they compare.
For example, if we have weather data for two regions A and B,
we may want to find those entries where monthly rainfall was higher in
region B. Assuming for the sake of the example we have two dataframes
weather_region_a and weather_region_b (a single one may be split by region),
we can use an inequality join to match the two based on
weather_region_a['rainfall'] < weather_region_b['rainfall'].
Pandance provides ineq_join for this operation,
which would look like:
ineq_join(weather_region_a, weather_region_b, on='rainfall', how='<')
As an example of working with temporal data, say we want to find connecting flights between locations A and C with a layover in location B. We have a table with flights from A to B, and another from B to C:
|
|
Valid connections are those flights that arrive in B before a departing flight from B. So we can get those flights with:
ineq_join(flights_ab, flights_bc, left_on='arr', right_on='dep', how='<',
suffixes=('_ab', '_bc'))
which gives us a list of connections:
dep_ab |
arr_ab |
dep_bc |
arr_bc |
|---|---|---|---|
2023-01-01 08:00 |
2023-01-01 10:00 |
2023-01-01 14:00 |
2023-01-01 17:00 |
2023-01-01 08:00 |
2023-01-01 10:00 |
2023-01-01 18:00 |
2023-01-01 21:00 |
2023-01-01 12:00 |
2023-01-01 14:00 |
2023-01-01 18:00 |
2023-01-01 21:00 |
2023-01-01 08:00 |
2023-01-01 10:00 |
2023-01-01 21:00 |
2023-01-02 00:00 |
2023-01-01 12:00 |
2023-01-01 14:00 |
2023-01-01 21:00 |
2023-01-02 00:00 |
2023-01-01 16:00 |
2023-01-01 18:00 |
2023-01-01 21:00 |
2023-01-02 00:00 |
A more realistic example would likely entail adding a buffer value,
which can just be done by passing a mutated version of flights_ab with
the time buffer added to the arrival time column.
Or, if the tables are small, we could use a theta_join
to express the more complex condition.
Since ineq_join relies on comparisons, any Python object that supports these
may be used for the join columns.
Let’s take an example with strings, which in Python are ordered
lexicographically.
Suppose we have a small sample of strings and want to find all strings that are
sorted lower in a large database of strings (which here is constructed randomly):
query = pd.DataFrame(['bbb', 'ccc'], columns=['string'])
database = pd.DataFrame(
[''.join(random.choices(string.ascii_lowercase, k=3)) for _ in range(10)],
columns=['string']
)
ineq_join(query, database, how='>', on='string', suffixes=('_query', '_db'))
In this case, the random database only has a few strings of lower ordering than our query:
string_query |
string_db |
|---|---|
bbb |
afn |
ccc |
afn |
bbb |
afq |
ccc |
afq |
Performance#
The ineq_join operation is typically faster than the common straightforward approach
using a Cartesian (cross) join followed by filtering on the inequality condition,
this latter method having the downside of comparing everything with everything,
even if the number of actual matches is much lower.
See the documentation of ineq_join for more details.
Fro much better performance however, the excellent R
data.table package is recommended, as it supports
inequality joins (which it implements in a similar fashion to ineq_join)
but is an order of magnitude faster, as shown below on the same dataset.
Implementation |
time [s] |
memory [MB] |
|---|---|---|
|
3.24 |
248 |
|
9.3 |
255 |
|
4.42 |
343 |
|
0.2 |
30 |
|
0.15 |
60 |
Click to see more technical details about the performance measurement.
The benchmark consisted of 2 dataframes A and B containing increasing integer sequences.
The two sequences have a parametrized overlap, otherwise A values are smaller than B.
On a < join without any overlap, the result is a Cartesian join,
with \(A \cdot B\) elements.
With a nonzero overlap L, the number of matches (rows in the result) is
\(A \cdot B - L^2 + \binom{L}{2}\).
For the results below A = B = 3000 and L = 1500,
meaning a result of length 7 874 250.
Only the join operation time and memory are shown.
data.table version 1.14.8 was used.
Pandance (Python) measurement:
# Speed
python test/performance.py
snakeviz $(ls -1rt | tail -n 1)
# Memory
# (decorated ineq_join_overlap_cartesian with @profile)
python -m memory_profiler test/performance.py
Versions: Python 3.8, pandas 1.4.2, numpy 1.22.3
data.table (R) measurement: Used RStudio’s profiling
(profvis package under the hood)
The profiling scripts are available in the Pandance repo. The profiling was performed on a machine with 4x Intel Core i7-8550U @ 4 GHz and 8 GB RAM, under Manjaro Linux 22.1.1, kernel 5.15.109, starting each test at about 37 C CPU temperature.
Theta Joins#
While joins naturally capture exact or approximate matching between columns, in principle pairs of values may be considered to match based on any set of criteria.
In relational algebra, a \(\theta\)-join is a join where pairs \((a, b)\) of values from columns A and B are considered to match if they fulfill a condition \(\theta\), which we could write \(\theta(a, b) = True\) (also known as a binary relation).
Pandance implements an (inner) theta_join that takes a
user-specified boolean-valued function which judges whether pairs of elements match.
This is a departure from the typical limited choice of \(\theta\)
as an inequality {<, <=, =, !=, >=, >}.
For instance, if we have the following tables of strings and want to find all pairs in which keywords appear as substrings of phrases:
|
|
A \(\theta\)-join can be written with a user-specified match condition
lambda kw, phrase: kw in phrase like so:
dance.theta_join(
keywords, phrases,
left_on='keyword', right_on='phrase',
condition=lambda kw, phrase: kw in phrase
)
Which results in:
keyword |
phrase |
|---|---|
a |
the quick brown fox jumps over the lazy dog |
the |
the quick brown fox jumps over the lazy dog |
Performance#
theta_join is parallelized, which is especially helpful for larger dataframes,
since all rows in left join column need to be evaluated against all rows in the right
in a linear fashion
(no logarithmic lookup speed-ups such as binary search for ineq_join
or interval trees for fuzzy_join).
See the benchmark above comparing
the performance of ineq_join with an equivalent theta_join.
To avoid unnecessary overhead on small data, multiple processes are used only if the number of rows in the longest input dataframe is over a (configurable) threshold.
The usual reminder is warranted, that parallelism does not always help, and that performance depends on your data, OS, and hardware.
See the theta_join documentation for more details
and examples.