Dagster Data Engineering Glossary:
Data Caching
Definition of caching
Caching is the process of storing the result of an expensive computation so that it can be reused in future instead of being calculated each time it is needed. This is a very important concept in data engineering, particularly when dealing with large datasets or complex transformations.
The main purpose of caching is to increase performance and reduce the computational load on your system. It reduces the time required for data processing tasks by storing frequently accessed data or results of expensive computations in a cache memory, which is a high-speed storage layer.
If you are doing a data-intensive task like running a complex SQL query on a huge dataset, the first time you run it, it might take a considerable amount of time. But if you cache the result, the next time you run the same query, instead of actually running the computation again, you can just retrieve the result from the cache. This way, you can drastically reduce the execution time for the repeated task.
Example of caching in Python
In Python, there are several ways to implement caching. The simplest one is to use the functools.lru_cache
decorator from the Python Standard Library. LRU stands for Least Recently Used, and the functools.lru_cache
decorator implements a LRU caching algorithm for your functions.
Here's a very basic example:
Let's consider a function that calculates the n th Fibonacci number, which is a common example for demonstrating the use of caching. The Fibonacci sequence is a series of numbers where a number is the sum of the two preceding ones. The sequence looks like this: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, ...
The recursive approach to calculating the nth Fibonacci number is quite inefficient because it ends up recalculating smaller Fibonacci numbers multiple times. Let's see how much we can speed it up by using caching.
Here's the Fibonacci function without caching:
def fib_no_cache(n):
if n < 2:
return n
else:
return fib_no_cache(n-1) + fib_no_cache(n-2)
# Let's see how long it takes to calculate the 35th Fibonacci number
import time
start = time.time()
print(fib_no_cache(35))
end = time.time()
print('Time without caching:', round(end - start,2), 'seconds')
Running this on your local computer might yield this type of output:
9227465
Time without caching: 1.87 seconds
And here's the same function with caching using the functools.lru_cache
decorator:
from functools import lru_cache
import time
@lru_cache(maxsize=None)
def fib_with_cache(n):
if n < 2:
return n
else:
return fib_with_cache(n-1) + fib_with_cache(n-2)
# Let's see how long it takes to calculate the 35th Fibonacci number
start = time.time()
print(fib_with_cache(35))
end = time.time()
print('Time with caching:', round(end - start,6), 'seconds')
Which will yield something like this:
9227465
Time with caching: 2.8e-05 seconds
Note: These times will vary based on the hardware and other running processes on your machine.
Clearly, the version with caching is significantly faster, especially as n
increases. This is because with caching, each Fibonacci number is only calculated once and then stored for future use, whereas the version without caching recalculates each Fibonacci number many times.
For more advanced caching needs, such as distributed caching or disk-based caching, you might need to use more sophisticated tools, such as joblib
for disk caching or Redis
and Memcached
for distributed caching.
Remember that caching is not always the best solution and it's not free. Caching uses memory or disk space to store results, and it can also make your code more complex, and debugging more time consuming. You should use caching wisely and only for operations that are actually expensive and are likely to be repeated with the same input.