For many programs some of the biggest performance issues are related to network and disk access – web requests and database access are some of the slowest aspects. I was commissioned by a client to improve the speed of a python program which uses the Salesforce CRM for its database. The python program was making calls via the Salesforce web API to get data. Some of these calls were made multiple times per process to get the same information, and each process was called multiple times per day on multiple servers. Some of this data in Salesforce only changed once a week. Other data was updated within a few minutes. The options I considered to improve speed of data access were:

Speed Improvement Options

1. Moving to a more performant database such as MySQL. This was not an option for the client, so I was stuck with getting data from Salesforce.

2. Using an existing python package to cache data which does not change regularly.

a) There are options such as requests_cache (https://requests-cache.readthedocs.io/en/stable/), which caches any http request made using the requests package. This may be useful for many programs, however the clients program was using a specific Salesforce python wrapper to access it, so this was not an option.

b) Python actually has a built-in function cache (https://docs.python.org/dev/library/functools.html#functools.cached_property) which can be used to cache the result of any function given the same input parameters. However, this cache function only works in memory, so it would give some speed improvement for a particular run of the process, but if the process was run again or was on a different server, the cache would obviously not be available. This is important for the particular project I was working on.

3. Finally … roll my own cache function with local data storage, and apply it selectively across the program. This is what I ended up doing. In a similar vein to how the functools cache works, I decided to create a decorator function which cached the result based on the function and the parameters supplied to it. However, the cache_result() function saves the file to the local disk as a pickled object.

The Caching Code

CACHE_FOLDER = 'cache'

def generate_cache_filename(func, *args, **kwargs):
    combined_data = (func._name_, args, kwargs)
    serialized_data = pickle.dumps(combined_data)
    hash_object = hashlib.sha256(serialized_data)
    filename = hash_object.hexdigest() + '.pickle'
    return os.path.join(CACHE_FOLDER, filename)

def cache_result(ttl):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            cache_file = generate_cache_filename(func, *args, **kwargs)
            try:
                with open(cache_file, 'rb') as file:
                    cached_data = pickle.load(file)
                cached_result, cached_time = cached_data
                current_time = time.time()
                if current_time - cached_time < (ttl*60):
                    return cached_result
            except (IOError, pickle.PickleError, EOFError):
                pass

            result = func(*args, **kwargs)
            cached_data = (result, time.time())
            os.makedirs(CACHE_FOLDER, exist_ok=True)
            with open(cache_file, 'wb') as file:
                pickle.dump(cached_data, file)

            return result
        return wrapper
    return decorator

How it works

A) Python functions can themselves receive and return functions, and this is what we’re doing here with decorator(func). When we pass a function ‘func’ as argument, inside the ‘decorator’ function we modify it, by creating another function ‘wrapper’ which uses the args and kwargs and some elements of the original function (since we call ‘func’ within it). Then the ‘decorator’ function returns a new function, ‘wrapper’, which takes the place of the old function ‘func’ passed to it!

B) This particular situation is a bit more complicated, since we need also to pass the ttl (time to live parameter) parameter to the decorator (not the original function!) which will determine how many minutes the cache is valid for. In order to pass an argument, we need an outer function definition – cache_result(ttl) – hence why we have a function nested three layers deep.

C) The ‘wrapper’ function firstly generates the cache filename, based on the supplied the function ‘func’, and its args and kwargs in the sub function ‘generate_cache_file’. This serializes all the data, hashes it (using sha256 which is very very unlikely to give the same hash output for different inputs), and then returns a filename unique to the information we put in.

D) RUN ONE (no cache file present): The first time the function is called, the try except loop will throw an exception and will pass onwards. The supplied function is called with its original args and kwargs, and its result and the current time is pickled to the cache filename, and then the result of the original function itself is returned.

E) RUN TWO (cache present and within ttl time): The second time the function is called, we try to open the cache file and if it is present, and load the cached_result and the cached_time time. If the difference between the cached_time time and current time is less than the ttl (cache expiry time) then we return the cached_result.

F) RUN THREE (cache present but outside ttl time): As per (E), but this time if the difference between the cached_time time and current time is greater than the ttl (cache expiry time), nothing is returned and the wrapper function passes to the next instruction as per (D), computing the result of the original function again and updating the cache.

Implementation Examples

# define the function with cache_result decorator
@cache_result(ttl=60)
def get_price_data_from_sf(sku_prefix):
    sf = Salesforce(username=SF_USERNAME, password=SF_PASSWORD, security_token=SF_SECURITY_TOKEN, client_id=SF_CLIENT_ID)
    query = (f"""Select id, Name, SKU, Price) 
                FROM price_list__c 
                WHERE SKU LIKE '{sku_prefix)}'
            """)
    prices_by_sku = sf.query_all(query)
    return prices_by_sku

# call the decorated function
prices = get_price_data_from_sf('AB-')

# call & cache a non-decorated function
get_price_data_from_sf = cache_result(60)(get_price_data_from_sf)
prices = get_price_data_from_sf('AB-')

Conclusion

Implementing this caching function gave a small speed improvement for each api call to salesforce, but since these calls are made multiple times for each program session, there was a significant overall reduction in time taken for adding a few lines of code. It could also be reused across many different functions and programs.