Similarity Computation

Similarity Computation #

This page contains the following content:

Similarity Valuator #

The similarity between two data objects can be calculated using a SimilarityValuator:

SimilarityValuator simVal = SimilarityModelFactory.newSimilarityValuator();
Similarity similarity = simVal.computeSimilarity(objectA, objectB);

A similarity cache can be passed to a similarity valuator to speed up time-consuming repetitive computations.

simVal.setSimilarityCache(simCache);

A similarity cache can be also passed to a similarity valuator used by a retriever in order to use a cache during the retrieval.

retriever.getValuator().setSimilarityCache(simCache);

The construction of similarity caches is described in the following sections.

Similarity Cache #

A SimilarityCache maps a pair of two data objects (i.e., query object and case object) to a Similarity object.

There are different types of cache implementations, which share a common interface SimilarityCache with the following methods:

cache.containsSimilarity(queryObject, caseObject);
cache.setSimilarity(queryObject, caseObject, similarity); // replaces entry if it already existed
cache.getSimilarity(queryObject, caseObject); // returns null if no such entry exists
cache.getSimilarities();
cache.size();
cache.isEmpty();
cache.removeEntryByIds(queryId, caseId); // removes the entry with the given id pair
cache.removeEntriesById(id); // removes all entries with a query and/or case represented by the given id 
cache.clear();

All cache types implement the Iterable interface which allows for iterating over all cached entries.

A SimilarityCache uses internally a DataObjectPair to build a combined id string from the given query object and case object.

SimpleSimilarityCache #

A SimpleSimilarityCache holds its entries in a simple HashMap<DataObjectPair, Similarity>. It only exists at runtime and nothing will be stored after the program has terminated.

Usage:

SimpleSimilarityCache simpleCache = new SimpleSimilarityCache();

CaffeineSimilarityCache #

Like a SimpleSimilarityCache, a CaffeineSimilarityCache is an in-memory cache.

Usage:

CaffeineSimilarityCache caffeineCache = CaffeineSimilarityCache.builder().build();

Additionally, it provides a selection of eviction strategies for its entries.

Size based eviction

When a new entry would exceed the maximum size or weight the cache will make space by evicting entries that have not been used recently or very often.

CaffeineSimilarityCache caffeineCache = CaffeineSimilarityCache.builder()
    .withMaximumSize(10_000)
    .build();

If entries have different “weights”, you may set a maximum weight instead of size. This requires you to implement your own weigher. A weigher is an object of a class which implements the functional interface Weigher<DataObjectPair, Similarity>.

Override the weigh method in your class.

class FileSizeWeigher implements Weigher<DataObjectPair, Similarity> {
    @Override
    @NonNegative int weigh(DataObjectPair pair, Similarity similarity) {
        String json = XStreamUtil.toJSON(similarity);
        return json.length();
    };
}
CaffeineSimilarityCache caffeineCache = CaffeineSimilarityCache.builder()
    .withMaximumWeight(10_000)
    .withWeigher(new FileSizeWeigher())
    .build();

Alternatively, you can use a lambda expression.

CaffeineSimilarityCache caffeineCache = CaffeineSimilarityCache.builder()
    .withMaximumWeight(10_000)
    .withWeigher((DataObjectPair pair, Similarity similarity) -> {
        String json = XStreamUtil.toJSON(similarity);
        return json.length();
    });
    .build();

Time based eviction

Another way to evict entries is to have them expire after a certain period of time. This time can be measured from the moment of writing an entry to the cache and/or the last time it was accessed.

CaffeineSimilarityCache caffeineCache = CaffeineSimilarityCache.builder()
    .withExpirationAfterWrite(Duration.ofMillis(10))
    .withExpirationAfterAccess(Duration.ofMillis(10))
    .build();

FileBasedCaffeineSimilarityCache #

A FileBasedCaffeineSimilarityCache is useful when the cache should be persisted to use again at another time, without setting all its entries again. It inherits all eviction strategies from CaffeineSimilarityCache. Invoking clear() on this cache will not only remove all entries, but also delete the cache file.

Usage:

FileBasedCaffeineSimilarityCache fileCache = FileBasedCaffeineSimilarityCache.builder().build();

The default path for the cache file is <user home>/ProCAKE/similarityCache.csv. This can be changed when building the cache, but it should be a csv file.

FileBasedCaffeineSimilarityCache fileCache = FileBasedCaffeineSimilarityCache.builder()
    .withCachePath(path)
    .build();

The cache can be persisted to this file on command.

fileCache.backup();

The contents of this file can be loaded into the cache on command.

fileCache.restore();

SimCacheListener #

A SimCacheListener is notified about changes to the cache it is attached to. A listener is added or removed to a cache as followed:

cache.addListener(listener);
cache.removeListener(listener);

A SimCacheListener requires an implementation of the update() method, which is invoked by the cache on every change. It provides the DataObjectPair that was changed, the potentially changed Similarity and a NotificationType. The latter can represent a new addition to the cache (NotificationType.ADD), a replacement of an existing entry (NotificationType.UPDATE), an entry being accessed (NotificationType.READ), an entry being removed or evicted (NotificationType.REMOVE), or the entire cache being cleared (NotificationType.CLEAR). Note that the provided DataObjectPair and Similarity on a CLEAR event are null. An implemenation of update() could look like this:

public class MySimCacheListener implements SimCacheListener {

    MyTable table = new MyTable();

    @Override
    public void update(DataObjectPair dataObjects, Similarity similarity, NotificationType notificationType) {
        switch (notificationType) {
            case ADD ->  table.add(dataObjects, similarity)
            case UPDATE -> table.updateSimilarity(dataObjects, similarity);
            case READ -> table.incrementReadCount(dataObjects);
            case REMOVE -> table.remove(dataObjects);
            case CLEAR -> table.removeAll();
        }
    }
}