Similarity Computation

Similarity Computation #

This page contains the following content:

Similarity Valuator #

The similarity between two data objects can be calculated using a SimilarityValuator:

SimilarityValuator simVal = SimilarityModelFactory.newSimilarityValuator();
Similarity similarity = simVal.computeSimilarity(objectA, objectB);

A similarity cache can be passed to a similarity valuator to speed up time-consuming repetitive computations.

simVal.setSimilarityCache(simCache);

A similarity cache can be passed to a similarity valuator of a retriever in order to use a cache during the retrieval.

retriever.getValuator().setSimilarityCache(simCache);

The construction of similarity caches is described in the following sections.

Similarity Cache #

A Similarity Cache stores pairs of keys and values. A key is a pair of data object ID’s and its value represents the objects’ similarity.

CaffeineSimilarityCache simCache = CaffeineSimilarityCache.builder().build();

Optionally, you can provide the cache with one or multiple eviction strategies.

Size based eviction

When a new entry would exceed the maximum size or weight the cache will make space by evicting entries that have not been used recently or very often.

CaffeineSimilarityCache sizeSimCache = CaffeineSimilarityCache.builder()
    .withMaximumSize(10_000)
    .build();

If entries have different “weights”, you may set a maximum weight instead of size. This requires you to implement your own weigher. A weigher is an object of a class which implements the functional interface Weigher<Pair<String, String>, Similarity>.

Override the weigh method in your class.

class MyWeigher implements Weigher<Pair<String, String>, Similarity> {
    @Override
    @NonNegative int weigh(Pair<String, String> pair, Similarity similarity) {
        return pair.getLeft().length() + pair.getRight().length();
    };
}
CaffeineSimilarityCache weightSimCache = CaffeineSimilarityCache.builder()
    .withMaximumWeight(10_000)
    .withWeigher(new MyWeigher())
    .build();

Alternatively, you can use a lambda expression.

CaffeineSimilarityCache weightSimCache = CaffeineSimilarityCache.builder()
    .withMaximumWeight(10_000)
    .withWeigher((Pair<String, String> pair, Similarity similarity) -> pair.getLeft().length() + pair.getRight().length())
    .build();

Time based eviction

Another way to evict entries is to have them expire after a certain period of time. This time can be measured from the moment of writing an entry to the cache and/or the last time it was accessed.

CaffeineSimilarityCache expiringSimCache = CaffeineSimilarityCache.builder()
    .withExpirationAfterWrite(Duration.ofMillis(10))
    .withExpirationAfterAccess(Duration.ofMillis(10))
    .build();

When the cache is all set up, initialize it before using it:

simCache.init();

In-Memory Caffeine Cache #

This cache only exists at runtime. Nothing will be stored after the program has stopped running.

Setting the similarity of two data objects will add a new entry to the cache. If an entry with this pair of data objects already exists, the cache will replace it.

simCache.setSimilarity(queryObject, caseObject, similarity);

You can get the similarity of two data objects as shown below. If the cache does not contain an entry with this pair of data objects, it will simply return null.

simCache.getSimilarity(queryObject, caseObject);

It is possible to clear the cache of all its entries.

simCache.clear();

A singular entry can be manually deleted by both the data objects’ IDs. Note that every cache entry is unique.

simCache.deleteEntryByIds(queryObjectId, caseObjectId);

Multiple entries can be manually deleted by one data object ID. It does not matter if the ID belongs to the first or second data object.

simCache.deleteEntriesById(dataObjectId);

File-based Caffeine Cache #

This cache inherits all methods of the in-memory cache. Additionally, it is persisted to disk as a csv file with each record (line) representing a cache entry. At initialization, already existing entries will be loaded into the cache. If an entry is removed in any way, automatically or manually, the respective record will also be removed. When clearing the cache, the csv file is also deleted.

The default path for the csv file is user/ProCAKE/similarityCache.csv. This can be changed when building the cache.

FileBasedCaffeineSimilarityCache cache = FileBasedCaffeineSimilarityCache.builder()
    .withCachePath(path)
    .build();