Similarity Measures

Similarity Measures #

This page contains the following content:

Many retrieval methods rely on simple text matching without considering semantics, leading to missed opportunities and wasted resources. To address this, Case-Based Reasoning (CBR) employs similarity measures to approximate the utility of solutions for new problems. By comparing problem descriptions and ranking similar cases, CBR improves efficiency. The local-global principle is a well-established technique which calculates similarity in attribute value representation by incorporating the importance and utility for each attribute (further information at Side Information). This similarity model, built on top of a data model, uses a mathematical function known as a similarity measure to determine the similarity between objects based on their shared common data class.

Similarity Measures in ProCAKE #

ProCAKE provides several similarity measures. Please refer to the following sections for more details:

The following figures give an overview of the built-in similarity measures and how they are linked to the system data classes:

Basic Measures #

Similarity-Measures-1

NEST Graph Measures #

The following figure depicts similarity measures for graph-based data classes:

Similarity-Measures-2

Object-Oriented Workflow Measures #

The following figure depicts similarity measures for object-oriented workflow data classes:

Similarity-Measures-3

Basic Similarity Measure Parameters #

There are a few parameters, which can be set for every measure:

  • name: Every measure requires a unique name. This name is used afterwards, to apply a measure for a similarity computation. If two measures share the same name, an exception will be thrown.
  • class: For every measure a data class has to be set. The measure can only be applied to objects of this data class and its subclasses. For example, a String measure requires objects the String class. A numeric measure can work on Integer or Double objects. If an unsuitable class is set, for example Integer for a String measure, an exception will be thrown, when trying to create the measure.
  • default: For every measure a default value can be set. If it is set to true, this measure will be used as default measure for similarity computations for the specified class. This will only happen, if there is no explicit measure set for the computation. If more than one measure for the same class is set as default, the first one will be taken. If the default value is not explicitly set, it is per default true.
  • forceOverride: In case, that a defined measure should be overridden, this value has to be set to true. Only in this case, the name of the measure does not need to be unique. Per default, this value is set to false.

These parameters can be set in the XML similarity file. This can look like:

<SMNumericMeasure name="SMNumericMeasure" class="Integer" default="true" forceOverride="false"/>

In this case, a numeric similarity measure is created. It has a name, which has to be unique, because it cannot be overridden. It can be applied to Integer objects and is the default measure for this class.

To set these parameters during runtime, the following code can be used:

SMNumericMeasure smNumericMeasure = (SMNumericMeasure) simVal.getSimilarityModel().createSimilarityMeasure(SMNumericMeasure.NAME, ModelFactory.getDefaultModel().getIntegerSystemClass());
smNumericMeasure.setForceOverride(false);
simVal.getSimilarityModel().addSimilarityMeasure(smNumericMeasure, "SMNumericMeasure");
simVal.getSimilarityModel().setDefaultSimilarityMeasure(ModelFactory.getDefaultModel().getIntegerSystemClass(), "SMNumericMeasure");

Here, simVal refers to a SimilarityValuator (described here).

Different Strategies: Vague Knowledge #

In situations where no value is assigned to an attribute it is necessary to distinguish different strategies to compute the Similarity. Three kinds of strategies can be distinguished. The chosen strategy must be stated in the domain-specific SimilarityModel.

  • Optimistic Strategy: In an optimistic strategy it is assumed that unknown values argue for similarity (Constant: OPTIMISTIC)
  • Pessimistic Strategy: In a pessimistic strategy it is assumed that unknown values argue against similarity (Constant: PESSIMISTIC).
  • Average Strategy: In an average strategy it is assumed that unknown values argue for similarity value of E. This strategy requires calculating an expectancy value E which is not always possible (Constant: AVERAGE).