# Collection #

The following measures for Collection objects are implemented:

For the following examples, the class `WeekdaysList`

is used. This is defined in the model as follows:

```
<ListClass name="WeekdayList">
<ElementClass name="Weekdays"/>
</ListClass>
```

The class uses the String class `Weekdays`

, which can be created as follows:

```
<StringClass name="Weekdays">
<InstanceTotalOrderPredicate>
<Value v="Monday"/>
<Value v="Tuesday"/>
<Value v="Wednesday"/>
<Value v="Thursday"/>
<Value v="Friday"/>
<Value v="Saturday"/>
<Value v="Sunday"/>
</InstanceTotalOrderPredicate>
</StringClass>
```

## Isolated Mapping #

The *Isolated Mapping* measure computes the similarity of two collection classes by finding the best mapping for each element.
Every element of the query is compared with every element of the case and the best possible matching is returned. These best similarities are summed up and divided by the size of the query.
It can happen, that many elements are mapped on the same case element. For example a query with the size of 10 can be mapped on a case with the size of 1. Nevertheless, this mapping can happen, when the case has also 10 elements, of which 9 are never mapped. If an optimal mapping is wanted, where each query item has only one mapping partner, the *Mapping* measure has to be used.

The following parameters can be set for this similarity measure.

Parameter | Type/Range | Default Value | Description |
---|---|---|---|

similarityToUse | Similarity Measure (String) | `null` | The parameter is used to define, which similarity has to be taken for the similarity computation between the elements of the collections. For all elements, this one local measure will be used and afterwards the global similarity is aggregated. The similarity measure, which is referred, must exists in the runtime domain, too. Otherwise, an exception will be thrown. If no `similarityToUse` is defined, a usable data class similarity will be searched. If this one can’t be found, too, an invalid similarity will be returned. |

The definition of a *Isolated Mapping* measure can look like:

```
<CollectionIsolatedMapping name="SMCollectionIsolatedMapping" class="WeekdayList" similarityToUse="SMStringEquals" default="false"/>
```

This measure would refer to the collection class *WeekdayList*, which is a list as the name implies. To compare the elements of this list, which are elements of the String type, the *Equals* measure is used.

To create this measure during runtime, the following code would be used:

```
SMCollectionIsolatedMapping smCollectionIsolatedMapping = (SMCollectionIsolatedMapping) simVal.getSimilarityModel().createSimilarityMeasure(SMCollectionIsolatedMapping.NAME, ModelFactory.getDefaultModel().getClass("WeekdayList"));
smCollectionIsolatedMapping.setSimilarityToUse("SMStringEquals");
simVal.getSimilarityModel().addSimilarityMeasure(smCollectionIsolatedMapping, "SMCollectionIsolatedMapping");
```

Here, the collection class *WeekdayList* and the *String Equals* similarity have to be defined somewhere else.

In the following example two collections exist, which are both objects of the class *WeekdayList*: The collection *Weekdays* contains all days from Monday until Sunday.
The collection *Workdays* contains all workdays, Monday until Friday. The first example applies for both types of collection, namely lists or sets, as the order of the elements is not relevant.

In this example, the *Isolated Mapping* measure is used on those two collections. *Weekdays* is the query and *Workdays* the case. The following mapping results:

The measure searches for the most similar element in the case for each element in the query. The elements *Monday*, *Tuesday*, *Wednesday*, *Thursday* and *Friday* appear in query and case, so they are mapped on themselves. Thus, the similarity for each of these mappings is 1.0.

The elements *Saturday* and *Sunday* only occur in the query. They both need a mapping partner, so the measure searches for the most similar element in the case. In this example, *Sunday* is mapped on *Monday* and *Saturday* on *Friday*. The mappings of these elements depend on the similarity measures, which are used for the comparison of the elements. So this mapping is just one possibility. It was chosen, because *Monday* and *Sunday* have the highest *Levenshtein* similarity. *Saturday* is also compared with every element in the case, but is equal similar to all elements except *Wednesday*. In this case *Friday* was chosen randomly.

The overall similarity of both collections is computed by summing up single similarities using the *Levenshtein* measure in this example. The similarity value of *Monday*, *Tuesday*, *Wednesday*, *Thursday* and *Friday* is 1.0. The similarity between *Sunday* and *Monday* would be
\(\frac{2}{3}\)
and between *Saturday* and *Friday* would be \(\frac{1}{2}\)
. So the total similarity would be calculated like this:

If the *String equals* measure would be used, *Saturday* and *Sunday* would have random partners. These mappings would have the similarity 0.0, so the total similarity would be \(\frac{5}{7}\)
.

As seen in this example, the *Isolated Mapping* measure can map several query items on one case item. In an extreme case, it can happen, that all query items are mapped on just one element of the case, as can be seen in the next example.

Here, the collection *Weekdays* is used again, but this time as case. The query is a list, which contains the *Monday* item 5 times. It has to be a list, because a set, as the other existing type of collection, does not allow one element to appear several times.

Following mapping results:

The item *Monday* is mapped on the *Monday* element in the case for 5 times. The resulting similarity is 1.0.

## Mapping #

The measure performs the best possible mapping of two collections using the A* algorithm. It performs a best possible mapping of the query items to the case items of the collection. Note that not all case items have to be utilized in the mapping, but all query items are tried to map. Each query element can only be mapped on a case element, which has no mapping partner, yet. If more query items than case items exist, some query items will get a similarity 0 (which will influence the overall similarity). If more case items than query items exist, then this will not interfere with the similarity-value (rather gives more possible match-scenarios).

A restriction for the A* search can be set by using the parameter `maxQueueSize`

. It is a restriction for the size of the internal queue of the A* algorithm. It can be used, to reduce the runtime, but can then also cause the best possible mapping not to be found. It’s default value is `1000`

. It requires values larger than 0. If a negative value is set, the pruning of the queue is disabled. In this case, the optimal mapping will be found, but the runtime won’t be limited.

The following parameters can be set for this similarity measure.

Parameter | Type/Range | Default Value | Description |
---|---|---|---|

similarityToUse | Similarity Measure (String) | `null` | The parameter is used to define, which similarity has to be taken for the similarity computation between the elements of the collections. For all elements, this one local measure will be used and afterwards the global similarity is aggregated. The similarity measure, which is referred, must exists in the runtime domain, too. Otherwise, an exception will be thrown. If no `similarityToUse` is defined, a usable data class similarity will be searched. If this one can’t be found, too, an invalid similarity will be returned. |

maxQueueSize | Size (int) | `1000` | This parameter is a restriction for the size of the internal queue of the A* algorithm. It can be used, to reduce the runtime, but can then also cause the best possible mapping not to be found. It requires values larger than 0. If a negative value is set, the pruning of the queue is disabled. In this case, the optimal mapping will be found, but the runtime won’t be limited. |

The definition of a *Mapping* measure can look like:

```
<CollectionMapping name="SMCollectionMapping" class="WeekdayList" similarityToUse="SMStringEquals" maxQueueSize="1100"/>
```

This measure would refer to the collection class *WeekdayList*, which is a list as the name implies. To compare the elements of this list, which are elements of the String type, the *Equals* measure is used. The maximum size of the queue is set here to 1100.

To create this measure during runtime, the following code would be used:

```
SMCollectionMapping smCollectionMapping = (SMCollectionMapping) simVal.getSimilarityModel().createSimilarityMeasure(SMCollectionMapping.NAME, ModelFactory.getDefaultModel().getClass("WeekdayList"));
smCollectionMapping.setSimilarityToUse("SMStringEquals");
smCollectionMapping.setMaxQueueSize(1100);
simVal.getSimilarityModel().addSimilarityMeasure(smCollectionMapping, "SMCollectionMapping");
```

Here, the collection class *WeekdayList* and the *String Equals* similarity have to be defined somewhere else.

In this example two collections exist: The collection *Weekdays* contains all days from Monday until Sunday.
The collection *Workdays* contains all workdays, Monday until Friday. It neither matters, if the collections are lists or sets, nor is the order important.

In this example, the *Mapping* measure is used on those two collections. *Weekdays* is the query and *Workdays* the case. Following mapping results:

The measure searches the mapping partner with the highest similarity for every element of the query. Then the similarities are compared and the mapping with the highest similarity is chosen, the other mappings are not considered anymore. The search for mapping partners restarts, but neither the mapped query item nor the mapped case element are used anymore. So, in this example, *Saturday* and *Sunday* have no mapping partner, because for each case element exists a query partner with a higher similarity.

Because the query items were mapped on their representations in the case, the similarity of each of these mappings is 1.0. *Saturday* and *Sunday* could not be mapped, so they are treated like their mapping has the similarity 0.0. So the total similarity is \(\frac{5}{7}\)
. In this example, the similarity is independent of the measure, which was used for the comparison of the items.

In the following is a second example. The element *Monday* is added to the collection *Workdays*. Because there is already one *Monday* in it, this collection must be a list. *Weekdays* again represents the query and *Workdays* the case.

The following mapping results:

Unlike the first example, there is now one available mapping partner for *Saturday* and *Sunday*. The *Monday* item of the query is mapped on one of the *Monday* items in the case, which is chosen randomly. The *Monday* item without a current mapping partner is compared to each query item. This repeats until the similarity value for the mapping of this item with *Saturday* or *Sunday* becomes the highest similarity. The mapping of this *Monday* element depends on the measure, which is used for the item comparison.

In the example above, the *Levenshtein* measure was used, so *Monday* was most similar to *Sunday*. The similarity between these elements is \(\frac{2}{3}\)
, so the total similarity is computed like:
\(\frac{1.0 + 1.0 + 1.0 + 1.0 + 1.0 + \frac{2}{3} + 0.0}{7} = 0.81\)

If the *String equals* measure would be used, *Saturday* would be the mapping partner of *Monday*, because it’s the first computation by this order. So the total similarity for this case is \(\frac{5}{7}\)
.

## List Mapping #

The *List Mapping* measure compares two lists and also considers the order of elements. Other collection types are not supported. The similarity between two lists can be computed with two different meanings:

*Exact*: This means, that the two lists must have the same length to be compared. Each element is compared with the element at the same position in the other list. The similarities for each comparison are summed and divided through the length of the list.*Inexact*: This means, that one of the two lists can be a sublist of the other. The similarities for possible sublists are computed like in the exact comparison. In the end, the best similarity is reported. In both cases the measure follows the exact order of the list.

The following parameters can be set for this similarity measure.

Parameter | Type/Range | Default Value | Description |
---|---|---|---|

similarityToUse | Similarity Measure (String) | `null` | The parameter is used to define, which similarity has to be taken for the similarity computation between the elements of the collections. For all elements, this one local measure will be used and afterwards the global similarity is aggregated. The similarity measure, which is referred, must exists in the runtime domain, too. Otherwise, an exception will be thrown. If no `similarityToUse` is defined, a usable data class similarity will be searched. If this one can’t be found, too, an invalid similarity will be returned. |

contains | Flag (String) | `inexact` | This parameter defines the meaning of the comparison of two lists. If it’s set to `exact` , the two lists must have the same length to be compared. Each element is compared with the element at the same position in the other list. The similarities for each comparison are summed and divided through the length of the list. If the parameter is set to `inexact` , one of the two lists can be a sublist of the other. The similarities for possible sublists are computed like in the exact comparison. In the end, the best similarity is reported. In both cases the measure follows the exact order of the list. |

The definition of a *List Mapping* measure can look like:

```
<ListMapping name="SMListMapping" class="WeekdayList" contains="exact" similarityToUse="SMStringEquals" default="false"/>
```

This measure would refer to the collection class *WeekdayList*, which is a list as the name implies. The keyword *contains* indicates, that the *exact* strategy is used. To compare the elements of this list, which are of the type String, the *Equals* measure is used.

To create this measure during runtime, the following code would be used:

```
SMListMapping smListMapping = (SMListMapping) simVal.getSimilarityModel().createSimilarityMeasure(SMListMapping.NAME, ModelFactory.getDefaultModel().getClass("WeekdayList"));
smListMapping.setContainsExact();
smListMapping.setSimilarityToUse("StringEqualsCaseSensitive");
simVal.getSimilarityModel().addSimilarityMeasure(smListMapping , "SMListMapping");
```

Here, the collection class *WeekdayList* and the *StringEquals* similarity have to be defined somewhere else.

In this example two lists exist: The list *Weekdays* contains all days from Monday until Sunday. The list *Workdays* contains all workdays, Monday until Friday. This measure works only on lists, so sets are not mentioned here.

In the following, *List Mapping* is used on the two lists. The following mapping results:

The sizes of the lists are different, so some items cannot be mapped. When using the *exact mapping* strategy, the similarity would be 0.0.

The *inexact mapping* strategy ignores the sizes of the lists. So it is checked, whether one list is a sublist of the other. In this example, three possible sublists are found. For each of them, the total similarity is computed, and the one with the highest similarity is chosen. Here, *Workdays* is a sublist of *Weekdays*, so the similarity is 1.0 and this mapping is chosen.

In the following is a second example. Here, *Workdays* is the query and *Weekdays* the case. But the order of *Weekdays* is different, so *Thursday* and *Wednesday* swapped their positions.

When using the *exact mapping* strategy, the similarity would be 0.0 and no mapping would be tried. So the following mapping just results, when using the *inexact mapping* strategy:

The measure finds three sublists again and computes the similarities. It identifies the second to the sixth element as the mapping partners with the highest similarity. But this measure follows the exact order of the items. So the element *Wednesday* is mapped on *Thursday*, just as *Thursday* on *Wednesday*. So, when using the *String equals* measure, the similarity is \(\frac{3}{5} = 0.6\)
.

## List Correctness #

The *List Correctness* measure compares two lists by the order of their elements. Other collection types aren’t supported.

This similarity measure uses the *Correctness* metric to compare two lists. This metric is the ratio of concordant and discordant pairs regarding all pairs. Its value is usually in the range of \(-1\)
to \(1\)
. \(-1\)
means, that only discordant pairs are given and \(1\)
means that only concordant pairs are given. Values in the range of \([0,1]\)
are used directly as similarity values. For values in the range of \([-1.0)\)
, a parameter `discordantParameter`

\(p_{d}\)
can be set. It defines the maximum possible similarity, if a correctness of \(-1\)
would be reached. The similarity values decrease linearly between this value till 0. The default value for this parameter is \(1.0\)
.

The meaning of concordant and discordant pairs is described here.

The similarity is computed by the following formula

\(sim_{correctness} = \begin{cases} \frac{|C|-|D|}{|C|+|D|}, &\text{if} \ |C| \geq |D|\\ \\ \bigl|\frac{|C|-|D|}{|C|+|D|}\bigl| \ * \ p_{d}, &\text{if} \ |C| < |D|\\ \end{cases}\)The following parameters can be set for this similarity measure.

Parameter | Type/Range | Default Value | Description |
---|---|---|---|

discordantParameter | Weight (double) | `1.0` | This parameter defines the maximum possible similarity, if a correctness of \(-1\) would be reached. The similarity values decrease linearly between this value till 0. |

The definition of a *List Correctness* measure can look like:

```
<ListCorrectness name="SMListCorrectness" class="WeekdayList" discordantParameter="0.5" default="false"/>
```

This measure would refer to the collection class *WeekdayList*, which is a list as the name implies. The discordant parameter is set to \(0.5\)
. So, if only discordant pairs exists, the similarity would be \(0.5\)
.

To create this measure during runtime, the following code would be used:

```
SMListCorrectness smListCorrectness = (SMListCorrectness) simVal.getSimilarityModel().createSimilarityMeasure(SMListCorrectness.NAME, ModelFactory.getDefaultModel().getClass("WeekdayList"));
smListCorrectness.setDiscordantParameter(0.5);
simVal.getSimilarityModel().addSimilarityMeasure(smListCorrectness, "SMListCorrectness");
```

In this example two lists exist: The first list contains all days from Monday until Wednesday in order of the week. The second list contains the days in the following order: Monday, Wednesday, Tuesday. Here, the order of the element Monday to the element Tuesday and of Monday to the element Wednesday matches. So, two concordant pairs can be found. Only the order of the element Tuesday to the element Wednesday doesn’t match. So, this pair is discordant. Because more concordant than discordant pairs does exits, the following similarity would be computed: \(\frac{2-1}{2+1}=\bigl\frac{1}{3}\) .

In another example, the first list containing the three days in order of the week exist, too. An additional list contains these elements in reverse order: Wednesday, Tuesday, Monday. Here, all elements occur in a different order. So, only discordant pairs could be computed. With no `discordantParameter`

set, the following similarity would be computed: \(\bigl|\frac{0-3}{0+3}\bigl| * 1= \bigl|-1\bigl| \ * \ 1 = 1\)
. If the parameter was set to \(0.5\)
, the similarity would be computed as follows: \(\bigl|\frac{0-3}{0+3}\bigl| \ * \ 0.5= \bigl|-1\bigl| \ * \ 0.5 = 0.5\)

## Sequence Matching #

Lists can be regarded as time series. Therefore, two algorithms (Smith Waterman Algorithm (SWA) and Dynamic Time Warping (DTW)) based on sequence matching have been adapted for comparing two list objects. These measures can be seen as an extension to the basic ListMapping. Both approaches find alignments between the two series by iteratively aligning sub-sequences until the best matching sub-sequences are identified. These alignments preserve the temporal order of the series. By default, the final alignment must include the last item of the query, however, this behavior can be disabled:

```
dpSimMeasure.setForceAlignmentEndsWithQuery(false);
```

The following mapping represents a valid alignment only if `setForceAlignmentEndsWithQuery(false)`

is set. Inversely, by default, this mapping is invalid and would not be found by either algorithm. Notice the gap in the alignment, which is interpreted as a *deletion* of *Wednesday*. This alignment is an example of the SWA approach:

In the following, the individual characteristics of both algorithms are outlined. Afterwards, their implementation and integration in ProCAKE is described.

### Smith Waterman Algorithm (SWA) #

SWA is based on the Needleman-Wunsch Algorithm and is able to find local alignments of two data sequences. These alignments are constructed by iteratively either aligning, inserting or deleting elements from one series with regard to the other. The approach therefore is somewhat similar to the Levenshtein distance approach. Instead of weekdays, the following example will compare two strings (which are also a sequence) \(A= \text{'}AYBCMM\text{'}\) , with length \(|A|\) , and \(B=\text{'}ABCKMMH\text{'}\) , with length \(|B|\) . String \(A\) is indexed using \(i = 1,...,|A|\) and string \(B\) is indexed by \(j = 1,...,|B|\) . With this convention, \(a_i\) depicts the i-th character from \(A\) . The steps of this example are illustrated in the figure below.

SWA calculates the alignment and thereupon the similarity score in the following way:

- First, the two parameters of the algorithm, which in this case are functions, have to be specified.

- A local similarity measure that will be used to compare the individual items in the sequences must be set. In ProCAKE, this measure can either be set manually or the measure based on the similarity model will be used. In this example, a simple
*CharacterEqual*measure is applied:

- The second parameter is a gap penalty function used to penalize indels, i.e., insertions or deletions, in the alignment. An insertion in string \(A\) will appear as a gap in \(A\) while a deletion from \(A\) will appear as a gap in \(B\) in the final alignment. For this purpose, the gap character \(\text{'}-\text{'}\) is introduced. Here, a constant function will be used, that assigns a penalty of \(0.5\) to every indel operation. In principle, this penalty can depend on the specific item being deleted or inserted (A function for this behavior can be set via code): \( W = -0.5 \)

Next, the scoring matrix \(H\) is initialized (see (a)). \(a_0\) and \(b_0\) are identified with the gap character and the first row and column of the matrix are set to zero: \(H_{0,j}=H_{i,0}=0\) . Entry \(H_{i,j}\) is interpreted as the subsequence similarity of \(a'_i = a_1 a_2 .. a_i\) and \(b'_j = b_1 b_2 .. b_j\) for \(i,j > 0\) . When either \(i=0\) or \(j=0\) , then \(a_0=\text{'}-\text{'}\) or \(b_0=\text{'}-\text{'}\) and hence, similarity of a string with the empty string is set to 0.

The scoring matrix is filled in according to the following formula (see (b)):

\( H_{i,j} = \text{max} \begin{cases} \quad 0, \\ \quad H_{i-1,j-1} + s(a_i,b_j), &\text{match/mismatch}\\ \quad H_{i-1,j} + s(a_i, -), &\text{deletion}\\ \quad H_{i,j-1} + s(-, b_j) &\text{insertion} \end{cases} \)The function \(s\) has to return a positive similarity value for a match or mismatch and a negative value for an indel operation (including zero in both cases). In this example, \(s\) is defined as follows:

\( s(a,b) = \begin{cases} \quad sim(a,b) &\text{if } a \in A \text{ and } b \in B\\ \quad W &\text{if } a=\text{'}-\text{'} \text{ or } b=\text{'}-\text{'} \end{cases} \)A match corresponds to a diagonal, an insertion to a vertical and a deletion to a horizontal step in the matrix (see (c)). The values’ origins must be stored during calculation if one wants to visualize the optimal alignment later.

In order to obtain the alignment, one has to find the maximum value in the matrix and trace back a path to a cell of value zero. This is done by successively stepping either vertically to the top, horizontally to the left or diagonally to the top left adjacent cell according to the inverted direction of the initial score calculation. Parts (c) and (d) show the alignment path and the final alignment, respectively. The strings \(A\) and \(B\) is assigned a similarity score of 4 which is significantly less than the maximum obtainable score of \(7=sim(A,A)\) if \(A\) was a subsequence of \(B\) .

### Dynamic Time Warping (DTW) #

DTW in its original application was developed in order to properly align distorted time series data collected from voice recordings. In contrast to simple Euclidean Distance which only compares values with the same timestamp, DTW is able to warp elements of one series to an element of the other series with a different timestamp. For this reason, the comparison of the two strings ‘abbcc’ and ‘abc’ using DTW would result in maximum similarity, since the two sections of length two, ‘bb’ and ‘cc’, from the first string can be warped onto the single characters ‘b’ and ‘c’ of the second string respectively. Gaps, for the same reason, do not exist in DTW.

Calculation of this method is very similar to SWA, however the interpretation of the obtained path through the matrix is fundamentally different. While horizontal or vertical steps in SWA represent either a deletion or insertion, in DTW these steps simply imply that multiple warps onto the same element occur. Natively, DTW finds a distance value. Therefore, in the following example two series of numeric data is compared: \(A=(1,3,5,3,1,2,3,4,5,6)\) and \(B=(2,5,2,1,2,4,6)\) . The same notation as in the previous section will be used.

DTW only needs the local distance function as a parameter. In this example, simple Euclidean Distance is used: \(dist(a,b) = |a-b|\)

Again, the value of the cell \(H_{i,j}\) from the scoring matrix is interpreted as the DTW-distance of the subsequences \(a'_i = a_1 a_2 .. a_i\) and \(b'_j = b_1 b_2 .. b_j\) for \(i,j > 0\) . Therefore, the first row and column, with the exception of \(H_{0,0}\) , are initialized to be infinitely large, since the distance of a series to an empty series is not calculable. \(H_{0,0}\) is initialized to zero (see (a)).

The matrix is now again iteratively filled in with the following formula (see (b)). The sources of each cell’s entry must be stored again.

\( H_{i,j} = \text{min} \begin{cases} \quad H_{i,j-1} + dist(a_i,b_j), &\text{step vertically}\\ \quad H_{i-1,j-1} + 2 * dist(a_i,b_j), &\text{step diagonally}\\ \quad H_{i-1,j} + dist(a_i,b_j) &\text{step horizontally} \end{cases} \)To find the warping path, one must start in the bottom right corner of the matrix, at \(H_{|A|,|B|}\) and follow the origins of the original calculation until \(H_{1,1}\) is reached (see (c)). Each cell \((i,j)\) in the obtained path represents an alignment, or a warp, of \(a_i\) onto \(b_j\) . When plotting the series and connecting the elements from the path, the warping path is visualized as in (d).

We now want to transform this approach to be used with similarity values instead of distances. This can be easily achieved by simply maximizing the scoring function \(H\) at each step. However, please note that due to the construction of DTW, this will result in the score growing in every step and every direction. Consider the example below: On the left, a similarity function was used that is obtained by inverting the distance. The resulting scoring matrix and alignment path is visualized below. As you can see, the alignment looks a little random and non-rational. On the right, the local similarity function was altered by mapping the score to the interval [-1,1]. Thereby, low similarity scores act as punishment and do not increase the score in the matrix. The result is shown is much more useful.

### Differences SWA vs DTW #

The main difference between both approaches lies in their basic similarity assumptions. In SWA, two series are regarded as highly similar, if a sub-sequence of series A can be converted into a sub-series of series B by means of as little indel operations as possible. On the other hand, DTW regards two series as highly similar, if a sub-sequence of series A can be projected onto a sub-sequence of series B by either stretching or compressing (i.e. warping) one sub-sequence onto the other. When interpreting the found alignments in the matrix for presenting them as mappings, differences emerge.

A typical mapping of SWA looks like this. Note the gap in the query that results from the deletion of *Tuesday*. This deletion results in a negative value being added as an indel punishment, resulting in a similarity of less than `1`

in this example:

On the other hand, a typical mapping of DTW looks like this. Note that instead of deleting one *Tuesday*, both are warped onto the singe *Tuesday* of the case. This has no effect on the final similarity score, which will be `1`

in this example:

### Implementation #

For the implementation in ProCAKE, the two presented approaches were abstracted into one similarity measure with two variations. The matrix is constructed using the following formula: \(H_{i,j} = max \begin{cases} \quad H_{i-1,j-1} &+ \ wTemp_i * ALG_{i,j}(diagonal),\\ \quad H_{i-1,j} &+ \ wTemp_i * ALG_{i,j}(horizontal),\\ \quad H_{i,j-1} &+ \ wTemp_i * ALG_{i,j}(vertical),\\ \quad 0 \end{cases}\)

Depending on the chosen approach (either SWA or DTW), \(ALG_{i,j}\) is replaced with:

\(SWA_{i,j}(x) := \begin{cases} \quad sim_{loc}(q_i,c_j)& \text{if } x=\text{'diagonal'},\\ \quad penaltyDeletion(q_i)& \text{if } x=\text{'horizontal'},\\ \quad penaltyInsertion(c_j)& \text{if } x=\text{'vertical'},\\ \quad 0& \text{otherwise} \end{cases} \\\) \(DTW_{i,j}(x) := \begin{cases} \quad 2 * sim_{loc}(q_i,c_j)& \text{if } x=\text{'diagonal'}, \\ \quad sim_{loc}(q_i,c_j)& \text{if } x=\text{'horizontal'}, \\ \quad sim_{loc}(q_i,c_j)& \text{if } x=\text{'vertical'}, \\ \quad 0& \text{otherwise} \end{cases} \\\)\(q_i\) represents the i-th query item and \(c_j\) the j-th case item. \(sim_{loc}\) is the local similarity measure used.

The factor \(wTemp\) allows for temporal weighting of the similarities: \(wTemp_i = \frac{1}{2^\frac{n-i}{h}}\) \(n\) is the length of the query. The halving distance \(h\) specifies the distance to the end where the events’ importance shall be one half. Therefore, with this formula, the weight of an event at the end of the sequence will be one, the weight of an event at \(end - h\) will be one half, and the weight of an event prior to \(end - h\) will be between 0 and one half. By using this formula, differences in the series, that are closer to the end of the sequences, will be weighted higher than dissimilarities further in the past.

Given the constructed scoring matrix, the final non-normalized similarity is calculated as follows:

\(sim_{raw}(q,c) = \max_{0 < j \leq m} H_{n,j}\)With \(m\) as the length of the case instance, this ensures that the alignment always ends with the last task of the query. Normalization is then conducted depending on the chosen approach.

For the SWA, the maximum similarity score is achieved when all elements are mapped. Thus, for all elements of the query, the weighting factors are summarized, assuming a similarity of 1. This value is used for normalization:

\(sim^{SWA}_{normed}(q, c) = \frac{sim_{raw}(q,c)}{\displaystyle\sum^{i<n}_{i=0} wTemp_i} \\\)As dynamic time warping allows warps and calculates the similarity score in the matrix differently, the alignment is included in the normalization. Let \(align =\{(0,0), \dots , (m,k)\}\) denote all cells from the alignment path. Let \(diag=\{(i_0,j_0),\dots,(i_p,j_p)\}\subseteq align\) denote all cells that originated from a diagonal step and let \(other = align \setminus diag \) denote all other assignments.

\(sim^{DTW}_{normed}(q, c) = \ \frac{sim_{raw}(q,c)}{\displaystyle\sum_{(i,j) \in diag} 2*wTemp_i + \displaystyle\sum_{(i,j) \in other} wTemp_i}\)#### Measure instantiation and configuration #

##### SWA #

For this SWA-based measure, the following parameters can be set.

Parameter | Type/Range | Default Value | Description |
---|---|---|---|

halvingDist | Percentage (Double, [0,1]) | `0.0` | This parameter specifies the distance to the end where the events’ importance shall be one half. |

constInsertionPenalty | Functional interface | `t -> -1.0` | This parameter is used to define a functional interface that gets the specific item being inserted as input. The output must be a negative numeric value. |

constDeletionPenalty | Functional interface | `t -> -1.0` | This parameter is used to define a functional interface that gets the specific item being inserted as input. The output must be a negative numeric value. |

localSimName | Similarity Measure (String) | `null` | The parameter is used to define, which similarity has to be taken for the similarity computation between the elements of the collections. For all elements, this one local measure will be used and afterwards the global similarity is aggregated. The similarity measure, which is referred, must exists in the runtime domain, too. Otherwise, an exception will be thrown. If no `localSimName` is defined, a usable data class similarity will be searched. If this one can’t be found, too, an invalid similarity will be returned. |

endWithQuery | Flag (boolean) | `true` | This parameter configures the binding to always end the alignment with the last task of the query can be disabled. |

The SWA-based measure can be instantiated during runtime as follows:

```
SMListSWA smListSWA = (SMListSWA) simVal.getSimilarityModel().createSimilarityMeasure(SMListSWA.NAME, ModelFactory.getDefaultModel().getClass("WeekdayList"));
simVal.getSimilarityModel().addSimilarityMeasure(smListSWA, "SMListSWA");
```

The similarity measure can be configured using the following parameters:

- The halving distance can be specified as a percentage value (denoted as a value from \([0,1]\)
) of the length of the query. The distance is measured starting from the end of the query.
`smListSWA.setHalvingDistancePercentage(0.5);`

- Insertion and deletion penalty schemes can be specified using a functional interface that gets the specific item being inserted or deleted as input. The output must be a negative numeric value.
`smListSWA.setInsertionScheme(t -> -0.5); smListSWA.setDeletionScheme(t -> -0.5);`

- When the local similarity to use is specified, this overwrites the configuration of the similarity model.
`smListSWA.setLocalSimilarityToUse("SMStringEquals");`

- The binding to always end the alignment with the last task of the query can be disabled.
`smListSWA.setForceAlignmentEndsWithQuery(false);`

Instantiation via XML is also possible, however insertion and deletion penalty schemes can only be set to a constant value.

```
<ListSWA name="SMListSWA" class="WeekdayList" halvingDist="0.5" localSimName="SMStringEquals" constInsertionPenalty="-0.5" constDeletionPenalty="-0.5" endWithQuery="false"/>
```

##### DTW #

For this DTW-based measure, the following parameters can be set.

Parameter | Type/Range | Default Value | Description |
---|---|---|---|

halvingDist | Percentage (Double, [0,1]) | `0.0` | This parameter specifies the distance to the end where the events’ importance shall be one half. |

valBelowZero | Double | `0.0` | The parameter is used to adapt the local similarity scores during calculation in the following way: \(sim_{stretched} = (1 + valBelowZero) \ast sim_{loc} - valBelowZero\) . Thereby the similarity scores will be mapped from the interval \([0,1]\) onto \([-valBelowZero,1]\) . |

localSimName | Similarity Measure (String) | `null` | The parameter is used to define, which similarity has to be taken for the similarity computation between the elements of the collections. For all elements, this one local measure will be used and afterwards the global similarity is aggregated. The similarity measure, which is referred, must exists in the runtime domain, too. Otherwise, an exception will be thrown. If no `localSimName` is defined, a usable data class similarity will be searched. If this one can’t be found, too, an invalid similarity will be returned. |

endWithQuery | Flag (boolean) | `true` | This parameter configures the binding to always end the alignment with the last task of the query can be disabled. |

The DTW-based measure can be instantiated accordingly:

```
SMListDTW smListDTW = (SMListDTW) simVal.getSimilarityModel().createSimilarityMeasure(SMListDTW.NAME, ModelFactory.getDefaultModel().getClass("WeekdayList"));
simVal.getSimilarityModel().addSimilarityMeasure(smListDTW, "SMListDTW");
```

The similarity measure can be configured using the following parameters:

- The halving distance can be specified as a percentage value (denoted as a value from \([0,1]\)
) of the length of the query. The distance is measured starting from the end of the query.
`smListDTW.setHalvingDistancePercentage(0.5);`

- When using DTW, it could be essential that the scoring matrix is instantiated using negative values for punishing strong dissimilarities. Otherwise every step in the matrix could increase the overall score. Therefore, a
*valBelowZero*can be introduced that adapts the local similarity scores during calculation in the following way: \(sim_{stretched} = (1 + valBelowZero) \ast sim_{loc} - valBelowZero\) . Thereby the similarity scores will be mapped from the interval \([0,1]\) onto \([-valBelowZero,1]\) .`smListDTW.setValBelowZero(0.5);`

- When the local similarity to use is specified, this overwrites the configuration of the similarity model.
`smListDTW.setLocalSimilarityToUse("SMStringEquals");`

- The binding to always end the alignment with the last task of the query can be disabled.
`smListDtw.setForceAlignmentEndsWithQuery(false);`

Instantiation via XML can be done as follows.

```
<ListDTW name="SMListDTW" class="WeekdayList" halvingDist="0.5" valBelowZero="0.5" localSimName="SMStringEquals" endWithQuery="false"/>
```