Collections

Collection #

The following measures for Collection objects are implemented:

Isolated Mapping #

The Isolated Mapping measure computes the similarity of two collection classes by finding the best mapping for each element. Every element of the query is compared with every element of the case and the best possible matching is returned. These best similarities are summed up and divided by the size of the query. It can happen, that many elements are mapped on the same case element. For example a query with the size of 10 can be mapped on a case with the size of 1. Nevertheless this mapping can happen, when the case has also 10 elements, of which 9 are never mapped. If an optimal mapping is wanted, where each query item has only one mapping partner, the Mapping measure has to be used.

The following parameters can be set for this similarity measure.

Parameter Type/Range Default Value Description
similarityToUse Similarity Measure (String) null The parameter is used to define, which similarity has to be taken for the similarity computation between the elements of the collections. For all elements, this one local measure will be used and afterwards the global similarity is aggregated. The similarity measure, which is referred, must exists in the runtime domain, too. Otherwise, an exception will be thrown. If no similarityToUse is defined, a usable data class similarity will be searched. If this one can’t be found, too, an invalid similarity will be returned.

The definition of a Isolated Mapping measure can look like:

<CollectionIsolatedMapping name="CollectionWeekdays" class="WeekdayList" similarityToUse="StringEqualsCaseSensitive" default="false"/>

This measure would refer to the collection class WeekdayList, which is a list as the name implies. To compare the elements of this list, which are elements of the String type, the Equals measure is used.

To create this measure during runtime, the following code would be used:

SMCollectionIsolatedMappingImpl collectionWeekdays = (SMCollectionIsolatedMappingImpl) simVal.getSimilarityModel().createSimilarityMeasure(SMCollectionIsolatedMapping.NAME, ModelFactory.getDefaultModel().getCollectionSystemClass());
collectionWeekdays.setDataClass(ModelFactory.getDefaultModel().getClass("WeekdayList"));
collectionWeekdays.setSimilarityToUse("StringEqualsCaseSensitive");
simVal.getSimilarityModel().addSimilarityMeasure(collectionWeekdays, "CollectionWeekdays");

Here, the collection class WeekdayList and the String Equals similarity have to be defined somewhere else.

In the following example two collections exist, which are both objects of the class WeekdayList: The collection Weekdays contains all days from Monday until Sunday. The collection Workdays contains all workdays, Monday until Friday. The first example applies for both types of collection, namely lists or sets, as the order of the elements is not relevant.

In this example, the Isolated Mapping measure is used on those two collections. Weekdays is the query and Workdays the case. The following mapping results:

IsolatedMapping_Example1

The measure searches for the most similar element in the case for each element in the query. The elements Monday, Tuesday, Wednesday, Thursday and Friday appear in query and case, so they are mapped on themselves. Thus, the similarity for each of these mappings is 1.0.

The elements Saturday and Sunday only occur in the query. They both need a mapping partner, so the measure searches for the most similar element in the case. In this example, Sunday is mapped on Monday and Saturday on Friday. The mappings of these elements depend on the similarity measures, which are used for the comparison of the elements. So this mapping is just one possibility. It was chosen, because Monday and Sunday have the highest Levenshtein similarity. Saturday is also compared with every element in the case, but is equal similar to all elements except Wednesday. In this case Friday was chosen randomly.

The overall similarity of both collections is computed by summing up single similarities using the Levenshtein measure in this example. The similarity value of Monday, Tuesday, Wednesday, Thursday and Friday is 1.0. The similarity between Sunday and Monday would be \(\frac{2}{3}\) and between Saturday and Friday would be \(\frac{1}{2}\) . So the total similarity would be calculated like this:

\(\frac{1.0 + 1.0 + 1.0 + 1.0 + 1.0 + \frac{2}{3} + \frac{1}{2}}{7} = 0.881\)

If the String equals measure would be used, Saturday and Sunday would have random partners. These mappings would have the similarity 0.0, so the total similarity would be \(\frac{5}{7}\) .

As seen in this example, the Isolated Mapping measure can map several query items on one case item. In an extreme case, it can happen, that all query items are mapped on just one element of the case, as can be seen in the next example.

Here, the collection Weekdays is used again, but this time as case. The query is a list, which contains the Monday item 5 times. It has to be a list, because a set, as the other existing type of collection, does not allow one element to appear several times.

Following mapping results:

IsolatedMapping_Example2

The item Monday is mapped on the Monday element in the case for 5 times. The resulting similarity is 1.0.

Mapping #

The measure performs the best possible mapping of two collections using the A* algorithm. It performs a best possible mapping of the query items to the case items of the collection. Note that not all case items have to be utilized in the mapping, but all query items are tried to map. Each query element can only be mapped on a case element, which has no mapping partner, yet. If more query items than case items exist, some query items will get a similarity 0 (which will influence the overall similarity). If more case items than query items exist, then this will not interfere with the similarity-value (rather gives more possible match-scenarios).

A restriction for the A* search can be set by using the parameter maxQueueSize. It is a restriction for the size of the internal queue of the A* algorithm. It can be used, to reduce the runtime, but can then also cause the best possible mapping not to be found. It’s default value is 1000. It requires values larger than 0. If a negative value is set, the pruning of the queue is disabled. In this case, the optimal mapping will be found, but the runtime won’t be limited.

The following parameters can be set for this similarity measure.

Parameter Type/Range Default Value Description
similarityToUse Similarity Measure (String) null The parameter is used to define, which similarity has to be taken for the similarity computation between the elements of the collections. For all elements, this one local measure will be used and afterwards the global similarity is aggregated. The similarity measure, which is referred, must exists in the runtime domain, too. Otherwise, an exception will be thrown. If no similarityToUse is defined, a usable data class similarity will be searched. If this one can’t be found, too, an invalid similarity will be returned.
maxQueueSize Size (int) 1000 This parameter is a restriction for the size of the internal queue of the A* algorithm. It can be used, to reduce the runtime, but can then also cause the best possible mapping not to be found. It requires values larger than 0. If a negative value is set, the pruning of the queue is disabled. In this case, the optimal mapping will be found, but the runtime won’t be limited.

The definition of a Mapping measure can look like:

<CollectionMapping name="CollectionWeekdays" class="WeekdayList" similarityToUse="StringEqualsCaseSensitive" maxQueueSize="1100"/>

This measure would refer to the collection class WeekdayList, which is a list as the name implies. To compare the elements of this list, which are elements of the String type, the Equals measure is used. The maximum size of the queue is set here to 1100.

To create this measure during runtime, the following code would be used:

SMCollectionMappingImpl collectionWeekdays = (SMCollectionMappingImpl) simVal.getSimilarityModel().createSimilarityMeasure(SMCollectionMapping.NAME,  ModelFactory.getDefaultModel().getCollectionSystemClass());
collectionWeekdays.setDataClass(ModelFactory.getDefaultModel().getClass("WeekdayList"));
collectionWeekdays.setSimilarityToUse("StringEqualsCaseSensitive");
simVal.getSimilarityModel().addSimilarityMeasure(collectionWeekdays, "CollectionWeekdays");

Here, the collection class WeekdayList and the String Equals similarity have to be defined somewhere else.

In this example two collections exist: The collection Weekdays contains all days from Monday until Sunday. The collection Workdays contains all workdays, Monday until Friday. It neither matters, if the collections are lists or sets, nor is the order important.

In this example, the Mapping measure is used on those two collections. Weekdays is the query and Workdays the case. Following mapping results:

Mapping_Example1

The measure searches the mapping partner with the highest similarity for every element of the query. Then the similarities are compared and the mapping with the highest similarity is chosen, the other mappings are not considered anymore. The search for mapping partners restarts, but neither the mapped query item nor the mapped case element are used anymore. So, in this example, Saturday and Sunday have no mapping partner, because for each case element exists a query partner with a higher similarity.

Because the query items were mapped on their representations in the case, the similarity of each of these mappings is 1.0. Saturday and Sunday could not be mapped, so they are treated like their mapping has the similarity 0.0. So the total similarity is \(\frac{5}{7}\) . In this example, the similarity is independent of the measure, which was used for the comparison of the items.

In the following is a second example. The element Monday is added to the collection Workdays. Because there is already one Monday in it, this collection must be a list. Weekdays again represents the query and Workdays the case.

The following mapping results:

Mapping_Example2

Unlike the first example, there is now one available mapping partner for Saturday and Sunday. The Monday item of the query is mapped on one of the Monday items in the case, which is chosen randomly. The Monday item without a current mapping partner is compared to each query item. This repeats until the similarity value for the mapping of this item with Saturday or Sunday becomes the highest similarity. The mapping of this Monday element depends on the measure, which is used for the item comparison.

In the example above, the Levenshtein measure was used, so Monday was most similar to Sunday. The similarity between these elements is \(\frac{2}{3}\) , so the total similarity is computed like: \(\frac{1.0 + 1.0 + 1.0 + 1.0 + 1.0 + \frac{2}{3} + 0.0}{7} = 0.81\)

If the String equals measure would be used, Saturday would be the mapping partner of Monday, because it’s the first computation by this order. So the total similarity for this case is \(\frac{5}{7}\) .

List Mapping #

The List Mapping measure compares two lists and also considers the order of elements. Other collection types are not supported. The similarity between two lists can be computed with two different meanings:

  • Exact: This means, that the two lists must have the same length to be compared. Each element is compared with the element at the same position in the other list. The similarities for each comparison are summed and divided through the length of the list.
  • Inexact: This means, that one of the two lists can be a sublist of the other. The similarities for possible sublists are computed like in the exact comparison. In the end, the best similarity is reported. In both cases the measure follows the exact order of the list.

The following parameters can be set for this similarity measure.

Parameter Type/Range Default Value Description
similarityToUse Similarity Measure (String) null The parameter is used to define, which similarity has to be taken for the similarity computation between the elements of the collections. For all elements, this one local measure will be used and afterwards the global similarity is aggregated. The similarity measure, which is referred, must exists in the runtime domain, too. Otherwise, an exception will be thrown. If no similarityToUse is defined, a usable data class similarity will be searched. If this one can’t be found, too, an invalid similarity will be returned.
contains Flag (String) inexact This parameter defines the meaning of the comparison of two lists. If it’s set to exact, the two lists must have the same length to be compared. Each element is compared with the element at the same position in the other list. The similarities for each comparison are summed and divided through the length of the list. If the parameter is set to inexact, one of the two lists can be a sublist of the other. The similarities for possible sublists are computed like in the exact comparison. In the end, the best similarity is reported. In both cases the measure follows the exact order of the list.

The definition of a List Mapping measure can look like:

<ListMapping name="ListWeekdays" class="WeekdayList" contains="exact" similarityToUse="StringEqualsCaseSensitive" default="false"/>

This measure would refer to the collection class WeekdayList, which is a list as the name implies. The keyword contains indicates, that the exact strategy is used. To compare the elements of this list, which are of the type String, the Equals measure is used.

To create this measure during runtime, the following code would be used:

SMListMappingImpl listWeekdays = (SMListMappingImpl) simVal.getSimilarityModel().createSimilarityMeasure(SMListMapping.NAME,  ModelFactory.getDefaultModel().getListSystemClass());
listWeekdays.setDataClass(ModelFactory.getDefaultModel().getClass("WeekdayList"));
listWeekdays.setContainsExact();
listWeekdays.setSimilarityToUse("StringEqualsCaseSensitive");
simVal.getSimilarityModel().addSimilarityMeasure(listWeekdays , "ListWeekdays");

Here, the collection class WeekdayList and the StringEquals similarity have to be defined somewhere else.

In this example two lists exist: The list Weekdays contains all days from Monday until Sunday. The list Workdays contains all workdays, Monday until Friday. This measure works only on lists, so sets are not mentioned here.

In the following, List Mapping is used on the two lists. The following mapping results:

ListMapping_Example1

The sizes of the lists are different, so some items cannot be mapped. When using the exact mapping strategy, the similarity would be 0.0.

The inexact mapping strategy ignores the sizes of the lists. So it is checked, whether one list is a sublist of the other. In this example, three possible sublists are found. For each of them, the total similarity is computed, and the one with the highest similarity is chosen. Here, Workdays is a sublist of Weekdays, so the similarity is 1.0 and this mapping is chosen.

In the following is a second example. Here, Workdays is the query and Weekdays the case. But the order of Weekdays is different, so Thursday and Wednesday swapped their positions.

When using the exact mapping strategy, the similarity would be 0.0 and no mapping would be tried. So the following mapping just results, when using the inexact mapping strategy:

ListMapping_Example3

The measure finds three sublists again and computes the similarities. It identifies the second to the sixth element as the mapping partners with the highest similarity. But this measure follows the exact order of the items. So the element Wednesday is mapped on Thursday, just as Thursday on Wednesday. So, when using the String equals measure, the similarity is \(\frac{3}{5} = 0.6\) .

List Correctness #

The List Correctness measure compares two lists by the order of their elements. Other collection types aren’t supported.

This similarity measure uses the Correctness metric to compare two lists. This metric is the ratio of concordant and discordant pairs regarding all pairs. Its value is usually in the range of \(-1\) to \(1\) . \(-1\) means, that only discordant pairs are given and \(1\) means that only concordant pairs are given. Values in the range of \([0,1]\) are used directly as similarity values. For values in the range of \([-1.0)\) , a parameter discordantParameter \(p_{d}\) can be set. It defines the maximum possible similarity, if a correctness of \(-1\) would be reached. The similarity values decrease linearly between this value till 0. The default value for this parameter is \(1.0\) .

The meaning of concordant and discordant pairs is described here.

The similarity is computed by the following formula

\(sim_{correctness} = \begin{cases} \frac{|C|-|D|}{|C|+|D|}, &\text{if} \ |C| \geq |D|\\ \\ \bigl|\frac{|C|-|D|}{|C|+|D|}\bigl| \ * \ p_{d}, &\text{if} \ |C| < |D|\\ \end{cases}\)

The following parameters can be set for this similarity measure.

Parameter Type/Range Default Value Description
discordantParameter Weight (double) 1.0 This parameter defines the maximum possible similarity, if a correctness of \(-1\) would be reached. The similarity values decrease linearly between this value till 0.

The definition of a List Correctness measure can look like:

<ListCorrectness name="SMListCorrectness" class="WeekdayList" discordantParameter="0.5" default="false"/>

This measure would refer to the collection class WeekdayList, which is a list as the name implies. The discordant parameter is set to \(0.5\) . So, if only discordant pairs exists, the similarity would be \(0.5\) .

To create this measure during runtime, the following code would be used:

SMListCorrectnessImpl smListCorrectness = (SMListCorrectnessImpl) simVal.getSimilarityModel().createSimilarityMeasure(SMListCorrectness.NAME,  ModelFactory.getDefaultModel().getListSystemClass());
smListCorrectness.setDataClass(ModelFactory.getDefaultModel().getClass("WeekdayList"));
smListCorrectness.setDiscordantParameter(0.5);
simVal.getSimilarityModel().addSimilarityMeasure(smListCorrectness_05, LIST_CORRECTNESS_0_5);

In this example two lists exist: The first list contains all days from Monday until Wednesday in order of the week. The second lists contains the days in the following order: Monday, Wednesday, Tuesday. Here, the order of the element Monday to the element Tuesday and of Monday to the element Wednesday matches. So, two concordant pairs can be found. Only the order of the element Tuesday to the element Wednesday doesn’t match. So, this pair is discordant. Because more concordant than discordant pairs does exits, the following similarity would be computed: \(\frac{2-1}{2+1}=\bigl\frac{1}{3}\) .

In another example, the first list containing the three days in order of the week exist, too. An additional list contains these elements in reverse order: Wednesday, Tuesday, Monday. Here, all elements occur in a different order. So, only discordant pairs could be computed. With no discordantParameter set, the following similarity would be computed: \(\bigl|\frac{0-3}{0+3}\bigl| * 1= \bigl|-1\bigl| \ * \ 1 = 1\) . If the parameter was set to \(0.5\) , the similarity would be computed as follows: \(\bigl|\frac{0-3}{0+3}\bigl| \ * \ 0.5= \bigl|-1\bigl| \ * \ 0.5 = 0.5\)

Sequence Matching #

Lists can be regarded as time series. Therefore, two algorithms (Smith Waterman Algorithm (SWA) and Dynamic Time Warping (DTW)) based on sequence matching have been adapted for comparing two list objects. These measures can be seen as an extension to the basic ListMapping. Both approaches find alignments between the two series by iteratively aligning sub-sequences until the best matching sub-sequences are identified. These alignments preserve the temporal order of the series. By default, the final alignment must include the last item of the query, however, this behavior can be disabled:

dpSimMeasure.setForceAlignmentEndsWithQuery(false);

The following mapping represents a valid alignment only if setForceAlignmentEndsWithQuery(false) is set. Inversely, by default, this mapping is invalid and would not be found by either algorithm. Notice the gap in the alignment, which is interpreted as a deletion of Wednesday. This alignment is an example of the SWA approach:

DynamicProgramming_Example1

In the following, the individual characteristics of both algorithms are outlined. Afterwards, their implementation and integration in ProCAKE is described.

Smith Waterman Algorithm (SWA) #

SWA is based on the Needleman-Wunsch Algorithm and is able to find local alignments of two data sequences. These alignments are constructed by iteratively either aligning, inserting or deleting elements from one series with regard to the other. The approach therefore is somewhat similar to the Levenshtein distance approach. Instead of weekdays, the following example will compare two strings (which are also a sequence) \(A= \text{'}AYBCMM\text{'}\) , with length \(|A|\) , and \(B=\text{'}ABCKMMH\text{'}\) , with length \(|B|\) . String \(A\) is indexed using \(i = 1,...,|A|\) and string \(B\) is indexed by \(j = 1,...,|B|\) . With this convention, \(a_i\) depicts the i-th character from \(A\) . The steps of this example are illustrated in the figure below.

Swa_example

SWA calculates the alignment and thereupon the similarity score in the following way:

  1. First, the two parameters of the algorithm, which in this case are functions, have to be specified.
  • A local similarity measure that will be used to compare the individual items in the sequences must be set. In ProCAKE, this measure can either be set manually or the measure based on the similarity model will be used. In this example, a simple CharacterEqual measure is applied:
\( \ \ \ \ \ \ s(a_i,b_j) = \begin{cases} \quad 1 &\text{if } a_i = b_j\\ \quad 0 &\text{if } else \end{cases} \)
  • The second parameter is a gap penalty function used to penalize indels, i.e., insertions or deletions, in the alignment. An insertion in string \(A\) will appear as a gap in \(A\) while a deletion from \(A\) will appear as a gap in \(B\) in the final alignment. For this purpose, the gap character \(\text{'}-\text{'}\) is introduced. Here, a constant function will be used, that assigns a penalty of \(0.5\) to every indel operation. In principle, this penalty can depend on the specific item being deleted or inserted (A function for this behavior can be set via code): \( W = -0.5 \)
  1. Next, the scoring matrix \(H\) is initialized (see (a)). \(a_0\) and \(b_0\) are identified with the gap character and the first row and column of the matrix are set to zero: \(H_{0,j}=H_{i,0}=0\) . Entry \(H_{i,j}\) is interpreted as the subsequence similarity of \(a'_i = a_1 a_2 .. a_i\) and \(b'_j = b_1 b_2 .. b_j\) for \(i,j > 0\) . When either \(i=0\) or \(j=0\) , then \(a_0=\text{'}-\text{'}\) or \(b_0=\text{'}-\text{'}\) and hence, similarity of a string with the empty string is set to 0.

  2. The scoring matrix is filled in according to the following formula (see (b)):

    \( H_{i,j} = \text{max} \begin{cases} \quad 0, \\ \quad H_{i-1,j-1} + s(a_i,b_j), &\text{match/mismatch}\\ \quad H_{i-1,j} + s(a_i, -), &\text{deletion}\\ \quad H_{i,j-1} + s(-, b_j) &\text{insertion} \end{cases} \)

    The function \(s\) has to return a positive similarity value for a match or mismatch and a negative value for an indel operation (including zero in both cases). In this example, \(s\) is defined as follows:

    \( s(a,b) = \begin{cases} \quad sim(a,b) &\text{if } a \in A \text{ and } b \in B\\ \quad W &\text{if } a=\text{'}-\text{'} \text{ or } b=\text{'}-\text{'} \end{cases} \)

    A match corresponds to a diagonal, an insertion to a vertical and a deletion to a horizontal step in the matrix (see (c)). The values' origins must be stored during calculation if one wants to visualize the optimal alignment later.

  3. In order to obtain the alignment, one has to find the maximum value in the matrix and trace back a path to a cell of value zero. This is done by successively stepping either vertically to the top, horizontally to the left or diagonally to the top left adjacent cell according to the inverted direction of the initial score calculation. Parts (c) and (d) show the alignment path and the final alignment, respectively. The strings \(A\) and \(B\) is assigned a similarity score of 4 which is significantly less than the maximum obtainable score of \(7=sim(A,A)\) if \(A\) was a subsequence of \(B\) .

Dynamic Time Warping (DTW) #

DTW in its original application was developed in order to properly align distorted time series data collected from voice recordings. In contrast to simple Euclidean Distance which only compares values with the same timestamp, DTW is able to warp elements of one series to an element of the other series with a different timestamp. For this reason, the comparison of the two strings ‘abbcc’ and ‘abc’ using DTW would result in maximum similarity, since the two sections of length two, ‘bb’ and ‘cc’, from the first string can be warped onto the single characters ‘b’ and ‘c’ of the second string respectively. Gaps, for the same reason, do not exist in DTW.

Calculation of this method is very similar to SWA, however the interpretation of the obtained path through the matrix is fundamentally different. While horizontal or vertical steps in SWA represent either a deletion or insertion, in DTW these steps simply imply that multiple warps onto the same element occur. Natively, DTW finds a distance value. Therefore, in the following example two series of numeric data is compared: \(A=(1,3,5,3,1,2,3,4,5,6)\) and \(B=(2,5,2,1,2,4,6)\) . The same notation as in the previous section will be used.

dtw_example

  1. DTW only needs the local distance function as a parameter. In this example, simple Euclidean Distance is used: \(dist(a,b) = |a-b|\)

  2. Again, the value of the cell \(H_{i,j}\) from the scoring matrix is interpreted as the DTW-distance of the subsequences \(a'_i = a_1 a_2 .. a_i\) and \(b'_j = b_1 b_2 .. b_j\) for \(i,j > 0\) . Therefore, the first row and column, with the exception of \(H_{0,0}\) , are initialized to be infinitely large, since the distance of a series to an empty series is not calculable. \(H_{0,0}\) is initialized to zero (see (a)).

  3. The matrix is now again iteratively filled in with the following formula (see (b)). The sources of each cell’s entry must be stored again.

    \( H_{i,j} = \text{min} \begin{cases} \quad H_{i,j-1} + dist(a_i,b_j), &\text{step vertically}\\ \quad H_{i-1,j-1} + 2 * dist(a_i,b_j), &\text{step diagonally}\\ \quad H_{i-1,j} + dist(a_i,b_j) &\text{step horizontally} \end{cases} \)
  4. To find the warping path, one must start in the bottom right corner of the matrix, at \(H_{|A|,|B|}\) and follow the origins of the original calculation until \(H_{1,1}\) is reached (see (c)). Each cell \((i,j)\) in the obtained path represents an alignment, or a warp, of \(a_i\) onto \(b_j\) . When plotting the series and connecting the elements from the path, the warping path is visualized as in (d).

We now want to transform this approach to be used with similarity values instead of distances. This can be easily achieved by simply maximizing the scoring function \(H\) at each step. However, please note that due to the construction of DTW, this will result in the score growing in every step and every direction. Consider the example below: On the left, a similarity function was used that is obtained by inverting the distance. The resulting scoring matrix and alignment path is visualized below. As you can see, the alignment looks a little random and non-rational. On the right, the local similarity function was altered by mapping the score to the interval [-1,1]. Thereby, low similarity scores act as punishment and do not increase the score in the matrix. The result is shown is much more useful.

stretchEx

Differences SWA vs DTW #

The main difference between both approaches lies in their basic similarity assumptions. In SWA, two series are regarded as highly similar, if a sub-sequence of series A can be converted into a sub-series of series B by means of as little indel operations as possible. On the other hand, DTW regards two series as highly similar, if a sub-sequence of series A can be projected onto a sub-sequence of series B by either stretching or compressing (i.e. warping) one sub-sequence onto the other. When interpreting the found alignments in the matrix for presenting them as mappings, differences emerge.

A typical mapping of SWA looks like this. Note the gap in the query that results from the deletion of Tuesday. This deletion results in a negative value being added as an indel punishment, resulting in a similarity of less than 1 in this example:

DynamicProgramming_SWA_Example1

On the other hand, a typical mapping of DTW looks like this. Note that instead of deleting one Tuesday, both are warped onto the singe Tuesday of the case. This has no effect on the final similarity score, which will be 1 in this example:

DynamicProgramming_DTW_Example1

Implementation #

For the implementation in ProCAKE, the two presented approaches were abstracted into one similarity measure with two variations. The matrix is constructed using the following formula: \(H_{i,j} = max \begin{cases} \quad H_{i-1,j-1} &+ \ wTemp_i * ALG_{i,j}(diagonal),\\ \quad H_{i-1,j} &+ \ wTemp_i * ALG_{i,j}(horizontal),\\ \quad H_{i,j-1} &+ \ wTemp_i * ALG_{i,j}(vertical),\\ \quad 0 \end{cases}\)

Depending on the chosen approach (either SWA or DTW), \(ALG_{i,j}\) is replaced with:

\(SWA_{i,j}(x) := \begin{cases} \quad sim_{loc}(q_i,c_j)& \text{if } x=\text{'diagonal'},\\ \quad penaltyDeletion(q_i)& \text{if } x=\text{'horizontal'},\\ \quad penaltyInsertion(c_j)& \text{if } x=\text{'vertical'},\\ \quad 0& \text{otherwise} \end{cases} \\\) \(DTW_{i,j}(x) := \begin{cases} \quad 2 * sim_{loc}(q_i,c_j)& \text{if } x=\text{'diagonal'}, \\ \quad sim_{loc}(q_i,c_j)& \text{if } x=\text{'horizontal'}, \\ \quad sim_{loc}(q_i,c_j)& \text{if } x=\text{'vertical'}, \\ \quad 0& \text{otherwise} \end{cases} \\\)

\(q_i\) represents the i-th query item and \(c_j\) the j-th case item. \(sim_{loc}\) is the local similarity measure used.

The factor \(wTemp\) allows for temporal weighting of the similarities: \(wTemp_i = \frac{1}{2^\frac{n-i}{h}}\) \(n\) is the length of the query. The halving distance \(h\) specifies the distance to the end where the events' importance shall be one half. Therefore, with this formula, the weight of an event at the end of the sequence will be one, the weight of an event at \(end - h\) will be one half, and the weight of an event prior to \(end - h\) will be between 0 and one half. By using this formula, differences in the series, that are closer to the end of the sequences, will be weighted higher than dissimilarities further in the past.

Given the constructed scoring matrix, the final non-normalized similarity is calculated as follows:

\(sim_{raw}(q,c) = \max_{0 < j \leq m} H_{n,j}\)

With \(m\) as the length of the case instance, this ensures that the alignment always ends with the last task of the query. Normalization is then conducted depending on the chosen approach. Let \(align =\{(0,0), \dots , (m,k)\}\) denote all cells from the alignment path. Let \(diag=\{(i_0,j_0),\dots,(i_p,j_p)\}\subseteq align\) denote all cells that originated from a diagonal step and let \(other = align \setminus diag \) denote all other assignments.

\(sim^{SWA}_{normed}(q, c) = \frac{sim_{raw}(q,c)}{\displaystyle\sum_{(i,j) \in diag} wTemp_i} \\\) \(sim^{DTW}_{raw}(q, c) = \ \frac{sim_{raw}}{\displaystyle\sum_{(i,j) \in diag} 2*wTemp_i + \displaystyle\sum_{(i,j) \in other} wTemp_i}\)

Measure instantiation and configuration #

SWA #

For this SWA-based measure, the following parameters can be set.

Parameter Type/Range Default Value Description
halvingDist Percentage (Double, [0,1]) 0.0 This parameter specifies the distance to the end where the events' importance shall be one half.
constInsertionPenalty Functional interface t -> -1.0 This parameter is used to define a functional interface that gets the specific item being inserted as input. The output must be a negative numeric value.
constDeletionPenalty Functional interface t -> -1.0 This parameter is used to define a functional interface that gets the specific item being inserted as input. The output must be a negative numeric value.
localSimName Similarity Measure (String) null The parameter is used to define, which similarity has to be taken for the similarity computation between the elements of the collections. For all elements, this one local measure will be used and afterwards the global similarity is aggregated. The similarity measure, which is referred, must exists in the runtime domain, too. Otherwise, an exception will be thrown. If no localSimName is defined, a usable data class similarity will be searched. If this one can’t be found, too, an invalid similarity will be returned.
endWithQuery Flag (boolean) true This parameter configures the binding to always end the alignment with the last task of the query can be disabled.

The SWA-based measure can be instantiated during runtime as follows:

SMListSWA smListSwa = (SMListSWA) simVal.getSimilarityModel().createSimilarityMeasure(SMListSWA.NAME, ModelFactory.getDefaultModel().getListSystemClass());
simVal.getSimilarityModel().addSimilarityMeasure(smListSwa, "ListSwa");

The similarity measure can be configured using the following parameters:

  • The halving distance can be specified as a percentage value (denoted as a value from \([0,1]\) ) of the length of the query. The distance is measured starting from the end of the query.
    smListSwa.setHalvingDistancePercentage(0.5);
    
  • Insertion and deletion penalty schemes can be specified using a functional interface that gets the specific item being inserted or deleted as input. The output must be a negative numeric value.
    smListSwa.setInsertionScheme(t -> -0.5);
    smListSwa.setDeletionScheme(t -> -0.5);
    
  • When the local similarity to use is specified, this overwrites the configuration of the similarity model.
    smListSwa.setLocalSimToUse("StringEqualsCaseSensitive");
    
  • The binding to always end the alignment with the last task of the query can be disabled.
    smListSwa.setForceAlignmentEndsWithQuery(false);
    

Instantiation via XML is also possible, however insertion and deletion penalty schemes can only be set to a constant value.

<ListSWA class="WeekdayList" name="WeekdayListSWA" halvingDist="0.5" localSimName="StringEqualsCaseSensitive" constInsertionPenalty="-0.5" constDeletionPenalty="-0.5" endWithQuery="false"/>
DTW #

For this DTW-based measure, the following parameters can be set.

Parameter Type/Range Default Value Description
halvingDist Percentage (Double, [0,1]) 0.0 This parameter specifies the distance to the end where the events' importance shall be one half.
valBelowZero Double 0.0 The parameter is used to adapt the local similarity scores during calculation in the following way: \(sim_{stretched} = (1 + valBelowZero) \ast sim_{loc} - valBelowZero\) . Thereby the similarity scores will be mapped from the interval \([0,1]\) onto \([-valBelowZero,1]\) .
localSimName Similarity Measure (String) null The parameter is used to define, which similarity has to be taken for the similarity computation between the elements of the collections. For all elements, this one local measure will be used and afterwards the global similarity is aggregated. The similarity measure, which is referred, must exists in the runtime domain, too. Otherwise, an exception will be thrown. If no localSimName is defined, a usable data class similarity will be searched. If this one can’t be found, too, an invalid similarity will be returned.
endWithQuery Flag (boolean) true This parameter configures the binding to always end the alignment with the last task of the query can be disabled.

The DTW-based measure can be instantiated accordingly:

SMListDTW smListDtw = (SMListDTW) simVal.getSimilarityModel().createSimilarityMeasure(SMListDTW.NAME, ModelFactory.getDefaultModel().getListSystemClass());
simVal.getSimilarityModel().addSimilarityMeasure(smListDtw, "ListDtw");

The similarity measure can be configured using the following parameters:

  • The halving distance can be specified as a percentage value (denoted as a value from \([0,1]\) ) of the length of the query. The distance is measured starting from the end of the query.
    smListDtw.setHalvingDistancePercentage(0.5);
    
  • When using DTW, it could be essential that the scoring matrix is instantiated using negative values for punishing strong dissimilarities. Otherwise every step in the matrix could increase the overall score. Therefore, a valBelowZero can be introduced that adapts the local similarity scores during calculation in the following way: \(sim_{stretched} = (1 + valBelowZero) \ast sim_{loc} - valBelowZero\) . Thereby the similarity scores will be mapped from the interval \([0,1]\) onto \([-valBelowZero,1]\) .
    smListDtw.setValBelowZero(0.5);
    
  • When the local similarity to use is specified, this overwrites the configuration of the similarity model.
    smListDtw.setLocalSimToUse("StringEqualsCaseSensitive");
    
  • The binding to always end the alignment with the last task of the query can be disabled.
    smListDtw.setForceAlignmentEndsWithQuery(false);
    

Instantiation via XML can be done as follows.

<ListDTW class="WeekdayList" name="WeekdayListDTW" halvingDist="0.5" valBelowZero="0.5" localSimName="StringEqualsCaseSensitive" endWithQuery="false"/>