Aggregate

Aggregate #

The following measures for Aggregate objects are implemented:

Global similarity measures are defined by applying an aggregation function \(\Phi\) to the local similarity values. Such aggregation functions are defined by determining

  • a basic aggregation function and
  • a weight model that determines weights \(\overline{\omega}=(\omega_1,\ldots,\omega_n)\) such that \(0 \leq \omega_i \leq 1\) and \(\sum_{i=1...n} \omega_i = 1\) .

To guarantee that \(\sum_{i=1...n} \omega_i = 1\) all weights will be normalized automatically during runtime.

Global parameters of all aggregate similarity measures #

ParameterType/RangeDefault ValueDescription
defaultWeightWeight (Double, [0,1])1.0The default weight for attributes that is used if no specific weight is defined.
ignoreNullAttributesInQueryFlag (Boolean)trueThe parameter specifies whether missing/unspecified attributes (null attributes) in aggregate objects contained in a query should be ignored. If set to false, these null attributes are not ignored and thus have to be also null in the corresponding case objects to be evaluated with a similarity of 1.0. This is particularly useful for finding semantically equal objects.

Furthermore, there are parameters to be set for the respective attributes:

ParameterType/RangeDefault ValueDescription
weightWeight (Double, [0,1])see aboveThis parameter specifies the weight of a single attribute, overriding the default weight.
similarityToUseSimilarity Measure (String)nullThis parameter specifies which similarity measure is used for the respective attributes for local similarity calculation. This must be defined in advance. Otherwise, an exception will be thrown. If no value is defined for the similarityToUse, the default measure for the respective data classes of the attribute values is used.

All the example measures in the following require a Aggregate class called Employee, which has to be defined in the model.xml. In this case, this can look like:

For example, an aggregate similarity measure might look like this (not instantiable):

<SMAggregate name="SMAggregate" class="Employee" defaultWeight="0.5">
   <AggWeight att="name" similarityToUse="SMName"/>
   <AggWeight att="room" weight="0.3"/>
</SMAggregate>

In this example, the default weight and a specific similarity measure are used for the name attribute. For the room attribute, a reduced weight is used (both weights are normalized in the background) and the default similarity measure for integers is applied.

Average #

The weighted average is the most typically used aggregation function. Each attribute contributes to the similarity (and thereby to the utility) in a way that is proportional to its weight.

Similarity: \(\Phi(s_1,\ldots,s_n)=\sum_{i=1}^n \omega_i \cdot s_i\)

For example, an Aggregate Average measure can look like:

sim.xml

    <AggregateAverage name="SMAggregateAverage" class="Employee" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateAverage>
This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight or similarityToUse for name, so it uses the default weight of 0.5 and the default Similarity Measure for Strings, while the weight for room is set as 0.3 and the similarityToUse is set as “NumericLinearInteger”.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateAverage smAggregateAverage = (SMAggregateAverage) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateAverage.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateAverage.setDefaultWeight(0.5);
    smAggregateAverage.setWeight("room", 0.3);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateAverage, "AggregateAverageEmployee");

For example, there are two aggregate objects of the class Employee. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name ( \(\frac{0.5}{0.5+0.3}\) ) and 0.375 for room ( \(\frac{0.3}{0.5+0.3}\) ).

Using simple measures for the attributes, the similarity between the names is 1.0 and between the rooms 0.0. For the total similarity, the weighted average of both similarities is used. So, the similarity is \(0.625 * 1.0 + 0.375 * 0.0 = 0.625\) .

Please note the following when specifying weights: All attributes having an equal weight of e.g. 1 are equally weighted in the similarity computation. To increase the importance of one attribute please choose a value greater than 1 and to decrease the importance choose values smaller than 1. Weights will be automatically normalized (i.e., the sum equals 1.0) during similarity computation. However, they also normalized weights can be set manually. Please note that SMAggregateAverage aggregates only attributes that are contained in a query aggregate object. Consequently, the original ratio of weights might be changed during normalization if attributes are missing.

Maximum #

When using the maximum aggregation, the overall similarity is determined by the maximum local similarity. If one attribute indicates a high utility, the utility for the whole case is high.

Similarity: \(\Phi(s_1,\ldots,s_n)=\max_{i=1}^n \omega_i \cdot s_i\)

For example, an Aggregate Maximum measure can look like:

sim.xml

    <AggregateMaximum name="SMAggregateMaximum" class="Employee" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateMaximum>
This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateMaximum smAggregateMaximum = (SMAggregateMaximum) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateMaximum.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateMaximum.setDefaultWeight(0.5);
    smAggregateMaximum.setWeight("room", 0.3);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateMaximum, "SMAggregateMaximum");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 1.0 is the highest similarity, this value would be returned.

K-Maximum #

The k highest local similarity value determines the global similarity. Hence, if k out of n attributes indicate a high utility, the overall utility for the whole case is high.

Similarity: \(\begin{array}{l} \Phi(s_1,\ldots,s_n)=\omega_{i_k} \cdot s_{i_k} \\ \quad \text{with}\, \omega_{i_r}\cdot s_{i_r} \geq \omega_{i_{r+1}}\cdot s_{i_{r+1}} \end{array}\)

The following parameters can be set for this similarity measure.

ParameterType/RangeDefault ValueDescription
kAttribute number (int)1The parameter is used to define, which k most similar case is returned. Only positive values greater than 0 are allowed. If k is set to a value higher than the number of attributes, an exception is thrown.

For example, an Aggregate K Maximum measure can look like:

sim.xml

    <AggregateKMaximum name="SMAggregateKMaximum" class="Employee" k="2" default="false" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateKMaximum>

This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3. The value for k is set to 2, so the second-highest similarity will be returned.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateKMaximum smAggregateKMaximum = (SMAggregateKMaximum) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateKMaximum.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateKMaximum.setDefaultWeight(0.5);
    smAggregateKMaximum.setWeight("room", 0.3);
    smAggregateKMaximum.setK(2);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateKMaximum, "SMAggregateKMaximum");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the local similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 0.0 is the second-highest similarity, the value 0.0 would be returned.

As another example, assume two aggregate objects of any class containing five attributes. The following similarities were calculated: 1.0, 0.8, 0.6, 0.4 and 0.2. Using the k-value of 2, the second-highest value of 0.8 is returned. Using the k-value of 4, the fourth-highest value of 0.4 is returned. For a k-value of 1, the highest value of 1.0 is returned, analogous to AggregateMaximum, and for k=5 the smallest value 0.2, corresponding to AggregateMinimum.

Minimum #

When using the minimum aggregation, the overall similarity is determined by the minimum local similarity. If one attribute indicates a low utility, the utility for the whole case is low.

Similarity: \(\Phi(s_1,\ldots,s_n)=\min_{i=1}^n \omega_i \cdot s_i\)

For example, an Aggregate Minimum measure can look like:

sim.xml

    <AggregateMinimum name="SMAggregateMinimum" class="Employee" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateMinimum>
This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateMinimum smAggregateMinimum = (SMAggregateMinimum) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateMinimum.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateMinimum.setDefaultWeight(0.5);
    smAggregateMinimum.setWeight("room", 0.3);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateMinimum, "SMAggregateMinimum");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 0.0 is the lowest similarity, this value would be returned.

K-Minimum #

The k lowest local similarity value determines the global similarity. Hence, if k out of n attributes indicate a low utility, the overall utility for the whole case is low.

Similarity: \(\begin{array}{l} \Phi(s_1,\ldots,s_n)=\omega_{i_k} \cdot s_{i_k} \\ \quad \text{with}\, \omega_{i_r}\cdot s_{i_r} \leq \omega_{i_{r+1}}\cdot s_{i_{r+1}} \end{array}\)

The following parameters can be set for this similarity measure.

ParameterType/RangeDefault ValueDescription
kAttribute number (int)1The parameter is used to define, which k lowest similar case is returned. Only positive values greater than 0 are allowed. If k is set to a value higher than the number of attributes, an exception is thrown.

For example, an Aggregate K Minimum measure can look like:

sim.xml

    <AggregateKMinimum name="SMAggregateKMinimum" class="Employee" k="2" default="false" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateKMinimum>

This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set to 0.3. The value for k is set to 2, so the second-lowest similarity will be returned.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateKMinimum smAggregateKMinimum = (SMAggregateKMinimum) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateKMinimum.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateKMinimum.setDefaultWeight(0.5);
    smAggregateKMinimum.setWeight("room", 0.3);
    smAggregateKMinimum.setK(2);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateKMinimum, "SMAggregateKMinimum");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the local similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 1.0 is the second-lowest similarity, this value would be returned.

As another example, assume two aggregate objects of any class containing five attributes. The following similarities were calculated: 1.0, 0.8, 0.6, 0.4 and 0.2. Using the k-value of 2, the second-lowest value of 0.4 is returned. Using the k-value of 4, the fourth-lowest value of 0.8 is returned. For a k-value of 1, the lowest value of 1.0 is returned, analogous to AggregateMinimum, and for k=5 the highest value of 1.0, corresponding to AggregateMaximum.

Minkowski #

The Minkowski aggregation is a generalization of the weighted average. The higher the value of the parameter p ≥ 1, the higher is the influence of the attribute with the highest local similarity. For p to infinity this aggregation function becomes the maximum aggregation. If no weights are given, the similarity will be \(1.0\) .

Similarity: \(\Phi(s_1,\ldots,s_n)=(\sum_{i=1}^n \omega_i \cdot s_i^p)^{1/p}\)

The following parameters can be set for this similarity measure.

ParameterType/RangeDefault ValueDescription
pWeight (double)2.0The parameter is used to define the influence of the attribute with the highest local similarity.

For example, an Aggregate Minkowski measure can look like:

sim.xml

    <AggregateMinkowski name="SMAggregateMinkowski" class="Employee" p="3" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateMinkowski>
This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set to 0.3. The value for p is set to 3.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateMinkowski smAggregateMinkowski = (SMAggregateMinkowski) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateMinkowski.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateMinkowski.setMinkowskiP(3);
    smAggregateMinkowski.setDefaultWeight(0.5);
    smAggregateMinkowski.setWeight("room", 0.3);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateMinkowski, "SMAggregateMinkowski");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. Using simple measures for the attributes, the similarity between the names is 1.0, between the rooms 0.0.

For the total similarity, the following calculation results: \(((0.625 * 1.0)^3 + (0.375 * 0.0)^3)^{\frac{1}{3}} = ((0.625)^3 + (0.0)^3)^{\frac{1}{3}} = (0.2441 + 0.0) ^{\frac{1}{3}} = 0.625 \) .

Euclidian #

The Euclidean aggregation is the same as Minkowski aggregation with a fixed p = 2. If no weights are given, the similarity will be also \(1.0\) .

Similarity: \(\Phi(s_1,\ldots,s_n)=(\sum_{i=1}^n \omega_i \cdot s_i^2)^{1/2}\)

For example, an Aggregate Euclidean measure can look like:

sim.xml

    <AggregateEuclidian name="SMAggregateEuclidian" class="Employee" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateEuclidian>
This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateEuclidian smAggregateEuclidian = (SMAggregateEuclidian) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateEuclidian.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateEuclidian.setDefaultWeight(0.5);
    smAggregateEuclidian.setWeight("room", 0.3);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateEuclidian, "SMAggregateEuclidian");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. Using simple measures for the attributes, the similarity between the names is 1.0, between the rooms 0.0.

For the total similarity, the following calculation results: \(((0.625 * 1.0)^2 + (0.375 * 0.0)^2)^{\frac{1}{2}} = ((0.625)^2 + (0.0)^2)^{\frac{1}{2}} = (0.3906 + 0.0) ^{\frac{1}{2}} = 0.625 \) .