Aggregate

Aggregate #

The similarity measures for Aggregate objects can be divided into two categories: weighted and unweighted. For the unweighted approach, only the measure Relevance is available. This structure can be seen in the following table of contents:

Weighted aggregate similarity measures #

Global similarity measures are defined by applying an aggregation function \(\Phi\) to the local similarity values. In the weighted aggregate context besides the aggregation function, a weight model that determines weights \(\overline{\omega}=(\omega_1,\ldots,\omega_n)\) such that \(0 \leq \omega_i \leq 1\) and \(\sum_{i=1...n} \omega_i = 1\) is needed. The weights are used to determine the importance of the attributes in the similarity computation.

To guarantee that \(\sum_{i=1...n} \omega_i = 1\) all weights will be normalized automatically during runtime.

Global parameters #

ParameterType/RangeDefault ValueDescription
defaultWeightWeight (Double, [0,1])1.0The default weight for attributes that is used if no specific weight is defined.
ignoreNullAttributesInQueryFlag (Boolean)trueThe parameter specifies whether missing/unspecified attributes (null attributes) in aggregate objects contained in a query should be ignored. If set to false, these null attributes are not ignored and thus have to be also null in the corresponding case objects to be evaluated with a similarity of 1.0. This is particularly useful for finding semantically equal objects.

Furthermore, there are parameters to be set for the respective attributes:

ParameterType/RangeDefault ValueDescription
weightWeight (Double, [0,1])see aboveThis parameter specifies the weight of a single attribute, overriding the default weight.
similarityToUseSimilarity Measure (String)nullThis parameter specifies which similarity measure is used for the respective attributes for local similarity calculation. This must be defined in advance. Otherwise, an exception will be thrown. If no value is defined for the similarityToUse, the default measure for the respective data classes of the attribute values is used.

All the example measures in the following require a Aggregate class called Employee, which has to be defined in the model.xml. In this case, this can look like:

For example, an aggregate similarity measure might look like this (not instantiable):

<SMAggregate name="SMAggregate" class="Employee" defaultWeight="0.5">
   <AggWeight att="name" similarityToUse="SMName"/>
   <AggWeight att="room" weight="0.3"/>
</SMAggregate>

In this example, the default weight and a specific similarity measure are used for the name attribute. For the room attribute, a reduced weight is used (both weights are normalized in the background) and the default similarity measure for integers is applied.

Average #

The weighted average is the most typically used aggregation function. Each attribute contributes to the similarity (and thereby to the utility) in a way that is proportional to its weight.

Similarity: \(\Phi(s_1,\ldots,s_n)=\sum_{i=1}^n \omega_i \cdot s_i\)

For example, an Aggregate Average measure can look like:

sim.xml

    <AggregateAverage name="SMAggregateAverage" class="Employee" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateAverage>
This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight or similarityToUse for name, so it uses the default weight of 0.5 and the default Similarity Measure for Strings, while the weight for room is set as 0.3 and the similarityToUse is set as “NumericLinearInteger”.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateAverage smAggregateAverage = (SMAggregateAverage) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateAverage.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateAverage.setDefaultWeight(0.5);
    smAggregateAverage.setWeight("room", 0.3);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateAverage, "AggregateAverageEmployee");

For example, there are two aggregate objects of the class Employee. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name ( \(\frac{0.5}{0.5+0.3}\) ) and 0.375 for room ( \(\frac{0.3}{0.5+0.3}\) ).

Using simple measures for the attributes, the similarity between the names is 1.0 and between the rooms 0.0. For the total similarity, the weighted average of both similarities is used. So, the similarity is \(0.625 * 1.0 + 0.375 * 0.0 = 0.625\) .

Please note the following when specifying weights: All attributes having an equal weight of e.g. 1 are equally weighted in the similarity computation. To increase the importance of one attribute please choose a value greater than 1 and to decrease the importance choose values smaller than 1. Weights will be automatically normalized (i.e., the sum equals 1.0) during similarity computation. However, they also normalized weights can be set manually. Please note that SMAggregateAverage aggregates only attributes that are contained in a query aggregate object. Consequently, the original ratio of weights might be changed during normalization if attributes are missing.

Maximum #

When using the maximum aggregation, the overall similarity is determined by the maximum local similarity. If one attribute indicates a high utility, the utility for the whole case is high.

Similarity: \(\Phi(s_1,\ldots,s_n)=\max_{i=1}^n \omega_i \cdot s_i\)

For example, an Aggregate Maximum measure can look like:

sim.xml

    <AggregateMaximum name="SMAggregateMaximum" class="Employee" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateMaximum>
This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateMaximum smAggregateMaximum = (SMAggregateMaximum) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateMaximum.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateMaximum.setDefaultWeight(0.5);
    smAggregateMaximum.setWeight("room", 0.3);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateMaximum, "SMAggregateMaximum");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 1.0 is the highest similarity, this value would be returned.

K-Maximum #

The k highest local similarity value determines the global similarity. Hence, if k out of n attributes indicate a high utility, the overall utility for the whole case is high.

Similarity: \(\begin{array}{l} \Phi(s_1,\ldots,s_n)=\omega_{i_k} \cdot s_{i_k} \\ \quad \text{with}\, \omega_{i_r}\cdot s_{i_r} \geq \omega_{i_{r+1}}\cdot s_{i_{r+1}} \end{array}\)

The following parameters can be set for this similarity measure.

ParameterType/RangeDefault ValueDescription
kAttribute number (int)1The parameter is used to define, which k most similar case is returned. Only positive values greater than 0 are allowed. If k is set to a value higher than the number of attributes, an exception is thrown.

For example, an Aggregate K Maximum measure can look like:

sim.xml

    <AggregateKMaximum name="SMAggregateKMaximum" class="Employee" k="2" default="false" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateKMaximum>

This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3. The value for k is set to 2, so the second-highest similarity will be returned.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateKMaximum smAggregateKMaximum = (SMAggregateKMaximum) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateKMaximum.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateKMaximum.setDefaultWeight(0.5);
    smAggregateKMaximum.setWeight("room", 0.3);
    smAggregateKMaximum.setK(2);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateKMaximum, "SMAggregateKMaximum");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the local similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 0.0 is the second-highest similarity, the value 0.0 would be returned.

As another example, assume two aggregate objects of any class containing five attributes. The following similarities were calculated: 1.0, 0.8, 0.6, 0.4 and 0.2. Using the k-value of 2, the second-highest value of 0.8 is returned. Using the k-value of 4, the fourth-highest value of 0.4 is returned. For a k-value of 1, the highest value of 1.0 is returned, analogous to AggregateMaximum, and for k=5 the smallest value 0.2, corresponding to AggregateMinimum.

Minimum #

When using the minimum aggregation, the overall similarity is determined by the minimum local similarity. If one attribute indicates a low utility, the utility for the whole case is low.

Similarity: \(\Phi(s_1,\ldots,s_n)=\min_{i=1}^n \omega_i \cdot s_i\)

For example, an Aggregate Minimum measure can look like:

sim.xml

    <AggregateMinimum name="SMAggregateMinimum" class="Employee" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateMinimum>
This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateMinimum smAggregateMinimum = (SMAggregateMinimum) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateMinimum.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateMinimum.setDefaultWeight(0.5);
    smAggregateMinimum.setWeight("room", 0.3);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateMinimum, "SMAggregateMinimum");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 0.0 is the lowest similarity, this value would be returned.

K-Minimum #

The k lowest local similarity value determines the global similarity. Hence, if k out of n attributes indicate a low utility, the overall utility for the whole case is low.

Similarity: \(\begin{array}{l} \Phi(s_1,\ldots,s_n)=\omega_{i_k} \cdot s_{i_k} \\ \quad \text{with}\, \omega_{i_r}\cdot s_{i_r} \leq \omega_{i_{r+1}}\cdot s_{i_{r+1}} \end{array}\)

The following parameters can be set for this similarity measure.

ParameterType/RangeDefault ValueDescription
kAttribute number (int)1The parameter is used to define, which k lowest similar case is returned. Only positive values greater than 0 are allowed. If k is set to a value higher than the number of attributes, an exception is thrown.

For example, an Aggregate K Minimum measure can look like:

sim.xml

    <AggregateKMinimum name="SMAggregateKMinimum" class="Employee" k="2" default="false" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateKMinimum>

This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set to 0.3. The value for k is set to 2, so the second-lowest similarity will be returned.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateKMinimum smAggregateKMinimum = (SMAggregateKMinimum) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateKMinimum.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateKMinimum.setDefaultWeight(0.5);
    smAggregateKMinimum.setWeight("room", 0.3);
    smAggregateKMinimum.setK(2);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateKMinimum, "SMAggregateKMinimum");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the local similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 1.0 is the second-lowest similarity, this value would be returned.

As another example, assume two aggregate objects of any class containing five attributes. The following similarities were calculated: 1.0, 0.8, 0.6, 0.4 and 0.2. Using the k-value of 2, the second-lowest value of 0.4 is returned. Using the k-value of 4, the fourth-lowest value of 0.8 is returned. For a k-value of 1, the lowest value of 1.0 is returned, analogous to AggregateMinimum, and for k=5 the highest value of 1.0, corresponding to AggregateMaximum.

Minkowski #

The Minkowski aggregation is a generalization of the weighted average. The higher the value of the parameter p ≥ 1, the higher is the influence of the attribute with the highest local similarity. For p to infinity this aggregation function becomes the maximum aggregation. If no weights are given, the similarity will be \(1.0\) .

Similarity: \(\Phi(s_1,\ldots,s_n)=(\sum_{i=1}^n \omega_i \cdot s_i^p)^{1/p}\)

The following parameters can be set for this similarity measure.

ParameterType/RangeDefault ValueDescription
pWeight (double)2.0The parameter is used to define the influence of the attribute with the highest local similarity.

For example, an Aggregate Minkowski measure can look like:

sim.xml

    <AggregateMinkowski name="SMAggregateMinkowski" class="Employee" p="3" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateMinkowski>
This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set to 0.3. The value for p is set to 3.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateMinkowski smAggregateMinkowski = (SMAggregateMinkowski) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateMinkowski.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateMinkowski.setMinkowskiP(3);
    smAggregateMinkowski.setDefaultWeight(0.5);
    smAggregateMinkowski.setWeight("room", 0.3);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateMinkowski, "SMAggregateMinkowski");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. Using simple measures for the attributes, the similarity between the names is 1.0, between the rooms 0.0.

For the total similarity, the following calculation results: \(((0.625 * 1.0)^3 + (0.375 * 0.0)^3)^{\frac{1}{3}} = ((0.625)^3 + (0.0)^3)^{\frac{1}{3}} = (0.2441 + 0.0) ^{\frac{1}{3}} = 0.625 \) .

Euclidian #

The Euclidean aggregation is the same as Minkowski aggregation with a fixed p = 2. If no weights are given, the similarity will be also \(1.0\) .

Similarity: \(\Phi(s_1,\ldots,s_n)=(\sum_{i=1}^n \omega_i \cdot s_i^2)^{1/2}\)

For example, an Aggregate Euclidean measure can look like:

sim.xml

    <AggregateEuclidian name="SMAggregateEuclidian" class="Employee" defaultWeight="0.5">
        <AggWeight att="room" weight="0.3"/>
    </AggregateEuclidian>
This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateEuclidian smAggregateEuclidian = (SMAggregateEuclidian) simVal.getSimilarityModel().createSimilarityMeasure(
      SMAggregateEuclidian.NAME,
      ModelFactory.getDefaultModel().getClass("Employee")
    );
    smAggregateEuclidian.setDefaultWeight(0.5);
    smAggregateEuclidian.setWeight("room", 0.3);
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateEuclidian, "SMAggregateEuclidian");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. Using simple measures for the attributes, the similarity between the names is 1.0, between the rooms 0.0.

For the total similarity, the following calculation results: \(((0.625 * 1.0)^2 + (0.375 * 0.0)^2)^{\frac{1}{2}} = ((0.625)^2 + (0.0)^2)^{\frac{1}{2}} = (0.3906 + 0.0) ^{\frac{1}{2}} = 0.625 \) .

Aggregate Relevance #

The AggregateRelevance measure allows setting different relevance levels of the attributes depending on the solution (referred to as class labels). Therefore, for each solution an AggregateAverage similarity measure according to the attribute values contained in an AggregateObject can be chosen. In the example below, the use of the AggregateRelevance measure is explained in more detail and a use case is presented.

AggregateRelevance uses the global similarity measure AggregateAverage, therefore the similarity computation for the AggregateRelevance measure is the following:

Similarity: \(\Phi(s_1,\ldots,s_n)=\Phi_{AggAv}(s_1,\ldots,s_n)\)

Hereby, it holds that \(AggAv \in X\) , where \(X\) is the set of all AggregateAverage measures contained in the AggregateRelevance measure, and the following condition is met:

\(\forall i \in |AggRel.solutionAttribute|: AggRel.solutionAttribute_i = AggAv.solutionLabel_i \text{ (for strings) } \lor AggRel.solutionAttribute_i \in AggAv.solutionLabel_i \text{ (for intervals) }\)

\( AggRel \) is the AggregateRelevance measure.

The following parameters can be set for this similarity measure.

ParameterType/RangeDefault ValueDescription
defaultMeasureSimilarity Measure (String)nullThe parameter specifies the default similarity measure for AggregateObjects. It is used if no specific similarity measure is defined for the attribute values.
solutionAttributeConcatenation of attribute names (bundled in one string, e.g.: attr1,...,attrX)nullThe parameter is a string concatenation of the considered attributes from the AggregateObjects. The attributes have to be parameters of the AggregatesObject class.

For the attributes, the following parameter can be set:

ParameterType/RangeDefault ValueDescription
similaritiesToUseString, Similarity Measure (bundled in one string, e.g.: attrValue1,...,attrValueX, String)nullThis parameter is used to set a similarity measure for specific attribute values. The first attribute of the method is a concatenated string of attribute values. The values have to match the attributes class from the associated attribute of solutionAttribute. For numeric values, it is possible to use a range of values by using interval notation. In other cases, like boolean, it is not possible to set anything else than true or false. The second attribute is the similarity measure that should be used for the given attribute values.

All the example measures in the following require an AggreagteClass called Cars, which has to be defined in the model.xml. In this case, this can look like:

model.xml

    <AggregateClass name="Cars">
        <Attribute name="price" class="Double"/>
        <Attribute name="miles" class="Miles"/>
        <Attribute name="year" class="Integer"/>
        <Attribute name="paint_color" class="String"/>
        <Attribute name="transmission" class="String"/>
        <Attribute name="deficiency" class="Deficiency_Types"/>
    </AggregateClass>

The class Miles is an InstanceIntervalPredicate class, which is used to define a range for the attribute miles. The class Deficiency_Types is an InstanceTaxonomyOrderPredicate class, which is used to define a taxonomy for the attribute deficiency like clutchWear or peelingVarnish which are common deficiencies in cars.

For example, an AggregateRelevance measure can look like:

sim.xml

    <AggregateRelevance name="SMCarRelevance" class="Cars" solutionAttribute="price" defaultMeasure="SMObjectEquals">
        <AggRelevance solutionLabel="[0,50000)" similarityToUse="SM1"/>
        <AggRelevance solutionLabel="[50000,100000]" similarityToUse="SM2"/>
    </AggregateRelevance>

In this example, if the solution attribute price of a Cars object is in between [0,50000), then SM1 is chosen to compute the similarity between query and case. If the price is in between [50000,100000], then SM2 is chosen to compute the similarity between query and case. In all other cases the default measure is used. SM1 and SM2 have to be AggregateAverage measures, therefore the returned similarity value is always computed according to the average similarity measure.

To create this measure during runtime, the following code would be used:

Wiki_AggregateTest.java

    SMAggregateRelevance smAggregateRelevance = (SMAggregateRelevance) simVal.getSimilarityModel().createSimilarityMeasure(
        SMAggregateRelevance.NAME,
        ModelFactory.getDefaultModel().getClass("Cars")
    );
    smAggregateRelevance.setSolutionAttribute("price");
    smAggregateRelevance.setDefaultMeasure("SMObjectEquals");
    smAggregateRelevance.setSimilarityToUse("[0,50000)", "SM1");
    smAggregateRelevance.setSimilarityToUse("[50000,100000]", "SM2");
    simVal.getSimilarityModel().addSimilarityMeasure(smAggregateRelevance, "SMCarRelevance");

Example #

To demonstrate the use of the AggregateRelevance measure, consider the following example. There are five different similarity measures contained and the solutionAttribute contains two attributes from the Cars class: miles and deficiency. The first measure, SMCarClutchWearDueToOperator is considered, if the miles from a car are low, but the deficiency is a clutchWear. This could indicate that a driver has damaged the clutch wear. This can only be the case, if the transmission is a manual one. Therefore, the Aggregate Average measure is instantiated as follows.

sim.xml

    <AggregateAverage name="SMCarClutchWearDueToOperator" class="Cars" defaultWeight="0.0">
        <AggWeight att="transmission" weight="1.0"/>
        <AggWeight att="miles" weight="0.1"/>
    </AggregateAverage>

In contrast, if a car has many miles run, clutch wear is most likely due to normal wear and tear, therefore the measure SMCarClutchWearDueToMiles is used and instantiated as follows.

sim.xml

    <AggregateAverage name="SMCarClutchWearDueToMiles" class="Cars" defaultWeight="0.0">
        <AggWeight att="transmission" weight="0.1"/>
        <AggWeight att="miles" weight="1.0"/>
    </AggregateAverage>

The AggregateRelevance measure is used to choose the right measure for the right case. The following code demonstrates the instantiation of the measure in XML.

sim.xml

    <AggregateRelevance name="SMCarRelevanceLargeTest" class="Cars" solutionAttribute="miles,deficiency" defaultMeasure="SMObjectEquals">
        <AggRelevance solutionLabel="[0,19999],clutchWear" similarityToUse="SMCarClutchWearDueToOperator"/>
        <AggRelevance solutionLabel="[20000,149999],clutchWear" similarityToUse="SMCarClutchWearDueToConstructionFault"/>
        <AggRelevance solutionLabel="[150000,999999),clutchWear" similarityToUse="SMCarClutchWearDueToMiles"/>
        <AggRelevance solutionLabel="[0,199999],peelingVarnish" similarityToUse="SMCarPeelingVarnishDueToManufacturer"/>
        <AggRelevance solutionLabel="[200000,500000],peelingVarnish" similarityToUse="SMCarPeelingVarnishDueToLackMaintenance"/>
    </AggregateRelevance>

If the miles of a car object are in between [0,19999] and the deficiency is a clutch wear, then the similarity is computed using the SMCarClutchWearDueToOperator measure. If the miles are in the same range, but the deficiency is a peeling varnish, then the similarity is computed using the SMCarClutchWearDueToMiles measure.